Friday, November 20, 2015

N-Gram Extraction

For reporting purposes we do some simple word analysis. I've been looking to expand on that by analyzing the frequency of 2 and 3 word pairs. So I've been evaluating some n-gram extraction algorithms.

Wikipedia defines an n-gram as follows:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram"

Initially I had been using string.split() to break the text into words and process each one but found the overhead to be too high (lots of string copying). So I changed the approach to loop through each character in the text and build up words. To get the word pairs (or triplets) my first thought was to use Queues. First I had to have a queue that was a fixed size. It needed to drop items from the front when new ones were added to the end. From StackOverflow I found a simple example to do just that:

internal class FixedSizedQueue<T> : Queue<T> {
  private readonly int maxQueueSize;
 
  public FixedSizedQueue(int maxQueueSize) {
    this.maxQueueSize = maxQueueSize;
  }
 
  public new void Enqueue(T item) {
    base.Enqueue(item);
    if (Count > maxQueueSize) Dequeue(); // Throw away
  }
}

With that in place I could do something like the following:

public static void Analyze(string text) {
  var wordBuilder = new StringBuilder();
  var bigramQueue = new FixedSizedQueue<string>(2);
 
  foreach (var c in text) {
    string word = "";
    if (char.IsLetter(c) || c == '\'') {
      wordBuilder.Append(c);
    } else {
      // Get the current word
      if (wordBuilder.Length > 0) {
        word = wordBuilder.ToString();
        Console.WriteLine("L0 {0}", word);
        wordBuilder = new StringBuilder();
      }
       
      // Add word to the bigramQueue.
      if (!string.IsNullOrEmpty(word)) bigramQueue.Enqueue(word);
      if (bigramQueue.Count == 2) {
        var bigram = string.Join(" ", bigramQueue.ToArray());
        Console.WriteLine("L1 {0}", bigram);
      }
       
      // Anything other than a space should clear the queue.
      // Not looking for bigrams over punctuation.
      if (c != ' ') {
        bigramQueue.Clear();
      }
    }
  }
}

That worked but still had some unnecessary overhead with a StringBuilder for each word and having to join the words back together to get the bigram. So I rewrote the code using pointers into the text and copying out the words (or word pairs) as I found them. This approach proved to be the fastest.

public static void AnalyzeText(string text) {
  string word;
  int currentIndex = 0;
  int currentWordStart = -1;
  int word1Start = -1;
   
  foreach (char c in text) {
   
    if (Char.IsLetter(c) || c == '\'') {
      if (currentWordStart == -1) currentWordStart = currentIndex;
      if (word1Start == -1) word1Start = currentIndex;
     
    } else {
      // Reached a non-letter check if we now have a word or word pairing.
      // Start with the greatest number of words and work backwords.
      // Look for 2 word pairings
      if (word1Start != currentWordStart) {
        var bigram = text.Substring(word1Start, currentIndex-word1Start);
        word1Start = currentWordStart; // reset index
        Console.WriteLine("2: {0}", bigram);
      }
       
      // Single
      if (currentWordStart != -1) {
        word = text.Substring(currentWordStart, currentIndex-currentWordStart);
        currentWordStart = -1;  // reset index
        Console.WriteLine("1: {0}", word);
      }
       
      // If this is anything other than a space (, . !) reset the previous word indexes.
      // Do this because I am not looking for word pairs that span punctuation
      if (c != ' ') {
        word1Start = -1;
      }   
    }
   
    // Increment the index
    currentIndex++;
  }
           
  // Look for 2 word pairings
  if (word1Start != currentWordStart) {
    var wordPair = text.Substring(word1Start, currentIndex-word1Start);
    word1Start = currentWordStart;
    Console.WriteLine("2: {0}", wordPair);
  }
   
  // Single
  if (currentWordStart != -1) {
    word = text.Substring(currentWordStart, currentIndex-currentWordStart);
    currentWordStart = -1;
    Console.WriteLine("1: {0}", word);
  }
}

I also took one more pass and made the code more generic. Rather than using specific variables for each pairing I used an array to hold on each word location. This code was slower than the previous due to the loops and array access. For looking for larger number of n-grams (say 4+) I think this would be the better code to use.

public static void AnalyzeText(string text) {
  int currentIndex = 0;
   
  int nGrams = 2;
  int[] wordStarts = new int[nGrams+1];
   
  foreach (char c in text) {
   
    // If we have a letter or apostrophe we are starting a word or in one.
    if (Char.IsLetter(c) || c == '\'') {
      // If any word start isn't set yet initialize it to the current index
      for (int i = 0; i < wordStarts.Length; i++) {
        if (wordStarts[i] == -1) wordStarts[i] = currentIndex;
      }
     
    } else {
      // Loop thru the word starts in reverse order. If a word start does not equal a previous start then we have a new gram
      for (int i = wordStarts.Length-1; i >= 0; i--) {
        // Use -1 for the check for the first word start.
        int previousWordStart = i == 0 ? -1 : wordStarts[i-1];
        if (wordStarts[i] != previousWordStart) {
          var wordPair = text.Substring(wordStarts[i], currentIndex-wordStarts[i]);
          // Move the pointer forward to the next word.
          wordStarts[i] = previousWordStart;
          Console.WriteLine("{0}: {1}", i, wordPair);
        }
      }
       
      // If this is anything other than a space (, . !) reset the previous word indexes.
      // Do this because I am not looking for word pairs that span punctuation
      if (c != ' ') {
        for (int i = 1; i < wordStarts.Length; i++) {
          wordStarts[i] = -1;
        }
      }
   
    }
   
    currentIndex++;
  }
}