For reporting purposes we do some simple word analysis. I've been looking to expand on that by analyzing the frequency of 2 and 3 word pairs. So I've been evaluating some n-gram extraction algorithms.
Wikipedia defines an n-gram as follows:
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram"
Initially I had been using string.split() to break the text into words and process each one but found the overhead to be too high (lots of string copying). So I changed the approach to loop through each character in the text and build up words. To get the word pairs (or triplets) my first thought was to use Queues. First I had to have a queue that was a fixed size. It needed to drop items from the front when new ones were added to the end. From StackOverflow I found a simple example to do just that:
internal class FixedSizedQueue<T> : Queue<T> { private readonly int maxQueueSize; public FixedSizedQueue( int maxQueueSize) { this .maxQueueSize = maxQueueSize; } public new void Enqueue(T item) { base .Enqueue(item); if (Count > maxQueueSize) Dequeue(); // Throw away } } |
With that in place I could do something like the following:
public static void Analyze( string text) { var wordBuilder = new StringBuilder(); var bigramQueue = new FixedSizedQueue< string >(2); foreach (var c in text) { string word = "" ; if ( char .IsLetter(c) || c == '\'' ) { wordBuilder.Append(c); } else { // Get the current word if (wordBuilder.Length > 0) { word = wordBuilder.ToString(); Console.WriteLine( "L0 {0}" , word); wordBuilder = new StringBuilder(); } // Add word to the bigramQueue. if (! string .IsNullOrEmpty(word)) bigramQueue.Enqueue(word); if (bigramQueue.Count == 2) { var bigram = string .Join( " " , bigramQueue.ToArray()); Console.WriteLine( "L1 {0}" , bigram); } // Anything other than a space should clear the queue. // Not looking for bigrams over punctuation. if (c != ' ' ) { bigramQueue.Clear(); } } } } |
That worked but still had some unnecessary overhead with a StringBuilder for each word and having to join the words back together to get the bigram. So I rewrote the code using pointers into the text and copying out the words (or word pairs) as I found them. This approach proved to be the fastest.
public static void AnalyzeText( string text) { string word; int currentIndex = 0; int currentWordStart = -1; int word1Start = -1; foreach ( char c in text) { if (Char.IsLetter(c) || c == '\'' ) { if (currentWordStart == -1) currentWordStart = currentIndex; if (word1Start == -1) word1Start = currentIndex; } else { // Reached a non-letter check if we now have a word or word pairing. // Start with the greatest number of words and work backwords. // Look for 2 word pairings if (word1Start != currentWordStart) { var bigram = text.Substring(word1Start, currentIndex-word1Start); word1Start = currentWordStart; // reset index Console.WriteLine( "2: {0}" , bigram); } // Single if (currentWordStart != -1) { word = text.Substring(currentWordStart, currentIndex-currentWordStart); currentWordStart = -1; // reset index Console.WriteLine( "1: {0}" , word); } // If this is anything other than a space (, . !) reset the previous word indexes. // Do this because I am not looking for word pairs that span punctuation if (c != ' ' ) { word1Start = -1; } } // Increment the index currentIndex++; } // Look for 2 word pairings if (word1Start != currentWordStart) { var wordPair = text.Substring(word1Start, currentIndex-word1Start); word1Start = currentWordStart; Console.WriteLine( "2: {0}" , wordPair); } // Single if (currentWordStart != -1) { word = text.Substring(currentWordStart, currentIndex-currentWordStart); currentWordStart = -1; Console.WriteLine( "1: {0}" , word); } } |
I also took one more pass and made the code more generic. Rather than using specific variables for each pairing I used an array to hold on each word location. This code was slower than the previous due to the loops and array access. For looking for larger number of n-grams (say 4+) I think this would be the better code to use.
public static void AnalyzeText( string text) { int currentIndex = 0; int nGrams = 2; int [] wordStarts = new int [nGrams+1]; foreach ( char c in text) { // If we have a letter or apostrophe we are starting a word or in one. if (Char.IsLetter(c) || c == '\'' ) { // If any word start isn't set yet initialize it to the current index for ( int i = 0; i < wordStarts.Length; i++) { if (wordStarts[i] == -1) wordStarts[i] = currentIndex; } } else { // Loop thru the word starts in reverse order. If a word start does not equal a previous start then we have a new gram for ( int i = wordStarts.Length-1; i >= 0; i--) { // Use -1 for the check for the first word start. int previousWordStart = i == 0 ? -1 : wordStarts[i-1]; if (wordStarts[i] != previousWordStart) { var wordPair = text.Substring(wordStarts[i], currentIndex-wordStarts[i]); // Move the pointer forward to the next word. wordStarts[i] = previousWordStart; Console.WriteLine( "{0}: {1}" , i, wordPair); } } // If this is anything other than a space (, . !) reset the previous word indexes. // Do this because I am not looking for word pairs that span punctuation if (c != ' ' ) { for ( int i = 1; i < wordStarts.Length; i++) { wordStarts[i] = -1; } } } currentIndex++; } } |