Friday, November 20, 2015

N-Gram Extraction

For reporting purposes we do some simple word analysis. I've been looking to expand on that by analyzing the frequency of 2 and 3 word pairs. So I've been evaluating some n-gram extraction algorithms.

Wikipedia defines an n-gram as follows:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram"

Initially I had been using string.split() to break the text into words and process each one but found the overhead to be too high (lots of string copying). So I changed the approach to loop through each character in the text and build up words. To get the word pairs (or triplets) my first thought was to use Queues. First I had to have a queue that was a fixed size. It needed to drop items from the front when new ones were added to the end. From StackOverflow I found a simple example to do just that:

With that in place I could do something like the following:

That worked but still had some unnecessary overhead with a StringBuilder for each word and having to join the words back together to get the bigram. So I rewrote the code using pointers into the text and copying out the words (or word pairs) as I found them. This approach proved to be the fastest.

I also took one more pass and made the code more generic. Rather than using specific variables for each pairing I used an array to hold on each word location. This code was slower than the previous due to the loops and array access. For looking for larger number of n-grams (say 4+) I think this would be the better code to use.