A core function that any text analytics package needs is to do language detection. By language detection, we refer to the following problem:
“Given an arbitrary piece of text of arbitrary length, determine in which language the text was written.”
Might sound simple for a human, assuming you know a thing or two about languages, but we’re talking about computers here. How does one automate this process? Before I dive into the solution, if you want to see this in the wild, go to translate.google.com and check out how Google does it. As soon as you start typing, Google guesses which language you’re typing in. How does it know?
The first thing you’ll need is a large corpus (read: millions of words) in the languages you’re interested in detecting. Now, the words can’t just be random, they should be structured sentences. There are many sources for this, but the easiest is probably Wikipedia. You can download the entire Wikipedia corpus in the language of your choosing. Might take a while, but it’s worth it because the more text you have, the higher the accuracy you’ll achieve.
Next step is to generate n-grams over this corpus. An n-gram is a phrase or collection of words “n” long. So a unigram (1-gram) is one word. A bi-gram (or 2-gram) is two words, a tri-gram is three words etc. You probably only need to generate all n-grams where n <= 3. Anything more will probably be overkill. How do you generate n-grams? Well, using the Repustate API of course. There are other n-gram generators on the internet, just google around. The benefit of using Repustate’s is that ours is blazingly fast, even when you take into account the network latency. Now as you generate n-grams, you need to store them in a set-like structure. We want a set rather than a list because sets only store unique items and they are much faster for lookups than lists.
I recommend using a bloom filter to store the n-grams. Bloom filters are awesome data structures, learn to use them and love them. OK, all of our n-grams (there will be millions of them per language) are stored in a bloom filter, one filter for each language.
Next, we take the text for which we want to detect the language, and generate n-grams over it. Just for kicks, let’s generate n-grams for the sentence “I love Repustate”:
Unigrams: I, love, Repustate
Bigrams: I love, love Repustate
Trigrams: I love Repustate
Simple, right? Now for each n-gram above, check to see if it exists in each of the bloom filters were created before. This is why using as large a corpus as possible is preferibile. The more n-grams, the higher the chance of a positive match. The bloom filter which returns the highest number of matches tells you which language you’re dealing with.
Repustate has done all the heavy lifting already and if there’s enough demand (basically, if one person asks), we’ll add language detection to our text analytics API.