Repustate takes pride in offering both an accurate and fast Named Entity Recognition API. This API is the lifeblood of Deep Semantic Search as it allows Deep Search to identify millions of entities across all supported languages in just a few milliseconds and make them all searchable.
But don't just take our word for it. We've compiled a benchmark to test other vendor's offerings against Repustate's Named Entity Recogition capabilities to see which is best.
All of the code is open source as well as the input data.
All of the code is open source as well as the input data.
Criterion | Details |
---|---|
Entity detection coverage | Are all entities present in the text recognized? This tests simple exact matches, but also context sensitive disambiguations, Twitter handles, ticker symbols, aliases, acronyms, nicknames etc. |
Granularity of entity classification | We evaluate how specific the entities are classified. Does the vendor differentiate between Location types (cities, countries, rivers etc.) or all locations just tagged as "Location"? |
Language coverage | The sample data contains samples from many languages, especially some tricky ones like Arabic and Japanese. We tested to see which vendors handled these languages properly. |
Speed of API | The amount of time it takes to process the test data set. |
Below are our findings. All of these results can be reproduced using the source code and sample data provided.
VENDOR | ACCURACY | GRANULARITY | LANGUAGES | SPEED (MS) |
---|---|---|---|---|
Repustate | 95% | ✓ | 23 | 60 |
Google Cloud NLP | 75% | ❌ | 10 | 1070 |
Amazon Comprehend | 67% | ❌ | 6 | 160 |
Dandelion | 63% | ❌ | 7 | 250 |
TextRazor | 61% | ❌ | 12 | 240 |
Microsoft Azure Cognitive | 50% | ❌ | 6 | 3210 |
spaCy | 45% | ❌ | 7 | 301 |
Aylien | 42% | ❌ | 6 | 150 |
1 spaCy is run locally while all other providers are over HTTP. As such, there's no network latency in spaCy's time.
Use Repustate. Joking aside, Google's Cloud NLP performs quite well and has good language support. It's a bit slow compared to the others, but if your dataset isn't too big, that speed hit shouldn't be too bad. If for whatever reason you can't use Repustate, we recommend Google ... But seriously, use Repustate.