Chinese sentiment analysis is now part of the Repustate API
We are very proud to announce our new Chinese sentiment analysis engine. Based on the same engine that we used to create our world-leading Arabic sentiment analysis engine, the Chinese sentiment analysis engine is blazingly fast and accurate.
Conditional Random Fields
Unlike English or Latin-based languages, Chinese (simplified) doesn’t necessarily disambiguate words using whitespace. For example the following string of symbols is a completely normal sentence in Chinese:
团购分量比较一般,不过肉多,而且是和两个女生,所以基本都能吃饱。 猪手香肠无得讲,的确系一般餐厅做唔出的味道,其他就比较一般啦。 后来和朋友们正价去吃> 了一次,感觉分量比团购多,希望商家以后能一视同仁啦。
(For those who don’t read Chinese, this is a review of a restaurant). Now you’ll see a few white spaces here & there but there’s actually many more words being expressed than there are separated tokens. So how do we know where one word (or idea) begins and the next ends?
We use a technique called conditional random fields which uses probabilistic models to infer what the meaning of a particular glyph (character) is given the glyphs around it. With a large enough pre-tagged corpus of Chinese text, Repustate can achieve almost 100% perfection in identifying the individual words or ideas being expressed in a long chain of Chinese glyphs.
Part of speech tagging & sentiment
Now that we know which words are being used, we can apply part of speech tagging (nouns, verbs, adjectives etc.) to help construct a grammatical overview of a piece of text. This then allows us to perform sentiment analysis using our proprietary engine. It’s the same engine that powers our Arabic sentiment analysis. Sentiment analysis uses a combination of probabilistic models, a dictionary of terms or phrases which connote sentiment as well as hand-tuned heuristics that are language specific. All of this is done in a split second so you can still analyze hundreds of Chinese documents in one HTTP request using the Repustate’s Chinese sentiment analysis API.