How to support Non-English Language Documents in search and text analysis?

Posted on February 12, 2010


Many language dependencies need to be addressed when text- analysis technology is applied to non-English languages.

 First, basic low-level character encoding differences can have huge impact on the general searchability of data: where English is often represented in basic ASCII, ANSI, or UTF-8, foreign languages can us a variety of different code-pages and UNICODE (UTF-16), which all map characters differently. Before a particular language’s archive can be full-text indexed and processed, a 100% matching character mapping process must be performed. Because this process may change from file to file, and may also be different for different electronic file formats, this exercise can be significant and labour intensive. In fact, words that contain such language-specific special characters such as ñ, Ǽ, ç, or ß (and there are hundreds more of such characters) will not be recognized at all.

 Next, the language needs to be recognized and the files need to be tagged with the proper language identifications. For electronic files that contain text that is derived from an optical character recognition (OCR) process or for data that needs to be OCR-ed, this process can be extra complicated.

 Straightforward text-analysis applications use regular expressions, dictionaries (of entities) or simple statistics (often Bayesian or Hidden Markov Models) that are all depending heavily on knowledge of the underlying language. For instance, many regular expressions use US phone number or US postal address conventions, and these structures will not work in other countries or in other languages. Also, regular expressions used by text analysis software often presume words that start with capitals to be named entities, which is not the case with German. Another example is the fact that in languages such as German and Dutch, words can be concatenated to new words, which is never anticipated by English text analysis tools. More examples of linguistic structures exist that are cannot be handled by many US-developed text analysis tools.

 In order to recognize the start and end of named entities and to resolve anaphora and co-references, more advanced text analysis approaches tag words in sentences with Part-of-Speech techniques. These natural language processing techniques depend completely on lexicons and on morphological, statistical and grammatical knowledge of the underlying language. Without extensive knowledge of a particular language, none of the developed text analysis tools will work at all.

 A few text analysis and text-analytics solutions exist that provide real coverage for languages other than English. Due to large investments by the US government, languages such as Arabic, Farsi, Urdu, Somali, Chinese and Russian are often well covered, but German, Spanish, French, Dutch and Scandinavian languages are almost always not fully supported. These limitations need to be taken into account when applying text analysis technology in international cases.


ZyLAB’s text analysis supports multiple languages, which is critical when investigations go global and incorporate collections of information in various languages. ZyLAB reconciles differences in character sets and words, but it also makes intensive use of statistics and the linguistic properties (i.e., conjugation, grammar, senses or meanings) of a language. Basic text mining technology is available for more than 400 languages. For 40 of them  ZyLAB provides deep linguistic analysis for the most challenging disambiguation and processing requirements.

In addition to this, ZyLAB offers product interfaces in twenty languages, including German, French, Dutch, Spanish, Portuguese, Italian, Swedish, Finnish, Polish, Arabic, Farsi, and Khmer. Nowadays, over 400 languages are supported by ZyLAB’s recognition and full-text indexing technology, including all Western-European, Eastern-European, Russian, Chinese, Japanese, Arabic, Farsi African, Asian and most regional indigenous languages. Automatic language recognition, along with voting between different OCR engines, simplifies the OCR process and yields optimal OCR results in multi-lingual environments. ZyLAB can full-text index any language that can be represented in UNICODE characters.