Dealing with Documents in other Languages

Posted on January 28, 2014


High-stake investigations and eDiscovery projects are not limited by national boundaries and no investigator can afford to miss relevant information because it is in a foreign language and the cost of translation is too high.
Multi-lingual text collection hide more complexities than it initially look like, because, in addition to differences in character sets and words, text analysis makes intensive use of statistics as well as the linguistic properties (such as conjugation, grammar, tenses or meanings) of a language. These language dependencies need to be addressed when dealing with non-English content.
First, basic low-level character encoding differences can have a huge impact on the general searchability of data. Whereas English is often represented in basic ASCII, ANSI or UTF-8, foreign languages can use a variety of different code-pages and UNICODE (UTF-16), all of which map characters differently. Before an archive with foreign language content can be full-text indexed and processed, a 100% matching character mapping process must be performed. Because this process may change from file to file, and may also be different for various electronic file formats, this exercise can be significant and labor intensive. In fact, words that contain such language-specific special characters as ñ, Æ, ç, or ß (and there are hundreds more like them) will not be recognized at all if the wrong language model is chosen or if the language is not known.
Next, the language needs to be recognized and the files need to be tagged with the proper language identifications. For electronic files that contain text that is derived from an Optical Character Recognition (OCR) process or for data that needs to be OCRed, this process can be additionally complex.
Text-analysis applications use more advanced linguistic analysis and often heavily depend on specific language characteristics and statistics to deal with complex information extraction, handling of co-references or anaphora and negations. Without proper knowledge of the underlying language, these techniques will not work properly, so it is important that the technology you use can deal with multi-lingual documents and multi-lingual document collections.
Statistical Machine Translation (SMT) methods in combination with translation memory provides a great tools to get a quick and cost-effective insight in the content of documents. This method differs very much from the more traditional grammar-based methods. For many years, the field of computational linguistics consisted on the one hand of research based on Chomsky’s theories of generative grammars and on the other hand more statistical approached. Because of the complexity, non-robustness and slow processing of the grammatical approach, statistical approaches are favored more and more by the research community.
Over the years, the statistical and grammatical methods have more or less merged, where the better working approach is now based on statistical algorithms in combination with large corpora of natural language which is tagged with (simple) linguistic properties and linguistic structures. Linguistic probabilities are automatically calculated from large collections of data.
Statistical Machine Translation works on the same principle: from a large collection sentence pairs in the source and target language, a SMT algorithm can derive the most probable translation for a particular sentence, phrase or word in a specific context. This approach really leads the evolution of this effort with innovative technology that overcomes many of the problems of traditional automated translation. While the translations may not legally submissible in court, they do provide great insights in the content of large document and e-mail collections.
Now, why are SMT suddenly so good? There are two major reasons for this:
• After 9-11, the US intelligence forces were in great need of translations for specific languages. There were not enough screened translators and it was impossible to teach enough existing and newly (screened) employees to translate all the available data. Machine Translation was the only option. The problem with existing (grammar-based) translation tools was that training the system for a specific domain required the vendor to be involved. This was off course a problem because of the highly confidential nature of the data. And last but not least, understanding the training process required deep knowledge of computational linguistics, another hard to find talent. Statistical Machine Translation can automatically learn from sets of examples, a process that can be done in-house. Also, the SMT was able to process the often corrupted data much more robust than the traditional approaches. So, basically, SMT did a better job on all requirements and US Intelligence agencies invested heavily in this new technology, making it even better.
• Due to the availability of large translation databases from for instance the United Nations and the European Union, training the SMT algorithms is much easier than it was in the past. Finally, Moore’s law, resulting in two times more data every 18 months, is in our advantage!
There is one golden rule in all statistical linguistic algorithms: THE ONLY GOOD DATA IS MORE DATA. And for that reason, I expect these algorithms only to become better and better, because the one thing that we can sure about, and that is that we will end of with even MUCH more data in a few years from now.
The applications of Machine Translation are endless: next to the obvious ones in intelligence, law enforcement and law enforcement, there are many other applications in the fields of eDiscovery, compliance, information governance, auditing, and off course knowledge management.
These translation solutions accelerate the way the world communicates by “unlocking” large volumes of digital content that would not be translated without automation.
So, with the right tools, documents in other languages than English no longer have to be a problem!