Handling Language Dependencies in eDiscovery and Records Management when using Content Analytics

Posted on November 12, 2010


Foreign language texts contain a lot of hidden information, making multilingual information extraction tools – and applications that allow cross-lingual information access – particularly useful. Only a few system developers offer their products for more than two or three languages. Typically, they develop the tools for one language and then adapt them to the others.

 Especially in the fields of eDiscovery and (international) records, information or knowledge management this can lead to high translation costs. When computer systems are used to manage and analyse the information (which is almost always the case these days), developing deep language support for all relevant languages is also a very expensive task.

 Information on guidelines how to produce highly multilingual applications with the least possible effort is scarce. But, it is also possible to develop text mining and other content analytics applications with multi-linguality requirements in mind.

 In order to develop such systems, one has to follow a number of basic rules:

  1.  Avoid language dependent algorithms and rules where possible.
  2. Limit the language specific resources to a minimum. For instance, no more than a few key word list or character mappings. Avoid part-of-speech taggers, parsers and language specific grammars.
  3. Store the language specific information externally from the core algorithms.
  4. If there are language dependencies that cannot be avoided, use algorithms to generate them from example corpora bottom-up.
  5. At all times, avoid language pair dependencies, because they will lead to an exponential growth of specific language dependencies (for instance language pairs in machine translation).

 There are few vendors that fully understand or oversee the effects and limitations of how they currently build multi-language support in their applications. In many flashy sales demo’s, no word is mentioned on language dependencies and more often than once, end users will be disappointed at a later moment in time when they find out that support for certain languages is very limited.

 When you consider purchasing a system with advanced functionality that depends on content analytics (such as automatic record classification, automatic legal review, predictive coding, etc.), than make sure that that functionality is also functioning properly for other languages and domains than what is presented in the sales demos!