Technology Assisted Review, Concept Search and Predictive Coding: The Limitations and Risks

Posted on May 9, 2012


Technology Assisted Review (TAR) is a marketing term used in the eDiscovery community to describe the process of automatic classification of documents in a so-called legal review. Similar documents are classified based on training data or seed sets. Typical classes include Confidential, Privileged or Responsive.  As the saying goes,
“there’s more than one way to skin a cat”; TAR is also called Machine Assisted Review (MAR), Computer Assisted Review (CAR), Predictive Coding, Concept Search, or Meaning-based computing: all of which are marketing terms without any specific scientific meaning.

A recent US ruling by Judge Peck regarding the use of machine learning technology in legal review, has created a lot of tumult in the eDiscovery community (see for more information and links to other articles). Not only has the opposing party filed complaints against the ruling (including accusing the judge of obtaining financial benefits from the vendor whose software was used), but the entire legal community seems to be engaged in the heated topic.

Now that there is case law on the use of TAR, and it has been confirmed by other judges, one can expect a dramatic increase in Predictive Coding, Concept Search or other terms relating to TAR capabilities being a requirement for eDiscovery software buyers.

The Science Behind TAR

For myself, I am an avid fan of artificial intelligence and machine learning. As a matter of fact, I have practiced in the field since 1985 and hold the special chair of text-mining at the University of Maastricht, where I teach my students everything there is to learn on the application of text mining for applications such as document classification, clustering and information extraction.

Machine learning and other techniques from artificial intelligence are not based on “hocus pocus”: they are based on solid mathematical and statistical frameworks in combination with common-sense or biology-inspired heuristics. In the case of text-mining, there is an extra complication: the content of textual documents has to be translated, so to speak, into numbers (probabilities, mathematical notions such as vectors, etc.) that machine learning algorithms can interpret. The choices that are made during this translation can highly influence the results of the machine learning algorithms.

For instance, the  “bag-of-words” approach  used by some products has several limitations that may result in having completely different documents ending up in the exact same vector for machine learning and having documents with the same meaning ending up as completely different vectors. See  for more information on this topic. The garbage-in, garbage-out principle definitely applies here!

Other complications arise when:

  • More than one foreign language is used in the document set, for instance, if some documents are in English and some documents are in Dutch. Multi-lingual documents in which multiple languages appear in individual documents causes even more problems.
  • The more document categories there are, the lower the quality will be for the document classification. This is very logical as it is easier to differentiate only black from white than it is to differentiate 1,000 types of gray values.
  • The absence of sufficient relevant training documents will lower the quality of classification. The number of required training documents grows faster than the increase of the number of categorization classes. So, for 2 times more classes one may need 4 times more training documents.
  • The documents use very different or very ambiguous language for the same topics (e.g. there are many synonyms and homonyms).

Dealing with incremental document collections (e.g. new documents are added after training) will result in lower quality or require completely new training of the machine learning.

Several risk factors are listed here, but there are more depending on the specific machine learning technology that is used: technology that is based on Bayes classifiers (falsely) presumes statistical independence between measured features (e.g. word occurrences) and Latent Semantic Indexing (LSI) and its variants such as Probabilistic Latent Semantic Analysis (PLSA) effectively use a lossy information compression algorithm (SVD) that may result in more (irreversible) information loss than required. Knowledge of the specific parameter settings is integral to gaining a full understanding of the quality of specific machine learning models.

There is No Free Lunch

Machine-learning requires significant set-up involving training and testing the quality of the classification model (aka the classifier) , which is a time consuming and demanding task that requires at least the manual tagging and evaluation of both the training and the test set by more than one party (in order to prevent biased opinions). Testing has to be done according to best practice standards used in the information retrieval community (e.g. see the proceedings of the TREC conferences organized by the NIST). Deviation from such standards will be challenged in courts. This is time consuming and expensive and should be factored into the cost-benefit analysis for the approach.

If the classifier does not work (e.g. a mutually-agreed upon predefined quality level is not reached), only retraining the entire model with better training examples will work. Eventually, this process could negate any performance increases or cost savings that could have been achieved by applying the technology. In that event it is impossible to improve the model and all training and test efforts will have been a waste of time. This may very well happen in cases that suffer from the complications as described above.

Additionally, one has to be able to explain and defend the application of machine learning technology in court. This may not be a trivial task given the fact that machine learning is based on state-of-the-art principles in linear algebra and probability calculus that are not commonly understood by those who may be involved in the law suit. Therefore, parties and the court will rely heavily on (expensive) expert witnesses.


So, when applying Predictive Coding, Concept Search or other names that refer to Technology Assisted Review, first become informed of the potential risks of using the technology for a particular case. It may be the right choice for some cases, but not for others.

If it is not possible or too risky to apply machine learning techniques, then there are also other forms of automatic document classification, such as rules-based document classification. These may be a better choice to use in certain cases, especially when defensibility is an issue.  They come with their fair share of set-up, but in almost all cases they are more defensible and easier to manage.