The Impact of Incorrect Training Sets and Rolling Collections on Technology-Assisted Review (TAR) and Defensible Disposition

Posted on July 16, 2013


Last week, I participated in the DESI Workshop held as part of the International Conference on Artificial Intelligence in Law on June 14 in Rome, Italy. At the conference we submitted recent Technology Assisted Review (TAR) findings described in the paper “The Impact of Incorrect Training Sets and Rolling Collections on Technology-Assisted Review, which was written by Mary Mack, Tim van Cann and myself.

In this paper, the findings presented show that training samples can have a very large margin of error before they impact TAR results unacceptably, but that documents added to a review set after training the classifiers, do result in an unacceptable TAR error rate.

Document classification and machine learning technology in eDiscovery are gaining attention under new names such as technology-assisted review (TAR), machine-assisted review (MAR), computer-assisted review (CAR) and predictive coding. Several judicial rulings, white papers and conferences have addressed typical legal concerns in relation to the quality of machine learning.

In the paper we address two of these very common concerns in document reduction and legal review and investigate their relation with machine learning quality in more detail.

First, we investigated the impact of the quality of training documents on the overall classification results and the use of machine learning with Support Vector Machines. We found that the impact of wrong training documents was much smaller than expected: deliberately inserting up to 25% wrong training documents resulted in only 3-5% less classification quality in recall.

Second, we also found that using the most-common used document feature-extraction techniques known as a bag-of-words (BoW) and Term Frequency- Inverse Document Frequency (TF-IDF) based classifiers lost up to 50% in quality when used on completely new documents, such as in a rolling collection.

These research results show that robust machine learning algorithms such as Support Vector Machines do not suffer that much from wrong training samples, as long as there are enough training samples available. As a result, parties involved do not have to worry too much about a few incorrectly coded training samples. This may reduce expectations to disclose training documents, especially the non-responsive documents, in a very early phase of the case.

Conversely, the research shows that classifiers are better used on complete document sets, rather than on sets that will be supplemented after training.

In this research, we used a freely available source of documents and a simple, open protocol during the experiments to invite replication and enhancements. Machine learning can reduce costs and improve outcomes when used appropriately.  It is equally important that the transparency required by cooperation be of the sort that does not unnecessarily disclose confidential or nonresponsive documents.

Obviously, the results do not only apply to the use of Machine Learning in eDiscovery, but also to Legacy Data Clean-up, Defensible Disposition, and Automatic data classification in records management, enterprise information archiving and intelligent data migration from legacy systems to for instance the cloud.

The complete paper is available for download at