Why Some Enterprise Search Tools Can Compromise the Integrity of your eDiscovery Process

Posted on October 7, 2010


As part of bringing eDiscovery in-house, one might consider using enterprise search tools, open source search tools, portal search, or embedded (free) search components to execute the negotiated Booleans, find relevant documents, and copy them to a preservation location (often a dedicated file server).

Although this may seem like a logical approach, there are many risks when non-specialized search tools are used in an eDiscovery process. In short, many of them lack some of the functionality that is required to make the collection process defensible on court. The following are just a few examples  of best practices [as defined by organizations such as EDRM (www.edrm.net) and the Sedona Conference (http://www.thesedonaconference.org/)] which cannot be achieved by many of the non-specialized enterprise search tools.

  1. In order to make in-house eDiscovery defensible in court, one must prove that the file has not been modified since it was collected from a particular location. Therefore it is very important to track hash values (http://en.wikipedia.org/wiki/Fingerprint_(computing)) from the moment of collection until the final production of a document. If two Word documents have the same binary content (exactly the same text, file and document properties), then they will have exactly the same hash value. It is critical to calculate and store these hash values for every document that is collected. Non-specialized enterprise search tools cannot…

 2.       When a document is collected from a particular location, various meta data related to the collection needs to be extracted and stored as well: data location, date and time (access, creation, and modification), and other available file (also known as Operating System) properties such as, but not limited to, size, and security access rights. Non-specialized enterprise search tools cannot…

 3.       For every document that is collected from a custodian or from a file system, yet not given to opposing counsel, you will have to justify why you do not disclose it to the other party. You will need to keep track of this as the eDiscovery process unfolds and you will need to save this information about every document collected. For example, it may be a duplicate, privileged, or confidential, or if it is non-responsive you need to track and document how or why you determined that.  Non-specialized enterprise search tools cannot…

 Additionally, some non-specialized enterprise search tools lack sufficient capabilities to properly handle in-house collection during eDiscovery:

 1.       Among the challenges of executing a Negotiated Boolean* query is the need to define the various occurrences of nouns and verbs, such as inflections, plurals, pre-fixes, post-fixes, noun conjunctions (often very common in languages such as German or Dutch), abbreviations, named entity variations, synonyms, spelling errors, spelling variations (e.g. US and UK English or pharmaceutical or chemical names), and more. In order to address this, attorneys use so-called WILDCARDS and FUZZY searches. These mathematical operators allow word variations to be found. For example, a wildcard search on SCHOOL* will find SCHOOL and SCHOOLS. Wildcards can also be used in the beginning of the word and/or in the middle of a word, or multiple wildcards can be used in the same word. For more examples, please consult: http://aiimcommunities.org/erm/blog/understand-benefits-and-limitations-traditional-web-search-engines-such-google-when-you-use-the.

 Many non-specialized search engines cannot handle these types of constructions, or they become extremely slow on large data sets. In order to effectively and quickly execute wildcard searches, words that are alike according to some wildcard or fuzzy algorithm need to be organized in special data structures in your search-index. Some vendors use dictionaries of inflected words to limit the search scope, but that will not give you all spelling variations, and there is more and more case law where a judge re-orders a full wildcard search, often combined with penalties and sanctions.

 2.       You need to be prepared to explain in court – and in detail — how your search engine worked: was there a level of fuzziness, did you identify non-searchable data, how did you handle encrypted files, how did you deal with compressed files (ZIP, ARJ, ARC, …), how did you address non-searchable bitmaps and PDF’s, how did handle other non-searchable data such as multi-media files? How does your relevance ranking work? How many documents did you miss? Did you collect all documents or only the most relevant (some web based search engines only keep 20,000 documents per key word in the index tables for performance reasons), how does your relevance work: on popularity, probability or some other form of statistics. In general, a judge will not like any type of collection based on popularity ranking! These are the questions that opposing counsel will ask you if they understand eDiscovery.

In summary, desktop-, portal- and web- search tools do NOT provide the functionality that is essential to eDiscovery. Fortunately, there are many specialized identification, search and collection tools that are defensible, auditable, and referenced in existing case law.

Every day, there is also more case law in which parties are sanctioned for data spoliation and scrutiny of process. As a result, it is increasingly important to use the right tools, rules and methodology for the job so that you can avoid penalties, fines, incurring the cost of redoing the work, strict deadlines for new productions, and bad PR.

 *When attorneys negotiate what data will need to be exchanged based on the Federal Rules of Civil Procedure, they determine not only the custodians, file locations, and file extensions, but they also define (or negotiate) a Boolean query. The content of the negotiated Boolean query—which can run hundreds of pages in length– is extremely important for both parties. The claiming party typically wants to gain access to as much data as possible, often resulting in a phishing expedition. In contrast, the disclosing party aims to produce as little information as possible to avoid costly legal reviews and to limit legal exposure and other legal risks. You can find more details about negotiated Booleans: http://aiimcommunities.org/erm/blog/learn-what-negotiated-boolean-ediscovery-all-about.