Do we understand the Benefits and Limitations of Traditional Web Search Engines such as Google when you use their Appliances and Technology in-house for mission-critical applications?

Posted on June 25, 2010

1


There is an inherent risk when a popular brand becomes the perceived archetype for a particular product group or task. Take Google as an example. Google is great for what it is designed to do –finding relevant websites with very high precision when knowing the right words to use in a search query. Admittedly I am a big user myself. If I am looking for a steakhouse in New York City, I will “Google it.” It is easy to use, fast and the results are often accurate and precise. Google does a phenomenal job keeping up with new information and it is often located quickly on popular sites. The risk is when a brand like “Google” becomes synonymous with “search” because there are varying degrees of search technology.

There are scenarios in which appliances and Internet-based search engines will not get the job done and relying on them for certain mission-critical tasks is a mistake. These limitations do not only apply to Google, but also to other in-house search solutions that are based on or derived from Internet search engine technology.

Many Internet search engines are optimized to retrieve pre-defined, specific and precise specifications. For those instances, one must know exactly what words to use and the search result for these words will be very precise and accurate. This is “focalized” search, a technique that provides little to no ability to explore data; it is assumed the user knows the exact terms to investigate. This fit very well in a basic retrieval model, but if one does not know exactly what words to use in the search then traditional search tools will not help.

For example, if searching for all documents that present a threat to national security or finding the reasons responsible for the credit crisis, one requires “exploratory” search. This type of search offers techniques that can deal with imprecise specifications, and even more important, are also dynamic and self-adopting to changing environments and data-sets. They use many different search techniques, search tools, text-mining, content-analytics and have various other interactive tools to help a user find proper keywords or navigate interactively through the data.

Let’s take a closer look at the limitations and the implications for users who require deeper and more thorough searching than traditional web search can provide:

Fast crawling and indexing: In order to crawl and full-text index as much data as possible on the Internet, the traditional web search index technology has to use optimizations and take a number of shortcuts to keep up with all the new data:

  • There is very little time to implement complex calculations at crawling time, when a web search engine visits new or changed web sites and updates the internal search index. As a result, fast support for more complex calculations such as wildcard searches, fuzzy searches, hit highlighting, hit navigation, and other search tools such as taxonomy, and faceted search all have to be calculated at search time, when the search engine algorithms use the search index to find relevant web pages or documents. Users will pay for this: it will either be impossible to use these functions or it will take a very long time for the system to return the results. This is a huge limitation if users do not know the exact words required for a search query.
  • Not all occurrences of documents that contain particular words or combination of words are stored in the index – there is often a cut off after a specific amount. This is very problematic if a web search tool is used for e-discovery collections. Users will then only find the most popular documents and not all of them. That is hard to explain in court to opposing counsel.
  • Most appliances and enterprise search engines do suggest alternatives to a user’s query, but these are based on frequently used queries and not on similar content in the documents. Therefore, a user will miss deliberate errors or other low frequency and unexpected spelling errors. Again, this poses a major risk to intelligence, security, law enforcement and early case assessments.

Relevance ranking based on popularity: With popular web search engines, everybody wants to be on the top of the list – they even pay money to get there. But criminals and terrorists don’t want to be found; they try to hide what they are doing. Web search engines’ relevance ranking does not overcome these circumstances. Additionally, it is often impossible to use a relevance ranking scheme other than the popularity ranking which is based on the number of incoming links and that the ranking results are often unclear, and difference based on time and locations from where the search is executed.

Avoiding search spam: Web search engines deliberately make their relevance ranking dynamic and not 100 percent clear. The basic principles are well published, but the details of the algorithm change all the time based on many different parameters such as location, time, relevance of a site, key words used, etc. If they didn’t follow this practice then most web search engines would soon become a victim of search spam like Alta Vista did. Search spam will lead searches for particular words to sites that have completely different content, but that may be relevant to the searcher. It goes without saying that this type of behavior is unacceptable in a legal or intelligence environment.

The following two examples help to put these limitations in the context of how they would impact enterprise users.

  • Under the Federal Rules of Civil Procedure (FRCP) many organizations are under subpoena and have to disclose all relevant e-mails and other electronic documents that contain certain keywords from certain custodians within a certain time period. Limiting searches to only meta information or only the first 20,000 e-mails would be incomplete and unacceptable, yet that is often what occurs when using a web search engine for this type of scenario.

  • The NAL (National Agricultural Library) has disclosed a large collection of very relevant documents on the Internet (http://naldr.nal.usda.gov/), and the collection consists of many large scanned documents (often 500 pages or more). Due to the fact that the documents are very old, the document scans may have many optical character recognition (OCR) errors. The current solution, based on the ZyLAB search engine, offers fuzzy search and sub-second hit navigation in the large documents. This allows the user to overcome scanning (OCR) and spelling variations of agricultural, bug and animal names and navigate immediately to, for instance, a hit on page 400 of a 500 page document without having to review the rest of the document. With a typical “Google” web engine it is not possible to find documents that contain such errors. It is also difficult to review retrieved documents because it will take an extremely long time to navigate through a lengthy document to the page with a relevant hit (if the hits are displayed at all). The search technology being used by the NAL site also provides advanced relevance ranking on any key field in the result list and there is no limitation to how users can sort the documents. “Popularity” does not play any role in the current implementation because it is irrelevant.

As you can see, for advanced search requirements, “Googling it” won’t suffice. Enterprises that rely solely on enterprise search appliances should understand and evaluate these limitations. If they use Google or other web search engines for mission-critical applications such as e-discovery, intelligence, security and law enforcement investigations, it should be clear that there are serious limitations that will affect the quality and defensibility of your work.

Advertisements