How to find more!

Posted on April 28, 2010

3


8 ways to implement real exploratory search.

In some search tasks, finding only a few “best hits” is good enough to find an answer to a question or a solution for a problem. A good example is when one is looking for a hotel in a foreign city—Amsterdam, for instance. Finding the best hotel site with the best deals will suffice; no one is interested in finding all web sites talking about hotel deals in Amsterdam. This type of search is often referred to as focalized or web-search and it yields high precision as opposed to high recall.

But when a police officer, investigator, lawyer, or intelligence analyst searches, they need to go beyond just finding the “best hits”. They need to review all potentially relevant documents to avoid being confronted with unexpected information.

There are a few more problems they have to address: when people commit fraud, they do not want to be found, contrary to typical web search, where everybody wants to be at the top of the list. Also, the searchers do not always know the exact keywords to look for: unknown (code) names or synonyms may be used. As a result, searchers need a broader, more exploratory type of search. This search should allow them to explore the data in a very interactive manner. The search should allow for high recall (e.g. find all potentially relevant documents), but at the same time also suppress the noise of non-relevant documents.

Precision and recall are reversely proportional: if you use tools to increase one, the other one will decrease. For instance, fuzzy search will get you more relevant hits, but you will also find a few false positives. Proximity search (find word A within X words of word B) statements on the other hand, will yield fewer false positives, but you can also miss relevant documents.

Professional exploratory search should provide the user not only with search tools to find more, but also with interactive tools to suppress irrelevant documents by being able to navigate fast, interactively, and by using different angles (facets) to look at the result lists.

Here are 8 basic techniques you should be looking for:

  1. Real fuzzy and wildcard search: this is essential to find words and phrases that look like the query, but that are not exactly the same. It is important to be able to change not only the end of a word, but also the beginning, the middle, or the end (or a combination of all). At the same time, the system should not depend on dictionaries (because that will limit what you can find) and it should also still perform with huge data sets. This does require that the search index is implemented in a manner that supports these types of searches. This will require some additional effort at indexing time, which is why almost all web engines do not support this. As a result, scanning, OCR, transliteration and spelling errors and variations cannot be found by such search engines.
  2. Fast hit highlighting, hit navigation and keyword in context: these are essential tools to quickly navigate large documents from hit to hit. Only then can users  efficiently determine why a document was retrieved and where the relevant words are. This should work in all file formats and also be fast: you don’t want to wait for 500 pages to be loaded individually before seeing the page with a hit. Keyword in context (aka KWIC view) allows users to see the words before and after a hit in the result list. This is very useful to look into the content of a document from the result list. If there is more than one hit, then multiple entries from that document will be listed in the result list entry.
  3. Tunable relevance ranking: all web engines are tuned for only one type of relevance ranking: mostly a popularity or page link algorithm. Exploratory searchers don’t want to find only popular documents; they want to find all relevant documents. In order to review them quickly in the result list, it is important to be able to organize or sort the result list on all available meta information: time, date, hit density, but also on any custom key fields that are attached to the document.
  4. Flexible proximity search and support for complex nested Boolean operators: (negotiated) Boolean queries are often large and complex to include both inclusive and exclusive keywords that can be combined with AND, OR and NOT. Especially in long documents, one needs the ability to nest these with brackets and one needs a proximity, near or preceding operator that provides the ability to define that certain keywords need to occur within the same sentence, paragraph or within X words of each other. This is especially important in long documents with many different sections and chapters. An AND operator will namely retrieve documents that have Word A AND Word B, even if they are in the beginning and end and completely non-correlated.
  5. Quorum search: this is the ultimate combination between precision and recall. Not many vendors have this ability. With a quorum search, one can define a bucket of words (the recall component) and set that at least X of these words need to be in a document (the precision component). It typically looks like 2 of {tree, plant, flower, rose, tulip}. Higher values for X result in higher precision. Larger buckets of words will result in higher recall. Quorum search is perfect for defining complex concepts.
  6. Text and content analytics: the search of the future. These days, there are so many new tools to add additional searchable meta to documents, unfortunately, not many search engines use them. Some examples are the extraction of document properties, file properties, entities, facts, events and concepts. Other tools include automatic summaries, machine translation, language detection, and many more. All this additional information will provide more search options, but also the ability to export ,for instance, all company or individual names that are mentioned in a set of documents. Additional options for relevance ranking and advanced visualization due to the more populated result lists are also an additional benefit.
  7. Faceted search (aka refine results or semantic relevance ranking):. The additional content generated by the content analytics will provide us with additional facets we can use to refine our results. For instance, we can define a facet like Country or Person which will include all countries or persons that are named in a set of documents retrieved with a full text query. A simple click operation on one of the values of a facet will get you the documents that contain that specific value. Faceted search helps users to find suspicious documents or zoom in on certain tagged documents.
  8. Advanced data visualization: Text analysis is often mentioned in the same sentence as information (or data) visualization; in large part because visualization is one of the viable technical tools for information analysis after unstructured information has been structured.

Instead of showing all results in a (large) tabular result list, it is also possible to plot them on a Google Map, for example, to show their geographic locations. It is also possible to display the results as an interactive hyperbolic tree: the Star Tree. The user may click and drag the star to reveal the different relationships between the search hits, its properties, and its locations. Or results are presented in a so-called TreeMap for a better understanding of the hierarchical relationships among the data set based on the meta information. It has been proven that data inconsistencies and complex patterns can be found very easily with such techniques and often much faster than without them.

If you are not familiar with advanced searching, then this may be “too much information” so to speak. You if you are a bit dizzy now, that is understandable. The key takeaway is that if you are in a business that requires exploratory search, then you should go beyond the simple “Google” search box. You should use all the technology that is available from true search solutions. You will end up with a professional search dashboard that will help you to explore and find all information, even if you do not know exactly what you are looking for!

You can find more on these topics when you click on the links below. Also feel free to contact me should you require more information:

Advanced Search: http://www.zylab.com/Technology/advanced_search.html

Content analytics and text mining: http://www.zylab.com/Technology/text_mining_and_analytics.html

Advanced data visualization: http://www.zylab.com/Technology/data_visualization.html

Machine translation: http://www.zylab.com/Technology/machine_translation.html

Advertisements