Audio and Phonetic Search: What’s Working and What’s Not

Posted on January 7, 2011


Companies are on alert when it comes to endless amounts of data stored within the enterprise. But they must be equally vigilant about information that exists as spoken word. It’s a major challenge, however, to effectively search audio files.

Why Audio Files Matter

Organizations don’t want to leave themselves open to a lawsuit or security breach. That’s why many need to search traditional forms of e-mails, e-docs and other forms of Electronically Stored Information (ESI). But they often overlook a major source of risk: the wealth of data stored on audio files.

Audio data exists on traditional fixed-line phone systems, VOIP, mobile and specialist platforms like Skype or MSN Live. Recording some of these conversations is required under regulatory conditions ‑ such as Conduct of Business (COBS) or the Federal Rules of Civil Procedure (FRCP) – and such documentation is admissible as evidence. These recordings may contain critical pieces of information required to ensure a successful case outcome. This means organizations need effective solutions to search such files. But what solution works best? That’s what companies are constantly struggling with.

The Good, Bad and Ugly in Audio and Speech Search

There are essentially three distinct ways to conduct what’s called “audio discovery”:

1.            The Human Listening Approach

This “old school” method involves hiring someone to sit down and listen to the recordings. This is highly inefficient, but there are some advantages. It allows for human interpretation and judgment, and a skilled listener can pick up on subtle nuances and inflections. However, there are some major, obvious weaknesses:

  •  A person can only listen to one call at a time.
  •  People have a limited attention span and fallible memory, which restricts the volume of data that can be recalled and the number of terms that can be searched for.
  •  Even the most skilled analysts are limited by working in real time. They’re human, and, therefore, can miss critical items. They can’t work 24/7 either.

 The upshot? Attempting to cover all audio with human listening is prohibitively expensive. It’s practical only in the most critical cases.

 2.            Speech to Text Technology

 Speech-to-text technology – a.k.a. Large Vocabulary Continuous Speech Recognition (LVCSR) ‑ converts the speech content of audio into text, processing it using a large vocabulary dictionary. This technique requires a sophisticated language model for good recognition, resulting in heavy processing needs. If a certain word or name is not included in the dictionary, it will never be found.

 Speech-to-text can search an audio transcript many times faster than real time; typically 2 to 4GB of text may be searched in 0.1 seconds. (That’s the equivalent of more than a quarter-million pages of text, by conservative estimates.) However, converting the audio to that searchable text is processor-hungry and is frequently achieved at only two to three times faster speed than real time. When ad hoc searching needs to be applied, a large dictionary is required to carry out the recognition. This further limits the volume (or speed) of data that can be practically processed.

 For these reasons, speech-to-text has also disappointed in the last decades. 

 3.            Pure Phonetic Search Technology

 This leaves phonetic search as the best solution to search large collections of audio and video files. Especially when you have a lot of speakers on the same conversation you’re searching. This search technology transforms audio recordings into a phonetic representation, rather than written words. Next, user queries are converted into phoneme (sound-based) sequences and are then matched to the recognized sound recordings. These matches are made possible by what’s commonly called “fuzzy technology,” best described as a “true/false” process-of-elimination method used when data (in this case, sound) is imprecise.

 Here’s why phonetic search remains the best option: It includes a model that can interpret the way words are pronounced and is therefore not limited to only searching for words in a dictionary. This means that searches for personal or company names or brands can be successfully conducted – without the need to “re-ingest” the data.

 A search using phonetic recognition will run up to 80,000 times faster than real time. Using a single core of a typical Intel processor, eight hours of audio data can be searched in under a second. Preparation of the searchable content is conducted at rates up to 80 times faster than real time.

 Other advantages include the ability to set a threshold that specifies where to cut off the search results. It also provides a recognition-confidence level that can be used for relevance ranking.

Phonetic Search for Compliance and eDiscovery

Phonetic-search solutions are needed now more than ever. In order to combat market abuse, insider dealing and market manipulation, the Federal Security Agency now requires organizations that handle client orders to record and maintain records of transactions conducted over telephone lines. These records must be “readily accessible” should the relevant authorities require them. FRCP regulations in the U.S. now allow “sound recordings” to be considered for inclusion in the list of discoverable items that may be requested as part of case preparation and evidence gathering. The wider implications of Sarbanes-Oxley and SEC regulations also influence the frequency with which audio files are called upon as a source of evidence.

 Then, there is the simple time-savings argument. In the Enron investigations, the FBI needed ten to 12 people working two to three months for seven days a week to transcribe 2,800 audio hours. Then industry experts needed to identify audio recordings to come up with presentable evidence. This process ultimately produced ten hours of audio data available for presentation, representing 82 separate exhibits.

 The Enron scandal represented perhaps the best public example of how recordings are crucial to demonstrate the intent of call participants. Locating phrases and playing back those recordings provides an additional layer of information, otherwise discernable only by human listening. Written text is not able to fully communicate intent or emotion. Essentially there is no substitute for actually hearing the audio concerned.


Fuzzy matching: Why it Works

At the heart of the effectiveness of the phonetic-search solution is the quality of the fuzzy-search technology. Instead of aiming for an impossible, 100 percent correct speech-to-text, both speech and end-user queries are converted into phonemes. These phonemes are matched against the sound collections by launching a fuzzy search between the phonemes in the query and the recognized phonemes. The result: a completely tunable search tool that can be used in eDiscovery, law enforcement and compliance applications where 100 percent recall is paramount.

 Phonetic search engines use a fraction of the hardware required by traditional solutions to deliver greater depth of audio search. It has the flexibility to use multiple search items, leading to greater accuracy and relevancy of results. And exact numbers of audio files relating to a specific topic can be easily determined, even across extremely large data-sets.

 Given this, why opt for any other kind of audio search?