WikiLeaks Makes the Case for Intelligent and Automatic Redaction

Posted on March 1, 2011


The WikiLeaks scandal involving the release of 250,000 classified State Department cables has put automatic redaction software in the spotlight. Apart from this example of the unauthorized release of confidential information, there are many other use cases for data redaction and examples of where it is critical for operations:

  • Government agencies censor sensitive material when responding to Freedom of Information Act (FOIA) requests
  • Employers and eCommerce providers remove personally identifiable information (PII) that could be intercepted by identify thieves
  • Legal professionals remove PII that is not germane to litigation; privilege, etc.
  • In some cases, such as the UN War Crime Tribunals that have received a lot of press coverage lately, witnesses’ lives literally depend upon redaction of their PII. (see slides from LegalTech NY keynote on the usage of the ZyLAB software in the war crimes tribunals:

In such instances, missing redactions simply isn’t an option. However, it is traditionally labor intensive and costly. For example, if one can manually apply 60 redactions per hour and they earn US$ 60/hour, each redaction costs $1.  The cost of redacting can easily run into thousands of dollars (or more) and still be rife with human errors.  And some of the redaction tools currently on the market aren’t robust enough to rely on.

Simply put: There are varying degrees of redaction capabilities and many risks associated with using basic redaction techniques for electronic documents. Gartner informs its subscribers that “manual and even semiautomatic processes are impractical” due to the huge volumes of information today. Some merely place a black “censor” bar on top of the text. That bar can inadvertently shift and reveal the text or it can be “retrieved” by certain software products since the text “under the hood” hasn’t actually been removed from the document.

The Right Approach to Redaction 

The right approach to redaction involves the following functionality and features:

  • Enables users to replace the sensitive text (such as the name of a foreign leader) with a code word.
  • Redacted text is represented by a black censor bar that is burned into the document as opposed to sitting on top and masking the content. The software also enables the users to apply colors other than black to help delineate the different types of redactions that may be present (PII, privilege, classified).
  • Allows for permanent, XML overlay and the option to make the redactions permanent just prior to the disclosure.
  • It actually removes the sensitive content – just as Gartner recommends.
  • Redaction logs that are often required by judges.

Redaction Workflow and Support for the Redaction Business Processes

Some redaction technology provides the ultimate protection of PII and confidential data while also helping to optimize the creation and quality control on redactions. For example, the ideal redaction technology offers the following:

  • Initial transparency of the redaction can be used to review the applied redactions quickly and efficiently by a manager before the redactions are burned in.
  • Redactions on scanned bitmaps as well as click-and-drag capabilities, version control, and more.
  • Redactions can be tagged with additional key field information, which can contain data and time stamps, exemption rules, comments or applicable legal rules, redaction creators, and additional user-defined key fields.
  • Meta data can be stripped from documents. These are document properties or other “hidden” information in electronic files.
  • Support redactions in other languages, including right-to-left and up-down redactions. See the following example for a Korean-English redacted document.

 Redaction that is Intelligent and Automated

Intelligent Redaction is an automated process developed by some vendors that uses search functionality to identify words, phrases, or blocks of text, and then automatically redact all of the “hits. Users can redact documents by drawing a box around the desired text, or using advanced “Intelligent Redaction”, users can simply perform a search, and from the results, choose to redact all hits. For example, an administrator could run a concept search query across an entire data set of millions of files and automatically redact any content that matches the search parameters.

Additionally, advanced search and text mining technology can recognize patterns of regular expressions (eg. credit cards, Social Security numbers, license plates, etc.) and redact all content matching that pattern. You do not have to specify the exact Social Security or bank account numbers; the intelligent technology recognizes everything that follows the stated numeric or alphanumeric form. An example of how this is done in the ZyLAB software is shown below. You can see a variety of regular expressions that recognize, for instance, license plates, cell phone numbers, or dates for Ireland.  For the pattern “dates”, a number of recognized patterns are shown.

Role-based Auto Redactions

As the WikiLeaks scandal illustrated, one needs to consider pre-emptive redactions of sensitive material, but also the actions of people in possession of the information. To aid with this challenge, it is also possible to automatically redact sensitive keywords behind the scenes based on user/role so that one level of employees would never have the opportunity to see (much less steal or leak) un-redacted content. With these more complex analytics it is possible to find any potential PII and prompt the user to redact it or simply redact it automatically to reduce the potential of the sensitive information being disclosed.

Note: With the tools from ZyLAB, it is possible to recognize redaction patterns from a library of more than 200 different entity types in over 400 languages.