|
LANGUAGE ANALYTICS
How It Works
Thanks in part to its roots in the Intelligence community, the CAAT engine has the capability to analyze documents in any language that can be expressed in Unicode B (which includes Middle East and Asian languages). The way this works has no parallel in the search world: CAAT can work multilingually as well as cross-lingually, offering a number of powerful features for document search and classification.
In multilingual mode, CAAT permits searchers to use any Unicode-B language for document search or categorization, and receive results in the same language string. There is no need for a “German” or “Spanish” version of CAAT’s APIs.
In Primary Language Identification Mode, CAAT automatically organizes or groups documents together based on the primary language in which they were written.
In cross-lingual mode, CAAT can index sets of parallel corpora (identical documents translated into different languages) in a “training mode” (i.e., these are only used to “teach” the engine the nuances of language), and then allows users to perform document searches in one language and find relevant documents in other languages without prior translation.
How does this work? Simply, the CAAT engine trains itself on how each language relates conceptually to the other language or languages being indexed. It is akin to understanding that “Comment allez-vous?” in French actually means “How are you?” in English, versus “how do you go?” (the literal translation).
How it is Used in eDiscovery
The legal world is now a global multilingual one: few companies deal only within the US and only then with US-based concerns. When foreign parties are involved, eDiscovery can become infinitely more complex.
CAAT is deployed by numerous partners around the world; the Unicode-B grounding for CAAT ensures that user or target language is not an impediment.
In cases where multiple languages are present in ESI – something occurring more commonly as cases become increasingly global – CAAT’s primary language identification quickly sorts or classifies documents into language-based categories. This is crucial for further processing and review: unless you are fluent in the appropriate languages, identifying say Swedish versus Norwegian documents is nearly impossible. Double-byte or Ideographic languages (such as Chinese and Japanese) further confound the problem, as Western speakers can’t even identify an alphabet. For CAAT, this is not a problem.
Finally, value of cross-lingual search is obvious: it will cut translation costs immensely as only relevant documents will need to be translated. Cross-lingual analytics also saves valuable search time for users. Even with on-site translators and machine-assisted translation, it is difficult to translate more than 3-4 pages per person per hour. Therefore, a large translation effort, prior to any real relevant searching, can add weeks or more to the discovery period. With Content Analyst, that challenge is eliminated – only the relevant documents need be translated, and those can occur after initial discovery, while further processing and review is ongoing.
|