NEAR-DUPLICATE DOCUMENT IDENTIFICATION

CAAT’s powerful Dynamic Clustering includes a powerful ability to group near-duplicate documents together based on concepts. Identifying near-duplicate documents presents the challenge of finding documents that are almost identical but may in fact be paraphrased. Near-duplication technologies from most other companies only identify documents that contain a preponderance of the same phrases and sentences. CAAT identifies those documents as very near-duplicates, but will also find documents that are conceptually-similar, essentially paraphrased documents. Near-duplicate document detection is an automatic operation, using CAAT’s conceptual analytics to identify and cluster documents that contain highly conceptually similar content.

CAAT returns a numerical value enabling users to quickly determine just how nearly duplicate each document is within a given cluster of near-duplicate documents. This feature is invaluable to certain markets, such as Legal Review, where the ability to identify and “bulk-tag” near-duplicate documents provides dramatic savings in cost and time.

 

spotlight