Latent Semantic Indexing (LSI) Technology

LSI+ Analytics Technology Datasheet

LSI+ Analytics Technology PDF Download

Latent Semantic Indexing (LSI) is the core machine-learning technique that enables Content Analyst's CAAT and Cerebrant products to identify, represent, and compare concepts that exist within a collection of documents or other forms of unstructured content.

LSI is essentially a mathematical approach to text analytics designed to extract every contextual relation among every term in every text object within a collection. It then generates a vector-space representation of all terms based on those relations using many dimensions (often in the hundreds). Within that space, proximity is a good indicator of conceptual similarity. The result: similarities can be identified based on concepts within the material. Since LSI is mathematics-based, it requires no word lists, taxonomies, or thesauri for CAAT and Cerebrant to accurately identify conceptually related text.

[Read More: Conceptualizing LSI-Based Text Analytics]

LSI+ is Content Analyst's Decade-Long Adaptation of LSI

The Content Analyst team has evolved the original concept of LSI to create major extensions and refinements to the technology and the company now holds a number of patents related to text analytics. In addition, Content Analyst provides industry-leading integration support services to ensure fast, seamless integration and continuous enhancements.

There are two ill-founded misconceptions regarding LSI technology. The first questions scalability and the second seeks to compare open source LSI libraries with Content Analyst's proprietary CAAT analytics engine.

CAAT is Content Analyst's proprietary text analytics engine that was built using LSI technology for much of its core functionality including conceptual search, clustering, auto-categorization, instant context and conceptual near-duplicate detection. LSI+ is Content Analyst's extensively-enhanced application of LSI and it is only one part of CAAT's broad set of text analytics capabilities. Content Analyst has had dozens of developers enhancing, strengthening and augmenting the core LSI capabilities for more than 10 years. The company's dramatically expanded LSI+ technology has made many significant enhancements to the core LSI technology giving it scalability as well as incomparable functionality to support even the most sensitive and rigorous text analytics applications.

LSI+ Enhancements and Advantages

  1. Relevancy of Results: LSI+ has many capabilities that produce significantly more relevant and accurate results than alternative approaches.
  2. Scalability: The LSI+ engine has vastly superior scalability in terms of the number of documents that it can index and query. It has been successfully used in applications that have indexed hundreds of millions of documents, and billions when measured in aggregate. Few, if any open source solutions can scale beyond ten thousand documents.
  3. Optimized Queries: LSI+ allows users to perform two orders of magnitude faster queries into much larger indexes than the open source implementations.
  4. Resource Requirements: LSI+ performs well on standard computing equipment such as a laptop or standard PC. LSI+ has been deeply proven to provide highly-precise, fast, and accurate results on millions of documents operating on a simple consumer grade laptop.
  5. Configurable Indexes: LSI+ provides many configurable options to tune the index, e.g., the ability to change the number of dimensions of the index.
  6. Support for Pre-Defined Entities: LSI+ allows the use of pre-defined entities that enable building indexes with specialized knowledge of certain entities.
  7. Phrase Queries: LSI+ also provides the ability to perform ad-hoc queries on non-compositional phrases and return highly relevant results.
  8. Term Expansion: LSI+ has the ability to search on a concept space and find related terms, including synonyms, acronyms, misspellings, abbreviations and code names.
  9. Higher Level Operations: LSI+ includes many higher level features that have been built into the engine including a variety of clustering algorithms, cluster titling algorithms, categorization, and a SQL-like query capability into the LSI vector space with metadata.
  10. Infrastructure Around LSI+ Engine: LSI+ provides a host of functionality that make it easier to process input text so that it will result in building higher quality indexes that provide significantly more relevant results.
  11. ContentCare®: Content Analyst has earned a solid reputation for outstanding integration support, which includes thorough documentation, roadmap collaboration, live training, code reviews and live technical support from our team located in the Washington DC area.

Content Analyst's dedicated software development team has sustained uninterrupted innovation and enhancement to its LSI+ technology for more than 10 years. Our roadmap promises continuous additional enhancements to maintain the company's incomparable reputation for superior analytics capabilities among the extensive community of software product companies depending on CAAT and LSI+ for their customers.