Skip to main content

System Text for Information Extraction

A system for extracting structured information from unstructured text.

Date Posted: October 16, 2008

alphaworks tab navigation

 

What is System Text for Information Extraction?

This system enables text-centered enterprise applications by extracting structured information from unstructured text. Unlike previous systems for information extraction, System Text incorporates AQL, a declarative rule language that makes it easy to express precise specifications for complex patterns in text.

Thanks to System Text's sophisticated, cost-based optimizer, these complex rules can run on enterprise-scale workloads with minimal hardware. System Text technology provides state-of-the-art information extraction for Lotus Notes® Live Text, IBM® OmniFind™ Personal Email Search (also available here at alphaWorks®), and several forthcoming IBM products.

This release of the System Text for Information Extraction includes the Development Environment component, which provides support for building and testing extraction rules in AQL. Also included are example rules in AQL, as well as documentation for the rule language and the Development Environment.

How does it work?

System Text for Information Extraction makes the process of writing information extraction code like that of building any other piece of enterprise software.

The system's Development Environment helps the annotator writer to develop and debug extraction rules. Rules in System Text are written in AQL, a language that combines the familiar declarative syntax of SQL with the expressive power of IBM's algebraic extraction technology. An example AQL rule is shown below:

example AQL rule

The Development Environment provides facilities for managing AQL rules and dictionary files, as well as for testing rules on collections of representative documents. Developers can add their own document collections and interactively explore the results of their annotators.

The advantages of System Text for Information Extraction go beyond the AQL language. Extracting information from text can be a CPU-intensive task, and making rules run efficiently has traditionally been a big problem for developers. System Text for Information Extraction solves this problem by relieving developers of the burden of performance tuning. Behind the scenes, the system automatically optimizes rule execution for maximum "throughput," allowing the developer to concentrate solely on building more accurate rules.

This release of System Text for Information Extraction provides a preview of the capabilities of AQL. Forthcoming versions will support compilation of AQL rules into UIMA (Unstructured Information Management Architecture) annotators.

About the technology author(s)

Frederick Reiss, of IBM""s Almaden Research Center, focuses on rule languages, infrastructure, and automated performance optimization technology for enterprise-scale, rule-based information extraction.

Rajasekar Krishnamurthy, Ph.D. (Almaden), focuses on scalability and quality in the context of large-scale information extraction systems and is building a declarative rule-based information extraction system.

Yunyao Li, Ph.D. (Almaden), focuses on designing, developing, and analyzing systems that can improve the accessibility of information for a wide spectrum of users in distributed, heterogeneous data environments.

Suresh Thalamati, a developer in IBM IM Advanced Technology Group, focuses on Web 2.0 technologies, text analytics, and database engine.

Ganesh Ramakrishnan, Ph.D. (India), focuses on information extraction, especially rule and feature induction using inductive logic programming, speeding up of rule-based information extraction techniques, and environments for developing and organizing rules.

Sriram Raghavan, Ph.D. (Almaden), develops powerful semantic search technology that exploits structured information extracted from text to enable high-precision keyword information retrieval.

Shivakumar Vaithyanathan, Ph.D. (Almaden), manages the Infrastructure for Intelligent Information Systems department at Almaden. He is an associate editor for the Journal of Statistical Analysis and Data Mining.

Trademarks