IBM®
Skip to main content
    United States change      Terms of use
 
 
Select a scope:    
     Home      Products      Services & industry solutions      Support & downloads      My account     
alphaWorks  >  Information management  >  

Unstructured Information Management Architecture SDK

A Java SDK that supports the implementation, composition, and deployment of applications working with unstructured information.


Date Posted: December 16, 2004
OverviewRequirements Download FAQs Forum Reviews

Update: September 6, 2007

IBM UIMA wrapper implementation that can run IBM UIMA components in Apache UIMA 2.2.

What is the Unstructured Information Management Architecture (UIMA) SDK?

Unstructured information management (UIM) applications are software systems that analyze unstructured information (text, audio, video, images, etc.) to discover, organize, and deliver relevant knowledge to the user. In analyzing unstructured information, UIM applications make use of a variety of analysis technologies, including statistical and rule-based Natural Language Processing (NLP), Information Retrieval (IR), machine learning, and ontologies. IBM®'s UIMA is an architectural and software framework that supports creation, discovery, composition, and deployment of a broad range of analysis capabilities and the linking of them to structured information services, such as databases or search engines. The UIMA framework provides a run-time environment in which developers can plug in and run their UIMA component implementations, along with other independently-developed components, and with which they can build and deploy UIM applications. The framework is not specific to any IDE or platform.

This technology, the UIMA SDK (Software Development Kit), is an all-Java™ implementation of the UIMA framework, and it supports the implementation, description, composition, and deployment of UIMA components and applications. It also supports the developer with an Eclipse-based development environment that includes a set of tools and utilities for using UIMA. In addition, it supports the inclusion of components written in C++ through the use of an optional C++ enablement layer. UIMA components written in Perl, Python, and TCL are also enabled via the C++ enablement layer.

One large, but not the only, application area of text analysis is improving text search. By detecting important terms and topics within documents, semantic search engines provide the capability to search for concepts and relationships instead of keywords. IBM's enterprise search solution, IBM OmniFind® Enterprise Edition, has such semantic search capabilities. It allows UIMA annotators to be plugged into the OmniFind processing flow, enabling semantic search to be performed on the extracted concepts.

New users of UIMA are strongly encouraged to use the UIMA version published on Apache, which is where all the new development is occurring, but there are still four locations where one can obtain the UIMA SDK:

  • The UIMA SDK versions on alphaWorks® are the "older" IBM UIMA implementations. There will be no further development on these versions. The alphaWorks SDK versions are still available for users who need to integrate work with the IBM UIMA versions. A C++ enablement layer is available for this version.
  • The UIMA SDK on developerWorks is the "OmniFind-compatible" version of the SDK. It is intended for users who want to develop and deploy semantic search solutions with OmniFind or solutions that take advantage of OmniFind's capabilities for enterprise-scale document crawling and extraction. The developerWorks SDK is tested for compatibility with a specific OmniFind version and will be updated to keep it in sync with new OmniFind releases. As the SDK evolves, prior versions will still be available on developerWorks, to ensure that each supported OmniFind version has a corresponding SDK. For customers who have an OmniFind license, this SDK is supported via the IBM support channels and also via the developerWorks forum.
  • The Java and C++ source code for the IBM UIMA alphaWorks versions are available at SourceForge.
  • Current UIMA development for the Java and C++ versions is done at the Apache Software Foundation in the open-source community as part of the Apache UIMA incubation project. The lastest Apache UIMA versions can be downloaded there. Futher information about the UIMA standards activities can be found on OASIS.

Besides the older IBM UIMA releases, the alphaWorks UIMA pages also contains some additional components for the newer Apache UIMA releases. Currently, there are two packages available that enhance the functionality of Apache UIMA:

  • SemanticSearch 2.1: The SemanticSearch package is based on Apache UIMA and provides a full-featured semantic search engine.
  • IBM UIMA wrapper: The IBM UIMA wrapper package enables you to run IBM UIMA components using Apache UIMA 2.2 or above. This package is designed for projects and products that migrate to Apache UIMA but also still need to be able to run older IBM UIMA components.

How does it work?

UIMA is an architecture in which basic building blocks called Analysis Engines (AEs) are composed in order to analyze a document. At the heart of AEs are the analysis algorithms that do all the work to analyze documents and record analysis results (for example, detecting person names). These algorithms are packaged within components that are called Annotators. AEs are the stackable containers for annotators and other analysis engines.

How Annotators represent and share their results is an important part of the UIMA architecture. To enable composition and reuse, UIMA defines a Common Analysis Structure (CAS) precisely for these purposes. The CAS is an object-based container that manages and stores typed objects having properties and values. Object types may be related to each other in a single-inheritance hierarchy. Annotators are given a CAS having the subject of analysis (the document), in addition to any previously created objects (from annotators earlier in the pipeline), and they add their own objects to the CAS. The CAS serves as a common data object, shared among the annotators that are assembled for an application.

Many UIM applications analyze entire collections of documents. UIMA supports this analysis through its Collection Processing Architecture. This part of the architecture allows specification of a "source-to-sink" flow by reading the data from the source, processing it, and storing the results in a data sink of your choice.

Migration from IBM UIMA to Apache UIMA:

For users who want to migrate from IBM UIMA components or applications to Apache UIMA, the Apache UIMA release provides some helpful migration tools.


About the technology author(s):
The UIMA SDK was developed by teams from IBM Research and IBM Software Group. It is a world-wide effort, with significant participation from the following IBM sites:

  • IBM Thomas J. Watson Research Center (New York)
  • IBM Haifa Research Laboratory (Israel)
  • IBM Development Laboratory Boeblingen (Germany)
  • IBM Almaden Research Center (California).
Apache UIMA is being developed by the Apache open-source community.


Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
IBM, alphaWorks, and OmniFind are trademarks of IBM Corporation in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.

Download now Download now

Related technologies

For platform(s):
Linux, Windows, Java, Intel

For topics:
analysis, data management, Eclipse, Natural Language, ontology, Perl, Search, semantics, UIMA


Related resources

Semantics Research topic

IBM Research: The UIMA Project

UIMA SDK on developerWorks

UIMA SDK on SourceForge

Press Articles

 

    About IBM Privacy Contact