Unstructured Information Management Architecture SDK
A Java SDK that supports the implementation, composition, and deployment of applications working with unstructured information.
Date Posted: December 16, 2004
|
|
 |
 |
 |
 |

|
 | 1. What is UIMA? |  |  | 2. What's the difference between UIMA and the UIMA SDK? |  |  | 3. How does UIMA relate to IBM® products? |  |  | 4. What is the Semantic Search package? |  |  | 5. Can I build my UIM application on top of UIMA? |  |  | 6. What is an annotation? |  |  | 7. What is the CAS? |  |  | 8. What does the CAS Contain? |  |  | 9. Does the CAS contain only annotations? |  |  | 10. Is the CAS merely XML? |  |  | 11. What is a type system? |  |  | 12. What's the difference between an annotator and an analysis engine? |  |  | 13. Are UIMA analysis engines Web services? |  |  | 14. How do you scale a UIMA application? |  |  | 15. What does it mean to embed UIMA in systems middleware? |  |  | 16. Must analysis engines be "stateless"? |  |  | 17. Is engine meta-data compatible with Web services and UDDI? |  |  | 18. How is the CPM different from a CPE? |  |  | 19. Is an XML Fragment Query supposed to be valid XML? |  |  | 20. Does UIMA support modalities other than text? |  |  | 21. How does UIMA compare to other similar work? |  |  | 22. The printed version of the UIMA SDK user's guide has funny characters. What can I do? |  |  | 23. The output in the viewer window appears to be missing the carriage-return and line-feed characters? |  |  | 24. On Linux, the Java system seems to stop at random places and is unresponsive to any commands. |  |  | 25. The documentation says the UIMA.LOG file will be created in the "default directory." Where is this directory? |  |  | 26. JCasGen says it is generating in the default package, but then I see an exception being generated. What happened? |  |  | 27. Why does my log output go to the Console and not to the uima.log file? |  |  | 28. I can't see any Run menu item. What can I do? |  |  | 29. When I invoke Run, instead of running, it shows a menu with Do you want to Save? |  |  | 30. The Component Description Editor looks funny -- not as in the documentation. |  |  | 31. The UIMA.LOG file is in my project directory; why don't I see it in the Package Explorer view of Eclipse? |  |  | 31. When using UIMA in WebSphere II OmniFind, how can I modify the pear time-out value? |  |  | 33. When using UIMA in WebSphere II OmniFind, how can I change the Java heap size for my custom annotator? |  |  | 34. When using UIMA in WebSphere II OmniFind, how can I see my custom annotator log messages in the OmniFind logs? |  |  | 35. For UIMA in WebSphere II OmniFind: My annotator works on XML tags. It works in the SDK, but not in OmniFind. What's wrong? |  |  | 36. What is the C++ enablement layer? |  |  | 37. What was added in previous updates? |  |  | 38. How does a UIMA component written in Python, Perl, and TCL interoperate with Java and C++? |  |  | 39. Tell me more about the open-source project and OASIS. |

|
|
 |  UIMA stands for Unstructured Information Management Architecture. It is component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies.
UIMA processing occurs through a series of modules called analysis engines. The result of analysis is an assignment of semantics to the elements of unstructured data, for example, the indication that the phrase "Washington" refers to a person's name or that it refers to a place.
UIMA supports the rendering of these results in conventional structures (for example, relational databases or search engine indices), where the content of the original unstructured information may efficiently be accessed according to its inferred semantics.
UIMA is specifically designed to support the developer in the creation, integration, deployment, and sharing of components across platforms and among disperse teams with different skills working to develop advanced analytics. | | |
 |  UIMA is an architecture that specifies component interfaces, design patterns, data representations, and development roles.
The UIMA Software Development Kit (SDK) is a software system that includes a run-time framework, APIs, and tools for implementing, composing, packaging, and deploying UIMA components. It comes with a semantic search engine for indexing and querying over the results of analysis.
The UIMA run-time framework allows developers to plug in their components and applications and run them on different platforms and according to different deployment options that range from tightly-coupled (running in the same process space) to loosely-coupled (distributed across different processes or machines for greater scale, flexibility, and recoverability). | | |
 |  This is a repackaging of the Semantic Search engine from the UIMA SDK, along with the demo tool for queries and a CAS consumer for generating the index, that works with the Apache UIMA version. It will allow you to index annotations as well as keywords; then you can use XML Fragments containing both keywords and annotations to query the index. All this is described in further detail in the package, as well as in the UIMA reference documentation chapter on building UIMA Applications. | | |
 |  Yes. The UIMA license does not restrict its usage to specific scenarios, and we are of course very interested in your feedback, which will help us making UIMA the right platform for building UIM applications. Please note, however, that we currently offer support on a "best we can do" basis. If you are interested in a more formal support agreement, or if you would like to include UIMA in a commercial solution, WebSphere Information Integrator OmniFind Edition is the product-level platform to build commercial solutions on. | | |
 |  An annotation is a label, typically represented as string of characters, associated with a region of a document. An example is the label "Person" associated with the span of text "George Washington". We say that "Person" annotates "George Washington" in the sentence "George Washington was the first president of the United States." The association of the label "Person" with a particular span of text is an annotation.
Annotations are not limited to text. A label may annotate a region of an image or a segment of audio. The same concepts apply. | | |
 |  The CAS stands for Common Analysis Structure. It provides cooperating UIMA components with a common representation and mechanism for shared access to the artifact being analyzed (for example, a document, audio file, video stream, etc.) and the current analysis results. | | |
 |  The CAS is a data structure for which UIMA provides multiple interfaces. It contains and provides the analysis writer with access to the following:
- the subject of analysis (the artifact being analyzed, such as the document)
- the analysis results or metadata (such as annotations, parse trees, relations, entities, etc.)
- indices to the analysis results
- the type system (a schema for the analysis results).
| | |
 |  No. The CAS contains the artifact being analyzed and the analysis results. Analysis results are those statements recorded by analysis engines in the CAS. The most common form of analysis result is the addition of an annotation. But an analysis engine may write any structure that conforms to the CAS's type system into the CAS. These may not be annotations but may be other things, such as links between annotations and properties of objects associated with annotations. | | |
 |  No; in fact there are many possible representations of the CAS. If all of the analysis engines are running in the same process, an efficient, in-memory data object is used. If a CAS must be sent to an analysis engine on a remote machine, it can be done via an XML or a binary serialization of the CAS. UIMA specifies an XML representation of the CAS. | | |
 |  Think of a type system as a schema for the CAS. It defines the types of objects and their properties (or features) that may be instantiated in a CAS. A CAS conforms to a particular type system. UIMA components declare their input and output with respect to a type system. Type systems include the definitions of types, their properties, and single-inheritance hierarchy of types. | | |
 |  In the terminology of UIMA, an annotator is simply some code that analyzes documents and puts out annotations on the content of the documents. The UIMA framework takes the annotator, together with metadata describing such things as the input requirements and output of the annotator, and produces an analysis engine. Analysis engines contain the framework-provided infrastructure that allows them to be easily combined with other analysis engines in different flows and according to different deployment options (collocated or as Web services, for example). | | |
 |  Not necessarily. However, deploying an analysis engine as a Web service is one of the deployment options supported by the UIMA framework. | | |
 |  The UIMA framework allows components such as analysis engines and CAS consumers to be easily deployed as services or in other containers and managed by systems middleware designed to be scaled. UIMA applications tend to naturally scale-out across documents, allowing many documents to be analyzed in parallel. | | |
 |  An example of an embedding would be the deployment of a UIMA analysis engine as an Enterprise Java Bean inside an application server such as IBM WebSphere. Such an embedding allows the deployer to take advantage of the features and tools provided by WebSphere for achieving scalability, service management, recoverability, etc. UIMA is independent of any particular systems middleware, so analysis engines could be deployed on other types of middleware as well. | | |
 |  Technically, no. But analysis engines developers are encouraged not to maintain state between documents that would prevent their engine from working as advertised if switched into a different flow or onto a different document collection.
UIMA defines another type of component, the CAS Consumer, which is intended to maintain state across documents and is typically associated with some resource such as a database or search engine that aggregates analysis results across an entire collection. | | |
 |  All UIMA component implementations are associated with an XML descriptor that represents captured metadata describing various properties about the component in order to support discovery, reuse, validation, automatic composition, and development tooling. In principle, UIMA component metadata is compatible with Web services and UDDI. However, the UIMA framework currently uses its own XML representation for this metadata. It would not be difficult to convert between UIMA's XML representation and the WSDL and UDDI standards. | | |
 |  The UIMA framework includes a Collection Processing Manager (CPM) for managing the execution of a workflow of UIMA components orchestrated to analyze a large collection of documents. The UIMA developer does not implement or describe a CPM. It is a built-in part of the framework. It is a piece of infrastructure code that handles CAS transport, instance management, batching, check-pointing, statistics collection, and failure recovery in the execution of this collection processing workflow.
A Collection Processing Engine (CPE) is a component that the UIMA developer creates by specifying a CPE descriptor. A CPE descriptor points to a series of UIMA components, including a Collection Reader, CAS Initializer, Analysis Engine(s), and CAS Consumers. These components organized in a particular flow define a collection analysis job that acquires documents from a source collection, initializes CASs with document content, performs document analysis, and then produces collection level results (for example, search engine index, database, etc). The CPM is the execution engine for a CPE. | | |
 |  Not exactly. The XML Fragment query syntax used by the semantic search engine that is shipped with UIMA uses basic XML syntax as an intuitive way to describe hierarchical patterns of annotations that may occur in a CAS. It deviates from valid XML in a few minor ways in order to support queries over "overlapping" or "cross-over" annotations. | | |
 |  The UIMA architecture supports the development, discovery, composition, and deployment of multi-modal analytics including text, audio, and video. However, this release of the SDK includes only documentation and programming examples for text analysis. | | |
 |  A number of different frameworks for Natural Language Processing (NLP) have preceded UIMA. Two of them where developed at IBM Research and represent UIMA's early roots. For details, please see the UIMA article that appears in the IBM Systems Journal Vol. 43, No. 3.
UIMA has advanced that state of the art along a number of dimensions including support for distributed deployments in different middleware environments; easy framework embedding in different software product platforms (key for commercial applications); broader architectural converge with its collection processing architecture; support for multiple-modalities; support for efficient integration across programming languages; support for a modern software engineering discipline calling out different roles in the use of UIMA to develop applications; the extensive use of descriptive component metadata to support development tools; and component discovery and composition. (Please note that not all these features are available in this release of the SDK.) | | |
 |  We've observed that some printers print this PDF better if you select (on Windows), the Advanced button that appears on the Print window, and then change the Font and Resource Policy: from Send by Range to Send at Start. | | |
 |  We've observed that some printers print this PDF better if you select (on Windows®), the Advanced button that appears on the Print window, and then change the Font and Resource Policy: from Send by Range to Send at Start. | | |
 |  We've observed this problem with earlier releases of Java™. Try running the SDK with the supplied IBM Java 1.4.2. | | |
 |  We've seen this behavior on some machines with hyperthreading enabled, on earlier versions of Linux®. This problem disappeared when we upgraded to the current levels of the threading
libraries. | | |
 |  The CAS types in the UIMA SDK must have a CAS name space. You can't have a type named MyType -- it must have a name such as com.myorg.MyType. The part of the name before the last period is the name space and is used in JCasGen to specify the package name of the generated files. | | |
 |  It is usually the directory you were in when you invoked UIMA. If you are running from Eclipse, it may be in the project you had selected when you did a Run, or it may be the directory where the eclipse.exe file is. | | |
 |  The CAS types in the UIMA SDK must have a CAS name space. You can't have a type named MyType -- it must have a name such as com.myorg.MyType. The part of the name before the last period is the name space and is used in JCasGen to specify the package name of the generated files. | | |
 |  UIMA uses the standard Java logger. The default behavior of the Java logger is to send log output to the console. This can be overridden by the -Djava.util.logging.config.file system property, which must point to a Logger.properties file in the format specified by Java. An example Logger.properties file, which redirects output to the file uima.log, is located in the root directory of the UIMA SDK. All the .bat/.sh files and Eclipse run configurations that come with the UIMA SDK set -Djava.util.logging.config.file to point to this Logger.properties file. You should do the same in your own run configurations if you would like the log output to go to the uima.log file. For more information about logging, see the "Logging" section in Chapter 4 of the UIMA SDK documentation. | | |
 |  Try switching to the Java perspective by selecting the following menu choices:
Window > Open Perspective > Java. | | |
 |  Eclipse checks to see if you've edited any files but not saved them, and if so, it will bring up this menu to give you the opportunity to save the files before running. The run action will happen after you decide whether you want to save the file(s) and take the appropriate action. | | |
 |  This appearance may be due to the editor not having enough room to be displayed. Try making the window larger. Try also double-clicking on the title tab for this editor at the top. This action should expand the window to the full Eclipse window. (You can return to the previous window configuration by double-clicking on the title tab again). | | |
 |  To see it, select the project and press F5 or right-click and select Refresh. Eclipse caches a view of the file system; it must be occasionally told when things have changed in the file system and that it should refresh its views. | | |
 |  In some cases, it is necessary to modify the standard time-out configuration setting for a custom annotator. For example, if an annotator performs very complex text analysis, then maybe the default time-out value of 30 seconds is too low. To change the time-out value, the snippet below shows the custom annotator settings in the EsCpeDescriptor.xml.
<casProcessor deployment="remote" name="MyCustomAnnotator">
<descriptor>
<include href="/home/esadmin/config/col1.parserdriver/specifiers/
EsSocketService.xml"/>
</descriptor>
<filter/>
<errorHandling>
<errorRateThreshold action="continue" value="0/100"/>
<maxConsecutiveRestarts action="terminate" value="3"/>
<timeout max="30000"/>
</errorHandling>
<checkpoint batch="1"/>
<deploymentParameters>
<parameter name="transport" type="string"
value="com.ibm.es.control.casprocessor.server.CasProcessorSocketTransport"/>
</deploymentParameters>
</casProcessor>
|
The time-out value is specified in milliseconds in the error handling section of the casProcessor. If the annotator does not return earlier, increase this time-out value in order to trigger a time-out event. After increasing the time-out value for the custom annotator, it is also necessary to increase the time-out value for the CPM output queue. The necessary setting is also in the EsCpeDescriptor.xml at the end of the file. The tag is called
<outputQueue dequeueTimeout="100000" .../>
Increase this time-out value by the same factor used for the custom annotator.
| | |
 |  In some cases, it is necessary to modify the standard timeout configuration setting for a custom annotator. For example, if an annotator performs very complex text analysis, then maybe the default timeout value of 30 seconds is too low. To change the timeout value, the snippet below shows the custom annotator settings in the EsCpeDescriptor.xml.
<casProcessor deployment="remote" name="MyCustomAnnotator">
<descriptor>
<include href="/home/esadmin/config/col1.parserdriver/specifiers/
EsSocketService.xml"/>
</descriptor>
<filter/>
<errorHandling>
<errorRateThreshold action="continue" value="0/100"/>
<maxConsecutiveRestarts action="terminate" value="3"/>
<timeout max="30000"/>
</errorHandling>
<checkpoint batch="1"/>
<deploymentParameters>
<parameter name="transport" type="string"
value="com.ibm.es.control.casprocessor.server.CasProcessorSocketTransport"/>
</deploymentParameters>
</casProcessor>
The timeout value is specified in milliseconds in the error handling section of the casProcessor. If the annotator does not return earlier, increase this timeout value in order to trigger a timeout event. After increasing the timeout value for the custom annotator, it is also necessary to increase the timeout value for the CPM output queue. The necessary setting is also in the EsCpeDescriptor.xml at the end of the file. The tag is called
<outputQueue dequeueTimeout="100000" .../>
Increase this timeout value by the same factor used for the custom annotator.
| | |
 |  The pear file, including the custom annotators that are associated with a collection, is running in a collection-specific, fenced box. The fenced box is a separate process called CAS processor. In order to change the JVM heap size for that process, you must modify the following configuration file:
NodeRoot/master_config/colID_config.ini
Within the file, search for an expression such as
sessionN.type=casprocessor
to get the session number for the current collection's CAS processor. After heaving the session number, change the heap size in the following setting:
sessionN.max_heap=size in MB
OmniFind must be restarted so the changes become effective.
The default heap size is set to 200 MB. Be careful with increasing that heap size. For additional help, see the memory recommendations in the OmniFind installation guide.
| | |
 |  All custom annotator log messages are written to the OmniFind parser service's audit log file, located at esNodeRoot/logs/audit/<colId>.casprocessor_audit_<current_date>.log. Within OmniFind there are three different log levels: Error, Warning, and Informational. The OmniFind log level for audit log files is set to Informational and cannot be changed to another value. Within the UIMA logging architecture, there are seven possible log levels (Error, Warning, Info, Config, Fine, Finer, and Finest); some can be additionally mapped to the OmniFind log levels. The default-level mapping is as shown below:
OmniFind log level: UIMA log level
Error: Error
Warning: Warning
Informational: Info
not mapped: Config , Fine, Finer, Finest
Note that the mapping for Error and Warning messages can not be changed. By default, only the custom annotator log message with the levels Info, Warning, and Error are written to the log file. This default behavior can be replaced with a special log-level mapping for log levels below Info, as follows:
- Modify the tokenizer.properties config file in the following directory:
EsNodeRoot/master_config/parserservice/
- Inside this file is a level configuration setting, such as
trevi.tokenizer.jedii.InformationalLevelMapping=Info
- In order to see more than UIMA annotator Info messages in the log file, replace this log level value with the desired UIMA log level. For example, use the following in order to see all UIMA annotator log messages in the OmniFind audit log:
trevi.tokenizer.jedii.InformationalLevelMapping=Finest
| | |
 |  The OmniFind XML parser models all XML tags as CAS annotations. They are removed from the actual document content. If you need to access XML information in your annotator, there are two ways of doing this, which can be combined:
- If you enable native XML search on the parse panel of your collection, OmniFind will create an Annotation of type com.ibm.es.tt.MarkupTag for each XML tag found in a document. This annotation contains all the information of the original XML tag, namely, its attributes and their content. Moreover, OmniFind will automatically index these XML tags under the name in which they appear in the XML file, so you can use them for semantic searching right away. Your annotators could access these annotations in their processing instead of relying on the XML tags. They would need to iterate over com.ibm.es.tt.MarkupTag and look at the
name feature of the annotation in order to find out which XML tag it represented originally.
- You can specify a so-called "XML to CAS" mapping file. In this file, you specify which XML tags should be mapped to which CAS types. OmniFind will automatically create annotations for these XML tags. This would make it even easier for your annotators to access certain XML tags than in Option 1. For example, if one annotator is interested only in content within <technicianComments>, you could specify a mapping from this tag to a type com.yourco.TechnicianComment. Then your annotator need iterate only over annotations of this type. In the case of "XML to CAS" mapping, OmniFind doesn't index the XML tags automatically. If you still want to search for, say, <technicianComments>, you have two options:
- Additionally enable native XML search.
- In your CAS2Index mapping, add a rule that maps com.yourco.TechnicianComment to the span technicianComment.
For details, please refer to the OmniFind Programming Guide.
| | |
 |  The C++ enablement layer enables analytics written in C++ to be incorporated into the UIMA SDK analytic pipeline; these components can then be combined with others written in Java. The C++ enablement layer uses Java's Native Interface (JNI) and special serialization capabilities to create a C++-accessable version of the CAS, which is used by the C++ annotator modules. The layer includes C++ libraries that enable access to the CAS in a manner that parallels the Java methods. Supported C++ components include Annotators and CAS Consumers. C++ analytic components built using this layer are incorporated into a UIMA Aggregate Analysis Engine or Collection Processing Engine, just as if they were written in Java. Versions of this layer are available for Linux (on i386 platforms) and Microsoft Windows. | | |
 |  Interoperation is achieved using generic C++ annotators called Pythonnator, Perltator, and Tclator, all of which use embedded interpreters to run a specified script in the desired language. For example: When a Pythonnator is initialized, the C++ code creates an embedded Python interpreter, imports the specified Python script, and calls the script's initialization method. When other Annotator methods, such as typeSystemInit() or process(), are called by the UIMA framework, the associated methods in the Scriptator's script are called. Each Scriptator also provides a library that implements an interface between the scripting language and the UIMA APIs of the UIMA C++ Enablement Layer. | | |
 |  Release 1.3
- Changes to the JAR file structure (some classes have been repackaged).
- Removal of implementation for methods that were deprecated in UIMA 1.0.
- Removal of Eclipse 2 version of Component Descriptor Editor (please use Eclipse 3.0 or above for running this tool).
Release 1.2
- CAS Consumers can now be included within Aggregate Analysis Engines.
- Collection Processing Engines can now specify overriding parameter values when operating in Integrated mode.
- New metadata elements in the component descriptors specify whether the components can be multiply deployed and whether they modify the CAS.
- The alphaWorks version includes an initial version of the C++ enablement layer, permitting existing analytic components written in C++ to be integrated into the UIMA SDK. It supplies JNI (Java Native Interface) code to invoke and pass arguments to UIMA components written in C++, and it provides a C++ library and framework that gives these components access to the CAS. This layer is available for Windows® and Linux®. Some documentation is provided through a QuickStart introduction, and there is detailed documentation for the required C++ APIs.
Release 1.1
- This version is incorporated into IBM's enterprise search solution, WebSphere Information Integrator OmniFind Edition, allowing search to be augmented using UIMA analytics.
- Support for multiple Subjects of Analysis has been added - this is documented in a new chapter in the UIMA SDK User's Guide and Reference.
- The Component Descriptor Editor has been greatly enhanced for Eclipse 3, allowing you to edit most UIMA descriptor files (the main exception being the Collection Processing Engine descriptors).
- A new GUI-based tool allows interactive semantic search querying.
- The Collection Processing Manager has been enhanced, and the SDK now includes examples and documentation describing how to use it.
| | |
 |  Interoperation is achieved using generic C++ annotators called Pythonnator, Perltator, and Tclator, all of which use embedded interpreters to run a specified script in the desired language. For example: When a Pythonnator is initialized, the C++ code creates an embedded Python interpreter, imports the specified Python script, and calls the script's initialization method. When other Annotator methods, such as typeSystemInit() or process(), are called by the UIMA framework, the associated methods in the Scriptator's script are called. Each Scriptator also provides a library that implements an interface between the scripting language and the UIMA APIs of the UIMA C++ Enablement Layer. | | |
 |  UIMA 2.0 is moving to a new open-source project at Apache. IBM has donated UIMA to Apache, and ongoing development of UIMA Version 2 will be done in the open-source style by the Apache community. Earlier versions of the source code currently available on SourceForge will remain there.
It will be some time before the Apache UIMA Version 2 code is ready for distribution. Earlier versions and the current Version 2 beta will remain available here on alphaWorks. Some user adaptation is expected to be needed when switching to the Apache UIMA project -- for example, the name spaces are updated with org.apache prefixes (in place of com.ibm prefixes).
OASIS has established a technical committee to work on the standardization of UIMA; participation is open to interested parties who are members of OASIS. | |
|
|
 |
|
For platform(s):
Linux, Windows, Java, Intel
|
 |
For topics:
analysis, data management, Eclipse, Natural Language, ontology, Perl, Search, semantics, UIMA
|
|
| |