Skip to main content

Text Analytics Tools and Runtime for IBM LanguageWare

An Eclipse application for building custom language analysis into IBM LanguageWare resources and their associated UIMA annotators.

Date Posted: December 7, 2006

alphaworks tab navigation


Update: June 22, 2009
The LanguageWare Resource Workbench 7.1.1.3 contains significant improvements to its performance and memory footprint when annotating large documents and document collections. It also contains a new dictionary merge capability and some bug fixes.

1. What is the LanguageWare Resource Workbench? Why should I use it?

The LanguageWare Resource Workbench is a comprehensive Eclipse-based environment for developing UIMA Analyzers. The Workbench ships with a set of template annotators that can be modified through the Workbench to generate a new set of custom annotators, tailored to your specific needs. These new annotators can then be simply exported from the Workbench (as a Pear) and installed into any UIMA pipeline. The customization is achieved through building Domain Extraction Models (lexico-semantic resources and parsing rules) which describe the entities and relationships you wish to extract. If you satisfy the following criteria, then you will want to use LanguageWare:

2. Where should I start with LanguageWare?

The best way to get started with LanguageWare is to install LanguageWare Resource Workbench and the Demonstrator from alphaWorks. The Demonstrator provides a workspace that is fully populated with a set of dictionaries, rules, UIMA pipelines, and associated documents. This will give you a great starting point with which to build your personalized analysis pipelines. There are also a number of short "Getting Started" videos on the alphaworks Download page that will introduce you to the Workbench and show you how it works (using the Demonstrator environment).

3. What's new in this version of LanguageWare Resource Workbench?

All new features are outlined in the Workbench Release Notes, ReleaseNotes.htm, which is located in the Workbench installation directory.

4. Does the Workbench generate UIMA annotator code and configuration files that I can use in my application?

The Workbench generates the UIMA annotator code, configuration files, and LanguageWare resources required. The deployment packaging is not yet automated; instructions for extracting the required files are included in the Getting Started Guide. However, the alphaWorks license allows you only to use this code for evaluation purposes.

5. What version of UIMA do I need to use the LanguageWare Annotators?

LanguageWare Resource Workbench ships with, and has been tested against, Apache UIMA, Version 2.2.2. They should work with newer versions of Apache UIMA; however, they have not been extensively tested for compatibility. Therefore, we would recommend Apache UIMA v2.2.2. The LanguageWare annotators are not compatible with versions of UIMA prior to 2.1. These were released by IBM and have namespace conflict with Apache UIMA.

6. What documentation is available to help me use LanguageWare?

Context-sensitive help is provided as part of the LanguageWare Workbench. There is an online help system shipped with the LanguageWare Workbench (under Help / Help Contents) Several "Getting Started" videos are provided on the alphaworks Download page. More detailed information about the underlying APIs will be provided for fully-licensed users of the technology.

7. Why is the Workbench shipped as an Eclipse-based application?

We built the Workbench on Eclipse because it provides a collaborative framework through which we can share components with other product teams across IBM, with our partners, and with our customers. This version of the Workbench is a complete, stand-alone application. However, users can still get the benefits of the Eclipse IDE by installing Eclipse features into the Workbench. Popular features include the Eclipse CVS feature for managing shared projects and the Eclipse XML feature for full XML editing support. See the Eclipse online help for more information about finding and installing new features. It is important to understand that while the Workbench is Eclipse-based, the Annotators that are exported from the Workbench (under File / Export) can be installed into any UIMA pipeline and can be deployed in a variety of ways. The LanguageWare team, as part of the commercial LanguageWare license, provide integration source code to simplify the overall deployment and integration effort. This includes UIMA serializers, CAS consumers, and APIs for integrating into through C/JNI, Eclipse, Web Services (REST), and others.

8. What are the known limitations with this release of the Workbench?

Any problems or limitations are outlined in the Workbench Release Notes, ReleaseNote.htm, which is located in the Workbench installation directory and is part of the Workbench Help System.

9. What is the LanguageWare Demonstrator?

The LanguageWare Demonstrator is a workspace that is fully populated with a set of dictionaries, rules, UIMA pipelines, and associated documents that you can use within the LanguageWare Workbench. This workspace gives you a great starting point with which to build your personalized analysis pipelines

10. What are the key lessons that you have learned in building the LanguageWare Workbench?

We've learned several important lessons:

11. What are Domain Extraction Models? How do I build a good Domain Extraction Model?

The process of customizing the template annotators provided with the Workbench requires you to build a set of resources that describe what you want to extract. We call them extraction models since they also give instructions, using the Parser Rules, for how you want to perform the identification/extraction process.

The models are a combination of:

The process of building data models is a simple iterative process within the Workbench.

The important thing is to understand exactly what you are being asked to extract, and what are the concepts that might build up the model. You don't have to overthink the process, however to be prepared to rewrite your model several times -- this is an iterative process-- before you end up with something that looks clean, robust, generalized, and maintainable.

Start by identifying dictionary concepts and create your dictionaries in the Workbench, then move onto the rules. Be careful not to "overfit" your model to the set of template sentences you started with. At regular intervals test your model against a separate set of documents and see how you are doing.

And the most important thing is to have fun! You may be surprised at how much you enjoy the challenge, its like solving a multi-dimensional puzzle.

11. What do the LanguageWare Runtime libraries provide?

LanguageWare provides many run-time libraries. Although each of these libraries provides discrete functionality, many libraries build on the functionality provided by the core LanguageWare libraries. The following is a non-exhaustive list of the libraries and their functions.

  • dlt.jar, rule_dlt.jar and icu4j.jar: provides core functionality, such as lexical analysis, dictionary look-up, and spelling correction
  • tagger_dlt.jar: provides part-of-speech tagging; requires the lexical analysis libraries mentioned above
  • dltls.jar: provides support for ontology-based semantic analysis of documents
  • an_dlt.jar, an_tagger_dlt.jar: used for running LanguageWare annotators in a UIMA pipeline
  • jfst.jar, antlr.jar: used for running the rule-based annotator
  • jdemo.jar: used by the sample applications
  • DictionaryBuilder.jar: used to build LanguageWare dictionaries from the command line. Building dictionaries using LanguageWare Resource Workbench is recommended instead. Several supporting JAR files are required; they are included in the lib directory.

12. How do I change the default editor for new file types in the Workbench?

Go to Window / Preferences / General / Editors / File Associations. If the content type is already listed, just add a new editor and pick the LanguageWare Text Editor. You can set this to be the default, or alternatively leave it as an option that you can choose, on right click, whenever you open a file of that type. You will need to restart the Workbench before this comes into effect. Note: Eclipse remembers the last viewer you used for a file type so if you opened a document with a different editor beforehand you may need to right-click on the file and explicitly choose the LanguageWare Text Editor the first time on restart.

12. What resources are included in the Runtime package?

There are several dictionaries included in the run-time environment. The latest official lexical analysis dictionaries are included in the IBM-dictionaries folder. In addition, dictionaries required for running the sample applications are stored in the samples\SampleDictionaries folder. Users can request the most recent dictionaries (if they are not present in the IBM-dictionaries folder) by contacting LanguageWare.

13. How do I develop my own custom UIMA Annotators in the Workbench?

The template annotators that ship with the Workbench are customized through building Domain Extraction Models (resources) which describe the entities/concepts that you want to recognize and how the concepts combine to generate new entities and relationships. Developing UIMA Annotators in the Workbench doesn't require any coding at all. You simply develop the Domain Models, create a UIMA configuration that includes these models and our underlying template Annotators, and export the resulting Annotator (or aggregate Annotator) from the Workbench (exported as a Pear).

13. How do I use the LanguageWare Runtime components?

The LanguageWare Runtime libraries can be used as part of a Java application. Several sample applications are included in the samples directory. These give an idea of some useful applications.

The libraries can also be used to create custom annotators based on LanguageWare. The preferred way to generate these annotators is by using LanguageWare Resource Workbench. However, it is possible to manually create LanguageWare-based UIMA annotators. UIMA requires XML descriptors for annotators. If you have downloaded LanguageWare Resource Workbench, you can use it to generate a PEAR file based on the sample rule-based annotator. This PEAR file contains descriptors for the core LanguageWare annotator, the POS Tagger annotator, and the rule-based annotator, as well as a descriptor for running all these together. The PEAR file also contains the libraries and resources required for running the annotators. See the Getting Started Guide for information about generating and installing a PEAR file.

14. How do I integrate the UIMA Analyzers that I develop in the Workbench?

Once you have completed building your Domain Extraction Models (dictionaries and rules), the Workbench provides an "Export as UIMA Pear" function under File / Export. This will generate a Pear file that contains all the code and resources required to run your pipeline in any UIMA-enabled application, that is, in a UIMA pipeline.

14. What documentation is available to help me use the LanguageWare Runtime components?

The documentation for the run-time environment can be found in the doc folder of LW70.zip. This file contains the User Guide for LanguageWare in HTML and PDF formats. The Javadoc folder contains documentation on the LanguageWare APIs. The documentation for LanguageWare Resource Workbench might also be of use.


Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
IBM, LanguageWare, and alphaWorks are trademarks of IBM Corporation in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.

15. How do I store my models?

The Workbench is designed to primarily help you to build your domain extraction models and this includes databases in which you can store your models. The Workbench ships with an embedded database (Derby, open source), however also can connect to any enterprise database, such as DB2 or Oracle.

16. How do I open a non-anno file in the LanguageWare Text Editor?

You can either right-click on the file and choose the "LanguageWare Text Editor", or alternatively update the default editor for that particular file type. See the "How do I change the default editor for new file types in the Workbench?" question in this FAQ.

17. Is LanguageWare available as a services offering?

Yes, LanguageWare is available as a Services Offering, For more details please see LanguageWare domain modelling and integration services.

18. What licensing conditions apply for LanguageWare on alphaWorks, for academic purposes, or for commercial use?

There are licensing conditions for using the LanguageWare Tools.

19. Is Language Identification identifying the wrong language?

Sometimes the default amount of text (1024 characters) used by Language Identification is not enough to disambiguate the correct language. This happens specially when languages are quite close or when the text analysed may include text in more than one language. In this case, it may help to increase the MaxCharsToExamine parameter. To do this, select from the LWR menu:

Window > Preferences > LanguageWare > UIMA Annotation Display.

Enable the checkbox for "Show edit advanced configuration option on pipeline stages". Select Apply and OK.

Next time you open a UIMA Pipeline Configuration file, you will notice an Advanced Configuration link at the Document Language stage. Click on it to expand and display its contents, notice the MaxCharsToExamine parameter can be edited. Change the default number displayed to a bigger threshold. Save your changes and try again to see if the Language Identification has improved.

20. How do I contact the LanguageWare team?

You can contact the LanguageWare team using the contact form on alphaWorks.

21. What languages are supported by the LanguageWare Resource Workbench?

The following table shows a list of all languages supported by the LanguageWare Resource Workbench. It also indicates whether the Workbench includes support for the following features:

LanguageWare Resource Workbench Language Support
Language Built-in Dictionaries POS Tagging Support
Afrikaans no no
Arabic Yes Yes
Catalan no no
Chinese Yes Yes
Czech no no
Danish no Yes
Dutch no Yes
English Yes Yes
Finnish no no
French Yes Yes
German Yes Yes
Greek no no
Italian Yes Yes
Japanese Yes Yes
Korean no no
Multilingual no no
Norwegian no no
Polish no no
Portuguese Yes Yes
Russian no no
Spanish Yes Yes
Swedish no no

Trademarks