Date Posted: December 7, 2006
Update: June 22, 2009
The LanguageWare Resource Workbench 7.1.1.3 contains significant improvements to its performance and memory footprint when annotating large documents and document collections. It also contains a new dictionary merge capability and some bug fixes.
Tab navigation
- 1. What is the LanguageWare Resource Workbench? Why should I use it?
- 2. Where should I start with LanguageWare?
- 3. What's new in this version of LanguageWare Resource Workbench?
- 4. Does the Workbench generate UIMA annotator code and configuration files that I can use in my application?
- 5. What version of UIMA do I need to use the LanguageWare Annotators?
- 6. What documentation is available to help me use LanguageWare?
- 7. Why is the Workbench shipped as an Eclipse-based application?
- 8. What are the known limitations with this release of the Workbench?
- 9. What is the LanguageWare Demonstrator?
- 10. What are the key lessons that you have learned in building the LanguageWare Workbench?
- 11. What are Domain Extraction Models? How do I build a good Domain Extraction Model?
- 12. How do I change the default editor for new file types in the Workbench?
- 13. How do I develop my own custom UIMA Annotators in the Workbench?
- 14. How do I integrate the UIMA Analyzers that I develop in the Workbench?
- 15. How do I store my models?
- 16. How do I open a non-anno file in the LanguageWare Text Editor?
- 17. Is LanguageWare available as a services offering?
- 18. What licensing conditions apply for LanguageWare on alphaWorks, for academic purposes, or for commercial use?
- 19. Is Language Identification identifying the wrong language?
- 20. How do I contact the LanguageWare team?
- 21. What languages are supported by the LanguageWare Resource Workbench?
1. What is the LanguageWare Resource Workbench? Why should I use it?
The LanguageWare Resource Workbench is a comprehensive Eclipse-based environment for developing UIMA Analyzers. The Workbench ships with a set of template annotators that can be modified through the Workbench to generate a new set of custom annotators, tailored to your specific needs. These new annotators can then be simply exported from the Workbench (as a Pear) and installed into any UIMA pipeline. The customization is achieved through building Domain Extraction Models (lexico-semantic resources and parsing rules) which describe the entities and relationships you wish to extract. If you satisfy the following criteria, then you will want to use LanguageWare:
- You need a robust open standards (UIMA) text analyzer that can be easily customized to your specific domain and analysis challenges.
- You need a technology that will enable you to exploit your existing structured data repositories in the analysis of the unstructured sources.
- You need a technology that allows you to build custom domain models that become your intellectual property and differentiation in the marketplace.
- You need a technology that is multi-lingual, multi-platform, multi-domain, and high performance.
2. Where should I start with LanguageWare?
The best way to get started with LanguageWare is to install LanguageWare Resource Workbench and the Demonstrator from alphaWorks. The Demonstrator provides a workspace that is fully populated with a set of dictionaries, rules, UIMA pipelines, and associated documents. This will give you a great starting point with which to build your personalized analysis pipelines. There are also a number of short "Getting Started" videos on the alphaworks Download page that will introduce you to the Workbench and show you how it works (using the Demonstrator environment).
3. What's new in this version of LanguageWare Resource Workbench?
All new features are outlined in the Workbench Release Notes, ReleaseNotes.htm, which is located in the Workbench installation directory.
4. Does the Workbench generate UIMA annotator code and configuration files that I can use in my application?
The Workbench generates the UIMA annotator code, configuration files, and LanguageWare resources required. The deployment packaging is not yet automated; instructions for extracting the required files are included in the Getting Started Guide. However, the alphaWorks license allows you only to use this code for evaluation purposes.
5. What version of UIMA do I need to use the LanguageWare Annotators?
LanguageWare Resource Workbench ships with, and has been tested against, Apache UIMA, Version 2.2.2. They should work with newer versions of Apache UIMA; however, they have not been extensively tested for compatibility. Therefore, we would recommend Apache UIMA v2.2.2. The LanguageWare annotators are not compatible with versions of UIMA prior to 2.1. These were released by IBM and have namespace conflict with Apache UIMA.
6. What documentation is available to help me use LanguageWare?
Context-sensitive help is provided as part of the LanguageWare Workbench. There is an online help system shipped with the LanguageWare Workbench (under Help / Help Contents) Several "Getting Started" videos are provided on the alphaworks Download page. More detailed information about the underlying APIs will be provided for fully-licensed users of the technology.
7. Why is the Workbench shipped as an Eclipse-based application?
We built the Workbench on Eclipse because it provides a collaborative framework through which we can share components with other product teams across IBM, with our partners, and with our customers. This version of the Workbench is a complete, stand-alone application. However, users can still get the benefits of the Eclipse IDE by installing Eclipse features into the Workbench. Popular features include the Eclipse CVS feature for managing shared projects and the Eclipse XML feature for full XML editing support. See the Eclipse online help for more information about finding and installing new features. It is important to understand that while the Workbench is Eclipse-based, the Annotators that are exported from the Workbench (under File / Export) can be installed into any UIMA pipeline and can be deployed in a variety of ways. The LanguageWare team, as part of the commercial LanguageWare license, provide integration source code to simplify the overall deployment and integration effort. This includes UIMA serializers, CAS consumers, and APIs for integrating into through C/JNI, Eclipse, Web Services (REST), and others.
8. What are the known limitations with this release of the Workbench?
Any problems or limitations are outlined in the Workbench Release Notes, ReleaseNote.htm, which is located in the Workbench installation directory and is part of the Workbench Help System.
9. What is the LanguageWare Demonstrator?
The LanguageWare Demonstrator is a workspace that is fully populated with a set of dictionaries, rules, UIMA pipelines, and associated documents that you can use within the LanguageWare Workbench. This workspace gives you a great starting point with which to build your personalized analysis pipelines10. What are the key lessons that you have learned in building the LanguageWare Workbench?
We've learned several important lessons:
- NLP is personal to each customer, and must be transparent and customizable
- Building extraction models is an iterative process of discovery
- Getting the model “right” pays in reduced maintenance, support, and extensibility
- The ad-hoc iterative approach works and gets results in record time
- Speed in modelling can result in sub-optimal models without careful consideration, so respect the formal modelling process!
- Best practices help optimize models and reduce overall cost to develop, support and maintain
11. What are Domain Extraction Models? How do I build a good Domain Extraction Model?
The process of customizing the template annotators provided with the Workbench requires you to build a set of resources that describe what you want to extract. We call them extraction models since they also give instructions, using the Parser Rules, for how you want to perform the identification/extraction process. The models are a combination of:
- The morphological resources, which describe the basic language characteristics
- The lexico-semantic resources, which describe the entities/concepts that you want to recognize
- The POS tagger resource
- The parsing rules, which describe how concepts combine to generate new entities and relationships.
- The process starts by collecting a set of representative documents that contain examples of the type of entities and relationships you are looking to extract through the analyzer (model) you are building.
- The second step would be to manually extract some good examples of sentences that act somewhat as "templates" that you can start modelling
- Then stop and think... and think some more!
11. What do the LanguageWare Runtime libraries provide?
LanguageWare provides many run-time libraries. Although each of these libraries provides discrete functionality, many libraries build on the functionality provided by the core LanguageWare libraries. The following is a non-exhaustive list of the libraries and their functions.
- dlt.jar, rule_dlt.jar and icu4j.jar: provides core functionality, such as lexical analysis, dictionary look-up, and spelling correction
- tagger_dlt.jar: provides part-of-speech tagging; requires the lexical analysis libraries mentioned above
- dltls.jar: provides support for ontology-based semantic analysis of documents
- an_dlt.jar, an_tagger_dlt.jar: used for running LanguageWare annotators in a UIMA pipeline
- jfst.jar, antlr.jar: used for running the rule-based annotator
- jdemo.jar: used by the sample applications
- DictionaryBuilder.jar: used to build LanguageWare dictionaries from the command line. Building dictionaries using LanguageWare Resource Workbench is recommended instead. Several supporting JAR files are required; they are included in the lib directory.
12. How do I change the default editor for new file types in the Workbench?
Go to Window / Preferences / General / Editors / File Associations. If the content type is already listed, just add a new editor and pick the LanguageWare Text Editor. You can set this to be the default, or alternatively leave it as an option that you can choose, on right click, whenever you open a file of that type. You will need to restart the Workbench before this comes into effect. Note: Eclipse remembers the last viewer you used for a file type so if you opened a document with a different editor beforehand you may need to right-click on the file and explicitly choose the LanguageWare Text Editor the first time on restart.
12. What resources are included in the Runtime package?
There are several dictionaries included in the run-time environment. The latest official lexical analysis dictionaries are included in the IBM-dictionaries folder. In addition, dictionaries required for running the sample applications are stored in the samples\SampleDictionaries folder. Users can request the most recent dictionaries (if they are not present in the IBM-dictionaries folder) by contacting LanguageWare.
13. How do I develop my own custom UIMA Annotators in the Workbench?
The template annotators that ship with the Workbench are customized through building Domain Extraction Models (resources) which describe the entities/concepts that you want to recognize and how the concepts combine to generate new entities and relationships. Developing UIMA Annotators in the Workbench doesn't require any coding at all. You simply develop the Domain Models, create a UIMA configuration that includes these models and our underlying template Annotators, and export the resulting Annotator (or aggregate Annotator) from the Workbench (exported as a Pear).
13. How do I use the LanguageWare Runtime components?
The LanguageWare Runtime libraries can be used as part of a Java application. Several sample applications are included in the samples directory. These give an idea of some useful applications.
The libraries can also be used to create custom annotators based on LanguageWare. The preferred way to generate these annotators is by using LanguageWare Resource Workbench. However, it is possible to manually create LanguageWare-based UIMA annotators. UIMA requires XML descriptors for annotators. If you have downloaded LanguageWare Resource Workbench, you can use it to generate a PEAR file based on the sample rule-based annotator. This PEAR file contains descriptors for the core LanguageWare annotator, the POS Tagger annotator, and the rule-based annotator, as well as a descriptor for running all these together. The PEAR file also contains the libraries and resources required for running the annotators. See the Getting Started Guide for information about generating and installing a PEAR file.14. How do I integrate the UIMA Analyzers that I develop in the Workbench?
Once you have completed building your Domain Extraction Models (dictionaries and rules), the Workbench provides an "Export as UIMA Pear" function under File / Export. This will generate a Pear file that contains all the code and resources required to run your pipeline in any UIMA-enabled application, that is, in a UIMA pipeline.
14. What documentation is available to help me use the LanguageWare Runtime components?
The documentation for the run-time environment can be found in the doc folder of LW70.zip. This file contains the User Guide for LanguageWare in HTML and PDF formats. The Javadoc folder contains documentation on the LanguageWare APIs. The documentation for LanguageWare Resource Workbench might also be of use.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
IBM, LanguageWare, and alphaWorks are trademarks of IBM Corporation in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
15. How do I store my models?
The Workbench is designed to primarily help you to build your domain extraction models and this includes databases in which you can store your models. The Workbench ships with an embedded database (Derby, open source), however also can connect to any enterprise database, such as DB2 or Oracle.
16. How do I open a non-anno file in the LanguageWare Text Editor?
You can either right-click on the file and choose the "LanguageWare Text Editor", or alternatively update the default editor for that particular file type. See the "How do I change the default editor for new file types in the Workbench?" question in this FAQ.
17. Is LanguageWare available as a services offering?
Yes, LanguageWare is available as a Services Offering, For more details please see LanguageWare domain modelling and integration services.
18. What licensing conditions apply for LanguageWare on alphaWorks, for academic purposes, or for commercial use?
There are licensing conditions for using the LanguageWare Tools.
- On alphaWorks:
The LanguageWare Resource Workbench, all the code contained within the package, and the generated Annotator code is provided on alphaWorks for evaluation purposes only as a complimentary download for a 90-day trial period. The purpose of this alphaWorks download is to allow you to evaluate the technology, to get a feeling for how it works and whether it might be useful to you, and to share with us your feedback and suggestions on how we could improve the technology in order to speed up its development.
- For academic purposes:
If a University, or a representative of a University, wishes to use LanguageWare for entirely academic (non-commercial) purposes then they should register on the IBM Academic Initiative Website, and notify the LanguageWare team using the contact form. If a University, or a representative of a University, wishes to use IBM LanguageWare for productive use purpose, then a commercial license must first be acquired. The fee for such productive use licensing arrangements will be based on the number of supported languages, the number of CPUs and the volume of text for analysis. If you wish to apply for a productive use license, please include these details in your request, so as to help expedite your request; which should be addressed to your IBM Representative or IBM Business Partner, or a send a request to the LanguageWare team using the contact form.
- For commercial or other productive use purposes:
If IBM LanguageWare is being used for a commercial or other productive use purpose, then a license to do so must first be acquired. The fee for such productive use licensing arrangements will be based on the number of supported languages, the number of CPUs and the volume of documentation for analysis. If you wish to apply for a productive use license, please include these details in your request, so as to help expedite your request; which should be addressed to your IBM Representative / IBM Business Partner / or a request sent to the LanguageWare team using the contact form.
19. Is Language Identification identifying the wrong language?
Sometimes the default amount of text (1024 characters) used by Language Identification is not enough to disambiguate the correct language. This happens specially when languages are quite close or when the text analysed may include text in more than one language.
In this case, it may help to increase the MaxCharsToExamine parameter. To do this, select from the LWR menu:
Window > Preferences > LanguageWare > UIMA Annotation Display.
Enable the checkbox for "Show edit advanced configuration option on pipeline stages". Select Apply and OK.
Next time you open a UIMA Pipeline Configuration file, you will notice an Advanced Configuration link at the Document Language stage.
Click on it to expand and display its contents, notice the MaxCharsToExamine parameter can be edited. Change the default number displayed to a bigger threshold.
Save your changes and try again to see if the Language Identification has improved.
20. How do I contact the LanguageWare team?
You can contact the LanguageWare team using the contact form on alphaWorks.
21. What languages are supported by the LanguageWare Resource Workbench?
The following table shows a list of all languages supported by the LanguageWare Resource Workbench. It also indicates whether the Workbench includes support for the following features:
| Language | Built-in Dictionaries | POS Tagging Support |
|---|---|---|
| Afrikaans | no | no | Arabic | Yes | Yes | Catalan | no | no | Chinese | Yes | Yes | Czech | no | no | Danish | no | Yes | Dutch | no | Yes | English | Yes | Yes | Finnish | no | no | French | Yes | Yes | German | Yes | Yes | Greek | no | no | Italian | Yes | Yes | Japanese | Yes | Yes | Korean | no | no | Multilingual | no | no | Norwegian | no | no | Polish | no | no | Portuguese | Yes | Yes | Russian | no | no | Spanish | Yes | Yes | Swedish | no | no |
