IBM®
Skip to main content
    United States change      Terms of use
 
 
Select a scope:    
     Home      Products      Services & industry solutions      Support & downloads      My account     
alphaWorks  >  Information management  >  

Text Analysis Perspective for DB2 Warehouse

A set of Eclipse plug-ins that allows you to configure and test text analysis engines and use them in warehouse and mining flows created by DB2 Warehouse Edition 9.5.


Date Posted: December 18, 2007
Overview Requirements DownloadFAQsForum Reviews

1. What is UIMA?
2. I use DB2® Warehouse Design Studio 9.5. Why do I need Text Analysis Perspective for DB2 Warehouse?
3. Can I test my own UIMA analysis engines in Text Analysis Perspective for DB2 Warehouse?
4. What types of documents can be used in document collections?
5. What practical business problems does this technology solve?
6. What are the business and technical limitations of the technology?
7. Please provide a practical example of how this technology could be used.
8. I cannot see analysis results for my third-party analysis engine. What's wrong?
9. I see irrevelant types in the Analysis Result view. How can I change them?
10. I cannot run an analysis engine after I reopen a text analysis project.
11. I get errors when deleting a text analysis project.
12. The analysis results or analysis difference view is empty.
13. Regular expression look-up run doesn't work after changes for feature types.
14. My custom analysis engine doesn't work in Text Analysis Perspective. Why not?
15. I see the error "Contributor com.ibm.uima.workbench.textAnalysisExplorer cannot be created." in my DB2 Design Studio log file. What's wrong?
16. My Dictionary Look-up analysis engine doesn't find the terms I expect.
17. What do the document languages "en-XX", "de-XX", and "fr-XX" mean?
18. Text search doesn't work anymore, and in the log file I see error messages saying "problems writing index".
19. When I open a document from the search results, I get an pop-up error window saying "OLE error opening".


1. What is UIMA?

UIMA stands for Unstructured Information Management Architecture. UIMA is a software architecture for the development, discovery, composition, and deployment of components for the analysis of unstructured information.
Back to top Back to top

2. I use DB2® Warehouse Design Studio 9.5. Why do I need Text Analysis Perspective for DB2 Warehouse?

If you use text operators to extract information from text columns, the Text Analysis Perspective makes it easier for you to configure these text operators. It allows you to test the dictionaries and rules on sample documents created from your database content before using them in a data flow. You can compare the results for different configurations, so you can easily see the impact of your changes. The Text Analysis Perspective displays the extracted information in the context of the document, which makes it easier for you to validate the correctness of your configuration -- or to find out why a certain rule did not yield the expected result.
Back to top Back to top

3. Can I test my own UIMA analysis engines in Text Analysis Perspective for DB2 Warehouse?

Yes. In order to create and build your anotator, use the UIMA SDK 1.4.5. Package the analysis engine as a processing engine archive (PEAR) file, and import this pear file into your text analysis project. Note: Analysis engines developed with IBM® UIMA 2.0 or the Apache UIMA SDK are not supported.
Back to top Back to top

4. What types of documents can be used in document collections?

Text Analysis Perspective for DB2 Warehouse allows you to run and test analysis engines on plain-text documents, including text, HTML, and XML documents. Binary document formats, such as PDF or Microsoft® Word, are not supported.
Back to top Back to top

5. What practical business problems does this technology solve?

Unstructured information is coming more and more into the focus of DB2 Warehousing. Analysis of warranty claims or call center records benefits greatly from the use of previously unstructured information. DB2 Warehouse Design Studio 9.5 includes the ability to run UIMA annotators within an ETL (extract, transform, load) flow. The key task in using these annotators is the ability for non-technical users to configure and test them easily. Supporting this task is the goal of the Text Analysis Perspective.
Back to top Back to top

6. What are the business and technical limitations of the technology?

Business limitations: None.

Technical limitations:

  • Only plain-text documents can be used for testing; documents in binary formats such as PDF or Microsoft® Word cannot be used.
  • The user interface of Text Analysis Perspective is English-only (but the actual text analysis is not restricted to English).
  • The Text Analysis Perspective uses UIMA 1.4.5. Annotators based on UIMA 2.0 or above are not supported.
Back to top Back to top

7. Please provide a practical example of how this technology could be used.

A database with product repair reports contains the product name in a structured column, but the actual parts that got repaired are part of a textual comments column provided by the mechanic. With Text Analysis Perspective for DB2 Warehouse, a warranty expert can import parts of the database, get an overview on the parts and actions mentioned in the warranty comments, and configure the list-based annotator within the Text Analysis Perspective to detect these parts. He can then test the annotator performance by analyzing some database columns and, as a result, enhance the annotator with additional terms. He can then use the configured annotator in an ETL flow within DB2 Warehouse Design Studio. In this flow, the part names are extracted from the comments column of the database, and a subsequent cross-tabbing analysis that correlates the products and part names reveals the most problematic parts per product -- an analysis that wasn't possible from the structured data alone.
Back to top Back to top

8. I cannot see analysis results for my third-party analysis engine. What's wrong?

Text Analysis Perspective stores annotations of the types that are declared as output types in the capabilities section of the analysis engine descriptor. If this section is incorrect or empty, the documents will still be processed, and the XCAS files that are generated will contain the results. However, the Analysis Results and Analysis Differences views will not show any annotations.
Back to top Back to top

9. I see irrevelant types in the Analysis Result view. How can I change them?

Text Analysis Perspective stores annotations of the types that are declared as output types in the capabilities section of the analysis engine descriptor. If you use a custom analysis engine, make sure that it declares only the output types that are relevant for your configuration task. Otherwise, too many irrelevant annotations will be stored, and that affects performance.
Back to top Back to top

10. I cannot run an analysis engine after I reopen a text analysis project.

If you closed a text analysis project and reopened it in the same session, the analysis engine will not run anymore. You must restart your DB2 Warehouse Design Studio in order to make the analysis engine run correctly again for your current project.
Back to top Back to top

11. I get errors when deleting a text analysis project.

If you delete a text analysis project and choose to delete the content of the project, and if you also previously ran an analysis engine, an error message will appear. There are two possible ways to solve the problem:
  • Delete the project without deleting the project content. You can later delete the project content manually from your file system. You will find the project content in your workspace directory named the same as your text analysis project.
  • Restart the DB2 Warehouse Design Studio and delete the project and the project content.
Back to top Back to top

12. The analysis results or analysis difference view is empty.

The analysis results view provides a filter: If the filter is active, documents that do not contain annotations are not shown. The analysis result view displays the documents page-by-page. If all documents of a page are filtered by the filter condition, then the page is empty. Turn off the filter to see all documents of the current page.

The same is true for the analysis difference view. Here, the filter allows hiding of documents that have not changed between two analysis engine runs. Thus, a page in the analysis difference view can be empty if it only contains documents without changes. Turn off the filter to see all documents of the current page.

Back to top Back to top

13. Regular expression look-up run doesn't work after changes for feature types.

The RegEx editor creates an invalid rule file when a type for a feature was changed after specifying a subpattern reference. You must delete the feature in question from your rule file. Run the regular expression look-up, insert the feature again with the correct type, and create the subpattern reference again.
Back to top Back to top

14. My custom analysis engine doesn't work in Text Analysis Perspective. Why not?

Ensure that your PEAR file works when running it with the pearinstaller utility of the UIMA SDK. If that is the case, the most likely cause is that your analysis engine needs environment variables other than the UIMA data path and the classpath. Such analysis engines are not supported in DB2 Warehouse. In addition, make sure that the output types defined in the capabilities section of your analysis engine are defined in the type system of your analysis engine.
Back to top Back to top

15. I see the error "Contributor com.ibm.uima.workbench.textAnalysisExplorer cannot be created." in my DB2 Design Studio log file. What's wrong?

This is merely a limitation of the current version and has no actual effect. Even if this error occurs, you will still be able to work with Text Analysis Perspective.
Back to top Back to top

16. My Dictionary Look-up analysis engine doesn't find the terms I expect.

Check and see whether the expected terms are shown in the "inflections" section of the dictionary editor. For example, the country "Vietnam" does not have any inflections by default, so the dictionary look-up will not find the country in a document that contains "vietnam". If you created the dictionary based on frequent terms, make sure that the dictionary name does not contain blanks or spaces. If it does, create a new dictionary and import the entries from the existing dictionary, using the import function of the dictionary editor.
Back to top Back to top

17. What do the document languages "en-XX", "de-XX", and "fr-XX" mean?

These language codes are used by the Dictionary Look-up analysis engine to correctly determine inflections of terms in a dictionary. The "XX" means that no specific language variant (for example, British English) will be used when processing the document; instead, a general language dictionary that contains inflections from several variants (such as both British English and American English) will be used.
Back to top Back to top

18. Text search doesn't work anymore, and in the log file I see error messages saying "problems writing index".

Most likely, a previous DB2 Design Studio session exited abnormally, which results in an inconsistent state of the search index. In order to restore the index state, follow these steps:
  1. Check the Design Studio log for a message such as Lock obtain timed out: SimpleFSLock@<path to a text analysis project>\resources\.lucene\write.lock.
  2. Write down the path to the write.lock file.
  3. Close DB2 Design Studio.
  4. Delete the write.lock file.
  5. Restart DB2 Design Studio.
Back to top Back to top

19. When I open a document from the search results, I get an pop-up error window saying "OLE error opening".

Text Analysis Perspective opens the system editor defined by your operating system for your search result document. The error occurs if no such editor is defined. Use Windows Explorer to associate the proper editor to your document and try again.


Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
IBM and DB2 are trademarks of IBM Corporation in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.

Back to top Back to top
Download now Download now

Related technologies

For platform(s):
Win32, Windows, Windows XP

For topics:
analysis, Data Analysis, data mining, Eclipse, Java technology, Natural Language, semantics, UIMA, utilities


 

    About IBM Privacy Contact