Text Analysis Perspective for DB2 Warehouse
A set of Eclipse plug-ins that allows you to configure and test text analysis engines and use them in warehouse and mining flows created by DB2 Warehouse Edition 9.5.
Date Posted: December 18, 2007
|
|
 |
 |
|
 |  If you use text operators to extract information from text columns, the Text Analysis Perspective makes it easier for you to configure these text operators. It allows you to test the dictionaries and rules on sample documents created from your database content before using them in a data flow. You can compare the results for different configurations, so you can easily see the impact of your changes. The Text Analysis Perspective displays the extracted information in the context of the document, which makes it easier for you to validate the correctness of your configuration -- or to find out why a certain rule did not yield the expected result. | | |
 |  Yes. In order to create and build your anotator, use the UIMA SDK 1.4.5. Package the analysis engine as a processing engine archive (PEAR) file, and import this pear file into your text analysis project. Note: Analysis engines developed with IBM® UIMA 2.0 or the Apache UIMA SDK are not supported. | | |
 |  Text Analysis Perspective for DB2 Warehouse allows you to run and test analysis engines on plain-text documents, including text, HTML, and XML documents. Binary document formats, such as PDF or Microsoft® Word, are not supported. | | |
 |  Unstructured information is coming more and more into the focus of DB2 Warehousing. Analysis of warranty claims or call center records benefits greatly from the use of previously unstructured information. DB2 Warehouse Design Studio 9.5 includes the ability to run UIMA annotators within an ETL (extract, transform, load) flow. The key task in using these annotators is the ability for non-technical users to configure and test them easily. Supporting this task is the goal of the Text Analysis Perspective. | | |
 |  Business limitations: None.
Technical limitations:
- Only plain-text documents can be used for testing; documents in binary formats such as PDF or Microsoft® Word cannot be used.
- The user interface of Text Analysis Perspective is English-only (but the actual text analysis is not restricted to English).
- The Text Analysis Perspective uses UIMA 1.4.5. Annotators based on UIMA 2.0 or above are not supported.
| | |
 |  A database with product repair reports contains the product name in a structured column, but the actual parts that got repaired are part of a textual comments column provided by the mechanic. With Text Analysis Perspective for DB2 Warehouse, a warranty expert can import parts of the database, get an overview on the parts and actions mentioned in the warranty comments, and configure the list-based annotator within the Text Analysis Perspective to detect these parts. He can then test the annotator performance by analyzing some database columns and, as a result, enhance the annotator with additional terms. He can then use the configured annotator in an ETL flow within DB2 Warehouse Design Studio. In this flow, the part names are extracted from the comments column of the database, and a subsequent cross-tabbing analysis that correlates the products and part names reveals the most problematic parts per product -- an analysis that wasn't possible from the structured data alone.
| | |
 |  Text Analysis Perspective stores annotations of the types that are declared as output types in the capabilities section of the analysis engine descriptor. If this section is incorrect or empty, the documents will still be processed, and the XCAS files that are generated will contain the results. However, the Analysis Results and Analysis Differences views will not show any annotations.
| | |
 |  Text Analysis Perspective stores annotations of the types that are declared as output types in the capabilities section of the analysis engine descriptor. If you use a custom analysis engine, make sure that it declares only the output types that are relevant for your configuration task. Otherwise, too many irrelevant annotations will be stored, and that affects performance.
| | |
 |  If you closed a text analysis project and reopened it in the same session, the analysis engine will not run anymore. You must restart your DB2 Warehouse Design Studio in order to make the analysis engine run correctly again for your current project.
| | |
 |  If you delete a text analysis project and choose to delete the content of the project, and if you also previously ran an analysis engine, an error message will appear. There are two possible ways to solve the problem:
- Delete the project without deleting the project content. You can later delete the project content manually from your file system. You will find the project content in your workspace directory named the same as your text analysis project.
- Restart the DB2 Warehouse Design Studio and delete the project and the project content.
| | |
 |  The analysis results view provides a filter: If the filter is active, documents that do not contain annotations are not shown. The analysis result view displays the documents page-by-page. If all documents of a page are filtered by the filter condition, then the page is empty. Turn off the filter to see all documents of the current page.
The same is true for the analysis difference view. Here, the filter allows hiding of documents that have not changed between two analysis engine runs. Thus, a page in the analysis difference view can be empty if it only contains documents without changes. Turn off the filter to see all documents of the current page.
| | |
 |  The RegEx editor creates an invalid rule file when a type for a feature was changed after specifying a subpattern reference. You must delete the feature in question from your rule file. Run the regular expression look-up, insert the feature again with the correct type, and create the subpattern reference again.
| | |
 |  Ensure that your PEAR file works when running it with the pearinstaller utility of the UIMA SDK. If that is the case, the most likely cause is that your analysis engine needs environment variables other than the UIMA data path and the classpath. Such analysis engines are not supported in DB2 Warehouse. In addition, make sure that the output types defined in the capabilities section of your analysis engine are defined in the type system of your analysis engine.
| | |
 |  This is merely a limitation of the current version and has no actual effect. Even if this error occurs, you will still be able to work with Text Analysis Perspective.
| | |
 |  Check and see whether the expected terms are shown in the "inflections" section of the dictionary editor. For example, the country "Vietnam" does not have any inflections by default, so the dictionary look-up will not find the country in a document that contains "vietnam". If you created the dictionary based on frequent terms, make sure that the dictionary name does not contain blanks or spaces. If it does, create a new dictionary and import the entries from the existing dictionary, using the import function of the dictionary editor.
| | |
 |  These language codes are used by the Dictionary Look-up analysis engine to correctly determine inflections of terms in a dictionary. The "XX" means that no specific language variant (for example, British English) will be used when processing the document; instead, a general language dictionary that contains inflections from several variants (such as both British English and American English) will be used.
| | |
 |  Most likely, a previous DB2 Design Studio session exited abnormally, which results in an inconsistent state of the search index. In order to restore the index state, follow these steps:
- Check the Design Studio log for a message such as Lock obtain timed out: SimpleFSLock@<path to a text analysis project>\resources\.lucene\write.lock.
- Write down the path to the write.lock file.
- Close DB2 Design Studio.
- Delete the write.lock file.
- Restart DB2 Design Studio.
| | |
 |  Text Analysis Perspective opens the system editor defined by your operating system for your search result document. The error occurs if no such editor is defined. Use Windows Explorer to associate the proper editor to your document and try again.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
IBM and DB2 are trademarks of IBM Corporation in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
| |
|
|
 |
|
For platform(s):
Win32, Windows, Windows XP
|
 |
For topics:
analysis, Data Analysis, data mining, Eclipse, Java technology, Natural Language, semantics, UIMA, utilities
|
|
| |