ToXgene
A template-based generator for complex, semantically-correlated collections of XML documents.
Date Posted: November 16, 2001
|
|
 |
 |
|
Update: June 10, 2005
Version 2.3: ToXgene has been completely redesigned and can now be accessed from within any Java application through an API.
What is ToXgene?
ToXgene is a template-based generator for complex, semantically-correlated collections of synthetic XML documents. ToXgene provides a declarative way of generating realistic synthetic data in which the user specifies what data is needed (using a template) without having to worry about how to generate that data. This tool is intended for cases in which the structure of the data to be generated is known and the data is required to conform to that
structure. ToXgene provides APIs that allow it to be invoked from any JavaTM application as well as to use other domain-specific data generators.
How does it work? The main features of ToXgene are as follows:
- Generation of complex XML content: ToXgene supports all XML element content models (CDATA, element, and mixed) and allows the generation of attributes as well. CDATA values are generated according to a type declaration; various string, numeric, and date types are supported.
- Use of skewed distributions: The user can specify skewed distributions to determine the number of occurrences for elements as well as to control the generation of CDATA literals (such as the length of string values). ToXgene supports the uniform, exponential, normal, log-normal, geometric, and user-defined multinomial distributions.
- Element sharing: ToXgene allows different elements (or attributes) to share CDATA literals, thus allowing the generation of references among elements in the same (or in different) documents. This enables the generation of collections of correlated documents (that is, documents that can be joined by value).
- Integrity constraints: Element sharing in ToXgene is achieved by generating the shared content prior to the actual documents and storing this data in memory. ToXgene allows the specification of most common integrity constraints (such as uniqueness) over the data in such lists; thus, one can generate consistent ID, IDREF, and IDREFS attributes. One can also specify integrity constraints over elements (or attributes) in different documents, which allows the generation of consistent single or multi-document data sets.
- Reuse of existing data: ToXgene allows the user to load existing data into tox-lists; such data is treated as any other shared data in the generation process. This allows the mixing of real and synthetic data, often required in common benchmarks (such as names of countries), and also the expanding of existing collections of documents without having to start from scratch again.
- Extensibility: ToXgene was developed in Java 2 and has a very simple interface for plugging in new CDATA generators. For convenience, we provide the source code for the various CDATA generators that come with ToXgene already.
- Scalability: If necessary, ToXgene can use a Persistent Object Manager for storing temporary data structures that do not fit in main memory. One can customize the buffer management and take advantage of parallel I/O for optimal performance.
ToXgene allows the reuse of previously generated content. Thus, an existing collection of documents can be expanded while its consistency is maintained, instead of its having to be started from scratch again. Moreover, ToXgene can mix real and synthetic data during the generation process, which mixing is often required in many practical situations (for example, using real names of countries or provinces).
|
|
 |

|  | About the technology author(s): Denilson Barbosa has been an Assistant Professor in the Department of Computer Science at the University of Calgary and a CAS Visiting Faculty since January 2005. His main research interests are the management of semistructured data (with emphasis on storage and indexing mechanisms), database and knowledge management systems, data compression, and parallel computing. Dr. Barbosa is a member of the ToX (Toronto XML Engine) project, under development at the Database Group of the Department of Computer Science at University of Toronto.
Alberto Mendelzon is a professor in the Department of Computer Science at the University of Toronto, which he joined in 1980. His research interests are in databases and knowledge bases, including database design theory, query languages, data warehousing and OLAP, database visualization, query processing, belief revision and knowledge base update, and global information systems. Dr. Mendelzon's recent research projects include the the WebSQL and WebOQL query languages for the WWW, TOPIC (the Toronto Page Influence Computation), and ToX (the Toronto XML Engine).
John Keenleyside is a senior software developer at the IBM Toronto Laboratory. He began his career at IBM in 1988 developing the C run time for IBM's compiler on OS/2. Later, he worked on the optimizers and code generators for C/C++ compilers targeting Windows® and AIX®. Recently, Mr. Keenleyside joined the DB2 Advanced Technology and Performance team, where he is exploiting and improving the performance of new technologies such as XML, IA64, and next-generation fabrics.
| |
|
| |