Advanced Pattern Search Toolkit for Sequential Data
A group of tools that allow users to search for complex patterns in sequential data such as the genome, protein sequences, text, and times series data.
Date Posted: March 16, 2004
|
|
 |
 |
|
 |  Given a large DNA contig, an ordered series of markers (smaller sequences that are to be found in the DNA), and a set of parameters that define the boundaries of the search (see below), the gene-finding tool uses substitutive (inexact) matching to produce the complete set of occurances of the markers (or their tolerable substitutes), in the given order, in the DNA contig.
| | |
 |  Given a large target sequence/text, an ordered series of markers (patterns that are to be found in target sequence), and a set of parameters that define the boundaries of the search (see below), the tool searches for those occurances in the target sequence where all the marker instances occur in the order that was provided and satisfying the other constraints.
| | |
 |  No, the toolkit is not a Web-based application. The downloadable file contains the set of JARs that contain the tool's core engine classes and a couple of stand-alone applications that demonstrate the use of the tool. The user may, however, integrate the tool as a part of a larger framework using the APIs presented. The tool may be converted to a Web-based application, such as a Web service, by writing wrapping client code over the tool.
| | |
 |  Yes. As mentioned above, the tools in the toolkit may be integrated and used as a part of a larger framework. Please refer to the documents "Gene Finding using APaS" and "Generic Pattern Finding using APaS," which are included in the toolkit.
| | |
 |  The gene-finding tool accepts data formatted as an XML string. The schema to which the input data must conform, and to which the output XML string will conform, are distributed with the download, along with example input and output files. The meaning of the individual parameters/elements in the XML document is given below. For more details, please refer to the white paper included in the toolkit.
| | |
 |  A marker is a recurring pattern that could be used to identify certain regions of a sequence. For example, certain motifs such as TATA box, etc., in a DNA contig indicate the presense of a downstream gene. Thus TATA box motifs can be used as markers to detect gene-coding regions in a DNA contig.
| | |
 |  The marker score associated with each marker element represents the maximum cumulative mismatch that can be had by any substring of the given DNA contig claiming to be an inexact match. Substrings with greater mismatches than the one mentioned for this marker are immediately filtered out. For filtering relative to the best match, please see "What does the Threshold parameter signify?"
| | |
 |  A profile is basically a table of scores that define the scope of inexact/substitutive matches to be permitted by the tool against a marker. The profile associated with the marker of length l constructed using an alphabet set of k characters is a matrix of dimensions k x l.
For example: Given a marker of length 10, acctagcgcg can be constructed using the alphabet set Σ={a,c,t,g}.
The associated profile, given the following
|Σ| = k = 4
and
l = 10,
What the matrix essentially represents is a series of floating-point mismatch score values to be associated with inexact matching when trying to compare a substring of length l of the DNA with the marker. It may be noted that the sum of any column is always 1. This is because each of the mismatch scores is a probability value. The significance of a mismatch score is addressed below.
The columns represent the spatial positions, while the rows represent the mismatch to be associated with the occurrence of that row's alphabet, if it is found to occur in the "column" positions in the substring.
For example, consider location (a,4). The mismatch score of 0.3 in that position indicates to the engine that if the alphabet(nucleotide) a is spotted in the fourth position of the substring (instead of t, which is the fourth nucleotide in the marker), then we add a mimatch of 0.3 to the cumulative score of the substring under consideration. The engine does not look into this profile table if it scans an exact match. It is only for mismatches that a look-up is made and a score associated with the mismatch.
The reader would have noted that the location (t,4) has a mismatch value of 0.3 associated with it as well! Although the score here should intuitively be 0.0 (exact match -- no mismatch score), the non-zero value in this position allows the user to manipulate the sum of the probabilities. If, for example, the user wants to associate no mismatch score with the fourth position, regardless of what alphabet is spotted, then the fourth column's entry would be (0.0, 0.0, 1.0, 0.0) instead of the current (0.3, 0.0, 0.3, 0.4). Since the mismatch data for t is not looked up at all (t in the fourth position would indicate a perfect match against the specified marker), the arrangement specified would also satisfy the sum of probabilities = 1 criterion.
| | |
 |  The profile data representative of a marker is provided so that user can define what the user means by inexact match. If correctly used, it can act as a powerful navigator that can intelligently find exactly those markers in the target DNA contig that the user wants to find.
Profile data can be generated based on what the user wants to mark as similar to a particular marker. The user may take a set of pattern strings that he considers equivalent to those he finds useful, and then, based on the distribution of nucleotides in the various markers, assign mismatch values to the marker. Any one pattern of the set, or the most frequently-occuring string, may be presented to the engine as the marker.
| | |
 |  As mentioned above, each location in a profile entry is used as a floating-point fraction.
| | |
 |  The input XML Schema is given in the download. A sample input profile matrix would be as follows:
<nucltd_a>0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1</nucltd_a>
<nucltd_c>0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2</nucltd_c>
<nucltd_t>0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3</nucltd_t>
<nucltd_g>0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4</nucltd_g>
Note that the data above is given only for illustrative purposes. Meaningful profile data wouldn't look like this, though this data is valid (since 0.1 + 0.2 + 0.3 + 0.4 = 1.0). The profile matrix is specified as a sequence of 4 rows: All child elements of the <Marker> parent. Individual floating point scores are separated from the next position's floating point score of the same nucleotide by a blank space (" ").
| | |
 |  The InterMarker Distance element specifies the distance within which any two adjacent markers of the ordered set must occur in a valid path. Since the engine attempts to find connections between the ordered set of markers, this parameter regulates the distance spread permitted between markeri and markeri+1.
| | |
 |  In addition to the score used to filter out markers that do not fall within a particular mismatch score, the Threshold parameter allows the user to optimize his search by specifying a more relative scoring parameter that allows him to perform filtering based on the kind of matches he gets at run time. The threshold is a floating-point factor that states the deviation from the score of the best marker that the search should consider.
For example, let a set of possible markers {V} be found corresponding to marker v, in the target DNA. If vmax is the score of the best match in {V} (that is, vmax is the minimum penalty), then the Threshold paramter (T) specifies that:
All the markers in {V} whose mismatch scores are greater than vmax * (1+T), are to be discarded.
This lets the user search only within a certain set of good matches. As a trivial case, if the user wants only the best match to be considered for a particular marker, but he doesn't know how good a match this target DNA might contain, he specifies a very liberal Score value for the marker (which means that a lot of substrings in the parent DNA will be shortlisted as markers), but a very stringent Threshold factor (0.0 in case of best match only). This will eliminate all but the best match.
| | |
 |  The end-to-end length parameter gives the user control over what must be the maximum spread from the first marker to the last marker in the ordered set in any found path. All found paths are eliminated if their difference in position from the first character of the first marker to the first character of the last marker is greater than specifed.
| | |
 |  Substitutive matching takes care of situations where the equivalent codon has been used rather than the one that has been frequently found (this might occur if the target organism experiences different environmental conditions and the codon bias is varied). Codons are defined as equivalent if they code for the same amino acid in the translation phase.
| |
|
|
 |
|
| |