| 
 | NeoBio API | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Object
  |
  +--neobio.alignment.PairwiseAlignmentAlgorithm
        |
        +--neobio.alignment.CrochemoreLandauZivUkelson
This abstract class is the superclass of both global and local sequence alignment algorithms (with linear gap penalty function) due to Maxime Crochemore, Gad Landau and Michal Ziv-Ukelson (2002).
This implementation derives from the paper of M.Crochemore, G.Landau and M.Ziv-Ukelson, A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices (available here as PDF or Postscript).
It employs Lempel-Ziv compression techniques to speed up the classic dynamic programmin approach to sequence alignment (see NeedlemanWunsch and SmithWaterman classes). It reduces the time and space complexity from O(n2) to O(n2/log n). In fact, the complexity is actually O(h n2/log n), where 0 <= h <= 1 is a real number denoting the entropy of the text (a measure of how compressible it is). This means that, the more compressible a sequence is, the less memory the algorithm will require, and the faster it will run.
The idea behind this improvement is to identify repetitions in the sequences and reuse the computation of their alignments. The first step is, therefore, to parse the sequences into LZ78-like factors. LZ78 is a popular compression algorithm of the Lempel-Ziv familiy due to J.Ziv and A.Lempel (1978). This factorisation is accomplished by the FactorSequence class (for more information about this factorisation, please refer to the specification of that class) that builds a doubly-linked list of factors. Each factor is an instance of the Factor class (refer to the specification of this class for more information).
Once the sequences have been parsed, the algorithm builds a matrix of blocks, called
 block table, that is vaguely similar to the dynamic programming matrix used by both
 NeedlemanWunsch and SmithWaterman. Each block contains an
 instance of an AlignmentBlock (please refer to its specification for more
 information on what information is stored) and represents the alignment beteween one
 factor of each sequence. This block table is, in fact, a partition of the alignment
 graph.
Consider a block B which corresponds to the alignment of factor F1 = xa from sequence S1 and factor F2 = yb from sequence S2. Here, F1 extends a previous factor of S1 with character a, while F2 extends a previous factor of S2 with character b. We can define the input border of B as the set of values at the left and top borders of block B, and the output border as the set of values at the right and bottom borders of B. Moreover, we can define the following prefix blocks of B:
Note that each factor has a pointer to its prefix factor, called ancestor (see the
 specification of the Factor class). This pointer makes it easy to retrieve
 any of the three prefix blocks of B in constant time.
Rather than computing each value of the alignment block B, the algorithm will only compute the values on its input and output borders. This is precisely what makes it more efficient.
In this class there is a general specification of how the block table is computed
 (see the computeBlockTable method for details). The actual
 method depends on the subclasses. In general, there are two phases:
In fact, for each block, only one column of the DIST matrix needs to be computed,
 all other columns are actually retrieved from its prefix blocks. This is precisely what
 is accomplished by the assembleDistMatrix method found in
 this class (it is general enough for both global and local alignment versions of the
 algorithm.
From the DIST matrix, we obtain the OUT matrix defined as
 OUT[i,j] = I[i] + DIST[i,j] where I is the input border array. This means
 that the OUT matrix is the DIST matrix updated by the input border of a block. The
 output border is then computed from the OUT matrix by taking the maximum value of each
 column. This class also have a general method for assembling the input border (see
 assembleInputBorder
The OUT matrix is encoded by the OutMatrix class that takes as both a DIST matrix and an input border array. Note that it does not compute the OUT matrix, it just stores the necessary information to retrieve a value at any position of the matrix.
A naive approach to compute the output border of a block from the OUT matrix of size n x n would take a time proportional to n2. However, it happens that, due to the nature of the DIST matrix, both DIST and OUT matrices are Monge arrays, which implies that they are also totally monotone. This property allows the computation of the output border of B in linear time with the SMAWK algorithm (see the specification of the Smawk class for more information on SMAWK).
This class contains a general specification that is pertinent to both global and local versions of the algorithm. For more information on each version of, please refer to the appropriate subclass.
A note about the performance of these algorithms. Although theoretical results suggest that these algorithms are faster and consume less memory than the classical methods, in practice it is hard to realise their performance gains.
These algorithms are extremely complex and require the storage of many extra
 pointers and other auxiliary data for each block (see the AlignmentBlock
 class for more details). Hence, even though the space requirement is
 O(n2/log n), which is less than O(n2), in practice, for most of
 the cases these algorithms will take more time and memory space than their clasical
 counterparts (we have to keep in mind that the Big Oh notation ignores all constants
 involved).
Therefore, in order to realise the full power of these algorithms, they have to be used with extremly large and redundant sequences. This will allow a proper reutilisation of the computations and, maybe, provide an improvement in terms of space and run time. For instance, it is easy to devise such a sequence if we use a one-character alphabet because, in this case, a sequence is factorised into a series of factors that are a prefix of the next.
CrochemoreLandauZivUkelsonGlobalAlignment, 
CrochemoreLandauZivUkelsonLocalAlignment, 
NeedlemanWunsch, 
SmithWaterman, 
FactorSequence, 
AlignmentBlock, 
OutMatrix, 
Smawk, 
computeBlockTable(), 
assembleDistMatrix(neobio.alignment.AlignmentBlock, int, int, int, int)| Field Summary | |
| protected  AlignmentBlock[][] | block_tableThe block table, which is a matrix of alignment blocks where each block represents the alignment between one factor of each sequence. | 
| protected static byte | DIAGONAL_DIRECTIONA constant that indicates that the diagonal direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment. | 
| protected static byte | LEFT_DIRECTIONA constant that indicates that the left direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment. | 
| protected  int | num_colsNumber of columns of the block table. | 
| protected  int | num_rowsNumber of rows of the block table. | 
| protected  OutMatrix | out_matrixAn instance of the OutMatrixclass that encodes the OUT matrix of a
 block when supplied with the DIST matrix and the input border array of a block. | 
| protected  FactorSequence | seq1The first factorised sequence being aligned. | 
| protected  FactorSequence | seq2The second factorised sequence being aligned. | 
| protected  Smawk | smawkAn instance of the Smawkclass that implements the SMAWK algorithm to
 compute the column maxima of a totally monotone matrix. | 
| protected static byte | STOP_DIRECTIONA constant that indicates that the source of an optimal path has been reached in a block and that the trace back procedure to retrieve a high scoring alignment can stop. | 
| protected static byte | TOP_DIRECTIONA constant that indicates that the top direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment. | 
| Fields inherited from class neobio.alignment.PairwiseAlignmentAlgorithm | 
| alignment, APPROXIMATE_MATCH_TAG, GAP_CHARACTER, GAP_TAG, MATCH_TAG, MISMATCH_TAG, score, score_computed, scoring, sequences_loaded, use_match_tag | 
| Constructor Summary | |
| CrochemoreLandauZivUkelson() | |
| Method Summary | |
| protected  int[][] | assembleDistMatrix(AlignmentBlock block,
                                     int dim,
                                     int r,
                                     int c,
                                     int lc)Assembles the DIST matrix of a block by retrieving the DIST columns of its prefix blocks. | 
| protected  int[] | assembleInputBorder(int dim,
                                       int r,
                                       int c,
                                       int lr)Assembles the input border of a block by retrieving the values at the output borders of the left and top blocks. | 
| protected abstract  PairwiseAlignment | buildOptimalAlignment()Retrieves an optimal alignment between the loaded sequences. | 
| protected  void | computeBlockTable()Computes the block table. | 
| protected  PairwiseAlignment | computePairwiseAlignment()Computes the block table (the result depends on subclasses, see computeBlockTablefor details) and requests subclasses to retrieve an
 optimal alignment between the loaded sequences. | 
| protected  int | computeScore()Computes the block table (the result depends on subclasses, see computeBlockTablefor details) and requests subclasses to locate the
 score of the highest scoring alignment between the two sequences in the block
 table. | 
| protected abstract  AlignmentBlock | createBlock(Factor factor1,
                       Factor factor2,
                       int row,
                       int col)Computes a block of the block table, which corresponds to an alignment block between one factor of sequence 1 and one factor of sequence 2. | 
| protected abstract  AlignmentBlock | createFirstColumnBlock(Factor factor1,
                                             Factor factor2,
                                             int row)Computes a block at the first column (column zero) of the block table, which corresponds to an alignment block between one factor of sequence 1 and an empty string. | 
| protected abstract  AlignmentBlock | createFirstRowBlock(Factor factor1,
                                       Factor factor2,
                                       int col)Computes a block at the first row (row zero) of the block table, which corresponds to an alignment block between one factor of sequence 2 and an empty string. | 
| protected abstract  AlignmentBlock | createRootBlock(Factor factor1,
                               Factor factor2)Computes the root block of the block table. | 
| protected  AlignmentBlock | getDiagonalPrefix(AlignmentBlock block)This method is a shorthand to retrieve the diagonal prefix of a block from the block table. | 
| protected  AlignmentBlock | getLeftPrefix(AlignmentBlock block)This method is a shorthand to retrieve the left prefix of a block from the block table. | 
| protected  AlignmentBlock | getTopPrefix(AlignmentBlock block)This method is a shorthand to retrieve the top prefix of a block from the block table. | 
| protected  void | loadSequencesInternal(java.io.Reader input1,
                                           java.io.Reader input2)Loads sequences into FactorSequenceinstances. | 
| protected abstract  int | locateScore()Locates the score of the highest scoring alignment between the two sequences in the block table after is thas been computed. | 
| protected  void | traverseBlock(AlignmentBlock block,
                           int source,
                           java.lang.StringBuffer gapped_seq1,
                           java.lang.StringBuffer tag_line,
                           java.lang.StringBuffer gapped_seq2)Traverses a block to retrieve a part of an optimal alignment from the specified source in the output border to an entry in the input border. | 
| protected  void | unloadSequencesInternal()Frees pointers to loaded sequences and the the block table so that their data can be garbage collected. | 
| Methods inherited from class neobio.alignment.PairwiseAlignmentAlgorithm | 
| getPairwiseAlignment, getScore, loadSequences, max, max, max, scoreDeletion, scoreInsertion, scoreSubstitution, setScoringScheme, unloadSequences, useMatchTag | 
| Methods inherited from class java.lang.Object | 
| clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait | 
| Field Detail | 
protected static final byte STOP_DIRECTION
protected static final byte LEFT_DIRECTION
protected static final byte DIAGONAL_DIRECTION
protected static final byte TOP_DIRECTION
protected FactorSequence seq1
protected FactorSequence seq2
protected AlignmentBlock[][] block_table
protected int num_rows
protected int num_cols
protected Smawk smawk
Smawk class that implements the SMAWK algorithm to
 compute the column maxima of a totally monotone matrix. It is used to speed up the
 computation of the output border of a block.
protected OutMatrix out_matrix
OutMatrix class that encodes the OUT matrix of a
 block when supplied with the DIST matrix and the input border array of a block.
 Note that it does not compute the OUT matrix itselft, it just stores the necessary
 information to retrieve a value at any position of the matrix.
 This object is then used to compute the output border of a block with the
 Smawk class. Note that the OutMatrix class implements the
 Matrix interface as required by the Smawk class.
Matrix, 
Smawk| Constructor Detail | 
public CrochemoreLandauZivUkelson()
| Method Detail | 
protected void loadSequencesInternal(java.io.Reader input1,
                                     java.io.Reader input2)
                              throws java.io.IOException,
                                     InvalidSequenceException
FactorSequence instances. In case of any error,
 an exception is raised by the constructor of FactorSequence (please
 check the specification of that class for specific requirements).
 A FactorSequence is an LZ78-like factorisation of the sequences
 being aligned.
loadSequencesInternal in class PairwiseAlignmentAlgorithminput1 - Input for first sequenceinput2 - Input for second sequence
java.io.IOException - If an I/O error occurs when reading the sequences
InvalidSequenceException - If the sequences are not validFactorSequenceprotected void unloadSequencesInternal()
unloadSequencesInternal in class PairwiseAlignmentAlgorithmPairwiseAlignmentAlgorithm.unloadSequences()
protected PairwiseAlignment computePairwiseAlignment()
                                              throws IncompatibleScoringSchemeException
computeBlockTable for details) and requests subclasses to retrieve an
 optimal alignment between the loaded sequences. The actual product depends on the
 subclasses which can produce a global (see
 CrochemoreLandauZivUkelsonGlobalAlignment) or local alignment (see
 CrochemoreLandauZivUkelsonLocalAlignment).
 Subclasses are required to implement the buildOptimalAlignment
 abstract method defined by this class according to their own method.
computePairwiseAlignment in class PairwiseAlignmentAlgorithmIncompatibleScoringSchemeException - If the scoring scheme is not compatible
 with the loaded sequences.CrochemoreLandauZivUkelsonGlobalAlignment, 
CrochemoreLandauZivUkelsonLocalAlignment, 
computeBlockTable(), 
buildOptimalAlignment()
protected int computeScore()
                    throws IncompatibleScoringSchemeException
computeBlockTable for details) and requests subclasses to locate the
 score of the highest scoring alignment between the two sequences in the block
 table. The result depends on the subclasses, and either a global alignment
 (see CrochemoreLandauZivUkelsonGlobalAlignment) or local alignment
 score (see CrochemoreLandauZivUkelsonLocalAlignment) will be produced.
 Subclasses are required to implement the locateScore abstract
 method defined by this class according to their own method.
Note that this method calculates the similarity value only (it doesn't trace back into the block table to retrieve the alignment itself).
computeScore in class PairwiseAlignmentAlgorithmIncompatibleScoringSchemeException - If the scoring scheme is not compatible
 with the loaded sequences.CrochemoreLandauZivUkelsonGlobalAlignment, 
CrochemoreLandauZivUkelsonLocalAlignment, 
locateScore()
protected void computeBlockTable()
                          throws IncompatibleScoringSchemeException
There are four different cases that defines four abstract methods in this class, which subclasses must implement:
Note that each factor has a serial number which indicates its order in the list of factors of a sequence. This number will match with the row and column index of a block in the block table. For instance, if a block has factors F1 and F2 with serial numbers 12 and 53, respectively, this means that this block is found at row 12, column 53 of the block table.
IncompatibleScoringSchemeException - If the scoring scheme is not compatible
 with the loaded sequences.createRootBlock(neobio.alignment.Factor, neobio.alignment.Factor), 
createFirstColumnBlock(neobio.alignment.Factor, neobio.alignment.Factor, int), 
createFirstRowBlock(neobio.alignment.Factor, neobio.alignment.Factor, int), 
createBlock(neobio.alignment.Factor, neobio.alignment.Factor, int, int)
protected int[][] assembleDistMatrix(AlignmentBlock block,
                                     int dim,
                                     int r,
                                     int c,
                                     int lc)
block - the block for which the DIST matrix is neededdim - the dimension of the DIST matrixr - the row index of this block in the block tablec - the column index of this block in the block tablelc - the number of columns of the alignment block
protected int[] assembleInputBorder(int dim,
                                    int r,
                                    int c,
                                    int lr)
ArrayIndexOutOfBoundsException.
dim - dimension of the input borderr - row index of the block in the block tablec - column index of the block in the block tablelr - number of row of the block
protected void traverseBlock(AlignmentBlock block,
                             int source,
                             java.lang.StringBuffer gapped_seq1,
                             java.lang.StringBuffer tag_line,
                             java.lang.StringBuffer gapped_seq2)
                      throws IncompatibleScoringSchemeException
block - the block to be traversedsource - the source of the path in the output bordergapped_seq1 - the StringBuffer to where the gapped sequence 1 is written totag_line - the StringBuffer to where the tag_line is written togapped_seq2 - the StringBuffer to where the gapped sequence 2 is written to
IncompatibleScoringSchemeException - If the scoring scheme is not compatible
 with the loaded sequences.protected AlignmentBlock getLeftPrefix(AlignmentBlock block)
block - the block
protected AlignmentBlock getDiagonalPrefix(AlignmentBlock block)
block - the block
protected AlignmentBlock getTopPrefix(AlignmentBlock block)
block - the block
protected abstract AlignmentBlock createRootBlock(Factor factor1,
                                                  Factor factor2)
                                           throws IncompatibleScoringSchemeException
factor1 - the factor of the first sequence being alignedfactor2 - the factor of the second sequence being aligned
IncompatibleScoringSchemeException - If the scoring scheme is not compatible
 with the loaded sequences.
protected abstract AlignmentBlock createFirstRowBlock(Factor factor1,
                                                      Factor factor2,
                                                      int col)
                                               throws IncompatibleScoringSchemeException
factor1 - the factor of the first sequence being alignedfactor2 - the factor of the second sequence being alignedcol - the column index of block in the block table
IncompatibleScoringSchemeException - If the scoring scheme is not compatible
 with the loaded sequences.
protected abstract AlignmentBlock createFirstColumnBlock(Factor factor1,
                                                         Factor factor2,
                                                         int row)
                                                  throws IncompatibleScoringSchemeException
factor1 - the factor of the first sequence being alignedfactor2 - the factor of the second sequence being alignedrow - the row index of block in the block table
IncompatibleScoringSchemeException - If the scoring scheme is not compatible
 with the loaded sequences.
protected abstract AlignmentBlock createBlock(Factor factor1,
                                              Factor factor2,
                                              int row,
                                              int col)
                                       throws IncompatibleScoringSchemeException
factor1 - the factor of the first sequence being alignedfactor2 - the factor of the second sequence being alignedrow - the row index of block in the block tablecol - the column index of block in the block table
IncompatibleScoringSchemeException - If the scoring scheme is not compatible
 with the loaded sequences.
protected abstract PairwiseAlignment buildOptimalAlignment()
                                                    throws IncompatibleScoringSchemeException
IncompatibleScoringSchemeException - If the scoring scheme is not compatible
 with the loaded sequences.protected abstract int locateScore()
IncompatibleScoringSchemeException - If the scoring scheme is not compatible
 with the loaded sequences.| 
 |  | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||