Biocceleration




	Home

	Products

	News

	Contact Us

PRODUCTS

Products > BioXL/H > Technical Information

BioXL/H - Technical Information

Highlights

BioXL/H TM is a network peripheral device designed to be a high-end solution for running rigorous searches of protein and nucleic acid sequence databases.

BioXL/H consists of computer firmware that has been configured as parallel processing units, optimized to run rigorous Dynamic Programming algorithms. The optimized logical circuitry runs search algorithms at speeds that are three orders of magnitude faster than today's high-end computer workstations. The configuration of the firmware is loaded at the beginning of a search, thereby giving BioXL/H the flexibility to run a variety of algorithms, including Smith-Waterman, ProfileSearch, Translated Searches, Frame+, ProfileFrame+ and HMM.

The speed at which sequence database algorithms run is typically measured in matrix cells per second. In its smallest configuration, a four-board BioXL/H runs Smith-Waterman and ProfileSearch applications at a rate of 3 billion matrix cells per second. An eight-board BioXL/H executes these applications at a speed of 6 billion matrix cells per second. As many as four BioXL/H devices can be combined in a virtual XLH-32 device that will run the same applications at a speed of 24 billion matrix cells per second.

BioXL/H's scaleable architecture allows for easy upgrades of the standard models. Several BioXL/H models are currently offered (the model number corresponds to the number of processing boards): XLH-4, XLH-8, XLH-12, XLH-16, XLH-24, XLH-32.

The user interface to BioXL/H is provided by GenCore TM , Biocceleration's sequence analysis software package.

Hardware Description

Bus Architecture
The processor boards and a 30GB internal hard disk are connected to a PCI bus. The bus is connected to an I/O processor, which in turn is connected to a fast 100Base-T network card. Figure 1 shows the internal configuration of a BioXL/H-8:

:: Figure 1: BioXL/H Box Configuration

Hardware Technology and Operation
The BioXL/H firmware is programmed into XilinxTM Field Programmable Gate Arrays (FPGA's), shown in Figure 2 as modules. The modules on a board are connected in a ring, making the result of each cell update available for the following processor, which needs it according to the algorithm.

Each BioXL/H board contains eight FPGA modules and 128MB of global memory.
Each of the modules is programmed to calculate four matrix cells per clock cycle (for the Smith-Waterman algorithm).
The clock rate of the system is 25-33MHz (programmable).

:: Figure 2: BioXL/H Board Scheme

Each module has its own local memory used to store the search query. The board global memory is used to store currently searched database sequences and search results.

Databases are transferred to BioXL/H through the network interface. Database information is stored in the global memory of processing boards and on the internal disk. The search calculations are performed concurrently with the database transfer into board memory and to the internal disk (the "streaming" mode). Databases that can fit into the combined global memory of all boards remain in the memory for subsequent searches (the "memory" mode). Larger sequence databases, such as GenBank, are read from the internal disk in subsequent searches (the "disk" mode).

At its current full speed, BioXL/H-8 rate of calculation is:
(4 updates/module)x(8 modules/board)x(8 boards)x28.7MHZ = 7.34 billion MCPS

System Requirements
BioXL/H is a network device, which is connected to a local network. The server software for BioXL/H can run on any supported SGI, SUN, Compaq or Intel-compatible host.

BioXL/H requires GenCore 5.0 or above to be installed on the host computer. The following platforms are supported:

All popular SGI servers and workstations running IRIX 6.5 or above
SUN Sparc and Ultra Sparc stations with Solaris 7 or above
All popular Digital/Compaq servers and workstations running Digital/Compaq UNIX (OSF/1) 4.0 D or above
Intel Pentium II compatible or above running any Linux-like OS, kernel version 2.4 or above

Network Configuration and Work Mode
Communication with BioXL/H is established through the host computer running the BioXL/H server process. The server process is responsible for handling job requests submitted by users. A typical network configuration is shown in Figure 3:

The BioXL/H work mode is similar to the work mode of a network printer, on which one job is executed at a time. Jobs submitted to BioXL/H are placed on a queue maintained by the BioXL/H server process. Users can run jobs either in the foreground or in the background as a batch job. When a batch job is completed, a user is notified by e-mail.

GenCore Applications Supported by BioXL/H

Application	Description
sw	sw is the standard Smith-Waterman application. It uses an amino acid query to search an amino acid database or a nucleic acid query to search a nucleic acid database.
tswp	tswp translates its nucleic acid query sequence in six reading frames and compares each frame to the sequences in an amino acid database.
tswn	tswn translates nucleic acid sequences in the database in six reading frames before comparing them to the amino acid query.
xsw	xsw is a Smith-Waterman application with asymmetric gap penalizing. It allows the use of separate gap penalties to penalize gaps opened in the query sequence, while penalties for gaps in a database sequence are treated the same as in sw applications. xsw performs all types of standard and translated searches-nucleic-nucleic, protein-protein, nucleic-protein and protein-nucleic. The query can be a sequence, profile or extended profile.
frame+_n2p	frame+_n2p compares a DNA query to a protein database, with separate treatment of sequencing errors and insertions of complete codons. FramePlus is an extended FrameSearch application based on Biocceleration's proprietary algorithms.
frame+_p2n	frame+_p2n compares a protein query sequence to a DNA database, with separate treatment of sequencing errors and insertions of complete codons. FramePlus is an extended FrameSearch application based on Biocceleration's proprietary algorithms.
profilesearch / hmmsearch	profilesearch / hmmsearch compares a protein profile / HMM to a protein database, or a nucleic profile to a nucleic database.
tprofilesearch	tprofilesearch compares protein profiles to nucleic database sequences translated into six reading frames.
proframe+	proframe+ compares a protein profile to a DNA database, allowing for separate treatment of sequencing errors and insertions of complete codons. proframe+ is based on Biocceleration's proprietary FramePlus algorithm.
profilescan / hmmscan	profilescan / hmmscan compares a protein sequence to a protein profile / HMM database.

:: Figure 3: A typical network configuration

Theoretical Background

Algorithms
BioXL/H algorithms utilize the dynamic programming method of sequence alignment by Needleman & Wuncsh (Journal of Molecular Biology, 48, 443-453, 1970), which was adapted for optimal local alignments by Smith & Waterman (Journal of Molecular Biology, 147, 195-197, 1981), with modifications by Gotoh (Journal of Molecular Biology,162, 705-708, 1982).

In Smith-Waterman database searches, BioXL/H uses the dynamic programming method to compare every database sequence to the query sequence and assign a score to each result. ProfileSearch, developed by Gribskov, et. al. (Proc. Natl. Acad. Sc. USA 84: 4355-4358, 1987), uses the same dynamic programming methodology; however, the query is a profile created from the alignment of multiple sequences.

HMMSearch, developed by Sean Eddy, et. al. (Dept. of Genetics, Washington University School of Medicine, HMMER package, http://genome.wustl.edu/eddy/hmmer.html), uses a more comprehensive profile structure and dynamic programming methodology to produce more sensitive homology measurement on the amino-acid level.

Dynamic Programming Model
The dynamic programming method checks every possible alignment between two given sequences. The two sequences define a matrix in which every cell represents the alignment of two specific positions in the two sequences. The alignments between specific positions are given either positive or negative scores, depending on the residues located in these positions.

The score of alignment in a certain position is then added to the sum of the highest scores of alignment leading to that position. The highest score can sometimes be achieved by adding gaps to the alignment, although introduction of a gap is penalized. After all the cells have been calculated, the position of the highest score is the final aligned position of the best local alignment between the two sequences.

In Figure 4, a sample matrix shows the alignment of sequence X (CAGCCUCGCUUAG, from left to right on the horizontal axis) and Y (AAUGCCAUUGACGG, from top to bottom on the vertical axis). The matrix is taken from Smith & Waterman (Advances in Applied Mathematics, 2, 482-489, 1981).

:: Figure 4: Sample Matrix

Scores in the first row and column are defined as zero. Entries Hi,j in all other cells of the matrix are defined as the score of the best alignment ending in the position matching xi and yj, and are calculated using the following equation (from Waterman and Vingron, Proc. Natl. Acad. Sci. USA, 91, 4625, 1994):

where
s(x i ,y j ) is the score of a match between xi and yj specified in the scoring matrix w is the gap penalty, which is a function of two parameters. If a new gap is opened, w equals (g+k), where g is a penalty for opening a gap, and k is a penalty for extending a gap. If a previously opened gap is extended, w equals k. The total penalty for a gap of length l is g + k*l. This is an "affine scoring penalty," as opposed to a non-affine penalty, for which g is zero.

H i,j equals to the first term in the equation if the maximum score is generated by aligning x i and y j . H i,j equals to the second term if a gap in the sequence Y is extended or opened in the current position, in which case the gap penalty is added to the score in the cell to the left of the current cell. H i,j equals to the third term if a gap in the sequence X is extended or opened in the current position, in which case the gap penalty is added to the score in the cell above the current cell.

Calculation of the best local alignment can therefore be viewed as diagonal movement from the top left cell to the bottom right cell. After scores in all cells are calculated, the highest score in the matrix is located in the final position of the best local alignment.

The sample matrix in Figure 4 was generated using a scoring matrix that assigns 1 to identical matches, and -1/3 to mismatches. The gap open penalty is 1 and the gap extend penalty is 1/3. The alignment leading to the cell with score 3.33 is the best local alignment:

Y: G C C A U U G
X: G C C . U C G

There is one gap in the alignment (represented as "." in sequence X), and one mismatch in the sixth position.

Speed of Calculations
Because calculation of the best local alignment requires calculation of scores in all cells in the matrix, the speed of computing the best alignment can be measured in matrix cell updates per second (MCPS). Software implementations of the Smith-Waterman algorithm, such as the BestFit application from GCG, can run at speeds of about thirty million MCPS on high-end workstations.

BioXL/H-8 can run at approximately six billion MCPS for long sequence queries. A search of all GenBank sequences (the total of 10 billion base pairs) with a DNA sequence of 1000 bases in length requires calculation of (1000*2*10,000 million) = 20,000 billion cells. The factor of 2 is added because both the sequence and its complement strand are searched. On BioXL/H-8, this search would take about 3300 seconds (about 56 minutes). The same search will take 11,100 minutes (more than seven and a half days) on a high-end UNIX machine.