|Oracle Data Mining Concepts
10g Release 1 (10.1)
Part Number B10698-01
In addition to data mining functions that produce predictive and descriptive models, ODM supports specialized sequence search and alignment algorithms (BLAST). In life sciences, vast quantities of data including nucleotide and amino acid sequences are stored, typically in a database. These sequence data help biologists determine the chemical structure, biological function, and evolutionary history of organisms. A key feature of managing the exponential growth in sequence data sources is the availability of fast, sensitive, and statistically rigorous techniques for detecting similarities between these sequences.
As the amount of nucleotide and amino acid sequence data continues to grow, the data becomes increasingly useful in the analysis of newly sequenced genes and proteins because of the greater chance of finding such sequence similarities.
Sequence alignment is one of the most commonly used bioinformatics tasks. It is present in almost any research and development activity across the many industries in the area of life sciences including academia, biotech, services, software, pharma, and hospitals.
Of all the sequence alignment algorithms, the one that is most widely used is BLAST (basic local alignment search tool). It is typically used to compare one query nucleotide or protein sequence against a database of sequences, and uncover similarities and sequence matches. Its success and popularity comes from its combination of speed, sensitivity, and statistical assessment of the results.
BLAST is a heuristic method to find the high-scoring locally optimal alignments between a query sequence and a database. The BLAST algorithm and family of programs rely on the statistics of gapped and un-gapped sequence alignments. The statistics allow the probability of obtaining an alignment with a particular score to be estimated. BLAST is unlikely to be as sensitive for all protein searches as a full dynamic programming algorithm. However, the underlying statistics provide a direct estimate of the significance of any match found.
The inclusion of BLAST in ODM positions the Oracle DBMS as a platform for bioinformatics.
Implementing BLAST in the database provides the following benefits:
Sequence search and alignment, with capabilities similar to those of NCBI BLAST 2.0, has been implemented in the database using table functions. This implementation enables users to perform queries against data that is held directly inside an Oracle database. As the algorithms are implemented as table functions, parallel computation is intrinsically supported.
The five core variants of BLAST have been implemented:
The BLAST table functions are implemented in the database, and can be invoked by SQL. Using SQL, it is possible to pre-process the sequences as well as perform any required post-processing. This additional processing capability means it is possible to combine BLAST searches with queries that involve images, date functions, literature search, etc. Using these complex queries make it possible to perform BLAST searches on a required subset of data, potentially resulting in highly performant queries. This functionality is expected to provide considerable value.
BLAST queries can be invoked directly using the SQL interface or through an application. The query below shows an example of a SQL-invoked BLAST search where a protein sequence is compared with the protein database SwissProt, and sequences are filtered so that only human sequences that were deposited after 1 January 1990 are searched against. The column of numbers at the end of the query reflects the parameters chosen.
select t_seq_id, alignment_length, q_seq_start, q_seq_end q_frame, t_seq_start, t_seq_end, t_frame, score, expect from TABLE( BLASTP_ALIGN ( (select sequence from query_db), CURSOR(SELECT seq_id, seq_data FROM swissprot WHERE organism = 'Homo sapiens (Human)' AND creation_date > '01-Jan-90'), 1, -1, 0, 0, 'BLOSUM62', 10, 0, 0, 0, 0, 0) );
The results of a SQL query can be displayed either through an application or a SQL*Plus interface. When the SQL*Plus interface is used, the user can decide how the results will be displayed. The following shows an example of the format of the output that could be displayed by the SQL query shown above.
T_SEQ_ID ALIGNMENT_LENGTH Q_SEQ_START Q_SEQ_END Q_FRAME T_SEQ_START T_SEQ_END T_FRAME SCORE EXPECT ---------- P31946 50 0 50 0 13 63 0 205 5.1694E-18 Q04917 50 0 50 0 12 62 0 198 3.3507E-17 P31947 50 0 50 0 12 62 0 169 7.7247E-14 P27348 50 0 50 0 12 62 0 198 3.3507E-17 P58107 21 30 51 0 792 813 0 94 6.34857645
The first row of information represents some of the attributes returned by the query; for example, the target sequence ID, the length of the alignment, the position where the alignment starts on the query sequence, the position where the alignment ends on the query sequence, which open reading frame was used for the query sequence, etc.
The next five rows represent sequence alignments that were returned; for example, the protein with the highest alignment to query sequence has the accession number "P31946", the alignment length was 50 amino acids, the alignment started at the first base of the amino acid query, and ended with the 50th base of the amino acid query, etc.