8 Sequence Similarity Search and Alignment (BLAST)

This chapter describes Oracle Data Mining support for certain problems in the life sciences. In addition to data mining functions that produce supervised and unsupervised models, ODM supports the sequence similarity search and alignment algorithm Basic Local Alignment Search Tool (BLAST). In life sciences, vast quantities of data including nucleotide and amino acid sequences are stored, typically in a database. These sequence data help biologists determine the chemical structure, biological function, and evolutionary history of organisms. A key feature of managing the exponential growth in sequence data sources is the availability of fast, sensitive, and statistically rigorous techniques for detecting similarities between these sequences.

As the amount of nucleotide and amino acid sequence data continues to grow, the data becomes increasingly useful in the analysis of newly sequenced genes and proteins because of the greater chance of finding such sequence similarities.

This chapter discusses the following topics:

For detailed information about Oracle's implementation of BLAST, see Oracle Data Mining Application Developer's Guide.

8.1 Bioinformatics Sequence Search and Alignment

Sequence alignment is one of the most common bioinformatics tasks. It is present in almost any research and development activity across the many industries in the area of life sciences including academia, biotech, services, software, pharmaceutical companies, and hospitals.

The National Center for Biotechnology Information (NCBI) developed one of the commonly used versions of the Basic Local Alignment Search Tool (BLAST).

Of all the sequence alignment algorithms, the one that is most widely used is BLAST. It is typically used to compare one query nucleotide or protein sequence against a database of sequences, and uncover similarities and sequence matches. Its success and popularity comes from its combination of speed, sensitivity, and statistical assessment of the results.

BLAST is a heuristic method to find the high-scoring locally optimal alignments between a query sequence and a database. The BLAST algorithm and family of programs rely on the statistics of gapped and un-gapped sequence alignments. The statistics allow the probability of obtaining an alignment with a particular score to be estimated. BLAST is unlikely to be as sensitive for all protein searches as a full dynamic programming algorithm. However, the underlying statistics provide a direct estimate of the significance of any match found.

The inclusion of BLAST in ODM positions the Oracle database as a platform for bioinformatics.

For more information about BLAST, see the NCBI BLAST Home Page at http://www.ncbi.nlm.nih.gov/BLAST.

8.2 BLAST in the Oracle Database

Implementing BLAST in the database provides the following benefits:

  • You can include BLAST in complex queries, thereby enabling complex analytical pipelines that include BLAST searches.

  • You can subselect portions of the database using SQL, thereby restricting searches.

  • Since sequence data is already stored in the database, it is not necessary to export the sequence data and pre-process them to create BLAST data sets and then import the results back into the database.

8.3 Oracle Data Mining Sequence Search and Alignment Capabilities

Sequence search and alignment, with capabilities similar to those of NCBI BLAST 2.0, has been implemented in the database using table functions. This implementation enables users to perform queries against data that is held directly inside an Oracle database. As the algorithms are implemented as table functions, parallel computation is intrinsically supported.

The five core variants of BLAST have been implemented:

  • BLASTN compares a nucleotide query sequence against a nucleotide database.

  • BLASTP compares a protein query sequence against a protein sequence database.

  • BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.

  • TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

  • TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

The BLAST table functions are implemented in the database, and can be invoked by SQL. Using SQL, it is possible to pre-process the sequences as well as perform any required post-processing. This additional processing capability means it is possible to combine BLAST searches with queries that involve images, date functions, literature search, etc. Using these complex queries make it possible to perform BLAST searches on a required subset of data, potentially resulting in highly performant queries. This functionality is expected to provide considerable value.

The performance of BLAST searches can be improved if the data set is transformed into a compressed binary format and used in the searches. BLASTN_COMPRESS() compresses nucleotide sequence data. The BLAST queries can use this compressed format for searches.

BLAST queries can be invoked directly using the SQL interface or through an application. The query below shows an example of a SQL-invoked BLAST search where a protein sequence is compared with the protein database SwissProt, and sequences are filtered so that only human sequences that were deposited after 1 January 1990 are searched against. The column of numbers at the end of the query reflects the parameters chosen.

select t_seq_id, alignment_length, q_seq_start, q_seq_end 
       q_frame, t_seq_start, t_seq_end, t_frame, score, expect 
  from TABLE( 
       BLASTP_ALIGN ( 
         (select sequence from query_db), 
         CURSOR(SELECT seq_id, seq_data 
                FROM swissprot 
                WHERE organism = 'Homo sapiens (Human)' AND 
                      creation_date > '01-Jan-90'), 

The results of a SQL query can be displayed either through an application or a SQL*Plus interface. When the SQL*Plus interface is used, the user can decide how the results will be displayed. The following shows an example of the format of the output that could be displayed by the SQL query shown above.

P31946         50            0          50         0         13          63          0      205    5.1694E-18

Q04917         50            0          50         0         12          62          0      198    3.3507E-17

P31947         50            0          50         0         12          62          0      169    7.7247E-14

P27348         50            0          50         0         12          62          0      198    3.3507E-17

P58107         21            30         51         0        792         813          0      94     6.34857645

The first row of information represents some of the attributes returned by the query: the target sequence ID, the length of the alignment, the position where the alignment starts on the query sequence, the position where the alignment ends on the query sequence, which open reading frame was used for the query sequence, among others.

The next five rows represent sequence alignments that were returned; for example, the protein with the highest alignment to query sequence has the accession number "P31946", the alignment length was 50 amino acids, the alignment started at the first base of the amino acid query, and ended with the 50th base of the amino acid query.