Oracle® Health Sciences Omics Data Bank Programmer's Guide Release 1.0.1 Part Number E27509-02 |
|
|
View PDF |
This chapter contains the following topics:
ENSEMBL GVF file which involves nucleotide variation references from dbSNP, COSMIC and EMBL is used to populate variation tables in Omics Data Bank. The cross-reference information for these variants is identified by Dbxref qualifier and is loaded to W_EHA_VARIANT_XREF table. The standard format for Dbxref in GVF file is specified below.
Dbxref=dbSNP_132:rs79772382;
The program to import this data splits it in to DATABASE and REFERENCE_ID using first colon (:) as delimiter. Hence, for the above example W_EHA_VARIANT_XREF columns are being populated with following data:
DATABASE = 'dbSNP_132'
REFERENCE_ID = 'rs79772382'
But for some organisms, there is a slight change in the way Dbxref is defined. Following is an example from Rattus norvegicus GVF file.
Dbxref=ENSEMBL:celera:ENSRNOSNP2610581;
For such cases W_EHA_VARIANT_XREF columns are being populated with following data:
DATABASE = 'ENSEMBL'
REFERENCE_ID = 'celera:ENSRNOSNP2610581'
Since REFERENCE_ID may contain suffixed or prefixed data for some of the cases mentioned above, when querying the REFERENCE_ID, we suggest using the SQL LIKE operator.
Note:
REFERENCE_SUFFIX column for W_EHA_VARIANT_XREF will not be populated with any data in the current model.The same scenario exists for SwissProt database cross-reference and hence users are suggested to use SQL LIKE operator for querying against W_EHA_PROT_XREF table.
The database cross-reference information for SwissProt is stored in W_EHA_PROT_XREF. This table populates DATABASE, REFERENCE_ID and REFERENCE_SUFFIX information. The standard format for database cross-reference in SwissProt file is specified below.
DR InterPro; IPR007031; Poxvirus_VLTF3.
The program to import this data splits it in to DATABASE, REFERENCE_ID and REFERENCE_SUFFIX using semi-colon (;) as delimiter. Hence, for the above example W_EHA_PROT_XREF columns are being populated with following data:
DATABASE = 'InterPro'
REFERENCE_ID = 'IPR007031'
REFERENCE_SUFFIX = 'Poxvirus_VLTF3'
There are certain cross-references in SwissProt file which have same REFERENCE_ID but different REFERENCE_SUFFIX. Due to indexing on REFERENCE_ID, only distinctthe first found REFERENCE_ID is stored leaving out other records with the same REFERENCE_ID.
For example:
DR EMBL; AL390732; CAH71826.2; JOINED; Genomic_DNA.
DR EMBL; AL390732; CAH73848.1; -; Genomic_DNA.
For the above example the REFERENCE_ID for both the cross-references is AL390732, hence due to indexing only first line information is stored in the table.
There is some loss of information on the REFERENCE_SUFFIX level but not on REFERENCE_ID, ie., all REFERENCE_ID would be captured in W_EHA_PROT_XREF table.
The references to mitochondrial chromosome are stored as MT in the reference side of the model, namely in W_EHA_VARIANT and W_EHA_HUGO_INFO tables. Any novel variants reported into W_EHA_VARIANT table from the result files will have the chromosome value converted from M to MT. When inserting into result tables, namely W_EHA_RSLT_COPY_NBR_VAR, the chromosome value will be M or as specified in the result file. Thus any queries must ensure to map any reference to M in result tables to MT in reference tables of ODB.
The promoter region information is not available in the reference data set imported from ENSEMBL, therefore a column has been provided in W_EHA_SPECIES table for you to specify the promoter region upstream to the gene for a specific organism. This is by default taken as input parameter while installation of ODB. This value is stored PROMOTER_OFFSET column of W_EHA_SPECIES table. Alternatively, you can change this value later on after installation of ODB by editing W_EHA_SPECIES.PROMOTER_OFFSET column.
End position refers to the last nucleotide in the affected area. CGI counts the end position of variants differently than MAF and GVF, that have end position values. CGI positions are zero based and all other data formats are one based. The code has already compensated for the zero based positions.
CGI treats the end position of insertion differently than single nucleotide variants (SNV). All other formats compute the end position as the same for both SNV and insertions. CGI sets the end position as identical to the start position for insertions, and then has the end position as + 1 for SNV. CGI also has a different computation for end position for each additional size of reference variants. In this release of ODB, the code has been modified to make the CGI loader compute the end position in the same way as VCF and MAF loaders. CGI stores end position as the nucleotide just past the affected variation.
The other formats' method of calculating the end position has been adopted because the GVF file has to be looked at as the authoritative reference since this data comes from Ensembl and we are following the GVF format and position calculation.