1 Omics Data Model

This chapter contains the following topics:

Introduction
Logical Data Model
Reference Data Tables
Result Data Tables

Introduction

Oracle Health Sciences Omics Data Bank (ODB) consists of two groups of tables. One set of tables is a reference to provide the genome-features metadata required to link sample specimen results to specific portions of the genome. The second set of tables in the model is used to capture the sample specimen ('Omics') results and link each result to some object in the reference model and also link back to the patient. The patient link is accomplished by linking ODB with Cohort Data Model (part of Oracle Health Sciences Cohort Explorer 1.0).

This Omics data can be used by tools such as BI server that require a star schema to build dynamic SQL correctly. However, it is important to note that in this release the scope of the product is the data model only, without any front-end user interfaces. The additional components included with the model is a set of scripts that can be used to load reference genomic data from specific sources as well as several types of result data from a limited set of formats.

The model is intended to handle very large amounts of data (in the order of terabytes and more). To gauge the scale of the data, consider that the human genome is around 3 billion bases in each strand and the number of genes that produce proteins is around 23,000. Each gene has many different variants and attributes that will be described in detail. Since this holds true for each protein, a lot of supporting reference data is loaded for each gene/ protein/ pathway. The data cannot merely be reloaded to maintain an organized table as is the case with other tables that use ETL processes to load data. One approach is to use Index Organized tables since data can be added to the reference model as more reference data is discovered for each gene.

Reference Data

The reference data is loaded from 4 distinct sources:

The genomic information with corresponding gene data is loaded from EMBL files which are stored online in the Ensembl database (http://www.ensembl.org). Ensembl is a joint project of the European Bioinformatics Institute and the Wellcome Trust Sanger Institute. This online database maintains references to other online database projects (dbSNP, NCBI, Cosmic, and so on) and provides references to each of these database. The model loads this cross reference information to, including known variation data, allow queries to use specific database references if needed.
The second source is the online SwissProt database (http://www.ebi.ac.uk/uniprot/) from which the model will obtain protein information. This database project is also a consortium of various groups including the European Bioinformatics Institute.
The third source is the HUGO Gene Nomenclature Committee (http://www.genenames.org/). This source provides reference seed information required for identifying human gene locations annotation only. The HUGO gene names are needed to find various cross references as well as the correct chromosome number for each gene. The HUGO Gene Nomenclature Committee is the authoritative group for all gene names.
The fourth reference source is Pathway Commons http://www.pathwaycommons.org/pc/ which is used as the source for published pathways and proteins/genes participating in each pathway. The coverage of pathway as a reference is minimal.

In general, the above files use similar features to represent the data. The EMBL format is a flat file representation which provides an easy mechanism to parse and store data in a separate database structure. Both Ensembl and SwissProt have native schemas that can be downloaded. However, these schemas are 3NF structures that require a lot of work to coerce into a star schema model. The ODB model does not copy the source schema structures in any way. In addition, there are a lot of extra objects in the native schemas that are not necessary for the type of queries needed for the ODB requirements and these objects are omitted in ODB schema.

Most of the online databases let you download complete references, or specific references for sections of the genome. Since some customers may only need some genes or some proteins, and some may not need any protein information at all, the model will allow for any combination of specific data to be loaded as well as updated. Ensembl and SwissProt databases are maintained in an additive manner, so that new data is added on top of the existing data. This lets the ODB reference data to be expanded as required by the customer. The HUGO and Pathwaycommons databases are reloaded each time a new version of gene names or pathways is uploaded.

Result Data

In ODB, the data model handles two main types of genetic results: gene expression and sequencing. Gene Expression experiments capture information on how effectively certain genes respond to various conditions. On another hand, for sequencing results, while there are many different types of sequencing techniques, the net effect is to record all of the variants detected for each organism being tested, copy number variation and other related features.

The model is designed to facilitate gene expression results to be queried with sequencing results in the same SQL statement.

Logical Data Model

ODB contains two sets of tables:

Reference data tables
Result data tables

Each set of tables comes with a set of loading scripts to load data into these tables. The reference loaders write to the reference tables, while the result loaders write to result tables. However, there is one exception, the W_EHA_VARIANT table, where the result loaders enable you to report on any novel variants by writing to this table with any new variants found. A dedicated procedure invoked by the result loader reports on any novel variants.

Figure 1-1 shows how the reference tables link to create the ODB reference tables section. Only table names are shown in the figure.

Figure 1-1 Reference Data Logical Model (Table Names Only)

Description of "Figure 1-1 Reference Data Logical Model (Table Names Only)"

Figure 1-2 shows how the result tables link to create the ODB result tables section. Only table names are shown in the figure.

Figure 1-2 Result Data Logical Model (Table Names Only)

Description of "Figure 1-2 Result Data Logical Model (Table Names Only)"

Reference Data Tables

The reference data starts with the SPECIES Dimension table and the DNA_SOURCE table.

W_EHA_SPECIES:

This species table stores information about each genome in the database. The current model will allow for any number of species genomes to be loaded. This requires you to specify species in queries if there are similar genes between the organisms being tested.

W_EHA_DNA_SOURCE:

The DNA_SOURCE table will store multiple records for different reference DNA strands for each species. Each cell in the species will have a copy of this reference DNA. There are buffers of DNA considered to be the reference for each organism. These reference strands are then used to map detected variations for each organism tested. The W_EHA_DNA_SOURCE table has the foreign key top SPECIES and also has a CLOB field to store the reference strand information. The DNA has a specific character notation to keep track of each DNA base. There are additional characters used for sections that have not be sequenced ("N") and there are other characters used to represent other possible DNA bases. The records in this table are used as the parent records to map genes and gene components. If Ensembl releases patches that show how some genes are re-defined new DNA_SOURCE records will be created and then link the other records as needed. This table also stores the chromosome location which is described later.

Figure 1-3 Genome Division by the Data Model

Description of "Figure 1-3 Genome Division by the Data Model"

W_EHA_GENE:

Each chromosome of the DNA strand has many different genes, which have a starting position and ending position. The entire size of the gene does not create the protein directly, but there are recognized sections of the DNA that scientists agree should be considered as part of the gene. The W_EHA_GENE table has fields for how the Ensembl database refers to the gene, as well as the recognized gene name. The recognized gene name is maintained by HUGO Gene Nomenclature Committee (http://www.genenames.org/). This reference information is loaded into the model to provide accurate chromosome information for each gene since the patch DNA sequences loaded do not list chromosomes.

W_EHA_GENE_SEGMENT:

This table is required to map the different segments of a gene to each buffer. Some genes are sequenced in multiple buffers and require this joining table to track each segment. This table gives the location in the buffer as well as a sequence number to keep track of the order of each segment that compose the gene. There is also a COMPLEMENT field that is used to indicate if a gene is transcribed in reverse order to create the protein.

_XREF, _QUALIFIER:

The XREF table associated with many of the different tables is used to list all of the cross reference information stored in the Ensembl database. There is a finite list of databases used, and each database has a specific format for the reference ID. You can use these reference ID values for queries. The _QUALIFIER tables associated with the different tables is used to list the other attributes. Each object can have an unlimited number of attributes such as "/note" that provides information to annotate the object. The database model will store this annotation data in case it is needed for reference.

W_EHA_GENE_STRUCTURE:

The process of gene transcription is accomplished by many different interim molecules that originate from sections of the gene. For each protein created, there is a distinct set of sections which are used. Each of these groups is identified in the EMBL file having the same TRANSCRIPT_ID qualifier. A GENE_STRUCTURE record is created to link to the protein and to be used as a parent record for all of the gene components. A given gene can have multiple proteins that are created (sometimes using the same sections) and each will have a different structure. Also, earlier research may have incorrect gene structures and the information is kept for historical reasons.

W_EHA_GENE_COMPONENT:

This table is used to store the various gene components. The EMBL file has many different objects listed (mRNA, CDS, STS, tRNA, misc RNA) and they all have a specific meaning. Views will be used to group the various types of objects in case there are queries to find genetic results that intersect with various gene regions. More user friendly names will be given (such as, mRNA = MESSENGER_RNA, CDS = CODING_REGION, STS = STRUCTURAL_SEGMENTS, and so on). You will be able to run various queries searching for mutations that occur in any of these regions, including the entire gene region.

W_EHA_GENE_COMP_SEGMENT:

This table is used to link each component to the DNA_SOURCE. Many of the gene components have joined sections. Sometimes the joined sections are detected in different source buffers as well and the foreign key to DNA_SOURCE is required for each part of the gene components. There is a sequence number to keep track of the order of each section used in the gene component.

W_EHA_PROTEIN:

Most of the known genes produce different types of protein molecules. The EMBL file lists the amino acids that comprise each protein molecule and uses an identifier for each protein molecule. The SwissProt files contains much more information about the protein molecule. There are more descriptive names for each protein which are not stored in the EMBL file (such as, insulin). The SwissProt file will also provide links to cross references as well as literature references. Each protein molecule can have many different components which are linked to this parent PROTEIN record.

W_EHA_PROT_COMPONENT:

This table is used to store all the protein components that are stored in the SwissProt files. You can import all or as many of the SwissProt files needed. This data may also be important for queries or reference. This can be important to show changes that may occur when variants are detected in the gene regions used to generate the amino acids of the protein.

W_EHA_VARIANT:

The VARIANT table is used to record the known reference sequence, REFERENCE_SEQ, corresponding to one or more variants that differ from the reference DNA_SOURCE. Most of these variants are well documented and compiled from other research. When results are uploaded, sometimes novel variants are detected and there are no known references for this variant. These results will generate new VARIANT records which may be of interest to researchers. There is a STATUS field which will be used to indicate NOVEL or KNOWN variants. Since this table will be queried frequently, it will be quite large and will require partitioning. The VARIANT table has a foreign key to the DNA_SOURCE record, not the GENE record. The reason being that some genes may overlap, and there may also be several structures that are affected by a variant. This table will be used to create result foreign keys as described later.

W_EHA_HUGO_INFO:

This table is very important to store reference seed information needed for identifying gene locations. The EMBL files will report each gene with a LOCUS_TAG which uses the registered name with the HUGO Gene Nomenclature Committee (http://www.genenames.org/). The entire reference data from this group will be loaded as seed data in this table. This is important because of the way the EMBL files store patch sequences. The patch sequences (which are corrections to the human genome project) list the chromosome using the accession number of the DNA used for detection. The W_EHA_HUGO_INFO table is required to look up the HUGO gene names to find various cross references as well as the correct chromosome number for each gene.

W_EHA_PATHWAY:

Pathway is used to describe a series of interactions in a cell. Numerous biological pathways exist, including genetic, metabolic, signaling, and so on. This table is used to store publicly available pathways. Each pathway's participants are defined in PATHWAY_PROTEIN table which has a foreign key to PATHWAY table.

W_EHA_PATHWAY_PROTEIN:

This table is used to keep track which gene or proteins belong to a particular pathway. It has a foreign key to PATHWAY table which associates each gene or protein with one or more pathways.

Result Data Tables

The model currently supports the following two categories of results:

Sequencing
Gene Expression

Types of sequencing results include simple variants, copy number variation and no-call. These results are directed into four major tables:

W_EHA_RSLT_SEQUENCING
W_EHA_RSLT_GENE_EXP
W_EHA_RSLT_COPY_NBR_VAR
W_EHA_RSLT_NOCALL

W_EHA_RSLT_SEQUENCING:

This table contains sequencing results, more specifically variant information. It has a foreign key into the W_EHA_VARIANT table. The records in this table are linked to the record in the variant reference table. Information such as insertions, deletions, or substitutions would be recorded in this table along with any quality metrics on this information.

Note:

A record in the W_EHA_RSLT_SEQUENCING table may come from any three result file types, namely VCF, MAF, or Complete Genomics (CGI) masterVar file.

W_EHA_RSLT_SEQUENCING_X:

This table is an additional table to store less frequently used sequencing metadata from the input file. It is always used as a helper table along with the main RSLT_SEQUENCING table.

W_EHA_RSLT_NOCALL:

This table contains results coming from Complete Genomics sequencing files. Only CGI format records no-call results, that is instances when there is incomplete information to make a call regarding variant information on an allele.

W_EHA_RSLT_NOCALL_X

This table is an additional table to store less frequently used sequencing metadata from the input file. It is always used as a helper table along with the main RSLT_NOCALL table.

W_EHA_RSLT_COPY_NBR_VAR:

This table contains copy number variation results coming from the Complete Genomics platform along with any relevant quality or count metrics. Currently no loading scripts are provided that support automatic parsing and loading of this data into the copy number variation result table. However, you can generate custom sequel to populate this table in order to query the results.

W_EHA_RSLT_CNV_X

This table is an additional table to store less frequently used sequencing metadata from the input file. It is always used as a helper table along with the main RSLT_COPY_NBR_VAR table.

W_EHA_PROBE:

This table is intended to hold probe information for gene expression results, and each probe is designed to represent a particular gene. Since probe design varies by vendors, there may be multiple probes that correspond to same gene. In the rare instance where more than one gene matches a probe, the model has W_EHA_PROBE_ALT_LINK that would need to be manually populated. W_EHA_PROBE must be populated by the expression loader prior to loading any results corresponding to gene expression. In addition, any reference information pertaining to probes can be recorded in the W_EHA_PROBE_XREF table.

W_EHA_RSLT_GENE_EXP:

This table is loaded from gene expression results and allows for storing gene intensity measurements, as well as quality metrics such as p-value and call information. A record can be inserted into this table only if the specified probe already exists in the W_EHA_PROBE table in order to establish a foreign key relationship.

W_EHA_RSLT_TYPE:

This table is a pre-seeded table with the types of results currently supported by the model. The result types are listed in Appendix A, "Additional Result Tables". Each record in the W_EHA_RSLT% tables supports a single specific result type.

W_EHA_STUDY

This table allows each result to be linked to a study if so specified by the end user during loading. The study table is intended to be a shadow copy of a table in the clinical data model, namely Cohort Explorer Data Model study table, thus it only holds study name and description. This table should be populated by the end user before loading any results pertaining to a given study. The presence of this table allows partitioning the results into groups based on a study the results have been collected for. This partitioning scheme leads to significant performance improvements.

W_EHA_CHROMOSOME

This table holds all the chromosomes names and is pre-seeded with names for all Human chromosomes. Refer to Appendix A, "Additional Result Tables" for pre-seeded data information. A result record in W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_NOCALL, W_EHA_RSLT_COPY_NBR_VAR may be linked to a particular chromosome. Similar to W_EHA_STUDY, this table is used to partition results for improved query performance.

W_EHA_RSLT_SPECIMEN

W_EHA_RSLT_SPECIMEN table is linked to all four result data tables. Every record in any of the result tables must have a foreign key that links to a particular specimen in this table. W_EHA_RSLT_SPECIMEN table in turn, links to W_EHA_DATASOURCE table which holds information about the database a given specimen comes from. This information, along with additional fields in SPECIMEN table such as SPECIMEN_NUMBER and SPECIMEN_VENDOR_NUMBER, can be sued to uniquely identify and pull more metadata about a given specimen from other source systems. This information is not stored in the ODB.

W_EHA_DATASOURCE:

This table stores information regarding specimen sources. Each genomic result must have a specimen record connected to it coming from another schema with patient results. Specimen identifier is the link between the clinical results and genomic results. This table needs to be populated by the user prior to running any result loaders. If ODB is intended to be used with Cohort Explorer data model, this table should be seeded with one specimen datasource, namely Cohort Data Model.

W_EHA_RSLT_FILE:

This table contains information about the input file you provide to populate the results. It stores information about the file storage type and the path to the file. The table is designed to store either external files (regular storage denoted by 'E' ), or SecureFiles (secure storage denoted by 'S'). Currently, only loading of regular files is supported by the loaders, although you can manually link to any type of storage including Secure Files. Specific file descriptions as to their native formats and so on is stored in the W_EHA_RSLT_FILE_TYPE table and referenced via foreign keys.

W_EHA_RSLT_FILE_TYPE:

This table is pre-seeded with file types supported by the loaders into the model. The list of valid pre-seeded values is specified in the appendix. W_EHA_RSLT_FILE has a foreign key into this table.