7 Miscellaneous Topics

This chapter contains the following topics:

Product Version and Product Profile Including Flanking Offsets
Querying Database Cross-References for Variations
Mitochondrial Chromosome Mappings
Promoter Offset
Loader Activity Logging
User Feedback for Loader Runs
VCF Loader Log
Creating Custom Gene Components
Additional Step on Exadata versus Non-Exadata
UNDO Tablespace Auto-extendable Issue
Chromosome Partitioned Tables

7.1 Product Version and Product Profile Including Flanking Offsets

Currently, there are two configuration tables in the ODB schema:

W_EHA_PRODUCT_VERSION table stores product version numbers and is updated and used by upgrade and patch installations. This table records each patch or release that is installed. It does not have ETL_PROC_WID or ENTERPRISE_ID columns. It does not require a loader to be populated and is global to all enterprises or tenants that use the schema. Two fields are used to record version information:

RELEASE_VERSION VARCHAR2(200);

PATCH_VERSION VARCHAR2(200);
W_EHA_PRODUCT_PROFILE table stores information specific to each enterprise or tenant - a single record for each ENTERPRISE_ID value, and a unique key to enforce this. It also stores global settings for different logging levels allowed for log records inserted into the log table. It stores the global setting to permit DBMS output. Finally, there is a VCF_FORMAT field intended to store a list of data types that are used in the Format column in the VCF data file.

The columns in this table are:
- PROMOTER_OFFSET NUMBER;
- FLANKING_OFFSET NUMBER;
- LOG_WARNING CHAR(1);
- LOG_INFO CHAR(1);
- LOG_DEBUG CHAR(1);
- LOG_TRACE CHAR(1);
- LOG_DBMS_OUTPUT CHAR(1);
- VCF_FORMAT VARCHAR2(4000 CHAR);

The FLANKING_OFFSET parameter is an input (right after PROMOTER_OFFSET) on running TRC install scripts and is used by various loader procedures. It defines the size of the region before and after a gene definition that are to be linked with sequencing results. These associations are important as a lot of research is focusing on areas before and after gene definitions. The value for FLANKING_OFFSET should be equal to or greater than the value used for PROMOTER_OFFSET, so that the queries generated by the Query Engine can find results linked to a gene that may exist in a promoter region that extends before or after the gene definition.

THE LOG_% fields are flags whose values are either Y or N. By default, the WARNING and INFO log levels are turned on and DBMS output is turned off. For the most part these settings are not needed or used, as logging is also configured in the loader scripts.

The VCF_FORMAT column is required to specify data types used in the specification of custom FORMAT columns. For a description of its usage, see the Section 4.5.2, "Custom Format Specification in VCF".

7.2 Querying Database Cross-References for Variations

7.2.1 Ensembl db_xref Qualifier Issue

The Ensembl GVF file, which contains nucleotide variation references from dbSNP, COSMIC and EMBL, is used to populate the variation tables in the Omics Data Bank. The cross-reference information for these variants is identified by the Dbxref qualifier and is loaded into the W_EHA_VARIANT_XREF table. The standard format for Dbxref in a GVF file is:

Dbxref=dbSNP_132:rs79772382;

To import this data, the program splits it into DATABASE and REFERENCE_ID using the first colon (:) as delimiter. Therefore, for the above example W_EHA_VARIANT_XREF populates columns with the following data:

DATABASE = 'dbSNP_132'

REFERENCE_ID = 'rs79772382'

However, for some organisms, Dbxref is defined differently. Following is an example from a Rattus norvegicus GVF file:

Dbxref=ENSEMBL:celera:ENSRNOSNP2610581;

For such cases, W_EHA_VARIANT_XREF columns are populated with the following data:

DATABASE = 'ENSEMBL'

REFERENCE_ID = 'celera:ENSRNOSNP2610581'

Since REFERENCE_ID may sometimes contain suffixed or prefixed data, Oracle recommends using the SQL LIKE operator when querying the REFERENCE_ID.

Note:

The REFERENCE_SUFFIX column in W_EHA_VARIANT_XREF is not populated with any data in the current model.

The same scenario hold good for the SwissProt database cross-reference. Oracle recommends using the SQL LIKE operator for querying against the W_EHA_PROT_XREF table.

7.2.2 Swissprot db_xref Qualifier Issue

The W_EHA_PROT_XREF table stores the database cross-reference information for SwissProt. This table populates the DATABASE, REFERENCE_ID and REFERENCE_SUFFIX information. The standard format for database cross-reference in a SwissProt file is:

DR InterPro; IPR007031; Poxvirus_VLTF3.

To import this data, the program splits it into DATABASE, REFERENCE_ID and REFERENCE_SUFFIX using the semi-colon (;) as delimiter. Hence, for the above example W_EHA_PROT_XREF, populate the columns with following data:

DATABASE = 'InterPro'

REFERENCE_ID = 'IPR007031'

REFERENCE_SUFFIX = 'Poxvirus_VLTF3'

There are certain cross-references in SwissProt file that have the same REFERENCE_ID but a different REFERENCE_SUFFIX. For indexing the REFERENCE_ID, only the distinct first found REFERENCE_ID is stored. The other records retain the same REFERENCE_ID.

For example:

DR EMBL; AL390732; CAH71826.2; JOINED; Genomic_DNA.

DR EMBL; AL390732; CAH73848.1; -; Genomic_DNA.

In the above example, the REFERENCE_ID for both the cross-references is AL390732. Therefore, only the first line information is stored in the table for indexing.

There is some loss of information on the REFERENCE_SUFFIX level but not on REFERENCE_ID. All REFERENCE_IDs are captured in the W_EHA_PROT_XREF table. Also, the count of instances where duplicate db_xrefs are not saved is now logged by the SwissProt loader as a warning.

7.2.3 W_EHA_VARIANT_X

The W_EHA_VARIANT_X table stores records for the following scenarios for both VCF and MAF format files:

When SVTYPE tag in the INFO column and < tag in the ALT column of the VCF file are present (for example, in case of large SV, there is some information on the SVTYPE in the INFO column such as SVTYPE=INS or SVTYPE=DEL, and so on); for such records the ALT column will have tags like <DEL> or <DUP>, and so on.
When the combined length of REF and ALT sequence is greater than or equal to 999, post trimming of overlapped bases. The checksum is calculated and reported in the W_EHA_VARIANT.SEQUENCE_CHECKSUM column.

The ALLELE column in the W_EHA_VARIANT_X table stores the complete value present in the ALT column.

If there is any overlap between the REF and ALT columns and some bases are trimmed from ALT column, then the ALLELE column still stores the untrimmed value. The checksum (stored in W_EHA_VARIANT.SEQUENCE_CHECKSUM) is calculated based on the trimmed ALT value.

To determine the position of the actual trimmed allele, W_EHA_VARIANT_X.START_POSITION value is used. If none of the bases are trimmed from the ALT column, then the START_POSITION value is 1. If first base is trimmed, then the START_POSITION value of the sequence existing in the ALLELE column is 2.

7.3 Mitochondrial Chromosome Mappings

The references to mitochondrial chromosome are stored as MT on the reference side in the W_EHA_VARIANT and W_EHA_HUGO_INFO tables, and on the result side in the W_EHA_CHROMOSOME tables of the model. Any novel variants reported into the W_EHA_VARIANT table from the result files have the chromosome value converted from M to MT. For example, while inserting in W_EHA_RSLT_COPY_NBR_VAR or W_EHA_RSLT_SEQUENCING, the FK to W_EHA_CHROMOSOME table for chromosome value MT is taken if the result file has chromosome M.

7.4 Promoter Offset

Promoter region information is not available in the reference data set imported from Ensembl EMBL files. Therefore, a column has been provided in the W_EHA_SPECIES table to specify the promoter region upstream to the gene for a specific organism. The column is named PROMOTER_OFFSET. This Promoter Offset is species-specific.

There is also a default Promoter Offset, stored in the PROMOTER_OFFSET column of the PRODUCT_PROFILE table. This default value is populated during ODB installation, from the value of the -promoter_offset argument.

If the Promoter Offset for a species is null in the W_EHA_SPECIES table, the default value from W_EHA_PRODUCT_PROFILE is used. The EMBL and SwissProt reference loaders, which typically create W_EHA_SPECIES records, do not populate the PROMOTER_OFFSET column there. Therefore, the only way to define a non-default Promoter Offset for a species, after its record has been created, is to manually edit the value in W_EHA_SPECIES.PROMOTER_OFFSET for it.

ODB provides a special view, W_EHA_PROMOTER_REGION_V, which uses the Promoter Offset described above to enable querying promoter regions for genes or gene structures in ODB.

7.5 Loader Activity Logging

The W_EHA_PRODUCT_PROFILE logging settings are singular levels, where each setting affects one level. These logging settings are generally not used by loaders as logging is configured in the loader scripts where the log levels are inherited - for example, if TRACE is on, DEBUG is also on. A single ETL_PROC_WID is used for each loader run. Any log records associated with that particular load can be looked up in the W_EHA_RSLT_LOG table, after the fact, using this ID.

The following named command-line option, present in all loaders, sets log level for that loader. -log_level [TRACE | DEBUG | INFO | WARNING | ERROR] default: INFO When -log_level is set to any other value, except the five listed above (for example, INHERIT), logging levels are inherited from W_EHA_PRODUCT_PROFILE.LOG_% settings. The following named command-line argument logs a summary log record to the database and prints the summary to the console.

-print_summary ['1'=on, '0'=off] default: '0' if on

Table 7-1 Log Events recorded in the W_EHA_RSLT_LOG table

Log Event	Modifiable by User	Log Event Description
START	No, always on	Displays the PLSQL package, operation and parameter list
END	No, always on	Displays the status of PLSQL operation
SUMMARY	Yes	Displays a summary report
FATAL	No, always on	Displays exception messages with error code and error trace
ERROR	No, always on	Displays exception messages with error code and error trace
WARNING	Yes	Displays warning messages
INFO	Yes	Displays informational messages such as sql%rowcounts
DEBUG	Yes	Displays debug messages
TRACE	Yes	Displays the finest grade debug messages

Table 7-2 How to enable log events using batch file parameters

Batch File Parameter Name	Batch File Parameter Value	WARNING Log Events Captured	INFO Log Events Captured	DEBUG Log Events Captured	TRACE Log Events Captured	SUMMARY Log Events Captured
log_level - - - - -	WARNING	Yes	No	No	No	-
	INFO	Yes	Yes	No	No	-
	DEBUG	Yes	Yes	Yes	No	-
	TRACE	Yes	Yes	Yes	Yes	-
	INHERIT	Each log event can be individually turned on or off by setting the corresponding attribute in the W_EHA_PRODUCT table				-
print_summary -	1	-	-	-	-	Yes
print_summary -	0	-	-	-	-	No

7.6 User Feedback for Loader Runs

User feedback for loader runs is provided in the form of summary log records that are printed at the console before the loader run terminates. The same summary is inserted as a log level SUMMARY record in the W_EHA_RSLT_LOG table. A summary feedback can be created for every ETL load by using its ETL_PROC_WID value.

The following function can be called anytime after a loader run to create a SUMMARY record in database, for instance if it was not run time of load:

odb_util.print_loader_summary(etl_proc_wid);

Following is an example of a loader summary output after a successful loader run:

--------------

Loader Summary

--------------

Command Used

ODB_RSLT_DUAL_CHANNEL_UTIL.process_dual_channel(i_data_file => 'unc.edu__AgilentG4502A_07_3__TCGA-A2-A0CX-01A-21R-A00Z-07__gene_expression_analysis_summary.txt', i_data_directory => 'ODB_LOAD', i_species_name => 'Homo sapiens', i_study_name => 'STUDY1', i_datasource_name => 'CDM', i_specimen_number => 'rna001', i_specimen_vendor => 'rnaseq', i_control_specimen => 'Stratagene Univeral Reference', i_user_label => 'ADF_p7', i_reference_version => 'GRCh37.p7', i_file_flg => 'E', i_preprocess_dir => null, i_preprocess_file => null, i_data_file_path => null, i_dbfs_store => null, i_alt_file_loc => null, i_read_size => null)

Properties

Start Date: 2013-06-07 15:36:41

ETL Process ID: 4116

Exadata: False

DB User: ODB25RES

OS User: vkamath

Hostname: NORTH\MUWKS0015

File Name: unc.edu__AgilentG4502A_07_3__TCGA-A2-A0CX-01A-21R-A00Z-07__gene_expression_analysis_summary.txt

File Type Code: 2-Channel Expression

File Type Name: Agilent TCGA 2-channel Expression analysis file

File Type Version: A

End Date: 2013-06-07 15:36:42

Status: SUCCESS

--------------

Insert Summary

--------------

w_eha_rslt_2channel_gxp: 61 rows

W_EHA_STG_2CHANNEL_GXP: 61 rows

w_eha_file: 1 row

w_eha_rslt_file_spec: 1 row

w_eha_file_qlfr: 0 rows

w_eha_rslt_specimen: 0 rows

-------------

Error Summary

-------------

---------------

Warning Summary

---------------

Info Summary

---------------

2013-06-07 15:36:41

Detail: Found specimen number=rna001

7.7 VCF Loader Log

Some variants are skipped from loading:

If EMBL reference data load is not complete, it might result in insufficient information for a chromosome or a chromosome region in the W_EHA_DNA_SOURCE table. In such cases, a reference is not available for a given region, so the variants falling in these regions are not loaded from VCF file.

These variants are shown in the summary report with the following warning message:
```
---------------
Warning Summary
---------------
2013-11-26 10:30:53
Detail: Some variants were skipped, check chromosome and genomics coordinates); [rowcount=16]
```
Some specimen numbers in the VCF file in CDM datasource are not found:

In a VCF file with multiple specimens, if some of the specimens are not available in the CDM tables, a warning message is logged to indicate how many specimens are found in CDM.

In the W_EHA_RSLT_LOG.RECORD_DETAIL column, a warning message stating number of specimens found out of total number of specimens in the file is shown. Following is an example:

returning 1664; [status=warning 2 of 3 were found, elapsed_time=00:00:19]

In the summary report, a warning message stating which specimen is not found is shown. Following is an example:

Detail: Could not find specimen number=HG00097

7.8 Creating Custom Gene Components

The end user can build queries by selecting the Variants or CNVs in regions, defined by Gene Components (the W_EHA_GENE_COMPONENT and associated tables). An example of this is selecting the Exon gene region, which is one of the gene component types provided by Ensembl.

You can add custom gene components to these tables, use custom gene components types and so on. These new gene component types can be made available in the TRC v2.0.1 (or higher) UI. To do this, you must understand the structure and functional use of these tables in the ODB schema.

The heart of the subsystem is the W_EHA_GENE_COMPONENT table, which stores one record per gene component. It has the following important columns:

ROW_WID - a surrogate primary key (PK), filled from the W_EHA_GENE_COMPONENT_S sequence.
STRUCTURE_WID - a foreign key (FK) to the W_EHA_GENE_STRUCTURE table. This is a mandatory column which links the component to its gene. If a gene component is associated with a known transcript with an Ensembl Transcript ID (for example, ENST00000384612), this FK should point to a gene structure with this TRANSCRIPT_ID. Otherwise, it points to a gene structure record for the same gene with TRANSCRIPT_ID = NO STRUCTURE (a catch-all gene structure).
COMPONENT_TYPE - identifies the component as belonging to a type. It is of type VARCHAR2(100) and not normalized, so you must use consistent casing for custom component types. Ensembl currently provides components of the following types - exon, CDS, STS, mRNA, misc_RNA, misc_feature.
PROTEIN_WID - an optional FK to the W_EHA_PROTEIN table. Setting this parameter lets you query by Pathway name, and Genes and Gene Sets.

The second table in the gene component subsystem is W_EHA_GENE_COMP_SEGMENT, which is used to map the component onto a DNA source and through it onto a chromosome. There can be one or more records per gene component in this table. There are the following important columns in this table:

ROW_WID - a surrogate PK, filled from the W_EHA_GENE_COMP_SEGMENT_S sequence.
GENE_COMPONENT_WID - the FK to the W_EHA_GENE_COMPONENT table.
SOURCE_WID - the FK to the W_EHA_DNA_SOURCE table.
NUMBER_IN_SEQUENCE - if there are multiple segments per component, their order is defined in this column.
START_POSITION and END_POSITION - the start and end positions of the segment relative to the W_EHA_DNA_SOURCE record (adding the W_EHA_DNA_SOURCE.START_POSITION value we get the absolute positions on the chromosome).
COMPLEMENT- a 0 or 1 flag. If the value is 1, the segment is a complement to the DNA Source sequence.

Finally, there are the W_EHA_GENE_COMP_XREF and W_EHA_GENE_COMP_QLFR tables, which have the same structure as all other XRef and Qualifier tables. These tables contain optional Database Reference and Qualifiers (such as notes) for the gene components (0 to many per component).

Currently, these tables contain data loaded from Ensembl EMBL files (by the EMBL Reference Loader). However, you can add your own data to the tables after loading the reference data from the Ensembl EMBL and SwissProt files. You can also define customer-specific component types and use them.

You can define a new component type by choosing the value. However, ensure that it does not coincide with any existing or future values from other sources. Since there is no way to know what component types can be added to Ensembl or other sources in the future, Oracle recommends using a customer-specific prefix. For example, to add a "First exon in a gene component" type, and the customer company name is XYZ, use XYZ_first_exon as the new component type. You can use it with the same letter casing for all first exons of genes that we add.

Next, provide the actual components. A superset of the data (exons) is already present in the W_EHA_GENE_COMPONENT table, with their segments already defined, and everything correctly linked to DNA Sources, Gene Structures, and Proteins. An extra set of records is required in W_EHA_GENE_COMPONENT, copied from some of the existing exon records, with the COMPONENT_TYPE = "XYZ_first_exon" instead of "exon", and the new segments for them. Inserting these records requires SQL or PL/SQL code, processing the pre-existing data, and inserting new records. The logic in the code is as follows:

For each W_EHA_GENE_COMPONENT record with COMPONENT_TYPE of exon, determine (by its position relative to the other exons for the same gene, if any) if this is the first exon in its gene.
If it is, insert a new record into W_EHA_GENE_COMPONENT, with the same values, except that the COMPONENT_TYPE is changed to "XYZ_first_exon" and a new ROW_ID obtained from the sequence.
Whenever a new component is inserted, insert copies of associated W_EHA_GENE_COMP_SEGMENT records, replacing their GENE_COMPONENT_WID values with the ROW_ID of the new W_EHA_GENE_COMPONENT record.

You must archive the script used to create these components, as it may be needed for re-creating the data in another database, re-installing the same database, or after loading another version of Ensembl reference files.

The previous example was that of copying the already existing components. For the general case, perform the following steps:

After selecting an existing or new Component Type (the same way as described above), ascertain the following for each gene component to be inserted:

Whether the gene (already existing in the W_EHA_GENE table) is associated with any information that uniquely identifies it: for example, W_EHA_GENE.ROW_WID or Ensemble Gene Id, or Species + HUGO Name.
Whether it is associated with any transcript for this gene (either a W_EHA_GENE_STRUCTURE.ROW_WID or an Ensembl Transcript ID, or not associated).
Whether it is associated with any existing protein record (in the W_EHA_PROTEIN table).
What range or ranges of chromosomal positions it maps to (chromosome name, start and end positions on the chromosome, complement or not). If there is more than one range, multiple segments should be created.
Whether you want to enter any DB Xref and (or) Qualifier entries for each component.

Once this information is gathered, you can create a script (for example, in PLSQL) to insert the new gene components. At the beginning of the script, get a new ETL_PROC_WID from the W_EHA_ETL_PROC_S sequence and use it in each subsequent INSERT statement for all tables. Then, looping over the components, perform the following steps:

Find the ROW_WID of the W_EHA_GENE_STRUCTURE that the new component will be linked to. This record must belong to the gene the component is associated with, and its TRANSCRIPT_ID must match the Ensembl Transcript ID of the new component, or be NO STRUCTURE if there is none. If there is no such record in W_EHA_GENE_STRUCTURE, it should be inserted and its ROW_WID used as the STRUCTURE_WID in step 3 (if it is found, use its ROW_WID as STRUCTURE_WID).
If the new component is linked to a W_EHA_PROTEIN record, find its ROW_WID, and use it as PROTEIN_WID in the next step. Otherwise, use NULL.
Insert a new record for the component into W_EHA_GENE_COMPONENT, getting a new ROW_WID for it from the W_EHA_GENE_COMPONENT_S sequence, and using the other values obtained above, and the chosen value for COMPONENT_TYPE. Store the new ROW_WID.
If there are DB XRef and Qualifier entries for the new gene component, insert them into the respective tables, using the appropriate sequences for the PKs and the new W_EHA_GENE_COMPONENT.ROW_WID as the FK to the W_EHA_GENE_COMPONENT record.
Create the Gene Component Segments. For each chromosomal range of the new component, you must find the W_EHA_DNA_SOURCE record that the range is contained in. You can do this easily as the DNA Sources are mapped to chromosomes except when the component range spans two (or more) DNA Sources. In this case, it has to be broken up into sub-ranges, each mapped to its own DNA Source.

There may be more than one matching DNA Source and you are required to select one of these. The selection is rather arbitrary; a reasonable rule is to select the longest of the matching DNA Sources. Once the DNA Sources for all ranges are selected, use their ROW_WIDs to recalculate the range start and stop positions from the chromosomal to the DNA Source coordinates.

Insert a new record for each range into W_EHA_GENE_COMP_SEGMENT table, using the ROW_WID of the component record inserted into W_EHA_GENE_COMPONENT as GENE_COMPONENT_WID, and getting the new ROW_WID value from the W_EHA_GENE_COMP_SEGMENT_S sequence.

Note:

The START_POSITIONs and END_POSITIONs of all segments and the DNA Sources themselves are 1-based, so when recalculating, use -1's where appropriate.

This outlines the process of adding new gene components and custom gene component types.

There is one extra step needed for each new gene component type to appear in the Query UI. It is necessary to add one or more records to the TRC_LOOKUP_CODE table in the Application schema (not the ODB schema). Enter the component type you added in ODB (for example, XYZ_first_exon) as CODE, TRC_QP_GENE_REGION as CODE_TYPE, the label you want in the UI for this Gene Region (for example, first exon) as CODE_NAME, an optional description as CODE_DESC, and the language or locale code (for example, en_US) as LANGUAGE_CODE. If the installation supports several languages, insert a separate record for each.

Following is information that will be useful when creating custom gene components. When inserting into a table, get the new ROW_WID from the appropriate sequence:

Table 7-3 Tables and Sequences

Table	Sequence to get the ROW_WIDs from
W_EHA_GENE_COMPONENT	W_EHA_GENE_COMPONENT_S
W_EHA_GENE_COMP_SEGMENT	W_EHA_GENE_COMP_SEGMENT_S
W_EHA_GENE_COMP_XREF	W_EHA_GENE_COMP_XREF_S
W_EHA_GENE_COMP_QLFR	W_EHA_GENE_COMP_QLFR_S
W_EHA_GENE_STRUCTURE	W_EHA_GENE_STRUCTURE_S
TRC_LOOKUP_CODE (column: CODE_ID)	S_ROW_ID_SEQ

For example, select W_EHA_GENE_COMPONENT_S.NEXTVAL from DUAL;

ETL_PROC_WID NUMBER(38), is a column in all ODB tables that should be filled using insertion scripts. For every run of a script, a new unique value should be obtained from the sequence W_EHA_ETL_PROC_S, stored in a script variable, and used whenever the script inserts a record or records into any ODB table. During insertion the current date and time should also be filled (using select SYSDATE from DUAL) in W_INSERT_DT DATE column of each table.

DNA Sources and Chromosomes

In ODB, Gene Component Segments are not mapped directly to Chromosomes, but rather to DNA Sources. The DNA Sources are stored in the W_EHA_DNA_SOURCE table and contain the actual genomic DNA sequence, an entire chromosome or about 1,000,000 bases per DNA Source, depending on the Ensembl version.

When inserting a new Gene Component, its Segments must be mapped to one or more DNA sources. For example, if we insert a new Gene Component with a single chromosomal Segment, spanning positions from 1450000 through 1460000 on Chromosome 1. We know that the species PK is 1. We must find the DNA source record, or records, covering this range.

The following SQL finds the W_EHA_DNA_SOURCE record the segment should be assigned to:

select min(ds.row_wid) from w_eha_dna_source ds

where ds.species_wid = 1

and ds.chromosome = '1'

and ds.start_position <= 1450000

and 1460000 <= ds.end_position

This works if the entire segment CHR1:1450000-1460000 fits into a single DNA source range. Instead if it spans across 2 or more DNA Source ranges, the query will not have a result. A more complex query must be used to get the partial DNA Sources. However, this is very unlikely to happen for Ensembl versions 67 or higher, where a DNA Source typically covers an entire chromosome.

Once you get the DNA source, the Segment can be mapped on it, with the following recalculation of the start and end positions:

W_EHA_GENE_COMP_SEGMENT.START_POSITION = 1450000 - W_EHA_DNA_SOURCE.START_POSITION + 1

and

W_EHA_GENE_COMP_SEGMENT.END_POSITION = 1460000 - W_EHA_DNA_SOURCE.START_POSITION + 1

7.8.1 Creating Custom Gene Region Views

The end user can build queries by selecting Variants or CNVs in regions defined in gene region views. An example of such a view is W_EHA_PROMOTER_REGION_V (selectable in the UI as Promoters). Now, TRC also supports custom gene region views, which can be created at customer sites and are specific to these sites.

Oracle recommends using custom gene region views only where calculated positions are involved (and not to use them instead of custom Gene Components). Using custom Gene Components does not significantly increase the size of the query SQL, but using each custom view in a query increases the SQL size by up to 500 characters.

Following are the guidelines for creating custom views:

The custom gene region views must be named following the ODB view naming conventions — the name must start with the W_EHA prefix and end with the _V suffix (indicating that it is a view). Use the standard Promoter Region view as an example - W_EHA_PROMOTER_REGION_V.

A gene region view must have the following columns:

STRUCTURE_WID (NUMBER) - an FK to the W_EHA_GENE_STRUCTURE table. This is a mandatory column, and is very important as it ultimately links the region to its gene. If the region is associated with a known transcript with an Ensemble Transcript ID (for example, ENST00000384612), this FK should point to a gene structure with this TRANSCRIPT_ID. Otherwise it points to a gene structure record for the same gene with TRANSCRIPT_ID = NO STRUCTURE (a catch-all gene structure).

There can be multiple W_EHA_STRUCTURE records for the same gene, if the gene has multiple transcripts, and the catch-all structure record for the gene must be used for all gene regions not associated with transcripts.
PROTEIN_WID (NUMBER) - an optional FK to the W_EHA_PROTEIN table. If this is set, you can query by Pathway name, Genes and Gene Sets.
COMPLEMENT (NUMBER) - a 0 or 1 flag. If the value is 1, the region is a complement versus the DNA Source sequence.
SOURCE_WID (NUMBER) - the FK to the W_EHA_DNA_SOURCE table. All genomic positions in ODB are relative to W_EHA_DNA_SOURCE records, not chromosomes, as DNA sequences are stored in this table.
START_POSITION and END_POSITION (both numbers) - the start and end positions of the region relative to the W_EHA_DNA_SOURCE record (adding the W_EHA_DNA_SOURCE.START_POSITION value we get the absolute positions on the chromosome).
The standard view in the ODB schema, W_EHA_PROMOTER_REGION_V, can be used as an example when creating custom gene region views. It is used in exactly the same way as the custom views are. The view calculates the estimated promoter position for a gene by using a standard offset from the start of the gene's first CDS. Other standard views that can be used as examples, are W_EHA_5P_FLANKING_REGION_V and W_EHA_3P_FLANKING_REGION_V.

To make a new custom view available in the TRC UI, perform the following steps:

A select privilege on the view must be granted to the TRC schema and the appropriate user roles, for example:

GRANT SELECT ON ODB.W_EHA_MY_CUSTOM_REGION_V TO OMICSDATAMARTADMIN;

GRANT SELECT ON ODB.W_EHA_MY_CUSTOM_REGION_V TO OMICSDATAMARTCONTRIBUTOR;

GRANT SELECT ON ODB.W_EHA_MY_CUSTOM_REGION_V TO OMICSDATAMARTUSER;

GRANT SELECT ON ODB.W_EHA_MY_CUSTOM_REGION_V TO TRC;

(run these commands as SYSTEM and customize the view name and the ODB and TRC schema names, if necessary).

A synonym for the view created in the Application (TRC) schema - it should have the same name as the view itself, for example:

CREATE SYNONYM TRC.W_EHA_MY_CUSTOM_REGION_V FOR ODB.W_EHA_MY_CUSTOM_REGION_V;

The view must be registered in the TRC Application's seed data, as follows:

add one or more records for it to the TRC_LOOKUP_CODE table in the Application (TRC) schema (not the ODB schema).

Enter CVIEW_ || <view name> as CODE (for example, if the view is named W_EHA_MY_CUSTOM_REGION_V, enter CVIEW_W_EHA_MY_CUSTOM_REGION_V), TRC_QP_GENE_REGION as CODE_TYPE, the label you want in the UI for this Gene Region (for example, Upstream region) as CODE_NAME, an optional description as CODE_DESC, and the language or locale code (for example, en_US) as LANGUAGE_CODE. If the installation supports several languages, insert a separate record for each.

The CVIEW_ prefix to the Code is necessary for the Query Engine to correctly identify the code as a custom gene region view.

7.8.2 Comparing Genomic Coordinates to Reference

For all result data loaded in the result tables, the given genomic coordinates are chromosome position specific, while the coordinate values (stored in the columns START_POSITION, END_POSITION) in the reference tables, with the exception of W_EHA_DNA_SOURCE, are specific to the source sequence stored in the W_EHA_DNA_SOURCE table. The genomic coordinates of this table are chromosome position specific.

Reference positions are counted from the first base with the positional base inclusive. This results in a single base increment when compared to other genomic coordinate systems. This should be taken into account when matching coordinates from result tables to reference table values.

Following is an example code for discovering overlapping regions of CNV against gene segments in the reference:

SELECT CNV.ROW_WID, GS.GENE_WID FROM W_EHA_RSLT_COPY_NBR_VAR CNV, W_EHA_GENE_SEGMENT GS, W_EHA_DNA_SOURCE DS, W_EHA_CHROMOSOME CH

WHERE CH.ROW_WID = CNV.CHROMOSOME_WID

AND DS.CHROMOSOME = CH.CHROMOSOME

AND GS.SOURCE_WID = DS.ROW_WID

AND (CNV.START_POSITION <= (GS.END_POSITION + DS.START_POSITION -1)

AND (CNV.END_POSITION >= (GS.START_POSITION + DS.START_POSITION -1))

7.9 Additional Step on Exadata versus Non-Exadata

All loaded result files are stored in staging tables on Exadata because direct path loading is required to get the best compression. The script, load_exadata.sh, triggers the SQL scripts required to load the data into result tables, including chromosome partitioned tables (suffixed _CHR). This script accepts a parameter which inquires the degree of parallelization for queries that insert data.

For each staging table, the script locks the staging table and loads into the result table directly. This script is only required on Exadata and must be executed by the ODB schema user. On Exadata, the loaders use a synonym which points to the staging tables and on non-Exadata the synonym points to the actual result table. For better compression run the script after a minimum of 50 sample result data have been loaded into the staging tables.

The load_exadata script takes the following parameters:

Username - to connect to the ODB schema. If using an Oracle Wallet this value is not used.
DBNAME or TNS NAME - WALLET NAME if using a wallet
Number for degree of parallelization - number which varies based on the Exadata machine type - quarter rack, half rack or a full rack. Consult the system administrator to evaluate and monitor resource capabilities to come up with the appropriate number.
1 for running script with Oracle Wallet, or 0 for running without Wallet.

7.10 UNDO Tablespace Auto-extendable Issue

Oracle DB UNDO tablespace should be auto-extendable. For Exadata systems the following error can occur on running Variant loaders:

Description: ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDOTBS8'

Ensure that the undo tablespace is increased or is created as Big File, with maxsize around 120G (with an option to increase its size, if required).

7.11 Chromosome Partitioned Tables

Chromosome partitioned tables have identical columns as the existing result tables. The partitioning strategy uses a primary range or interval partition based on CHROMOSOME_WID. There is a subpartition using a range on the START_POSITION. This range partition uses a subpartition template that declares new partitions for every 1 million bases. This type of subpartition enables database administrators to easily add new partitions to the range, if any subpartition gets too large.

The Exadata version of ODB uses a secondary step to move results from staging tables in order to maximize the hybrid columnar compression used on all result tables. This load exadata task now has SQL statements to move staging data to each of the result tables, the specimen based partitioned table and the chromosome based partitioned tables as well.

On non-Exadata installations, there is no secondary task to move data from staging tables since hybrid columnar compression does not exist on Oracle 11g non-Exadata versions. There are indexes declared based on CHROMSOME_WID and START_POSITION to give some better performance. There are views declared for non-Exadata that let application code, referencing the chromosome based result tables, to work seamlessly on any database platform.

TRC currently only queries the chromosome based partitioned tables for file export. When users specify a genomic range to export, the SQL used to export this data dynamically uses the chromosome based partition tables for VCF (variant) and SEG (copy number variant) export files. For RES (gene expression) and GCT (dual channel gene expression), there are no chromosome based partitioned result tables since all results are always linked to specific genes. Any export that uses a genomic range for RES or GCT is used to find all genes in that range. For non-Exadata, all types of export always finds the list of genes to export based on genomic ranges. This is done primarily because chromosome partitioning only exists for the Exadata platform.