7 Miscellaneous Topics

This chapter contains the following topics:

Product Version and Product Profile including Flanking Offsets
Querying Database Cross-References for Variations
Mitochondrial Chromosome Mappings
Promoter Offset
Loader Activity Logging
User Feedback for Loader Runs
Creating Custom Gene Components
Additional Step on Exadata versus Non-Exadata

7.1 Product Version and Product Profile including Flanking Offsets

Currently, there are two configuration tables in the ODB schema:

W_EHA_PRODUCT_VERSION table stores version numbers used by upgrade and patch installations. This table records each patch or release that is installed. It does not have ETL_PROC_WID or ENTERPRISE_ID. It does not use a loader to populate it, and it is global to all enterprises or tenants that use the schema. Two fields are used to record version information:

RELEASE_VERSION VARCHAR2(200);

PATCH_VERSION VARCHAR2(200);
W_EHA_PRODUCT_PROFILE table stores information specific to each enterprise or tenant - a single record for each ENTERPRISE_ID value, and a unique key to enforce this. It also stores global settings for various logging levels allowed for log records inserted into the log table. It also stores global setting to allow DBMS output. Finally, there is a VCF_FORMAT field intended to store a list of data types that are used in the Format column in the VCF data file.

The columns in this table are:

PROMOTER_OFFSET NUMBER;

FLANKING_OFFSET NUMBER;

LOG_WARNING VARCHAR2(1 CHAR);

LOG_INFO VARCHAR2(1 CHAR);

LOG_DEBUG VARCHAR2(1 CHAR);

LOG_TRACE VARCHAR2(1 CHAR);

LOG_DBMS_OUTPUT VARCHAR2(1 CHAR);

VCF_FORMAT VARCHAR2(4000 CHAR);

FLANKING_OFFSET parameter is an input (right after PROMOTER_OFFSET) on running TRC install scripts. The FLANKING_OFFSET value is used by various loader procedures and defines the size of the region before and after gene definition that sequencing results are to be linked with . These associations are important as a lot of research is focusing on areas before and after gene definitions. The value for FLANKING_OFFSET should be equal or larger than the value used for PROMOTER_OFFSET, so that the query engine can find results linked to a gene that may exist in a promoter region that extends before or after the gene definition.

THE LOG_% fields are flags whose values are either Y or N. By default the WARNING and INFO log levels are turned on. DBMS output is by default turned off. For the most part these settings are not needed or used as logging is also configured in the loader scripts

The VCF_FORMAT column is required to specify data types used in the specification of custom FORMAT columns. For a description of its usage see the section 'Instructions on Custom Format Specification in VCF' under chapter sub-section '4.6.1 Functional Description of VCF Loader'.

7.2 Querying Database Cross-References for Variations

7.2.1 Ensembl db_xref Qualifier Issue

The ENSEMBL GVF file, which involves nucleotide variation references from dbSNP, COSMIC and EMBL, is used to populate the variation tables in Omics Data Bank. The cross-reference information for these variants is identified by the Dbxref qualifier and is loaded onto W_EHA_VARIANT_XREF table. The standard format for Dbxref in GVF file is:

Dbxref=dbSNP_132:rs79772382;

To import this data, the program splits it into DATABASE and REFERENCE_ID using the first colon (:) as delimiter. Hence, for the above example W_EHA_VARIANT_XREF, populate columns with following data:

DATABASE = 'dbSNP_132'

REFERENCE_ID = 'rs79772382'

However, for some organisms, Dbxref is defined differently. Following is an example from Rattus norvegicus GVF file:

Dbxref=ENSEMBL:celera:ENSRNOSNP2610581;

For such cases, W_EHA_VARIANT_XREF columns are populated with the following data:

DATABASE = 'ENSEMBL'

REFERENCE_ID = 'celera:ENSRNOSNP2610581'

Since REFERENCE_ID may contain suffixed or prefixed data for some of the cases mentioned above, when querying the REFERENCE_ID, Oracle recommends using the SQL LIKE operator.

Note:

REFERENCE_SUFFIX column for W_EHA_VARIANT_XREF is not populated with any data in the current model.

The same scenario exists for the SwissProt database cross-reference. Hence, Oracle recommends using the SQL LIKE operator for querying against the W_EHA_PROT_XREF table.

7.2.2 Swissprot db_xref Qualifier Issue

The W_EHA_PROT_XREF table stored the database cross-reference information for SwissProt. This table populates the DATABASE, REFERENCE_ID and REFERENCE_SUFFIX information. The standard format for database cross-reference in SwissProt file is:

DR InterPro; IPR007031; Poxvirus_VLTF3.

To import this data, the program splits it into DATABASE, REFERENCE_ID and REFERENCE_SUFFIX using the semi-colon (;) as delimiter. Hence, for the above example W_EHA_PROT_XREF, populate the columns with following data:

DATABASE = 'InterPro'

REFERENCE_ID = 'IPR007031'

REFERENCE_SUFFIX = 'Poxvirus_VLTF3'

There are certain cross-references in SwissProt file which have the same REFERENCE_ID but a different REFERENCE_SUFFIX. For indexing the REFERENCE_ID, only the distinct first found REFERENCE_ID is stored. The other records retain the same REFERENCE_ID.

For example:

DR EMBL; AL390732; CAH71826.2; JOINED; Genomic_DNA.

DR EMBL; AL390732; CAH73848.1; -; Genomic_DNA.

For the above example the REFERENCE_ID for both the cross-references is AL390732. Hence, for indexing only the first line information is stored in the table.

There is some loss of information on the REFERENCE_SUFFIX level but not on REFERENCE_ID, that is, all REFERENCE_IDs are captured in the W_EHA_PROT_XREF table. Also, all instances of not saving duplicate db_xrefs are now logged by the SwissProt loader as warnings.

7.3 Mitochondrial Chromosome Mappings

The references to mitochondrial chromosome are stored as MT on the reference side (in the W_EHA_VARIANT and W_EHA_HUGO_INFO tables) and on the result side (in the W_EHA_CHROMOSOME table) of the model. Any novel variants reported into the W_EHA_VARIANT table from the result files have the chromosome value converted from M to MT. When inserting into result tables, that is W_EHA_RSLT_COPY_NBR_VAR and W_EHA_RSLT_SEQUENCING, the FK to W_EHA_CHROMOSOME table for chromosome value MT is taken if the result file has chromosome M.

7.4 Promoter Offset

The promoter region information is not available in the reference data set imported from ENSEMBL. Therefore, a column has been provided in W_EHA_SPECIES table for you to specify the promoter region upstream to the gene for a specific organism. The column is named PROMOTER_OFFSET. This Promoter Offset is species-specific.

There is also a default Promoter Offset, stored in the PROMOTER_OFFSET column of the PRODUCT_PROFILE table. This default value is populated during ODB installation, from the value of the -promoter_offset argument.

If the Promoter Offset for a species is null in the W_EHA_SPECIES table, the default value from W_EHA_PRODUCT_PROFILE is used. The EMBL and SwissProt reference loaders, which typically create W_EHA_SPECIES records do not populate the PROMOTER_OFFSET column there. So the only way to define a non-default Promoter Offset for a species, after its record has been created, is to manually edit the value in W_EHA_SPECIES.PROMOTER_OFFSET for it.

ODB provides a special view, W_EHA_PROMOTER_REGION_V, which uses the Promoter Offset described above to allow querying promoter regions for genes/gene structures in ODB.

7.5 Loader Activity Logging

The W_EHA_PRODUCT_PROFILE logging settings are singular levels, where each setting affects 1 level. These logging settings are generally not used by loaders as logging is configured in the loader scripts where the log levels are inherited - for example, if TRACE is on, DEBUG is also on. A single ETL_PROC_WID is used for each run of a loader. Any log records associated with that particular load can be looked up in the W_EHA_RSLT_LOG table after the fact, using this id.

The following named command-line option present in all loaders sets log level for that loader. -log_level ["TRACE" | "DEBUG" | "INFO" | "WARNING" | "ERROR"] default: INFO When -log_level is set to any other value except the five listed above, like "INHERIT", logging levels are inherited from W_EHA_PRODUCT_PROFILE.LOG_% settings. The following named command-line argument,

-print_summary ['1'=on, '0'=off] default: '0' if on, logs a summary log record to the database and prints the summary to the console.

Table 7-1 Log Events recorded in the W_EHA_RSLT_LOG table

Log Event	Modifiable by User	Log Event Description
START	No, always on	Displays the pl/sql package, operation and parameter list
END	No, always on	Displays the status of pl/sql operation
SUMMARY	Yes	Displays a summary report
FATAL	No, always on	Displays exception messages with error code and error trace
ERROR	No, always on	Displays exception messages with error code and error trace
WARNING	Yes	Displays warning messages
INFO	Yes	Displays informational messages such as sql%rowcounts
DEBUG	Yes	Displays debug messages
TRACE	Yes	Displays the finest grade debug messages

Table 7-2 How to enable log events using batch file parameters

Batch File Parameter Name	Batch File Parameter Value	WARNING Log Events Captured	INFO Log Events Captured	DEBUG Log Events Captured	TRACE Log Events Captured	SUMMARY Log Events Captured
log_level - - - - -	WARNING	Yes	No	No	No	-
	INFO	Yes	Yes	No	No	-
	DEBUG	Yes	Yes	Yes	No	-
	TRACE	Yes	Yes	Yes	Yes	-
	INHERIT	Each log event can be individually turned on or off by setting the corresponding attribute in the W_EHA_PRODUCT table				-
print_summary -	1	-	-	-	-	Yes
print_summary -	0	-	-	-	-	No

7.6 User Feedback for Loader Runs

User feedback of loader runs is provided to the user in the form of summary log records that is printed at console before the loader run terminates. The same summary is inserted as a log level 'SUMMARY' record in the W_EHA_RSLT_LOG table. A summary feedback can be created to every ETL load using its ETL_PROC_WID value.

The following function can be called anytime after a loader run to create a SUMMARY record in database, for instance if it was not run time of load:

odb_util.print_loader_summary(etl_proc_wid);

The following is an example of a loader summary output after a successful loader run:

--------------

Loader Summary

--------------

Command Used

ODB_RSLT_DUAL_CHANNEL_UTIL.process_dual_channel(i_data_file => 'unc.edu__AgilentG4502A_07_3__TCGA-A2-A0CX-01A-21R-A00Z-07__gene_expression_analysis_summary.txt', i_data_directory => 'ODB_LOAD', i_species_name => 'Homo sapiens', i_study_name => 'STUDY1', i_datasource_name => 'CDM', i_specimen_number => 'rna001', i_specimen_vendor => 'rnaseq', i_control_specimen => 'Stratagene Univeral Reference', i_user_label => 'ADF_p7', i_reference_version => 'GRCh37.p7', i_file_flg => 'E', i_preprocess_dir => null, i_preprocess_file => null, i_data_file_path => null, i_dbfs_store => null, i_alt_file_loc => null, i_read_size => null)

Properties

Start Date: 2013-06-07 15:36:41

ETL Process ID: 4116

Exadata: False

DB User: ODB25RES

OS User: vkamath

Hostname: NORTH\MUWKS0015

File Name: unc.edu__AgilentG4502A_07_3__TCGA-A2-A0CX-01A-21R-A00Z-07__gene_expression_analysis_summary.txt

File Type Code: 2-Channel Expression

File Type Name: Agilent TCGA 2-channel Expression analysis file

File Type Version: A

End Date: 2013-06-07 15:36:42

Status: SUCCESS

--------------

Insert Summary

--------------

w_eha_rslt_2channel_gxp: 61 rows

W_EHA_STG_2CHANNEL_GXP: 61 rows

w_eha_rslt_file: 1 row

w_eha_rslt_file_spec: 1 row

w_eha_rslt_file_qlfr: 0 rows

w_eha_rslt_specimen: 0 rows

-------------

Error Summary

-------------

---------------

Warning Summary

---------------

Info Summary

---------------

2013-06-07 15:36:41

Detail: Found specimen number=rna001

7.7 Creating Custom Gene Components

The end user can build queries by selecting the Variants or CNVs in regions, defined by Gene Components (the W_EHA_GENE_COMPONENT and associated tables). An example of this is selecting the Exon gene region which is one of the gene component types provided by Ensembl.

You can add custom gene components to these tables, use custom gene components types, and so on, and these new gene component types can be made available in the TRC v2.0.1 UI. To do this, you must understand the structure and functional use of these tables in the ODB schema.

The heart of the subsystem is the W_EHA_GENE_COMPONENT table, which stores one record per gene component. It has the following important columns:

ROW_WID-a surrogate primary key (PK), filled from the W_EHA_GENE_COMPONENT_S sequence.

STRUCTURE_WID-a foreign key (FK) to the W_EHA_GENE_STRUCTURE table. This is a mandatory column which links the component to its gene. If a gene component is associated with a known transcript with an Ensembl Transcript ID (for example, ENST00000384612), this FK should point to a gene structure with this TRANSCRIPT_ID. Else, it points to a gene structure record for the same gene with TRANSCRIPT_ID = NO STRUCTURE (a catch-all gene structure).

COMPONENT_TYPE-identifies the component as belonging to a type. It is of type VARCHAR2(100), and is not normalized, so you must use consistent casing for custom component types. Ensembl currently provides components of the following types- exon, CDS, STS, mRNA, misc_RNA, misc_feature.

PROTEIN_WID- an optional FK to the W_EHA_PROTEIN table. When this is set, you can query by Pathway name, and by Genes and Gene Sets.

The second table in the Gene Component subsystem is W_EHA_GENE_COMP_SEGMENT, which is used to map the component onto a DNA Source and through it onto a Chromosome. There can be one or more records in this table per Gene Component. There are the following important columns in this table:

ROW_WID - a surrogate PK, filled from the W_EHA_GENE_COMP_SEGMENT_S sequence.

GENE_COMPONENT_WID-the FK to the W_EHA_GENE_COMPONENT table.

SOURCE_WID-the FK to the W_EHA_DNA_SOURCE table.

NUMBER_IN_SEQUENCE-if there are more than one segment per component, their order is defined in this column.

START_POSITION and END_POSITION-the start and end positions of the segment relative to the W_EHA_DNA_SOURCE record (adding the W_EHA_DNA_SOURCE.START_POSITION value we get the absolute positions on the chromosome).

COMPLEMENT- a 0 or 1 flag. If the value is 1, the segment is a complement to the DNA Source sequence

Finally, there are the W_EHA_GENE_COMP_XREF and W_EHA_GENE_COMP_QLFR tables, which have the same structure as all other XRef and Qualifier tables. These tables contain optional Database Reference and Qualifiers (such as notes) for the gene components (0 to many per component).

Currently, these tables contain data loaded from Ensembl (filled by the EMBL Reference Loader).

However, you can add your own data to the tables after loading the reference data from the Ensembl EMBL and SwissProt files. You can also define customer-specific component types and use them.

You can define a new component type by choosing the value. However, you must ensure that it does not coincide with any existing or future value from other sources. Since there is no way to know what component types can be added to Ensembl or other sources in the future, Oracle recommends using a customer-specific prefix. For example, if we want to add a First exon in a gene component type, and the customer company name is XYZ, we can use the XYZ_first_exon as the new component type.

We now have the component type we can use, and we can use it with the same letter casing for all first exons of genes that we add.

Next, we provide the actual components. We already have a superset of the data (exons) in the W_EHA_GENE_COMPONENT table, with their segments already defined, and everything correctly linked to DNA Sources, Gene Structures, and Proteins. All we need is an extra set of records in W_EHA_GENE_COMPONENT, copied from some of the existing exon records, with the COMPONENT_TYPE XYZ_first_exon instead of exon, and the new segments for them. Inserting these records requires SQL or PL/SQL code, processing the pre-existing data, and inserting new records. The logic in the code can be as follows:

For each W_EHA_GENE_COMPONENT record with COMPONENT_TYPE of exon determine (by its position relative to the other exons for the same gene, if any) if this is the first exon in its gene.
If it is, insert a new record into W_EHA_GENE_COMPONENT, with the same values, except that the COMPONENT_TYPE is changed to XYZ_first_exon and a new ROW_ID obtained from the sequence.
Whenever a new component is inserted, insert copies of associated W_EHA_GENE_COMP_SEGMENT records, replacing their GENE_COMPONENT_WID values with the ROW_ID of the new W_EHA_GENE_COMPONENT record.

You must archive the script used to create these components, as it may be needed for re-creating the data in another database or when re-installing the same database.

The previous example was a case of copying the already existing components. For a general case, perform the following steps:

After selecting an existing or choosing a new Component Type (the same way as described above), ascertain the following for each gene component to be inserted:

Whether the gene (already existing in the W_EHA_GENE table) is associated with any information that uniquely identifies it: for example, W_EHA_GENE.ROW_WID or Ensemble Gene Id, or Species + HUGO Name.
Whether it is associated with any transcript for this gene (either a W_EHA_GENE_STRUCTURE.ROW_WID or an Ensembl Transcript Id, or not associated).
Whether it is associated with any existing protein record (in the W_EHA_PROTEIN table).
What range or ranges of chromosomal positions it maps to (chromosome name, start and end positions on the chromosome, complement or not). If there is more than one range, more than one segments need to be created.
Whether you want to enter any DB Xref and (or) Qualifier entries for each component.

Once this information is gathered, you can create a script (for example, in PL/SQL) to insert the new gene components. At the beginning of the script, get a new ETL_PROC_WID from the W_EHA_ETL_PROC_S sequence and use it in each subsequent insert statement for all tables. Then, looping over the components, perform the following steps:

Find the ROW_WID of the W_EHA_GENE_STRUCTURE the new component will be linked to. This record must belong to the gene the component is associated with, and its TRANSCRIPT_ID must match the Ensembl Transcript ID of the new component, or be NO STRUCTURE if there is none. If there is no such record in W_EHA_GENE_STRUCTURE, it should be inserted and its ROW_WID used as the STRUCTURE_WID in step 3 (if it is found, use its ROW_WID as STRUCTURE_WID).
If the new component is linked to a W_EHA_PROTEIN record, find its ROW_WID, and use it as PROTEIN_WID in the next step. Otherwise, use NULL.
Insert a new record for the component into W_EHA_GENE_COMPONENT, getting a new ROW_WID for it from the W_EHA_GENE_COMPONENT_S sequence, and using the other values obtained above, and the chosen value for COMPONENT_TYPE. Store the new ROW_WID.
If there are DB XRef and Qualifier entries for the new gene component, insert them into the respective tables, using the appropriate sequences for the PKs and the new W_EHA_GENE_COMPONENT.ROW_WID as the FK to the W_EHA_GENE_COMPONENT record.
5. Next, yo must create the Gene component Segments. For each chromosomal range of the new component, you must find the W_EHA_DNA_SOURCE record that the range is contained in. You can do this easily as the DNA Sources are mapped to chromosomes except the case when the component range spans two (or more) DNA Sources. In this case, it has to be broken up into sub-ranges, each mapped to its own DNA Source.

There is a further complication here - there may be more than one matching DNA Source - and we need to select one of these. The selection is rather arbitrary; a reasonable rule is to select the longest of the matching DNA Sources. Once the DNA Sources for all ranges are selected, use their ROW_WIDs to recalculate the range start and stop positions from the chromosomal to the DNA Source coordinates.

Insert a new record for each range into W_EHA_GENE_COMP_SEGMENT table, using the ROW_WID of the component record inserted into W_EHA_GENE_COMPONENT as GENE_COMPONENT_WID, and getting the new ROW_WID value from the W_EHA_GENE_COMP_SEGMENT_S sequence.

Note:

The START_POSITIONs and END_POSITIONs of all segments and the DNA Sources themselves are 1-based, so when recalculating, use -1's where appropriate.

This outlines the process of adding new gene components and custom gene component types.

There is one extra step needed for each new gene component type to appear in the Query UI. It is necessary to add one or more records to the TRC_LOOKUP_CODE table in the Application schema (not the ODB schema). Enter the component type you added in ODB (for example, XYZ_first_exon) as CODE, TRC_QP_GENE_REGION as CODE_TYPE, the label you want in the UI for this Gene Region (for example, first exon) as CODE_NAME, an optional description as CODE_DESC, and the language or locale code (for example, en_US) as LANGUAGE_CODE. If the installation supports several languages, insert a separate record for each.

Here is some practical information, useful when creating custom gene components.

Tables and sequences (when inserting into a table, get the new ROW_WID from the appropriate sequence):

Table	Sequence to get the ROW_WIDs from
W_EHA_GENE_COMPONENT	W_EHA_GENE_COMPONENT_S
W_EHA_GENE_COMP_SEGMENT	W_EHA_GENE_COMP_SEGMENT_S
W_EHA_GENE_COMP_XREF	W_EHA_GENE_COMP_XREF_S
W_EHA_GENE_COMP_QLFR	W_EHA_GENE_COMP_QLFR_S
W_EHA_GENE_STRUCTURE	W_EHA_GENE_STRUCTURE_S
TRC_LOOKUP_CODE (column: CODE_ID)	S_ROW_ID_SEQ

For example: select W_EHA_GENE_COMPONENT_S.NEXTVAL from DUAL;

ETL_PROC_WID NUMBER(10), is a column in all ODB tables that needs to be filled using insertion scripts. For every run of a script, a new unique value should be obtained from the sequence W_EHA_ETL_PROC_S stored in a script variable, and used whenever the script inserts a record or records into any ODB table. During insertion the current date and time should also be filled using select SYSDATE from DUAL) in W_INSERT_DT DATE column of each table.

DNA Sources and Chromosomes: In ODB, Gene Component Segments are not mapped directly to Chromosomes, but rather to DNA Sources. The DNA Sources are stored in the W_EHA_DNA_SOURCE table and contain the actual genomic DNA sequence, typically about 1,000,000 bases per DNA Source.

When inserting a new Gene Component, its Segments must be mapped to one or more DNA sources. For example, if we insert a new Gene Component with a single chromosomal Segment, spanning positions from 1450000 through 1460000 on Chromosome 1. We know that the species PK is 1. We must find the DNA source record or records, covering this range.

The following SQL finds the W_EHA_DNA_SOURCE record the segment should be assigned to:

select min(ds.row_wid) from w_eha_dna_source ds

where ds.species_wid = 1

and ds.chromosome = '1'

and ds.start_position <= 1450000

and 1460000 <= ds.end_position

This works if the entire segment CHR1:1450000-1460000 fits into a single DNA source range. If it does not, but spans across 2 or more DNA Source ranges, the query will not have a result. A more complex query must be used to get the partial DNA Sources.

If we get our DNA source, the Segment can be mapped on it, with the following recalculation of the start and end positions:

W_EHA_GENE_COMP_SEGMENT.START_POSITION = 1450000 - W_EHA_DNA_SOURCE.START_POSITION + 1

and

W_EHA_GENE_COMP_SEGMENT.END_POSITION = 1460000 - W_EHA_DNA_SOURCE.START_POSITION + 1.

7.7.1 Creating Custom Gene Region Views

The end user can build queries can build queries by selecting Variants or CNVs in regions defined in gene region views. An example of such a view is W_EHA_PROMOTER_REGION_V (selectable in the UI as Promoters). Now, TRC also supports custom gene region views, which can be created at customer sites and are specific to these sites.

Oracle recommends using custom gene region views only where calculated positions are involved (and not to use them instead of custom Gene Components). Using custom Gene components does not significantly increase the size of the query SQL, but using each custom view in a query increases the SQL size by up to 500 characters.

The following are the guidelines for creating custom views:

The custom gene region views must be named following the TRC view naming conventions: the name must start with the W_EHA prefix and end with the _V suffix (indicating that it is a view). Use the standard Promoter Region view as an example: W_EHA_PROMOTER_REGION_V.

A gene region view must have the following columns:

STRUCTURE_WID (NUMBER)-an FK to the W_EHA_GENE_STRUCTURE table. This is a mandatory column, and is very important as it ultimately links the region to its gene. If the region is associated with a known transcript with an Ensemble Transcript ID (for example, ENST00000384612), this FK should point to a gene structure with this TRANSCRIPT_ID. Otherwise it points to a gene structure record for the same gene with TRANSCRIPT_ID = NO STRUCTURE (a catch-all gene structure).

There can be multiple W_EHA_STRUCTURE records for the same gene, if the gene has multiple transcripts, and the catch-all structure record for the gene must be used for all gene regions not associated with transcripts.
PROTEIN_WID (NUMBER)-an optional FK to the W_EHA_PROTEIN table. If this is set, you can query by Pathway name, Genes and Gene Sets.
COMPLEMENT (NUMBER)-a 0 or 1 flag. If the value is 1, the region is a complement versus the DNA Source sequence.
SOURCE_WID (NUMBER)-the FK to the W_EHA_DNA_SOURCE table. All genomic positions in ODB are relative to W_EHA_DNA_SOURCE records, not chromosomes, as DNA sequences are stored in this table.
START_POSITION and END_POSITION (both NUMBERs)-the start and end positions of the region relative to the W_EHA_DNA_SOURCE record (adding the W_EHA_DNA_SOURCE.START_POSITION value we get the absolute positions on the chromosome).
The standard view, W_EHA_PROMOTER_REGION_V, can be used as an example when creating custom gene region views. It is used by the TRC application in exactly the same way as the custom views are. The view calculates the estimated promoter position for a gene by using a standard offset from the start of the gene's first CDS.

To make a new custom view available in the TRC UI, perform the following steps:

A select privilege on the view must be granted to the TRC schema and the appropriate user roles, for example:

GRANT SELECT ON ODB.W_EHA_MY_CUSTOM_REGION_V TO OMICSDATAMARTADMIN;

GRANT SELECT ON ODB.W_EHA_MY_CUSTOM_REGION_V TO OMICSDATAMARTCONTRIBUTOR;

GRANT SELECT ON ODB.W_EHA_MY_CUSTOM_REGION_V TO OMICSDATAMARTUSER;

GRANT SELECT ON ODB.W_EHA_MY_CUSTOM_REGION_V TO TRC;

(run these commands as SYSTEM and customize the view name and the ODB and TRC schema names, if necessary).

A synonym for the view created in the Application (TRC) schema - it should have the same name as the view itself, for example:

CREATE SYNONYM TRC.W_EHA_MY_CUSTOM_REGION_V FOR ODB.W_EHA_MY_CUSTOM_REGION_V;

The view must be registered in the TRC Application's seed data, as follows:

add one or more records for it to the TRC_LOOKUP_CODE table in the Application (TRC) schema (not the ODB schema).

Enter CVIEW_ || <the view name> as CODE (for example, if the view is named W_EHA_MY_CUSTOM_REGION_V, enter CVIEW_W_EHA_MY_CUSTOM_REGION_V), TRC_QP_GENE_REGION as CODE_TYPE, the label you want in the UI for this Gene Region (for example, Upstream region) as CODE_NAME, an optional description as CODE_DESC, and the language or locale code (for example, en_US) as LANGUAGE_CODE. If the installation supports several languages, insert a separate record for each.

The CVIEW_ prefix to the Code is necessary for the Query Engine to correctly identify the code as a custom gene region view.

7.7.2 Comparing Genomic Coordinates to Reference:

For all result data loaded in the result tables, the given genomic coordinates are chromosome position specific, while the coordinate values (stored in the columns START_POSITION, END_POSITION) in the reference tables, with the exception of W_EHA_DNA_SOURCE, are specific to the source sequence stored in the W_EHA_DNA_SOURCE table. The genomic coordinates of this table are chromosome position specific.

Reference positions are counted from the 1st base with the positional base inclusive. This results in a single base increment when compared to other genomic coordinate systems. This should be taken into account when matching coordinates from result tables to reference table values.

Following is an example code for discovering overlapping regions of CNV against gene segments in the reference:

SELECT CNV.ROW_WID, GS.GENE_WID FROM W_EHA_RSLT_COPY_NBR_VAR CNV, W_EHA_GENE_SEGMENT GS, W_EHA_DNA_SOURCE DS, W_EHA_CHROMOSOME CH

WHERE CH.ROW_WID = CNV.CHROMOSOME_WID

AND DS.CHROMOSOME = CH.CHROMOSOME

AND GS.SOURCE_WID = DS.ROW_WID

AND (CNV.START_POSITION <= (GS.END_POSITION + DS.START_POSITION -1)

AND (CNV.END_POSITION >= (GS.START_POSITION + DS.START_POSITION -1))

7.8 Additional Step on Exadata versus Non-Exadata

All loaded result files are stored in staging tables on Exadata because direct path loading is required to get the best compression. The script, load_exadata.sh, triggers the sql-scripts required to load the data into result tables. This script accepts a parameter which asks about the degree of parallelization for the queries that insert data.

For each staging table, the script locks the staging table and loads into the result table directly. This script is only required on Exadata and must be executed by the ODB schema user. On Exadata, the loaders use a synonym which points to the staging tables and on non-Exadata the synonym points to the real result table. For better compression run the script after a minimum of 50 samples results data has been loaded into the staging tables.

The load_exadata script takes the following parameters:

Username: to connect to the database schema. If using an Oracle Wallet this value is not used.
DBNAME or TNS NAME: WALLET NAME if using a wallet
Number for degree of parallelization: number which varies based on the Exadata machine type - quarter rack, half rack or a full rack. Consult the system administrator to evaluate and monitor resource capabilities to come up with the appropriate number.
1 for running script with Oracle Wallet, or 0 for running without Wallet.