Skip Headers
Oracle® Health Sciences Omics Data Bank Programmer's Guide
Release 1.0.1

Part Number E27509-02
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

4 Loaders for Result Data

This chapter includes the following topics:

Prerequisites

Before loading the result loaders ensure that the reference Ensembl files have been loaded using the java loader.

Note:

The reference loaded has to match the reference used for alignment of all other files to be loaded as well.

After the reference is loaded, perform the following steps to initialize your database:

  1. Create W_EHA_RSLT_STUDY records. There is a sequence (W_EHA_RSLT_STUDY_S) associated with this table to allow for as many study records as needed for testing. The result data is now partitioned using the FK to study. So the required number of studies should added to this table. The result loaders use the value for RESULT_STUDY_NAME as the value to lookup the corresponding study primary key. For simple testing you only need one record.

  2. Configure a W_EHA_DATASOURCE record to identify the CDM schema to be used to validate specimen numbers. Each result record that is to be loaded, must have the specimen exist in the CDM schema in the W_EHA_SPECIMEN_PATIENT_H table. Note that the CDM schema needs the v1.01 patch installed, which adds a SPECIMEN_VENDOR_NUMBER field to be used for vendor specific information. The W_EHA_DATASOURCE can use a database link if the CDM schema is in another instance.

  3. Ensure that the ODB schema has SELECT privileges on the W_EHA_SPECIMEN_PATIENT_H table in the CDM schema.

  4. All of the specimens required for the example files should be added into the W_EHA_SPECIMEN_PATIENT_H table in the CDM schema.

To install the loader, Copy the Results_Loader folder into a directory of your choice. The program should be run from the directory it is installed in.

Overview of Result Loaders

There are four result loaders that are developed, one for each file type:

  1. Gene Expression

  2. CGI

  3. MAF

  4. VCF

These four loaders can be run using the .bat file in Windows or the shell scripts in Linux.

The .bat files to be run in Windows are as follows:

Each of the four result loaders will require the following parameters to be passed in the order:

Note:

Oracle Wallet must be set up before the batch files can be run successfully.

The shell scripts to be run in Linux are as follows:

Each of the four result loaders will require the following parameters to be passed in the order:

If Oracle Wallet is set up the shell script uses those credentials to run the Sqlldr and Sqlplus. If Oracle Wallet is not set up the script prompts for a password and connects to Sqlldr and Sqlplus.

Probe Loader

The probe loader is used to populate the W_EHA_PROBE table. You can use probe_loader.bat to run in Windows or probe_loader.sh to run in Linux. Probe loader is somewhat of a reference loader rather than result but it does vary with vendors, for example, Affymetrix, Illumina.

Running Probe Loader on Windows

To run probe_loader.bat you need to pass the following parameters in order:

  1. Name and path of the probe data file

  2. Wallet name

The probe loader then calls the control file to populate the data into the staging table and then calls the procedure that loads data into W_EHA_PROBE. The bat file requires Oracle Wallet to be set up before it can run successfully.

The bat file can be run as follows:

C:\>probe_loader.bat <Name and path of probe data file><Wallet name>

Running Probe Loader on Linux

To run probe_loader.sh you must pass the following parameters in order:

  1. Name and path of the probe data file

  2. Schema/Wallet name - if Oracle Wallet is set up enter the wallet name else enter the schema name

  3. If Oracle Wallet is set up, then 1 else 0

  4. If the previous parameter is 0 then enter the username of the schema

The shell script calls the control file to populate the data into the staging table and then calls the procedure that loads data into W_EHA_PROBE.

If Oracle Wallet is set up the shell script uses those credentials to run the Sqlldr and Sqlplus. If Oracle Wallet is not set up the script prompts for a password and connects to Sqlldr and Sqlplus.

If Oracle Wallet is set up, then the shell script can be run as follows :

sh probe_loader.sh <Name and path of the probe data file><Wallet name>1

If Oracle Wallet is not set up, the shell script can be run as follows :

sh probe_loader.sh <Name and path of the probe data file><Schema Name>0<Schema username>

Once the installer scripts are run, check the log file (probe_loader.log) that is created in the same folder from which the bat or sh script was run. This log file will indicate the number of records that have been read, number of records that have been loaded into the staging table and the number of records that have been discarded.

Assumptions for Data File

Following are the assumptions for the data file for the Gene Expression Loader:

  1. The file is tab separated.

  2. The first row is always the header.

Mappings for Probe Loader

Table 4-1 Mappings for Probe Loader

Data File W_EHA_PROBE table

PROBESET

W_EHA_PROBE.PROBE_NAME

ACC

W_EHA_PROBE.ACCESSION

DESCP

W_EHA_PROBE.PROBE_DESC

GENEID

W_EHA_PROBE.PRIMARY_HUGO_NAME


Gene Expression Loader

The gene expression loader loads the W_EHA_RSLT_FILE, and W_EHA_RSLT_GENE_EXP tables. These tables can be loaded by running the gene_expression_loader.bat file in Windows or the gene_expression_loader.sh file in Linux.

Note:

The gene expression loader assumes that probe loader (see Probe Loader) has already populated W_EHA_PROBE table with probe names corresponding to genes.

Running Gene Expression Loader on Windows

The batch file requires the following parameters to be passed in the order. If any of the parameters have spaces in them enclose them in "":

  • File name of the gene expression data file

  • Oracle Directory Object

  • Study name

  • Datasource Name

  • Specimen Number

  • Specimen Vendor Number

  • Wallet Name

The w_eha_rslt_err_log table will contain error records if the records were not successfully loaded into the target tables.

Note:

Oracle Wallet must be set up before the batch files can be run successfully.

You can run the gene_expression_bat as follows:

C:\>gene_expression_loader.bat <File name of the gene expression data file><SourceOracle Directory Object><Study Source Name><Species Type><Specimen Number><Specimen Vendor Number><Walleta Name>

Running Gene Expression Loader on Linux

The gene_loader.sh will require the following parameters to be passed in the order. If any of the parameters have spaces in them enclose them in "":

  • File name along with the gene expression data file

  • Oracle Directory Object

  • Study Name

  • Datasource Name

  • Specimen Number

  • Specimen Vendor Number

  • Schema/Wallet Name - If Oracle Wallet is set up enter the Wallet name else enter the schema name.

  • 1 if Oracle Wallet is set up else 0

  • If previous parameter was 0 then enter schema username

The w_eha_rslt_err_log table will contain error records if the records were not successfully loaded into the target tables.

If Oracle Wallet is set up the shell script uses those credentials to run the Sqlldr and Sqlplus. If Oracle Wallet is not set up the script prompts for a password and connects to Sqlldr and Sqlplus.

If the Oracle wallet is set up, you can run the gene_expression_loader.sh script as follows:

gene_expression_loader.sh <File nameof the gene expression data file><Oracle Directory Object><Study Name><Data Source Name><Specimen Number><Specimen Vendor Number><Wallet Name>1

If Oracle Wallet is not set up, you can run the gene_expression_loader.sh script as follows:

gene_expression_loader.sh <File name of the gene expression data file><Oracle Directory Object><Study Name><Data Source Name> <Specimen Number><Specimen Vendor Number><Schema Name>0<Schema Username>

Assumptions for Data File

Following are the assumptions for the data file for the Gene Expression Loader:

  1. The file is tab separated.

  2. The first row is always the header.

  3. The first column is named DATA.

  4. Each hybridization present in the data file should have three columns in the following order:

    1. Intensity - Header value should be the Hybridization Name

    2. Call

    3. P-Value

  5. The first column for each hybridization should contain only the hybridization name. The values in this column with be the hybridization intensity value.

  6. The total size of the header in the data file should not be greater than 32000 characters.

Mappings for Gene Expression Loader

Table 4-2 Mappings for Gene expression Loader

Data File W_EHA_RSLT_GENE_EXP

DATA

There will be a look up in W_EHA_PROBE, corresponding ROW_WID will be populated in W_EHA_RSLT_GENE_EXP.PROBE_WID

HYBRIDIZATION - Header

W_EHA_RSLT_GENE_EXP.HYBRIDIZATION_NAME

HYBRIDIZATION - Data Values

W_EHA_RSLT_GENE_EXP.INTENSITY

HYBRIDIZATION_Call

W_EHA_RSLT_GENE_EXP.CALL

HYBRIDIZATION_P-VALUE

W_EHA_RSLT_GENE_EXP.P_VALUE


CGI Sequence Data Loader

The CGI 2.0 file format is described here:

ftp://ftp2.completegenomics.com/

Each section in a CGI file is self-contained and separate. There are three types of sections:

The main challenge for loading a CGI file is to parse the #SAMPLE information from the comments section and then map it with the rest of the data. This sample information is important to retrieve the Specimen_Id from the data source you have mentioned while executing the batch file.

Files to Load

The execution call of the stored procedure odb_result_util.process_cgi_stage() is designed in the script file load_cgi.sql. This stored procedure will accept FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE and SPECIMEN VENDOR as an input parameter.

Note:

This stored procedure will create an external table dynamically and upload data from the source file to it. External tables allow Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver can be used to access any data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on the external tables but they can be used for query, join and sort operations.

Two external tables will be created dynamically. The first table will store the specimen number which will be parsed from the comments section of the file. It will also store the #Sample tag and the related specimen number. The second external table will store the complete result data. Odb_util.get_specimen_wid() function will retrieve the specimen_id for the corresponding specimen number and insert a record into w_EHA_RSLT_SPECIMEN table.

Two multi-table insert statements will be written dynamically. One will insert records into W_EHA_VARIANT_STG, W_EHA_RSLT_NOCALL and W_EHA_RSLT_NOCALL_X tables and the other will insert records into W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_SEQYUENCING_X table.After inserting the record into W_EHA_VARIANT_STG table, a PROCESS_VARIANT() procedure will be called which will populate the W_EHA_VARIANT table.

Data Load

Any allele that is marked as "ref" indicates that the section of the genome matches the reference sequence. This data will not be saved to the database. Any allele not matching the database will generate two records in the ODB. Specimen identifiers are stored in the CGI data file header. There are several values used to get the correct Primary Key value of the corresponding SPECIMEN record which is external to the ODB schema. The SPECIMEN identifier is the first field used to look up the external database. This will most likely be a barcode or some other natural key. This value is not necessarily unique in the other database (especially coming from HDM) where different vendors can have different barcode systems that may overlap. So there is also a specimen vendor number used to look up the correct SPECIMEN record. The last value needed is used to specify if multiple sources are used to provide SPECIMEN records. Each result table will store a FK to the correct datasource for the SPECIMEN as well as the Primary Key value for the SPECIMEN record. The loading code will must use all three fields to find the correct values for the result records.

The alleles identified as "no-call" will create a W_EHA_RSLT_NOCALL and W_EHA_RSLT_NOCALL_X record. All other non-ref alleles will create a W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_SEQUENCING_X record.

The other columns are mapped as shown below. If both alleles are not homogeneous and are both non-references, then a total of four records will be created for such a row in the master variant.

Note:

The batch file requires Oracle Wallet to be set up to run correctly.

The batch file requires the following parameters to be passed in order. If any of the parameters have spaces in them enclose them in "":

  • File name of the CGI data file

  • Oracle Directory Object

  • Study name

  • Datasource Name

  • Species Name

  • Specimen Vendor Number

  • Wallet Name

Following command will be used to execute the batch file to upload the CGI data.

d:\> CGI_loader.bat <CGI file Name with full path ><Oracle Directory Object>< Species Type ><Study Name><Data Source Name><Species name><Specimen Vendor Number><Wallet Name>

For example,

d:\>cgi_loader.bat "masterVarBeta-NA19240-L2-200-37-ASM small.tsv"ODB_LOAD" "'Homo sapiens'" STUDY1" "CDM" "vendor2","db004_w"

The shell script requires the following parameters to be passed in order. Enclose spaces in "":

  • File name of the CGI data file

  • Oracle Directory Object

  • Study name

  • Datasource Name

  • Species Name

  • Specimen Vendor Number

  • Schema/Wallet Name- If Oracle Wallet is set up enter the Wallet name else enter the schema name 1 if Oracle Wallet is set up else 0

  • If previous parameter was 0 then enter schema username

The shell script can be run with or without Oracle Wallet being set up. The following command will be used to execute the shell script to upload CGI Data when Oracle Wallet is set up.

sh CGI_loader.sh <CGI file Name >< Directory Object >< Species Type ><Study Name><Data Source Name><Specimen Vendor Number><Wallet Name>1

For example,

sh cgi_loader.sh "masterVarBeta-NA19240-L2-200-37-ASM small.tsv"ODB_LOAD" "'Homo sapiens'" STUDY1" "CDM" "vendor2","db004_w",1

The following command will be used to execute the shell script to upload the CGI Data when Oracle Wallet is not set up.

sh CGI_loader.sh <CGI file Name ><Oracle Directory Object><Species Type ><Study Name><Data Source Name><Specimen Vendor Number><Schema Name>0<schema username>

For example,

sh CGI_loader.sh "masterVarBeta-NA19240-L2-200-37-ASM small.tsv"ODB_LOAD" "'Homo sapiens'" STUDY1" "CDM" "vendor2","db004",0,gdm

Note:

In Linux, it is not required to use 'Homo sapiens' in double quotes. That requirement is only for Windows.

Table 4-3 Mapping of CGI Result File

Column Name Table and Column Name in ODB Description

chromosome

W_EHA_RSLT_NOCALL.CHROMOSOME_WID

W_EHA_RSLT_SEQUENCING.CHROMOSOME_WID

W_EHA_VARIANT.CHROMOSOME

For no-call results this is stored directly in the CHROMOSOME field. For non-reference alleles, this field is used with the begin position to find the correct DNA_SOURCE record to find a VARIANT record or create a VARIANT record. Three values are needed to find existing VARIANT records. The chromosome, the begin position and the replace tag which is notation combining reference and allele sequences. For NOVEL variants, a new record is created in W_EHA_VARIANT table with chromosome value.

begin

W_EHA_RSLT_NOCALL.START_POSITION

W_EHA_RSLT_SEQUENCING.START_POSITION

W_EHA_VARIANT_STG.START_POSITION

W_EHA_VARIANT.ABSOLUTE_POSITION

The value of this field must add 1 since CGI uses zero based offsets and all other references use one based offsets. This field is used as described above. For no-call results this is stored in the START_POSITION field (after adding 1).

For NOVEL variants, 'begin' is stored as it is ABSOLUTE_POSITION column in W_EHA_VARIANT table, while W_EHA_VARIANT.START_POSITION is calculated relative to W_EHA_DNA_SOURCE.START_POSITION using the 'begin' value.

end

W_EHA_RSLT_NOCALL.END_POSITION

W_EHA_VARIANT_STG.END_POSITION

This value is used for no-call results and stored in the END_POSITION field. This value is also used to calculate the relative end position based on W_EHA_DNA_SOURCE.START_POSITION for END_POSITION in W_EHA_VARIANT.END_POSITION for novel variants.

zygosity

W_EHA_RSLT_SEQUENCING.ZYGOSITY

Stored for sequencing alleles in ZYGOSITY field.

varType

W_EHA_RSLT_SEQUENCING. VARIANT_TYPE

W_EHA_RSLT_NOCALL. NOCALL_TYPE

Stored either in NOCALL_TYPE or VARIANT_TYPE.

reference

W_EHA_VARIANT. REPLACE_TAG

This value is used in conjunction with allele1Seq and allele2Seq to construct a replace tag value used to find existing VARIANT records in W_EHA_VARIANT table. For CGI there are two overlap value computed for two alleles of each row of the result file. Using that overlap value, the reference sequence and the allele sequence is shortened and the start position and end position is incremented. This overlap value is used to create a replace tag with shortened reference and allele sequence. For insertions, the reference sequence uses a "-" and for deletions the allele sequence uses "-". This is standard notation used in most references.

At some in-dels the representation can be as follows: ins can be "AT/ATCTA" and del can be "ATCTA/AT". The logic for checking and inserting variants into the file is in the called procedure, the procedure should handle varying representations for variants coming from any of the sequencing file types

This field in some cases is empty for insertions.

allele1Seq

W_EHA_VARIANT.REPLACE_TAG

W_EHA_RSLT_SEQUENCING_X.. ALLELE/ W_EHA_RSLT_SEQUENCING_X. ALLELE_CLOB

W_EHA_RSLT_NOCALL_X. ALLELE/ W_EHA_RSLT_NOCALL_X. ALLELE_CLOB

For sequencing results, this value is used to construct the replace tag. In some cases, this field is empty for deletions.

allele2Seq

W_EHA_VARIANT.REPLACE_TAG

W_EHA_RSLT_SEQUENCING_X. ALLELE/ W_EHA_RSLT_SEQUENCING_X. ALLELE_CLOB

W_EHA_RSLT_NOCALL_X. ALLELE/ W_EHA_RSLT_NOCALL_X. ALLELE_clob

For sequencing results, this value is used to construct the replace tag. In some cases, this field is empty for deletions.

allele1VarScoreVAF

W_EHA_RSLT_SEQUENCING.SCORE_VAF

Used for sequencing results and stored in the SCORE_VAF field. If the variant is homozygous, then as only single allele record is stored, in such cases the least score value out of two scores is stored.

Allele2VarScoreVAF

W_EHA_RSLT_SEQUENCING.SCORE_VAF

Used for sequencing results and stored in the SCORE_VAF field. If the variant is homozygous, then as only single allele record is stored, in such cases the least score value out of two scores is stored.

allele1VarScoreEAF

W_EHA_RSLT_SEQUENCING.SCORE_EAF

Used for sequencing results and stored in the SCORE_EAF field. If the variant is homozygous, then as only single allele record is stored, in such cases the least score value out of two scores is stored.

Allele2VarScoreEAF

W_EHA_RSLT_SEQUENCING.SCORE_EAF

Used for sequencing results and stored in the SCORE_EAF field. If the variant is homozygous, then as only single allele record is stored, in such cases the least score value out of two scores is stored.

allele1HapLink

W_EHA_RSLT_SEQUENCING_X.HAPLINK

W_EHA_RSLT_NOCALL_X.HAPLINK

Used for both no-call and sequencing results and stored in the HAPLINK field.

allele2HapLink

W_EHA_RSLT_SEQUENCING_X.HAPLINK

W_EHA_RSLT_NOCALL_X.HAPLINK

Used for both no-call and sequencing results and stored in the HAPLINK field.

allele1ReadCount

W_EHA_RSLT_SEQUENCING. ALLELE_READ_COUNT

Used for sequencing results and stored in the ALLELE_READ_COUNT field.

Allele2ReadCount

W_EHA_RSLT_SEQUENCING. ALLELE_READ_COUNT

Used for sequencing results and stored in the ALLELE_READ_COUNT field.

referenceAlleleReadCount

W_EHA_RSLT_SEQUENCING. REFERENCE_READ_COUNT

Used for sequencing results and stored in the REFERENCE_READ_COUNT field.


MAF Sequence Data Loader

Mutation Annotation Format (MAF) is created by TCGA. MAF files store variation data for multiple samples.The MAF file format is described here:

http://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm

The format of a MAF file is tab-delimited columns. This file is the simplest among all result files. ODB supports import of MAF from 2.0 to 2.2 versions.

Files to Load

The execution call of the stored procedure odb_result_util.process_maf() is designed in one of the script files (load_maf.sql). This stored procedure will accept FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE and SPECIMEN VENDOR as an input parameter.

This stored procedure will create an external table dynamically and upload data from the source file into it. External tables allow Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver can be used to access data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on external tables but they can be used for query, join and sort operations.

Only one external table is created dynamically and will hold the complete result data. A Global Temporary table named W_EHA_MAF_SPECIMEN is created explicitly to store the Normal and Tumor sample barcodes. There will be two pass through the MAF file. One will create a W_EHA_VARIANT_STG record and collect each unique specimen number into the global temporary table. The bulk collect will then call Odb_util.GET_SPECIMEN_WID for all of the specimen numbers in one statement.

Another bulk insert statement will then insert the data into W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_SEQUENCING _X. After inserting the record into W_EHA_VARIANT_STG table, a PROCESS_VARIANT() procedure will be called which will populate the W_EHA_VARIANT table.

Data Load

Each row of the MAF file has two allele information for two sample types — tumor and normal sample. These two sample IDs are specified in each row of the file. Sample ID of tumor is specified in "Tumor_Sample_Barcode" and that of normal is specified in "Matched_Norm_Sample_Barcode" column respectively. So, for a single row, there can be a maximum of eight records created in ODB, depending on the heterozygosity and resemblance with reference sequence. Four for tumor and four for Normal sample.

For allele sequences, "-" for a deletion represent a variant. "-" for an insertion represents wild-type allele. If an allele sequence is the same as the Reference_Allele sequence, then that allele information is not stored in the data bank. There is no information on NOCALL in MAF data hence all the data will go to W_EHA_VARIANT ,W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_SEQUENCING_X tables.

Note:

The batch file requires Oracle Wallet to be set up to run correctly.

The batch file requires the following parameters to be passed in order. Any spaces present in the parameters should be enclosed within "" :

  • File name of the MAF data file

  • Oracle Directory Object

  • Study name

  • Datasource Name

  • Species Name

  • Specimen Vendor Number

  • Wallet Name

The following command will be use to execute the batch file to upload the MAF data.

C:\> MAF_loader.bat <MAF file Name><File Type Code><File Version><Vendor Name><Data Source Name><Specimen Type><"Specimen Number "><Specimen Vendor Number><Schema Name>

For example,

d:\>maf_loader.bat "hgsc.bcm.edu__Applied_Biosystems_Sequence_data_level3","" "ODB_LOAD" "'Homo sapiens'" "STUDY1" "CDM" "vendor3" "db004_w"

The shell script requires the following parameters to be passed in order. Any spaces present in the parameters should be enclosed within "" :

  • File name of the MAF data file

  • Oracle Directory Object

  • Study name

  • Datasource Name

  • Species Name

  • Specimen Vendor Number

  • Schema/Wallet Name- If Oracle Wallet is set up enter the Wallet name else enter the schema name 1 if Oracle Wallet is set up else 0

  • If previous parameter was 0 then enter schema username

The shell script can be run with or without Oracle Wallet being set up. The following command will be used to execute the shell script to upload the MAF Data when Oracle Wallet is set up.

Sh maf_loader.sh <MAF file Name><File Type Code><File Version><Vendor Name><Data Source Name><Specimen Type><" "><Specimen Vendor Number><Schema Name>1

For example,

sh maf_loader.sh "hgsc.bcm.edu__Applied_Biosystems_Sequence_data_level3.maf" "ODB_LOAD" "'Homo sapiens'" STUDY1" "CDM" "vendor2","db004",1

The following command will be used to execute the shell script to upload the MAF Data when Oracle Wallet is not set up.

sh maf_loader.sh <MAF file Name ><Oracle Directory Object><Species Type><Study Name><Data Source Name><Specimen Vendor Number><Schema Name>0<schema username>

For example,

sh maf_loader.sh "hgsc.bcm.edu__Applied_Biosystems_Sequence_data_level3.maf" "ODB_LOAD" "'Homo sapiens'" STUDY1" "CDM" "vendor2","db004",0,gdm

Table 4-4 Mapping of MAF Result File

Column Name in Result File Table and Column Name in ODB Description

Chromosome

W_EHA_RSLT_SEQUENCING.CHROMOSOME_WID

W_EHA_VARIANT.CHROMOSOME

This field is used with the begin position to find the correct DNA_SOURCE record to find a VARIANT record or create a VARIANT record. Note that 3 values are needed to find existing VARIANT records. The chromosome, the begin position and the replace tag which is notation combining reference and allele sequences. For NOVEL variants, a new record is created in W_EHA_VARIANT table with chromosome value.

Start_Position

W_EHA_RSLT_SEQUENCING.START_POSITION

W_EHA_VARIANT_STG.START_POSITION

W_EHA_VARIANT.ABSOLUTE_POSITION

This field is used as described above. For no-call results this is stored in the START_POSITION field (after adding 1). For NOVEL variants, 'Start_Position' is stored as it is ABSOLUTE_POSITION column in W_EHA_VARIANT table, while W_EHA_VARIANT.START_POSITION is calculated relative to W_EHA_DNA_SOURCE.START_POSITION using the 'Start_Position' value.

End_Position

W_EHA_VARIANT_STG.END_POSITION

This value is used for no-call results and stored in the END_POSITION field. This value is also used to calculate the relative end position based on W_EHA_DNA_SOURCE.START_POSITION for END_POSITION in W_EHA_VARIANT.END_POSITION for novel variants.

Strand

W_EHA_VARIANT.STRAND

W_EHA_VARIANT_STG.STRAND

This value is used to indicate forward or reverse strand.

Variant_Type

W_EHA_RSLT_SEQUENCING.VARIANT_TYPE

Type of variant including snp, insertion, or deletion. Stored in VARIANT_TYPE in W_EHA_RSLT_SEQUENCING table.

Reference_Allele

W_EHA_VARIANT.REPLACE_TAG

This value is used for REPLACE_TAG, REPLACE_TAG is used twice, one for the tumor and the other for normal sample. This value is used in conjunction with Tumor_Seq_Allele1 and Tumor_Seq_Allele2 to find existing VARIANT records for tumor sample. Similarly for normal sample the Reference allele is used for REPLACE_TAG. Note that for insertions the reference sequence uses a "-" and for deletions the allele sequence uses "-". This is standard notation used in most references. Logic for loader will be implemented in called procedure to variant table.

For deletion this value has deleted sequence and for insertion it has "-".

Tumor_Seq_Allele1

W_EHA_RSLT_SEQUENCING_X.ALLELE/ W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB

W_EHA_VARIANT.REPLACE_TAG

For sequencing results this value is used to construct the replace tag. '-' value represents a deletion.

Tumor_Seq_Allele2

W_EHA_RSLT_SEQUENCING_X.ALLELE / W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB

W_EHA_VARIANT.REPLACE_TAG

For sequencing results this value is used to construct the replace tag. '-' value represents a deletion.

Tumor_Sample_Barcode

W_EHA_RSLT_SPECIMEN.SPECIMEN_NUMBER

This value represents tumor sample ID This barcode ID involves TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID.

Matched_Norm_Sample_Barcode

W_EHA_RSLT_SEQUENCING.RESULT_SPEC_WID

W_EHA_RSLT_SEQUENCING.SPECIMEN_WID

This value represents normal sample ID This barcode ID involves TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The complete barcode ID as is foreign key to RSLT_ SPECIMEN record.

Match_Norm_Seq_Allele1

W_EHA_RSLT_SEQUENCING_X.ALLELE / W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB

W_EHA_VARIANT.REPLACE_TAG

For sequencing results this value is used to construct the replace tag. . '-' value represents a deletion.

Match_Norm_Seq_Allele2

W_EHA_RSLT_SEQUENCING_X.ALLELE / W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB

W_EHA_RSLT_SEQUENCING.ALLELE

W_EHA_VARIANT.REPLACE_TAG

For sequencing results this value is used to construct the replace tag. . '-' value represents a deletion.

Score

W_EHA_RSLT_SEQUENCING.SCORE_VAF

This is mapped to W_EHA_RSLT_SEQUENCING.SCORE_VAF.


VCF Sequence Data Loader

The VCF file format is described here:

http://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm

The mapping currently is as per the 1000 genomics genotype data supporting v4.1. Two kinds of VCF files are available at 1000 Genomes, sites and genotypes. Sites file does not contain genotype data and sample details. However, the genotype vcf file contains individual genotype data along with the sample information. Hence the current loader supports genotype vcf files from 1000 genomes. The current model will support only SNP information and will not support Structural Variation including InDel whose representation differs from SNP in VCF file.

Files to Load

The execution call of the stored procedure odb_result_util.process_vcf() is designed in one of the script files (load_vcf.sql). fileThis stored procedure will accept FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE and SPECIMEN VENDOR as an input parameter.

This stored procedure will create an external table dynamically and upload data from the source file into it. External tables allow Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver can be used to access data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on external tables but they can be used for query, join and sort operations. The loader will support 6000 specimens per VCF file.

The stored procedure will dynamically create seven external tables, where each table stores the common columns like chromosome, ref and so on along with 990 specimens. The first external table will store common attributes and the first 990 specimens, the second external table will store the common attributes and the second 990 specimens. Dynamic sql is used to first parse the header row in the data file and store it in an external table named VCF_SPEC_!!SEQ!! that will contain eight header columns. These columns are then appended and parsed to obtain the specimen identifiers. The second external table named VCF_DATA_!!SEQ!! will store the complete result data. This table will map all the fields existing in the result file.

Note:

In both the external tables the"!!SEQ!!" string will be replaced by ETL_PROC_ID at the run time.

There will be two multi-table insert statements written dynamically. One will insert records into W_EHA_VARIANT_STG and other will insert record into W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_SEQUENCING_X tables.

A select statement which inserts data into W_EHA_VARIANT_STG will compute the overlap value comparing reference with allele sequence. Using that overlap value, the reference sequence and the allele sequence is shortened and the start position and end position is incremented. Also this overlap value is used to create a replace tag with shortened reference and allele sequence.

After inserting the record into W_EHA_VARIANT_STG table, a PROCESS_VARIANT() procedure will be called which will populate the W_EHA_VARIANT table.

Another dynamic sql is used to parse the data columns from the data file into an external table (VCF_DATA_!!SEQ!! ) and is also used to populate the data from the external table into W_EHA_RSLT_SEQUENCING and W_EHA_SEQUENCING_X tables using multi insert statements.

The procedure allows to load any number of samples in batches of 30.The dynamic sql to create and load the data into an external table and populate the W_EHA_RSLT_SEQUENCING and W_EHA_SEQUENCING_X tables.

Data Load

Two kinds of VCF files are available at 1000 Genomes, namely sites and genotypes. Sites file does not contain genotypic data and sample details whereas the genotype vcf file contains individual genotypic data along with sample information. Therefore, the current loader supports genotype vcf files from 1000 genomes. The current model will support only SNP information including InDels and will not support Structural Variation whose representation differs from SNP in VCF file. If the user has a merged file containing SNP, INDEL and SV data, then the SV data should be removed from the file to load results.

The sample information is present on the header row of the VCF data following the "FORMAT" column. Each row represents one sample.

Data type representation format and its order for each sample is specified in the "FORMAT" column. All the alleles for all samples are stored in the ALT column, but to get the allele information for each sample the GT identifier from the FORMAT column for each sample is used. The allele value is represented in numerals (for example, 0/1, 1/2), where '0' represents reference allele and '1' and '2' represent alleles specified order in ALT.

Note:

The batch file requires Oracle Wallet to be set up to run correctly.

The batch file requires the following parameters to be passed in order. Any spaces present in the parameters should be enclosed within "" :

  • File name of the VCF data file

  • Oracle Directory Object

  • Study name

  • Datasource Name

  • Species Name

  • Specimen Vendor Number

  • Wallet Name

The following command will be used to execute the batch file to upload the VCF data.

C:\> VCF_loader.bat <VCF file Name><Oracle Directory Object><Study name><Data Source Name><Species Name><Specimen Vendor Number><Wallet Name>

For example,

d:\>vcf_loader.bat "d:\trc-v2\testdata\ ALL.chr20.merged_beagle_mach.20101123.snps_indels_svs.genotypes.vcf","ODB_LOAD","Study1","cdm","Homo sapiens"," ","vendor3","db004_w"

The shell script requires the following parameters to be passed in order. Any spaces present in the parameters should be enclosed within "" :

  • Name of the VCF data file

  • Oracle Directory Object

  • Study name

  • Datasource Name

  • Species Name

  • Specimen Vendor Number

  • Schema/Wallet Name- If Oracle Wallet is set up enter the Wallet name else enter the schema name

  • 1 if Oracle Wallet is set up else 0

  • If previous parameter was 0 then enter schema username

The shell script can be run with or without Oracle Wallet being set up. The following command will be used to execute the shell script to uploadthe VCF Data when Oracle Wallet is set up.

sh vcf_loader.sh <VCF file Name><Oracle Directory Object><Study Name><Data Source Name><Species name><Specimen Vendor Number><Schema Name>1

For example,

sh vcf_loader.sh "/trc-v2/testdata/ ALL.chr20.merged_beagle_mach.20101123.snps_indels_svs.genotypes.vcf","ODB_LOAD","Study1","cdm","Homo sapiens"," ","vendor3","db004_w",1

The following command will be used to execute the shell script to upload the VCF Data when Oracle Wallet is not set up.

sh vcf_loader.sh <VCF file Name><Oracle Directory Object><Study Name><Data Source Name><Species Name><Specimen Vendor Number><Schema Name>0<schema username>

For example,

sh vcf_loader.sh "/trc-v2/testdata/ ALL.chr20.merged_beagle_mach.20101123.snps_indels_svs.genotypes.vcf","ODB_LOAD","Study1","cdm","Homo sapiens"," ","vendor3","db004_w",0,gdm

Table 4-5 Mapping of VCF Result File

Column Name in Result File Table and Column Name in ODB Description

CHROM

W_EHA_VARIANT.CHROMOSOME

W_EHA_RSLT_SEQUENCING_CHROMOSOSME_WID

This field is used with the POS to find the correct DNA_SOURCE record to find a VARIANT record or create a VARIANT record. Note that 3 values are needed to find existing VARIANT records. The chromosome, the POS and the replace tag which is notation combining reference and allele sequences. For NOVEL variants, a new record is created in W_EHA_VARIANT table with chromosome value.

POS

W_EHA_VARIANT_STG.START_POSITION

W_EHA_VARIANT_STG.END_POSITION

W_EHA_VARIANT.ABSOLUTE_POSITION

This field is used as described above in Chromosome column. For NOVEL variants, 'POS' is stored as it is ABSOLUTE_POSITION column in W_EHA_VARIANT table, while W_EHA_VARIANT.START_POSITION is calculated relative to W_EHA_DNA_SOURCE.START_POSITION using the 'POS' value.Since, VCF does not have the end position information, while inserting novel variants in VARIANT table, END_POSITION need to be calculated based on POS information and number of bases in REF.

REF

W_EHA_VARIANT.REPLACE_TAG

This value is used in conjunction with ALT sequence to construct a replace tag value used to find existing VARIANT records. The replace tag is constructed with reference first followed by "/" and then the allele sequence. Note that for insertions the reference sequence uses a "-" and for deletions the allele sequence uses "-". This is standard notation used in most references.At some in-dels the representation can be as follows: ins can be "AT/ATCTA" and del can be "ATCTA/AT". The logic will be implemented in procedure which can be called with any of the above formats.

ALT

W_EHA_RSLT_SEQUENCING_X.ALLELE

W_EHA_VARIANT.REPLACE_TAG

For sequencing results, this value is used to construct the replace tag and to store the value in the results table as per rules.

FILTER

W_EHA_RSLT_SEQUENCING_X.FILTER

All variants, whether "PASS" or "FAIL" shall be loaded. "PASS/FAIL". Based on quality value and % of samples having data. Eg: "q10;s50". If filtering is not done then the field has "." (also load).

INFO

This field is not mapped with any database column. But will be used in cursor to filter out the SV data.

This is mainly used for integrated VCF files which contains SNP, INDEL and SV. The first field before semicolon specifies the variation type ID This will be used to filter out the SV data from the integrated VCF file.

FORMAT.GT

W_EHA_VARIANT.REPLACE_TAG

W_EHA_RSLT_SEQUENCING_X.ALLELE

Used to get the allele information for each sample. It is represented as '<allele1_num>/<allele2_num>'. In some cases instead of "/" there could be "|".

  1. For diploid: "0|0" represents both the alleles from REF.

  2. "0|1" represents one allele from REF and other from ALT allele.

  3. "0/2" represents one allele from REF and other from ALT allele 2.

  4. For haploid: only one allele number is represented.

  5. '.' is specified if a call cannot be made for a sample at that locus.

FORMAT.FT

W_EHA_RSLT_SEQUENCING_X. GENOTYPE_FILTER

Sample genotype filter indicating if this genotype was "called" (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semi-colon separated list of codes for filters that fail, or"." to indicate that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs (String, no white-space or semi-colons permitted).

FORMAT.GQ

W_EHA_RSLT_SEQUENCING.SCORE_VAF

This is mapped to W_EHA_RSLT_SEQUENCING.SCORE_VAF


Typical Errors Associated with Result Loaders (CGI, MAF, VCF, Gene Expression)

For each loader, a new VARCHAR2 (100) variable is defined. This variable is set before each and every SQL call to specify the context information. It is then used in standard error logging where the RECORD_DETAIL column of W_EHA_RSLT_ERR_LOG table will be populated with a simple context error message. Some common error messages are 'Generating ETL PROC ID', 'Verifying result file type and creating result file record', 'Verifying Species Name', 'Verifying Study Name', 'Processing Variant Staging records', 'Processing variant records', 'Processing result records', 'Dropping external tables'.

Examples of common error log records which are used in CGI, MAF, and VCF loaders are shown in Table 4-6.

Table 4-6 Columns Generated while Parsing CGI Files

Column Name Description Examples

ROW_WID

Record identifier

 

RESULT_TYPE_NAME

Result Type Name

For CGI — CGI masterVar

SPECIES_NAME

Species Name

Homo sapiens

SPECIMEN_NUMBER

Specimen Number parsed from the comments section of file.

GS00706-DNA_C01

SPECIMEN_VENDOR_NUMBER

Name of vendor passed as parameter with batch file

For CGI : CGI

DATASOURCE_NM

Data source Name

CDM

ERROR_DESC

Error message

ORA-01403 NO DATA FOUND

RECORD_DETAIL

 

Verifying Study Name

ETL_PROC_WID

Each load identifier.

 

Following are some errors which are handled as per the loader's processes.

CGI

When an external table processing for Specimen Number fails, an error will be logged as Processing external table for specimen header. When the external table processing for the entire CGI result record fails, an error will be logged as Processing external table for specimen data. If the process fails to retrieve the specimen ID from W_EHA_RSLT_SPECIMEN table, the error Retrieving specimen number from external table will be logged along with the ORA error in W_EHA_RSLT_ERR_LOG table.

VCF

Similar to CGI, when an external table processing for Specimen Number fails, an error will be logged as Processing external table for specimen header. If the process fails to retrieve the specimen ID from W_EHA_RSLT_SPECIMEN table, an error Retrieving specimen number from external table will be logged along with the ORA error in W_EHA_RSLT_ERR_LOG table.

After retrieving the specimen ID from the header table, if the process fails while dropping the external header table, an error Dropping specimen header external table will be logged into the error table. For VCF, a batch of 30 specimens are processed at a time, so if the process fails the first time then an error Processing data table for first set of specimen will be logged into the error table.

After processing each batch of specimens, the loader drops the external table and recreates for the next set of specimens. If the process fails while dropping the external table, then the error Processing data table for first set of specimen will get logged into the error table. If the process fails during the next specimen batch processing, then Processing data table for next set of specimen will be logged into error table.

At the end of the result processing, the loader drops the last version of external table. If the process fails, the error Dropping last version of data table will be logged into error table.

MAF

Only one external table is created for MAF result data. If the process fails at this step, then the error Processing MAF external data table will be logged into error table. A first BULK insert statement will create a variant staging and specimen record. If the process fails at this stage, then the error Processing variant staging and specimen staging record will be logged into the error table.

If the process fails while retrieving the specimen ID for a record from global temporary table, an error Processing variant staging and specimen staging record will be logged into error table.

Gene Expression

If the process fails while retrieving the specimen ID, an error Verifying Specimen ID will be logged into the error table. If the process fails while the processing the hybridization header external table, the Processing hybridization header table will be logged. If the process fails while retrieving the hybridization name from the header table then the error Retrieving hybridization name from header table will be logged. Similar to VCF, a batch of 45 expression data will be processed at a time and if the process fails at this stage, then an error Processing data table for the set of gene expression will be logged. If the process fails while dropping expression and hybridization external tables, then an error messages Dropping expression data table and Dropping hybridization table respectively, will be logged.

Note:

For the above loaders, the REC_DETAILS column of W_EHA_RSLT_ERR_LOG table will only be described the context in which the process failed.

Template of Command Line Arguments for Result Loaders

This section provides a summary of templates for running each data loader, along with an example using sample summary input files.

Glossary of Parameter Templates

The following is a glossary of common data entries and a brief description of their point of origin.

Table 4-7 Glossary of Parameter Templates

Parameter Name Description

user

DB user name

host_url:dbname

DB connection string (include system-add:port:dbname). to be used if no Wallet is provided.

schema_name

DB instance name

species_file

File with a list of Species names

input_file(.dat)

Input File, with path if not present in same folder (File type)

wallet_name

Name of Wallet created

sqlnet.ora_path

Path to directory with file sqlnet.ora and tnsnames.ora

wallet_dir

Directory with wallet files

species_wid

Primary Key ID from W_EHA_SPECIES

file_type_Code

File_type_code from W_EHA_RSLT_FILE_TYPE

file_version

File_type_version from W_EHA_RSLT_FILE_TYPE

vendor_name

User input of Vendor name for Result files

data_source_name

datasource_name from W_EHA_DATASOURCE

species_name

species_name from W_EHA_SPECIES

specimen_number

If datasource is CDM, give the SPECIMEN_NUMBER under W_EHA_SPECIMEN_PATIENT_H. (Has to be present in CDM table, but should only be given for gene_expression_loader.)

specimen_vendor_number

If datasource is CDM, give the SPECIMEN_VENDOR_NUMBER under W_EHA_SPECIMEN_PATIENT_H. (Has to be present in CDM table.)


Probe Annotation

Loads probe annotation files to W_EHA_PROBE table

Without Oracle Wallet:

Linux:

>sh probe_loader.sh <input_file(text)> <schema_name> 0 <user>

For example,

>probe_loader.sh homo-sapiens-9606-gene-symbol.gmt DB001 0 gdm

With Oracle Wallet:

Windows:

>probe_loader.bat <input_file(text)> <wallet_name>

Linux:

>sh probe_loader.sh <input_file(text)> <wallet_name> 1

Example:

>probe_loader.bat probeset_annotation_summary_ref.txt DB001Wallet

Gene Expression

Loads probe annotation files to W_EHA_PROBE table.

Without Oracle Wallet:

Linux:

>sh gene_expression_loader.sh <input_file(text)> <oracle_directory_object> <study_name> <data_source_name> <specimen_number> <specimen_vendor_number> <schema_name> 0 <user>

Example:

>sh gene_expression_loader.sh mas5_expression_summary_part1.txt "sapiensODB_LOAD" "study1""CDM" "exp01" "affy01" DB001 0 gdm

With Oracle Wallet:

Windows:

>gene_expression_loader.bat <input_file(text)> <oracle_directory_object> <study_name> <data_source_name> <species_name> <specimen_number> <specimen_vendor_number> <wallet_name>

Linux:

>sh gene_expression_loader.sh <input_file(text)> <oracle_directory_object> <study_name> <data_source_name> <specimen_number> <specimen_vendor_number> <wallet_name> 1

Example:

>sh gene_expression_loader.sh mas5_expression_summary_part1.txt "ODB_LOAD" "study1" "CDM" "exp01" "affy01" DB001Wallet 1

CGI masterVar

Loads variation/mutation results from CGI masterVar files to RSLT_Sequencing and RSLT_Nocall tables.

Without Oracle Wallet:

Linux:

>sh CGI_loader.sh <input_file(.tsv)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <schema_name> 0 <user>

Example:

>sh CGI_loader.sh "masterVarBeta-NA19240-L2-200-37-ASM.tsv" "ODB_LOAD" "Homo Sapiens" "Study1" "CDM" "CGI" DB001 0 gdm

With Oracle Wallet:

Windows:

>CGI_loader.bat <input_file(.tsv)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <wallet_name>

Linux:

>sh CGI_loader.sh <input_file(.tsv)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <wallet_name> 1

Example:

>sh CGI_loader.sh "masterVarBeta-NA19240-L2-200-37-ASM.tsv" "ODB_LOAD" "Homo Sapiens" "Study1" "CDM" "CGI" DB001Wallet 1

MAF

Loads variation/mutation results from MAF files to RSLT_Sequencing and RSLT_Nocall tables.

Without Oracle Wallet:

Linux:

>sh MAF_loader.sh <input_file(.maf)> <file_type_Code> <file_version> <vendor_name> <data_source_name> <species_name> <specimen_number> <specimen_vendor_number> <schema_name> 0 <user>

Example:

>sh MAF_loader.sh "hgsc.bcm.edu__Applied_Biosystems_Sequence_data_level3.maf" "MAF" "2.2" "MAF" "CDM" "Homo sapiens" "" "MAF" DB001 0 gdm

With Oracle Wallet:

Windows:

>MAF_loader.bat <input_file(.maf)> <file_type_Code> <file_version> <vendor_name> <data_source_name> <species_name> <specimen_number> <specimen_vendor_number> <wallet_name>

Linux:

>sh MAF_loader.sh <input_file(.maf)> <file_type_Code> <file_version> <vendor_name> <data_source_name> <species_name> <specimen_number> <specimen_vendor_number> <wallet_name> 1

Example:

>sh MAF_loader.sh "hgsc.bcm.edu__Applied_Biosystems_Sequence_data_level3.maf" "MAF" "2.2" "MAF" "CDM" "Homo sapiens" "" "MAF" DB001Wallet 1

VCF

Loads variation/mutation results from VCF files to RSLT_Sequencing and RSLT_Nocall tables.

Without Oracle Wallet:

Linux:

>sh VCF_loader.sh <input_file(.vcf)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <schema_name> 0 <user>

Example:

>sh VCF_loader.sh "ALL.chr20.merged_beagle_mach.20101123.snps_indels_svs.genotyp_copy.vcf" "ODB_LOAD" "Homo Sapiens" "Study1" "CDM" "VCF" DB001 0 gdm

With Oracle Wallet:

Windows:

>VCF_loader.bat <input_file(.vcf)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <wallet_name>

Linux:

>sh VCF_loader.sh <input_file(.vcf)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <wallet_name> 1

Example:

>sh VCF_loader.sh "ALL.chr20.merged_beagle_mach.20101123.snps_indels_svs.genotyp_copy.vcf" ODB_LOAD" "Homo Sapiens" "Study1 "CDM" "VCF" DB001Wallet 1