Oracle® Health Sciences Omics Data Bank Programmer's Guide Release 1.0.1 Part Number E27509-02 |
|
|
View PDF |
This chapter includes the following topics:
Before loading the result loaders ensure that the reference Ensembl files have been loaded using the java loader.
Note:
The reference loaded has to match the reference used for alignment of all other files to be loaded as well.After the reference is loaded, perform the following steps to initialize your database:
Create W_EHA_RSLT_STUDY records. There is a sequence (W_EHA_RSLT_STUDY_S) associated with this table to allow for as many study records as needed for testing. The result data is now partitioned using the FK to study. So the required number of studies should added to this table. The result loaders use the value for RESULT_STUDY_NAME as the value to lookup the corresponding study primary key. For simple testing you only need one record.
Configure a W_EHA_DATASOURCE record to identify the CDM schema to be used to validate specimen numbers. Each result record that is to be loaded, must have the specimen exist in the CDM schema in the W_EHA_SPECIMEN_PATIENT_H table. Note that the CDM schema needs the v1.01 patch installed, which adds a SPECIMEN_VENDOR_NUMBER field to be used for vendor specific information. The W_EHA_DATASOURCE can use a database link if the CDM schema is in another instance.
Ensure that the ODB schema has SELECT privileges on the W_EHA_SPECIMEN_PATIENT_H table in the CDM schema.
All of the specimens required for the example files should be added into the W_EHA_SPECIMEN_PATIENT_H table in the CDM schema.
To install the loader, Copy the Results_Loader folder into a directory of your choice. The program should be run from the directory it is installed in.
There are four result loaders that are developed, one for each file type:
Gene Expression
CGI
MAF
VCF
These four loaders can be run using the .bat file in Windows or the shell scripts in Linux.
The .bat files to be run in Windows are as follows:
gene_expression_loader.bat
CGI_loader.bat
MAF_loader.bat
VCF_loader.bat
Each of the four result loaders will require the following parameters to be passed in the order:
Name of the Result file - This can include a sub directory created on the directory used. Any sub-directory has to match the syntax of the operating system on the database server. The database server must have full access privileges on this directory.
The Oracle Directory object - For more information refer to Chapter 2, "Prerequisites for Loading Data". Oracle recommends that you always use capitalized names for Oracle directory objects.
Species Type - The actual species name should be passed. Usually this will be "Homo sapiens" if reference is loaded for humans. The name can be found in the W_EHA_SPECIES table.
Study name - This should be a value in the W_EHA_RSLT_STUDY.RESULT_STUDY_NAME column.
Data Source Name - This is a value inW_EHA_DATASOURCE.DATASOURCE_NM.
Specimen Number - should be given for gene expression loader only. If CDM is referenced, this value should be present for a record in W_EHA_SPECIMEN_PATIENT_H table under SPECIMEN_NUMBER.
Specimen Vendor Number - If CDM is referenced, this value should be present for a record in W_EHA_SPECIMEN_PATIENT_H table under SPECIMEN_VENDOR_NUMBER.
Wallet Name
Note:
Oracle Wallet must be set up before the batch files can be run successfully.The shell scripts to be run in Linux are as follows:
gene_expression_loader.sh
CGI_loader.sh
MAF_loader.sh
VCF_loader.sh
Each of the four result loaders will require the following parameters to be passed in the order:
Name of the Result file - This can include a sub directory created on the directory used. Any sub-directory has to match the syntax of the OS on the database server. The database server must have full access privileges on this directory.
The Oracle Directory object - For more information refer to Chapter 2, "Prerequisites for Loading Data". Oracle recommends that you always use capitalized names for Oracle directory objects.
Species Type - The actual species name should be passed. Usually this will be "Homo sapiens" if reference is loaded for humans. The name can be found in the W_EHA_SPECIES table.
Name of the Study
Data Source Name
Specimen Number - Should be given for gene_expression_loader ONLY. If CDM is referenced, this empty value should be present for a record in W_EHA_SPECIMEN_PATIENT_H table under SPECIMEN_NUMBER.
Specimen Vendor Number
Schema/Wallet name - If Oracle Wallet is set up, enter Wallet name else enter the schema name.
1 if Oracle Wallet is set up else 0
If previous parameter was 0 then enter schema username
If Oracle Wallet is set up the shell script uses those credentials to run the Sqlldr and Sqlplus. If Oracle Wallet is not set up the script prompts for a password and connects to Sqlldr and Sqlplus.
The probe loader is used to populate the W_EHA_PROBE table. You can use probe_loader.bat to run in Windows or probe_loader.sh to run in Linux. Probe loader is somewhat of a reference loader rather than result but it does vary with vendors, for example, Affymetrix, Illumina.
Running Probe Loader on Windows
To run probe_loader.bat you need to pass the following parameters in order:
Name and path of the probe data file
Wallet name
The probe loader then calls the control file to populate the data into the staging table and then calls the procedure that loads data into W_EHA_PROBE. The bat file requires Oracle Wallet to be set up before it can run successfully.
The bat file can be run as follows:
C:\>probe_loader.bat <Name and path of probe data file><Wallet name>
To run probe_loader.sh you must pass the following parameters in order:
Name and path of the probe data file
Schema/Wallet name - if Oracle Wallet is set up enter the wallet name else enter the schema name
If Oracle Wallet is set up, then 1 else 0
If the previous parameter is 0 then enter the username of the schema
The shell script calls the control file to populate the data into the staging table and then calls the procedure that loads data into W_EHA_PROBE.
If Oracle Wallet is set up the shell script uses those credentials to run the Sqlldr and Sqlplus. If Oracle Wallet is not set up the script prompts for a password and connects to Sqlldr and Sqlplus.
If Oracle Wallet is set up, then the shell script can be run as follows :
sh probe_loader.sh <Name and path of the probe data file><Wallet name>1
If Oracle Wallet is not set up, the shell script can be run as follows :
sh probe_loader.sh <Name and path of the probe data file><Schema Name>0<Schema username>
Once the installer scripts are run, check the log file (probe_loader.log) that is created in the same folder from which the bat or sh script was run. This log file will indicate the number of records that have been read, number of records that have been loaded into the staging table and the number of records that have been discarded.
The gene expression loader loads the W_EHA_RSLT_FILE, and W_EHA_RSLT_GENE_EXP tables. These tables can be loaded by running the gene_expression_loader.bat file in Windows or the gene_expression_loader.sh file in Linux.
Note:
The gene expression loader assumes that probe loader (see Probe Loader) has already populated W_EHA_PROBE table with probe names corresponding to genes.The batch file requires the following parameters to be passed in the order. If any of the parameters have spaces in them enclose them in "":
File name of the gene expression data file
Oracle Directory Object
Study name
Datasource Name
Specimen Number
Specimen Vendor Number
Wallet Name
The w_eha_rslt_err_log table will contain error records if the records were not successfully loaded into the target tables.
Note:
Oracle Wallet must be set up before the batch files can be run successfully.You can run the gene_expression_bat as follows:
C:\>gene_expression_loader.bat <File name of the gene expression data file><SourceOracle Directory Object><Study Source Name><Species Type><Specimen Number><Specimen Vendor Number><Walleta Name>
The gene_loader.sh will require the following parameters to be passed in the order. If any of the parameters have spaces in them enclose them in "":
File name along with the gene expression data file
Oracle Directory Object
Study Name
Datasource Name
Specimen Number
Specimen Vendor Number
Schema/Wallet Name - If Oracle Wallet is set up enter the Wallet name else enter the schema name.
1 if Oracle Wallet is set up else 0
If previous parameter was 0 then enter schema username
The w_eha_rslt_err_log table will contain error records if the records were not successfully loaded into the target tables.
If Oracle Wallet is set up the shell script uses those credentials to run the Sqlldr and Sqlplus. If Oracle Wallet is not set up the script prompts for a password and connects to Sqlldr and Sqlplus.
If the Oracle wallet is set up, you can run the gene_expression_loader.sh script as follows:
gene_expression_loader.sh <File nameof the gene expression data file><Oracle Directory Object><Study Name><Data Source Name><Specimen Number><Specimen Vendor Number><Wallet Name>1
If Oracle Wallet is not set up, you can run the gene_expression_loader.sh script as follows:
gene_expression_loader.sh <File name of the gene expression data file><Oracle Directory Object><Study Name><Data Source Name> <Specimen Number><Specimen Vendor Number><Schema Name>0<Schema Username>
Following are the assumptions for the data file for the Gene Expression Loader:
The file is tab separated.
The first row is always the header.
The first column is named DATA.
Each hybridization present in the data file should have three columns in the following order:
Intensity - Header value should be the Hybridization Name
Call
P-Value
The first column for each hybridization should contain only the hybridization name. The values in this column with be the hybridization intensity value.
The total size of the header in the data file should not be greater than 32000 characters.
Mappings for Gene Expression Loader
Table 4-2 Mappings for Gene expression Loader
Data File | W_EHA_RSLT_GENE_EXP |
---|---|
DATA |
There will be a look up in W_EHA_PROBE, corresponding ROW_WID will be populated in W_EHA_RSLT_GENE_EXP.PROBE_WID |
HYBRIDIZATION - Header |
W_EHA_RSLT_GENE_EXP.HYBRIDIZATION_NAME |
HYBRIDIZATION - Data Values |
W_EHA_RSLT_GENE_EXP.INTENSITY |
HYBRIDIZATION_Call |
W_EHA_RSLT_GENE_EXP.CALL |
HYBRIDIZATION_P-VALUE |
W_EHA_RSLT_GENE_EXP.P_VALUE |
The CGI 2.0 file format is described here:
ftp://ftp2.completegenomics.com/
Each section in a CGI file is self-contained and separate. There are three types of sections:
Comment lines - beginning with #
Header - beginning with >
Actual result data with information about the Zygocity, Variant Type, Reference, Alleles, Scores and Count.
The main challenge for loading a CGI file is to parse the #SAMPLE information from the comments section and then map it with the rest of the data. This sample information is important to retrieve the Specimen_Id from the data source you have mentioned while executing the batch file.
The execution call of the stored procedure odb_result_util.process_cgi_stage()
is designed in the script file load_cgi.sql. This stored procedure will accept FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE and SPECIMEN VENDOR as an input parameter.
Note:
This stored procedure will create an external table dynamically and upload data from the source file to it. External tables allow Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver can be used to access any data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on the external tables but they can be used for query, join and sort operations.Two external tables will be created dynamically. The first table will store the specimen number which will be parsed from the comments section of the file. It will also store the #Sample tag and the related specimen number. The second external table will store the complete result data. Odb_util.get_specimen_wid()
function will retrieve the specimen_id for the corresponding specimen number and insert a record into w_EHA_RSLT_SPECIMEN table.
Two multi-table insert statements will be written dynamically. One will insert records into W_EHA_VARIANT_STG, W_EHA_RSLT_NOCALL and W_EHA_RSLT_NOCALL_X tables and the other will insert records into W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_SEQYUENCING_X table.After inserting the record into W_EHA_VARIANT_STG table, a PROCESS_VARIANT()
procedure will be called which will populate the W_EHA_VARIANT table.
Any allele that is marked as "ref" indicates that the section of the genome matches the reference sequence. This data will not be saved to the database. Any allele not matching the database will generate two records in the ODB. Specimen identifiers are stored in the CGI data file header. There are several values used to get the correct Primary Key value of the corresponding SPECIMEN record which is external to the ODB schema. The SPECIMEN identifier is the first field used to look up the external database. This will most likely be a barcode or some other natural key. This value is not necessarily unique in the other database (especially coming from HDM) where different vendors can have different barcode systems that may overlap. So there is also a specimen vendor number used to look up the correct SPECIMEN record. The last value needed is used to specify if multiple sources are used to provide SPECIMEN records. Each result table will store a FK to the correct datasource for the SPECIMEN as well as the Primary Key value for the SPECIMEN record. The loading code will must use all three fields to find the correct values for the result records.
The alleles identified as "no-call" will create a W_EHA_RSLT_NOCALL and W_EHA_RSLT_NOCALL_X record. All other non-ref alleles will create a W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_SEQUENCING_X record.
The other columns are mapped as shown below. If both alleles are not homogeneous and are both non-references, then a total of four records will be created for such a row in the master variant.
Note:
The batch file requires Oracle Wallet to be set up to run correctly.The batch file requires the following parameters to be passed in order. If any of the parameters have spaces in them enclose them in "":
File name of the CGI data file
Oracle Directory Object
Study name
Datasource Name
Species Name
Specimen Vendor Number
Wallet Name
Following command will be used to execute the batch file to upload the CGI data.
d:\> CGI_loader.bat <CGI file Name with full path ><Oracle Directory Object>< Species Type ><Study Name><Data Source Name><Species name><Specimen Vendor Number><Wallet Name>
For example,
d:\>cgi_loader.bat "masterVarBeta-NA19240-L2-200-37-ASM small.tsv"ODB_LOAD" "'Homo sapiens'" STUDY1" "CDM" "vendor2","db004_w"
The shell script requires the following parameters to be passed in order. Enclose spaces in "":
File name of the CGI data file
Oracle Directory Object
Study name
Datasource Name
Species Name
Specimen Vendor Number
Schema/Wallet Name- If Oracle Wallet is set up enter the Wallet name else enter the schema name 1 if Oracle Wallet is set up else 0
If previous parameter was 0 then enter schema username
The shell script can be run with or without Oracle Wallet being set up. The following command will be used to execute the shell script to upload CGI Data when Oracle Wallet is set up.
sh CGI_loader.sh <CGI file Name >< Directory Object >< Species Type ><Study Name><Data Source Name><Specimen Vendor Number><Wallet Name>1
For example,
sh cgi_loader.sh "masterVarBeta-NA19240-L2-200-37-ASM small.tsv"ODB_LOAD" "'Homo sapiens'" STUDY1" "CDM" "vendor2","db004_w",1
The following command will be used to execute the shell script to upload the CGI Data when Oracle Wallet is not set up.
sh CGI_loader.sh <CGI file Name ><Oracle Directory Object><Species Type ><Study Name><Data Source Name><Specimen Vendor Number><Schema Name>0<schema username>
For example,
sh CGI_loader.sh "masterVarBeta-NA19240-L2-200-37-ASM small.tsv"ODB_LOAD" "'Homo sapiens'" STUDY1" "CDM" "vendor2","db004",0,gdm
Note:
In Linux, it is not required to use 'Homo sapiens' in double quotes. That requirement is only for Windows.Table 4-3 Mapping of CGI Result File
Column Name | Table and Column Name in ODB | Description |
---|---|---|
chromosome |
W_EHA_RSLT_NOCALL.CHROMOSOME_WID W_EHA_RSLT_SEQUENCING.CHROMOSOME_WID W_EHA_VARIANT.CHROMOSOME |
For no-call results this is stored directly in the CHROMOSOME field. For non-reference alleles, this field is used with the begin position to find the correct DNA_SOURCE record to find a VARIANT record or create a VARIANT record. Three values are needed to find existing VARIANT records. The chromosome, the begin position and the replace tag which is notation combining reference and allele sequences. For NOVEL variants, a new record is created in W_EHA_VARIANT table with chromosome value. |
begin |
W_EHA_RSLT_NOCALL.START_POSITION W_EHA_RSLT_SEQUENCING.START_POSITION W_EHA_VARIANT_STG.START_POSITION W_EHA_VARIANT.ABSOLUTE_POSITION |
The value of this field must add 1 since CGI uses zero based offsets and all other references use one based offsets. This field is used as described above. For no-call results this is stored in the START_POSITION field (after adding 1). For NOVEL variants, 'begin' is stored as it is ABSOLUTE_POSITION column in W_EHA_VARIANT table, while W_EHA_VARIANT.START_POSITION is calculated relative to W_EHA_DNA_SOURCE.START_POSITION using the 'begin' value. |
end |
W_EHA_RSLT_NOCALL.END_POSITION W_EHA_VARIANT_STG.END_POSITION |
This value is used for no-call results and stored in the END_POSITION field. This value is also used to calculate the relative end position based on W_EHA_DNA_SOURCE.START_POSITION for END_POSITION in W_EHA_VARIANT.END_POSITION for novel variants. |
zygosity |
W_EHA_RSLT_SEQUENCING.ZYGOSITY |
Stored for sequencing alleles in ZYGOSITY field. |
varType |
W_EHA_RSLT_SEQUENCING. VARIANT_TYPE W_EHA_RSLT_NOCALL. NOCALL_TYPE |
Stored either in NOCALL_TYPE or VARIANT_TYPE. |
reference |
W_EHA_VARIANT. REPLACE_TAG |
This value is used in conjunction with allele1Seq and allele2Seq to construct a replace tag value used to find existing VARIANT records in W_EHA_VARIANT table. For CGI there are two overlap value computed for two alleles of each row of the result file. Using that overlap value, the reference sequence and the allele sequence is shortened and the start position and end position is incremented. This overlap value is used to create a replace tag with shortened reference and allele sequence. For insertions, the reference sequence uses a "-" and for deletions the allele sequence uses "-". This is standard notation used in most references. At some in-dels the representation can be as follows: ins can be "AT/ATCTA" and del can be "ATCTA/AT". The logic for checking and inserting variants into the file is in the called procedure, the procedure should handle varying representations for variants coming from any of the sequencing file types This field in some cases is empty for insertions. |
allele1Seq |
W_EHA_VARIANT.REPLACE_TAG W_EHA_RSLT_SEQUENCING_X.. ALLELE/ W_EHA_RSLT_SEQUENCING_X. ALLELE_CLOB W_EHA_RSLT_NOCALL_X. ALLELE/ W_EHA_RSLT_NOCALL_X. ALLELE_CLOB |
For sequencing results, this value is used to construct the replace tag. In some cases, this field is empty for deletions. |
allele2Seq |
W_EHA_VARIANT.REPLACE_TAG W_EHA_RSLT_SEQUENCING_X. ALLELE/ W_EHA_RSLT_SEQUENCING_X. ALLELE_CLOB W_EHA_RSLT_NOCALL_X. ALLELE/ W_EHA_RSLT_NOCALL_X. ALLELE_clob |
For sequencing results, this value is used to construct the replace tag. In some cases, this field is empty for deletions. |
allele1VarScoreVAF |
W_EHA_RSLT_SEQUENCING.SCORE_VAF |
Used for sequencing results and stored in the SCORE_VAF field. If the variant is homozygous, then as only single allele record is stored, in such cases the least score value out of two scores is stored. |
Allele2VarScoreVAF |
W_EHA_RSLT_SEQUENCING.SCORE_VAF |
Used for sequencing results and stored in the SCORE_VAF field. If the variant is homozygous, then as only single allele record is stored, in such cases the least score value out of two scores is stored. |
allele1VarScoreEAF |
W_EHA_RSLT_SEQUENCING.SCORE_EAF |
Used for sequencing results and stored in the SCORE_EAF field. If the variant is homozygous, then as only single allele record is stored, in such cases the least score value out of two scores is stored. |
Allele2VarScoreEAF |
W_EHA_RSLT_SEQUENCING.SCORE_EAF |
Used for sequencing results and stored in the SCORE_EAF field. If the variant is homozygous, then as only single allele record is stored, in such cases the least score value out of two scores is stored. |
allele1HapLink |
W_EHA_RSLT_SEQUENCING_X.HAPLINK W_EHA_RSLT_NOCALL_X.HAPLINK |
Used for both no-call and sequencing results and stored in the HAPLINK field. |
allele2HapLink |
W_EHA_RSLT_SEQUENCING_X.HAPLINK W_EHA_RSLT_NOCALL_X.HAPLINK |
Used for both no-call and sequencing results and stored in the HAPLINK field. |
allele1ReadCount |
W_EHA_RSLT_SEQUENCING. ALLELE_READ_COUNT |
Used for sequencing results and stored in the ALLELE_READ_COUNT field. |
Allele2ReadCount |
W_EHA_RSLT_SEQUENCING. ALLELE_READ_COUNT |
Used for sequencing results and stored in the ALLELE_READ_COUNT field. |
referenceAlleleReadCount |
W_EHA_RSLT_SEQUENCING. REFERENCE_READ_COUNT |
Used for sequencing results and stored in the REFERENCE_READ_COUNT field. |
Mutation Annotation Format (MAF) is created by TCGA. MAF files store variation data for multiple samples.The MAF file format is described here:
http://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm
The format of a MAF file is tab-delimited columns. This file is the simplest among all result files. ODB supports import of MAF from 2.0 to 2.2 versions.
The execution call of the stored procedure odb_result_util.process_maf()
is designed in one of the script files (load_maf.sql). This stored procedure will accept FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE and SPECIMEN VENDOR as an input parameter.
This stored procedure will create an external table dynamically and upload data from the source file into it. External tables allow Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver can be used to access data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on external tables but they can be used for query, join and sort operations.
Only one external table is created dynamically and will hold the complete result data. A Global Temporary table named W_EHA_MAF_SPECIMEN is created explicitly to store the Normal and Tumor sample barcodes. There will be two pass through the MAF file. One will create a W_EHA_VARIANT_STG record and collect each unique specimen number into the global temporary table. The bulk collect will then call Odb_util.GET_SPECIMEN_WID for all of the specimen numbers in one statement.
Another bulk insert statement will then insert the data into W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_SEQUENCING _X. After inserting the record into W_EHA_VARIANT_STG table, a PROCESS_VARIANT() procedure will be called which will populate the W_EHA_VARIANT table.
Each row of the MAF file has two allele information for two sample types — tumor and normal sample. These two sample IDs are specified in each row of the file. Sample ID of tumor is specified in "Tumor_Sample_Barcode" and that of normal is specified in "Matched_Norm_Sample_Barcode" column respectively. So, for a single row, there can be a maximum of eight records created in ODB, depending on the heterozygosity and resemblance with reference sequence. Four for tumor and four for Normal sample.
For allele sequences, "-" for a deletion represent a variant. "-" for an insertion represents wild-type allele. If an allele sequence is the same as the Reference_Allele sequence, then that allele information is not stored in the data bank. There is no information on NOCALL in MAF data hence all the data will go to W_EHA_VARIANT ,W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_SEQUENCING_X tables.
Note:
The batch file requires Oracle Wallet to be set up to run correctly.The batch file requires the following parameters to be passed in order. Any spaces present in the parameters should be enclosed within "" :
File name of the MAF data file
Oracle Directory Object
Study name
Datasource Name
Species Name
Specimen Vendor Number
Wallet Name
The following command will be use to execute the batch file to upload the MAF data.
C:\> MAF_loader.bat <MAF file Name><File Type Code><File Version><Vendor Name><Data Source Name><Specimen Type><"Specimen Number "><Specimen Vendor Number><Schema Name>
For example,
d:\>maf_loader.bat "hgsc.bcm.edu__Applied_Biosystems_Sequence_data_level3","" "ODB_LOAD" "'Homo sapiens'" "STUDY1" "CDM" "vendor3" "db004_w"
The shell script requires the following parameters to be passed in order. Any spaces present in the parameters should be enclosed within "" :
File name of the MAF data file
Oracle Directory Object
Study name
Datasource Name
Species Name
Specimen Vendor Number
Schema/Wallet Name- If Oracle Wallet is set up enter the Wallet name else enter the schema name 1 if Oracle Wallet is set up else 0
If previous parameter was 0 then enter schema username
The shell script can be run with or without Oracle Wallet being set up. The following command will be used to execute the shell script to upload the MAF Data when Oracle Wallet is set up.
Sh maf_loader.sh <MAF file Name><File Type Code><File Version><Vendor Name><Data Source Name><Specimen Type><" "><Specimen Vendor Number><Schema Name>1
For example,
sh maf_loader.sh "hgsc.bcm.edu__Applied_Biosystems_Sequence_data_level3.maf" "ODB_LOAD" "'Homo sapiens'" STUDY1" "CDM" "vendor2","db004",1
The following command will be used to execute the shell script to upload the MAF Data when Oracle Wallet is not set up.
sh maf_loader.sh <MAF file Name ><Oracle Directory Object><Species Type><Study Name><Data Source Name><Specimen Vendor Number><Schema Name>0<schema username>
For example,
sh maf_loader.sh "hgsc.bcm.edu__Applied_Biosystems_Sequence_data_level3.maf" "ODB_LOAD" "'Homo sapiens'" STUDY1" "CDM" "vendor2","db004",0,gdm
Table 4-4 Mapping of MAF Result File
Column Name in Result File | Table and Column Name in ODB | Description |
---|---|---|
Chromosome |
W_EHA_RSLT_SEQUENCING.CHROMOSOME_WID W_EHA_VARIANT.CHROMOSOME |
This field is used with the begin position to find the correct DNA_SOURCE record to find a VARIANT record or create a VARIANT record. Note that 3 values are needed to find existing VARIANT records. The chromosome, the begin position and the replace tag which is notation combining reference and allele sequences. For NOVEL variants, a new record is created in W_EHA_VARIANT table with chromosome value. |
Start_Position |
W_EHA_RSLT_SEQUENCING.START_POSITION W_EHA_VARIANT_STG.START_POSITION W_EHA_VARIANT.ABSOLUTE_POSITION |
This field is used as described above. For no-call results this is stored in the START_POSITION field (after adding 1). For NOVEL variants, 'Start_Position' is stored as it is ABSOLUTE_POSITION column in W_EHA_VARIANT table, while W_EHA_VARIANT.START_POSITION is calculated relative to W_EHA_DNA_SOURCE.START_POSITION using the 'Start_Position' value. |
End_Position |
W_EHA_VARIANT_STG.END_POSITION |
This value is used for no-call results and stored in the END_POSITION field. This value is also used to calculate the relative end position based on W_EHA_DNA_SOURCE.START_POSITION for END_POSITION in W_EHA_VARIANT.END_POSITION for novel variants. |
Strand |
W_EHA_VARIANT.STRAND W_EHA_VARIANT_STG.STRAND |
This value is used to indicate forward or reverse strand. |
Variant_Type |
W_EHA_RSLT_SEQUENCING.VARIANT_TYPE |
Type of variant including snp, insertion, or deletion. Stored in VARIANT_TYPE in W_EHA_RSLT_SEQUENCING table. |
Reference_Allele |
W_EHA_VARIANT.REPLACE_TAG |
This value is used for REPLACE_TAG, REPLACE_TAG is used twice, one for the tumor and the other for normal sample. This value is used in conjunction with Tumor_Seq_Allele1 and Tumor_Seq_Allele2 to find existing VARIANT records for tumor sample. Similarly for normal sample the Reference allele is used for REPLACE_TAG. Note that for insertions the reference sequence uses a "-" and for deletions the allele sequence uses "-". This is standard notation used in most references. Logic for loader will be implemented in called procedure to variant table. For deletion this value has deleted sequence and for insertion it has "-". |
Tumor_Seq_Allele1 |
W_EHA_RSLT_SEQUENCING_X.ALLELE/ W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB W_EHA_VARIANT.REPLACE_TAG |
For sequencing results this value is used to construct the replace tag. '-' value represents a deletion. |
Tumor_Seq_Allele2 |
W_EHA_RSLT_SEQUENCING_X.ALLELE / W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB W_EHA_VARIANT.REPLACE_TAG |
For sequencing results this value is used to construct the replace tag. '-' value represents a deletion. |
Tumor_Sample_Barcode |
W_EHA_RSLT_SPECIMEN.SPECIMEN_NUMBER |
This value represents tumor sample ID This barcode ID involves TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. |
Matched_Norm_Sample_Barcode |
W_EHA_RSLT_SEQUENCING.RESULT_SPEC_WID W_EHA_RSLT_SEQUENCING.SPECIMEN_WID |
This value represents normal sample ID This barcode ID involves TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The complete barcode ID as is foreign key to RSLT_ SPECIMEN record. |
Match_Norm_Seq_Allele1 |
W_EHA_RSLT_SEQUENCING_X.ALLELE / W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB W_EHA_VARIANT.REPLACE_TAG |
For sequencing results this value is used to construct the replace tag. . '-' value represents a deletion. |
Match_Norm_Seq_Allele2 |
W_EHA_RSLT_SEQUENCING_X.ALLELE / W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB W_EHA_RSLT_SEQUENCING.ALLELE W_EHA_VARIANT.REPLACE_TAG |
For sequencing results this value is used to construct the replace tag. . '-' value represents a deletion. |
Score |
W_EHA_RSLT_SEQUENCING.SCORE_VAF |
This is mapped to W_EHA_RSLT_SEQUENCING.SCORE_VAF. |
The VCF file format is described here:
http://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm
The mapping currently is as per the 1000 genomics genotype data supporting v4.1. Two kinds of VCF files are available at 1000 Genomes, sites and genotypes. Sites file does not contain genotype data and sample details. However, the genotype vcf file contains individual genotype data along with the sample information. Hence the current loader supports genotype vcf files from 1000 genomes. The current model will support only SNP information and will not support Structural Variation including InDel whose representation differs from SNP in VCF file.
The execution call of the stored procedure odb_result_util.process_vcf()
is designed in one of the script files (load_vcf.sql). fileThis stored procedure will accept FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE and SPECIMEN VENDOR as an input parameter.
This stored procedure will create an external table dynamically and upload data from the source file into it. External tables allow Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver can be used to access data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on external tables but they can be used for query, join and sort operations. The loader will support 6000 specimens per VCF file.
The stored procedure will dynamically create seven external tables, where each table stores the common columns like chromosome, ref and so on along with 990 specimens. The first external table will store common attributes and the first 990 specimens, the second external table will store the common attributes and the second 990 specimens. Dynamic sql is used to first parse the header row in the data file and store it in an external table named VCF_SPEC_!!SEQ!! that will contain eight header columns. These columns are then appended and parsed to obtain the specimen identifiers. The second external table named VCF_DATA_!!SEQ!! will store the complete result data. This table will map all the fields existing in the result file.
Note:
In both the external tables the"!!SEQ!!" string will be replaced by ETL_PROC_ID at the run time.There will be two multi-table insert statements written dynamically. One will insert records into W_EHA_VARIANT_STG and other will insert record into W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_SEQUENCING_X tables.
A select statement which inserts data into W_EHA_VARIANT_STG will compute the overlap value comparing reference with allele sequence. Using that overlap value, the reference sequence and the allele sequence is shortened and the start position and end position is incremented. Also this overlap value is used to create a replace tag with shortened reference and allele sequence.
After inserting the record into W_EHA_VARIANT_STG table, a PROCESS_VARIANT()
procedure will be called which will populate the W_EHA_VARIANT table.
Another dynamic sql is used to parse the data columns from the data file into an external table (VCF_DATA_!!SEQ!! ) and is also used to populate the data from the external table into W_EHA_RSLT_SEQUENCING and W_EHA_SEQUENCING_X tables using multi insert statements.
The procedure allows to load any number of samples in batches of 30.The dynamic sql to create and load the data into an external table and populate the W_EHA_RSLT_SEQUENCING and W_EHA_SEQUENCING_X tables.
Two kinds of VCF files are available at 1000 Genomes, namely sites and genotypes. Sites file does not contain genotypic data and sample details whereas the genotype vcf file contains individual genotypic data along with sample information. Therefore, the current loader supports genotype vcf files from 1000 genomes. The current model will support only SNP information including InDels and will not support Structural Variation whose representation differs from SNP in VCF file. If the user has a merged file containing SNP, INDEL and SV data, then the SV data should be removed from the file to load results.
The sample information is present on the header row of the VCF data following the "FORMAT" column. Each row represents one sample.
Data type representation format and its order for each sample is specified in the "FORMAT" column. All the alleles for all samples are stored in the ALT column, but to get the allele information for each sample the GT identifier from the FORMAT column for each sample is used. The allele value is represented in numerals (for example, 0/1, 1/2), where '0' represents reference allele and '1' and '2' represent alleles specified order in ALT.
Note:
The batch file requires Oracle Wallet to be set up to run correctly.The batch file requires the following parameters to be passed in order. Any spaces present in the parameters should be enclosed within "" :
File name of the VCF data file
Oracle Directory Object
Study name
Datasource Name
Species Name
Specimen Vendor Number
Wallet Name
The following command will be used to execute the batch file to upload the VCF data.
C:\> VCF_loader.bat <VCF file Name><Oracle Directory Object><Study name><Data Source Name><Species Name><Specimen Vendor Number><Wallet Name>
For example,
d:\>vcf_loader.bat "d:\trc-v2\testdata\ ALL.chr20.merged_beagle_mach.20101123.snps_indels_svs.genotypes.vcf","ODB_LOAD","Study1","cdm","Homo sapiens"," ","vendor3","db004_w"
The shell script requires the following parameters to be passed in order. Any spaces present in the parameters should be enclosed within "" :
Name of the VCF data file
Oracle Directory Object
Study name
Datasource Name
Species Name
Specimen Vendor Number
Schema/Wallet Name- If Oracle Wallet is set up enter the Wallet name else enter the schema name
1 if Oracle Wallet is set up else 0
If previous parameter was 0 then enter schema username
The shell script can be run with or without Oracle Wallet being set up. The following command will be used to execute the shell script to uploadthe VCF Data when Oracle Wallet is set up.
sh vcf_loader.sh <VCF file Name><Oracle Directory Object><Study Name><Data Source Name><Species name><Specimen Vendor Number><Schema Name>1
For example,
sh vcf_loader.sh "/trc-v2/testdata/ ALL.chr20.merged_beagle_mach.20101123.snps_indels_svs.genotypes.vcf","ODB_LOAD","Study1","cdm","Homo sapiens"," ","vendor3","db004_w",1
The following command will be used to execute the shell script to upload the VCF Data when Oracle Wallet is not set up.
sh vcf_loader.sh <VCF file Name><Oracle Directory Object><Study Name><Data Source Name><Species Name><Specimen Vendor Number><Schema Name>0<schema username>
For example,
sh vcf_loader.sh "/trc-v2/testdata/ ALL.chr20.merged_beagle_mach.20101123.snps_indels_svs.genotypes.vcf","ODB_LOAD","Study1","cdm","Homo sapiens"," ","vendor3","db004_w",0,gdm
Table 4-5 Mapping of VCF Result File
Column Name in Result File | Table and Column Name in ODB | Description |
---|---|---|
CHROM |
W_EHA_VARIANT.CHROMOSOME W_EHA_RSLT_SEQUENCING_CHROMOSOSME_WID |
This field is used with the POS to find the correct DNA_SOURCE record to find a VARIANT record or create a VARIANT record. Note that 3 values are needed to find existing VARIANT records. The chromosome, the POS and the replace tag which is notation combining reference and allele sequences. For NOVEL variants, a new record is created in W_EHA_VARIANT table with chromosome value. |
POS |
W_EHA_VARIANT_STG.START_POSITION W_EHA_VARIANT_STG.END_POSITION W_EHA_VARIANT.ABSOLUTE_POSITION |
This field is used as described above in Chromosome column. For NOVEL variants, 'POS' is stored as it is ABSOLUTE_POSITION column in W_EHA_VARIANT table, while W_EHA_VARIANT.START_POSITION is calculated relative to W_EHA_DNA_SOURCE.START_POSITION using the 'POS' value.Since, VCF does not have the end position information, while inserting novel variants in VARIANT table, END_POSITION need to be calculated based on POS information and number of bases in REF. |
REF |
W_EHA_VARIANT.REPLACE_TAG |
This value is used in conjunction with ALT sequence to construct a replace tag value used to find existing VARIANT records. The replace tag is constructed with reference first followed by "/" and then the allele sequence. Note that for insertions the reference sequence uses a "-" and for deletions the allele sequence uses "-". This is standard notation used in most references.At some in-dels the representation can be as follows: ins can be "AT/ATCTA" and del can be "ATCTA/AT". The logic will be implemented in procedure which can be called with any of the above formats. |
ALT |
W_EHA_RSLT_SEQUENCING_X.ALLELE W_EHA_VARIANT.REPLACE_TAG |
For sequencing results, this value is used to construct the replace tag and to store the value in the results table as per rules. |
FILTER |
W_EHA_RSLT_SEQUENCING_X.FILTER |
All variants, whether "PASS" or "FAIL" shall be loaded. "PASS/FAIL". Based on quality value and % of samples having data. Eg: "q10;s50". If filtering is not done then the field has "." (also load). |
INFO |
This field is not mapped with any database column. But will be used in cursor to filter out the SV data. |
This is mainly used for integrated VCF files which contains SNP, INDEL and SV. The first field before semicolon specifies the variation type ID This will be used to filter out the SV data from the integrated VCF file. |
FORMAT.GT |
W_EHA_VARIANT.REPLACE_TAG W_EHA_RSLT_SEQUENCING_X.ALLELE |
Used to get the allele information for each sample. It is represented as '<allele1_num>/<allele2_num>'. In some cases instead of "/" there could be "|".
|
FORMAT.FT |
W_EHA_RSLT_SEQUENCING_X. GENOTYPE_FILTER |
Sample genotype filter indicating if this genotype was "called" (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semi-colon separated list of codes for filters that fail, or"." to indicate that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs (String, no white-space or semi-colons permitted). |
FORMAT.GQ |
W_EHA_RSLT_SEQUENCING.SCORE_VAF |
This is mapped to W_EHA_RSLT_SEQUENCING.SCORE_VAF |
For each loader, a new VARCHAR2 (100) variable is defined. This variable is set before each and every SQL call to specify the context information. It is then used in standard error logging where the RECORD_DETAIL column of W_EHA_RSLT_ERR_LOG table will be populated with a simple context error message. Some common error messages are 'Generating ETL PROC ID', 'Verifying result file type and creating result file record', 'Verifying Species Name', 'Verifying Study Name', 'Processing Variant Staging records', 'Processing variant records', 'Processing result records', 'Dropping external tables'.
Examples of common error log records which are used in CGI, MAF, and VCF loaders are shown in Table 4-6.
Table 4-6 Columns Generated while Parsing CGI Files
Column Name | Description | Examples |
---|---|---|
ROW_WID |
Record identifier |
|
RESULT_TYPE_NAME |
Result Type Name |
For CGI — CGI masterVar |
SPECIES_NAME |
Species Name |
Homo sapiens |
SPECIMEN_NUMBER |
Specimen Number parsed from the comments section of file. |
GS00706-DNA_C01 |
SPECIMEN_VENDOR_NUMBER |
Name of vendor passed as parameter with batch file |
For CGI : CGI |
DATASOURCE_NM |
Data source Name |
CDM |
ERROR_DESC |
Error message |
ORA-01403 NO DATA FOUND |
RECORD_DETAIL |
Verifying Study Name |
|
ETL_PROC_WID |
Each load identifier. |
Following are some errors which are handled as per the loader's processes.
When an external table processing for Specimen Number fails, an error will be logged as Processing external table for specimen header. When the external table processing for the entire CGI result record fails, an error will be logged as Processing external table for specimen data. If the process fails to retrieve the specimen ID from W_EHA_RSLT_SPECIMEN table, the error Retrieving specimen number from external table will be logged along with the ORA error in W_EHA_RSLT_ERR_LOG table.
Similar to CGI, when an external table processing for Specimen Number fails, an error will be logged as Processing external table for specimen header. If the process fails to retrieve the specimen ID from W_EHA_RSLT_SPECIMEN table, an error Retrieving specimen number from external table will be logged along with the ORA error in W_EHA_RSLT_ERR_LOG table.
After retrieving the specimen ID from the header table, if the process fails while dropping the external header table, an error Dropping specimen header external table will be logged into the error table. For VCF, a batch of 30 specimens are processed at a time, so if the process fails the first time then an error Processing data table for first set of specimen will be logged into the error table.
After processing each batch of specimens, the loader drops the external table and recreates for the next set of specimens. If the process fails while dropping the external table, then the error Processing data table for first set of specimen will get logged into the error table. If the process fails during the next specimen batch processing, then Processing data table for next set of specimen will be logged into error table.
At the end of the result processing, the loader drops the last version of external table. If the process fails, the error Dropping last version of data table will be logged into error table.
Only one external table is created for MAF result data. If the process fails at this step, then the error Processing MAF external data table will be logged into error table. A first BULK insert statement will create a variant staging and specimen record. If the process fails at this stage, then the error Processing variant staging and specimen staging record will be logged into the error table.
If the process fails while retrieving the specimen ID for a record from global temporary table, an error Processing variant staging and specimen staging record will be logged into error table.
If the process fails while retrieving the specimen ID, an error Verifying Specimen ID will be logged into the error table. If the process fails while the processing the hybridization header external table, the Processing hybridization header table will be logged. If the process fails while retrieving the hybridization name from the header table then the error Retrieving hybridization name from header table will be logged. Similar to VCF, a batch of 45 expression data will be processed at a time and if the process fails at this stage, then an error Processing data table for the set of gene expression will be logged. If the process fails while dropping expression and hybridization external tables, then an error messages Dropping expression data table and Dropping hybridization table respectively, will be logged.
Note:
For the above loaders, the REC_DETAILS column of W_EHA_RSLT_ERR_LOG table will only be described the context in which the process failed.This section provides a summary of templates for running each data loader, along with an example using sample summary input files.
The following is a glossary of common data entries and a brief description of their point of origin.
Table 4-7 Glossary of Parameter Templates
Parameter Name | Description |
---|---|
user |
DB user name |
host_url:dbname |
DB connection string (include system-add:port:dbname). to be used if no Wallet is provided. |
schema_name |
DB instance name |
species_file |
File with a list of Species names |
input_file(.dat) |
Input File, with path if not present in same folder (File type) |
wallet_name |
Name of Wallet created |
sqlnet.ora_path |
Path to directory with file sqlnet.ora and tnsnames.ora |
wallet_dir |
Directory with wallet files |
species_wid |
Primary Key ID from W_EHA_SPECIES |
file_type_Code |
File_type_code from W_EHA_RSLT_FILE_TYPE |
file_version |
File_type_version from W_EHA_RSLT_FILE_TYPE |
vendor_name |
User input of Vendor name for Result files |
data_source_name |
datasource_name from W_EHA_DATASOURCE |
species_name |
species_name from W_EHA_SPECIES |
specimen_number |
If datasource is CDM, give the SPECIMEN_NUMBER under W_EHA_SPECIMEN_PATIENT_H. (Has to be present in CDM table, but should only be given for gene_expression_loader.) |
specimen_vendor_number |
If datasource is CDM, give the SPECIMEN_VENDOR_NUMBER under W_EHA_SPECIMEN_PATIENT_H. (Has to be present in CDM table.) |
Loads probe annotation files to W_EHA_PROBE table
Linux:
>sh probe_loader.sh <input_file(text)> <schema_name> 0 <user>
For example,
>probe_loader.sh homo-sapiens-9606-gene-symbol.gmt DB001 0 gdm
Windows:
>probe_loader.bat <input_file(text)> <wallet_name>
Linux:
>sh probe_loader.sh <input_file(text)> <wallet_name> 1
Example:
>probe_loader.bat probeset_annotation_summary_ref.txt DB001Wallet
Loads probe annotation files to W_EHA_PROBE table.
Linux:
>sh gene_expression_loader.sh <input_file(text)> <oracle_directory_object> <study_name> <data_source_name> <specimen_number> <specimen_vendor_number> <schema_name> 0 <user>
Example:
>sh gene_expression_loader.sh mas5_expression_summary_part1.txt "sapiensODB_LOAD" "study1""CDM" "exp01" "affy01" DB001 0 gdm
Windows:
>gene_expression_loader.bat <input_file(text)> <oracle_directory_object> <study_name> <data_source_name> <species_name> <specimen_number> <specimen_vendor_number> <wallet_name>
Linux:
>sh gene_expression_loader.sh <input_file(text)> <oracle_directory_object> <study_name> <data_source_name> <specimen_number> <specimen_vendor_number> <wallet_name> 1
Example:
>sh gene_expression_loader.sh mas5_expression_summary_part1.txt "ODB_LOAD" "study1" "CDM" "exp01" "affy01" DB001Wallet 1
Loads variation/mutation results from CGI masterVar files to RSLT_Sequencing and RSLT_Nocall tables.
Linux:
>sh CGI_loader.sh <input_file(.tsv)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <schema_name> 0 <user>
Example:
>sh CGI_loader.sh "masterVarBeta-NA19240-L2-200-37-ASM.tsv" "ODB_LOAD" "Homo Sapiens" "Study1" "CDM" "CGI" DB001 0 gdm
Windows:
>CGI_loader.bat <input_file(.tsv)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <wallet_name>
Linux:
>sh CGI_loader.sh <input_file(.tsv)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <wallet_name> 1
Example:
>sh CGI_loader.sh "masterVarBeta-NA19240-L2-200-37-ASM.tsv" "ODB_LOAD" "Homo Sapiens" "Study1" "CDM" "CGI" DB001Wallet 1
Loads variation/mutation results from MAF files to RSLT_Sequencing and RSLT_Nocall tables.
Linux:
>sh MAF_loader.sh <input_file(.maf)> <file_type_Code> <file_version> <vendor_name> <data_source_name> <species_name> <specimen_number> <specimen_vendor_number> <schema_name> 0 <user>
Example:
>sh MAF_loader.sh "hgsc.bcm.edu__Applied_Biosystems_Sequence_data_level3.maf" "MAF" "2.2" "MAF" "CDM" "Homo sapiens" "" "MAF" DB001 0 gdm
Windows:
>MAF_loader.bat <input_file(.maf)> <file_type_Code> <file_version> <vendor_name> <data_source_name> <species_name> <specimen_number> <specimen_vendor_number> <wallet_name>
Linux:
>sh MAF_loader.sh <input_file(.maf)> <file_type_Code> <file_version> <vendor_name> <data_source_name> <species_name> <specimen_number> <specimen_vendor_number> <wallet_name> 1
Example:
>sh MAF_loader.sh "hgsc.bcm.edu__Applied_Biosystems_Sequence_data_level3.maf" "MAF" "2.2" "MAF" "CDM" "Homo sapiens" "" "MAF" DB001Wallet 1
Loads variation/mutation results from VCF files to RSLT_Sequencing and RSLT_Nocall tables.
Linux:
>sh VCF_loader.sh <input_file(.vcf)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <schema_name> 0 <user>
Example:
>sh VCF_loader.sh "ALL.chr20.merged_beagle_mach.20101123.snps_indels_svs.genotyp_copy.vcf" "ODB_LOAD" "Homo Sapiens" "Study1" "CDM" "VCF" DB001 0 gdm
Windows:
>VCF_loader.bat <input_file(.vcf)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <wallet_name>
Linux:
>sh VCF_loader.sh <input_file(.vcf)> <oracle_directory_object> <species_name> <study_name> <data_source_name> <specimen_vendor_number> <wallet_name> 1
Example:
>sh VCF_loader.sh "ALL.chr20.merged_beagle_mach.20101123.snps_indels_svs.genotyp_copy.vcf" ODB_LOAD" "Homo Sapiens" "Study1 "CDM" "VCF" DB001Wallet 1