Skip Headers
Oracle® Health Sciences Omics Data Bank Programmer's Guide
Release 2.5

E35680-04
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

4 Loaders for Result Data

This chapter includes the following topics:

4.1 Prerequisites

Before using the result loaders, user must ensure that at least one version of reference Ensembl files have been loaded using the Java loader.

Note:

The reference loaded has to match the reference used for alignment of result files.

In addition, user should ensure that Oracle optimizer statistics were gathered after the reference data was loaded.

After the reference is loaded, perform the following steps to initialize your database:

  1. Create a record in W_EHA_RSLT_STUDY table. There is a sequence (W_EHA_RSLT_STUDY_S) associated with this table to let as many study records required for testing. The result data is now partitioned using the FK to study. The required number of studies should be added to this table. The result loaders use the value for RESULT_STUDY_NAME to lookup the corresponding study primary key. For simple testing you only need one record.

  2. Configure a W_EHA_DATASOURCE record to identify the CDM schema to validate specimen numbers. Each result record that is to be loaded must have the specimen exist in the CDM schema in the W_EHA_SPECIMEN_PATIENT_H table. Note that the CDM schema needs the v1.01 patch installed, which adds a SPECIMEN_VENDOR_NUMBER field to be used for vendor specific information. The W_EHA_DATASOURCE can use a database link if the CDM schema is in another instance.

  3. Ensure that the ODB schema has SELECT privileges on the W_EHA_SPECIMEN_PATIENT_H table in the CDM schema.

  4. All the specimens required for the example files should be added into the W_EHA_SPECIMEN_PATIENT_H table in the CDM schema.

To install the loaders, copy the Result_Loader folder into a directory. You must run a loader from the directory it is installed in (on Linux this requires an execute permission for all SH scripts).

4.1.1 Setting Default Cache Sizes for Result Loading

Each of the result tables have a corresponding sequence which is named similar to the table with a suffix of "_S". Each of these result tables sequences have a default cache size that reflects an average number of records that might be inserted for any load of result files.

Most of the sequences have a default cache size of 6000, whereas all of the sequencing related result table sequences have a cache size of 15000. There may be a need to load much larger files (that is, TCGA VCF data can have 4.5 million rows). For larger files it is recommended to have the DBA adjust all of the corresponding sequence cache sizes to at least 100,000 or larger.

Lower sequence cache sizes can result in waits for each parallel process trying to get the next cache of sequences. The actual decision to increase sequence cache size should be based on the number of rows estimated to be inserted during any load.

An example of the SQL to alter a sequence is:

alter sequence w_eha_rslt_gene_exp_s cache 100000;

The current list of sequences used by relevant result loaders are as follows:

  1. VCF loader

    • ODB_RSLT_GVCF_UTIL (default cache 15000)

      • W_EHA_RSLT_CONFLICT_S

      • W_EHA_RSLT_NOCALL1_S

      • W_EHA_RSLT_NON_VARIANT_S

      • W_EHA_RSLT_NOCALL1_S

      • W_EHA_RSLT_SEQUENCING1_S

      • W_EHA_RSLT_STRUCT_VAR_S

      • W_EHA_RSLT_SV_BREAKEND_S

  2. MAF Loader

    • "ODB_RSLT_MAF_UTIL (default cache 15000)

      • W_EHA_RSLT_SEQUENCING1_S

  3. RNA-seq loader

    • ODB_RSLT_RNA_SEQ_UTIL (default cache 6000)

      • W_EHA_RSLT_RNA_SEQ_S

  4. CNV loader

    • ODB_RSLT_CNV_UTIL (default cache 6000)

      • W_EHA_RSLT_COPY_NBR_VAR_S

  5. Single channel loader

    • ODB_RSLT_SINGLE_CHANNEL_UTIL (default cache 6000)

      • W_EHA_RSLT_GENE_EXP_S

  6. Dual channel loader

    • ODB_RSLT_DUAL_CHANNEL_UTIL (default cache 6000)

      • W_EHA_RSLT_2CHANNEL_GXP_S

4.2 Overview of Result Loaders

Following are the result loaders provided, one for each file type:

Additionally, there are:

All loaders can be run using .bat files in Windows or shell scripts in Linux.

The .bat files to be run in Windows are as follows:

Note:

Oracle Wallet must be set up before the batch files can be run successfully.

If Oracle Wallet is set up, shell script uses those credentials to run Sqlplus.

If Oracle Wallet is not set up, the shell script prompts for a password and connects to Sqlplus.

The shell scripts to be run in Linux are as follows:

4.3 Version Info Utility

The Version info utility is used to check for all available versions in the database instance for any of version types allowed in W_EHA_VERSION table. It is a command line API which:

4.3.1 Functional Description

The version info utility is a standalone script version_info.sh (version_info.bat for Windows) with an optional named argument '-list_ver' , to which a reference version type argument is passed. The loader calls the odb_reference_util.list_version_info function, accepting the version type argument as a parameter. The function queries W_EHA_VERSION table for column VERSION_LABEL values where column VERSION_TYPE equals the value of input parameter accepted by the function. These version labels are listed to the user on the command line. As mentioned in the previous section, if the user does not pass the parameter, the function would display the list of all versions for each version type.

4.3.2 Running the Version Check Utility

Name

version_info.sh - Lists Version Labels

Synopsis

version_info.sh -help

version_info.sh <...options>

Description

Validates input options and calls the loader script list_version.sql# odb_reference_util.list_version_info

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-list_ver <VARCHAR2>

A Reference Version type allowed for W_EHA_VERSION.VERSION_TYPE column. Known Version types are: 'DNA','PROTEIN','HUGO','PATHWAY','SIFT','POLYPHEN','PROBE', and 'BIOBASE'.

4.4 VCF Sequence Data Loader

4.4.1 Functional Description

The VCF loader takes the chromosome, position and reference version details of a record from VCF file and checks if the corresponding region of that chromosome exists in W_EHA_DNA_SOURCE table for a specific reference version given as input to the loader. If it is present, it maps this record of W_EHA_DNA_SOURCE table as W_EHA_VARIANT.SOURCE_WID. If the region does not exist, the loader ignores that record and does not log in to W_EHA_RSLT_LOG table.

The loader does not validate for invalid chromosome number or positions details. If it encounters such invalid data the loaders ignores that record and does not log it to W_EHA_RSLT_LOG table. The loader supports two types of chromosome representation in the VCF file like chr10 and also 10 would be loaded without any error. For mitochrondrial chromosome the loader can read chrM, chrMT, M and MT from the file.

Since VCF file contains multiple specimen information and if one or more of the specimen value does not exist in the CDM schema then that particular specimen data is not loaded and is logged in W_EHA_RSLT_LOG.

The loader does not validate the accuracy of the reference nucleotides in the database. It assumes that the same version of reference mapped VCF data is loaded in to ODB. The user has to make sure that the reference version matches between the results file being loaded and the reference data available in the ODB. ODB now supports multiple reference version, so VCF loader has to be given information as to which reference data is has to be mapped to. This version information present in VERSION_LABEL column of W_EHA_VERSION table has to be given to the VCF loader as a parameter. Please refer the loader parameter list for more details.

The VCF loader has been extended from existing support of 1000 Genomes VCF 4.1 format which include mutations like SNV, small indel to large structural variants and structural re-arrangements from 1000 genomes VCF 4.1. The updated VCF loader supports the gVCF (genome VCF) data from illumina, version 20120906a.

Details on 1000 Genomes VCF 4.1 specification can be found at the following link:

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Details on gVCF specification can be found at the following link:

https://sites.google.com/site/gvcftools/home/about-gvcf

Following is a brief description of each data type supported by VCF loader.

4.4.1.1 1000 genomes VCF4.1 version

The 1000 genomes VCF 4.1 format can be broadly classified in to 3 categories based on the type of variants as given below.

  1. SNV and small indel: These mutations are loaded in W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_NOCALL and W_EHA_VARIANT table. The W_EHA_VARIANT table is only populated for novel variants not already existing in that table.

    The loader populates W_EHA_RSLT_NOCALL table based on the GT values having './.'. If there are consecutive records with this type of nocall genotype, then these records will be collapsed while loaded to this table. For example if there are three records with POS as 1001, 1002 and 1003 with nocall genotype, then only one record is created in W_EHA_RSLT_NOCALL table with START_POSITION as 1001 and END_POSITION as 1003.

  2. Large Structural Variation: These large structural changes in genome are recorded in W_EHA_RSLT_STRUCT_VAR, W_EHA_VARIANT and W_EHA_VARIANT_X. The W_EHA_VARIANT and W_EHA_VARIANT_X tables are only populated for novel variants not existing in that table. W_EHA_VARIANT_X table is used to populate only the ALT column value from the VCF file in ALLELE clob column of this table.

    Currently any variant with ALT value more than 4000 characters will not be loaded. These records will be logged by the loader in W_EHA_RSLT_LOG table and also on the loader sqlplus console.

  3. Structural re-arrangement: Currently 1000 genomes data for structural re-arrangements is not yet released. There is neither detailed documentation nor proper examples in the 1000 genomes manual which would cover all scenario's of this data set. In view of this, the loader is built on the following assumptions considered from the little information available at the 1000 genomes VCF 4.1 manual.

    1. The loader assumes that there is GT information in the form of either '0' or '1'.

    2. The loader will identity structural rearrangements from tag 'SVTYPE=BND' present in the INFO column.

    3. Allele depth (AD) and total depth (DP) are expected to be a single value by the loader.

    These genomic re-arrangements are stored in W_EHA_RSLT_SV_BREAKEND table.

    Since the VCF 4.1 file can contain all the above three types of mutations in the same file, the VCF loader automatically distinguishes these three types of data using the INFO column and records are created in their respective result tables as described above.

    • If the INFO column of the VCF file does not have, 'SVTYPE' tag, then this mutation is considered as either SNV or small indel and data is loaded to W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_SEQUENCING _X (only if length of allele is greater than 500) , table.

    • If the INFO column has 'SVTYPE=BND', then this mutation is considered as structural re-arrangement and records are loaded to W_EHA_RSLT_SV_BREAKEND table.

    • If the INFO column has 'SVTYPE' other than 'BND', for example, 'SVTYPE=DEL', then these mutations are considered as large structural variants and records are loaded to W_EHA_RSLT_STRUCT_VAR table.

4.4.1.2 Genome Variant Call Format (gVCF)

gVCF is designed to store both variant and non-variant information related to single sample only. gVCF follows the 1000 genomes VCF 4.1 conventions, with the additional feature like siteConflicts. The VCF loader processes gVCF data in to W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_SEQUENCING _X (only if length of allele is greater than 500) ,W_EHA_RSLT_NON_VARIANT, W_EHA_RSLT_CONFLICT, W_EHA_RSLT_NOCALL and W_EHA_VARIANT. The W_EHA_VARIANT table is only populated for novel variants not already existing in that table. The mutations are stored in W_EHA_RSLT_SEQUENCNG and W_EHA_RSLT_SEQUENCING _X (only if length of allele is greater than 500) , the non-variant information is stored in W_EHA_RSLT_NON_VARIANT table, the conflict variants are stored in W_EHA_RSLT_CONFLICT and nocall information is stored in W_EHA_RSLT_NOCALL table. Collapsing of nocall data for the consecutive positions is not done in gVCF load because the nocall data is already compressed in gVCF file based on the similar quality scores and other parameters defined while creating the gVCF file.

Following logic is used by the loader while populating gVCF records in target tables.

  1. Any gVCF file record with GT value 0/1 or 1/0 or 1/2 or n/n, where n is not zero are stored in W_EHA_RSLT_SEQUENCING table.

  2. Any record with GT value as 0/0 or just 0 are stored in W_EHA_RSLT_NON_ VARIANT table.

  3. Any record with GT value as '.' or './.' and ALT value as '.' is stored in W_EHA_ RSLT_NOCALL table.

  4. Any record with GT value as '.' and ALT value not '.' is stored in W_EHA_RSLT_ CONFLICT table.

4.4.1.3 FILE_TYPE_CODE and LOAD_MODE of VCF Loader

As mentioned earlier, the VCF loader can be used to load all types of VCF data ie., SNP and small indel, large structural variation, structural re-arrangement and gVCF data. Since there is just one loader for loading all these data types, the loader has to be provided with some information to identify the file type. There is one additional parameter which can be used to load data in a specific mode as described below.

There are mainly two parameters required by the VCF loader which determines the input file type and the mode in which user want to load this data.

These parameters are:

  1. FILE_TYPE_CODE: The 'FILE_TYPE_CODE' parameter is used to provide the file type information based on which the file type related package is called internally by the loader. If a user is loading a VCF file containing either SNP and small indel or large SV or SV-rearrangements, then user has to give FILE_TYPE_CODE as 'VCF' and if user is loading a gVCF file then FILE_TYPE_CODE should be given as 'GVCF'.

  2. LOAD_MODE: The VCF loader has options to load the VCF and gVCF data in 3 different modes using the parameter 'load_mode'. Following are the three types of modes identified by the loader

    1. 'VCF' mode: This mode can be used for both VCF and gVCF file types. This mode will only load mutations and nocall data and will only populate W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_SEQUENCING _X (only if length of allele is greater than 500) , and W_EHA_RSLT_NOCALL tables respectively. If a user gives FILE_TYPE_CODE as GVCF and LOAD_MODE as VCF, then all the non-variant and conflict records from the gVCF file will be skipped and only variants and nocalls will be loaded.

    2. 'GVCF' mode: This mode will load data to all the tables made for gVCF, ie., W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_SEQUENCING _X (only if length of allele is greater than 500) , W_EHA_RSLT_NON_VARIANT, W_EHA_RSLT_NOCALL and W_EHA_RSLT_CONFLICT tables. Since the gVCF file has all the information about a genome like variants, non-variants, nocalls and conflicts, this mode is best suited for gVCF file type. However, if user has a VCF file with all non-variant information for a specific genome, then they can also use this mode to load the data. Usually, a VCF file doesn't contain all the non-variant information of a genome and only shows few records as non-variant for a specimen when there is a mutation at that position for a different specimen. For such files it is not advisable to load it using GVCF mode because the user will end up loading incomplete non-variant information for that specimen.

    3. 'NON-VAR' mode: This mode will load data to only W_EHA_RSLT_NON_VARIANT and W_EHA_RSLT_NOCALL tables. This mode is designed for scenarios like the user has already loaded mutations through the VCF file and now they want to load only the non-variant information, then they can use this mode using a gVCF file. Just like the GVCF mode this mode is also mostly suited for gVCF file type as the gVCF file contains all non-variant information. Furthermore, it is not advisable to load a VCF file which doesn't contain all the non-variant information for a specimen in NON-VAR mode as incomplete non-variant information will be stored in the database.

4.4.2 Custom Format Specification in VCF

The VCF loader also supports loading custom data types from FORMAT column from the VCF file. Following are the details about loading custom formats to ODB.

Custom format option helps user to load certain VCF FORMAT column fields which are currently not mapped in ODB. Before executing the loader the user has to manually create a column in W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_SV_BREAKEND and W_EHA_RSLT_STRUCT_VAR tables. For Exadata, the staging tables like W_EHA_STG_SEQUENCING, W_EHA_STG_SV_BREAKEND and W_EHA_STG_STRUCT_VAR should also be appended with the additional column in the same order as defined in the main result tables.

The column names should follow a specific naming convention. If a user wants to map a 'PL' data type from the FORMAT column in the input file, then user has to create a column with the field name 'CUST_PL'. Once user creates this column, give the details of the mapping to the loader under 'custom_format' parameter for the loader as "PL=CUST_PL". In case the user wants to load multiple custom format columns, give the values as comma separated, for example, "PL=CUST_PL,GL=CUST_GL".

There is now no limitation on number of custom formats supported by the loader as such, but the loader can read only 32 format data types at a time, so if a custom format data type is beyond the 32 data type then that data type will not be loaded. The order of the custom columns created should be same in the main tables and in the staging tables, otherwise there could be a mismatch in the data loaded. The custom column is advised to be added with a VARCHAR2(%n) data type as there could be comma separated values and other alphanumeric characters in the field. The %n should be defined based on the string requirement of the FORMAT data type for which the column is created.

This is a global change and having a global PRODUCT_PROFILE should always map to the correct DDL structure of the SEQUENCING table. The loader stored procedure does allow for per-load override of the mapping. There is an extra parameter in the call to process_vcf named i_custom_format that allows the user to use a per-load mapping. But in all practicality, this would not be used in a real situation.

The reason being is that the SEQUENCING table could end up with different data mapped to user defined columns. So it will always be recommended that the users set the proper mapping in the PRODUCT_PROFILE table. In PRODUCT_PROFILE table, if there are 2 or more custom formats (For example, 'CUST_PL' & 'CUST_GL'), the VCF_FORMAT column should be inserted with the value 'CUST_PL, CUST_GL'.

4.4.3 Data Load

The execution call of the stored procedure odb_result_util.process_vcf() is designed in one of the script files (load_vcf.sql). file. This stored procedure accepts FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE, SPECIMEN VENDOR, REFERNCE VERSIOn, FILE FLAG, CUSTOM FORMAT (for new columns), PREPROSECC DIR, PREPROCESS FILE, DATA FILE PATH, DBFS STORE, FILE TYPE CODE, LOAD MODE, XREF DB, FILE VERSION, ALT FILE LOCATION and READ SIZE as an input parameter.

This stored procedure creates an external table dynamically and uploads data from the source file into it. External tables let Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver can be used to access data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on external tables but they can be used for query, join, and sort operations. The loader supports unlimited specimens per VCF file.

The loader creates 3 external tables, one to store metadata, one to store the specimen information and one to store the actual data of specimens for result table.

Dynamic SQL first parses the header row in the data file and stores it in an external table named gvcf_spec_ !!SEQ!! that contains nine header columns.

The loader then creates an external table to store the metadata of result file. The metadata of result file resides in header part of result file which starts with "##" string. This external table then populates the W_EHA_RSLT_FILE_QLF table. The W_EHA_RSLT_FILE_QLF table is simply a name value pair table.

Looking at the metadata of the file, most of this data is in XML format where there is some identifier or tag followed by an attribute value or XML definition. The metadata load will set the QUALIFIER_TAG to the identifier before the "=" character (i.e. FORMAT, INFO, FILTER) and everything after the equal sign will be copied to the QUALIFIER_VALUE column of W_EHA_RSLT_FILE_QLF table.

The next external table declaration for data had 10 columns existing for the static VCF fields including row number. This means that the table declaration can allow for 986 specimen columns. There is a check in the loader for any file that has more than 986 specimens to give an error. Users can use various command line tools (awk, cut, etc...) to create files with appropriate number of columns if a file has more than 986 columns.

The SQL to create the data external table can exceed 32K, so a cursor is used to create the external table. The constant string used was broken up into 3 separate strings. The statements used now allow for the list of specimen columns to be added as a separate string. This allows for more than 32K statement to create the external table.

The loader retrieves the list of specimens from external tables and will store in the dynamic array. Also it validates and parses the flex fields and stores in an array.

Note:

In both the external tables, the "!!SEQ!!" string is replaced by ETL_PROC_ID at run time.

The loader will first process the reference data. A select statement which inserts data into W_EHA_VARIANT_STG, computes the overlap value comparing reference with the allele sequence. Using that overlap value, the reference sequence and the allele sequence is shortened and the start position and end position is incremented. Also, this overlap value creates a replace tag with shortened reference and allele sequence.

After inserting the record into the W_EHA_VARIANT_STG table, a PROCESS_VARIANT() procedure is called which populates the W_EHA_VARIANT table.

The loader then process to parse the first set of result data which do not use W_EHA_VARIANT foreign key. This includes W_EHA_RSLT_NOCALL, W_EHA_RSLT_NON_VARIANT, W_EHA_RSLT_SV_BREAKEND tables.

The loader then process to parse to link all records to W_EHA_VARIANT table. This includes W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_SEQUENCING_X(if allele length is > 500), W_EHA_RSLT_STRUCT_VAR.

The loader process a separate pass to parse the conflict data.This process populates the W_EHA_RSLT_CONFLICT table.

At the end of all data parsing, a loader will call ext_tables_error_log() procedure which will parse all rows that have ALT_SEQ columns larger than 4000 characters those are actually not processed and logs an error in W_EHA_RSLT_LOG table.

The loader code is design in such a way that most of the parsing process shares some common code. The code declares some constant variables which compute the data for gene wid and variant wid. Two constant variables c_insert_result_nonvar and c_insert_result_var defines which generates the dynamic inert statement for Non variant and Variant records respectively.

The code also creates a temporary table which stores the gene information after computation using first 9 fields of result file. This temporary table which loaded the possible set of gens then tied up with specimen to populate the gene_wid for variant as well as for non variant records.

A string constant defined get each value from the INFO column that calculate format offset positions also, is parsed in dynamically for each field into the cursor separately.

4.4.3.1 Data Files

Two kinds of VCF files are available at 1000 Genomes, namely sites and genotypes. Sites file does not contain genotypic data and sample details whereas the genotype vcf file contains individual genotypic data along with sample information. The current loader supports only VCF files with sample and genotype data. The sample information is present on the header row of the VCF data following the FORMAT column. Each row represents one sample.

Data type representation format and its order for each sample is specified in the FORMAT column. All the alleles for all samples are stored in the ALT column, but to get the allele information for each sample, the GT identifier from the FORMAT column for each sample is used. The allele value is represented in numerals (for example, 0/1, 1/2), where 0 represents reference allele and 1 and 2 represent alleles specified order in ALT.

Following are the list of passes that are used to process each VCF file.

  1. The file is parsed for header columns that are indicated by "##" and then are stored in the W_EHA_RSLT_FILE_QLFR table.

  2. The file is parsed to create all referenced W_EHA_VARIANT records. Note that this pass does not require "GT" format field so that reference VCF files can be loaded.

  3. The file is parsed to add records that do not use W_EHA_VARIANT foreign keys. This includes W_EHA_RSLT_NOCALL, W_EHA_RSLT_NON_VARIANT, W_EHA_RSLT_SV_BREAKEND.

  4. The file is parsed to link all records to W_EHA_VARIANT. This includes W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_STRUCT_VAR.

  5. In GVCF mode, the file is parsed to add W_EHA_RSLT_CONFLICT records.

  6. The last validation pass checks for all rows that have ALT_SEQ columns larger than 4000 characters to log warnings.

  7. The loader can read only 32 data types from the FORMAT column in the VCF file. Any data type either supported or custom data types not present in the first 32 data types of FORMAT column will not be loaded.

Note:

The batch file requires Oracle Wallet to be set up to run correctly.

Table Mapping of VCF Result File (snps, indels, large SVs, and re-arrangements) and gVCF

Column Name in Result File Table and Column Name in ODB Description
CHROM W_EHA_VARIANT.CHROMOSOME_WID

W_EHA_RSLT_SEQUENCING.CHROMOSOME_WID

W_EHA_RSLT_NON_VARIANT.CHROMOSOME_WID

W_EHA_RSLT_NOCALL.CHROMOSOME_WID

W_EHA_RSLT_CONFLICT.CHROMOSOME_WID

W_EHA_RSLT_STRUCT_VAR.CHROMOSOME_WID

W_EHA_RSLT_SV_BREAKEND.REF_CHROMOSOME_WID

This field is used with the POS to find the correct DNA_SOURCE record, to find a VARIANT record, or create a VARIANT record.

Three values are needed to find existing VARIANT records. The chromosome, the POS, and the replace tag, which is notation combining reference and allele sequences. For NOVEL variants, a new record is created in the W_EHA_VARIANT table with chromosome value.

POS W_EHA_VARIANT_STG.START_POSITION

W_EHA_VARIANT_STG.END_POSITION

W_EHA_VARIANT.ABSOLUTE_POSITION

W_EHA_RSLT_SEQUENCING.START_POSITION

W_EHA_RSLT_NON_VARIANT.START_POSITION

W_EHA_RSLT_NOCALL.START_POSITION

W_EHA_RSLT_CONFLICT.START_POSITION

W_EHA_RSLT_STRUCT_VAR.START_POSITION

W_EHA_RSLT_SV_BREAKEND.REF_START_POSITION

This field is used as described above in the Chromosome column.

For novel variants, POS is stored in the ABSOLUTE_POSITION column in the W_EHA_VARIANT table.

For W_EHA_VARIANT.START_POSITION, Start_Position is relative to the value in the W_EHA_DNA_SOURCE.START_POSITION table using the pos value. Since, VCF does not have the end position information, while inserting novel variants in the VARIANT table, END_POSITION need to be calculated based on POS information and number of bases in REF.

ID W_EHA_RSLT_SV_BREAKEND.BREAKEND_ID This value is only stored for structural re-arrangement data where 'SVTYPE=BND' present in INFO columm. This value in ID column is stored in BREAKEND_ID column.
REF W_EHA_VARIANT.REPLACE_TAG This value is used in conjunction with ALT sequence to construct a replace tag value used to find existing VARIANT records. The replace tag is constructed with reference first followed by "/" and then the allele sequence.

Note that for insertions, the reference sequence uses a "-" and for deletions the allele sequence uses "-". This is standard notation used in most references. At some in-dels, the representation can be as follows:

ins can be AT/ATCTA and del can be ATCTA/AT.

The logic is implemented in the procedure which can be called with any of the above formats.

ALT W_EHA_RSLT_ SEQUENCING.ALLELE

W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB

W_EHA_RSLT_ CONFLICT.ALLELE

W_EHA_RSLT_CONFLICT.ALLELE_CLOB

W_EHA_RSLT_CONFLICT.ALLELE_CLOB

W_EHA_RSLT_STRUCT_VAR.ALLELE

W_EHA_RSLT_STRUCT_VAR.ALLELE_CLOB

W_EHA_VARIANT_X.ALLELE

W_EHA_RSLT_SV_BREAKEND.ALLELE

W_EHA_RSLT_SV_BREAKEND.ALLELE_CLOB

For sequencing results, this value constructs the replace tag and stores the value in the results table as per rules.

If the value of ALT is greater than 500 char then it is stored in W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB, W_EHA_RSLT_CONFLICT.ALLELE_CLOB and W_EHA_RSLT_STRUCT_VAR.ALLELE_CLOB.

INFO.SVTYPE If the INFO column of the VCF file does not have 'SVTYPE' tag, then this mutation is considered as either SNV or small indel. If the INFO column has 'SVTYPE=BND', then this mutation is considered as structural re-arrangement. If the INFO column has 'SVTYPE' other than 'BND', for example, 'SVTYPE=DEL', then these mutations are considered as large structural variants. -
INFO.END W_EHA_RSLT_NON_VARIANT.END_POSITION

W_EHA_RSLT_NOCALL.END_POSITION

W_EHA_RSLT_CONFLICT.END_POSITION

W_EHA_RSLT_STRUCT_VAR.END_POSITION

For gVCF and large SV data, END_POSITION value using 'END=' tag present in this INFO column.
INFO.CIPOS W_EHA_RSLT_STRUCT_VAR.CIPOS_START

W_EHA_RSLT_STRUCT_VAR.CIPOS_END

For large SV, INFO column contains tag 'CIPOS' which contains two values. First value is stored in CIPOS_START and second value is stored in CIPOS_END.
INFO.CIEND W_EHA_RSLT_STRUCT_VAR.CIEND_START

W_EHA_RSLT_STRUCT_VAR.CIEND_END

For large SV, INFO column contains tag 'CIEND' which contains two values. First value is stored in CIEND_START and second value is stored in CIEND_END.
INFO.HOMLEN W_EHA_RSLT_STRUCT_VAR.HOMLEN For large SV, the value for the tag 'HOMLEN' present in INFO column is stored here.
INFO.HOMSEQ W_EHA_RSLT_STRUCT_VAR.HOMSEQ For large SV, the value for the tag 'HOMSEQ' present in INFO column is stored here.
INFO.MEINFO W_EHA_RSLT_STRUCT_VAR.MEINFO For large SV, the value for the tag 'MEINFO' present in INFO column is stored here.
INFO.MATE_ID W_EHA_RSLT_SV_BREAKEND.MATE_ID For structural re-arrangement data, MATE_ID tag value present in INFO column is stored here.
INFO.EVENT_ID For structural re-arrangement data, EVENT_ID tag value present in INFO column is stored here. -
INFO.PRECISION W_EHA_VARIANT.PERCISION

W_EHA_RSLT_SV_BREAKEND.PRECISION

For either Large SV or SV re-arrangement data, if 'IMPRECISE' tag present in INFO column, then 'IMPRECISE' value is populated in these columns, otherwise 'PRECISE' is populated.
FORMAT.GT Gets the allele information for each sample. It is represented as '<allele1_num>/<allele2_num>'. In some cases instead of "/" there could be "|".
  • For diploid: 0|0 represents both the alleles from REF.

  • 0|1 represents one allele from REF and other from ALT allele.

  • 0/2 represents one allele from REF and other from ALT allele 2.

  • 1/3 represents one allele from first ALT value and 2nd allele from 3rd ALT value.

  • For haploid: only one allele number is represented.

  • '.' or './.'is specified if a call cannot be made for a sample at that locus.

-
FORMAT.FT W_EHA_RSLT_SEQUENCING. GENOTYPE_FILTER

W_EHA_RSLT_NON_VARIANT. GENOTYPE_FILTER

W_EHA_RSLT_NOCALL. GENOTYPE_FILTER

W_EHA_RSLT_CONFLICT. GENOTYPE_FILTER

W_EHA_RSLT_STRUCT_VAR. GENOTYPE_FILTER

W_EHA_RSLT_SV_BREAKEND. GENOTYPE_FILTER

Sample genotype filter indicating if this genotype is called (the concept is similar to the FILTER field). PASS indicates that all filters have been passed. A semi-colon separated list of codes for filters that fail, or "." indicates that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs (For String, white-space or semi-colons is not permitted).
FORMAT.GQ W_EHA_RSLT_SEQUENCING.SCORE_VAF

W_EHA_RSLT_NON_VARIANT.SCORE_VAF

W_EHA_RSLT_NOCALL.SCORE_VAF

W_EHA_RSLT_CONFLICT.SCORE_VAF

W_EHA_RSLT_STRUCT_VAR.SCORE_VAF

W_EHA_RSLT_SV_BREAKEND.SCORE_VAF

This is mapped to W_EHA_RSLT_SEQUENCING.SCORE_VAF
QUAL W_EHA_RSLT_SEQUENCING.QUAL

W_EHA_RSLT_NON_VARIANT.QUAL

W_EHA_RSLT_CONFLICT.QUAL

W_EHA_RSLT_STRUCT_VAR.QUAL

W_EHA_RSLT_SV_BREAKEND.QUAL

Quality of the allele sequence. Mapped to QUAL column of VCF file.
FILTER W_EHA_RSLT_SEQUENCING.FILTER

W_EHA_RSLT_NON_VARIANT.FILTER

W_EHA_RSLT_NOCALL.FILTER

W_EHA_RSLT_CONFLICT.FILTER

W_EHA_RSLT_STRUCT_VAR.FILTER

W_EHA_RSLT_SV_BREAKEND.FILTER

Filter applied to the particular record. Mapped to FILTER column in the VCF file.
FORMAT.DP W_EHA_RSLT_SEQUENCING.TOTAL_READ_COUNT

W_EHA_RSLT_NON_VARIANT.TOTAL_READ_COUNT

W_EHA_RSLT_NOCALL.TOTAL_READ_COUNT

W_EHA_RSLT_CONFLICT.TOTAL_READ_COUNT

W_EHA_RSLT_STRUCT_VAR.TOTAL_READ_COUNT

W_EHA_RSLT_SV_BREAKEND.TOTAL_READ_COUNT

Total number of reads that mapped to the defined allele sequence
FORMAT.AD W_EHA_RSLT_SEQUENCING.ALLELE_READ_COUNT W_EHA_RSLT_SEQUENCING.REFERENCE_READ_COUNT

W_EHA_RSLT_NON_VARIANT.REFERENCE_READ_COUNT

W_EHA_RSLT_NOCALL.ALLELE_READ_COUNT

W_EHA_RSLT_NOCALL.REFERENCE_READ_COUNT

W_EHA_RSLT_CONFLICT.ALLELE_READ_COUNT

W_EHA_RSLT_CONFLICT.REFERENCE_READ_COUNT

W_EHA_RSLT_STRUCT_VAR.ALLELE_READ_COUNT

W_EHA_RSLT_STRUCT_VAR.REFERENCE_READ_COUNT

W_EHA_RSLT_SV_BREAKEND.ALLELE_READ_COUNT

The first value is mapped to REFERENCE_READ_COUNT and consecutive values are mapped to ALLELE_READ_COUNT
FORMAT.BQ W_EHA_RSLT_SEQUENCING.RMS_BASE_QUAL

W_EHA_RSLT_NON_VARIANT.RMS_BASE_QUAL

W_EHA_RSLT_NOCALL.RMS_BASE_QUAL

W_EHA_RSLT_CONFLICT.RMS_BASE_QUAL

W_EHA_RSLT_STRUCT_VAR.RMS_BASE_QUAL

W_EHA_RSLT_SV_BREAKEND.RMS_BASE_QUAL

RMS base quality at this position.
FORMAT.MQ W_EHA_RSLT_SEQUENCING.RMS_MAPPING_QUAL

W_EHA_RSLT_NON_VARIANT.RMS_MAPPING_QUAL

W_EHA_RSLT_NOCALL.RMS_MAPPING_QUAL

W_EHA_RSLT_CONFLICT.RMS_MAPPING_QUAL

W_EHA_RSLT_STRUCT_VAR.RMS_MAPPING_QUAL

W_EHA_RSLT_SV_BREAKEND.RMS_MAPPING_QUAL

RMS mapping quality at this position.
FORMAT.GQX W_EHA_RSLT_SEQUENCING.GENOTYPE_QUAL_X

W_EHA_RSLT_NON_VARIANT.GENOTYPE_QUAL_X

W_EHA_RSLT_NOCALL.GENOTYPE_QUAL_X

W_EHA_RSLT_CONFLICT.GENOTYPE_QUAL_X

W_EHA_RSLT_STRUCT_VAR.GENOTYPE_QUAL_X

W_EHA_RSLT_SV_BREAKEND.GENOTYPE_QUAL_X

GQX - Minimum of {Genotype quality assuming variant position,Genotype quality assuming non-variant position}

4.4.4 Command-Line Argument List

Name

VCF_loader.sh - load records

Synopsis

VCF_loader.sh -help

VCF_loader.sh <...options>

Description

Validates input options and calls the loader script for VCF and GVCF mode load_vcf.sql#odb_rslt_gvcf_util.process_gvcf and for NON-VAR mode odb_nonvar_gvcf_util.process_nonvar_gvcf

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-custom_format <VARCHAR2>

Custom format comma delimited (format_name=column(,format_name=column)*)

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-file_type_code* <VARCHAR2>

File type code (GVCF|VCF) [default: VCF]

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.4.5 Examples

UNIX: with 'GVCF' file_type_code and 'GVCF' load_mode

$ sh VCF_loader.sh -db_wallet odb_user -data_file "YRI.trio.2010_03.snps.genotypes_NEW.vcf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -custom_format null -preprocess_dir null -preprocess_file null -data_file_path null -dbfs_store null -file_type_code "GVCF" -load_mode "GVCF" -alt_file_loc null -read_size null

Windows: with 'GVCF' file_type_code and 'GVCF' load_mode

C:\> VCF_loader.bat -db_wallet odb_user -data_file "YRI.trio.2010_03.snps.genotypes_NEW.vcf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -custom_format null -preprocess_dir null -preprocess_file null -data_file_path null -dbfs_store null -file_type_code "GVCF" -load_mode "GVCF" -alt_file_loc null -read_size null

For file type VCF and load mode as VCF above same command will be used, only user will have to pass the -file_type_code as VCF an d -load_mode as VCF as a parameter with SH and BAT command.

4.5 MAF Sequence Data Loader

Mutation Annotation Format (MAF) is created by TCGA. MAF files store variation data for multiple samples. The MAF file format is described here:

http://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm

The format of a MAF file is tab-delimited columns. This file is the simplest among all result files. ODB supports import of MAF versions 2.0-2.2.

4.5.1 Functional Description

The MAF loader currently loads only the variant records and does not load wild type (WT) information. For example:

  • If one of the allele is WT while the other is variant then the loader only records variant allele

  • If both the alleles are WT then the loader does not load any of these alleles

  • If both alleles are variants and homozygous then it stores only one record

  • If both alleles are MT and heterozygous then it stores as two separate records

The loader takes the chromosome and position details of a record from MAF file and checks if the corresponding region of that chromosome exists in W_EHA_DNA_SOURCE table for a specimen reference version specified as the user input for the loader. If it is present, it maps this record of W_EHA_DNA_SOURCE table as W_EHA_VARIANT.SOURCE_WID. If the region does not exist, the loader ignores that record and does not log in to W_EHA_RSLT_LOG table.

The loader does not validate for invalid chromosome number or positions details. If it encounters such invalid data the loaders ignores that record and does not log it to W_EHA_RSLT_LOG table.

The loader does not validate the accuracy of the reference nucleotides in the database. It assumes that the same version of reference mapped MAF data is loaded in to ODB. You have to ensure that the reference version matches between the results file being loaded and the reference data available in the ODB.

A single record of MAF file has data for both normal and tumor sample. The loader loads data for both these samples.

Since MAF file contains multiple specimen information and if one or more of the specimen value does not exist in the CDM schema then the loaders skips that row and logs the error with deatils in W_EHA_RSLT_LOG .

4.5.2 Data Load

The execution call of the stored procedure ODB_RSLT_MAF_UTIL.process_maf() is designed in one of the script files (load_maf.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE, SPECIMEN VENDOR, Reference Version, File Flag, Preprocess directory, Preprocess File, Data File Path, DBFS Store, Alternate file location (ftp location/http location), Read Size as an input parameter.

This stored procedure creates an external table dynamically and uploads data from the source file into it. External tables let Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver can be used to access data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on external tables but they can be used for query, join, and sort operations.

Only one external table is created dynamically and will hold the complete result data. A Global Temporary table named W_EHA_MAF_SPECIMEN is created explicitly to store the Normal and Tumor sample barcodes. There will be two pass through the MAF file. One creates a W_EHA_VARIANT_STG record and collects each unique specimen number into the global temporary table. The bulk collect will then call Odb_util.GET_SPECIMEN_WID for all of the specimen numbers in one statement.

Another bulk insert statement then inserts the data into W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_SEQUENCING _X.

A select statement which parses the data from the external table uses a join with w_eha_variant and w_eha_dna_source table which gets the dataset which will then used to compute the gene_wid from w_eha_gene_segment table. Dataset returned with variant and dna source table checks whether the start position of variant record is less than or equal to the end position of gene segment. It also checks whether the End position variant record greater than or equal to the start position of gene segment and populate the GENE_WID in W_EHA_STG_SEQUENCING table.

After inserting the record into W_EHA_VARIANT_STG table, a PROCESS_VARIANT() procedure is called which will populate the W_EHA_VARIANT table.

4.5.2.1 Data files

Each row of the MAF file has two allele information for two sample types - tumor and normal sample. These two sample IDs are specified in each row of the file. Sample ID of tumor is specified in Tumor_Sample_Barcode and that of normal is specified in Matched_Norm_Sample_Barcode column respectively. For a single row, there can be a maximum of eight records created in ODB, depending on the heterozygosity and resemblance with reference sequence. Four for tumor and four for Normal sample.

For allele sequences, - for a deletion represent a variant and - for an insertion represents a wild-type allele. If an allele sequence is the same as the Reference_Allele sequence, that allele information is not stored in the data bank. There is no information on NOCALL in MAF data hence all the data will go to W_EHA_VARIANT, W_EHA_RSLT_SEQUENCING, and W_EHA_RSLT_SEQUENCING_X tables.

The W_EHA_RSLT_SEQUENCING _X tables only populates rows for which the length of allele is computed greater than 500. A loader validates the reference version which is passed as the input parameter against the W_EHA_VERSION table and populates the VERSION_WID in corresponding result table.

Note:

The batch file requires Oracle Wallet to be set up to run correctly.

Table Mapping of MAF Result File

Column Name in Result File Table and Column Name in ODB Description
Chromosome W_EHA_RSLT_SEQUENCING.CHROMOSOME_WID

W_EHA_VARIANT.CHROMOSOME

This field is used with the begin position to find the correct DNA_SOURCE record, to find a VARIANT record, or create a VARIANT record.

Three values are needed to find existing VARIANT records. The chromosome, the begin position, and the replace tag, which is notation combining reference and allele sequences.

For novel variants, a new record is created in the W_EHA_VARIANT table with the chromosome value.

Start_Position W_EHA_RSLT_SEQUENCING.START_POSITION

W_EHA_VARIANT_STG.START_POSITION

W_EHA_VARIANT.ABSOLUTE_POSITION

This field is used as described above. For no-call results, this is stored in the START_POSITION field (after adding 1).

For novel variants, Start_Position is stored in the ABSOLUTE_POSITION column in the W_EHA_VARIANT table.

For W_EHA_VARIANT.START_POSITION, Start_Position is relative to the value in the W_EHA_DNA_SOURCE.START_POSITION table.

End_Position W_EHA_VARIANT_STG.END_POSITION This value is used for no-call results and stored in the END_POSITION field. This value also calculates the relative end position based on W_EHA_DNA_SOURCE.START_POSITION for END_POSITION in W_EHA_VARIANT.END_POSITION for novel variants.
Strand W_EHA_VARIANT.STRAND

W_EHA_VARIANT_STG.STRAND

This value indicates forward or reverse strand.
Variant_Type W_EHA_RSLT_SEQUENCING.VARIANT_TYPE Type of variant including snp, insertion, or deletion. Stored in VARIANT_TYPE in the W_EHA_RSLT_SEQUENCING table.
Reference_Allele W_EHA_VARIANT.REPLACE_TAG This value is used for REPLACE_TAG, REPLACE_TAG and is used twice, one for the tumor and the other for normal sample. This value is used in conjunction with Tumor_Seq_Allele1 and Tumor_Seq_Allele2 to find existing VARIANT records for tumor sample. Similarly for normal sample, the Reference allele is used for REPLACE_TAG.

Note that for insertions, the reference sequence uses a "-" and for deletions the allele sequence uses "-". This is standard notation used in most references. Logic for loader will be implemented in called procedure to variant table.

For deletion, this value has deleted sequence and for insertion it has "-".

Tumor_Seq_Allele1 W_EHA_RSLT_SEQUENCING_X.ALLELE/ W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB

W_EHA_VARIANT.REPLACE_TAG

For sequencing results this value constructs the replace tag.

'-' value represents a deletion.

Tumor_Seq_Allele2 W_EHA_RSLT_SEQUENCING_X.ALLELE / W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB

W_EHA_VARIANT.REPLACE_TAG

For sequencing results, this value constructs the replace tag.

'-' value represents a deletion.

Tumor_Sample_Barcode W_EHA_RSLT_SPECIMEN.SPECIMEN_NUMBER This value represents tumor sample ID. This barcode ID involves TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID.
Matched_Norm_Sample_Barcode W_EHA_RSLT_SEQUENCING.RESULT_SPEC_WID

W_EHA_RSLT_SEQUENCING.SPECIMEN_WID

This value represents normal sample ID. This barcode ID involves TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The complete barcode ID as is foreign key to RSLT_ SPECIMEN record.
Match_Norm_Seq_Allele1 W_EHA_RSLT_SEQUENCING_X.ALLELE / W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB

W_EHA_VARIANT.REPLACE_TAG

For sequencing results this value constructs the replace tag.

'-' value represents a deletion.

Match_Norm_Seq_Allele2 W_EHA_RSLT_SEQUENCING_X.ALLELE / W_EHA_RSLT_SEQUENCING_X.ALLELE_CLOB

W_EHA_RSLT_SEQUENCING.ALLELE

W_EHA_VARIANT.REPLACE_TAG

For sequencing results this value constructs the replace tag.

'-' value represents a deletion.

Score W_EHA_RSLT_SEQUENCING.SCORE_VAF This is mapped to W_EHA_RSLT_SEQUENCING.SCORE_VAF.

4.5.3 Command-Line Argument List

Name

MAF_loader.sh - load records

Synopsis

MAF_loader.sh -help

MAF_loader.sh <...options>

Description

Validates input options and calls the loader script load_maf.sql#odb_rslt_maf_util.process_maf

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location, link that is, ftp:location, http:location ]

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE]

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.5.4 Examples

UNIX

$ sh MAF_loader.sh -db_wallet odb_user -data_file "ut_maf.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_vendor "vendor3" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir null -preprocess_file null -data_file_path null -dbfs_store null -alt_file_loc null -read_size null

Windows

C:\> MAF_loader.bat -db_wallet odb_user -data_file "ut_maf.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_vendor "vendor3" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir null -preprocess_file null -data_file_path null -dbfs_store null -alt_file_loc null -read_size null

4.6 CGI masterVar Loader

CGI masterVar loader has been removed from Omics Data Bank 2.5 and will be reintroduced in future ODB version.

4.7 RNA-Seq Loader

4.7.1 Functional Description

The TCGA RNA SEQ data file format specifications are described here:

https://wiki.nci.nih.gov/display/TCGA/RNASeq+Data+Format+Specification#RNASeqDataFormatSpecification-Datafiles

Currently, support is provided for loading of the exon version of the data files; TCGA has three different types of files: exon, gene, and splice junctions. Only the exon files are measured by exact chromosome locations. The other two files are calculated estimations based upon gene locations using this exon data file.

Once the exon data file is loaded, the user can choose genes which then can map to specific chromosome regions. RNA Sequencing-based data have a data type alias of Quantification-Exon.

A RNASeq exon quantification file is a tabular, text-based, tab-separated dataset, with a single header row stating the column names. The file consists of the following columns:

  • barcode: Identifies the sample. This column may or may not be used by the loader, but from ODB v2.5 onwards, only files with this column can be loaded.

  • exon: Gives standard chromosome token: chr1-chr22, chrX, chrY, chrM, followed by a coordinate pair, strand indicated with +/-, for example, chr1:12227:-,chr1:12595:+

  • raw_counts: Stores raw read counts in positive floating point values or a zero if unavailable.

  • median_length_normalized: A normalized region length calculation in positive float or zero.

  • RPKM: (Reads Per Kilobaseq exon Model per million mapped reads), calculated expression intensity values in positive float or zero.

4.7.2 Data Load

The RNASeq exon quantification file inputs the data into the following table: W_EHA_RSLT_RNA_SEQ.

The last four columns of the TCGA exon files are populated into the two tables mentioned above. The EXON type file specifies a chromosome and range. This field will be parsed to find the corresponding chromosome record, strand, and separate the start and end positions. The exact result table value is stored in RESULT_EXON_NAME in W_EHA_RSLT_RNA_SEQ for user reference.

The table will store an additional FK value for SPECIES, a FK value for the version of DNA reference mapped to the result file, and a FK value to W_EHA_GENE table for each gene the reference mapping associates to a record. If no such gene is found in the current reference a '0' value is input into the column. If the RPKM value for a input row is a null value (a blank) or has the text 'null' then this row is skipped by the loader.

The execution call of the stored procedure ODB_RSLT_RNA_SEQ_UTIL.process_tcga_rna_seq()is designed in one of the script files (load_tcga_rna_seq.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE, SPECIMEN VENDOR, Reference Version, File Flag, Preprocess directory, Preprocess File, Data File Path, DBFS Store, Alternate file location (ftp location/http location), Read Size as input parameters.

This stored procedure creates an external table dynamically and uploads data from the source file into it. External tables let Oracle to query data that is stored outside the database in flat files.

Only one external table is created dynamically and will hold the complete result data.There is one bulk insert statement which will insert data into W_EHA_RSLT_RNA_SEQ table. The query uses the 2 inline views, one of which computes the gene wid and the other computes the start position, end position, chromosome wid, strand , row_count, median_length columns. These 2 inline views are then joined to populate the W_EHA_RSLT_RNA_SEQ table.

Note:

The result types, and mainly the identifiers used in the first column are different for TCGA-Exon and TCGA-Gene. Additional columns specified to hold gene result record identifiers have been created, which remain empty as these are not filled by the exon loader. This is due to the gene record identifiers having a different querying requirement and hence separated in table from the columns that are populated with exon result-identifiers. The remaining column fields (RPKM, median length, raw count) are common for both formats.

4.7.2.1 Data File

The batch file of the RNA-seq loader requires Oracle Wallet to be set up to run correctly.

Specimen number -

A valid specimen number that should be either proved by the user, or is retrieved from the data file (the "barcode" column). It is used to link the result records by using specimen to the associated external datasource given in the previous parameter. If the Datasource is CDM then this value should be present for a record in W_EHA_SPECIMEN_PATIENT_H table under SPECIMEN_NUMBER.

Note:

In Linux, it is not required to use Homo sapiens within "". That requirement is only for Windows.

Table Mapping of RNASeq exon Result File

Column Name in Result File Table and Column Name in ODB Description
exon W_EHA_RSLT_RNA_SEQ.CHROMOSOME_WID

W_EHA_RSLT_RNA_SEQ.START_POSITION

W_EHA_RSLT_RNA_SEQ.END_POSITION

W_EHA_RSLT_RNA_SEQ.STRAND

W_EHA_RSLT_RNA_SEQ .RESULT_EXON_NAME

The column contains values in the following format:

'<chromosome>:<absolute start position>-<absolute end position>:<strand>'

The loader parses each value and populates data in the respective fields. The chromosome value is looked up in W_EHA_CHROMOSOME for it ROW_WID value to populate CHROMOSOME_WID. The entire value is place in RESULT_EXON_NAME.

raw_counts W_EHA_RSLT_RNA_SEQ. RAW_COUNTS Raw Read counts gives a positive floating point or zero.
median_length_normalized W_EHA_RSLT_RNA_SEQ. MEDIAN_LENGTH Calculated average normalized median length of the exon region for which an RPKM count if generated. Stores a positive float or zero.
RPKM W_EHA_RSLT_RNA_SEQ .RPKM Reads Per Kilobaseq exon Model per million mapped reads. Stores a positive float or zero.

The barcode column in the file is optionally used to identify the sample, if the user does not pass the sample number to the loader as an argument. The value from the second row (first after the header) is then taken as the specimen number.

4.7.3 Command-Line Argument List

Name

TCGA_RNA_SEQ_loader.sh - load records

Synopsis

TCGA_RNA_SEQ_loader.sh -help

TCGA_RNA_SEQ_loader.sh <...options>

Description

Validates input options and calls the loader script load_tcga_rna_seq.sql#odb_rslt_rna_seq_util.process_tcga_rna_seq

Options

(*) required

-db_wallet* <VARCHAR2> (Required unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES, that is, for humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_number <VARCHAR2>

Specimen number - Identification number of specimen for which the genomic result file is being loaded. If CDM is referenced, this value should be defined in W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_NUMBER. If the specimen number argument is missing or is null (""), the value from the second row (first after the header) in the "barcode" column of the data file is used as the specimen number.

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location, link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.7.4 Examples

UNIX

$ sh TCGA_RNA_SEQ_loader.sh -db_wallet odb_user -data_file "summary_TCGA-AB-2803-03A-01T-0734-13.exon.quantification.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "RNA01" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -read_size ''

Windows

C:\> TCGA_RNA_SEQ_loader.bat -db_wallet odb_user -data_file "summary_TCGA-AB-2803-03A-01T-0734-13.exon.quantification.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "RNA01" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E"

4.8 File Lineage Linker

File lineage linker has been removed from Omics Data Bank 2.5 release and will be reintroduced in the next release.

4.9 Copy Number Variation Loader

4.9.1 Functional Description

The CNV loader is meant to load data from TCGA belonging to Affymetrix Genome-Wide Human SNP Array 6.0 platform. The input file should contain columns in the following order for the loader to perform correctly:

  1. Sample

  2. Chromosome

  3. Start

  4. End

  5. Num_Probes

  6. Segment_Mean

An extract of the input CNV file is shown in the table:

Table Extract of input CNV file

Sample Chromosome Start End Num_Probes Segment_Mean
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020 1 51598 219036 22 0.8546
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020 1 219482 1176387 120 0.0513
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020 1 1176449 1243413 47 0.623
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020 1 1243440 5290540 2132 0.0167
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020 1 5291209 5308749 6 0.6214
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020 1 5308775 9230624 2368 -0.0261

4.9.2 Data Load

The execution call of the stored procedure odb_rslt_cnv_util. process_cnv_nbr_var() is designed in one of the script files (load_cnv.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTORY OBJECT, SPECIES NAME,STUDY, DATASOUCRE NAME, SPECIMEN VENDOR, FILE FLAG (External or Secured), DBFS_STORE, DNA_VERSION and a few other input parameters.

This stored procedure creates an external table and uploads data from the source file into it. External tables let Oracle query data that is stored outside the database in flat files. The stored procedure creates cnv_data_!!SEQ!! as an external table. This external table stores the complete result data. This table maps all the fields existing in the result file.

Note:

In the above external table, the !!SEQ!! string is replaced by ETL_PROC_ID at run time.

There is a single bulk insert statement written. This inserts the records into the W_EHA_STG_COPY_NBR_VAR table.

A select statement which parses the data from the external table uses an inline query which gets the dataset of gene segment records. The query lookups with the dataset returned from the inline query and checks whether the start position of result file is less than or equal to the end position of gene segment + start position of DNA source. It also checks whether the End position of result file greater than or equal to the start position of gene segment + start position of DNA source. The dataset of both the inline queries are then outer joined with the ROW_WID of external table lookup for GENE_WID and populate the records in W_EHA_STG_COPY_NBR_VAR table.

The loader associates the data with FILE_TYPE_CODE 'Genome_Wide_SNP_6' in W_EHA_FILE_TYPE table to distinguish this data.

Note:

The batch file requires Oracle Wallet to be set up to run correctly.

4.9.3 Command-Line Argument List

Name

CNV_loader.sh - load records

Synopsis

CNV_loader.sh -help

CNV_loader.sh <...options>

Description

Validates input options and calls the loader script load_cnv.sql#odb_rslt_cnv_util.process_cnv_nbr_var

Options

(*) required

-db_wallet* <VARCHAR2> (Required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES, that is, for humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_number <VARCHAR2>

Specimen number - Identification number of specimen for which the genomic result file is being loaded. If CDM is referenced, this value should be defined in W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_NUMBER. If the specimen number is not passed on the command-line, or is passed as "", the value in the second (first after the header) row in the first column ("sample") in the file will be used as the Specimen Number.

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.9.4 Examples

UNIX

$ sh CNV_loader.sh -db_wallet odb_user -data_file "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020.hg19.seg.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -read_size ''

Windows

C:\> CNV_loader.bat -db_wallet odb_user -data_file "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020.hg19.seg.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E"

4.10 Single Channel Gene Expression Loader

4.10.1 Functional Description

Single channel gene expression loader loads gene expression data into W_EHA_RSLT_GENE_EXP table. The loader starts with loading all the hybridization sets (consisting of intensity, call and P-value) from the input file into an intermediary staging table W_EHA_STG_GENE_EXP.

While reading from the file it assumes that it has a maximum of 15 hybridization sets (which translates to maximum of 45 columns. If it doesn't have then it uses empty fillers in the staging table for the unavailable hybridization sets. Then for each record from the staging table, it verifies if the probe name exists in the W_EHA_PROBE table along with matching version and species id which are passed as input parameters.

If it matches, then it inserts all the sets of hybridizations available for that record into W_EHA_RSLT_GENE_EXP table excluding the empty fillers. If probe name doesn't exist in the W_EHA_PROBE table, then it skips that record and if probe exist with unmatched version and/or species id, then it logs a warning into W_EHA_RSLT_LOG table saying that version and/or species do not match.

The W_EHA_RSLT_LOG table will contain error records if the records were not loaded successfully into the target tables. If an input row intensity value is a null value (or blank) or has the text 'null' then the loader skips the insert of this record.

Note:

Oracle Wallet must be set up before the batch files can be run successfully.

4.10.2 Data Load

The gene expression loader loads primarily into the W_EHA_RSLT_GENE_EXP tables. These tables can be loaded by running the gene_expression_loader.bat file in Windows or the gene_expression_loader.sh file in Linux.

Note:

The gene expression loader assumes that probe loader (see Probe Loader) has already populated W_EHA_PROBE table with probe names corresponding to genes.

The input file for this loader should contain normalized intensity values, and optional inputs of present/absent calls and P-value (such as the output of Affymetrix's MAS5 algorithm). The input file allows for multiple hybridization intensity data in a tabular format. The following section lists out the specific of the format of the input data file.

4.10.2.1 Assumptions for Data File

Following are the assumptions for the data file for the Gene Expression Loader:

  1. The file is tab separated.

  2. The first row is always the header.

  3. The first column is named DATA.

  4. Each hybridization present in the data file should have three columns in the following order:

    • Intensity - Header value should be the Hybridization Name

    • Call

    • P-Value

  5. The first column for each hybridization should contain only the hybridization name. The values in this column will be the hybridization intensity value.

  6. The total size of the header in the data file should not be greater than 32000 characters.

4.10.2.2 Mappings for Gene Expression Loader

Table Mappings for gene expression loader

Data File W_EHA_RSLT_GENE_EXP
DATA There will be a look up in W_EHA_PROBE, corresponding ROW_WID will be populated in W_EHA_RSLT_GENE_EXP.PROBE_WID
HYBRIDIZATION - Header W_EHA_RSLT_GENE_EXP.HYBRIDIZATION_NAME
HYBRIDIZATION - Data Values W_EHA_RSLT_GENE_EXP.INTENSITY
HYBRIDIZATION_Call W_EHA_RSLT_GENE_EXP.CALL
HYBRIDIZATION_P-VALUE W_EHA_RSLT_GENE_EXP.P_VALUE

4.10.3 Command-Line Argument List

Name

single_channel_gene_expr_loader.sh - load records

Synopsis

single_channel_gene_expr_loader.sh -help

single_channel_gene_expr_loader.sh <...options>

Description

Validates input options and calls the loader script load_single_channel_gene_expr.sql#odb_rslt_single_channel_util.process_single_channel

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_number* <VARCHAR2>

Specimen number - Identification number of specimen for which the genomic result file is being loaded. If CDM is referenced, this value should be defined in W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_NUMBER

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.10.4 Examples

UNIX

$ sh single_channel_gene_expr_loader.sh -db_wallet odb_user -data_file "mas5_expression_summary_part1.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "RNA01" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir null -preprocess_file null -data_file_path null -dbfs_store null -alt_file_loc null -read_size null

Windows

C:\> single_channel_gene_expr_loader.bat -db_wallet odb_user -data_file "mas5_expression_summary_part1.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "RNA01" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir null -preprocess_file null -data_file_path null -dbfs_store null -alt_file_loc null -read_size null

4.11 Dual Channel Loader

4.11.1 Functional Description

Dual channel loader supports Agilent 244K Custom Gene Expression G4502A-07 platform specific Level-3 (Gene level) input files from TCGA. It inputs Level-3 result data, which contains gene symbol and associated LOWESS log2 transformed ratio gene expression values and loads it into the W_EHA_RSLT_2CHANNEL_GXP result table.

ADF data with a specific User Label is used to link the Dual Channel data with with genes with the specific DNA Reference Version via the GENE_WID foreign key in the W_EHA_RSLT_2CHANNEL_GXP table.Currently, for each Gene Symbol or Ratio input from the file, the loader is set to generate a record for each GENE_WID value taken from W_EHA_GENE_SEGMENT reference table, where the genomic coordinates of the corresponding Composite Name at least partially matches the genomic coordinates of a Gene Segment in the EMBL reference.

To correctly map the genomic coordinates of the result composite genes (using the ADF file composite annotation loaded through the ADF Data Loader) to the EMBL reference genome, the genomic reference version of the loaded EMBL release must match the reference version of the ADF file. That is, the EMBL data loaded in ODB must be the same genomic release as that given in the ADF file.

The ADF file data must be input into ODB, using the ADF data loader (and with the same ADF User Label) before input of 2channel result data with this loader.

If the Log2 ratio value for a input row is a null value (a blank) or has the text 'null', then this row is skipped by the loader on insert to the result table.

4.11.2 Data Load

The execution call of the stored procedure odb_rslt_dual_channel_util. process_dual_channel() is designed in one of the script files (load_dual_channel.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTORY OBJECT, STUDY, DATASOURCE NAME, SPECIMEN NAME, SPECIMEN VENDOR, SPECIES NAME, (ADF) USER LABEL, (DNA) REFERENCE VERSION, FILE FLAG (External or Secure) , DBFS_STORE, DNA_VERSION and a few other input parameters.

This stored procedure creates an external table and uploads data from the source file into it. External tables let Oracle query data that is stored outside the database in flat files. The stored procedure creates dual_channel_data_!!SEQ!! as an external table. This external table stores the complete result data. This table maps all the fields existing in the result file.

Note:

In the above external table, the !!SEQ!! string is replaced by ETL_PROC_ID at the run time.

There is a single bulk insert statements written dynamically. This inserts the record into the W_EHA_STG_2CHANNEL_GXP table.

A select statement which parses the data from the external table uses an inline query which gets the dataset of gene segment records. The query lookups with the dataset returned from the inline query and checks whether the start position of the result file is less than or equal to the end position of gene segment + start position of DNA source. It will also check whether the End position of result file greater than or equal to start position of gene segment + start position of DNA source. The dataset of both the inline queries is then the outer join with the ROW_WID of external table to lookup the GENE_WID and populates the records in W_EHA_STG_2CHANNEL_GXP table.

Note:

The batch file requires Oracle Wallet to be set up to run correctly.

4.11.3 Command Line Argument List

Name

dual_channel_gene_expr_loader.sh - load records

Synopsis

dual_channel_gene_expr_loader.sh -help

dual_channel_gene_expr_loader.sh <...options>

Description

Validates input options and calls the loader script load_dual_channel_gene_expr.sql#odb_rslt_dual_channel_util.process_dual_channel

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_number* <VARCHAR2>

Specimen number - Identification number of specimen for which the genomic result file is being loaded. If CDM is referenced, this value should be defined in W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_NUMBER

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-control_specimen* <VARCHAR2>

Control specimen

-user_label* <VARCHAR2>

User label (W_EHA_ADF.USER_LABEL) used to identify a composite record's source ADF dataset, that is, AgilentG4502A_07_1

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.11.4 Examples

UNIX

$ sh dual_channel_gene_expr_loader.sh -db_wallet odb_user -data_file "dual_channel_summary.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020" -specimen_vendor "vendor1" -control_specimen "Stratagene Univeral Reference" -user_label "AgilentG4502A_07_01" -reference_version "GRCh37.p8" -file_flg "E" Windows

C:\> dual_channel_gene_expr_loader.bat -db_wallet odb_user -data_file "dual_channel_summary.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020" -specimen_vendor "vendor1" -control_specimen "Stratagene Univeral Reference" -user_label "AgilentG4502A_07_01" -reference_version "GRCh37.p8" -file_flg "E"

4.12 Typical Errors Associated with Result Loaders

4.12.1 Errors Relevant to Sequencing Loads

For each loader, a new VARCHAR2 (100) variable is defined. This variable is set before each and every SQL call to specify the context information. It is then used in standard error logging where the RECORD_DETAIL column of the W_EHA_RSLT_LOG table is populated with a simple context error message.

Some common error messages are Generating ETL PROC ID, Verifying result file type and creating result file record, Verifying Species Name, Verifying Study Name, Processing Variant Staging records, Processing variant records, Processing result records, Dropping external tables.

Examples of common error log records which are used in CGI, MAF, and VCF loaders are shown in Table 4-1.

Table 4-1

Table 4-1 Errors Generated while Loading Sequencing Files (CGI masterVar file used as example)

Column Name Description Examples

ROW_WID

Record identifier

-

RESULT_TYPE_NAME

Result Type Name

For CGI - CGI masterVar

SPECIES_NAME

Species Name

Homo sapiens

SPECIMEN_NUMBER

Specimen Number parsed from the comments section of file.

GS00706-DNA_C01

SPECIMEN_VENDOR_NUMBER

Name of vendor passed as parameter with batch file

For CGI : CGI

DATASOURCE_NM

Data source Name

CDM

ERROR_DESC

Error message

ORA-01403 NO DATA FOUND

RECORD_DETAIL

Verifying Study Name

-

ETL_PROC_WID

Each load identifier.

-


Following are some errors which are handled as per the loader's processes.

4.12.2 VCF Loader Errors

VCF/gVCF loader processes the loaders in multiple passes. The loader creates 3 external tables. If the process fails while processing specimen specification table an error 'Processing external specimen table for specimen headers' logs in w_eha_rslt_log table. If the process fails while creating metadata external table an error 'Processing external table for metadata' logs in result log table. If the process fails while creating qualifier records an error 'Processing metadata records(w_eha_rslt_file_qlfr)' logs in result log table.

If the process fails while retrieving specimen from external table an error 'Retrieving specimen numbers from external specimen table' logs in result log table. If the loader tries to load the file which has more than 986 samples then the error 'File loaded cannot have more than 986 specimens' log in result log table.

If the process fails while executing variant staging records an error 'Processing variant staging records(w_eha_variant_stg, w_eha_variant_x_stg)' log into result log table. If the process fails while populating variant table from variant staging table an error logs 'Processing variant records(w_eha_variant, w_eha_variant_log)'; in the result log table.

If the process fails while populatin variant result records an error 'Processing result sequencing records(w_eha_stg_sequencing, w_eha_stg_sequencing_x, w_eha_stg_struct_var) logs in result log table and if fails while processing non variant records then an error ' 'Processing result sequencing records(w_eha_stg_struct_var, w_eha_stg_sv_breakend, w_eha_stg_nocall, w_eha_stg_non_variant)' in result log table. If the process fails while populating conflict records then an error 'Processing result sequencing records(w_eha_stg_conflict)' log in result log table.

If the loader processes the xref records and if the process fails then an error 'Processing variant xref records(w_eha_variant_xref)' logs, if the process fails while computing nocall collapsing function then an error 'Collapsing result nocall staging records(w_eha_rslt_nocall) logs in esult log table.

At the end of all data parsing, a loader will call ext_tables_error_log() procedure which will parse all rows that have ALT_SEQ columns larger than 4000 characters those are actually not processed and logs an error 'Log failed external table records' logs in W_EHA_RSLT_LOG table.

Other possible errors are: 'Generating etl process id', 'Generating enterprise id' , Verifying result file type ({0}, {1}) and processing result file record(w_eha_rslt_file)' 'Verifying {0} reference version={1}', 'Verifying result type', 'Verifying datasource name', 'Verifying species name', 'Verifying study name', 'Parsing global flex fields' , 'Retrieving specimen ids' ,'Getting flanking offset', 'Processing external data table' and 'Dropping external tables'.

4.12.3 MAF Loader Errors

Only one external table is created for MAF result data. If the process fails at this step, an error 'Processing external data table' is logged into the w_eha_rslt_log table. A first BULK insert statement creates a variant staging and specimen record. If the process fails at this stage, an error 'Processing variant staging and specimen staging records (w_eha_variant_stg, w_eha_maf_specimen, w_eha_maf_specimen_log)' is logged into the error table.

If the process fails while retrieving the specimen ID for a record from global temporary table, an error 'Processing list of specimens' is logged into the error table. If the error occurs while processing variant staging records (which populates the w_eha_variant table) an error 'Processing variant records (w_eha_variant, w_eha_variant_log)' is logged into error log table.

If the process fails while executing bulk insert statement which populates the target result table an error 'Processing result records (w_eha_stg_sequencing, w_eha_stg_sequencing_x)' getting logged into error log table.

Other possible errors are: 'Generating etl process id', 'Generating enterprise id' , Verifying result file type ({0}, {1}) and processing result file record(w_eha_rslt_file)' 'Verifying {0} reference version={1}', 'Verifying result type', 'Verifying datasource name', 'Verifying species name', 'Verifying study name', 'Getting flanking offset', 'Processing external data table' and 'Dropping external tables'.

4.12.4 Single Channel Gene Expression Loader Errors

If the process fails while retrieving any of: DNA version ID, specimen ID, datasource name or study ID, an error is logged into the error table and is shown in the error summary at the end. If the process fails while processing the hybridization header external table, the Processing hybridization header table is logged. If the process fails while retrieving the hybridization name from the header table, the error Retrieving hybridization name from header table is logged.

Similar to VCF, a batch of 45 expression data is processed at a time and if the process fails at this stage, an error 'Processing data table for the set of gene expression' is logged. If the process fails while dropping expression and hybridization external tables, error messages Dropping expression data table and Dropping hybridization table respectively are logged.

4.12.5 Dual Channel Gene Expression Loader Errors

The Dual Channel loader verifies the existence of and looks up a W_EHA_ADF record with the relevant User Label. If the record is not found, the error is 'Getting the ROW_WID of the relevant W_EHA_ADF record, and checking its contents'. Subsequently, the ADF record is verified to match the Species ('The Species of the W_EHA_ADF record with user_label=… does not match the Species Name argument' error if it does not), and the DNA Reference Version ('The DNA reference version of the W_EHA_ADF record with user_label=… does not match the version argument', if it does not).

If the process of loading the data from the file into the external table fails, the error is 'Processing external data table'.

If processing the data from the external table and insertion into the result table fails, the error is 'Processing result records (w_eha_stg_2channel_gxp)'.

Other possible errors are: 'Verifying DNA reference version', 'Verifying result type', 'Verifying datasource name', 'Verifying species name', 'Verifying study name', 'Getting flanking offset', and 'Retrieving specimen id', 'Dropping external tables'.

4.12.6 RNA-seq Loader Errors

Only one external table is created for RNA Seq result data. If the process fails at this step, an error 'Processing external table' is logged into the log table.

When the external table processing for the entire RNA seq record set fails, an error 'Processing result records (w_eha_stg_rna_seq)' is logged.

Other possible errors are: 'Verifying DNA reference version', 'Verifying result type', 'Verifying datasource name', 'Verifying species name', 'Verifying study name', 'Getting flanking offset', 'Retrieving specimen number from external table', and 'Retrieving specimen id'.

4.12.7 Copy Number Variation Loader Errors

Only one external table is created for CNV result data. When the process fails during creation of external table, the error 'Processing external data table' is logged in the W_EHA_RSLT_ LOG table. When the external table processing for the entire CNV result record set fails, then an error 'Processing results records (w_eha_stg_copy_nbr_var)' is logged. If the process fails while dropping CNV external table, an error 'Dropping external table' is logged in the W_EHA_RSLT_ LOG table.

Other possible errors are: 'Verifying DNA reference version', 'Verifying result type', 'Verifying datasource name', 'Verifying species name', 'Verifying study name', 'Getting flanking offset', 'Retrieving specimen number from external table', and 'Retrieving specimen id'.

Note:

For the above loaders, the REC_DETAILS column of the W_EHA_RSLT_ERR_LOG table only describes the context in which the process failed.

4.13 Collecting Oracle Optimizer Statistics

Oracle statistics is a collection of data about database objects such as tables and indexes and is required by Oracle optimizer to estimate the most efficient query execution plan. Missing or stale statistics can profoundly deteriorate query performance.

Oracle recommends gathering table and index statistics after a significant amount of data is loaded into a table. The statistics should be gathered after large bulk loads, most notably after reference tables are populated, but also after initial result runs. Later on, when a batch size becomes relatively small comparing to a size of a schema, it is not required to gather statistics after each load. Instead, periodic statistics gathering on a weekly schedule are recommended. Statistics should be collected when major loading procedures are not running.

To collect statistics connect to a database as ODB_SCHEMA owner using sqlplus and execute the command:

exec dbms_stats.gather_schema_stats ('ODB_', cascade=>true,estimate_percent=>dbms_stats.auto_sample_size);