4 Loaders for Result Data

This chapter includes the following topics:

Prerequisites
Overview of Result Loaders
Version Information Utility
VCF Sequence Data Loader
MAF Sequence Data Loader
RNA-Seq Loader
File Specimen Loader and File Lineage Linker
Copy Number Variation Loader
Single Channel Gene Expression Loader
Dual Channel Loader
Quality Control Metadata Loader
Typical Errors Associated with Result Loaders
Collecting Oracle Optimizer Statistics

4.1 Prerequisites

Before using the result loaders, ensure that at least one version of reference Ensembl files has been loaded using the Java loader.

Note:

The reference loaded has to match the reference used for alignment of result files.

Also, ensure that Oracle optimizer statistics were gathered after the reference data was loaded.

After the reference is loaded, perform the following steps to initialize your database:

Create a record in the W_EHA_RSLT_STUDY table. There is a sequence (W_EHA_RSLT_STUDY_S) associated with this table that lets you create as many study records as required. The result data can be partitioned by study or by gene, depending on how the ODB schema was created. Add the required number of studies to this table. The result loaders use the value for RESULT_STUDY_NAME to look up the corresponding study primary key.
Verify that the W_EHA_DATASOURCE record, which identifies the CDM schema to validate specimen numbers, is correct. Each result record that is to be loaded must reference a specimen that exists in the CDM schema. Patient related specimens should be in the W_EHA_SPECIMEN_PATIENT_H bridge table and subject related ones in the W_EHA_SPECIMEN_SUBJECT_H table. These CDM schema bridge tables have a SPECIMEN_VENDOR_NUMBER field to be used for vendor specific information and a SPECIMEN_NUMBER field to a universal specimen identifier. The W_EHA_DATASOURCE table has a DB_LINK_NAME field that ensures that a database link can be used if the CDM schema is in another instance. The table stores the CDM-lookup procedure function call in the field VALIDATION_PROC. Four seed data entries are provided in this table, each with the following DATASOURCE_CD that the result loaders can choose from as an input parameter:
- 'CDM': which is the default datasource with the VALIDATION_PROC value as 'ODB_UTIL.VALIDATE_CDM_PATIENT_SPEC', that does a patient specific CDM specimen look up.
- 'CDM_PATIENT': has the VALIDATION_PROC value as 'ODB_UTIL.VALIDATE_CDM_PATIENT_SPEC', that does a patient specific CDM specimen look up.
- 'CDM_SUBJECT': has the VALIDATION_PROC value as 'ODB_UTIL.VALIDATE_CDM_SUBJECT_SPEC', that does a subject specific CDM specimen look up.
- 'CDM_BOTH': has the VALIDATION_PROC value as 'ODB_UTIL.VALIDATE_CDM_BOTH_SPEC', that internally calls both ODB_UTIL.VALIDATE_CDM_PATIENT_SPEC and ODB_UTIL.VALIDATE_CDM_SUBJECT_SPEC functions, to lookup and link to both subject and patient specimens.
Ensure that the ODB schema has SELECT privileges on the W_EHA_SPECIMEN_PATIENT_H_V and W_EHA_SPECIMEN_SUBJECT_H_V views in the CDM schema.
All the specimens required for the example files should be added into the W_EHA_SPECIMEN_PATIENT_H table for patient related specimens and in W_EHA_SPECIMEN_SUBJECT_H for subject related specimens in the CDM schema.

To install the loaders, copy the Result_Loader folder into a directory. Run a loader from the directory it is installed in (on Linux this requires an execute permission for all SH scripts, which should be granted after the files are copied (for example, using the Linux chmod command).

4.1.1 Setting Default Cache Sizes for Result Loading

Each of the result tables have a corresponding sequence, named similar to the table with a suffix _S. Each of these result table sequences has a default cache size that reflects an average number of records that might be inserted for any load of result files.

Most of the sequences have a default cache size of 6000, whereas all of the sequencing related result table sequences have a cache size of 15000. There may be a need to load much larger files (that is, TCGA VCF data can have 4.5 million rows). For larger files, Oracle recommends that a DBA adjusts all the corresponding sequence cache sizes to at least 100,000 or larger.

Lower sequence cache sizes can result in delays for each parallel process trying to get the next cache of sequences. The actual decision to increase sequence cache size should be based on the number of rows estimated to be inserted during any load.

An example of the SQL to alter a sequence is:

alter sequence w_eha_rslt_gene_exp_s cache 100000;

The current list of sequences used by relevant result loaders are as follows:

CGI loader
- ODB_RSLT_CGI_UTIL (default cache 15000)
  - W_EHA_RSLT_SEQUENCING1_S
  - W_EHA_RSLT_NOCALL1_S
  - W_EHA_RSLT_NON_VARIANT_S
VCF loader
- ODB_RSLT_GVCF_UTIL (default cache 15000)
  - W_EHA_RSLT_CONFLICT_S
  - W_EHA_RSLT_NOCALL1_S
  - W_EHA_RSLT_NON_VARIANT_S
  - W_EHA_RSLT_NOCALL1_S
  - W_EHA_RSLT_SEQUENCING1_S
  - W_EHA_RSLT_STRUCT_VAR_S
  - W_EHA_RSLT_SV_BREAKEND_S
MAF Loader
- ODB_RSLT_MAF_UTIL (default cache 15000)
  - W_EHA_RSLT_SEQUENCING1_S
RNA-seq loader
- ODB_RSLT_RNA_SEQ_UTIL (default cache 6000)
  - W_EHA_RSLT_RNA_SEQ_S
CNV loader
- ODB_RSLT_CNV_UTIL (default cache 6000)
  - W_EHA_RSLT_COPY_NBR_VAR_S
Single channel loader
- ODB_RSLT_SINGLE_CHANNEL_UTIL (default cache 6000)
  - W_EHA_RSLT_GENE_EXP_S
Dual channel loader
- ODB_RSLT_DUAL_CHANNEL_UTIL (default cache 6000)
  - W_EHA_RSLT_2CHANNEL_GXP_S
QC metadata loader
- ODB_RSLT_METADATA_UTIL (default cache 6000)
  - W_EHA_RSLT_SPEC_QLFR_S

4.2 Overview of Result Loaders

Following are the result loaders provided, one for each file type:

Gene Expression - Single channel
Gene Expression - Dual channel
Copy Number Variation - SEG format
MAF - MAF format
VCF - VCF and gVCF formats
CGI - CGI MasterVar format
RNA-seq - TCGA RNA-seq exon files

Additionally, there are:

Probe loader - a prerequisite to load probes before single channel gene expression loader runs.
ADF loader - a prerequisite to load gene composite and reporter annotation before dual channel gene expression loader runs.
QC metadata - loads Quality Control metadata files for specimens in ODB.
File Specimen Loader - links specimen in ODB to files, including existing input files, derived files or any other file related to result data. Also, enables loading file details into ODB if they are not already present.
File Lineage Linker- Lets the user link any 2 files in ODB to each other, thus enabling building a file lineage between source files and their derivatives.

All loaders can be run using .bat files in Windows or shell scripts in Linux.

The .bat files to be run in Windows are as follows:

single_channel_gene_expr_loader.bat
dual_channel_gene_expr_loader.bat
MAF_loader.bat
VCF_loader.bat
CGI_masterVar_loader.bat
CNV_loader.bat
TCGA_RNA_SEQ_loader.bat
METADATA_loader.bat
File_specimen_linker.bat
File_lineage_linker.bat

Note:

An Oracle Wallet must be set up before the batch files can be run successfully.

If an Oracle Wallet is set up, a shell script can use the wallet credentials to run Sqlplus.

If Oracle Wallet is not set up, the shell script prompts for a password and connects to Sqlplus.

The shell scripts to be run in Linux are as follows:

single_channel_gene_expr_loader.sh
dual_channel_gene_expr_loader.sh
MAF_loader.sh
VCF_loader.sh
CGI_masterVar_loader.sh
TCGA_RNA_SEQ_loader.sh
CNV_loader.sh
METADATA_loader.sh
File_specimen_linker.sh
File_lineage_linker.sh

4.3 Version Information Utility

The version information utility checks for all available versions in the database instance for any of version types allowed in W_EHA_VERSION table. It is a command line API which:

lists specific versions loaded for a single version type when a version type is specified through -list_ver, OR
lists all versions present, grouped by version type, when version type is not specified.

4.3.1 Functional Description

The version info utility is a standalone script version_info.sh (version_info.bat for Windows) with an optional named argument '-list_ver', to which a reference version type argument is passed. The loader calls the odb_reference_util.list_version_info function, accepting the version type argument as a parameter. The function queries W_EHA_VERSION table for the VERSION_LABEL values where column VERSION_TYPE is the value of input parameter accepted by the function. These version labels are listed to the user on the command line. As mentioned in the previous section, if the user does not pass the parameter, the function displays the list of all versions for each version type.

4.3.2 Running the Version Check Utility

Name

version_info.sh - Lists Version Labels

Synopsis

version_info.sh -help

version_info.sh <...options>

Description

Validates input options and calls the loader script list_version.sql# odb_reference_util.list_version_info

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-list_ver <VARCHAR2>

A Reference Version type allowed for W_EHA_VERSION.VERSION_TYPE column. Known Version types are: 'DNA','PROTEIN','HUGO','PATHWAY','SIFT','POLYPHEN','PROBE', 'GENETIC_CODE', 'COSMIC', and 'BIOBASE'

4.4 CGI masterVar Data Loader

The CGI masterVar file is an integrated report of variant calls and annotations with each file representing variants per sample. The current CGI masterVar loader supports 4 versions of CGI Format which includes 2.0, 2.2, 2.4 and 2.5.

Since there is a large difference of number of columns between the CGI format versions, the loader is built to handle this change and load the data accordingly.

The CGI 2.0 file format is described here:

ftp://ftp2.completegenomics.com/

Each section in a CGI file is self-contained and separate. The following three types of sections are present:

Comment lines - beginning with #
Header - beginning with >
Actual result data with information about the Zygocity, Variant Type, Reference, Alleles, Scores and Count

The main challenge while loading a CGI file is to parse the #SAMPLE information from the comments section and then map it with the rest of the data. This sample information is important to retrieve the Specimen_Id from the data source mentioned while executing the batch file. The loader also parses the #FORMAT information, which contains the CGI format version details, and maps it to CGI file type in W_EHA_FILE_TYPE table.

4.4.1 Functional Description of CGI Loader

The CGI loader currently loads variant records as well as the wild type (WT) information. For example,

If one of the alleles is WT while the other is variant, then the loader only records the variant allele.
If both the alleles are WT then the loader creates a single record in W_EHA_RSLT_NON_VARIANT table.
If both alleles are variants and homozygous, then it stores only one record. In such cases it stores the least score value in SCORE_VAF and SCORE_EAF columns.
If both alleles are variant and heterozygous, then it stores data as two separate records.
If a record is nocall and both alleles have the same value then a single record is created in W_EHA_RSLT_NOCALL table.
If a record has one allele as variant and another as a nocall, then the loader report creates one record in variant and one record in nocall table.
A haploid record contains only one allele information and if it is a variant then it is reported as a variant record.
If a haploid record is wildtype, then it is reported in W_EHA_RSLT_NON_VARIANT table.

The loader takes the chromosome and position details of a record from CGI masterVar file and checks if the corresponding region exists in W_EHA_DNA_SOURCE table. If it is present, the loader maps this record of W_EHA_DNA_SOURCE table as W_EHA_VARIANT.SOURCE_WID. If the region does not exist, the loader ignores that record and logs information in the W_EHA_RSLT_LOG table indicating that some records were not loaded. The loader does not validate for invalid chromosome number or positions details. If it encounters such invalid data, the loader ignores that record and does not log it to W_EHA_RSLT_ERR_LOG table.

The loader does not validate the accuracy of the reference nucleotides in the database. It assumes that the same version of reference mapped CGI masterVar data is loaded in to ODB. Ensure that the reference version matches the results file being loaded and the reference data available in the ODB.

4.4.2 Files to Load

The execution call of the stored procedure odb_result_util.process_cgi() is designed in the script file load_cgi.sql. This stored procedure accepts FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE, and SPECIMEN VENDOR as an input parameter.

4.4.3 Data Load

The loader creates 2 external tables. One table stores the metadata and the other stores the actual result data.

The loader first creates an external table to store the metadata of the result file. This metadata resides in the header part of the result file, which starts with ## string. This external table then populates the W_EHA_RSLT_FILE_QLF table. The W_EHA_ RSLT_FILE_QLF table is simply a name value pair table.

Most of the file metadata is in XML format where an identifier or tag is followed by an attribute value or XML definition. The metadata load will set the QUALIFIER_TAG to the identifier before the = character (that is, FORMAT, INFO, FILTER) and everything after the = is copied to the QUALIFIER_VALUE column of W_EHA_RSLT_FILE_QLF table.

The SQL to create the data external table can exceed 32K, so a cursor is used to create the external table. The constant string used is broken up into 5 separate strings. This allows for more than 32K statement to create the external table. The structure of the external table depends on the version of file type.

The loader first processes the reference data. A select statement which inserts data into W_EHA_VARIANT_STG, computes the overlap value comparing reference with the allele sequence. Using this overlap value, the reference sequence and the allele sequence is shortened and the start position and end position are incremented. Also, this overlap value creates a replace tag with shortened reference and allele sequence.

After inserting the record into the W_EHA_VARIANT_STG table, a PROCESS_ VARIANT() procedure is called that populates the W_EHA_VARIANT table. This procedure also populates the W_EHA_VARIANT_LINK_STG table if it exists, else it creates and populates a new _link table. Another new feature is added to this procedure to populate the new table W_EHA_VARIANT_GENE_MAP. This table stores the link between variant and gene_wid. To populates this table, a select query is written which maps the gene_wid of W_EHA_GENE_SEGMENT with variant record using the BEGIN and END position of the result file data.

The loader then process to parse the first set of result data which does not use W_EHA_ VARIANT foreign key. This includes W_EHA_RSLT_NOCALL, W_EHA_RSLT_NON_ VARIANT tables.

The loader then process to parse to link all records to W_EHA_VARIANT table. This includes W_EHA_RSLT_SEQUENCING.

Specimen identifiers are stored in the CGI data file header. There are several values used to obtain the correct Primary Key value of the corresponding SPECIMEN record which is external to the ODB schema. The SPECIMEN identifier is the first field used to look up the external database. The loader will verify the tag #SAMPLE in the header part of the result file. This is usually a barcode or some other natural key. This value is not necessarily unique in the other database (especially coming from HDM) where different vendors can have different barcode systems that may overlap. There is also a specimen vendor number used to look up the correct SPECIMEN record. The last value required is used to specify if multiple sources are used to provide SPECIMEN records. Each result table stores a FK to the correct datasource for the SPECIMEN and the Primary Key value for the SPECIMEN record. The loading code must use all three fields to find the correct values for the result records.

The alleles identified as no-call create a W_EHA_RSLT_NOCALL record. All other non-ref alleles create a W_EHA_RSLT_SEQUENCING. The query used to create sequencing records maps with w_eha_variant_gene_map table to compute the gene_wid for sequencing record against variant wid of both w_eha_rslt_sequencing and w_eha_variant_gene_m ap table. All wildtype records with zygosity of hom or hap and varitype of ref are loaded in the W_EHA_RSLT_NON_VARIANT table.

To compute the gene_wid for nocall and non-variant records, a temporary intermediate table is created, which calculates the relative position of the records. This table is then used to map against start and end position of w_eha_nocall or w_eha_non_variant table for nocall and non-variant records.

Note:

The batch file requires Oracle Wallet to be set up to run correctly.
In Linux, you do not need to use Homo sapiens in "". This requirement is only for Windows.

Table 4-1 Mapping of CGI Result File

Column Name	Table and Column Name in ODB	Description
Chromosome	W_EHA_RSLT_NOCALL.CHROMOSOME_WID W_EHA_RSLT_SEQUENCING.CHROMOSOME_WID W_EHA_VARIANT.CHROMOSOME W_EHA_RSLT_NON_VARIANT.CHROMOSOME_WID	For no-call results, this is stored directly in the CHROMOSOME field. For non-reference alleles, this field is used with the begin position to find the correct DNA_SOURCE record, a VARIANT record, or create a VARIANT record. Three values are needed to find existing VARIANT records. The chromosome, the begin position, and the replace tag which is the notation combining reference and allele sequences. For NOVEL variants, a new record is created in the W_EHA_VARIANT table with chromosome value.
Begin	W_EHA_RSLT_NOCALL.START_POSITION W_EHA_RSLT_SEQUENCING.START_POSITION W_EHA_VARIANT_STG.START_POSITION W_EHA_VARIANT.ABSOLUTE_POSITION W_EHA_RSLT_NON_VARIANT.START_POSITION	The value of this field must add 1 since CGI uses zero based offsets and all other references use one based offsets. This field is used as described above. For no-call results, this is stored in the START_POSITION field (after adding 1). For NOVEL variants, begin is stored in the ABSOLUTE_POSITION column in the W_EHA_VARIANT table. For W_EHA_VARIANT.START_POSITION, Start_Position is relative to the value in the W_EHA_DNA_SOURCE.START_POSITION table using the begin value.
End	W_EHA_RSLT_NOCALL.END_POSITION W_EHA_VARIANT_STG.END_POSITION W_EHA_RSLT_NON_VARIANT. END_POSITION	This value is used for no-call results and stored in the END_POSITION field. This value calculates the relative end position based on W_EHA_DNA_SOURCE.START_POSITION for END_POSITION in W_EHA_VARIANT.END_POSITION for novel variants.
Zygosity	W_EHA_RSLT_SEQUENCING.ZYGOSITY W_EHA_RSLT_NON_VARIANT. ZYGOSITY	Stored for sequencing alleles in ZYGOSITY field.
Vartype	W_EHA_RSLT_SEQUENCING. VARIANT_TYPE W_EHA_RSLT_NOCALL. NOCALL_TYPE	Stored either in NOCALL_TYPE or VARIANT_TYPE.
reference	W_EHA_VARIANT. REPLACE_TAG	This value is used in conjunction with allele1Seq and allele2Seq to construct a replace tag value to find existing VARIANT records in W_EHA_VARIANT table. For CGI, there are two overlap value computed for two alleles of each row of the result file. Using that overlap value, the reference sequence and the allele sequence is shortened and the start position and end position are incremented. This overlap value creates a replace tag with shortened reference and allele sequence. For insertions, the reference sequence uses a - and for deletions the allele sequence uses -. This is standard notation used in most references. At some in-dels the representation can be as follows: ins can be AT/ATCTA and del can be ATCTA/AT. The logic for checking and inserting variants into the file is in the called procedure. The procedure should handle varying representations for variants coming from any of the sequencing file types. In some cases, this field is empty for insertions.
allele1Seq	W_EHA_VARIANT.REPLACE_TAGW_EHA_VARIANT_X.ALLELE	For sequencing results, this value constructs the replace tag. In some cases, this field is empty for deletions.
allele2Seq	W_EHA_VARIANT.REPLACE_TAGW_EHA_VARIANT_X.ALLELE	For sequencing results, this value constructs the replace tag. In some cases, this field is empty for deletions.
allele1VarScoreVAF	W_EHA_RSLT_SEQUENCING.SCORE_VAF W_EHA_RSLT_NON_VARIANT. SCORE_VAF	Used for sequencing results and stored in the SCORE_VAF field. If the variant is homozygous, as only single allele record is stored, the least value out of two scores is stored.
Allele2VarScoreVAF	W_EHA_RSLT_SEQUENCING.SCORE_VAF W_EHA_RSLT_NON_VARIANT. SCORE_VAF	Used for sequencing results and stored in the SCORE_VAF field. If the variant is homozygous, as only single allele record is stored, the least value out of two scores is stored.
allele1VarScoreEAF	W_EHA_RSLT_SEQUENCING.SCORE_EAF	Used for sequencing results and stored in the SCORE_EAF field. If the variant is homozygous, as only single allele record is stored, the least value out of two scores is stored.
Allele2VarScoreEAF	W_EHA_RSLT_SEQUENCING.SCORE_EAF	Used for sequencing results and stored in the SCORE_EAF field. If the variant is homozygous, as only single allele record is stored, the least score value out of two scores is stored.
allele1HapLink	W_EHA_RSLT_SEQUENCING_X.HAPLINK W_EHA_RSLT_NOCALL_X.HAPLINK	Used for both no-call and sequencing results and stored in the HAPLINK field.
allele2HapLink	W_EHA_RSLT_SEQUENCING_X.HAPLINK W_EHA_RSLT_NOCALL_X.HAPLINK	Used for both no-call and sequencing results and stored in the HAPLINK field.
allele1ReadCount	W_EHA_RSLT_SEQUENCING. ALLELE_READ_COUNT	Used for sequencing results and stored in the ALLELE_READ_COUNT field.
Allele2ReadCount	W_EHA_RSLT_SEQUENCING. ALLELE_READ_COUNT	Used for sequencing results and stored in the ALLELE_READ_COUNT field.
referenceAlleleReadCount	W_EHA_RSLT_SEQUENCING. REFERENCE_READ_COUNT W_EHA_RSLT_NON_VARIANT. REFERENCE_READ_COUNT	Used for sequencing results and stored in the REFERENCE_READ_COUNT field.
totalReadCount	W_EHA_RSLT_SEQUENCING. TOTAL_READ_COUNT W_EHA_RSLT_NON_VARIANT. TOTAL_READ_COUNT	Used for sequencing and non-variant results and stored in the TOTAL_READ_COUNT field.

4.4.4 Running the CGI Loader with Named Command-Line Arguments

Name

CGI_masterVar_loader.sh - load records

Synopsis

CGI_masterVar_loader.sh -help

CGI_masterVar_loader.sh <...options>

Description

Validates input options and calls the loader script load_cgi.sql#odb_rslt_cgi_util.process_cgi

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, for humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

-parallel_degree <NUMBER>

The degree of parallelism, or number of parallel execution servers associated with the load operation.

4.4.5 Examples

UNIX:

$ CGI_masterVar_loader.sh -db_wallet odb_user s04jsnx1 -check_version 0 -check_version_non_i 1 -data_file 'summary_masterVarBeta_2.4_format.tsv' -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM_PATIENT" -specimen_vendor "vendor2" -reference_version "GRCh37.p8" -file_flg "E" -parallel_degree 8

Windows:

C:\> CGI_masterVar_loader.bat -db_wallet odb_user s04jsnx1 -check_version 0 -check_version_non_i 1 -data_file 'summary_masterVarBeta_2.4_format.tsv' -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM_PATIENT" -specimen_vendor "vendor2" -reference_version "GRCh37.p8" -file_flg "E" -parallel_degree 8

4.5 VCF Sequence Data Loader

4.5.1 Functional Description

The VCF loader procures the chromosome, position and reference version details of a record from a VCF file and checks if the corresponding region of that chromosome exists in W_EHA_DNA_SOURCE table for the specific reference version given as input to the loader. If present, it maps this record of W_EHA_DNA_SOURCE table as W_EHA_VARIANT.SOURCE_WID. If the region is not found (for example, if the chromosome name and (or) the position details are invalid), the loader ignores the record and does not log into W_EHA_RSLT_LOG table.

The loader supports two types of chromosome representation in the VCF file, like chr10 and also 10 would be loaded without any error. For mitochrondrial chromosome the loader can read chrM, chrMT, M and MT from the file.

A typical VCF file contains data for multiple specimens. If one or more specimen values does not exist in the CDM schema, then the data for these particular specimens is not loaded but is logged in W_EHA_RSLT_LOG.

The loader does not validate the accuracy of the reference nucleotides in the database. It assumes that the same version of reference mapped VCF data is loaded to ODB. Therefore, ensure that the reference version of the results file being loaded matches that of the reference data available in the ODB. ODB now supports multiple reference versions, so the VCF loader must be instructed as to which reference data it has to map the results. You must provide the version information, present in the VERSION_LABEL column of the W_EHA_VERSION table, to the VCF loader as a parameter. Refer the loader parameter list for more details.

The VCF loader has been extended from the existing support of the 1000 Genomes VCF 4.1 format which includes mutations such as SNV, small indel to large structural variants and structural re-arrangements from the 1000 genomes VCF 4.1. The updated VCF loader also supports the gVCF (genome VCF) data from Illumina, version 20120906a.

Details of the 1000 Genomes VCF 4.1 specification can be found at the following link:

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Details of the gVCF specification can be found at the following link:

https://sites.google.com/site/gvcftools/home/about-gvcf

Following is a brief description of each format supported by the VCF loader.

4.5.1.1 1000 genomes VCF4.1 Version

The 1000 genomes VCF 4.1 format can be broadly classified into 3 categories based on the type of variants given below:

SNV and small indel: These mutations are loaded in W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_NOCALL and W_EHA_VARIANT table. The W_EHA_VARIANT table is only populated for novel variants not already present in that table.

The loader populates W_EHA_RSLT_NOCALL table based on the GT values having './.'. Consecutive records with this type of nocall genotype will be collapsed while loading to this table. For example, if there are three records with POS as 1001, 1002 and 1003 with nocall genotype, then only one record is created in W_EHA_RSLT_NOCALL table with the START_POSITION 1001 and the END_POSITION 1003.
Large Structural Variation: These large structural changes in the genome are recorded in the W_EHA_RSLT_SEQUENCING, W_EHA_VARIANT and W_EHA_VARIANT_X tables. The W_EHA_VARIANT and W_EHA_VARIANT_X tables are only populated for novel variants, not present in that table. W_EHA_VARIANT_X table is used to populate only the ALT column value from the VCF file in ALLELE clob column of this table.
Structural rearrangement: Currently, 1000 genomes data for structural rearrangements is not yet released. There is neither detailed documentation nor proper examples in the 1000 genomes manual, which would cover all scenarios for this data set. In view of this, the loader is built on the following assumptions, considered from the small amount of information available in the 1000 genomes VCF 4.1 manual.
1. The loader assumes that there is GT information in the form of either '0' or '1'.
2. The loader identifies structural rearrangements from the tag 'SVTYPE=BND' present in the INFO column.
3. Allele depth (AD) and total depth (DP) are expected to be a single value by the loader.
These genomic re-arrangements are stored in W_EHA_RSLT_SV_BREAKEND table.

Since a VCF 4.1 file can contain all the above three types of mutations in the same file, the VCF loader automatically distinguishes these three types of data using the INFO column and records are created in their respective result tables as described above.
- If the INFO column of the VCF file does not have an 'SVTYPE' tag or has 'SVTYPE' other than 'BND' (for example, 'SVTYPE=DEL'), then this mutation is considered as either SNV, small indel or large SV and data is loaded to the W_EHA_RSLT_SEQUENCING table.
- If the INFO column has 'SVTYPE=BND', then this mutation is considered as a structural rearrangement and records are loaded to the W_EHA_RSLT_SV_BREAKEND table.

4.5.1.2 Genome Variant Call Format (gVCF)

gVCF is designed to store both variant and non-variant information related to single sample only. gVCF follows the 1000 genomes VCF 4.1 conventions, with additional features such as siteConflicts. The VCF loader processes gVCF data into W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_NON_VARIANT, W_EHA_RSLT_CONFLICT, W_EHA_RSLT_NOCALL and W_EHA_VARIANT. The W_EHA_VARIANT table is only populated for novel variants not present in that table. The mutations are stored in W_EHA_RSLT_SEQUENCNG, the non-variant information is stored in W_EHA_RSLT_NON_VARIANT table, the conflict variants are stored in W_EHA_RSLT_CONFLICT, and nocall information is stored in the W_EHA_RSLT_NOCALL table. Nocall data for the consecutive positions is not collapsed in gVCF load because it is already compressed in the gVCF file based on similar quality scores and other parameters defined while creating the gVCF file.

The following logic is used while populating gVCF records in target tables:

Any gVCF file records with GT values 0/1 or 1/0 or 1/2 or n/n, where n is not zero are stored in the W_EHA_RSLT_SEQUENCING table.
Any records with GT values 0/0 or just 0 are stored in the W_EHA_RSLT_NON_ VARIANT table.
Any records with GT values '.' or './.' and ALT value '.' are stored in the W_EHA_ RSLT_NOCALL table.
Any records with GT values '.' and ALT value not '.' are stored in the W_EHA_RSLT_ CONFLICT table.

4.5.1.3 FILE_TYPE_CODE and LOAD_MODE of VCF Loader

As mentioned earlier, the VCF loader can be used to load all types of VCF data, that is, SNP and small indel, large structural variation, structural rearrangement, and gVCF data. Since there is just one loader for loading all the data types, it has to be provided with some information to identify the file type. There is one additional parameter which can be used to load data in a specific mode as described below.

There are mainly two parameters required by the VCF loader which determine the input file type and the mode in which to load this data.

These parameters are:

FILE_TYPE_CODE: The 'FILE_TYPE_CODE' parameter is used to provide the file type information. While loading a VCF file containing either SNP and small indel or large SV or SV-rearrangements, give FILE_TYPE_CODE as 'VCF'. While loading a gVCF file, FILE_TYPE_CODE should be given as 'GVCF'.
LOAD_MODE: The VCF loader has options to load VCF and gVCF data in three different modes using the parameter 'load_mode'. Following are the three types of modes identified by the loader:
- 'VCF' mode: This mode can be used for both VCF and gVCF file types. This mode loads mutations and nocall data and only populates the W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_NOCALL tables. If FILE_TYPE_CODE is GVCF and LOAD_MODE is VCF, then all non-variant and conflict records from the gVCF file are skipped and only variants and nocalls are loaded.
- 'GVCF' mode: This mode loads data to all tables made for gVCF, that is, W_EHA_RSLT_SEQUENCING, W_EHA_RSLT_NON_VARIANT, W_EHA_RSLT_NOCALL and W_EHA_RSLT_CONFLICT tables. Since the gVCF file has all the information about a genome such as variants, non-variants, nocalls and conflicts, this mode is best suited for the gVCF file type. However, this mode can also be used to load data from a VCF file with all non-variant information for a specific genome. Usually, a VCF file doesn't contain all the non-variant information of a genome and only shows few records as non-variant for a specimen when there is a mutation at that position for a different specimen. It is advisable not to load such files using GVCF mode because incomplete non-variant information will be loaded for that specimen.
- 'NON-VAR' mode: This mode loads data only into the W_EHA_RSLT_NON_VARIANT and W_EHA_RSLT_NOCALL tables. It is designed for scenarios where mutations have already been loaded from a VCF file and only non-variant information is to be loaded. Like the GVCF mode, this mode is also mostly suited for the gVCF file type as the gVCF file contains all non-variant information. It is not advisable to load a VCF file which does not contain all the non-variant information for a specimen in NON-VAR mode, as incomplete non-variant information will be stored in the database.

4.5.2 Custom Format Specification in VCF

The VCF loader also supports loading custom data types from the FORMAT column of a VCF file. Following are the details on loading custom formats to ODB.

The custom format option helps load certain VCF FORMAT column fields which are currently not mapped in ODB. Before executing the loader, you must manually create a column in W_EHA_RSLT_SEQUENCING and W_EHA_RSLT_SV_BREAKEND tables. For Exadata, the staging tables like W_EHA_STG_SEQUENCING and W_EHA_STG_SV_BREAKEND should also be appended with the additional column in the same order as defined in the main result tables.

The column names should follow a specific naming convention. To map a 'PL' data type from the FORMAT column in the input file, create a column with the field name 'CUST_PL'. Then provide the mapping details to the loader under 'custom_format' parameter for the loader as "PL=CUST_PL". To load multiple custom format columns, provide the values as a comma separated string, for example, "PL=CUST_PL,GL=CUST_GL".

Although there is now no limitation on the number of custom formats supported by the loader, it can read only 32 format data types at a time. So if a custom format data type is beyond the 32 data type then it will not be loaded. The order of the custom columns created should be same in the main tables and in the corresponding staging tables, otherwise there could be a mismatch in the data loaded. It is recommended to add a custom column with a VARCHAR2(%n) data type as there could be comma separated values and other alphanumeric characters in the field. The %n should be defined based on the string requirement of the FORMAT data type for which the column is created.

Alternatively, set these custom format field mappings as a global variable in VCF_FORMAT column of the W_EHA_PRODUCT_PROFILE table. This has to be set manually by the user in this table. The -custom_format input parameter of the VCF_Loader, which is a temporary variable has precedence over the globally defined variable in W_EHA_PRODUCT_PROFILE.VCF_FORMAT column while loading VCF data. The format specification in VCF_FORMAT column is the same as used for -custom_format input parameter.

The VCF export functionality of the Cohort Explorer UI also requires that the custom format details be defined in the global variable, that is, W_EHA_PRODUCT_PROFILE.VCF_FORMAT column. The exported VCF file contains the custom format field mapping data only if the VCF_FORMAT global variable is set. Otherwise only the default format mapping data is exported. While updating VCF_FORMAT column ensure that only new custom format fields are appended and not existing ones.

4.5.2.1 Debugging Inconsistent Datatypes for FORMAT Field in VCF File

The VCF files can be generated by various in-house and open source tools, so there is a possibility of using non-standard datatypes for some of the VCF mapped columns in ODB. In such cases, an Invalid number error is generated by the loader without providing information about the line having a mismatch in datatype. To know the exact location of datatype mismatch, there is an additional parameter called '-validate_numbs Y' used by the VCF loader. The default value of this parameter is 'N'. Pass 'Y' as the parameter value to validate the numeric values in the VCF file. This parameter determines if there is a character value instead of the expected numeric value and displays the line number and sample order number which has the issue. For example, the error appears as follows:

Detail: Error in number conversion of row number 16 and specimen column S1
Description: ORA-06502: PL/SQL: numeric or value error: character to number conversion error

The error indicates that there is a data format issue for the first sample in line number 16 of the VCF file.

4.5.3 Data Load

The loader code uses an external table through a cursor to parse the header rows to store metadata records and validate specimens. Most of the file metadata is in XML format where there is an identifier or tag followed by an attribute value or XML definition. The metadata load sets the QUALIFIER_TAG to the identifier before the "=" character (that is, FORMAT, INFO, FILTER), and everything after the "=" character is copied to the QUALIFIER_VALUE column of the W_EHA_FILE_LOAD_QLFR table.

A separate external table is then used to read it in a single sequential (non-parallel) pass. This promotes accurate line number values and efficient loading from various storage locations such as DBFS. All other DML statements employ the temporary table used to load this data.

The loader then utilizes an insert statement to create all W_EHA_VARIANT records as long as the "i_load_mode" parameter is not set to "NON-VAR". This process creates the W_EHA_VARIANT_X records, when there is a large structural variant or the replace tag for non-deletion variants is larger than 1000 characters. The variant record is created by eliminating any overlapping DNA sequences and then adjusting the start position. Checksums and length checks are performed to avoid duplicate records in the W_EHA_VARIANT_X table. If the "i_xref_db" parameter is specified, all rows that have XREF information also create the W_EHA_VARIANT_XREF records.

Note:

The current loader can be used to load VCF files that do not have any specimens. This is convenient for loading information from dbSNP and Cosmic.

If the "i_validate_numbs" parameter is set to 'Y", then all fields that are expected to be numeric, are evaluated and any rows that have invalid numbers are stored in the log file. No further processing is performed. This is useful for debugging files that do not create any result records.

The loader then uses separate insert statements to create the following records:

W_EHA_RSLT_SEQUENCING
W_EHA_RSLT_NOCALL
W_EHA_RSLT_NON_VARIANT
W_EHA_RSLT_SV_BREAKEND
W_EHA_RSLT_CONFLICT

The following table indicates the tables inserted based on LOAD_MODE:

Table 4-2 Tables inserted based on LOAD_MODE:

Name of the Record	VCF	GVCF	NON-VAR
W_EHA_RSLT_SEQUENCING	X	X	-
W_EHA_RSLT_NOCALL	X	X	X
W_EHA_RSLT_NON_VARIANT	-	X	X
W_EHA_RSLT_SV_BREAKEND	X	X
W_EHA_RSLT_CONFLICT	-	X	-

4.5.3.1 Data Files

Two kinds of VCF files are available at 1000 Genomes - sites and genotypes. A sites file does not contain genotypic data and sample details whereas a genotype VCF file contains individual genotypic data along with sample information. The current loader supports only VCF files with sample and genotype data. The sample information is present in the header row of the VCF data following the FORMAT column. Each row represents one sample.

The data type representation format and its order for each sample are specified in the FORMAT column. All alleles for all samples are stored in the ALT column. To obtain the allele information for each sample, the GT identifier from the FORMAT column for each sample is used. The allele value is represented in numerals (for example, 0/1, 1/2), where 0 represents reference allele and 1 and 2 represent alleles specified order in ALT.

Following is the list of passes that are used to process each VCF file:

The file is parsed for header columns that are indicated by "##" and the results are stored in the W_EHA_FILE_LOAD_QLFR table.
The file is parsed to create all referenced W_EHA_VARIANT records.

Note:
This pass does not require "GT" format field to facilitate loading reference VCF files.
The file is parsed to add records that do not use W_EHA_VARIANT foreign keys. This includes W_EHA_RSLT_NOCALL, W_EHA_RSLT_NON_VARIANT, W_EHA_RSLT_SV_BREAKEND.
The file is parsed to link all records to W_EHA_VARIANT. This includes W_EHA_RSLT_SEQUENCING.
In GVCF mode, the file is parsed to add W_EHA_RSLT_CONFLICT records.
The loader can read only 32 data types from the FORMAT column in the VCF file. Any data type, either supported or custom, not present in the first 32 data types of FORMAT column will not be loaded.

Note:

The batch file requires Oracle Wallet to be set up to run correctly.

Following is the table mapping of VCF Result File (snps, indels, large SVs, and rearrangements) and gVCF:

Column Name in Result File	Table and Column Name in ODB	Description
CHROM	W_EHA_VARIANT.CHROMOSOME_WID W_EHA_RSLT_SEQUENCING.CHROMOSOME_WID W_EHA_RSLT_NON_VARIANT.CHROMOSOME_WID W_EHA_RSLT_NOCALL.CHROMOSOME_WID W_EHA_RSLT_CONFLICT.CHROMOSOME_WID W_EHA_RSLT_SV_BREAKEND.REF_CHROMOSOME_WID	This field is used with the POS to find the correct DNA_SOURCE record, to find a VARIANT record, or create a VARIANT record. Three values are required to find existing VARIANT records. The chromosome, the POS, and the replace tag, which is notation combining reference and allele sequences. For NOVEL variants, a new record is created in the W_EHA_VARIANT table with chromosome value.
POS	W_EHA_VARIANT_STG.START_POSITION W_EHA_VARIANT_STG.END_POSITION W_EHA_VARIANT.ABSOLUTE_POSITION W_EHA_RSLT_SEQUENCING.START_POSITION W_EHA_RSLT_NON_VARIANT.START_POSITION W_EHA_RSLT_NOCALL.START_POSITION W_EHA_RSLT_CONFLICT.START_POSITION W_EHA_RSLT_SV_BREAKEND.REF_START_POSITION	This field is used as described above in the Chromosome column. For novel variants, POS is stored in the ABSOLUTE_POSITION column in the W_EHA_VARIANT table. For W_EHA_VARIANT.START_POSITION, Start_Position is relative to the value in the W_EHA_DNA_SOURCE.START_POSITION table using the pos value. Since, VCF does not have the end position information, while inserting novel variants in the VARIANT table, END_POSITION need to be calculated based on POS information and number of bases in REF.
ID	W_EHA_RSLT_SV_BREAKEND.BREAKEND_ID	This value is only stored for structural re-arrangement data where 'SVTYPE=BND' present in INFO column. This value in ID column is stored in BREAKEND_ID column.
REF	W_EHA_VARIANT.REPLACE_TAG	This value is used in conjunction with ALT sequence to construct a replace tag value used to find existing VARIANT records. The replace tag is constructed with reference first followed by "/" and then the allele sequence. For insertions, the reference sequence uses a "-" and for deletions the allele sequence uses "-". This is standard notation used in most references. At some in-dels, the representation can be as follows: ins can be AT/ATCTA and del can be ATCTA/AT. The logic is implemented in the procedure which can be called with any of the above formats.
ALT	W_EHA_RSLT_ SEQUENCING.ALLELE W_EHA_RSLT_ CONFLICT.ALLELE W_EHA_RSLT_CONFLICT.ALLELE_CLOB W_EHA_RSLT_CONFLICT.ALLELE_CLOB W_EHA_VARIANT_X.ALLELE W_EHA_RSLT_SV_BREAKEND.ALLELE W_EHA_RSLT_SV_BREAKEND.ALLELE_CLOB	For sequencing results, this value constructs the replace tag and stores the value in the results table as per rules.
INFO.SVTYPE	If the INFO column of the VCF file does not have 'SVTYPE' tag, then this mutation is considered as either SNV or small indel. If the INFO column has 'SVTYPE=BND', then this mutation is considered as structural re-arrangement. If the INFO column has 'SVTYPE' other than 'BND', for example, 'SVTYPE=DEL', then these mutations are considered as large structural variants.	-
INFO.END	W_EHA_RSLT_NON_VARIANT.END_POSITION W_EHA_RSLT_NOCALL.END_POSITION W_EHA_RSLT_CONFLICT.END_POSITION	For gVCF and large SV data, END_POSITION value using 'END=' tag present in this INFO column.
INFO.CIPOS	W_EHA_RSLT_SEQUENCING.CIPOS_START W_EHA_RSLT_SEQUENCING.CIPOS_END	For large SV, the INFO column contains tag 'CIPOS' which contains two values. The first value is stored in CIPOS_START and second in CIPOS_END.
INFO.CIEND	W_EHA_RSLT_SEQUENCING.CIEND_START W_EHA_RSLT_SEQUENCING.CIEND_END	For large SV, the INFO column contains tag 'CIEND' which contains two values. The first value is stored in CIEND_START and the second n CIEND_END.
INFO.HOMLEN	W_EHA_RSLT_SEQUENCING.HOMLEN	For large SV, the value for the tag 'HOMLEN' present in INFO column is stored here.
INFO.HOMSEQ	W_EHA_RSLT_SEQUENCING.HOMSEQ	For large SV, the value for the tag 'HOMSEQ' present in INFO column is stored here.
INFO.MEINFO	W_EHA_RSLT_SEQUENCING.MEINFO	For large SV, the value for the tag 'MEINFO' present in INFO column is stored here.
INFO.MATE_ID	W_EHA_RSLT_SV_BREAKEND.MATE_ID	For structural rearrangement data, MATE_ID tag value present in INFO column is stored here.
INFO.EVENT_ID	For structural re-arrangement data, EVENT_ID tag value present in INFO column is stored here.	-
INFO.PRECISION	W_EHA_VARIANT.PERCISION W_EHA_RSLT_SV_BREAKEND.PRECISION	For either Large SV or SV rearrangement data, if the 'IMPRECISE' tag is present in the INFO column, then 'IMPRECISE' value is populated in these columns, otherwise 'PRECISE' is populated.
FORMAT.GT	Gets the allele information for each sample. It is represented as '<allele1_num>/<allele2_num>'. In some cases instead of "/" there could be "\|". For diploid: 0\|0 represents both the alleles from REF. 0\|1 represents one allele from REF and other from ALT allele. 0/2 represents one allele from REF and other from ALT allele 2. 1/3 represents one allele from first ALT value and 2nd allele from 3rd ALT value. For haploid: only one allele number is represented. '.' or './.'is specified if a call cannot be made for a sample at that locus.	-
FORMAT.FT	W_EHA_RSLT_SEQUENCING. GENOTYPE_FILTER W_EHA_RSLT_NON_VARIANT. GENOTYPE_FILTER W_EHA_RSLT_NOCALL. GENOTYPE_FILTER W_EHA_RSLT_CONFLICT. GENOTYPE_FILTER W_EHA_RSLT_SV_BREAKEND. GENOTYPE_FILTER	Sample genotype filter indicating if this genotype is called (the concept is similar to the FILTER field). PASS indicates that all filters have been passed. A semi-colon separated list of codes for filters that fail, or "." indicates that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs (For string, white-space or semi-colons are not permitted).
FORMAT.GQ	W_EHA_RSLT_SEQUENCING.SCORE_VAF W_EHA_RSLT_NON_VARIANT.SCORE_VAF W_EHA_RSLT_NOCALL.SCORE_VAF W_EHA_RSLT_CONFLICT.SCORE_VAF W_EHA_RSLT_SV_BREAKEND.SCORE_VAF	This is mapped to W_EHA_RSLT_SEQUENCING.SCORE_VAF
QUAL	W_EHA_RSLT_SEQUENCING.QUAL W_EHA_RSLT_NON_VARIANT.QUAL W_EHA_RSLT_CONFLICT.QUAL W_EHA_RSLT_SV_BREAKEND.QUAL	Quality of the allele sequence. Mapped to QUAL column of VCF file.
FILTER	W_EHA_RSLT_SEQUENCING.FILTER W_EHA_RSLT_NON_VARIANT.FILTER W_EHA_RSLT_NOCALL.FILTER W_EHA_RSLT_CONFLICT.FILTER W_EHA_RSLT_SV_BREAKEND.FILTER	Filter applied to the particular record. Mapped to FILTER column in the VCF file.
FORMAT.DP	W_EHA_RSLT_SEQUENCING.TOTAL_READ_COUNT W_EHA_RSLT_NON_VARIANT.TOTAL_READ_COUNT W_EHA_RSLT_NOCALL.TOTAL_READ_COUNT W_EHA_RSLT_CONFLICT.TOTAL_READ_COUNT W_EHA_RSLT_SV_BREAKEND.TOTAL_READ_COUNT	Total number of reads that mapped to the defined allele sequence
FORMAT.AD	W_EHA_RSLT_SEQUENCING.ALLELE_READ_COUNT W_EHA_RSLT_SEQUENCING.REFERENCE_READ_COUNT W_EHA_RSLT_NON_VARIANT.REFERENCE_READ_COUNT W_EHA_RSLT_NOCALL.ALLELE_READ_COUNT W_EHA_RSLT_NOCALL.REFERENCE_READ_COUNT W_EHA_RSLT_CONFLICT.ALLELE_READ_COUNT W_EHA_RSLT_CONFLICT.REFERENCE_READ_COUNT W_EHA_RSLT_SV_BREAKEND.ALLELE_READ_COUNT	The first value is mapped to REFERENCE_READ_COUNT and consecutive values are mapped to ALLELE_READ_COUNT
FORMAT.BQ	W_EHA_RSLT_SEQUENCING.RMS_BASE_QUAL W_EHA_RSLT_NON_VARIANT.RMS_BASE_QUAL W_EHA_RSLT_NOCALL.RMS_BASE_QUAL W_EHA_RSLT_CONFLICT.RMS_BASE_QUAL W_EHA_RSLT_SV_BREAKEND.RMS_BASE_QUAL	RMS base quality at this position.
FORMAT.MQ	W_EHA_RSLT_SEQUENCING.RMS_MAPPING_QUAL W_EHA_RSLT_NON_VARIANT.RMS_MAPPING_QUAL W_EHA_RSLT_NOCALL.RMS_MAPPING_QUAL W_EHA_RSLT_CONFLICT.RMS_MAPPING_QUAL W_EHA_RSLT_SV_BREAKEND.RMS_MAPPING_QUAL	RMS mapping quality at this position.
FORMAT.GQX	W_EHA_RSLT_SEQUENCING.GENOTYPE_QUAL_X W_EHA_RSLT_NON_VARIANT.GENOTYPE_QUAL_X W_EHA_RSLT_NOCALL.GENOTYPE_QUAL_X W_EHA_RSLT_CONFLICT.GENOTYPE_QUAL_X W_EHA_RSLT_SV_BREAKEND.GENOTYPE_QUAL_X	GQX - Minimum of {Genotype quality assuming variant position, Genotype quality assuming non-variant position}
FORMAT.SS	W_EHA_RSLT_ SEQUENCING.SOMATIC_STATUS_WID	W_EHA_SOMATIC_STATUS table pre-seeded with all values while installation. For more information, see Table A-4. The loader maps the SS value from the VCF file to the SOMATIC_STATUS_CODE column of the W_EHA_SOMATIC_STATUS table and a foreign key is created in W_EHA_RSLT_SEQUENCING with SOMATIC_STATUS_WID column.
FORMAT.SSC	W_EHA_RSLT_ SEQUENCING.SOMATIC_SCORE	Somatic score of the variant

4.5.4 Command-Line Argument List

Name

VCF_loader.sh - load records

Synopsis

VCF_loader.sh -help

VCF_loader.sh <...options>

Description

Validates input options and calls the loader script for VCF and GVCF mode load_vcf.sql#odb_rslt_gvcf_util.process_gvcf and for NON-VAR mode odb_nonvar_gvcf_util.process_nonvar_gvcf

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-custom_format <VARCHAR2>

Custom format comma delimited (format_name=column(,format_name=column)*)

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-file_type_code* <VARCHAR2>

File type code (GVCF|VCF) [default: VCF]

-load_mode <VARCHAR2>

Load mode (VCF|GVCF|NON-VAR)[default: VCF]

i_validate_numbs <CHAR>

Validate all number fields before insert (Y|N)[default: N)

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.5.5 Examples

UNIX: with 'GVCF' file_type_code and 'GVCF' load_mode

$ sh VCF_loader.sh -db_wallet odb_user -data_file "YRI.trio.2010_03.snps.genotypes_NEW.vcf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -custom_format "" -preprocess_dir "" -preprocess_file "" -data_file_path "" -dbfs_store "" -file_type_code "GVCF" -load_mode "GVCF" -alt_file_loc "" -read_size ""

Windows: with 'GVCF' file_type_code and 'GVCF' load_mode

C:\> VCF_loader.bat -db_wallet odb_user -data_file "YRI.trio.2010_03.snps.genotypes_NEW.vcf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -custom_format "" -preprocess_dir "" -preprocess_file "" -data_file_path "" -dbfs_store "" -file_type_code "GVCF" -load_mode "GVCF" -alt_file_loc "" -read_size ""

When the file type is VCF and load mode is VCF, the same command will be used. Pass the -file_type_code as VCF an d -load_mode as VCF as a parameter with SH and BAT command.

4.6 MAF Sequence Data Loader

The Mutation Annotation Format (MAF) is created by TCGA. MAF files store variation data for multiple samples. The MAF file format is described here:

http://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm

The format of a MAF file is tab-delimited columns. This file is the simplest among all result files. ODB supports importing MAF versions 2.0 - 2.2.

4.6.1 Functional Description

The MAF loader currently loads only the variant records and not wild type (WT) information. For example:

If one of the alleles is WT while the other is variant, then the loader only records the variant allele.
If both the alleles are WT, then the loader does not load any of these alleles.
If both alleles are variants and homozygous, then it stores only one record.
If both alleles are MT and heterozygous, then it stores them as two separate records.

The loader obtains the chromosome and position details of a record from the MAF file and checks if the corresponding region of that chromosome exists in W_EHA_DNA_ SOURCE table for the reference version specified as input. If it is present, it maps this record to the W_EHA_DNA_SOURCE table as W_EHA_ VARIANT.SOURCE_WID. If the region is not found (for example, the chromosome and (or) position information is invalid), the loader ignores that record and does not log in to W_EHA_RSLT_LOG table.

The loader does not validate the accuracy of the reference nucleotides in the database. It assumes that the same version of reference mapped MAF data is loaded in to ODB. Ensure that the reference version of the results file being loaded matches that of the reference data available in ODB.

A single record in a MAF file contains data for both normal and tumor samples. The loader loads data for both these samples.

A typical MAF file contains information about multiple specimens. If one or more of the specimens do not exist in the CDM schema, then the loaders skips that row and logs an error with details in W_EHA_RSLT_LOG.

4.6.2 Data Load

The execution call of the stored procedure ODB_RSLT_MAF_UTIL.process_maf() is designed in one of the script files (load_maf.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, DATA SOURCE, SPECIMEN VENDOR, Reference Version, File Flag, Preprocess directory, Preprocess File, Data File Path, DBFS Store, Alternate file location (ftp location/http location), Read Size as input parameters.

It creates an external table dynamically and uploads data from the source file into it. External tables let Oracle query data that is stored outside the database in flat files. The ORACLE_LOADER driver can be used to access data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on external tables but they can be used for query, join, and sort operations.

Only one external table is created dynamically and will hold the complete result data. A global temporary table, W_EHA_MAF_SPECIMEN, is created explicitly to store the Normal and Tumor sample barcodes. There are two passes through the MAF file. One creates a W_EHA_VARIANT_STG record and collects each unique specimen number into the global temporary table. A bulk collect then calls Odb_util.GET_SPECIMEN_WID for all the specimen numbers in one statement.

Another bulk insert statement inserts the data into W_EHA_RSLT_SEQUENCING.

A select statement parses data from the external table and employs a join with w_eha_variant and w_eha_dna_source table to get the dataset. This is then used to compute the gene_wid from w_eha_gene_segment table. The dataset returned with variant and DNA source table examines whether the start position of variant record is less than or equal to the end position of a gene segment. It also checks if the End position variant record is greater than or equal to the start position of the gene segment and populates the GENE_WID in W_EHA_STG_SEQUENCING table.

After inserting the record into the W_EHA_VARIANT_STG table, the PROCESS_VARIANT() procedure is called, which populates the W_EHA_VARIANT table.

4.6.2.1 Data files

Each row of a MAF file has two allele information for two sample types – a tumor sample and a normal one. These two sample IDs are specified in each row of the file. Sample ID of the tumor sample is specified in Tumor_Sample_Barcode and that of the normal one is specified in Matched_Norm_Sample_Barcode column. For a single file row, a maximum of eight records can be created in ODB depending on the heterozygosity and resemblance to the reference sequence - four for the tumor sample and four for the normal one.

Allele sequences for a deletion represent a variant and for an insertion represent a wild-type allele. If an allele sequence is the same as the Reference_Allele sequence, its information is not stored in the data bank. There is no information on NOCALL in MAF data, hence all the data is populated in the W_EHA_VARIANT and W_EHA_RSLT_SEQUENCING tables.

The loader validates the reference version, which is passed as an input parameter, against the W_EHA_VERSION table and populates the VERSION_WID in the corresponding result table.

Note:

The batch file requires an Oracle Wallet to be set up to run correctly.

Following is the table mapping of the MAF Result File:

Column Name in Result File	Table and Column Name in ODB	Description
Chromosome	W_EHA_RSLT_SEQUENCING.CHROMOSOME_WID W_EHA_VARIANT.CHROMOSOME	This field is used with the begin position to find the correct DNA_SOURCE record and VARIANT record, or create a VARIANT record. Three values are needed to find existing VARIANT records. The chromosome, the begin position, and the replace tag, which is notation combining reference and allele sequences. For novel variants, a new record is created in the W_EHA_VARIANT table with the chromosome value.
Start_Position	W_EHA_RSLT_SEQUENCING.START_POSITION W_EHA_VARIANT_STG.START_POSITION W_EHA_VARIANT.ABSOLUTE_POSITION	This field is used as described above. For no-call results, this is stored in the START_POSITION field (after adding 1). For novel variants, Start_Position is stored in the ABSOLUTE_POSITION column in the W_EHA_VARIANT table. For W_EHA_VARIANT.START_POSITION, Start_Position is relative to the value in the W_EHA_DNA_SOURCE.START_POSITION table.
End_Position	W_EHA_VARIANT_STG.END_POSITION	This value is used for no-call results and stored in the END_POSITION field. This value also calculates the relative end position based on W_EHA_DNA_SOURCE.START_POSITION for END_POSITION in W_EHA_VARIANT.END_POSITION for novel variants.
Strand	W_EHA_VARIANT.STRAND W_EHA_VARIANT_STG.STRAND	This value indicates forward or reverse strand.
Variant_Type	W_EHA_RSLT_SEQUENCING.VARIANT_TYPE	Type of variant including snp, insertion, or deletion. Stored in VARIANT_TYPE in the W_EHA_RSLT_SEQUENCING table.
Reference_Allele	W_EHA_VARIANT.REPLACE_TAG	This value is used for REPLACE_TAG, REPLACE_TAG and is used twice, one for the tumor and the other for normal sample. It is used in conjunction with Tumor_Seq_Allele1 and Tumor_Seq_Allele2 to find existing VARIANT records for tumor sample. Similarly for normal sample, the Reference allele is used for REPLACE_TAG. For insertions, the reference sequence uses a "-" and for deletions the allele sequence uses "-". This is standard notation used in most references. Logic for loader will be implemented in called procedure to variant table. For deletion, this value has deleted sequence and for insertion it has "-".
Tumor_Seq_Allele1	W_EHA_VARIANT.REPLACE_TAG	For sequencing results this value constructs the replace tag. '-' value represents a deletion.
Tumor_Seq_Allele2	W_EHA_VARIANT.REPLACE_TAG	For sequencing results, this value constructs the replace tag. '-' value represents a deletion.
Tumor_Sample_Barcode	W_EHA_RSLT_SPECIMEN.SPECIMEN_NUMBER	This value represents tumor sample ID. This barcode ID involves TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID.
Matched_Norm_Sample_Barcode	W_EHA_RSLT_SEQUENCING.RESULT_SPEC_WID W_EHA_RSLT_SEQUENCING.SPECIMEN_WID	This value represents normal sample ID. This barcode ID involves TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The complete barcode ID as is foreign key to RSLT_ SPECIMEN record.
Match_Norm_Seq_Allele1	W_EHA_VARIANT.REPLACE_TAG	For sequencing results this value constructs the replace tag. '-' value represents a deletion.
Match_Norm_Seq_Allele2	W_EHA_RSLT_SEQUENCING.ALLELE W_EHA_VARIANT.REPLACE_TAG	For sequencing results this value constructs the replace tag. '-' value represents a deletion.
Score	-	This column is not functional in the MAF files and is currently not mapped.

4.6.3 Command-Line Argument List

Name

MAF_loader.sh - load records

Synopsis

MAF_loader.sh -help

MAF_loader.sh <...options>

Description

Validates input options and calls the loader script load_maf.sql#odb_rslt_maf_util.process_maf

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location, link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE]

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.6.4 Examples

UNIX

$ sh MAF_loader.sh -db_wallet odb_user -data_file "ut_maf.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_vendor "vendor3" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir "" -preprocess_file "" -data_file_path "" -dbfs_store "" -alt_file_loc "" -read_size ""

Windows

C:\> MAF_loader.bat -db_wallet odb_user -data_file "ut_maf.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_vendor "vendor3" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir "" -preprocess_file "" -data_file_path "" -dbfs_store "" -alt_file_loc "" -read_size ""

4.7 RNA-Seq Loader

4.7.1 Functional Description

The TCGA RNA SEQ data file format specifications are described here:

https://wiki.nci.nih.gov/display/TCGA/RNASeq+Data+Format+Specification#RNASeqDataFormatSpecification-Datafiles

TCGA has three different types of files: exon, gene, and splice junctions. Only the exon files are measured by exact chromosome locations. The other two files are calculated estimations based upon gene locations using this exon data file. Currently, support is provided for loading the exon version data files.

Once an exon data file is loaded you can select genes, which can map to specific chromosome regions. RNA sequencing based data has a data type alias of Quantification-Exon.

A RNASeq exon quantification file is a tabular, text-based, tab-separated dataset, with a single header row stating the column names. The file consists of the following columns:

barcode: Identifies the sample. This column may or may not be used by the loader. For ODB v3.0, only files with this column can be loaded.
exon: Provides standard chromosome token: chr1-chr22, chrX, chrY, chrM, followed by a coordinate pair, strand indicated with +/-, for example, chr1:12227:-,chr1:12595:+
raw_counts: Stores raw read counts in positive floating point values or a zero, if unavailable.
median_length_normalized: A normalized region length calculation in positive float or zero.
RPKM: (Reads Per Kilobaseq exon Model per million mapped reads) Calculated expression intensity values in positive float or zero.

4.7.2 Data Load

The RNASeq Loader inserts data from an exon quantification file into the W_EHA_ RSLT_RNA_SEQ table.

The last four columns of a TCGA exon file are populated into the table mentioned above. The EXON type file specifies a chromosome and range. This field will be parsed to find the corresponding chromosome record, strand, and separate the start and end positions. The actual value is stored in RESULT_EXON_NAME in W_EHA_RSLT_RNA_SEQ for reference.

The table stores an additional FK value for SPECIES, DNA reference version to which the results are mapped, and W_EHA_GENE table for each gene the reference mapping associates to a record. If no such gene is found in the current reference, a '0' value is added into the column. If the RPKM value for an input row is a null value (a blank, or has the text 'null'), then it is skipped by the loader.

The execution call of the stored procedure ODB_RSLT_RNA_SEQ_UTIL.process_tcga_rna_seq()is designed in one of the script files (load_tcga_rna_seq.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, STUDY NAME, Reference Version as mandatory input parameters, and DATA SOURCE, SPECIMEN VENDOR, File Flag, Preprocess directory, Preprocess File, Data File Path, DBFS Store, Alternate file location (ftp location or http location), Read Size as optional input parameters.

This stored procedure creates an external table dynamically and uploads data from the source file into it. External tables let Oracle query data that is stored outside the database in flat files.

Only one external table is created dynamically and holds the complete result data. There is one bulk insert statement which inserts data into the W_EHA_RSLT_RNA_SEQ table. The query uses the two inline views, one of which computes the gene WID and the other computes the start position, end position, chromosome WID, strand, row_count, median_length columns. These two inline views are then joined to populate the W_EHA_RSLT_RNA_SEQ table.

Note:

The result types, and mainly the identifiers used in the first column are different for TCGA-Exon and TCGA-Gene. Additional columns specified to hold gene result record identifiers have been created, which remain empty as these are not filled by the exon loader. This is because gene record identifiers have a different querying requirement and are therefore separated in the table from the columns that are populated with exon result-identifiers. The remaining column fields (RPKM, median length, raw count) are common for both formats.

4.7.2.1 Data File

The Windows batch file of the RNA-seq loader requires an Oracle Wallet to be set up to run correctly.

Specimen number -

A valid specimen number that is either provided by the user or retrieved from the data file (the "barcode" column). It links the result records by using the specimen to the associated external datasource given in the previous parameter. If the Datasource is CDM or CDM_PATIENT, then this value should be present for a record in W_EHA_SPECIMEN_PATIENT_H table under SPECIMEN_NUMBER. If the datasource is CDM_SUBJECT, then this value should be present for a record in W_EHA_SPECIMEN_SUBJECT_H table.

Note:

In Linux, it is not required to use Homo sapiens within "". That requirement is only for Windows.

Following is the table mapping of RNASeq exon result file:

Column Name in Result File	Table and Column Name in ODB	Description
exon	W_EHA_RSLT_RNA_SEQ.CHROMOSOME_WID W_EHA_RSLT_RNA_SEQ.START_POSITION W_EHA_RSLT_RNA_SEQ.END_POSITION W_EHA_RSLT_RNA_SEQ.STRAND W_EHA_RSLT_RNA_SEQ.RESULT_EXON_NAME	The column contains values in the following format: '<chromosome>:<absolute start position>-<absolute end position>:<strand>' The loader parses each value and populates data in the respective fields. The chromosome value is looked up in W_EHA_CHROMOSOME for its ROW_WID value to populate CHROMOSOME_WID. The entire value is place in RESULT_EXON_NAME.
raw_counts	W_EHA_RSLT_RNA_SEQ. RAW_COUNTS	Raw Read counts gives a positive floating point or zero.
median_length_normalized	W_EHA_RSLT_RNA_SEQ. MEDIAN_LENGTH	Calculated average normalized median length of the exon region for which an RPKM count if generated. Stores a positive float or zero.
RPKM	W_EHA_RSLT_RNA_SEQ.RPKM	Reads Per Kilobaseq exon Model per million mapped reads. Stores a positive float or zero.

The barcode column in the file is optionally used to identify the sample when the sample number is not passed to the loader as an argument. The value from the second row (first after the header) is then taken as the specimen number.

4.7.3 Command-Line Argument List

Name

TCGA_RNA_SEQ_loader.sh - load records

Synopsis

TCGA_RNA_SEQ_loader.sh -help

TCGA_RNA_SEQ_loader.sh <...options>

Description

Validates input options and calls the loader script load_tcga_rna_seq.sql#odb_rslt_rna_seq_util.process_tcga_rna_seq

Options

(*) required

-db_wallet* <VARCHAR2> (Required unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 1]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES, that is, for humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_number <VARCHAR2>

Specimen number - Identification number of the specimen for which the genomic result file is being loaded. If CDM is referenced, this value should be defined in W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_NUMBER. If the specimen number argument is missing or is null (""), the value from the second row (first after the header) in the "barcode" column of the data file is used as the specimen number.

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of the specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR [default: NULL]

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR [default: NULL]

-data_file_path <VARCHAR2>

File system path to secure data file directory [default: NULL]

-dbfs_store <VARCHAR2>

Database file system store [default: NULL]

-alt_file_loc <VARCHAR2>

Alternate file location, link that is, ftp:location, http:location [default: NULL]

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE [default: NULL]

-data_file_dir <VARCHAR2>

File system path to Oracle directory object [default: NULL]

-File_version (varchar2) [default: NULL]

File Version of a result file.

4.7.4 Examples

UNIX

$ sh TCGA_RNA_SEQ_loader.sh -db_wallet odb_user -data_file "summary_TCGA-AB-2803-03A-01T-0734-13.exon.quantification.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "RNA01" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -read_size ''-file_version '3.1.4.0'

Windows

C:\> TCGA_RNA_SEQ_loader.bat -db_wallet odb_user -data_file "summary_TCGA-AB-2803-03A-01T-0734-13.exon.quantification.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "RNA01" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" –file_version '3.1.4.0'

4.8 File Specimen Loader and File Lineage Linker

File Lineage Linker facilitates associating file records with each other (each association, a File Link, is directional with one file being a parent and the other a child. The user decides which file is the parent and which the child when running the File Lineage Linker). Another loader, File-Specimen Loader, is used to create file records for low-level (or other) files which have no real loader provided in ODB. These files can then be linked to other files or to each other. This loader also permits associating files with specimens.

4.8.1 File-Specimen Loader

The File-Specimen Loader creates records in the W_EHA_FILE table, representing files not loaded by any of the loaders provided in ODB. This permits creating records for files of unsupported types, such as BAM and other low level files. These records can be then linked with other file records (such as VCF files), to provide lineage associations. The File-Specimen Loader can also associate new or existing file records with specimen records.

Note:

The File Specimen Loader can create a file record for any File Type defined in the W_EHA_FILE_TYPE table, including file types supported by "real" loaders, such as VCF, CNV, and so on. It will not, however, load the file's contents into the database. Therefore, it should not be used for files supported by the provided loaders.

The loader can be run in two different modes. In one mode, it creates a new File record and associates it with at least one specimen. In this case, provide a file name and data directory. In the other mode, an existing File record is associated with one or more specimen. In this mode, the only required arguments (besides the connection arguments) are the File URI and the Specimen Number(s) and Vendor. The second mode is distinguished by using the -append_specimen 1 argument.

In both modes, at least one specimen must be specified. Specimen numbers are available as a delimited string (the default delimiter is comma, but there is an argument that lets you specify a different delimiter), and there can be any number of items in it. However, only one Specimen Vendor number can be supplied, so that all specimen associated with a file in one run of this Loader should share the same Specimen Vendor Number. Subsequent runs in the Append Specimen mode can be used to associate specimen with other Specimen Vendor Number(s) with the same file.

The command-line arguments for the File-Specimen Loader are as follows:

Name

File_specimen_linker.sh

Synopsis

File_specimen_linker.sh -help

File_specimen_linker.sh <...options>

Description

Validates input options and calls the loader script

load_file_spec.sql#odb_rslt_file_spec_util.process_file_spec

Options

(*) denotes that it is required

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"
```
-db_wallet* <VARCHAR2>
```
Required, unless the -db_conn/-db_user combination is used to log into the database
Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CO NNECT_DATA=(SID=XE)))"
```
-db_conn* <VARCHAR2>
```
Required if -db_wallet is not provided
ODB user name for the Database connection
```
-db_user* <VARCHAR2>
```
Required if -db_conn is provided
Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]
```
-log_level <VARCHAR2>
```
Print summary (1=yes|0=no) [default: 0]
```
-print_summary <NUMBER>
```
Data file name - Oracle external table LOCATION
```
-data_file* <VARCHAR2>
```
Required if loading a new file
Oracle directory object: Oracle external table DIRECTORY. For information, see Section 2.1, "Setting Up a Directory Object".
```
-data_directory* <VARCHAR2>
```
required if loading a new file
Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]
```
-datasource_name* <VARCHAR2>
```
Specimen vendor: Sample vendor number of specimens with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_ PATIENT_H.SPECIMEN_VENDOR_NUMBER
```
-specimen_vendor* <VARCHAR2>
```
Specimen numbers: A delimited list of identification numbers of specimens to be associated with the newly or previously loaded file. The default delimiter is comma, if this is not acceptable (for example, if one or more Specimen Numbers contain comma(s)), an alternative single-character delimiter can be defined using the -delimiter_char <CHAR> argument.
```
-specimen_numbers* <VARCHAR2>
```
File flag (E=external|S=copy to secure data file directory) [default: E]
```
-file_flg <CHAR>
```
File system path to secure data file directory
```
-data_file_path <VARCHAR2>
```
Database file system store
```
-dbfs_store <VARCHAR2>
```
Alternate file location link that is, ftp:location, http:location
```
-alt_file_loc <VARCHAR2>
```
File system path to Oracle directory object
```
-data_file_dir <VARCHAR2>
```
URI (unique resource identifier) for the file
```
-file_uri* <VARCHAR2>
```
Required, if appending associated specimen to an already loaded file. Optional, if a new file is being loaded
The file type of the new file being loaded (the type must exist in the W_EHA_FILE_TYPE table)
```
-file_type* <VARCHAR2>
```
Mandatory, if loading a new file
The file type version of the new file being loaded (the version must exist in the W_EHA_FILE_TYPE table, and the file type of the record must match the value of the -file_type argument)
```
-file_version <VARCHAR2>
```
Sets the loader mode (1=append specimen associations to an already loaded files|0=load a new file) [default: 0]
```
-append_specimen <NUMBER>
```
Character to use as a delimiter when parsing the -specimen_numbers value [default: comma]
```
-delimiter_char <CHAR>
```

Examples

UNIX: Loading a new file

$ sh File_specimen_linker.sh -db_wallet odb_user -data_file "File020113.bam" -file_uri "MYFILE1" -data_directory "ODB_LOAD"-datasource_name "CDM" -specimen_numbers "Spec 1,Spec 2,Spec 3"-specimen_vendor "vendor1" -file_flg "E" -file_type "BAM" -file_version "1.4"

Windows: Loading a new file

C:\> File_specimen_linker.bat -db_wallet odb_user -data_file "File020113.bam" -file_uri "MYFILE1" -data_directory "ODB_LOAD"-datasource_name "CDM" -specimen_numbers "Spec 1,Spec 2,Spec 3"-specimen_vendor "vendor1" -file_flg "E" -file_type "BAM" -file_version "1.4"

UNIX: Appending specimen to an already loaded file

$ sh File_specimen_linker.sh -db_wallet odb_user -file_uri "MYFILE1" -append_specimen 1 -datasource_name "CDM" -specimen_numbers "Spec 4,Spec 5" -specimen_vendor "vendor1"

Windows: Appending specimen to an already loaded file

C:\> File_specimen_linker.bat -db_wallet odb_user -file_uri "MYFILE1" -append_specimen 1 -datasource_name "CDM" -specimen_numbers "Spec 4,Spec 5" -specimen_vendor "vendor1"

4.8.2 File Lineage Linker

The File Lineage Linker creates a directional association between two files (W_EHA_FILE records), in which one file is a parent and the other a child. The user decides which file should be the child, and which the parent. The association record is created (unless it already exists) in the W_EHA_FILE_LINK table.

Normally, the files have to be associated with at least one common specimen to be linked. The File Lineage Linker verifies this before creating a link. However, this requirement can be overridden by using the -force_link 1 argument.

The File Lineage Linker requires that the parent and child files are identified by their file URIs. This ensures that not more than one file link is created every time the Linker runs.

The command-line arguments for the File-Lineage Linker are as follows:

Name

File_lineage_linker.sh

Synopsis

File_lineage_linker.sh -help

File_lineage_linker.sh <...options>

Description

Validates input options and calls the loader script load_link.sql#

odb_rslt_link_util.process_link

Options

(*) denotes that it is required

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"
```
-db_wallet* <VARCHAR2>
```
Required, unless the -db_conn/-db_user combination is used to log into the database
Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CO NNECT_DATA=(SID=XE)))"
```
-db_conn* <VARCHAR2>
```
Required if -db_wallet is not provided
ODB user name for the Database connection
```
-db_user* <VARCHAR2>
```
Required if -db_conn is provided
Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]
```
-log_level <VARCHAR2>
```
Print summary (1=yes|0=no) [default: 0]
```
-print_summary <NUMBER>
```
The File URI of the child file in the link
```
-child_file_uri* <VARCHAR2>
```
The File URI of the parent file in the link
```
-parent_file_uri* <VARCHAR2>
```
Allow the files to be linked even if they have no associated specimen in common (1=yes|0=no) [default: 0]
```
-force_link <NUMBER>
```

Examples

UNIX

$ sh File_lineage_linker.sh -db_wallet odb_user -parent_file_uri "MYFILE1" -child_file_uri "MYFILE2" -force_link 1

Windows

C:\> File_lineage_linker.bat -db_wallet odb_user -parent_file_uri "MYFILE1" -child_file_uri "MYFILE2" -force_link 1

4.9 Copy Number Variation Loader

4.9.1 Functional Description

The Copy Number Variation (CNV) loader loads data from TCGA belonging to Affymetrix Genome-Wide Human SNP Array 6.0 platform. The input file should contain columns in the following order for the loader to perform correctly:

Sample
Chromosome
Start
End
Num_Probes
Segment_Mean

An extract of the input CNV file is shown in the table:

Sample	Chromosome	Start	End	Num_Probes	Segment_Mean
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020	1	51598	219036	22	0.8546
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020	1	219482	1176387	120	0.0513
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020	1	1176449	1243413	47	0.623
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020	1	1243440	5290540	2132	0.0167
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020	1	5291209	5308749	6	0.6214
JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020	1	5308775	9230624	2368	-0.0261

4.9.2 Data Load

The execution call of the stored procedure odb_rslt_cnv_util. process_cnv_nbr_var() is designed in one of the script files (load_cnv.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTORY OBJECT, SPECIES NAME,STUDY, DATASOUCRE NAME, SPECIMEN VENDOR, FILE FLAG (External or Secured), DBFS_STORE, DNA_VERSION, and a few other input parameters.

It creates an external table and uploads data from the source file into it. It creates cnv_data_!!SEQ!! as an external table, which stores the complete result data. This table maps all the fields existing in the result file.

Note:

In the above external table, the !!SEQ!! string is replaced by ETL_PROC_ID at run time.

There is a single bulk insert statement which inserts records into the W_EHA_STG_COPY_NBR_VAR table.

A select statement, which parses the data from the external table, utilizes an inline query which gets the dataset of gene segment records. The query looks up the dataset returned from the inline query and checks if the start position of the result file is less than or equal to the end position of (a gene segment) + (start position of DNA source). It also checks if the End position of result file is greater than or equal to the (start position of the gene segment) + (start position of DNA source). The datasets of both inline queries are then outer joined with the ROW_WID of external table lookup for GENE_WID, and records are inserted into the W_EHA_STG_COPY_NBR_VAR table.

The loader associates this data with the FILE_TYPE_CODE 'Genome_Wide_SNP_6' in W_EHA_FILE_TYPE table to distinguish it.

Note:

The batch file requires an Oracle Wallet to be set up to run correctly.

4.9.3 Command-Line Argument List

Name

CNV_loader.sh - load records

Synopsis

CNV_loader.sh -help

CNV_loader.sh <...options>

Description

Validates input options and calls the loader script load_cnv.sql#odb_rslt_cnv_util.process_cnv_nbr_var

Options

(*) required

-db_wallet* <VARCHAR2> (Required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES, that is, for humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_number <VARCHAR2>

Specimen number - Identification number of the specimen for which the genomic result file is being loaded. If CDM is referenced, this value should be defined in W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_NUMBER. If the specimen number is not passed on the command-line, or is passed as "", the value in the second (first after the header) row in the first column ("sample") in the file will be used as the Specimen Number.

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of the specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.9.4 Examples

UNIX

$ sh CNV_loader.sh -db_wallet odb_user -data_file "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020.hg19.seg.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -read_size ''

Windows

C:\> CNV_loader.bat -db_wallet odb_user -data_file "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020.hg19.seg.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E"

4.10 Single Channel Gene Expression Loader

4.10.1 Functional Description

The single channel gene expression loader loads gene expression data into the W_EHA_RSLT_GENE_EXP table. The loader begins with loading all the hybridization sets (consisting of intensity, call and P-value) from the input file into an intermediary staging table W_EHA_STG_GENE_EXP.

While reading from the file, it assumes that there are a maximum of 15 hybridization sets, which translates to a maximum of 45 columns. If there are fewer columns, the loader uses empty fillers in the staging table for the unavailable hybridization sets. For each record from the staging table, it verifies if the probe name exists in the W_EHA_PROBE table along with the matching version and species ID, which are passed as input parameters.

If a match is found, the loader inserts all the hybridization sets available for that record into W_EHA_RSLT_GENE_EXP table, excluding the empty fillers. If the probe name does not exist in the W_EHA_PROBE table, the loader skips that record. If the probe exists, but with non-matching version and (or) species ID, the loader logs a warning into W_EHA_RSLT_LOG table, with a message that version and (or) species do not match.

The W_EHA_RSLT_LOG table contains error records if records were not loaded successfully into the target tables. If an input row intensity value is a null value (or blank, or has the text null), then the loader skips this row.

Note:

An Oracle Wallet must be set up before the batch files can be run successfully.

4.10.2 Data Load

The gene expression loader primarily loads into the W_EHA_RSLT_GENE_EXP table. It also updates two aggregate tables, W_EHA_RSLT_GXP_HYBRID_AGG and W_EHA_RSLT_GXP_PROBE_AGG. To run the loader, use the gene_expression_loader.bat file in Windows or the gene_expression_loader.sh file in Linux.

Note:

The gene expression loader assumes that probe loader (for details, see Section 3.6, "Probe Loader") has already populated W_EHA_PROBE table with probe names corresponding to the genes.

The input file for this loader should contain normalized intensity values, and optional inputs of present or absent calls and P-value (such as the output of Affymetrix's MAS5 algorithm). The input file permits multiple hybridization intensity data in a tabular format.

4.10.2.1 Assumptions for Data File

Following are the assumptions for the Gene Expression Loader data file:

The file is tab delimited.
The first row is always the header.
The first column is named DATA.
Each hybridization present in the data file should have three columns in the following order:
- Intensity - The header value should be the Hybridization Name
- Call
- P-Value
The header of the first column for each hybridization should contain only the hybridization name. The values in this column are the hybridization intensity values.
The total size of the header in the data file should not be greater than 32000 characters.

4.10.2.2 Mappings for Gene Expression Loader

Following are the table mappings for gene expression loader:

Data File	W_EHA_RSLT_GENE_EXP
DATA	There will be a look up in W_EHA_PROBE, corresponding ROW_WID will be populated in W_EHA_RSLT_GENE_EXP.PROBE_WID
HYBRIDIZATION - Header	W_EHA_RSLT_GENE_EXP.HYBRIDIZATION_NAME
HYBRIDIZATION - Data Values	W_EHA_RSLT_GENE_EXP.INTENSITY
HYBRIDIZATION_Call	W_EHA_RSLT_GENE_EXP.CALL
HYBRIDIZATION_P-VALUE	W_EHA_RSLT_GENE_EXP.P_VALUE

4.10.2.3 Aggregate Tables

There are two tables introduced in version 3.0, namely W_EHA_RSLT_GXP_HYBRID_AGG and W_EHA_RSLT_GXP_PROBE_AGG. These tables are populated with calculated aggregate values for the normalized intensity column of result file. The aggregates are: median (MEDIAN), average (AVG), minimum (MIN), maximum (MAX), Standard Deviation (STDDEV), and Variance (VARIANCE). The data in W_EHA_RSLT_GXP_HYBRID_AGG is aggregated over hybridizations, and in W_EHA_RSLT_GXP_PROBE_AGG over probes.

4.10.3 Command-Line Argument List

Name

single_channel_gene_expr_loader.sh - load records

Synopsis

single_channel_gene_expr_loader.sh -help

single_channel_gene_expr_loader.sh <...options>

Description

Validates input options and calls the loader script load_single_channel_gene_expr.sql#odb_rslt_single_channel_util.process_single_channel

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_number* <VARCHAR2>

Specimen number - Identification number of specimen for which the genomic result file is being loaded. If CDM is referenced, this value should be defined in W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_NUMBER

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.10.4 Examples

UNIX

$ sh single_channel_gene_expr_loader.sh -db_wallet odb_user -data_file "mas5_expression_summary_part1.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "RNA01" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir "" -preprocess_file "" -data_file_path "" -dbfs_store "" -alt_file_loc "" -read_size ""

Windows

C:\> single_channel_gene_expr_loader.bat -db_wallet odb_user -data_file "mas5_expression_summary_part1.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "RNA01" -specimen_vendor "vendor1" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir "" -preprocess_file "" -data_file_path "" -dbfs_store "" -alt_file_loc "" -read_size ""

4.11 Dual Channel Loader

4.11.1 Functional Description

The dual channel loader supports Agilent 244K Custom Gene Expression G4502A-07 platform specific Level-3 (Gene level) input files from TCGA. It inputs Level-3 result data, which contains gene symbols and associated LOWESS log2 transformed ratio gene expression values and loads it into the W_EHA_RSLT_2CHANNEL_GXP result table.

ADF data with a specific user label links the Dual Channel data comprising genes with the specific DNA Reference Version through the GENE_WID foreign key in the W_EHA_RSLT_2CHANNEL_GXP table. Currently, for each Gene Symbol or Ratio input from the file, the loader is set to generate a record for each GENE_WID value taken from W_EHA_GENE_SEGMENT reference table where the genomic coordinates of the corresponding Composite Name at least partially match the genomic coordinates of a Gene Segment in the EMBL reference.

To correctly map the genomic coordinates of the result composite genes (using the ADF file composite annotation loaded through the ADF Data Loader) to the EMBL reference genome, the genomic reference version of the loaded EMBL release must match the reference version of the ADF file. That is, the EMBL data loaded in ODB must be the same genomic release as that given in the ADF file.

The ADF file data must be input into ODB, using the ADF data loader (and with the same ADF User Label) before loading 2channel result data with this loader.

If the Log2 ratio value for an input row is a null value (a blank, or has the text 'null'), then this row is skipped by the loader during the insert to the result table.

4.11.2 Data Load

The execution call of the stored procedure odb_rslt_dual_channel_util. process_dual_channel() is designed in one of the script files (load_dual_channel.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTORY OBJECT, STUDY, DATASOURCE NAME, SPECIMEN NAME, SPECIMEN VENDOR, SPECIES NAME, (ADF) USER LABEL, (DNA) REFERENCE VERSION, FILE FLAG (External or Secure), DBFS_STORE, DNA_VERSION, and a few other input parameters.

It creates an external table and uploads data from the source file into it. The stored procedure creates dual_channel_data_!!SEQ!! as an external table. This external table stores the complete result data. This table maps all the fields existing in the result file.

Note:

In the above external table, the !!SEQ!! string is replaced by ETL_PROC_ID at the run time.

There is a single bulk insert statement written dynamically. This statement inserts the record into the W_EHA_STG_2CHANNEL_GXP table.

A select statement which parses the data from the external table utilizes an inline query which gets the dataset of gene segment records. The query looks up the dataset returned from the inline query and checks whether the start position of the result file is less than or equal to the end position of gene segment + start position of DNA source. It will also check whether the End position of result file greater than or equal to start position of gene segment + start position of DNA source. The dataset of both inline queries is then the outer join with the ROW_WID of external table to look up the GENE_WID and populates the records in W_EHA_STG_2CHANNEL_GXP table.

Note:

The batch file requires an Oracle Wallet to be set up to run correctly.

4.11.3 Command Line Argument List

Name

dual_channel_gene_expr_loader.sh - load records

Synopsis

dual_channel_gene_expr_loader.sh -help

dual_channel_gene_expr_loader.sh <...options>

Description

Validates input options and calls the loader script load_dual_channel_gene_expr.sql#odb_rslt_dual_channel_util.process_dual_channel

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-study_name* <VARCHAR2>

Study name defined in W_EHA_RSLT_STUDY.RESULT_STUDY_NAME

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_number* <VARCHAR2>

Specimen number - Identification number of specimen for which the genomic result file is being loaded. If CDM is referenced, this value should be defined in W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_NUMBER

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-control_specimen* <VARCHAR2>

Control specimen

-user_label* <VARCHAR2>

User label (W_EHA_ADF.USER_LABEL) used to identify a composite record's source ADF dataset, that is, AgilentG4502A_07_1

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2>

File system path to secure data file directory

-dbfs_store <VARCHAR2>

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

4.11.4 Examples

UNIX

$ sh dual_channel_gene_expr_loader.sh -db_wallet odb_user -data_file "dual_channel_summary.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020" -specimen_vendor "vendor1" -control_specimen "Stratagene Univeral Reference" -user_label "AgilentG4502A_07_01" -reference_version "GRCh37.p8" -file_flg "E" Windows

C:\> dual_channel_gene_expr_loader.bat -db_wallet odb_user -data_file "dual_channel_summary.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -study_name "STUDY1" -datasource_name "CDM" -specimen_number "JOUAL_p_TCGA_b96_SNP_N_GenomeWideSNP_6_A01_748020" -specimen_vendor "vendor1" -control_specimen "Stratagene Univeral Reference" -user_label "AgilentG4502A_07_01" -reference_version "GRCh37.p8" -file_flg "E"

4.12 Quality Control Metadata Loader

The Quality Control Metadata Loader captures Specimen and Analysis metadata and loads them into ODB. Although ODB provides standard %_QLFR suffixed tables that can accommodate extensible metadata attributes as name or value pairs, it does not support range queries. To provide such a feature, a table (W_EHA_QUALIFIER) is created in ODB and two additional columns are appended to this and the QLFR-suffixed tables, that store numeric and date metadata values. These columns are populated only for tags that represent dates or numbers. The metadata input file is a standard CSV file format, created by the user. A detailed description of this file is provided below.

4.12.1 Functional Description

Two new tables provide context and improve qualifier search and display:

W_EHA_QLFR_CATEGORY: Deals with Qualifier categories group qualifiers into distinct functional areas. Some category examples are:
- Run
- Analysis
- Analysis Component
- Loader
- Sample Preparation
- Specimen
W_EHA_QUALIFIER: Describes qualifiers used for all new %_QLFR tables. The logical key is a combination of table name, qualifier tag name and qualifier category (numeric, date, or character string). The table contains foreign keys to W_EHA_UNIT_OF_MEASURE, thereby providing the option to name a unit of measure for values, and W_EHA_QLFR_CATEGORY to store user defined categories. W_EHA_QLFR_TABLE links to this table and stores the table names of the %_QLFR suffixed table using a qualifier.

Each record loaded into the QUALIFIER table always has a QLFR_CHAR_VALUE column populated with a reported value. Additionally, all numeric values are loaded into QLFR_NUMB_VALUE column. All values for tag type date are loaded into QLFR_DATE_VALUE column. Both date and numeric values can be queried and sorted by range. DISPLAY_ORDER and DISPLAY_NAME attributes provide additional means to improve metadata reports. This QUALIFIER table also includes fields to support units of measure. PREFERRED_UNIT is a flag that indicates whether a particular unit is a preferred unit. The measurements that are not associated with preferred units can be converted into preferred units using translation rules defined in the W_EHA_QLFR_TRANSLATION table.

4.12.2 Data Load

The QC Metadata loader takes one or more CSV files as input and loads them into the following tables:

w_eha_rslt_file_spec_qlfr
w_eha_rslt_spec_qlfr
w_eha_qualifier
w_eha_qlfr_table
w_eha_qlfr_category
w_eha_unit_of_measure

The execution call of the stored procedure ODB_RSLT_METADATA_UTIL.process_metadata() is designed in one of the script files (load_metadata.sql). This stored procedure accepts as input DATA FILE LIST, ORACLE DIRECTYORY OBJECT, DATA SOURCE, SPECIMEN VENDOR, FILE FLAG, DATA FILE PATH, DBFS STORE, DATA FORMAT, and READ SIZE as optional input parameters.

It creates an external table dynamically and uploads data from each source file in the file list into it. This lets Oracle query data that is sourced from the flat files outside the database. The procedure processes the external metadata and inserts it into a staging table, W_EHA_QUALIFIER_STG. It then checks for the presence of each qualifier in the W_EHA_QUALIFIER table before inserting new qualifiers into it. A specimen lookup procedure is called and the process, based on the input field TABLE_NAME value, adds records to W_EHA_RSLT_SPEC_QLFR and W_EHA_FILE_SPEC_QLFR, linking the qualifier value to result data and file.

4.12.2.1 Data File

Each metadata file is a comma separated text file with the following fields:

Field Header	Description
TABLE_NAME	Target ODB qualifier table, this can be W_EHA_RSLT_SPEC_QLFR, or W_EHA_RSLT_FILE_SPEC_QLFR
QLFR_CATEGORY_NAME	Qualifier Category type
QLFR_TAG	Qualifier tag type name
DISPLAY_NAME	Display name to type
DISPLAY_ORDER	Order of display
DATA_TYPE	Value data type
UOM_NAME	Unit of measure type
PREFERRED_UNIT	-
SPECIMEN_NUMBER	Specimen ID of qualifier
SPECIMEN_VENDOR_NUMBER	Specimen vendor ID
SPECIMEN_DATA_SOURCE	Data source for specimen lookup
FILE_URI	Unique Identifier of specimen result file
QLFR_CHAR_VALUE	Stores Character string based Qualifier values
QLFR_NUMB_VALUE	Stores numerical qualifier values
QLFR_DATE_VALUE	Store data formatted qualifier values

The following table lists the sample content of a metadata file.

Table 4-3 Sample Content

TABLE_NAME	QLFR_CATEGORY_NAME	QLFR_TAG	DISPLAY_NAME	DISPLAY_ORDER	DATA_TYPE	UOM_NAME	PREFERRED_UNIT	SPECIMEN_NUMBER	SPECIMEN_VENDOR_NUMBER	SPECIMEN_DATA_SOURCE	FILE_URI	QLFR_CHAR_VALUE	QLFR_NUMB_VALUE	QLFR_DATE_VALUE
W_EHA_RSLT_SPEC_QLFR	SAMPLE_PREP	COLLECTION_DATE	-	-	DATE	-	-	TCGA-02-0075-10A-01W	vendor3	CDM_PATIENT	-	28-Aug-13	-	8/28/2013 14:22
W_EHA_RSLT_SPEC_QLFR	SAMPLE_PREP	COLLECTION_DATE	-	-	DATE	-	-	TCGA-08-0389-01A-01W	vendor3	CDM_PATIENT	-	28-Aug-13	-	8/28/2013 14:22
W_EHA_RSLT_SPEC_QLFR	SAMPLE_PREP	COLLECTION_DATE	-	-	DATE	-	-	TCGA-08-0389-11A-01W	vendor3	CDM_PATIENT	-	28-Aug-13	-	8/28/2013 14:22
W_EHA_RSLT_SPEC_QLFR	SAMPLE_PREP	DNA_CONC (MG/ML)	DNA Concentration	1	NUMBER	mg/ml	-	TCGA-02-0007-01A-01W	vendor3	CDM_PATIENT	-	.006 mg/ml	0.006	-
W_EHA_RSLT_SPEC_QLFR	SAMPLE_PREP	DNA_CONC (MG/ML)	DNA Concentration	1	NUMBER	mg/ml	-	TCGA-02-0007-10A-01W	vendor3	CDM_PATIENT	-	.007 mg/ml	0.007	-
W_EHA_RSLT_SPEC_QLFR	SAMPLE_PREP	DNA_CONC (MG/ML)	DNA Concentration	1	NUMBER	mg/ml	-	TCGA-02-0028-01A-01W	vendor3	CDM_PATIENT	-	.008 mg/ml	0.008	-
W_EHA_RSLT_SPEC_QLFR	SAMPLE_PREP	DNA_CONC (MG/ML)	DNA Concentration	1	NUMBER	mg/ml	-	TCGA-02-0028-10A-01W	vendor3	CDM_PATIENT	-	.009 mg/ml	0.009	-
W_EHA_RSLT_FILE_SPEC_QLFR	RUN	RUNID	RunID	1	CHARACTER	-	-	TCGA-08-0390-01A-01W	vendor3	CDM_PATIENT	ABCDEFMAF1	120126_SN316_0202_19	-	-
W_EHA_RSLT_FILE_SPEC_QLFR	RUN	RUNID	RunID	1	CHARACTER	-	-	TCGA-08-0390-11A-01W	vendor3	CDM_PATIENT	ABCDEFMAF1	120126_SN316_0202_20	-	-
W_EHA_RSLT_FILE_SPEC_QLFR	RUN	RUNID	RunID	1	CHARACTER	-	-	TCGA-08-0390-01A-01W	vendor3	CDM_PATIENT	ABCDEFMAF1	120126_SN316_0202_21	-	-
W_EHA_RSLT_FILE_SPEC_QLFR	RUN	RUNID	RunID	1	CHARACTER	-	-	TCGA-08-0390-10A-01W	vendor3	CDM_PATIENT	ABCDEFMAF1	120126_SN316_0202_22	-	-

4.12.3 Command-Line Argument

Name

METADATA_loader.sh - load records

Synopsis

METADATA_loader.sh -help

METADATA_loader.sh <...options>

Description

Validates input options and calls the loader script load_metadata.sql# odb_rslt_metadata_util.process_metadata

Options

(*) required

-db_wallet* <VARCHAR2> (required unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CO NNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file_list* <VARCHAR2>,<VARCHAR2>…

A comma separated list of data file names - each an Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-datasource_name* <VARCHAR2>

Datasource name defined in W_EHA_DATASOURCE.DATASOURCE_NM [default: CDM]

-specimen_vendor* <VARCHAR2>

Specimen vendor - Sample vendor number of specimen with genomic result data. If CDM is referenced, this value should be defined in the W_EHA_SPECIMEN_PATIENT_H.SPECIMEN_VENDOR_NUMBER

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-data_file_path <VARCHAR2>

File system path to secure data file directory [default: NULL]

-dbfs_store <VARCHAR2>

Database file system store [default: NULL]

-date_format <VARCHAR2>

Format of the date data type file input field [default: 'YYYY-MM-DD HH24:MI:SS']

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE [default: NULL]

-data_file_dir <VARCHAR2>

File system path to Oracle directory object [default: NULL]

4.12.4 Examples

UNIX

$ sh METADATA_loader.sh [-db_wallet odb_user -data_file_list "FILE_NOV_28_2013_04.csv,FILE_AUG_28_2013_04.csv" -data_directory "ODB_LOAD" -datasource_name "CDM_PATIENT" -specimen_vendor "vendor1" -file_flg "E" -date_format "mm/dd/yyyy HH24:MI" -data_file_path "" -dbfs_store "" -read_size ""

Windows

C:\> METADATA_loader.bat -db_wallet odb_user -data_file_list "FILE_NOV_28_2013_04.csv,FILE_AUG_28_2013_04.csv" -data_directory "ODB_LOAD" -datasource_name "CDM" -specimen_vendor "vendor1"� -file_flg "E" -date_format "mm/dd/yyyy HH24:MI" -data_file_path "" -dbfs_store "" -read_size ""

4.13 Typical Errors Associated with Result Loaders

Errors have been observed while running various ODB loaders. The loader run aborts prematurely with the following error message: ORA-01460 unimplemented or unreasonable conversion requested.

Oracle recommends applying an RDBMS patch to the TRC database that fixes this bug. See Oracle Support Bug 13099577 (ORA-1460 WHEN PARALLEL QUERY SERVERS ARE USED) available here https://mosemp.us.oracle.com/epmos/faces/BugDisplay?id=13099577 for details.

4.13.1 Errors Relevant to Sequencing Loads

For each loader, a new VARCHAR2 (100) variable is defined. This variable is set before each and every SQL call to specify the context information. It is then used in standard error logging where the RECORD_DETAIL column of the W_EHA_RSLT_LOG table is populated with a simple context error message.

Some common error messages are

Generating ETL PROC ID
Verifying result file type and creating result file record
Verifying Species Name
Verifying Study Name
Processing Variant Staging records
Processing variant records
Processing result records
Dropping external tables

Examples of common error log records which are used in CGI, MAF, and VCF loaders are shown in Table 4-4.

Table 4-4 Errors Generated while Loading Sequencing Files (CGI masterVar file used as example)

Column Name	Description	Examples
ROW_WID	Record identifier	-
RESULT_TYPE_NAME	Result Type Name	For CGI - CGI masterVar
SPECIES_NAME	Species Name	Homo sapiens
SPECIMEN_NUMBER	Specimen Number parsed from the comments section of file.	GS00706-DNA_C01
SPECIMEN_VENDOR_NUMBER	Name of vendor passed as parameter with batch file	For CGI: CGI
DATASOURCE_NM	Data source Name	CDM
ERROR_DESC	Error message	ORA-01403 NO DATA FOUND
RECORD_DETAIL	Verifying Study Name	-
ETL_PROC_WID	Each load identifier.	-

4.13.2 VCF Loader Errors

The VCF or gVCF loader processes files in multiple passes. The loader creates 3 external tables.

If the process fails while processing the specimen specification table, an error 'Processing external specimen table for specimen headers' is logged in the w_eha_rslt_log table.
If the process fails while creating the metadata external table, an error 'Processing external table for metadata' is logged in the result log table.
If the process fails while creating qualifier records, an error 'Processing metadata records(w_eha_file_load_qlfr)' is logged in the result log table.
If the process fails while retrieving specimen from the external table, an error 'Retrieving specimen numbers from external specimen table is logged in the result log table.
If the loader attempts to load the file, which has more than 986 samples, then the error 'File loaded cannot have more than 986 specimens is logged in the result log table.
If the process fails while executing variant staging records, an error 'Processing variant staging records (w_eha_variant_stg, w_eha_variant_x_stg)' is logged into the result log table.
If the process fails while populating variant table from variant staging table, an error 'Processing variant records(w_eha_variant, w_eha_variant_log)' is logged in the result log table.
If the process fails while populating variant result records, an error 'Processing result sequencing records (w_eha_stg_sequencing, w_eha_stg_sequencing_x, w_eha_stg_struct_var) is logged in the result log table, and if it fails while processing non variant records, an error 'Processing result sequencing records (w_eha_stg_struct_var, w_eha_stg_sv_breakend, w_eha_stg_nocall, w_eha_stg_non_variant)' is logged in the result log table.
If the process fails while populating conflict records, then an error 'Processing result sequencing records(w_eha_stg_conflict)' is logged in the result log table.
If the loader processes the xref records and the process fails, then an error 'Processing variant xref records (w_eha_variant_xref)' is logged, if the process fails while computing the nocall collapsing function then an error 'Collapsing result nocall staging records (w_eha_rslt_nocall) is logged in the result log table.
If there are any non-standard datatype formats in the file, where instead of a number a character is found for the datatype mapped by the loader, an 'Invalid number' error is generated by the loader. To know the exact location of error in the file, use '-validate_numbs Y' parameter in the VCF loader. This parameter will determine if there is a character value instead of the expected numeric value and would output the line number and Sample order number which has the issue.
Other possible errors are: 'Generating etl process id', 'Generating enterprise id', Verifying result file type ({0}, {1}) and processing result file record(w_eha_file)' 'Verifying {0} reference version={1}', 'Verifying result type', 'Verifying datasource name', 'Verifying species name', 'Verifying study name', 'Parsing global flex fields', 'Retrieving specimen ids', 'Getting flanking offset', 'Processing external data table' and 'Dropping external tables'.

4.13.3 CGI Loader Errors

If the process fail while the loader is unable to verify the file type and version error message 'verifying result file type ({type name}, {version}) and processing result file record(w_eha_rslt_file)', is logged in the result log table.

4.13.4 MAF Loader Errors

Only one external table is created for MAF result data. If the process fails at this step, an error 'Processing external data table' is logged into the w_eha_rslt_log table. A first BULK insert statement creates a variant staging and specimen record. If the process fails at this stage, an error 'Processing variant staging and specimen staging records (w_eha_variant_stg, w_eha_maf_specimen, w_eha_maf_specimen_log)' is logged into the result log table.

If the process fails while retrieving the specimen ID for a record from global temporary table, an error 'Processing list of specimens' is logged into the result log table. If the error occurs while processing variant staging records (which populates the w_eha_variant table), an error 'Processing variant records (w_eha_variant, w_eha_variant_log)' is logged into result log table.

If the process fails while executing the bulk insert statement which populates the target result table, an error 'Processing result records (w_eha_stg_sequencing, w_eha_stg_sequencing_x)' is logged into the result log table.

Other possible errors are: 'Generating etl process id', 'Generating enterprise id', Verifying result file type ({0}, {1}) and processing result file record(w_eha_file)' 'Verifying {0} reference version={1}', 'Verifying result type', 'Verifying datasource name', 'Verifying species name', 'Verifying study name', 'Getting flanking offset', 'Processing external data table' and 'Dropping external tables'.

4.13.5 Single Channel Gene Expression Loader Errors

If the process fails while retrieving DNA version ID, specimen ID, datasource name or study ID, an error is logged into the result log table and displayed in the error summary at the end. If the process fails while processing the hybridization header external table, the Processing hybridization header table error is logged. If the process fails while retrieving the hybridization name from the header table, the error 'Retrieving hybridization name from header table' is logged.

Similar to VCF, a batch of 45 expression data is processed at a time and if the process fails at this stage, an error 'Processing data table for the set of gene expression' is logged. If the process fails while dropping the expression and hybridization external tables, error messages 'Dropping expression data table' and 'Dropping hybridization table' respectively are logged.

4.13.5.1 Missing Probe Link Issue

When re-running the probe loader to load probes, with the same probe names as existing probes, the loader updates the existing probes to point to the new DNA reference and probe versions given by the loader.

Now, when running the Gene expression loader, a result record will only be inserted to the result table when there is match for Probe name and the DNA reference version and Species when looking up the probe table.

This means running the gene expression loader but passing a reference version parameter (say 'GRCH37.P7') when the corresponding probe records point to another reference version (say 'GRCH37.P8'); the loader logs a warning message (for example, 'Version and/or species id are not matching for the probe=1554103_at') and does not load the record.

To load result records pointing to multiple reference versions, load gene expression result files for a particular reference version BEFORE re-running the probe loader updating the probes in the probe table to a new reference version. Then load the next set of gene expression result files pointing to the new version.

While existing probes are refreshed with a new probe version, any existing gene expression result records will no longer have a valid probe reference key. This issue can be solved either by reloading such result files to make new result records with reference keys to the refreshed probes.

Alternatively, if Flashback Archive is enabled for the database, any SQL query to retrieve probe annotation for such result records can use a flashback query using the "AS OF" clause with the creation date of the gene expression result.

4.13.6 Dual Channel Gene Expression Loader Errors

The Dual Channel loader verifies the existence of and looks up a W_EHA_ADF record with the relevant User Label. If the record is not found, the error is 'Getting the ROW_WID of the relevant W_EHA_ADF record, and checking its contents'. Subsequently, the ADF record is verified to match the Species ('The Species of the W_EHA_ADF record with user_label=… does not match the Species Name argument' error if it does not), and the DNA Reference Version ('The DNA reference version of the W_EHA_ADF record with user_label=… does not match the version argument', if it does not).

If the process of loading the data from the file into the external table fails, the error is 'Processing external data table'.

If processing the data from the external table and insertion into the result table fails, the error is 'Processing result records (w_eha_stg_2channel_gxp)'.

Other possible errors are: 'Verifying DNA reference version', 'Verifying result type', 'Verifying datasource name', 'Verifying species name', 'Verifying study name', 'Getting flanking offset', and 'Retrieving specimen id', 'Dropping external tables'.

4.13.7 RNA-seq Loader Errors

Only one external table is created for RNA Seq result data. If the process fails at this step, an error 'Processing external table' is logged into the log table.

When the external table processing for the entire RNA seq record set fails, an error 'Processing result records (w_eha_stg_rna_seq)' is logged.

Other possible errors are: 'Verifying DNA reference version', 'Verifying result type', 'Verifying datasource name', 'Verifying species name', 'Verifying study name', 'Getting flanking offset', 'Retrieving specimen number from external table', and 'Retrieving specimen id'.

4.13.8 Copy Number Variation Loader Errors

Only one external table is created for CNV result data. When the process fails while creating an external table, the error 'Processing external data table' is logged in the W_EHA_RSLT_ LOG table. When the external table processing for the entire CNV result record set fails, then an error 'Processing results records (w_eha_stg_copy_nbr_var)' is logged. If the process fails while dropping CNV external table, an error 'Dropping external table' is logged in the W_EHA_RSLT_ LOG table.

Other possible errors are: 'Verifying DNA reference version', 'Verifying result type', 'Verifying datasource name', 'Verifying species name', 'Verifying study name', 'Getting flanking offset', 'Retrieving specimen number from external table', and 'Retrieving specimen id'.

Note:

For the above loaders, the REC_DETAILS column of the W_EHA_RSLT_ERR_LOG table only describes the context in which the process failed.

4.13.9 File Lineage Linker Errors

The File Lineage Linker can fail because either the parent or child file URI is not provided, or is invalid (there is no file record in W_EHA_FILE with this FILE_URI). The Linker will also fail if there are no common associated specimen for the two files being linked. However, if the "-force_link 1" argument is used, this error becomes a warning, and the files are linked.

4.13.10 Loader Runtime Error: ORA-01460 Unimplemented or Unreasonable Conversion Requested

Errors have been observed while running various ODB loaders. The loader run aborts prematurely with the following error message: ORA-01460 unimplemented or unreasonable conversion requested.

4.14 Collecting Oracle Optimizer Statistics

Oracle statistics is a collection of data of database objects such as tables and indexes and is required by Oracle optimizer to estimate the most efficient query execution plan. Missing or stale statistics can profoundly deteriorate query performance.

Oracle recommends gathering table and index statistics after a significant amount of data is loaded into a table. The statistics should be gathered after large bulk loads, most notably after reference tables are populated, but also after initial result runs. Later on, when a batch size becomes relatively small compared to the size of the already loaded data, statistics need not be gathered after each load. Instead, Oracle recommends gathering statistics on a weekly schedule basis. Statistics should be collected when major loading procedures are not running.

To collect statistics, connect to a database as ODB_SCHEMA owner using sqlplus and execute the command:

exec dbms_stats.gather_schema_stats ('ODB_', cascade=>true,estimate_percent=>dbms_stats.auto_sample_size);