3 Loaders for Reference Data

This chapter contains the following topics:

Ensembl and SwissProt Loaders
HUGO Loader
GVF Ensembl Loader
Pathway Loader
Prediction Score (PolyPhen, SIFT) Loader
Probe Loader
ADF Data Loader
HGMD (BioBase) Loader
COSMIC Loader
Variant Effect Job
Typical Errors Associated with Reference Loaders

Note:

As the Reference part of the ODB model changes, the Reference Data Loaders have to adapt to these changes. Therefore, both should be updated together.

3.1 Ensembl and SwissProt Loaders

3.1.1 Installing the Loaders

Following are the prerequisites for installing the Ensembl and SwissProt reference loaders:

You must have an Oracle database instance with ODB installed in a schema, where the name and password are known. Running the loaders can be simplified by creating an Oracle Wallet with these credentials. Take into consideration that using an Oracle Wallet in Java programs is different from when starting SQL scripts: you require 2 arguments for it—both being OS paths—one to the Oracle Home (the directory where the tnsnames.ora file resides) and the other to the directory where the wallet is created.

Note:
Ensembl and SwissProt loaders are written in Java.

Note:
Since version 10.2, the JDBC Thin driver has the ability to read a tnsnames.ora file using the oracle.net.tns_admin property. The JDBC Thin driver, however, cannot access additional TNS content pointed to by an IFILE clause residing inside tnsnames.ora. This is a JDBC Thin driver-specific issue and can occur on any platform.
This limitation applies to JDBC 10.2.0.4 and later (Release 10.2 and later).
Java Runtime 1.7 or higher must be installed and should be the default on the machine (this can be verified using the java-version command from the command prompt).
Copy the Reference Loader folder into a directory of your choice. The program should be run from the directory it is installed in. If you are using shell scripts on Unix/Linux, they must be made executable (for example, chmod +x SwissProt.sh).

3.1.2 Files to Load

Following is a list of files to be loaded:

The Ensembl multi-gene EMBL files can be downloaded from ftp://ftp.ensembl.org/pub/release-74/embl/homo_sapiens.

The link above may not reflect the most recent version of the multi-gene files available. Oracle recommends that you use the latest release available from Ensembl. It can be navigated to from the following web page: http://www.ensembl.org/info/data/ftp/index.html (select the EMBL format)

The files are organized by chromosome. There are also some configuration and patch files. At the very least, you must load all chromosome files to cover the entire Human genome. The files are gzipped and can be loaded without being extracting.

Prior to version 68, Ensembl EMBL files were organized differently. The data was divided into segments, each approximately 100,000 base pairs in length. Starting with version 68, the sequence of an entire chromosome comes in one section. This has presented memory problems for the EMBL loader, which were largely overcome in ODB 3.0, facilitating loading the Human EMBL files using the standard Java Virtual Machine maximum heap allocation of 1 GB. However, this is not guaranteed for other species. So if an out-of-memory error occurs when loading an EMBL file, the load should be repeated with a higher JVM maximum heap allocation, as described below, and may require a computer with Linux or a 64-bit Windows operating system and at least 4 GB of RAM (Oracle recommends 8GB).
The SwissProt file (a single file) can be downloaded from http://www.ebi.ac.uk/uniprot/database/download.html.
Get the (UniProtKB/SwissProt) Flat File.
Both EMBL and SwissProt files can be loaded in the compressed form, without extracting the contents, just as they are downloaded. The EMBL and SwissProt Loader can handle GZIP archives (identified by the .gzip extension) as well as some ZIP archives (a ZIP archive must contain only one file to be handled correctly, and is identified by the .zip extension).

3.1.3 Loading the Data

Before you begin:

As the Loader runs, it logs some information in the gdm.log file. The file is always appended to and keeps growing. You may occasionally want to delete it and start from scratch the next time you run the Loader. Some of the information logged can be very useful for investigative purposes. Oracle recommends that you check the log when you have a problem. The EMBL/SwissProt loader also logs into the W_EHA_RSLT_LOG table, like the ODB SQL loaders. However, it logs into the database only while connected to it, that is, a connection failure is not logged in the database.
The order of loading EMBL and SwissProt files is not important. The scenario outlined below is just an example. Both DNA and Protein reference data are now versioned. Versions stored in the W_EHA_VERSION table are used. For DNA sources and all reference data that are linked to them (genes and so on) the version must be of type DNA. For Protein Info records and all the reference data linked to them the version must be of type PROTEIN. Starting from version 3.0, a Protein record is loaded only if there is no identical record already loaded with another version. Each part of a SwissProt file is date-stamped with a release date, and this is used to determine if a record in a new file has changed from an already loaded version or not.

The EMBL Loader accepts a version argument, which must match an existing version of type DNA. If the version is not found, there is a prompt that lets you use the provided version label. On confirmation, the version is created. If no version argument is provided, the loader enters the full interactive mode, letting you select an existing version from a list or creating a new one. The same is true for the SwissProt loader, except that the version is of type PROTEIN.

Version labels are not case-sensitive and stored in uppercase.

Note:
If the version label is not provided or the provided version does not exist, the EMBL or SwissProt loader prompts for input and does not continue until you respond. Therefore, to run the loader in the non-interactive mode, it is necessary to create the version of the correct type beforehand and ensure it is passed to the loader correctly. It is also possible to prevent the loader from waiting for the user input in the case of an incorrect version being passed to it, by appending >outlog <empty.txt to the command line and creating an empty text file empty.txt. In this case the loader fails if the version does not exist.

Perform the following steps to load files using Linux or Windows shell scripts. There is also a native JAVA command-line interface available described in Section 3.1.4, "Running the Embl/Swissprot Loader with Named Command-Line Arguments".

Since the SwissProt file is a single file, it can be loaded in under an hour (Human proteins only) or in a day (all species). To load the SwissProt file, run SwissProt.bat (or SwissProt.sh on Linux). When SwissProt.bat is executed, you can optionally specify the Species List file. The purpose of the Species List file is to permit only loading protein information for the organism(s) you want.

The format of the file is simple - type in the species primary (Latin) name(s), one species per line. A file for just the human genome is included in the distribution — Species.dat and contained in the main loader directory. If there is no -protFile option with the name of a species list file, ALL proteins for ALL species are loaded (which takes much longer).

If an Oracle Wallet is set up, SwissProt.bat/.sh can use the credentials stored in the Wallet to connect to the schema else it prompts for a password. If an Oracle Wallet is set up, pass the following parameters to run SwissProt.bat/.sh:
1. Username - when an Oracle Wallet is set up enter ""
2. Url — instance alias for which the Wallet credential was created—if you start SqlPlus as 'sqlplus /@DB001_Wallet, then this value here must be DB001Wallet.
3. Schema name
4. Directory location of the tnsnames.ora file
5. Path to Wallet
6. Path and name of the species list file. This is an optional parameter. If you do not have this list, enter "
7. Complete path and name of the data file
8. Reference version of type 'PROTEIN' to use (omit to enter the interactive mode):
  
  Following is an example of how the SwissProt.bat is to be run if an Oracle Wallet is set up and a Species list file is not present or not used:
  
  C:\>swissProt.bat "" DB001Wallet trc_gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets "" SwissProt.dat "VERSIONP1"
  
  Example using SwissProt.sh
  
  > sh SwissProt.sh "" DB001Wallet trc_gdm "/app/oracle/product/11.2.0.2.0/network/admin" "/app/wallet/" "" SwissProt.dat "VERSIONP1"
  
  Following is an example of how the swissProt.bat is to be run if Oracle Wallet is set up and Species list is present:
  
  C:\>swissProt.bat "" DB001Wallet trc_gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets Species.dat SwissProt.dat "VERSIONP1"
  
  Example using SwissProt.sh
  
  > sh SwissProt.sh "" DB001Wallet trc_gdm "/app/oracle/product/11.2.0.2.0/network/admin" "/app/wallet/" Species.dat SwissProt.dat "VERSIONP1"
If no Oracle Wallet is set up, you have to pass the following parameters when swissProt.bat is run:
1. Username to connect to schema
2. Url — Full DB URL (host:port:instance, or scan-server-name:port:SID for a multiple node DB). For example, Localhost:1613:devdb1
3. Schema name
4. Directory location of the tnsnames.ora file. When Oracle Wallet is not set up enter ""
5. Path to Wallet — when Oracle Wallet is not set up enter ""
6. Path and name of the species list file. This is an optional parameter. If you do not have this list, enter ""
7. Complete path and name of the data file
8. Optional version of type 'PROTEIN' to use (omit to enter the interactive mode)
Following is an example of how the swissProt.bat is to be run if Oracle Wallet is not set up and Species list is not present:

C:\>swissProt.bat trc_gdm localhost:1613:devdb1 trc_gdm "" "" "" SwissProt.dat "VERSIONP1"

Example using SwissProt.sh

>sh SwissProt.sh trc_gdm localhost:1613:devdb1 trc_gdm "" "" "" SwissProt.dat "VERSIONP1"

This is an example of how the swissProt.bat is to be run if an Oracle Wallet is not set up and a Species list file is used:

C:\>swissProt.bat trc_gdm localhost:1613:devdb1 trc_gdm "" "" Species.dat SwissProt.dat

Example using SwissProt.sh

>sh SwissProt.sh trc_gdm localhost:1613:devdb1 trc_gdm "" "" Species.dat SwissProt.dat "VERSIONP1"

The SwissProt Loader can also be run with named command-line arguments.
The Ensembl EMBL files can be loaded next. Multiple EMBL files can be loaded one at a time, in any order, but without duplication (each file can only run once - otherwise, at present, some information is duplicated). Multiple EMBL files can be loaded concurrently (for example, in separate terminal windows). However, to create a new DNA reference version using the interactive mode, one file should be loaded first (creating the new version in the process) and only then can multiple files be loaded concurrently.

If an Oracle Wallet is set up, embl.bat/.sh can use the credentials stored in the Wallet to connect to the schema else it prompts for a password. If Oracle Wallet is set up, pass the following parameters when Embl.bat is run:
1. Username — when an Oracle Wallet is set up, enter ""
2. Url — instance alias for which the Wallet credential was created - if you start SqlPlus as 'sqlplus /@DB001_Wallet, then this value here must be DB001Wallet.
3. Schema name
4. Directory location of the tnsnames.ora file
5. Path to Wallet
6. Complete path and name of the data file
7. Depending on the file being loaded and on the operating system, a seventh, optional, argument may need to be used: the Java Virtual Machine heap size, in MB.
  
  By default, embl.bat (and the corresponding UNIX shell script, embl.sh) specifies 1 GB of Java Virtual Machine (JVM) maximum heap space (using the –Xmx1024M option). This is known to be sufficient for loading all Human Ensembl files, version 68, and works on most Linux or Windows systems (Oracle recommends 8 GB of RAM or more). Hence, you do not need to provide the heap size argument.
  
  However, some Ensembl EMBL files for other species, version 68 or higher, may require more heap space. The seventh argument should then be provided as 2048 (that is 2048 MB). These files can then be loaded successfully on Linux or 64-bit Windows with enough RAM. If there is an Out-of-memory error while loading a file, use a larger maximum heap size and use the same option for all subsequent Ensembl files for the same organism.
  
  If the specific computer or operating system cannot handle the specified JVM heap size, a Java Virtual Machine creation failed error occurs. You will then have to use another machine.
8. The version of type DNA can be passed as the eighth argument (omit to enter the interactive mode).
  
  Following is an example of how embl.bat is to be run if Oracle Wallet is set up:
  
  C:\>embl.bat "" DB001Wallet trc_gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets embl.dat 2560
  
  Example for the embl.sh shell script for Linux:
  
  >sh embl.sh "" DB001Wallet trc_gdm /app/ora11g/product/11.2.0/dbhome_2/NETWORK/ADMIN /app/wallets embl.dat 2048 "GRCH37.P8"
If no Oracle Wallet is set up, pass the following parameters when Embl.bat is run:
1. Username to connect to schema
2. Url — Full DB URL (host:port:instance). For example, Localhost:1613:devdb1
3. Schema name
4. Directory location of the tnsnames.ora file. When Oracle Wallet is not set up enter ""
5. Path to Wallet — when Oracle Wallet is not set up enter ""
6. Complete path and name of the data file
7. Optional maximum heap size (in MB), or ""
8. Optional version of type 'DNA' (omit to enter the interactive mode to select or create a version)
  
  This is an example of how the embl.bat is to be run if Oracle Wallet is not set up:
  
  C:\>embl.bat trc_gdm localhost:1613:devdb1 trc_gdm "" "" embl.dat 2048 "GRCH37.P8"
  
  Example for embl.sh:
  
  >sh embl.sh trc_gdm localhost:1613:devdb1 trc_gdm "" "" embl.dat 2048 "GRCH37.P8"

The EMBL Loader can also be run with named command-line arguments.

3.1.4 Running the Embl/Swissprot Loader with Named Command-Line Arguments

The EMBL/SwissProt Loader is a single Java application, packaged into a JAR archive GDM.jar. It can be used without any shell or Windows script, using the arguments described in this section. Running the application using these arguments provides some additional capabilities, not supported by the shell or Windows scripts installed with it.

Java Runtime 1.7 is required to run the EMBL/SwissProt Loader. It has to be either in the PATH environment variable, or the full path to the Java executable with version 1.7 should be specified.

If the Java 1.7 executable is in the PATH environment variable, the EMBL/SwissProt loader is run as follows:

java -Xmx2048m -jar gdm.jar <argument 1>…<argument N>

The -XmxNNNNm Java option is optional. In the 3.0 (or later) version of the EMBL/SwissProt Loader, it is usually not required because the memory usage has been optimized to handle chromosome-wide Ensembl files. Use it if you encounter an Out-of-memory error for any Ensembl EMBL file (provided that the machine has enough memory and the operating system is Linux or 64-bit Windows).

The arguments (except for the path or name of the data file to load, which must always be the last argument) are not positional, and can be used in any order. The key or option arguments begin with the "-" character. Some of them require a value as the following argument, others are stand-alone. For the up-to-date list of all available arguments, execute the following command:

java -jar gdm.jar -help

Following is more detailed information about the arguments and their usage:

-url <url> - specifies the URL of the database to be connected. <url> must be in the form - host:port:instance. This argument is mandatory.

-schema <schema name> - specifies the ODB schema name. This argument is mandatory.

-user <user> - specifies the user name used to log into the database (needed only when not using an Oracle Wallet).

-wallet <wallet> - specifies the directory where the Oracle Wallet is set up.

-orahome <oracle home directory> - specifies the Oracle home directory, when a Wallet is used (this is the directory where the tnsnames.ora file resides).

-sprot - this key is used to load a SwissProt file. If it is absent, the data file will be loaded as an EMBL file.

-protFile <species file> - specifies the file path (optional) and name of a Species List file, used to optionally filter the contents of a SwissProt file by species (used when loading SwissProt only).

-version <version label> - specifies the version label of the DNA or Protein reference version. If this argument is present, it must match an existing version of appropriate type (DNA for EMBL, PROTEIN for SwissProt). If it is omitted, the Loader will start in the interactive mode, prompting you to select an existing version, or create a new one.

-verbose - if this argument is present, additional information will be printed on the screen and logged.

-updateDB - this argument is necessary if you want to actually load the contents of the file into the database. If it is omitted, the contents of the data file are parsed, but nothing is inserted into the database tables (except the version, if the Loader is started in the interactive mode and you choose to create a new version). Omitting this option is useful for verifying that the file parses without errors.

-print_summary - if this key is present, summary info will be printed on the console at the end of the run.

Examples:

Loading an EMBL file without a Wallet, Windows:

java -jar GDM.jar -url localhost:1521:b41804x1 -schema odb -user odb -version V1 -updateDB EMBLFile.dat

Verifying (without loading) the same file:

java -jar GDM.jar -url localhost:1521:b41804x1 -schema odb -user odb -version V1 EMBLFile.dat

Loading a SwissProt file with a Wallet, Windows:

java -jar GDM.jar -url localhost:1521:b41804x1 -schema odb -orahome C:\ora11g\product\11.2.0\dbhome\NETWORK\ADMIN -wallet C:\Wallets -version P1 -updateDB -sprot -protFile species.dat SPFile.dat

Loading an EBML file with a Wallet, Linux, interactive mode for version:

java -jar GDM.jar -url localhost:1521:b41804x1 -schema odb -orahome /home/apps/ora11g/product/11.2.0/dbhome/NETWORK/ADMIN -wallet /home/Wallets -updateDB EMBLFile.dat

3.1.5 Index-Organized Tables Loader

The Java (EMBL/SwissProt) loader uses staging tables to prevent blocking when multiple sessions are loading concurrently. EMBL and Swissprot linked XREF and QLFR data tables are index referenced. A stored procedure, ODB_REFERENCE_UTIL.create_referecne_iot is called after all Ensembl and SwissProt data has been loaded (this builds the XREF and QLFR tables more efficiently).

After loading the EMBL and SwissProt reference files, call the shell script IOT_loader.sh, which in-turn executes load_from_stage_to_iot.sql, which calls the stored procedure.

Following is the command line options for the loader script:

Sh IOT_loader.sh <options>

Options

(*) required

-db_wallet* <VARCHAR2> (required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

SID, or the Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

Example

sh ./IOT_loader.sh -db_wallet TRCQC

3.1.6 Gathering Optimizer Statistics

Oracle recommends gathering table and index statistics after completing an Ensembl and (or) SwissProt data load. Missing or stale statistics can profoundly deteriorate query performance. Oracle statistics is a collection of data about database objects such as tables and indexes. Oracle optimizer requires you to estimate the most efficient query execution plan.

To collect statistics, connect to a database as ODB_SCHEMA owner using SqlPlus and execute the following command:

exec dbms_stats.gather_schema_stats ('ODB_', cascade=>true,estimate_percent=>dbms_stats.auto_sample_size);

3.2 HUGO Loader

3.2.1 Description and Files to Load

The Hugo Loader is responsible for populating curated gene nomenclature records, taken from an online resource maintained by HUGO Gene Nomenclature Committee (HGNC), into ODB's reference database. The input data for the loader comprises of the complete HGNC dataset, which can be retrieved from their Statistics and Downloads Webpage here http://www.genenames.org/cgi-bin/hgnc_stats. You must specifically download the complete HGNC dataset by clicking the hyperlink in the sentence Click here for the complete HGNC dataset provided in the above webpage. The data downloaded is a large file with tabular text and tab-delimited values, given with column headers.

Important:

The format of the complete HGNC files has changed as of May 2013. The ODB HUGO Loader 3.0 only supports the new file format, while the previous versions of the loader (ODB 2.0.2.1 and prior) only support the old format.

A batch file for Windows and an alternative shell script for Linux-bash, have been provided for loading the data.

3.2.2 Running the Loader

The loader is found bundled in the latest ODB build in the /ODB_Loaders/Reference_Loader/Hugo_loader directory. This folder contains 8 files:

hugo_loader.bat
hugo_loader.sh
hugo_script.sql
several common sh, bat, and SQL scripts for reference version checking

To run the loader, perform the following:

Copy the above files into a folder on your system along with the downloaded input file from HUGO.
Open a command prompt terminal and change the directory to where the hugo_loader.bat and (or) hugo_loader.sh file resides.
If working on Linux, ensure the scripts are executable (you may need to run chmod u+x *.sh)
The hugo_loader.sh/.bat scripts use the credentials stored in an Oracle Wallet to connect to the schema that has the ODB. Pass the following parameters when the hugo_loader.bat is run:
1. The Hugo data file.
2. Oracle Directory Object
3. Wallet name
4. Operation without an Oracle Wallet (with user name and database connection arguments) is only supported on Linux.
Execute hugo_loader.sh/.bat, with appropriate arguments, to load the data.

3.2.3 Command-Line Argument List

Synopsis

hugo_loader.sh -help

hugo_loader.sh <...options>

Description

Description:

Validates input options and calls the loader script hugo_script.sql#hugo_loader.load_hugo

Options

(*) required

-db_wallet* <VARCHAR2> (required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

SID, or the Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version* <NUMBER>

Run check version (1=yes|0=no)

-check_version_non_i* <NUMBER>

Run check version in non-interactive mode (1=yes|0=no)

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table location. Refer to the programmer's guide on how to retrieve this file from HGNC's web portal.

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table directory. For details, see Section 2.1, "Setting Up a Directory Object"

-reference_version <VARCHAR2>

The reference version label of the Hugo file being loaded. A "HUGO" reference version label is defined in W_EHA_VERSION.VERSION_LABEL. If the version label is not present in the W_EHA_VERSION table, the loader prompts the user to confirm whether to continue with the version label provided. If yes, the loader inserts the new record in the version table with the given Version_label and proceeds with the load.

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh hugo_loader.sh -db_wallet odb_user -check_version 1 -check_version_non_i 0 -data_file "genefam_list.pl" -data_directory "ODB_LOAD" -reference_version "DLD_DT_01062013" -read_size ""

Windows

C:\> hugo_loader.bat -db_wallet odb_user -check_version 1 -check_version_non_i 0 -data_file "genefam_list.pl" -data_directory "ODB_LOAD" -reference_version "DLD_DT_01062013" -read_size ""

Once loading is complete, log into SQL developer, or SQL*Plus, with ODB Schema and verify that 35000 or more records are populated in W_EHA_HUGO_INFO table. The execution information is logged in the W_EHA_RSLT_LOG table. If run with the -print_summary 1 option, the loader will also print information about the execution on the console, including the count of inserted records and errors, if any.

3.3 GVF Ensembl Loader

3.3.1 Description and Files to Load

The GVF Ensembl loader is responsible for the input of known variants for any given species for which DNA source records are present. It loads only those variant records from the input file, for which matching DNA source records exist in the DB. (That is, those variants that fall into the absolute position ranges of a DNA source record with the same chromosome and species ID). Therefore, you must ensure that the EMBL loader is run first with the relevant species' EMBL input files.

Since GVF files do not contain information about the species, it is necessary to pass a species_ID value as parameter to run the loader. Therefore, the W_EHA_SPECIES table should have the relevant species record with a primary key ID, which is then passed as said parameter.

Any GVF file can be loaded multiple times. For Homo sapiens, GVF input files can be downloaded from the Ensembl FTP website ftp://ftp.ensembl.org/pub/release-65/variation/gvf/homo_sapiens/.

The preceding link may not necessarily reflect the most recent version of GVF files available. Oracle recommends that you use the latest GVF files.

Note:

Ensembl now keeps Germ-line mutations and Somatic mutations in separate gvf files. To have both datasets, Oracle recommends loading both the Homo_sapiens.gvf.gz and the Homo_sapiens_somatic.gvf.gz files present in the link.

3.3.2 Running the Loader

The .bat and .sh files for GVF loaders require the same set of named command line arguments, except that the .bat script only supports using an Oracle Wallet connection (like for all other SQL loaders).

3.3.3 Command-Line Argument List

Name

GVF_loader.sh - load records

Synopsis

GVF_loader.sh -help

GVF_loader.sh <...options>

Description

Validates input options and calls the loader script load_gvf.sql#odb_ref_gvf_util.process_gvf

Options

(*) required

-db_wallet* <VARCHAR2> (required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle SID, or the Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, for humans "Homo sapiens"

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh GVF_loader.sh -db_wallet odb_user -data_file "som_variants.gvf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -reference_version "GRCh37.p8" -preprocess_dir "" -preprocess_file "" -read_size ""

Windows

C:\> GVF_loader.bat -db_wallet odb_user -data_file "som_variants.gvf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -reference_version "GRCh37.p8" -preprocess_dir "" -preprocess_file "" -read_size ""

3.3.4 Gathering Optimizer Statistics

Oracle recommends gathering table and index statistics after running the GVF loader. Oracle statistics is a collection of data about database objects such as tables and indexes. It is required by Oracle optimizer to estimate the most efficient query execution plan. Missing or stale statistics can profoundly deteriorate query performance.

To collect statistics, connect to a database as ODB_SCHEMA owner using sqlplus and execute the following command:

exec dbms_stats.gather_schema_stats ('ODB_', cascade=>true,estimate_percent=>dbms_stats.auto_sample_size);

3.4 Pathway Loader

3.4.1 Description and Files to Load

Pathway_loader is a script for extracting, transforming, and loading GSEA standard file formats.

The data used can be downloaded from http://www.pathwaycommons.org/archives/PC1/last_release-2011/gsea/by_species/homo-sapiens-9606-gene-symbol.gmt.zip.

Currently, this is the last supported pathway file format. More recent versions exist but should be reformatted to the above supported version.

The first column and the second column in this file are normal tab delimited but the third column is a string containing delimited values.

The pathway_loader utility is compatible with Oracle RDBMS 10.2 and above. It is not operating system dependent and works entirely within the Oracle database. This section describes the setup procedure and also illustrates how to use the utility to load data from the GSEA file located on your system.

3.4.2 Running the Loader

The loader is made up of 10 files:

pathway_loader.bat
pathway_loader.sh
pathway_script.sql
several common sh, bat, and SQL scripts for reference version checking

The execution call of the stored procedure load_pathway() is designed in one of the script files (pathway_script.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTORY OBJECT, SPECIES_NAME, PATHWAY REFERENCE VERSION and READ_SIZE as input parameters.

It creates an external table dynamically and uploads data from the source file into it. External tables let Oracle query data that is stored outside the database in flat files. The ORACLE_LOADER driver accesses data stored in any format that can be loaded by the SQL*Loader.

The stored procedure dynamically creates PATH_DATA_!!SEQ!! as an external table. This external table stores the complete pathway data. This table maps all the fields existing in the pathway file.

Note:

In the above external table, the "!!SEQ!!" string is replaced by ETL_PROC_ID at run time.

There are two bulk insert statements executed dynamically. One SQL inserts the record into the W_EHA_PATHWAY table and the other inserts the record into the W_EHA_PATHWAY_PROTEIN table.

Multiple versions of pathway data can be loaded into the table. The VERSION_WID column saves the version ID of the particular version loaded, which is retrieved from the W_EHA_VERSION table.

The pathway_loader.bat file requires the logon credentials to be stored in an Oracle Wallet, while the pathway_loader.sh script can be run with or without an Oracle Wallet.

The load_pathway procedure supports error logging associated with Pathway Reference, species_name, sql_err and Species lookup failure. These errors are logged into the W_EHA_RSLT_LOG table.

3.4.3 Command-Line Argument List

Name

pathway_loader.sh - load records

Synopsis

pathway_loader.sh -help

pathway_loader.sh <...options>

Description

Validates input options and calls the loader script pathway_script.sql#pathway_loader.load_pathway

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

SID or Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection. -check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-reference_version <VARCHAR2>

"PATHWAY" reference version label defined in W_EHA_VERSION.VERSION_LABEL. If the version label is not present in the W_EHA_VERSION table, the loader interactively prompts for confirmation to continue with the version label provided. If yes, the loader inserts the new record in the version table with the given Version_label and proceeds with the load.

-read_size <NUMBER>

Read size in bytes. Oracle external table READSIZE

Examples

UNIX

$ sh pathway_loader.sh -db_wallet odb_user -data_file "homo-sapiens-9606-gene-symbol.gmt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -reference_version "feb-2011" -read_size ""

Windows

C:\> pathway_loader.bat -db_wallet odb_user -data_file "homo-sapiens-9606-gene-symbol.gmt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -reference_version "feb-2011" -read_size ""

3.5 Prediction Score (PolyPhen, SIFT) Loader

3.5.1 Description and Files to Load

In Ensembl, human mutations affecting the amino acid substitutions are further analyzed for the effect of this substitution on protein function. This is done using SIFT and PolyPhen predictive algorithms. The source files from running either SIFT or PolyPhen contain prediction and score which is stored in the target tables. The model supports multiple versions of SIFT and PolyPhen data. The versions are recorded in the W_EHA_VERSION and W_EHA_FILE_TYPE tables.

Downloading Data from Ensembl BioMart

The source data is downloaded from the Ensembl BioMart tool. Download SIFT and POLYPHEN data separately and load them in separate loader runs.

The data can be downloaded from one of the following links:

http://uswest.ensembl.org/biomart/martview/

http://asia.ensembl.org/biomart/martview/

Select Database as Ensembl Variation <ver>.
Select Dataset as either Homo sapiens Somatic Short Variants (COSMIC sourced) or Homo Sapiens Short Variants (dbSNP sourced).
From Filters: If you want to download data for a specific region of the chromosome, then use this option. Otherwise you can retain the default filter options.
From Attributes: Select following options in the specific order defined below:
- From SEQUENCE VARIATION:
  - Variation Name
  - Chromosome Name
  - Position on Chromosome (bp)
  - Strand
  - Variant Allele
- From GENE ASSOCIATED INFORMATION:
  - Consequence specific allele
  - Ensembl Transcript ID
  - Polyphen prediction or SIFT prediction
  - Polyphen score or SIFT score.
Click Result at the top of the screen.
Export all results to select following:
1. Select File
2. Select TSV
3. Select Unique results only.
Click Go to download the file.

Select the attributes in the specified order. Also, ensure that you select PolyPhen prediction and PolyPhen score when downloading PolyPhen data, and SIFT prediction and SIFT score when downloading SIFT data.

For downloading large SIFT and PolyPhen datasets, you can alternatively use the Ensembl Perl APIs. For details, refer the following links:

http://www.ensembl.org/info/docs/index.html

http://www.ensembl.org/info/docs/api/variation/index.html#api

The execution call of the stored procedure odb_result_util.process_variant_prediction () is designed in one of the script files (load_prediction_score.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTORY OBJECT, FILE TYPE (being either SIFT or Polyphen), FILE VERSION and DNA REFERENCE VERSION as an input parameters.

The stored procedure dynamically creates PREDICTION_DATA_!!SEQ!! as an external table. This external table stores the complete result data. This table maps all the fields existing in the input file.

Note:

In the above external table, the "!!SEQ!!" string is replaced by ETL_PROC_ID at run time.

There are two multi-table insert statements executed dynamically. One inserts records into the w_eha_variant_prediction table and another inserts records into the w_eha_rslt_log table.

A select statement that parses the data from the external table uses an inline query, which gets the dataset of variant and transcript records. An inline query uses a partition (analytical function) function to avoid duplicate records for different variation names with the same variant record (VARIANT_WID). The dataset of this inline query will then be joined with the W_EHA_VARIANT_PREDICTION table to look up VARIANT_WID and STRUCTURE_WID, and will insert a record into either the W_EHA_VARIANT_PREDICTION table or the W_EHA_RSLT_LOG table.

If you upload the same file (that is, a file with the same name as the one previously loaded) with a different version, the loader considers this file as a new file and uploads the record into the target (W_EHA_VARIANT_PREDICTION) table with a different file version.

3.5.2 Running the Loader

A record is inserted in the W_EHA_VARIANT_PREDICTION table for a variant belonging to a specific transcript; hence reference_id and transcript_id is looked up before inserting a record in this table.

Before inserting a record in this table, the loader also checks if a record already exists in the W_EHA_VARIANT_PREDICTION table for a specific reference_id and transcript_id and for the version of SIFT or PolyPhen data.

The following operation is performed based on the above condition:

If reference_id (variant_wid) and transcript_id (structure_wid) exist in the target table but SIFT or polyphen version is different, a new record is inserted in this table.
If reference_id (variant_wid) and transcript_id (structure_wid) exist in the target table and SIFT or polyphen version are also the same, the score is compared.
If the score is same, the variant record is not updated.
If the score is not same, the existing record is not updated but the reference_id along with transcript_id is reported to the W_EHA_RSLT_LOG table stating that the score was different.
If reference_id (variant_wid) and transcript_id (structure_wid) do not exist in the table, a new record is inserted.

Note:

The batch file requires an Oracle Wallet to be set up to run correctly. However, the shell script can be run with or without an Oracle Wallet.

SIFT or PolyPhen Reference Version

The Program Version refers to the SIFT or PolyPhen program version used to generate the data. You can check the version from http://asia.ensembl.org/info/genome/variation/predicted_data.html

Table 3-1 Mapping of Polyphen or SIFT File

Column Name in Result File	Table and Column Name in ODB	Description
Variation Name	REFERENCE_ID in W_EHA_VARIANT_XREF table	This is a FK to W_EHA_VARIANT table. It is extracted by the loader using a lookup of Variation Name from the source file against the REFERENCE_ID in the W_EHA_VARIANT_XREF table.
Chromosome Name	W_EHA_CHROMOSOME .CHROMOSOME	Chromosome Name
Position on Chromosome (bp)	W_EHA_RSLT_SEQUENCING.START_POSITION	Stores the start position on chromosome.
Strand	N/A	N/A
Variant Alleles	N/A	N/A
Ensembl Transcript ID	W_EHA_GENE_STRUCTURE. TRANSCRIPT_ID	This is a FK to W_EHA_GENE_STRUCTURE table. It is extracted by the loader using a lookup of Ensembl Transcript ID from the source file against the 'TRANSCRIPT_ID' in W_EHA_GENE_STRUCTURE table.
PolyPhen /SIFT prediction	W_EHA_PREDICTION_CODE .CODE	This is a FK to W_EHA_CODE table, which stores all possible predictions for SIFT and PolyPhen data. The FK corresponding to the prediction of this record is available in the source file.
PolyPhen or SIFT score	W_EHA_VARIANT_PREDICTION. PREDICTION_SCORE	Stores the prediction score value from the source file for a particular variant.

3.5.3 Command-Line Argument List

Name

prediction_score_loader.sh - load records

Synopsis

prediction_score_loader.sh -help

prediction_score_loader.sh <...options>

Description

Validates input options and calls the loader script load_prediction_score.sql#odb_ref_prediction_util.process_variant_prediction

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-prediction_version_type* <VARCHAR2>

Prediction reference version type (SIFT|POLYPHEN)

-prediction_version_label* <VARCHAR2>

"SIFT"|"Polyphen" reference version label defined in W_EHA_VERSION.VERSION_LABEL. If the version label is not present in the W_EHA_VERSION table, the loader prompts for confirmation to continue with the version label provided. If yes, the loader inserts the new record in the version table with the given Version_label and proceeds with the load.

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh prediction_score_loader.sh -db_wallet odb_user -data_file "ut_variant_pred1.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -prediction_version_type "Polyphen" -prediction_version_label "5.0" -reference_version "GRCh37.p8" -preprocess_dir "" -preprocess_file "" -read_size ""

Windows

C:\> prediction_score_loader.bat -db_wallet odb_user -data_file "ut_variant_pred1.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -prediction_version_type "Polyphen" -prediction_version_label "5.0" -reference_version "GRCh37.p8" -preprocess_dir "" -preprocess_file "" -read_size ""

Typical Errors Associated with prediction_score Loader

Record not inserted is logged if the same version and same file type is loaded but the input record inserted has a score different than the existing record.
Other possible errors are:
- Generating etl process id
- Generating enterprise id
- Verifying species name
- Verifying reference version
- Verifying prediction version
- Inserting file type
- Verifying file type
- Processing result records

3.6 Probe Loader

3.6.1 Description and Files to Load

The probe loader populates the W_EHA_PROBE table. You can use probe_loader.bat to run in Windows or probe_loader.sh to run in Linux. Probe loader is mostly a reference loader, but it varies with vendors, for example, Affymetrix, Illumina.

Following are the assumptions for the data file for the Probe Loader:

The file is tab-delimited.
The first row is always the header.

Mappings for Probe Loader

Table Mappings for Probe Loader

Data File	W_EHA_PROBE table
PROBESET	W_EHA_PROBE.PROBE_NAME
ACC	W_EHA_PROBE.ACCESSION
DESCP	W_EHA_PROBE.PROBE_DESC
GENEID	W_EHA_PROBE.PRIMARY_HUGO_NAME

3.6.2 Running the Loader

The execution call of the stored procedure PROBE_LOADER() is in the script file probe_script.sql. This stored procedure accepts FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES NAME, DNA VERSION LABEL, PROBE VERSION LABEL and READ_SIZE as input parameters.

The stored procedure dynamically creates PROBE_DATA_!!SEQ!! as an external table. This external table stores the complete probe data and maps all the fields existing in the probe file.

There is a merge statement that dynamically either inserts or updates the existing probe record in w_eha_probe table. While updating a record, it updates all the columns including the row_wid with a new row_wid for a particular probe name.

Old records can be referenced using the Flashback Data Archive (FDA) approach, which is used for securely tracking the contextual history of all data. FDA makes it possible to automatically and transparently track all changes to tables in the database and easily query data in those tables as of any point in time or over any interval within the specified retention period, with minimal performance impact.

To load gene expression result files for a particular older reference version, run the result loader BEFORE re-running the probe loader, updating the probes in the table to a new reference version. Then load the next set of gene expression result files pointing to the new version. The gene expression records will lose their probe reference keys as probe row_wids are updated.

The .bat file requires an Oracle Wallet to be set up before it can run successfully.

For Shell scripts, if an Oracle Wallet is set up, the shell script uses those credentials to run the Sqlplus. If an Oracle Wallet is not set up, the script prompts for a password and connects to Sqlplus.

3.6.3 Command-Line Argument List

Name

probe_loader.sh - load records

Synopsis

probe_loader.sh -help

probe_loader.sh <...options>

Description

Validates input options and calls the loader script probe_script.sql#probe_loader.load_probe

Options

(*) required

-db_wallet* <VARCHAR2> (required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string, that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-dna_version_label* <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-probe_version_label* <VARCHAR2>

"PROBE" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh probe_loader.sh -db_wallet odb_user -data_file "dummy_probeset_annotation_summary.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -dna_version_label "GRCh37.p8" -probe_version_label "PROBE_VER_1" -read_size ""

Windows

C:\> probe_loader.bat -db_wallet odb_user -data_file "dummy_probeset_annotation_summary.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -dna_version_label "GRCh37.p8" -probe_version_label "PROBE_VER_1" -read_size ""

3.7 ADF Data Loader

3.7.1 Description and Files to Load

An Array Description Format (ADF) file, or an array design file, is a microarray platform-specific, tab-delimited file that describes the design of an array. For a particular array platform, this file lists out the features (spots) found on an array, along with its location and associated annotation information including the Reporters (oligo probes) found at that feature and the Composite Elements (genomic features such as genes) represented by it.

The ADF Data Loader loads the ADF file for AgilentG402A_07_1 (Agilent 244K Custom Gene Expression G4502A-07-1) platform, which contains annotation data for all Gene Composite elements found in AgilentG402A_07 Level-3 data files present in TCGA and are loaded using the Dual Channel Loader into ODB.

TCGA provides all available ADF files on its Platform Design page: https://tcga-data.nci.nih.gov/tcga/tcgaPlatformDesign.jsp

The TCGA ADF file for AgilentG4502A_07_1 can be retrieved from the following link: http://tcga-data.nci.nih.gov/docs/integration/adfs/tcga/AgilentG4502A_07_01.tcga.adf.zip.

3.7.2 Running the Loader

The execution call of the stored procedure odb_result_util.process_adf() is designed in the script file load_adf.sql. This stored procedure accepts FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES, USER LABEL, Reference Version, File Flag, Preprocess directory, Preprocess File, Data File Path, DBFS Store, Alternate file location (ftp location or http location), Read Size as input parameters.

The stored procedure dynamically creates ADF_DATA_!!SEQ!! as an external table. This external table stores the complete result data. This table maps all the fields existing in the result file.

The procedure first loads a row into the w_eha_adf table. A simple merge command is written to lookup against user label, file WID, version WID, etl proc WID and enterprise WID dataset. All the values are passed as input parameters. If the dataset matches any of the w_eha_adf records, then the loader updates all the columns (except w_update_dt) of w_eha_adf table for corresponding user_label of the table.

The three multi-table insert statements executed dynamically insert records into the w_eha_adf_composite table, the w_eha_adf_reporter, and the w_eha_adf_reporter_coord tables.

The deleted records can be referenced using the Flashback Data Archive (FDA) approach, which is used for securely tracking the contextual history of all data. FDA makes it possible to automatically and transparently track all of the changes to the tables in the database, and to easily query data in those tables as of any point in time or over any interval within the specified retention period, with minimal performance impact

Note:

The batch file requires an Oracle Wallet to be set up to run correctly. The shell script can be run with or without an Oracle Wallet.

3.7.3 Command-Line Argument List

Name

ADF_loader.sh - load records

Synopsis

ADF_loader.sh -help

ADF_loader.sh <...options>

Description

Validates input options and calls the loader script load_adf.sql, which calls odb_ref_adf_util.process_adf

Options

(*) required

-db_wallet* <VARCHAR2> (required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string, that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB username for the database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-user_label* <VARCHAR2>

User label used to identify a composite record's source ADF dataset that is, AgilentG4502A_07_1

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2> (required, if -file_flg is "S")

File system path to secure data file directory

-dbfs_store <VARCHAR2> (required, if -file_flg is "S")

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

Examples

UNIX

$ sh ADF_loader.sh -db_wallet odb_user -data_file "adf_summary.adf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -user_label "AgilentG4502A_07_01" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir "" -preprocess_file "" -data_file_path "" -dbfs_store "" -alt_file_loc "" -read_size ""

Windows

C:\> ADF_loader.bat -db_wallet odb_user -data_file "adf_summary.adf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -user_label "AgilentG4502A_07_01" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir "" -preprocess_file "" -data_file_path "" -dbfs_store "" -alt_file_loc "" -read_size ""

3.8 HGMD (BioBase) Loader

3.8.1 Description and Files to Load

The HGMD loader is used to load mutations and associated disease, drug and other annotations from GFF formatted source files that can be obtained from BioBase Biological Databases as a part of the Genome Trax™ data set. Files can be loaded from the BioBase site at the following location:

https://portal.biobase-international.com/download/genometrax/

A valid Genome Trax license is required for the download. Download only the gff archive for the desired reference genome assembly. The loader currently loads 3 data source file types:

hgmd_hg*.gff
hgmd_disease_hg*.gff
drug_hg*.gff

Note:

BioBase provides scheduled updates as a full set of curated data assembled against two most recent reference genome builds.

The BioBase release version of these data sets is not available in data files and has to be provided as a command-line argument (see -hgmd_version_label below). Currently the HGMD datasets are built against Human Genome 18 (HG18) and Human Genome 19 (HG19) in UCSC notation (or Build 36 and Build 37 in NCBI notation).

You can load both HG18 and HG19 curated data, but they have to be linked to the correct reference genome versions that exist in ODB. Loading a new HGMD release version will overwrite all the linkage data previously loaded for the same reference genome version. The overwritten records can be referenced using the Flashback Data Archive (FDA) approach.

The file processing starts with the hgmd_hg*.gff file representing the inherited mutations track. All variants listed in the source file are identified and the novel ones are added to the W_EHA_VARIANT table. A reference to the HGMD accession for all variants from the source file is inserted into the W_EHA_VARIANT_XREF table. New disease names are added to the W_EHA_DISEASE table. Finally, all curated associations are loaded into W_EHA_DISEASE_G_VARIANT table and literature links are added to the W_EHA_DISEASE_G_VAR_XREF table.

The second source file hgmd_disease_hg*.gff that represents gene or disease linkage is loaded into w_eha_disease_gene and w_eha_disease_gene_xref tables. The latter table stores literature links.

Lastly, the drug_hg*.gff file is processed to populate W_EHA_DRUG, W_EHA_DRUG _XREF, W_EHA_DRUG_TARGET and W_EHA_DRUG_TARGET_XREF tables. The first two tables store drug records and drug references respectively. Currently drugs are referenced by the DruBank IDs (http://www.drugbank.ca/). Two other tables store gene or drug linkage and references to supporting research findings.

3.8.2 Running the Loader

The loader is implemented as a PL/SQL stored procedure that can be invoked from a provided shell or batch file. The procedure accepts an Oracle Directory Object, file suffix, species name, reference genome version, BioBase release version and read size as input parameters.

The file suffix—the hg* portion of the input file name—reflects the version of the reference genome assembly (HG18 or HG19) that is used to prepare the source reference files. The loader parameter called reference genome version specifies which reference genome release loaded into ODB schema should be used to link HGMD annotations. Therefore the file suffix and the target reference genome version should be within the same major release.

For example, HG19 files can be linked with GRCH37.P8 or GRCH37.P9 reference genome but not Build 36.

The loader does not require specific filenames to process. Given the file suffix, it loads all currently supported files.

Note:

The batch file requires an Oracle Wallet to be set up to run correctly. The shell script can be run with or without an Oracle Wallet.

3.8.3 Command-Line Argument List

Name

HGMD_loader.sh - load records

Synopsis

HGMD_loader.sh -help

HGMD_loader.sh <...options>

Description

Validates input options and calls the loader script load_hgmd.sql#odb_ref_hgmd_util.process_hgmd

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_directory* <VARCHAR2>

Data file suffix name - Oracle external table LOCATION

-data_file_suffix* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see Section 2.1, "Setting Up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-dna_version_label* <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-hgmd_version_label* <VARCHAR2>

"HGMD" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh HGMD_loader.sh -db_wallet odb_user -data_file_suffix "hg19_12" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -dna_version_label "GRCh37.p8" -hgmd_version_label "2012.4" -read_size ""

Windows

C:\> HGMD_loader.bat -db_wallet odb_user -data_file_suffix "hg19_12" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -dna_version_label "GRCh37.p8" -hgmd_version_label "2012.4" -read_size ""

3.9 COSMIC Loader

3.9.1 Description and Files to Load

The COSMIC loader is used to load information on cancer causing somatic mutations such as sample data, histology, anatomical site observed, the genomic location and the publication IDs from where the data is curated. The dataset is available at the following official Sanger website:

http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/.

Cosmic has changed its license model and the versions after v71 are not free to download. Oracle supports loading COSMIC data for v71 versions or before.

For details on downloading the COSMIC dataset, visit the following location

http://cancer.sanger.ac.uk/cosmic/download

The loader takes 4 files of two types as input, that must be retrieved from the Sanger's ftp file download repository from separate file paths.

Two coding and non-coding annotation data files with the following file name patterns:
- CosmicCompleteExport_vxx_xxxxxx.tsv.gz: contains annotations on somatic variants the affect coding regions of the genome.
- CosmicNCV_vxx_xxxxxx.csv.gz: contains available annotations on variants found in non-coding regions of the genome.
Two VCF format variant genomic feature files with the following file name patterns:
- CosmicCodingMuts_vxx_xxxxxx_noLimit.vcf.gz: contains genomic feature information for coding variants.
- CosmicNonCodingVariants_vxx_xxxxxx_noLimit.vcf.gz: contains genomic feature information for non coding variants.

You must decompress all files before running the loader.

3.9.2 Running the Loader

The loader is implemented as a PL/SQL stored procedure that could be invoked from a provided shell or batch file. The procedure accepts all four of the input file Oracle Directory Object, species name, reference genome version, COSMIC release version and read size as input parameters.

The VCF files from COSMIC that map Mutation identifiers with the genomic position or sequence variation list the reference genome version in the header portion. For example, ##reference=GRCh37.

When you load VCF files from COSMIC—(the COSMIC loader invokes the VCF loader but you must provide a reference genome version). You can link them to any reference genome patch for this reference genome version.

All Comic versions should be linked to GRCH37 based DNA versions such as GRCH37.P8 or GRCH37.P7. If your analysis involves both p8 and p7, run the loader twice and link the same files to both patches.

The cosmic loader first calls process_gvcf to process and load reference variants found in the 2 vcf format files into the variant tables. Thus, any new variants are added to ODB. After the variant load, the data files are processed by the loader procedure which loads sample, histology, anatomical site to ODB and bridges the data to variant and gene information.

Note:

The batch file requires an Oracle Wallet to be set up to run correctly. The shell script can be run with or without an Oracle Wallet.

3.9.3 Command-Line Argument List

Name

COSMIC_loader.sh - load records from COSMIC files

Synopsis

COSMIC_loader.sh -help

COSMIC_loader.sh <...options>

Description

Validates input options and calls the loader script load_cosmic.sql#odb_ref_cosmic_util.process_cosmic

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see Section 2.2, "Setting Up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_directory* <VARCHAR2>

Data file suffix name - Oracle external table LOCATION

-data_file_coding* <VARCHAR2>

Coding Mutations Data file name - Oracle external table LOCATION

-data_file_noncoding* <VARCHAR2>

Non Coding Mutations Data file name - Oracle external table LOCATION

-vcf_file_coding* <VARCHAR2>

Coding region Varaint's VCF file name - Oracle external table LOCATION

-vcf_file_noncoding* <VARCHAR2>

Non Coding region Variant's VCF file name - Oracle external table LOCATION

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, for humans "Homo sapiens"

-dna_version_label* <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-cosmic_version_label* <VARCHAR2>

"COSMIC" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

sh COSMIC_loader.sh -db_wallet slc04lx3 -data_file_coding "CosmicCompleteExport_v67_241013.tsv" -data_file_noncoding "CosmicNCV_v67_241013.tsv" -vcf_file_coding "CosmicCodingMuts_v67_20131024.vcf" -vcf_file_noncoding "CosmicNonCodingVariants_v67_20131024.vcf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -dna_version_label "GRCH37.P8" -cosmic_version_label "COSMIC.V67"

Windows

C:\> COSMIC_loader.bat -db_wallet slc04lx3 -data_file_coding "CosmicCompleteExport_v67_241013.tsv" -data_file_noncoding "CosmicNCV_v67_241013.tsv" -vcf_file_coding "CosmicCodingMuts_v67_20131024.vcf" -vcf_file_noncoding "CosmicNonCodingVariants_v67_20131024.vcf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -dna_version_label "GRCH37.P8" -cosmic_version_label "COSMIC.V67"

3.10 Variant Effect Job

The variant effect job is responsible for loading the impact of the presence of a genomic variant on any feature transcript it falls on or is close to. The variant effect links a variant to a gene transcript component, such as a coding exon segment, the protein coded and the genomic source sequence and then calculates the following:

NET EFFECT caused by the presence of the mutation on the target transcript.
The gene level GENE EFFECT on the base nucleotides.
The CODING Region EFFECT, which stores the specific region and the nucleotide change occurrence, as calculated from the start site of the transcript.
PROTEIN_EFFECT that notes the location on and change to the Amino Acid translation sequence of the coding transcript.

By default, the variant effect job is disabled, and when enabled, is set to run once every 24 hours by the DBMS Scheduler. When the job completes a run, the procedure called by the job, odb_var_effect_util.process_var_effect, will load impact data for any new variants added to W_EHA_VARIANT.

The procedure can also be run once and immediately, by adding a new job to the scheduler without a time specification.

Execute the following command creates a single run job:

BEGIN  -- Job defined entirely by the CREATE JOB procedure.  DBMS_SCHEDULER.create_job (    job_name        => 'VARIANT_EFFECT_JOB3',    job_type        => 'PLSQL_BLOCK',    job_action      => 'BEGIN odb_var_effect_util.process_var_effect; END;',                          enabled         => true,    comments        => 'Test Job.');END;/

3.11 Typical Errors Associated with Reference Loaders

3.11.1 Loader Runtime Error: ORA-01460 Unimplemented or Unreasonable Conversion Requested

Errors have been observed while running various ODB loaders. The loader run aborts prematurely with the following error message: ORA-01460 unimplemented or unreasonable conversion requested.

Oracle recommends applying an RDBMS patch to the TRC database that fixes this bug. See Oracle Support Bug 13099577 (ORA-1460 WHEN PARALLEL QUERY SERVERS ARE USED) available here https://mosemp.us.oracle.com/epmos/faces/BugDisplay?id=13099577 for details.