Skip Headers
Oracle® Health Sciences Omics Data Bank Programmer's Guide
Release 2.5

E35680-04
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

3 Loaders for Reference Data

This chapter contains the following topics:

Note:

As the Reference part of the ODB model changes, the Reference Data Loaders has to adapt to these changes. Therefore, both should be updated together.

3.1 Ensembl and SwissProt Loaders

3.1.1 Installing the Loaders

Following are the prerequisites for installing the Ensembl and SwissProt reference loaders:

  1. You must have an Oracle database instance with ODB installed in a schema, for which the name and password are known.

    Note:

    Ensembl and SwissProt loaders are written in Java.
  2. Java Runtime 1.7 or higher must be installed and should be the default on the machine (can be verified using the java -version command from the command prompt).

  3. Copy the Reference Loader folder into a directory of your choice. The program should be run from the directory it is installed in.

3.1.2 Files to Load

Following is a list of files to be loaded:

  1. The Ensembl multi-gene EMBL files can be downloaded from:

    ftp://ftp.ensembl.org/pub/release-71/embl/homo_sapiens

    The link above may not reflect the most recent version of the multi-gene files available. Oracle recommends that you use the latest release available from Ensembl.

    The files are organized by chromosome. There are also some configuration and patch files. At least all chromosome files must be loaded to cover the entire Human genome. The files are gzipped, and can be loaded without un-gzipping.

    Prior to version 68, Ensembl EMBL files were organized differently, the data was divided into segments, each approximately 100,000 base pairs in length. Starting with version 68, the sequence of an entire chromosome comes in one section. This has presented memory problems for the EMBL loader, which were largely overcome in ODB 2.5, allowing loading the Human EMBL files using the standard Java Virtual Machine maximum heap allocation of 1 GB. However, this is not guaranteed for other species. So if an out-of-memory error occurs when loading an EMBL file, the load should be repeated with a higher JVM maximum heap allocation as described below and may require a computer with Linux or 64-bit Windows operating system and at least 4 GB of RAM (Oracle recommends 8GB).

  2. The SwissProt file (a single file) can be downloaded here:

    http://www.ebi.ac.uk/uniprot/database/download.html

  3. Get the (UniProtKB/SwissProt) Flat File.

  4. Both EMBL and SwissProt files can be loaded without extracting the contents, just as they are downloaded. The EMBL and SwissProt Loader can handle GZIP archives (identified by the gzip extension) as well as some ZIP archives (a ZIP archive must contain only one file to be handled correctly, and is identified by the zip extension).

3.1.3 Loading the Data

Before you begin:

  1. As the Loader runs, it logs some information in the gdm.log file. The file is always appended and keeps growing. So, you may occasionally want to delete it and start from scratch the next time you run the Loader. Some of the information logged can be very useful for investigative purposes. Oracle recommends that you check the log when you have a problem. The EMBL/SwissProt loader also logs into the W_EHA_RSLT_LOG table, like the ODB SQL loaders. However, it logs into the database only while connected to it, that is, connection failure is not logged. Also, a SUMMARY log record is not created by the EMBL/SwissProt loader 2.5.

  2. The order of loading EMBL and SwissProt files is not important. The scenario outlined below is just an example. Both DNA and Protein reference data are now versioned. Versions stored in the W_EHA_VERSION table are used. For DNA Sources and all reference data that are linked to them (genes and so on) the version must be of type 'DNA', for Protein Info records and all the reference data linked to them the version must be of type 'PROTEIN'.

    The EMBL Loader accepts a version argument which must match an existing version of type 'DNA'. If the version is not found, there is a prompt allowing the user to use the provided version label, and if the user confirms, the version is created. If no version argument is provided, the loaders enter full interactive mode, allowing the user to select an existing version from a list or create a new one. The same is true for the SwissProt loader, except here the version is of type 'PROTEIN'.

    The version labels are not case-sensitive and are stored in uppercase.

    Note:

    If the version label is not provided or the provided version does not exist, the EMBL or SwissProt loader prompts the user for input and do not continue until the user responds. Therefore, if you want to run the loader in the non-interactive mode, it is necessary to create the version of the correct type beforehand and to ensure it is passed to the loader correctly. It is also possible to prevent the loader from waiting for the user input in the case of an incorrect version being passed to it by appending >outlog <empty.txt to the command line and creating an empty text file empty.txt. In this case the loader fails if the version does not exist.

Perform the following steps to load files (using Linux or Windows shell scripts. There is native JAVA command-line interface available described in Section 3.1.4):

  1. Since the SwissProt file is a single file, it can be loaded in under an hour (Human proteins only) or in a day (all species). To load the SwissProt file, run SwissProt.bat (or SwissProt.sh on Linux). When SwissProt.bat is run you can optionally pass the Species List file. The purpose of the Species List file is to let only loading protein information for the organism you want.

    The format of the file is simple: type in the species primary (Latin) name, one species per line. A file for just the human genome is now included in the distribution — it is named Species.dat and contained in the main Loader Directory. If there is no -protFile option with the name of a species list file, ALL proteins for ALL species are loaded (which takes much longer).

    If an Oracle Wallet is set up, SwissProt.bat/.sh can use the credentials stored in the Wallet to connect to the schema, otherwise it prompts for a password. If an Oracle Wallet is set up, the user has to pass the following parameters to run SwissProt.bat/.sh:

    1. Username — when an Oracle Wallet is set up enter ""

    2. Url — instance alias for which the Wallet credential was created - if you start SqlPlus as 'sqlplus /@DB001_Wallet, then this value here must be DB001Wallet.

    3. Schema name

    4. Path for Oracle Home

    5. Path to Wallet

    6. Path and name of species list. This is an optional parameter. If you do not have this list enter "

    7. Complete path and name of the data file

    8. Reference version of type 'PROTEIN' to use (omit to enter the interactive mode):

      Following is an example of how the SwissProt.bat is to be run if Oracle Wallet is set up and Species list is not present:

      C:\>swissProt.bat "" DB001Wallet trc_gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets "" SwissProt.dat "VERSIONP1"

      Example using SwissProt.sh

      > sh SwissProt.sh "" DB001Wallet trc_gdm "/app/oracle/product/11.2.0.2.0/network/admin" "/app/wallet/" "" SwissProt.dat "VERSIONP1"

      Following is an example of how the swissProt.bat is to be run if Oracle Wallet is set up and Species list is present:

      C:\>swissProt.bat "" DB001Wallet trc_gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets Species.dat SwissProt.dat "VERSIONP1"

      Example using SwissProt.sh

      > sh SwissProt.sh "" DB001Wallet trc_gdm "/app/oracle/product/11.2.0.2.0/network/admin" "/app/wallet/" Species.dat SwissProt.dat "VERSIONP1"

      If Oracle Wallet is not set up, you have to pass the following parameters when swissProt.bat is run:

    9. Username to connect to schema

    10. Url — Full DB URL (host:port:instance, or scan-server-name:port:SID for a multiple node DB). For example, Localhost:1613:devdb1

    11. Schema name

    12. Path for Oracle Home — when Oracle Wallet is not set up enter ""

    13. Path to Wallet — when Oracle Wallet is not set up enter ""

    14. Path and name of species list. This is an optional parameter. If you do not have this list enter ""

    15. Complete path and name of the data file

    16. Optional version of type 'PROTEIN' to use (omit to enter the interactive mode)

    Following is an example of how the swissProt.bat is to be run if Oracle Wallet is not set up and Species list is not present:

    C:\>swissProt.bat trc_gdm localhost:1613:devdb1 trc_gdm "" "" "" SwissProt.dat "VERSIONP1"

    Example using SwissProt.sh

    >sh SwissProt.sh trc_gdm localhost:1613:devdb1 trc_gdm "" "" "" SwissProt.dat "VERSIONP1"

    This is an example of how the swissProt.bat is to be run if Oracle Wallet is not set up and Species list is present:

    C:\>swissProt.bat trc_gdm localhost:1613:devdb1 trc_gdm "" "" Species.dat SwissProt.dat

    Example using SwissProt.sh

    >sh SwissProt.sh trc_gdm localhost:1613:devdb1 trc_gdm "" "" Species.dat SwissProt.dat "VERSIONP1"

    The SwissProt Loader can also be run with named command-line arguments.

  2. The Ensembl EMBL file can be loaded next. Multiple EMBL files can be loaded one at a time, in any order, but without duplication (each file can only run once - otherwise, at present, some information is duplicated). Multiple EMBL files can be loaded concurrently (for example, in separate terminal windows). However, if the user wants to create a new DNA reference version using the interactive mode, one file should be loaded first (creating the new version in the process) and only then multiple files can be loaded concurrently.

    If Oracle Wallet is set up, embl.bat/.sh can use the credentials stored in the Wallet to connect to the schema else it prompts you for a password. If Oracle Wallet is set up the user have to pass the following parameters when Embl.bat is run:

    1. Username — when Oracle Wallet is set up, enter

    2. Url — instance alias for which the Wallet credential was created - if you start SqlPlus as 'sqlplus /@DB001_Wallet, then this value here must be DB001Wallet.

    3. Schema name

    4. Path for Oracle Home

    5. Path to Wallet

    6. Complete path and name of the data file

    7. Depending on the file being loaded and on the operating system, a seventh, optional, argument may need to be used: the Java Virtual Machine heap size, in MB.

      By default, embl.bat (and the corresponding UNIX shell script, embl.sh) specifies 1 GB of Java Virtual Machine (JVM) maximum heap space (using the –Xmx1024M option). This is known to be sufficient for loading all Human Ensembl files, version 68, and works on most Linux or Windows systems (Oracle recommends 8 GB of RAM or more). Hence, you do not need to provide the heap size argument.

      However, some Ensembl EMBL files for other species, version 68 or higher may require more heap space, and the seventh argument should then be provided as 2048 (that is 2048 MB). Then these files can be loaded successfully on Linux or 64-bit Windows with enough RAM. If there is an Out-of-memory error while loading a file, use a larger maximum heap size and use the same option for all subsequent Ensembl files for the same organism.

      If the specific computer or operating system cannot handle the specified JVM heap size, a Java Virtual Machine creation failed error occurs. Then you need to use another machine.

    8. The version of type 'DNA' can be passed as the 8th argument (omit to enter the interactive mode).

      Following is an example of how embl.bat is to be run if Oracle Wallet is set up:

      C:\>embl.bat "" DB001Wallet trc_gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets embl.dat 2560

      Example of embl.sh shell script for Linux:

      >sh embl.sh "" DB001Wallet trc_gdm /app/ora11g/product/11.2.0/dbhome_2/NETWORK/ADMIN /app/wallets embl.dat 2048 "GRCH37.P8"

    If Oracle Wallet is not set, you have to pass the following parameters when Embl.bat is run:

    1. Username to connect to schema

    2. Url — Full DB URL (host:port:instance). For example, Localhost:1613:devdb1

    3. Schema name

    4. Path for Oracle Home — when Oracle Wallet is not set up enter

    5. Path to Wallet — when Oracle Wallet is not set up enter

    6. Complete path and name of the data file

    7. Optional maximum heap size (in MB), or ""

    8. Optional version of type 'DNA' (omit to enter the interactive mode to select or create a version)

      This is an example of how the embl.bat is to be run if Oracle Wallet is not set up:

      C:\>embl.bat trc_gdm localhost:1613:devdb1 trc_gdm "" "" embl.dat 2048 "GRCH37.P8"

      Example of embl.sh:

      >sh.embl.sh trc_gdm localhost:1613:devdb1 trc_gdm "" "" embl.dat 2048 "GRCH37.P8"

The EMBL Loader can also be run with named command-line arguments.

3.1.4 Running the Embl/Swissprot Loader with Named Command-Line Arguments

The EMBL/SwissProt Loader is a single Java application, packaged into a JAR archive GDM.jar. It can be used without any shell or Windows script, using the following arguments. Running the application using these arguments provides some additional capabilities not supported by the shell/Windows scripts installed with it.

Java Runtime 1.7 is required to run the EMBL/SwissProt Loader. It has to be either in the PATH environment variable or the user should specify the full path to the Java executable with version 1.7.

If the Java 1.7 executable is in the PATH environment variable, the EMBL/SwissProt loader is run as follows:

java -Xmx2048m -jar gdm.jar <argument 1>…<argument N>

The -XmxNNNNm Java option is optional. In the 2.5 version of the EMBL/SwissProt Loader it is usually not necessary, because the memory usage has been optimized to handle chromosome-wide Ensembl files. However, if you ever encounter an out-of-memory error for any Enseml EMBL file, use it (provided that the machine has enough memory and the operating system is Linux or 64-bit Windows).

The arguments (except for the name of the data file to load, which must always be the last argument) are not positional, and can be used in any order. The key/option arguments begin with the "-" character. Some of them require a value as the following argument, others are stand-alone. To get the up-to-date list of all available arguments, run the following command:

java -jar gdm.jar -help

Here is more detailed information about the arguments and their usage:

-url <url> - specifies the URL of the database to comment to. <url> must be in the form: host:port:instance . This argument is mandatory.

-schema <schema name> - specifies the ODB schema name. This argument is mandatory.

-user <user> - specifies the user name to log into the database with (needed only when not using an Oracle Wallet).

-wallet <wallet> - specifies the directory where the Oracle Wallet is used

-orahome <oracle home directory> - specifies the oracle home directory, when a Wallet is used

-sprot - this key is used to load a SwissProt file. If it is absent, the data file will be loaded as an EMBL file.

-protFile <species file> - specifies the file path (optional) and name of a Species List file, used to optionally filter the contents of a SwissProt file by species (used when loading SwissProt only)

-version <version label> - specifies the version label of the DNA or Protein reference version. If this argument is present, it must match an existing version of appropriate type (DNA for EMBL, PROTEIN for SwissProt). If it is omitted, the Loader will start in the interactive mode, prompting the user to select an existing version, or create a new one.

-verbose - if this argument is present, additional information will be printed on the screen and logged.

-updateDB - this argument is necessary, if the user wants to actually load the contents of the file into the database. If it is omitted, the contents of the data file are parsed, but nothing is inserted into the database tables (except the version, if the Loader is started in the interactive mode and the user chooses to create a new version). Omitting this option is useful for verifying that the file parses without errors.

Examples:

Loading an EMBL file without a wallet, Windows:

java-jar GDM.jar -url localhost:1521:b41804x1 -schema odb -user odb -version V1 -updateDB EMBLFile.dat

Verifying (without loading) the same file:

java-jar GDM.jar -url localhost:1521:b41804x1 -schema odb -user odb -version V1 EMBLFile.dat

Loading a SwissProt file with a wallet, Windows:

java-jar GDM.jar -url localhost:1521:b41804x1 -schema odb -orahome C:\ora11g\product\11.2.0\dbhome\NETWORK\ADMIN -wallet C:\Wallets -version P1 -updateDB -sprot -protFile species.dat SPFile.dat

Loading an EBML file with a Wallet, Linux, interactive mode for version:

java-jar GDM.jar -url localhost:1521:b41804x1 -schema odb -orahome /home/apps/ora11g/product/11.2.0/dbhome/NETWORK/ADMIN -wallet /home/Wallets -updateDB EMBLFile.dat

3.1.5 Gathering Optimizer Statistics

Oracle recommends gathering table and index statistics after completing Ensembl data load. Oracle statistics is a collection of data about database objects such as tables and indexes. It is required by Oracle optimizer to estimate the most efficient query execution plan. Missing or stale statistics can profoundly deteriorate query performance.

To collect statistics connect to a database as ODB_SCHEMA owner using sqlplus and execute the command:

exec dbms_stats.gather_schema_stats ('ODB_', cascade=>true,estimate_percent=>dbms_stats.auto_sample_size);

3.2 HUGO Loader

3.2.1 Description and Files to Load

The Hugo Loader is responsible for populating curated gene nomenclature records, taken from an online resource maintained by HUGO Gene Nomenclature Committee (HGNC), into ODB's reference database. The input data for the loader comprises the complete HGNC dataset which can be retrieved from their Statistics and Downloads Webpage here: http://www.genenames.org/cgi-bin/hgnc_stats. There are multiple datasets at this web page. The user has to specifically download complete HGNC dataset by clicking the hyperlink in the sentence Click here for the complete HGNC dataset provided in the above webpage. The data downloaded is a large file with tabular text in tab-separated values given with column headers.

Important: The format of the complete HGNC files has changed as of May 2013. The ODB HUGO Loader 2.5 only supports the new file format, while the previous versions of the loader (ODB 2.0.2.1 and prior) only support the old format.

A batch file for Windows and an alternative shell script for Linux-bash, have been provided for loading the data. The content below shows the step-by step process of this load procedure.

3.2.2 Running the Loader

The loader is found bundled in the latest ODB build in the compressed file Hugo_Loader.zip. This folder consists of 8 files:

  • hugo_loader.bat

  • hugo_loader.sh

  • hugo_script.sql

  • several common sh, bat, and sql scripts for reference version checking

To run the loader, perform the following:

  1. Copy the above files into a folder on your system along with the downloaded input file from HUGO.

  2. Open a command prompt terminal and change the directory to where the hugo_loader.bat and/or hugo_loader.sh file resides.

  3. If working on Linux, make sure the scripts are executable (you may need to run 'chmod u+x *.sh')

  4. The hugo_loader.sh/.bat scripts can use the credentials stored in an Oracle Wallet to connect to the schema that has the ODB. The following parameters should be passed when the hugo_loader.bat is run:

    1. The Hugo data file.

    2. Oracle Directory Object

    3. Wallet name

    4. Operation without an Oracle Wallet (with user name and database connection arguments) is only supported on Linux.

  5. Execute hugo_loader.sh/.bat with appropriate arguments - to load the data

3.2.3 Command-Line Argument List

Synopsis

hugo_loader.sh -help

hugo_loader.sh <...options>

Description

Description:

Validates input options and calls the loader script hugo_script.sql#hugo_loader.load_hugo

Options

(*) required

-db_wallet* <VARCHAR2> (required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

SID, or the Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION, refer to the programmer's guide on how to retrieve this file from HGNC's web portal.

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-reference_version <VARCHAR2>

The version of the Hugo file version being loaded, A "HUGO" reference version label is defined in W_EHA_VERSION.VERSION_LABEL. If the version label is not present in the W_EHA_VERSION table, the loader interactively ask the user if he wishes to continue with the version label provided. If yes, the loader inserts the new record in the version table with the given Version_label and proceeds with the load.

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh hugo_loader.sh -db_wallet odb_user -data_file "genefam_list.pl" -data_directory "ODB_LOAD" -reference_version "DLD_DT_01062013" -read_size null

Windows

C:\> hugo_loader.bat -db_wallet odb_user -data_file "genefam_list.pl" -data_directory "ODB_LOAD" -reference_version "DLD_DT_01062013" -read_size null

Once the loading is complete, log into SQL developer, or SQP*Plus with ODB Schema and verify that 35000 or more records are populated in W_EHA_HUGO_INFO table. The information about the execution is logged in the W_EHA_RSLT_LOG table. If run with the -print_summary 1 option, the loader will also print on the console information about the execution, including the count of inserted records, and the errors, if there are any.

3.3 GVF Ensembl Loader

3.3.1 Description and Files to Load

The GVF Ensembl loader is responsible for the input of known variants of any given species for which DNA source records are present. The loader loads only those variant records from the input file, for which matching DNA source records exist in the DB. (that is, those variants that fall into the absolute position ranges of a DNA source record with the same chromosome and species ID). Hence ensure the EMBL loader is run first with the relevant species' EMBL input files.

Since GVF files do not contain information about the species, it is necessary to pass a species_ID value as parameter to run the loader. This requires W_EHA_Species table to have the relevant species record with a primary key ID which is then passed as said parameter.

Any GVF file can be loaded multiple times. For Homo sapiens, the input file can be downloaded from the ensemble FTP link:

ftp://ftp.ensembl.org/pub/release-65/variation/gvf/homo_sapiens/

The link above may not necessarily reflect the most recent version of GVF file available. Oracle recommends that you use the latest GVF file.

3.3.2 Running the Loader

For the GVF loaders, the.bat and.sh files require the same set of named command line arguments, except that the BAT script only supports using a Oracle Wallet connection (like all other loaders).

3.3.3 Command-Line Argument List

Name

GVF_loader.sh - load records

Synopsis

GVF_loader.sh -help

GVF_loader.sh <...options>

Description

Validates input options and calls the loader script load_gvf.sql#odb_ref_gvf_util.process_gvf

Options

(*) required

-db_wallet* <VARCHAR2> (required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle SID, or the Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, for humans "Homo sapiens"

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh GVF_loader.sh -db_wallet odb_user -data_file "som_variants.gvf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -reference_version "GRCh37.p8" -preprocess_dir null -preprocess_file null -read_size null

Windows

C:\> GVF_loader.bat -db_wallet odb_user -data_file "som_variants.gvf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -reference_version "GRCh37.p8" -preprocess_dir null -preprocess_file null -read_size null

3.3.4 Gathering Optimizer Statistics

Oracle recommends gathering table and index statistics after running the GVF loader. Oracle statistics is a collection of data about database objects such as tables and indexes.It is required by Oracle optimizer to estimate the most efficient query execution plan. Missing or stale statistics can profoundly deteriorate query performance.

To collect statistics connect to a database as ODB_SCHEMA owner using sqlplus and execute command:

exec dbms_stats.gather_schema_stats ('ODB_', cascade=>true,estimate_percent=>dbms_stats.auto_sample_size);

3.4 Pathway Loader

3.4.1 Description and Files to Load

Pathway_loader is a script for extracting, transforming, and loading GSEA standard file formats.

The data used can be downloaded from

http://www.pathwaycommons.org/pc-snapshot/current-release/gsea/by_species/homo-sapiens-9606-gene-symbol.gmt.zip.

The first column and the second column in this file are normal tab delimited but the third column in the file is a string containing delimited values.

The pathway_loader utility is compatible with Oracle RDBMS 10.2 and above. It is not operating system dependent and works entirely within Oracle database. This section describes the setup procedure and also shows how to use the utility to load data from the GSEA file located on your system.

3.4.2 Running the Loader

The loader is made up of three files:

  • pathway_loader.bat

  • pathway_loader.sh

  • pathway_script.sql

The execution call of the stored procedure load_pathway()is designed in one of the script files (pathway_script.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTORY OBJECT, SPECIES_NAME, PATHWAY REFERENCE VERSION and READ_SIZE as input parameters.

This stored procedure creates an external table dynamically and uploads data from the source file into it. External tables let Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver access data stored in any format that can be loaded by the SQL*Loader.

The stored procedure dynamically creates PATH_DATA_!!SEQ!! as an external table. This external table stores the complete pathway data. This table maps all the fields existing in the pathway file.

Note:

In the above external table, the "!!SEQ!!" string is replaced by ETL_PROC_ID at the run time.

There are two bulk insert statements written dynamically. One sql inserts the record into the W_EHA_PATHWAY table and the other inserts the record into the W_EHA_PATHWAY_PROTEIN table.

Now multiple versions of pathway data can be loaded into the table. The VERSION_WID column saves the version id of the particular version loaded, which is retrieved from the W_EHA_VERSION table.

The pathway_loader.bat file requires the logon credentials to be stored in Oracle Wallet while the pathway_loader.sh script can be run with or without Oracle Wallet being set up.

The load_pathway procedure has been altered to include an error logging associated with Pathway Reference, species_name, sql_err and Species lookup failure. These errors are logged into the W_EHA_RSLT_LOG table.

3.4.3 Command-Line Argument List

Name

pathway_loader.sh - load records

Synopsis

pathway_loader.sh -help

pathway_loader.sh <...options>

Description

Validates input options and calls the loader script pathway_script.sql#pathway_loader.load_pathway

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

SID or Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection. -check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-reference_version <VARCHAR2>

"PATHWAY" reference version label defined in W_EHA_VERSION.VERSION_LABEL. If the version label is not present in the W_EHA_VERSION table, the loader interactively ask the user if he wishes to continue with the version label provided. If yes, the loader inserts the new record in the version table with the given Version_label and proceeds with the load.

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh pathway_loader.sh -db_wallet odb_user -data_file "homo-sapiens-9606-gene-symbol.gmt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -reference_version "feb-2011" -read_size null

Windows

C:\> pathway_loader.bat -db_wallet odb_user -data_file "homo-sapiens-9606-gene-symbol.gmt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -reference_version "feb-2011" -read_size null

3.5 Prediction Score (PolyPhen, SIFT) Loader

3.5.1 Description and Files to Load

In Ensembl, human mutations affecting the amino acid substitutions are further analyzed for the effect of this substitution on protein function. This is done using SIFT and PolyPhen predictive algorithms. The source files from running either SIFT or PolyPhen contain prediction and score which is stored in the target tables. The model supports multiple version of SIFT and PolyPhen data. The versions are recorded in the W_EHA_VERSION and W_EHA_FILE_TYPE tables.

Steps to Download Data from ENSEMBL BioMart

The source data is downloaded from the ENSEMBL BioMart tool. You have to download SIFT and POLYPHEN data separately and load them separately.

The data can be downloaded from one of the following links:

http://uswest.ensembl.org/biomart/martview/

http://asia.ensembl.org/biomart/martview/

  • From Dataset: Select following options

    • Select Ensembl Variation <ver>.

    • Select either Homo sapiens Somatic Variation (COSMIC <ver>) or Homo sapiens Variation (dbSNP <ver>;ENSEMBL).

  • From Filters: If you want to download data for a specific region of the chromosome then use this option. Otherwise you can retain the default filter options.

  • From Attributes: Select following options in the specific order defined below

    • From SEQUENCE VARIATION:

      • Variation Name

      • Chromosome Name

      • Position on Chromosome (bp)

      • Strand

      • Variant Allele

    • From GENE ASSOCIATED INFORMATION:

      • Consequence specific allele

      • Ensemble Transcript ID

      • Polyphen prediction or SIFT prediction

      • Polyphen score or SIFT score.

  • Then click Result at the top of the screen.

  • FromExport all results to select following

    • Select File

    • Select TSV

    • Select Unique results only.

  • Click Go to download the file.

User should ensure that he selects the attributes in the specified order. Also, user should ensure ensure that he selects PolyPhenprediction and PolyPhen score when downloading PolyPhen data, and SIFT prediction and SIFT score when downloading SIFT data.

The execution call of the stored procedure odb_result_util. process_variant_prediction ()is designed in one of the script files (load_prediction_score.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTORY OBJECT, FILE TYPE (being either SIFT or Polyphen), FILE VERSION and DNA REFERENCE VERSION as an input parameter.

This stored procedure creates an external table dynamically and uploads data from the source file into it. External tables let Oracle query data that is stored outside the database in flat files. The ORACLE_LOADER driver access data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on external tables but they can be used for query, join, and sort operations.

The stored procedure dynamically creates PREDICTION_DATA_!!SEQ!! as an external table. This external table stores the complete result data. This table maps all the fields existing in the result file.

Note:

In the above external table, the "!!SEQ!!" string is replaced by ETL_PROC_ID at run time.

There are two multi-table insert statements written dynamically. One inserts record into the w_eha_variant_prediction table and other inserts record into the w_eha_rslt_log table.

A select statement which parses the data from the external table uses an inline query which gets the dataset of variant and transcript records. An inline query uses a partition (analytical function) function to avoid the duplicate records for different variation name which has the same variant record (VARIANT_WID). The dataset of this inline query will then be the outer join with the W_EHA_VARIANT_PREDICTION table to lookup for VARIANT_WID and STRUCTURE_WID and will insert a record into either the W_EHA_VARIANT_PREDICTION table or the W_EHA_RSLT_LOG table.

If you upload the same file (file name same as previously loaded) with different version, the loader considers this file as a new file and uploads the record into the target (W_EHA_VARIANT_PREDICTION) table with a different file version.

3.5.2 Running the Loader

Record is inserted in the W_EHA_VARIANT_PREDICTION table for a variant belonging to a specific transcript; hence lookup of reference_id and transcript_id is performed before inserting a record in this table.

Before inserting a record in this table, the loader also checks if a record already exists in the W_EHA_VARIANT_PREDICTION table for a specific reference_id and transcript_id and for the version of SIFT or polyphen data.

Following operation is performed based on the above condition:

  • If reference_id (variant_wid) and transcript_id (structure_wid) exist in the target table but SIFT or polyphen version is different, a new record is inserted in this table.

  • If reference_id (variant_wid) and transcript_id (structure_wid) exist in the target table and SIFT or polyphen version are also the same, the score is compared.

  • If the score is same, the variant record is not updated.

  • If the score is not same, the existing record is not updated but the reference_id along with transcript_id is reported to the W_EHA_RSLT_LOG table stating that the score was different.

  • If reference_id (variant_wid) and transcript_id (structure_wid) do not exist in the table, a new record is inserted.

Note:

The batch file requires Oracle Wallet to be set up to run correctly. However, the shell script can be run with or without Oracle Wallet being set up.

SIFT/Polyphen reference version

The Program Version refers to the SIFT or polyphen program version used to generate the data. You can check the version from: http://useast.ensembl.org/info/docs/variation/vep/vep_script.html

Table 3-1 Mapping of Polyphen or SIFT File

Column Name in Result File Table and Column Name in ODB Description

Variation Name

REFERENCE_ID in W_EHA_VARIANT_XREF table

This is a FK to W_EHA_VARIANT table. This is extracted by the loader using a lookup of Variation Name from the source file against the REFERENCE_ID in the W_EHA_VARIANT_XREF table.

Chromosome Name

W_EHA_CHROMOSOME .CHROMOSOME

Chromosome Name.

Position on Chromosome (bp)

W_EHA_RSLT_SEQUENCING.START_POSITION

Stores the start position on chromosome.

Strand

N/A

N/A

Variant Alleles

N/A

N/A

Ensembl Transcript ID

W_EHA_GENE_STRUCTURE. TRANSCRIPT_ID

This is a FK to W_EHA_GENE_STRUCTURE table. This is extracted by the loader using a lookup of Ensembl Transcript ID from the source file against the 'TRANSCRIPT_ID' in W_EHA_GENE_STRUCTURE table.

PolyPhen /SIFT prediction

W_EHA_PREDICTION_CODE .CODE

This is a FK to W_EHA_CODE table, which stores all possible predictions for SIFT and PolyPhen data. The FK corresponding to the prediction of this record is available in the source file.

PolyPhen or SIFT score

W_EHA_VARIANT_PREDICTION. PREDICTION_SCORE

Stores the prediction score value from the source file for a particular variant


3.5.3 Command-Line Argument List

Name

prediction_score_loader.sh - load records

Synopsis

prediction_score_loader.sh -help

prediction_score_loader.sh <...options>

Description

Validates input options and calls the loader script load_prediction_score.sql#odb_ref_prediction_util.process_variant_prediction

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-prediction_version_type* <VARCHAR2>

Prediction reference version type (SIFT|POLYPHEN)

-prediction_version_label* <VARCHAR2>

"SIFT"|"Polyphen" reference version label defined in W_EHA_VERSION.VERSION_LABEL. If the version label is not present in the W_EHA_VERSION table, the loader interactively ask the user if he wishes to continue with the version label provided. If yes, the loader inserts the new record in the version table with the given Version_label and proceeds with the load.

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh prediction_score_loader.sh -db_wallet odb_user -data_file "ut_variant_pred1.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -prediction_version_type "Polyphen" -prediction_version_label "5.0" -reference_version "GRCh37.p8" -preprocess_dir null -preprocess_file null -read_size null

Windows

C:\> prediction_score_loader.bat -db_wallet odb_user -data_file "ut_variant_pred1.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -prediction_version_type "Polyphen" -prediction_version_label "5.0" -reference_version "GRCh37.p8" -preprocess_dir null -preprocess_file null -read_size null

Typical Errors Associated with prediction_score Loader

"Record not inserted" is logged if same version, same file type loaded but input record being inserted has score different than existing record.

Other possible errors are similar to other loaders: 'Generating etl process id', 'Generating enterprise id', 'Verifying species name', 'Verifying reference version', 'Verifying prediction version', 'Inserting file type', 'Verifying file type', 'Processing result records'

3.6 Probe Loader

3.6.1 Description and Files to Load

The probe loader populates the W_EHA_PROBE table. You can use probe_loader.bat to run in Windows or probe_loader.sh to run in Linux. Probe loader is a reference loader rather than result but it varies with vendors, for example, Affymetrix, Illumina.

Following are the assumptions for the data file for the Probe Loader:

  • The file is tab separated.

  • The first row is always the header.

Mappings for Probe Loader

Table Mappings for Probe Loader

Data File W_EHA_PROBE table
PROBESET W_EHA_PROBE.PROBE_NAME
ACC W_EHA_PROBE.ACCESSION
DESCP W_EHA_PROBE.PROBE_DESC
GENEID W_EHA_PROBE.PRIMARY_HUGO_NAME

3.6.2 Running the Loader

The execution call of the stored procedure PROBE_LOADER() is in one of the script files (probe_script.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTYORY OBJECT,SPECIES NAME, DNA VERSION LABEL, PROBE VERSION LABEL and READ_SIZE as input parameters.

This stored procedure creates an external table dynamically and uploads data from the source file into it. External tables let Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver access data stored in any format that can be loaded by the SQL*Loader.

The stored procedure dynamically creates PROBE_DATA_!!SEQ!! as an external table. This external table stores the complete probe data and maps all the fields existing in the probe file.

There is a merge statement that dynamically either inserts or updates the existing probe record in w_eha_probe table. During updation of a record, it updates all the columns including the row_wid with new row_wid for a particular probe name.

The old records can be referenced using the Flashback Data Archive (FDA) approach.

The FDA approach is used for securely tracking the contextual history of all data. FDA makes it possible to automatically and transparently track all of the changes to the tables in the database, and to easily query data in those tables as of any point in time or over any interval within the specified retention period, with minimal performance impact.

The bat file requires Oracle Wallet to be set up before it can run successfully.

For Shell scripts, if Oracle Wallet is set up, the shell script uses those credentials to run the Sqlplus. If Oracle Wallet is not set up, the script prompts for a password and connects to Sqlplus.

3.6.3 Command-Line Argument List

Name

probe_loader.sh - load records

Synopsis

probe_loader.sh -help

probe_loader.sh <...options>

Description

Validates input options and calls the loader script probe_script.sql#probe_loader.load_probe

Options

(*) required

-db_wallet* <VARCHAR2> (required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string, that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-dna_version_label* <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-probe_version_label* <VARCHAR2>

"PROBE" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh probe_loader.sh -db_wallet odb_user -data_file "dummy_probeset_annotation_summary.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -dna_version_label "GRCh37.p8" -probe_version_label "PROBE_VER_1" -read_size null

Windows

C:\> probe_loader.bat -db_wallet odb_user -data_file "dummy_probeset_annotation_summary.txt" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -dna_version_label "GRCh37.p8" -probe_version_label "PROBE_VER_1" -read_size null

3.7 ADF Data Loader

3.7.1 Description and Files to Load

An Array Description Format (ADF) file, or an array design file, is a microarray platform-specific, tab-delimited file that describes the design of an array. For a particular array platform, the file lists out the Features (spots) found on an array, along with its location on the array and its associated annotation information including the Reporters (Oligo Probes) found at that feature and the Composite Elements (Genomic Features such as Genes) represented by it.

The ADF Data Loader loads the ADF file for AgilentG402A_07_1 (Agilent 244K Custom Gene Expression G4502A-07-1) platform, which contains annotation data for all Gene Composite elements found in AgilentG402A_07 Level-3 data files present in TCGA and are loaded using the Dual Channel Loader into ODB.

TCGA provides all available ADF files on its Platform Design page:

https://tcga-data.nci.nih.gov/tcga/tcgaPlatformDesign.jsp

The TCGA ADF file for AgilentG4502A_07_1 can be retrieved from the following link in the above wedpage: http://tcga-data.nci.nih.gov/docs/integration/adfs/tcga/AgilentG4502A_07_01.tcga.adf.zip

3.7.2 Running the Loader

The execution call of the stored procedure odb_result_util.process_adf() is designed in one of the script files (load_adf.sql). This stored procedure accepts FILE NAME, ORACLE DIRECTYORY OBJECT, SPECIES, USER LABEL, Reference Version, File Flag, Preprocess directory, Preprocess File, Data File Path, DBFS Store, Alternate file location (ftp location/http location) ,Read Size as an input parameter.

This stored procedure creates an external table dynamically and uploads data from the source file into it. External tables let Oracle to query data that is stored outside the database in flat files. The ORACLE_LOADER driver access data stored in any format that can be loaded by the SQL*Loader. No DML can be performed on external tables but they can be used for query, join, and sort operations.

The stored procedure dynamically creates ADF_DATA_!!SEQ!! as an external table. This external table stores the complete result data. This table will map all the fields existing in the result file.

The procedure first loads a row into the w_eha_adf table.A simple merge command is written which do lookup against user label, file wid, version wid, etl proc wid and enterprise wid dataset. All the values are passed as an input parameters. If the dataset matches with any of the w_eha_adf record then the loader will just update all the columns (except w_update_dt) of w_eha_adf table for corresponding user_label of w_eha_adf table.

The three multi-table insert statements written dynamically inserts record into the w_eha_adf_composite table, the w_eha_adf_reporter, and the w_eha_adf_reporter_coord tables.

The deleted records can be referenced using the Flashback Data Archive (FDA) approach.

The FDA approach is used for securely tracking the contextual history of all data. FDA makes it possible to automatically and transparently track all of the changes to the tables in the database, and to easily query data in those tables as of any point in time or over any interval within the specified retention period, with minimal performance impact

Note:

The batch file requires Oracle Wallet to be set up to run correctly. The shell script can be run with or without Oracle Wallet being set up.

3.7.3 Command-Line Argument List

Name

ADF_loader.sh - load records

Synopsis

ADF_loader.sh -help

ADF_loader.sh <...options>

Description

Validates input options and calls the loader script load_adf.sql, which calls odb_ref_adf_util.process_adf

Options

(*) required

-db_wallet* <VARCHAR2> (required, unless the -db_conn/-db_user combination is used to log into the database)

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string, that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB username for the database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_file* <VARCHAR2>

Data file name - Oracle external table LOCATION

-data_directory* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-user_label* <VARCHAR2>

User label used to identify a composite record's source ADF dataset that is, AgilentG4502A_07_1

-reference_version <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-file_flg* <CHAR>

File flag (E=external|S=copy to secure data file directory) [default: E]

-preprocess_dir <VARCHAR2>

Preprocess directory - Oracle external table PREPROCESSOR

-preprocess_file <VARCHAR2>

Preprocess file - Oracle external table PREPROCESSOR

-data_file_path <VARCHAR2> (required, if -file_flg is "S")

File system path to secure data file directory

-dbfs_store <VARCHAR2> (required, if -file_flg is "S")

Database file system store

-alt_file_loc <VARCHAR2>

Alternate file location link that is, ftp:location, http:location

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

-data_file_dir <VARCHAR2>

File system path to Oracle directory object

Examples

UNIX

$ sh ADF_loader.sh -db_wallet odb_user -data_file "adf_summary.adf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -user_label "AgilentG4502A_07_01" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir null -preprocess_file null -data_file_path null -dbfs_store null -alt_file_loc null -read_size null

Windows

C:\> ADF_loader.bat -db_wallet odb_user -data_file "adf_summary.adf" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -user_label "AgilentG4502A_07_01" -reference_version "GRCh37.p8" -file_flg "E" -preprocess_dir null -preprocess_file null -data_file_path null -dbfs_store null -alt_file_loc null -read_size null

3.8 HGMD (BioBase) Loader

3.8.1 Description and Files to Load

The HGMD loader is used to load mutations and associated disease, drug and other annotations from BIOBASE GFF formatted source files to ODB reference tables. The loader currently loads 3 data source file types: 'hgmd_hg*.gff', 'hgmd_disease_hg*.gff', and 'drug_hg*.gff'.

Please note that BioBase provides scheduled updates as a full set of curated data assembled against two most recent reference genome builds. The BioBase release version of these data sets is not available in data files and has to be provided as a command-line argument (see -hgmd_version_label below). Currently the HGMD datasets are built against Human Genome 18 and Human Genome 19 in UCSC notation (or Build 36 and Build 37 in NCBI notation).

One can load both HG18 and HG19 curated data, but they have to be linked to correct reference genome versions that exist in ODB. Loading a new HGMD release version will overwrite all the linkage data previously loaded for the same reference genome version. The overwritten records can be referenced using the Flashback Data Archive (FDA) approach.

The file processing starts with the hgmd_hg*.gff file representing the inherited mutations track. The first step is to identify all variants listed in the source file and add the novel ones to the W_EHA_VARIANT table. A reference to the HGMD accession for all variants from the source file is inserted into the W_EHA_VARIANT_XREF table. The next step is to add new disease names to the W_EHA_DISEASE table. Finally all curated associations are loaded into W_EHA_DISEASE_G_VARIANT table and literature links are added to the W_EHA_DISEASE_G_VAR_XREF table.

The second source file hgmd_disease_hg*.gff that represents gene/disease linkage is loaded into w_eha_disease_gene and w_eha_disease_gene_xref tables. The latter table stores literature links.

Lastly the drug_hg*.gff file is processed to populate W_EHA_DRUG , W_EHA_DRUG _XREF, W_EHA_DRUG_TARGET and W_EHA_DRUG_TARGET_XREF tables. The first two tables store drug records and drug references respectively. Currently drugs a referenced by the DruBank ids (www.drugbank.ca). Two other tables store gene/drug linkage and references to supporting research findings.

3.8.2 Running the Loader

The loader is implemented as a PL/SQL stored procedure that could be invoked from a provided shell or batch file, The procedure accepts Oracle Directory Object, file suffix, species name, reference genome version, BioBase release version and read size as input parameters.

The file suffix - the hg* portion of the input file name - reflects the version of the reference genome assembly (HG18 or HG19) that the source reference files are prepared with. The loader parameter called "reference genome version" specifies which reference genome release loaded into ODB schema should the used to link HGMD annotations. Therefore the file suffix and the target reference genome version should be within the same major release.

For example HG19 files can be linked with GRCH37.P8 or GRCH37.P9 reference genome but not Build 36.

The loader does not require specific filenames to process. Given the file suffix it loads all currently supported files.

Note:

The batch file requires Oracle Wallet to be set up to run correctly. The shell script can be run with or without Oracle Wallet being set up.

3.8.3 Command-Line Argument List

Name

HGMD_loader.sh - load records

Synopsis

HGMD_loader.sh -help

HGMD_loader.sh <...options>

Description

Validates input options and calls the loader script load_hgmd.sql#odb_ref_hgmd_util.process_hgmd

Options

(*) required

-db_wallet* <VARCHAR2>

Oracle wallet name, see "Setting up an Oracle Wallet"

-db_conn* <VARCHAR2> (required if -db_wallet is not provided)

Oracle connection string that is, "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=XE)))"

-db_user* <VARCHAR2> (required if -db_conn is provided)

ODB user name for the Database connection.

-check_version <NUMBER>

Run check version (1=yes|0=no) [default: 0]

-check_version_non_i <NUMBER>

Run check version in non-interactive mode (1=yes|0=no) [default: 1]

-log_level <VARCHAR2>

Set log level TRACE, DEBUG, INFO, WARNING, ERROR [default: INFO]

-print_summary <NUMBER>

Print summary (1=yes|0=no) [default: 0]

-data_directory* <VARCHAR2>

Data file suffix name - Oracle external table LOCATION

-data_file_suffix* <VARCHAR2>

Oracle directory object - Oracle external table DIRECTORY, see "Setting up a Directory Object"

-species_name* <VARCHAR2>

Species name defined in W_EHA_SPECIES that is, For humans "Homo sapiens"

-dna_version_label* <VARCHAR2>

"DNA" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-hgmd_version_label* <VARCHAR2>

"HGMD" reference version label defined in W_EHA_VERSION.VERSION_LABEL

-read_size <NUMBER>

Read size in bytes - Oracle external table READSIZE

Examples

UNIX

$ sh HGMD_loader.sh -db_wallet odb_user -data_file_suffix "hg19_12" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -dna_version_label "GRCh37.p8" -hgmd_version_label "2012.4" -read_size null

Windows

C:\> HGMD_loader.bat -db_wallet odb_user -data_file_suffix "hg19_12" -data_directory "ODB_LOAD" -species_name "Homo sapiens" -dna_version_label "GRCh37.p8" -hgmd_version_label "2012.4" -read_size null