3 Loaders for Reference Data

This chapter contains the following topics:

Ensembl and SwissProt Loaders (Java)
HUGO Loader (PLSQL)
GVF Ensembl Loader (PLSQL)
Pathway Loader
Template of Command Line Arguments for Reference Loaders

Ensembl and SwissProt Loaders (Java)

Installing the Loader

Following are the prerequisites to installing the loader for reference data:

You must have an Oracle database instance with ODB installed in a schema, for which the name and password are known.

Note:
As the Reference part of the ODB model changes, the Reference Data Loader has to adapt to these changes. Therefore, both should be updated together.
Java Runtime 1.6 or higher must be installed and should be the default on the machine (can be verified using the "java -version" command from the command prompt).

Perform the following steps to install the loader:

Copy the Reference Loader folder into a directory of your choice. The program should be run from the directory it is installed in.

Files to Load

Following is a list of files to be loaded:

The Ensembl multi-gene EMBL files can be downloaded from:

ftp://ftp.ensembl.org/pub/release-66/embl/homo_sapiens

The link above may not necessarily reflect the most recent version of multi-gene file available. Oracle recommends that you use the latest release available from Ensembl.

The whole Human genome (including various patches) is contained in four zipped .DAT files, and each one has to be loaded.

Each file has many contigs in it, each with a sequence of approximately 100,000 base pairs, multiple genes and other features pertaining to this sequence.
The SwissProt file (a single file) can be downloaded here:

http://www.ebi.ac.uk/uniprot/database/download.html
Get the (UniProtKB/SwissProt) Flat File.

Loading the Data

Before you begin:

As the Loader runs, it will log some information in the GDM.LOG file. The file is always appended, so it will be growing. So, occassionally you may want to delete it and start from scratch the next time you run the Loader. Some of the information logged can be very useful for investigative purposes. Oracle recommends that you check the log when you have a problem.
The order of loading EMBL and SwissProt files is unimportant. The scenario outlined below is just an example. However, the GVF files (which are now loaded using a separate loader application) can only be loaded after all the Ensembl EMBL files have been loaded.

Perform the following steps to load files:

Since the SwissProt file is a single file, it can be loaded in under an hour (Human proteins only) or in a day (all species).To load the SwissProt file, run SwissProt.bat. When swissProt.bat is run you can optionally pass the Species List file. The purpose of the Species List file is to allow only loading protein information for the organism(s) you want. The format of the file is simple: type in the species primary (Latin) name, one species per line. A file for just the human genome is now included in the distribution - it is named Species.dat and contained in the main Loader Directory. If there is no -protFile option with the name of a species list file, ALL proteins for ALL species will be loaded (which will take much longer).

If Oracle Wallet is set up swissProt.bat will use the credentials stored in the Wallet to connect to the schema else it will prompt for a password. If Oracle Wallet is set up the user will have to pass the following parameters to run swissProt.bat:
1. Username — when oracle Wallet is set up enter ""
2. Url — instance alias for which the Wallet credential was created - if you start SqlPlus as 'sqlplus /@DB001_Wallet, then this value here must be DB001Wallet.
3. Schema name
4. Path for Oracle Home
5. Path to Wallet
6. Path and name of species list. This is an optional parameter. If you do not have this list enter "
7. Complete path and name of the data file
Following is an example of how the swissProt.bat is to be run if Oracle Wallet is set up and Species list is not present:

C:\>swissProt.bat "" DB001Wallet trc_gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets "" SwissProt.dat

Following is an example of how the swissProt.bat is to be run if Oracle Wallet is set up and Species list is present:

C:\>swissProt.bat "" DB001Wallet trc_gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets Species.dat SwissProt.dat

If Oracle Wallet is not set up, you will have to pass the following parameters when swissProt.bat is run:
1. Username to connect to schema
2. Url — Full DB URL (host:port:instance). For example, Localhost:1613:devdb1
3. Schema name
4. Path for Oracle Home — when Oracle Wallet is not set up enter ""
5. Path to Wallet — when Oracle Wallet is not set up enter ""
6. Path and name of species list. This is an optional parameter. If you do not have this list enter ""
7. Complete path and name of the data file
Following is an example of how the swissProt.bat is to be run if Oracle Wallet is not set up and Species list is not present:

C:\>swissProt.bat trc_gdm localhost:1613:devdb1 trc_gdm "" "" "" SwissProt.dat

This is an example of how the swissProt.bat is to be run if Oracle Wallet is not set up and Species list is present:

C:\>swissProt.bat trc_gdm localhost:1613:devdb1 trc_gdm "" "" Species.dat SwissProt.dat
The Ensembl EMBL file(s) can be loaded next. Multiple EMBL files must be loaded one at a time, in any order, but without duplication (each file can only run once - otherwise, at present, some information is duplicated).

If Oracle Wallet is set up, Embl.bat will use the credentials stored in the Wallet to connect to the schema else it will prompt you for a password. If Oracle Wallet is set up the user will have to pass the following parameters when Embl.bat is run:
1. Username — when Oracle Wallet is set up, enter ""
2. Url — instance alias for which the Wallet credential was created - if you start SqlPlus as 'sqlplus /@DB001_Wallet, then this value here must be DB001Wallet.
3. Schema name
4. Path for Oracle Home
5. Path to Wallet
6. Complete path and name of the data file
Following is an example of how embl.bat is to be run if Oracle Wallet is set up:

C:\>embl.bat "" DB001Wallet trc_gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets embl.dat

If Oracle Wallet is not set, you will have to pass the following parameters when Embl.bat is run:
1. Username to connect to schema
2. Url — Full DB URL (host:port:instance). For example, Localhost:1613:devdb1
3. Schema name
4. Path for Oracle Home — when Oracle Wallet is not set up enter ""
5. Path to Wallet — when Oracle Wallet is not set up enter ""
6. Complete path and name of the data file
This is an example of how the embl.bat is to be run if Oracle Wallet is not set up:

C:\>embl.bat trc_gdm localhost:1613:devdb1 trc_gdm "" "" embl.dat

HUGO Loader (PLSQL)

The Hugo Loader is responsible for populating curated gene nomenclature records, taken from an online resource maintained by HUGO Gene Nomenclature Committee (HGNC), into ODB's reference database. The input data for the loader comprises the complete HGNC dataset which can be retrieved from their Statistics and Downloads webpage here: http://www.genenames.org/cgi-bin/hgnc_stats.pl. The data downloaded will be a large file with tabular text in tab-separated values given with column headers.

A batch file for Windows, and an alternative shell script for Linux-bash, has been provided for loading data. The content below shows the step-by step process of this load procedure.

Installing and Running the Loader

The loader can found bundled in the latest ODB build in the compressed file Hugo_Loader.zip. This folder consists of four files:

loaddata_hugo.ctl
hugo_loader.bat
hugo_loader.sh

To run the loader, perform the following steps:

Copy the above files into a folder on your system along with the downloaded input file from HUGO.
Open a command prompt terminal and change the directory to where the above files reside.
The hugo_loader.bat file uses the credentials stored in Oracle Wallet to connect to the schema that has the ODB. The following parameters should be passed when the hugo_loader.bat is run:
1. The name and complete path of the Hugo data file.
2. The schema name
The format of the command for the batch file is as follows:

C:\> hugo_loader.bat <HUGO file Name with full path ><Schema Name>
Note:
- The HUGO file name should not have any special characters such as '=' in its file name.
The hugo_loader.sh can be run with or without Oracle Wallet. The following parameters need to be passed when hugo_loader.sh is run:
1. The name and complete path of the Hugo data file
2. The schema name
3. 1 if Oracle Wallet is set up else 0
4. If the previous parameter is 0 enter the username to connect to the schema
Examples:

Following is an example of how hugo_loader.sh should be run when Oracle Wallet is set up:

sh hugo_loader.sh <HUGO file Name with full path ><Schema Name>1

Following is an example of how hugo_loader.sh should be run when Oracle Wallet is not set up:

sh hugo_loader.sh <HUGO file Name with full path ><Schema Name>0<username>
Once the loading is complete, log into SQL developer, or SQP*Plus with ODB Schema and verify that 35000 or more records are populated in W_EHA_HUGO_INFO table. Also check the log file (hugo.log) that is created in the same folder from which the .bat or .sh script was run, which will indicate the number of records that have been read, number of records that have been loaded into the staging table and the number of records that have been discarded.

GVF Ensembl Loader (PLSQL)

The GVF Ensembl loader is responsible for the input of known variants of any given species for which DNA source records are present. The loader will load only those variant records from the input file, for which matching DNA source records exist in the DB. (that is, those variants that fall into the absolute position ranges of a DNA source record with the same chromosome and species ID). Hence ensure the EMBL loader is run first with the relevant species' EMBL input files.

Since GVF files do not contain information about the species, it is necessary to pass a species_ID value as parameter to run the loader. This requires W_EHA_Species table to have the relevant species record with a primary key ID which is then passed as said parameter.

Any GVF file can be loaded multiple times. For Homo sapiens, the input file can be downloaded from the ensemble FTP link:

ftp://ftp.ensembl.org/pub/release-65/variation/gvf/homo_sapiens/

The link above may not necessarily reflect the most recent version of GVF file available. Oracle recommends that you use the latest GVF file.

For the GVF loaders, the .bat and .sh files have the same parameters except that the BAT script only supports using the Oracle Wallet connection (like all other loaders). The list of parameters are as follow (in this expected order):

GVF file to be loaded. Should include full path to allow sqlldr to access correctly.
SPECIES_WID to match the primary key ID on the SPECIES table matching the data. This will usually be 1 since only Homo sapiens will be installed.
TNS name to connect to the database.
1 to use wallet connection to database, 0 to prompt for password (used only in SH script).
Username to connect to the database (used only in SH script).

Following is the command line input for windows:

gvf_loader.bat <input_file(.gvf)> <species_wid> <wallet_name> 1

Following is the command line input for a shell scipt without the use of a wallet instance:

sh gvf_loader.sh <input_file(.gvf)> <species_wid> <dbname> 0 <user>

Following is the command line input for a shell scipt with a wallet instance:

sh gvf_loader.sh <input_file(.gvf)> <species_wid> <wallet_name> 1

Examples:

gvf_loader.sh Homo_sapiens.gvf 1 DB001 0 gdm

>gvf_loader.bat Homo_sapiens.gvf 1 DB001Wallet 1

Pathway Loader

Description

Pathway_loader is the utility for extracting, transforming and loading GSEA standard file formats.

The data used in our design case is can be downloaded from http://www.pathwaycommons.org/pc-snapshot/current-release/gsea/by_species/homo-sapiens-9606-gene-symbol.gmt.zip.

The pathway_loader utility is compatible with Oracle RDBMS 10.2 and above. It is not operating system dependent and works entirely within Oracle database. This section describes the setup procedure and also shows how to use the utility to load data from GSEA file located on your system.

Installing and Running the Loader

The loader is made up of three files:

pathway_loader.bat
pathway_loader.sh
gseaload.ctl

The pathway_loader.bat and pathway_loader.sh script first run sql loader that loads the data into the staging table from the data file. The scripts then call stored procedure to load the data from the staging table to the target table.

The pathway_loader.bat file requires the logon credentials to be stored in Oracle Wallet. The following parameters need to be passed when the bat file is run:

The name and complete path of the data file
The schema name

Following is an example of how the pathway_loader.bat file should be run:

C:\> pathway_loader.bat <Data file Name with full path ><Schema Name>

The pathway_loader.sh script can be run with or without Oracle Wallet being set up. The following parameters need to be passed when pathway_loader.sh is run:

The name and complete path of the Hugo data file
The schema name
The schema name
If the previous parameter is 0 enter the username to connect to the schema

Following is an example of how pathway_loader.sh should be run when Oracle Wallet is set up:

Sh pathway_loader.sh <Data file Name with full path ><Schema Name>1

Following is an example of how hugo_loader.sh should be run when Oracle Wallet is not set up:

Sh pathway_loader.sh <Data file Name with full path ><Schema Name>0<username>

Also check the log file that is created in the same folder from which the bat or sh script was run, which will indicate the number of records that have been read, number of records that have been loaded into the staging table and the number of records that have been discarded.

The procedure to load the data from the staging table to the target will create a function call process_string() and a procedure called load_pathway(). The process_string function shred through a single column tab, space, semicolon, pipe or similar delimiters data and tokenize them. The procedure load_pathway called this function and loop through the tokenized list and insert them into the destination tables accordingly.

The program in the utility is a PL/SQL program utilizing BULK COLLECT method. The program will load 8800 records in 6.77 seconds. The main concept used in the SELECT statement bind the result set of the query to a collection providing much less communication between the PL/SQL and SQL engines. All variables in the INTO clause are collection.

Files and Tables Used in Loading

The following file and tables are used in the loading process.

Source file: homo-sapiens-9606-gene-symbol.gmt. You can view this file by downloading it from

http://www.pathwaycommons.org/pc-snapshot/current-release/gsea/by_species/homo-sapiens-9606-gene-symbol.gmt.zip.

The first column and the second column in this file are normal tab delimited but the third column in the file is a string containing delimited values.
Staging table:

Table name: W_EHA_PATHWAY_GSEA_STG

PATHWAY_NAME VARCHAR2(100 CHAR)

PATHWAY_SOURCE_ID VARCHAR2(15 CHAR)

PATHWAY_PROTEIN_SYMBOL CLOB

);
Master destination table:

Table name: W_EHA_PATHWAY

ROW_WID NUMBER(38,0)

PATHWAY_SOURCE_ID VARCHAR2(50 BYTE)

PATHWAY_NAME VARCHAR2(200 BYTE)

W_INSERT_DT DATE

W_UPDATE_DT DATE

ETL_PROC_WID NUMBER(10,0)

ENTERPRISE_ID NUMBER(38,0)

);
Child destination table:

Table name: W_EHA_PATHWAY_PROTEIN

ROW_WID NUMBER(38,0)

PATHWAY_WID NUMBER(38,0)

HUGO_SYMBOL VARCHAR2(50 BYTE)

W_INSERT_DT DATE

W_UPDATE_DT DATE

ETL_PROC_WID NUMBER(10,0)

ENTERPRISE_ID NUMBER(38,0)

);

For relationship information, refer to the ODB data model.

Template of Command Line Arguments for Reference Loaders

This section provides a summary of templates for running each data loader, along with an example using sample summary input files.

Glossary of Parameter Templates

The following is a glossary of common data entries and a brief description of their point of origin.

Table 3-1 Glossary of Parameter Templates

Parameter Name	Description
user	DB user name
host_url:dbname	DB connection string (include system-add:port:dbname). to be used if no Wallet is provided. I think this DB connection string is required only for Java loaders. For all other loaders if u enter the schema name its enough.
schema_name	DB instance name
species_file	File with a list of Species names
input_file(.dat)	Input File, with path if not present in same folder (File type)
wallet_name	Name of Wallet created
sqlnet.ora_path	Path to directory with file sqlnet.ora and tnsnames.ora
wallet_dir	Directory with wallet files
species_wid	Primary Key ID from W_EHA_SPECIES
file_type_Code	File_type_code from W_EHA_RSLT_FILE_TYPE
file_version	File_type_version from W_EHA_RSLT_FILE_TYPE
vendor_name	User input of Vendor name for Result files
data_source_name	datasource_name from W_EHA_DATASOURCE
species_name	species_name from W_EHA_SPECIES
specimen_number	If datasource is CDM, give the SPECIMEN_NUMBER under W_EHA_SPECIMEN_PATIENT_H. (Has to be present in CDM table, but should only be given for gene_expression_loader.)
specimen_vendor_number	If datasource is CDM, give the SPECIMEN_VENDOR_NUMBER under W_EHA_SPECIMEN_PATIENT_H. (Has to be present in CDM table.)

SwissProt

There are four scenarios for command line input:

Without Oracle Wallet and with species:

Windows:

>swissProt.bat <user> <host_url:dbname> <schema_name> "" "" <species_file> <input_file(.dat)>

Linux:

>sh SwissProt.sh <user> <host_url:dbname> <schema_name> "" "" <species_file> <input_file(.dat)>

For example,

>swissProt.bat gdm localhost:1521:db001 gdm "" "" Species.dat uniprot_sprot.dat

Without Oracle Wallet and without species:

Windows:

>swissProt.bat <user> <host_url:dbname> <schema_name> "" "" "" <input_file(.dat)>

Linux:

>sh SwissProt.sh <user> <host_url:dbname> <schema_name> "" "" "" <input_file(.dat)>

Example:

>swissProt.bat gdm localhost:1521:db001 gdm "" "" "" uniprot_sprot.dat

With Oracle Wallet and with species:

Windows:

>swissProt.bat "" <wallet_name> <schema_name> <sqlnet.ora_path> <wallet_dir> <species_file> <input_file(.dat)>

Linux:

>sh SwissProt.sh "" <wallet_name> <schema_name> <sqlnet.ora_path> <wallet_dir> <species_file> <input_file(.dat)>

Example:

>swissProt.bat "" DB001Wallet gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets Species.dat uniprot_sprot.dat

With Oracle Wallet and without species:

Windows:

>swissProt.bat "" <wallet_name> <schema_name> <sqlnet.ora_path> <wallet_dir> "" <input_file(.dat)>

Linux:

>sh SwissProt.sh "" <wallet_name> <schema_name> <sqlnet.ora_path> <wallet_dir> "" <input_file(.dat)>

Example:

>swissProt.bat "" DB001Wallet gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets "" uniprot_sprot.dat

EMBL

The input is similar to SwissProt without species as input.

Without Oracle Wallet:

Windows:

>embl.bat <user> <host_url:dbname> <schema_name> "" "" <input_file(.dat)>

Linux:

>sh embl.sh <user> <host_url:dbname> <schema_name> "" "" <input_file(.dat)>

Example:

>embl.bat gdm localhost:1521:db001 gdm "" "" Homo_sapiens.12000.dat

With Oracle Wallet:

Windows:

>embl.bat "" <wallet_name> <schema_name> <sqlnet.ora_path> <wallet_dir> <input_file>

Windows:

>embl.bat "" <wallet_name> <schema_name> <sqlnet.ora_path> <wallet_dir> <input_file(.dat)>

Linux:

>sh embl.sh "" <wallet_name> <schema_name> <sqlnet.ora_path> <wallet_dir> <input_file(.dat)>

Example:

>embl.bat "" DB001Wallet gdm C:\ora11g\product\11.2.0\dbhome_2\NETWORK\ADMIN D:\wallets Homo_sapiens.12000.dat

HUGO

The Hugo batch file runs only with an Oracle Wallet instance. The Shell scripts can be executed in two modes:

Without Oracle Wallet:

Linux:

>sh hugo_loader.sh <input_file(text)> <schema_name> 0 <user>

Example:

>hugo_loader.sh hgnc_downloads.cgi DB001 0 gdm

With Oracle Wallet:

Windows:

>hugo_loader.bat <input_file(text)> <wallet_name>

Linux:

>sh hugo_loader.sh <input_file(text)> <wallet_name> 1

Example:

>hugo_loader.bat hgnc_downloads.cgi DB001Wallet

GVF

Like Hugo, the batch file will only run with a Wallet instance.

Without Oracle Wallet:

Linux:

>sh gvf_loader.sh <input_file(.gvf)> <species_wid> <schema_name> 0 <user>

Example:

>gvf_loader.sh Homo_sapiens.gvf 1 DB001 0 gdm

With Oracle Wallet:

Windows:

>gvf_loader.bat <input_file(.gvf)> <species_wid> <wallet_name>

Linux:

>sh gvf_loader.sh <input_file(.gvf)> <species_wid> <wallet_name> 1

Example:

>gvf_loader.bat Homo_sapiens.gvf 1 DB001Wallet

Pathway

The files run similar to HUGO.

Without Oracle Wallet:

Linux:

>sh pathway_loader.sh <input_file(.gmt)> <schema_name> 0 <user>

Example:

>pathway_loader.sh homo-sapiens-9606-gene-symbol.gmt DB001 0 gdm

With Oracle Wallet:

Windows:

>pathway_loader.bat <input_file(.gmt)> <wallet_name>

Linux:

>sh pathway_loader.sh <input_file(.gmt)> <wallet_name> 1

Example:

>pathway_loader.bat homo-sapiens-9606-gene-symbol.gmt DB001Wallet