2 Prerequisites for Loading Data

This chapter contains the following topics:

Setting up a Directory Object
Setting up Oracle Wallet
Setting Up User Privileges for Querying or Loading Data
Integration with Oracle Health Sciences Cohort Explorer Data Model or Another External Data Model
Reference Version Compatibility
Handling Newline Characters in Input Files

2.1 Setting up a Directory Object

All loaders (except for EMBL and SwissProt) use an external table to access data in the files which require an Oracle directory object to be created. Oracle directory objects require that the database Operating System user account has access to this directory. Therefore, the directory must be mounted and all permissions granted before the database server is started. For more information about creating directory objects, see Oracle Database SQL Language Reference 11g Release 2.

Oracle recommends that the Oracle Directory used for loading files into the database not exist in DBFS secure files. When DBFS is used for the location of data files to be loaded, the database has a contention for resources trying to load the file. When the external table is used to load the file, each read of the data file requires the database to first send the buffer to the Operating System where the DBFS is mounted. Then the Operating System sends this buffer to the database to match the read request of the data file. This slows down the loading process and causes the database to have many wait events while reading the data file. Therefore, Oracle recommends not using the Oracle Directory objects located in DBFS.

All the loaders described here now use external tables. The ODB schema user must have the CREATE ANY DIRECTORY privilege and the CREATE ANY TABLE privilege. The name used for the directory object is used as a parameter to all of the loaders. The Oracle database OS account must have permissions to access the directory specified in the Oracle directory object.

An example of a command to create the oracle directory objects is:

>create directory TRC_ODB as /home/oracle/perm_loc;

Note:

The directory used must reflect the path requirements of the operating system that the database server is installed on. Windows database servers have different naming convention than Linux servers. Also, the directory used must be mounted on the host OS of the database server before the database server is started.

After a directory is created, the user creating the directory object needs to grant READ and WRITE privileges on the directory to other users as follows:

GRANT READ on DIRECTORY <<DIR_NAME>> TO <<ODB USER NAME>>

GRANT WRITE on DIRECTORY <<DIR_NAME>> TO <<ODB USER NAME>>

Here DIR_NAME is the name of an Oracle directory where all the result files are kept and ODB USER NAME is the name of the database user that executes the loaders. A database user should have both READ and WRITE grants on Oracle directory in order to process loaders.

2.2 Setting up Oracle Wallet

Oracle Wallet must be set up with the credentials used to connect to the schema where gdm is installed. Perform the following steps to set up Oracle Wallet:

Add the following code to tnsnames.ora under $ORACLE_HOME\NETWORK\ADMIN

DB001_Wallet =

(DESCRIPTION =

(ADDRESS_LIST =

(ADDRESS = (PROTOCOL = TCP)(HOST = 10.178.187.186)(PORT = 1521))

)

(CONNECT_DATA =

(SERVICE_NAME = db001)

)

)

Note:
Set the SERVICE_NAME and HOST values above to point to your database installation.
Oracle wallet can be created on the client or the middle tier system. Open a command prompt terminal and execute the following:

>cd d:

>d:

>mkdir wallets

>cd wallets

>mkstore -wrl D:\wallets -create -nologo

Enter password: <type a 8 alphanumeric-character password>

Enter password again: <retype above password>

>dir

Volume in drive D is Data

Volume Serial Number is C###

Directory of D:\wallets

11/24/2011 09:24 PM <DIR> .

11/24/2011 09:24 PM <DIR> ..

11/24/2011 09:13 PM 3,965 cwallet.sso

11/24/2011 09:13 PM 3,888 ewallet.p12

Note:
The last command should show two files created by running mkstore -create: cwallet.sso and ewallet.p12.
Add your database credentials to your wallet.

>mkstore -wrl D:\wallets -createCredential DB001_Wallet odb

Your secret/Password is missing in the command line

Enter your secret/Password: <enter password for odb user>

Re-enter your secret/Password:<re-enter password>

Enter wallet password:<enter the 8 digit password given while creating wallet>

Note:
For every user credential added to the wallet, you must create a new dataset name in tnsnames.ora. The system assumes username as odb.
Configure SQLNET to look for wallet. Add the following lines of code to sqlnet.ora under $ORACLE_HOME\NETWORK\ADMIN:

WALLET_LOCATION = (SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=D:\wallets)))

SQLNET.WALLET_OVERRIDE = TRUE
Test connectivity using sqlplus. On any command prompt terminal enter the following:

>sqlplus /@DB001_Wallet

You will get the following result:

SQL*Plus: Release 11.2.0.1.0 Production on Fri June 7 15:54:35 2013

Copyright (c) 1982, 2010, Oracle. All rights reserved.

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

2.3 Setting Up User Privileges for Querying or Loading Data

Perform the following steps to set up user privileges for querying or loading data:

You must create a database user having the following privileges:
- CREATE SYNONYM
- CREATE SESSION
You must assign the user an appropriate role from the following three roles:
- OmicsDatamartUser
  
  Queries ODB schema, can only perform queries on the model, cannot write to the model, this role will be typically given to named users of the UIs whose credentials are to be passed to database layer for querying only.
- OmicsDatamartAdmin
  
  Loads into reference side of ODB, refreshes all reference data into the reference side of the schema including W_EHA_VARIANT table. This user cannot create new data definitions for objects such as tables and views.
- OmicsDatamartContributor
  
  Able to load result data into the Omics Data Bank via provided result loaders. This role enables a named user to write to ODB result side of the model.
Local synonym should be created for the database user. The create_odb_synonyms.sql is available in the "odb_install" folder in the ODB Software package.
1. Connect to the database instance with the database user.
2. Execute the following command by replace the odb_schema_name with the actual name of your Omics Data Bank schema:
  
  @create_odb_synonyms.sql &&odb_schema_name

2.4 Integration with Oracle Health Sciences Cohort Explorer Data Model or Another External Data Model

Integration with other data models is done through the specimen record. Each genomic data result file must be accompanied with SPECIMEN_NUMBER and SPECIMEN_VENDOR_NUMBER information, and SPECIMEN_DATASOURCE. These entities should match the record in specimen datasource schema. OHSCE Data Model is the default datasource for specimen in the current release. The SPECIMEN_WID in W_EHA_RSLT_SPECIMEN table in ODB should match the ROW_WID in the W_EHA_SPECIMEN PATIENT_H table in OHSCE Data Model.

Every result record imported into the ODB schema is linked to a SPECIMEN record. The SPECIMEN data can come from any external schema either on the same instance (co-located) or on an external instance. The loaders which load various files call a specific stored procedure to validate that the SPECIMEN exists.

The assumption is that any database used to provide the SPECIMEN information has a single numeric field to identify the correct SPECIMEN. This numeric value is stored in each result record (without direct FK definition). The stored procedure used is in a package ODB_UTIL which is not wrapped to let additional external databases to be supported. Currently, this stored procedure is implemented to support validating SPECIMEN records stored in the Cohort Data Model. You need to expand this stored procedure to support other schemas that are intended to provide specimen data.

Following is the structure of ODB_UTIL.GET_SPECIMEN_WID stored procedure:

function get_specimen_wid

(

i_datasource_id -

number,

i_specimen_number

in varchar2,

i_specimen_vendor

varchar2

,i_study_id number,i_etl_proc_id number, i_enterprise_id number, i_file_wid in number, i_logger result_logger default null, i_call_count) ) return number;

This stored procedure has seven mandatory and two optional parameters:

DATASOURCE_ID: W_EHA_DATASOURCE table is used to configure each external database to provide SPECIMEN data. The VALIDATION_PROC column of this table should be populated with the ODB_UTIL.VALIDATE_CDM_SPECIMEN value. This procedure validates the specimen in CDM (or external data source) schema. SPECIMEN_NUMBER and SPECIMEN_VENDOR_NUMBER: These two VARCHAR2 fields are used to identify a unique specimen.
ETL PROC WID and ENTERPRISE WID are sent by the loader as an input parameter to procedure.
FILE WID is sent by loader which is a ROW_WID of W_EHA_RSLT_FILE table. This is the same ROW_WID which loader creates the records for result file used for processing the loader. This is used by stored procedure to create a record in w_eha_rslt_file_spec table.

The stored procedure looks up the W_EHA_DATASOURCE record and compares the name field. Currently, there is a check for a name of CDM and then code for looking for specimen data in CDM. If additional database schemas need to be used to provide specimen information, you must first add a record first to W_EHA_DATASOURCE with a unique name.

The stored procedure has to specifically handle that data source name and validate the specimen number and specimen vendor number passed. Most data files support a specimen number and the loaders currently have a specimen vendor number passed as a parameter.

If the specimen exists in the CDM schema (w_eha_specimen_h table) and i_file_wid parameter value is not null, then the stored procedure calls the add_file_spec() local procedure which insert record into w_eha_rslt_file_spec table. This established the link between result file and specimen used for loading result data.

If the get_specimen_wid stored procedure raises the NO_DATA_FOUND exception (if specimen does not exists in w_eha_rslt_specimen table), then the code retrieves the validation procedure name from w_eha_datasource table (ODB_UTIL.VALIDATE_CDM_SPECIMEN). This stored procedure retrieves the specimen_wid from the w_eha_specimen_h table against the specimen number and specimen vendor number which might be used as an external data source.

For the external data source, user needs to mention the DB link (w_eha_datasource). If the procedure does not exists on the CDM schema, then the function logs the error 'Could not find specimen number' in w_eha_rslt_log table else it creates a record in both w_eha_rslt_specimen and in w_eha_rslt_file_spec table, and returns specimen_wid to the loader

This procedure also logs relevant warning, error, and such information.

2.4.1 Specimen and Vendor Number Requirement

To load results into any of the result tables, each specimen referred to in the input result files: VCF, MAF, CGI masterVar, and gene expression must be present in the OHSCE data model. If a file with multiple specimens is loaded, such as a VCF or MAF file, and one of the specimens is not found in the Cohort Explorer datamart schema, then the loader skips that row and loads the rest of the data into the target tables.

2.5 Reference Version Compatibility

User must ensure the compatibility of the ENSEMBL version loaded into ODB with other reference and results data before loading them to ODB.

Following are the list of data files to be considered for version compatibility:

GVF data file should belong to the same version of ENSEMBL that exists in ODB. For example, if ODB is loaded with ENSEMBL 66 version, then the GVF file to be loaded should also belong to ENSEMBL 66.
Variation data files, which include VCF, gVCF, MAF, and CGI masterVar should be based on the same reference genome that was used by the ENSEMBL version. For example, if the loaded ENSEMBL 66 version is using GRCh 37 reference genome, then the results to be loaded should also be mapped based on GRCh 37 version.
Copy Number Variation result data should also be checked for reference genome version compatibility with ENSEMBL version as specified in point 2.
TCGA RNASeq exon data should similarly be matched to the correct reference version. TCGA provides a 'description.txt' file along with other mage-tab analysis files for every RNAseq dataset. You can find mapping reference version details in this file.

2.6 Handling Newline Characters in Input Files

All reference and result input text files have an End-Of-Line character convention, that should be followed by the operating system on which the database server is loaded. For a windows database server, text files in a Linux or UNIX environment must be processed by the tool unix2dos to convert the file to the DOS format.