2 Prerequisites for Loading Data

This chapter contains the following topics:

Setting up a Directory Object
Setting up Oracle Wallet
Integration with Oracle Health Sciences Cohort Explorer Data Model or Another External Data Model
Reference Version Compatibility
Handling Newline Characters in Input Files

Setting up a Directory Object

Configure an Oracle directory object to be used for loading data files. Note that all of the loaders described here now use external tables. The ODB schema user must have the "CREATE ANY DIRECTORY" privilege as well as the "CREATE ANY TABLE" privilege. The name used for the directory object is used as a parameter to all of the loaders. The Oracle database OS account must have permissions to access the directory specified in the Oracle directory object. An example of this command is:

>create directory ODB_LOAD as '/home/oracle';

Note:

The directory used must reflect the path requirements of the operating system that the database server is installed on. Windows database servers will have different naming convention than Linux servers. Also the directory used must be mounted on the host OS of the database server before the database server is started.

After a directory is created, the user creating the directory object needs to grant READ and WRITE privileges on the directory to other users as follows:

GRANT READ ON DIRECTORY ODB_LOAD_dir TO odb;

Extra Step on Exadata vs non-Exadata:

On Exadata, all the loaded result files are stored in staging tables because direct path loading is required to get the best compression. There is now a new script, load_exadata_results.sql, which is to be run by a DBA when the loaders are not loading result files. This script uses parallel DML for the session to enable direct path loading. For each staging table, the script locks the staging and the corresponding result table exclusively before moving the data. This script is only required on Exadata and must be executed by the ODB schema user. On Exadata, the loaders use a synonym which points to the staging tables and on non-Exadata the synonym points to the real result table.

Setting up Oracle Wallet

Oracle Wallet must be set up with the credentials used to connect to the schema where gdm is installed. Perform the following steps to set up Oracle Wallet:

Add the following code to tnsnames.ora under $ORACLE_HOME\NETWORK\ADMIN

DB001_Wallet =

(DESCRIPTION =

(ADDRESS_LIST =

(ADDRESS = (PROTOCOL = TCP)(HOST = 10.178.187.186)(PORT = 1521))

)

(CONNECT_DATA =

(SERVICE_NAME = db001)

)

)

Note:
Set the SERVICE_NAME and HOST values above to point to your database installation.
Oracle wallet can be created on the client or the middle tier system. Open a command prompt terminal and execute the following:

>cd d:

>d:

>mkdir wallets

>cd wallets

>mkstore -wrl D:\wallets -create -nologo

Enter password: <type a 8 alphanumeric-character password>

Enter password again: <retype above password>

>dir

Volume in drive D is Data

Volume Serial Number is C###

Directory of D:\wallets

11/24/2011 09:24 PM <DIR> .

11/24/2011 09:24 PM <DIR> ..

11/24/2011 09:13 PM 3,965 cwallet.sso

11/24/2011 09:13 PM 3,888 ewallet.p12

Note:
The last command should show two files created by running mkstore -create: cwallet.sso and ewallet.p12
Add your database credentials to your wallet.

>mkstore -wrl D:\wallets -createCredential DB001_Wallet gdm

Your secret/Password is missing in the command line

Enter your secret/Password: <enter password for gdm user>

Re-enter your secret/Password:<re-enter password>

Enter wallet password:<enter the 8 digit password given while creating wallet>

Note:
For every user credential added to the wallet, you must create a new dataset name in tnsnames.ora. The system assumes username as gdm.
Configure SQLNET to look for wallet. Add the following lines of code to 'sqlnet.ora' under $ORACLE_HOME\NETWORK\ADMIN:

WALLET_LOCATION = (SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=D:\wallets)))

SQLNET.WALLET_OVERRIDE = TRUE
Test connectivity via sqlplus. On any command prompt terminal enter the following:

>sqlplus /@DB001_Wallet

You will get the following result:

SQL*Plus: Release 11.2.0.1.0 Production on Fri Nov 25 15:54:35 2011

Copyright (c) 1982, 2010, Oracle. All rights reserved.

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

Integration with Oracle Health Sciences Cohort Explorer Data Model or Another External Data Model

Integration with other data models is done through specimen record. Each genomic data result file must be accompanied with SPECIMEN_NUMBER and SPECIMEN_VENDOR_NUMBER information, as well as SPECIMEN_DATASOURCE. These entities should match the record in specimen datasource schema. OHSCE Data Model is the default datasource for specimen in the current release. SPECIMEN_WID in W_EHA_RSLT_SPECIMEN table should match SPECIMEN_WID in the W_EHA_SPECIMEN table in OHSCE Data Model.

Every result record imported into the ODB schema is linked to a SPECIMEN record. The SPECIMEN data can come from any external schema either on the same instance (co-located) or on an external instance. The loaders which load various files call a specific stored procedure to validate that the SPECIMEN exists. The assumption is that any database used to provide the SPECIMEN information will have a single numeric field to identify the correct SPECIMEN. This numeric value is stored in each result record (without direct FK definition). The stored procedure used is in a package ODB_UTIL which is not wrapped to allow additional external databases to be supported. Currently this stored procedure is implemented to support validating SPECIMEN records stored in the Cohort Data Model. You need to expand this stored procedure to support other schemas that are intended to provide specimen data.

Following is the structure of ODB_UTIL.GET_SPECIMEN_WID stored procedure:

function get_specimen_wid ( i_datasource_id in number, i_specimen_number in varchar2, i_specimen_vendor in varchar2 ) return number;

This strored procedure has three parameters:

DATASOURCE_ID: W_EHA_DATASOURCE table is used to configure each external database to provide SPECIMEN data. There is one default record for CDM, which has to be configured at the time of installation to specify the SCHEMA and optionally database link name. The stored procedure uses this information to use dynamic SQL to validate the specimen information.
SPECIMEN_NUMBER and SPECIMEN_VENDOR_NUMBER: These two VARCHAR2 fields used to identify a unique specimen. s

The stored procedure looks up the W_EHA_DATASOURCE record and compares the name field. Currently, there is a check for a name of CDM and then code for looking specimen data in CDM. If additional database schemas need to be used to provide specimen information, you must first add a record first to W_EHA_DATASOURCE with a unique name. The stored procedure has to specifically handle that datasource name and add the code to validate the passed specimen number and specimen vendor number. Note that most of the data files support a specimen number, and the loaders currently have a specimen vendor number passed as a parameter.

Specimen and Vendor Number Requirement

To load results into any of the result tables, each specimen referred to in the input result files: VCF, MAF, CGI masterVar, gene expression must be present in the OHSCE data model. If a file with multiple specimen is being loaded, e.g. VCF file, and one of the specimen is not found in the Cohort Explorer datamart schema, then NONE of the records for the entire file will be loader. Note that this User needs to consider the compatibility of the available ENSEMBL version in ODB with other variation reference and results data before loading them to ODB.

Reference Version Compatibility

You must consider the compatibility of the available ENSEMBL version in ODB with other variation reference and results data before loading them to ODB.

Following are the list of data files to be considered for version compatibility:

GVF data file should belong to the same version of ENSEMBL that exists in ODB. For example, if ODB is loaded with ENSEMBL 66 version, then GVF file to be loaded should also belong to ENSEMBL 66.
Variation data files, which include CGI masterVar, VCF and MAF should be based on the same reference genome that was used by the ENSEMBL version. For example, if loaded ENSEMBL 66 version is using GRCh 19 reference genome, then the results to be loaded should also be mapped based on GRCh 19 version.
Copy Number Variation result data should also be checked for reference genome version compatibility with ENSEMBL version as specified in point 2 above.

Handling Newline Characters in Input Files

All reference and result input text files have an End-Of-Line character convention, that should be followed by the operating system on which the database server is loaded. For a windows database server, text files in a Linux or Unix environment must be processed by the tool unix2dos to convert file to the DOS format.