2 Prerequisites for Loading Data

This chapter contains the following topics:

Setting Up a Directory Object
Setting Up an Oracle Wallet
Setting Up User Privileges for Querying or Loading Data
Integration with Oracle Health Sciences Cohort Explorer Data Model or Another External Data Model
Migrating W_EHA_RSLT_STUDY, W_EHA_SPEC_PATIENT and W_EHA_SPEC_SUBJECT Tables
Reference Version Compatibility
Handling Newline Characters in Input Files
Periodically Purge the Recycle Bin

2.1 Setting Up a Directory Object

All loaders (except for EMBL and SwissProt) use external tables to access data in files. This requires an Oracle directory object to be created. Oracle directory objects require the database Operating System (OS) user account to have access to this directory. Therefore, the directory must be mounted and all permissions granted before the database server is started. For more information about creating directory objects, see Oracle Database SQL Language Reference 11g Release 2.

The directory object name is used as a parameter to all the loaders. The ODB schema user must have the CREATE ANY DIRECTORY and CREATE ANY TABLE privilege. The Oracle database OS account must have permissions to access the directory specified in the Oracle directory object.

An example of a command to create oracle directory objects is:

>create directory TRC_ODB as /home/oracle/perm_loc;

Note:

The directory used must reflect the path requirements of the operating system that the database server is installed on. Windows database servers have different naming conventions than Linux servers. Also, the directory used must be mounted on the host OS of the database server before the database server is started.

After a directory is created, grant READ and WRITE privileges on the directory to other users as follows:

GRANT READ on DIRECTORY <<DIR_NAME>> TO <<ODB USER NAME>>

GRANT WRITE on DIRECTORY <<DIR_NAME>> TO <<ODB USER NAME>>

Here DIR_NAME is the name of an Oracle directory where all the result files are kept and ODB USER NAME is the database user executing the loaders. The database user should have both READ and WRITE grants on Oracle directory to process loaders.

2.2 Setting Up an Oracle Wallet

An Oracle Wallet must be set up with the credentials used to connect to the schema where ODB is installed.

Perform the following steps to set up the Oracle Wallet:

Add the following code to tnsnames.ora under $ORACLE_HOME\NETWORK\ADMIN

DB001_Wallet =

(DESCRIPTION =

(ADDRESS_LIST =

(ADDRESS = (PROTOCOL = TCP)(HOST = 10.178.187.186)(PORT = 1521))

)

(CONNECT_DATA =

(SERVICE_NAME = db001)

)

)

Note:
Set the SERVICE_NAME and HOST values to point to your database installation.
The Oracle wallet can be created on the client or middle tier system. Open a command prompt terminal and execute the following:

>cd d:

>d:

>mkdir wallets

>cd wallets

>mkstore -wrl D:\wallets -create -nologo

Enter password: <type a 8 alphanumeric-character password>

Enter password again: <retype above password>

>dir

Volume in drive D is Data

Volume Serial Number is C###

Directory of D:\wallets

11/24/2011 09:24 PM <DIR> .

11/24/2011 09:24 PM <DIR> ..

11/24/2011 09:13 PM 3,965 cwallet.sso

11/24/2011 09:13 PM 3,888 ewallet.p12

Note:
The last command should show two files created by running mkstore -create: cwallet.sso and ewallet.p12.
Add your database credentials to your wallet.

>mkstore -wrl D:\wallets -createCredential DB001_Wallet odb

Your secret/Password is missing in the command line

Enter your secret/Password: <enter password for odb user>

Re-enter your secret/Password:<re-enter password>

Enter wallet password:<enter the 8 digit password given while creating wallet>

Note:
For every user credential added to the wallet, you must create a new dataset name in tnsnames.ora. The system assumes username as odb.
Configure SQLNET to look for the wallet. Add the following lines of code to sqlnet.ora under $ORACLE_HOME\NETWORK\ADMIN:

WALLET_LOCATION = (SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=D:\wallets)))

SQLNET.WALLET_OVERRIDE = TRUE
Test connectivity using sqlplus. Enter the following on any command prompt terminal:

>sqlplus /@DB001_Wallet

You will get the following result:

SQL*Plus: Release 11.2.0.1.0 Production on Fri June 7 15:54:35 2013

Copyright (c) 1982, 2010, Oracle. All rights reserved.

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

2.3 Setting Up User Privileges for Querying or Loading Data

Perform the following steps to set up user privileges for querying or loading data:

Create a database user having the following privileges:
- CREATE SYNONYM
- CREATE SESSION
Assign the user an appropriate role from the following three roles:
- OmicsDatamartUser
  
  Queries ODB schema. Can only perform queries on the model, cannot write to the model. This role will be typically given to named UI users whose credentials are to be passed to the database layer for querying only.
- OmicsDatamartAdmin
  
  Loads data into the reference side of ODB, refreshes all reference data into the reference side of the schema including W_EHA_VARIANT table. This user cannot create new data definitions for objects such as tables and views.
- OmicsDatamartContributor
  
  Can load result data into the Omics Data Bank through provided result loaders. This role enables a named user to write to the ODB result side of the model.
Create a local synonym for the database user. The create_synonym_for_user.sql script is available in the master_install folder in the TRC Software package.
1. Connect to the database instance with the above created database user.
2. Execute the following command by replacing cdm_Schema_name with the name of cohort explorer schema name, odb_schema_name with the actual name of your Omics Data Bank schema, apps_schema_name with the application schema name, and job_schema_name with job engine schema name:
  
  @create_synonym_for_user.sql <<cdm_schema_name>> <<odb_schema_name>> <<apps_schema_name>> <<job_schema_name>>

2.4 Integration with Oracle Health Sciences Cohort Explorer Data Model or Another External Data Model

Integration with other data models is done through the specimen record. Each genomic data result file must be accompanied by SPECIMEN_NUMBER and SPECIMEN_VENDOR_NUMBER information, and SPECIMEN_DATASOURCE. These entities should match a record in the specimen datasource schema. OHSCE Data Model is the default datasource for specimen in the current release. ODB contains three tables, W_EHA_SPEC_PATIENT, W_EHA_SPEC_SUBJECT, and W_EHA_SPEC_EXTERNAL, each having a foreign key, RSLT_SPECIMEN_WID, to the W_EHA_RSLT_SPECIMEN table. All three tables have a column to store the external table ID value to register a specific result specimen. For W_EHA_SPEC_PATIENT in ODB, this value is SPEC_PATIENT_WID, which should match ROW_WID in the W_EHA_SPECIMEN PATIENT_H table in the OHSCE Data Model. Similarly, OHSCE's subject specific content can be linked through W_EHA_SPEC_SUBJECT.SPEC_SUBJECT_WID matched to SUBJECT specific table W_EHA_SPECIMEN_SUBJECT_H. The table W_EHA_SPEC_EXTERNAL is used to register external non-OHSCE specimen sources, where the external specimen ID is matched to EXT_SPECIMEN_WID.

Every result record imported into the ODB schema is linked to a SPECIMEN record. The SPECIMEN data can come from any external schema either on the same instance (co-located) or on an external instance. The loaders call a specific stored procedure to validate that the SPECIMEN exists.

It is assumed that any database used to provide the SPECIMEN information has a single numeric field to identify the correct SPECIMEN. This numeric value is stored in each result record (without direct FK definition). Three stored function calls are provided as options to the user through a stored procedure in a package ODB_UTIL. The procedure is not wrapped so that additional external databases can be supported. Currently, the function calls are implemented to support validating SPECIMEN records stored in the Cohort Data Model. The three function calls are:

ODB_UTIL.validate_cdm_patient_spec: Used to validate against specimens in CDM's W_EHA_SPECIMEN_PATIENT_H table.
ODB_UTIL.validate_cdm_subject_spec: Used to validate against specimens in CDM's W_EHA_SPECIMEN_SUBJECT_H table.
ODB_UTIL.validate_cdm_both_spec: Used to validate against specimens to both CDM tables.

This stored procedure can be expanded to support other schemas that are intended to provide specimen data.

Following is the structure of ODB_UTIL.GET_SPECIMEN_WID stored procedure that contains the above functions:

function get_specimen_wid(i_datasource_id in number, i_specimen_number in varchar2, i_specimen_vendor in varchar2, i_study_id in number, i_etl_proc_id in number, i_enterprise_id in number, i_file_wid in number, i_logger in result_logger default null, i_lock_table in number default 1) return number;

This stored procedure has seven mandatory and two optional parameters:

DATASOURCE_ID: W_EHA_DATASOURCE table is used to configure each external database to provide SPECIMEN data. The VALIDATION_PROC column of this table should be populated with the ODB_UTIL.VALIDATE_CDM_SPECIMEN value. This procedure validates the specimen in CDM (or external data source) schema.
SPECIMEN_NUMBER and SPECIMEN_VENDOR_NUMBER: These two VARCHAR2 fields are used to identify a unique specimen.
ETL PROC WID and ENTERPRISE WID are sent by the loader as an input parameter to procedure.
FILE WID is sent by loader. It is the ROW_WID of W_EHA_FILE table that creates the records for the result file used for processing the loader. This is used by the stored procedure to create a record in W_EHA_RSLT_FILE_SPEC table.

The stored procedure looks up the W_EHA_DATASOURCE record and compares the name field. Currently, there is a check for a CDM name and code for searching specimen data in CDM. If additional database schemas should be used to provide specimen information, you must first add a record to W_EHA_DATASOURCE with a unique name.

The stored procedure has to specifically handle that data source name and validate the specimen number and specimen vendor number passed. Most data files support a specimen number and the loaders currently have a specimen vendor number passed as a parameter.

If the specimen exists in the CDM schema (W_EHA_SPECIMEN_H table) and I_FILE_WID parameter value is not null, then the stored procedure calls the local procedure add_file_spec(), which inserts a record into W_EHA_RSLT_FILE_SPEC table. This establishes the link between result file and the specimen used for loading result data.

If the get_specimen_wid stored procedure raises the NO_DATA_FOUND exception (if specimen does not exist in W_EHA_RSLT_SPECIMEN_H table), then the code retrieves the validation procedure name from W_EHA_DATASOURCE table (ODB_UTIL.VALIDATE_CDM_SPECIMEN). This stored procedure retrieves the SPECIMEN_WID from the W_EHA_SPECIMEN_H table against the specimen number and specimen vendor number which might be used as an external data source.

For an external data source, mention the DB link (W_EHA_DATASOURCE). If the procedure does not exist in the CDM schema, then the function logs the error 'Could not find specimen number' in W_EHA_RSLT_LOG table, else it creates a record in both W_EHA_RSLT_SPECIMEN_H and in W_EHA_RSLT_FILE_SPEC table, and returns SPECIMEN_WID to the loader.

This procedure also logs relevant warning, error, and other information.

2.4.1 Specimen and Vendor Number Requirement

To load results into any of the result tables, each specimen referred to in the input result files: VCF, MAF, CGI masterVar, and gene expression must be present in the OHSCE data model. If a file with multiple specimens is loaded, such as a VCF or MAF file, and one of the specimens is not found in the Cohort Explorer datamart schema, then the loader skips that row and loads the rest of the data into the target tables.

2.5 Migrating W_EHA_RSLT_STUDY, W_EHA_SPEC_PATIENT and W_EHA_SPEC_SUBJECT Tables

If the CDM schema data is refreshed using a full ETL load and result files are loaded in ODB, then the ODB CDM reference is broken and the implicit foreign keys in the ODB schema should be updated to point to new CDM record values.

Prior to ODB specimen records being updated, the function indexes that link ODB schema to Patient and Subject records should be rebuilt. Log in as ODB schema and use the following SQL to rebuild these indexes:

SQL> alter index W_EHA_SPEC_PATIENT_M1 rebuild;
SQL> alter index W_EHA_SPEC_SUBJECT_M1 rebuild;

Following are the SQL statements to update CDM-related tables:

W_EHA_RSLT_STUDY:

update w_eha_rslt_study rs set external_study_wid = (
select row_wid from cdm.w_eha_study_d sd
where rs.result_study_name = sd. study_name
);

W_EHA_SPEC_PATIENT:

update w_eha_spec_patient sp
set
spec_patient_wid = (select sph.row_wid
from
CDM.w_eha_specimen_patient_h_v sph,
w_eha_rslt_specimen rs
where
rs.row_wid = sp.rslt_specimen_wid
and rs.specimen_number = sph.specimen_number
and nvl(rs.specimen_vendor_number, 'n0Ne' ) = nvl(sph.specimen_vendor_number, 'n0Ne'));

W_EHA_SPEC_SUBJECT:

update w_eha_spec_subject ss
set
spec_subject_wid = (select sph.row_wid
from
CDM.w_eha_specimen_subject_h_v sph,
w_eha_rslt_specimen rs
where
rs.row_wid = ss.rslt_specimen_wid
and rs.specimen_number = sph.specimen_number
and nvl(rs.specimen_vendor_number, 'n0Ne' ) = nvl(sph.specimen_vendor_number, 'n0Ne'));

Also, if the CDM schema is refreshed, ODB aggregates must be rebuilt using the following procedure:

BEGIN
  DBMS_SCHEDULER.create_job (
    job_name        => 'RebuildAggregates',
    job_type        => 'PLSQL_BLOCK',
    job_action      => 'BEGIN ODB_VARIANT_AGG_UTIL.rebuild_aggregates ; END;',
    enabled         => true,
    comments        => 'To rebuild aggregates from scratch');
END;
/

Note:

Since this procedure may take some time to execute, Oracle recommends that you run it as a scheduled job.

2.6 Reference Version Compatibility

Before loading the ENSEMBL version to ODB, ensure its compatibility with other reference and results data.

Following are the data files to be considered for version compatibility:

GVF data files should belong to versions of Ensembl existing in ODB. For example, if ODB is loaded with Ensembl 66 version, then Oracle recommends that the GVF file to be loaded with the same Version Label should also belong to Ensembl 66. However, a GVF data file belonging to the same genomic alignment can be loaded for any DNA version with the same genomic alignment without losing validity.
Variation data files, which include VCF, gVCF, MAF, and CGI masterVar should be based on the same reference genome that was used by the Ensembl version loaded with the same Version Label. For example, if the loaded Ensembl 66 version is using GRCh 37 reference genome, then the results to be loaded should also be mapped based on GRCh 37 version.

Note:
If the same reference DNA version source file is loaded more than once, then for each variant, one record is created in the variant table for every source record matched.
Copy Number Variation result data should also be checked for reference genome version compatibility with Ensembl version as specified in point 2.
TCGA RNASeq exon data should similarly be matched to the correct reference version. TCGA provides a description.txt file along with other mage-tab analysis files for every RNAseq dataset. Mapping reference version details are available in this file.

2.7 Handling Newline Characters in Input Files

All reference and result input text files have an End-Of-Line character convention that should be followed by the operating system on which the database server is loaded. For a windows database server, text files in a Linux or UNIX environment must be processed by the tool unix2dos to convert the file to the DOS format.

2.8 Periodically Purge the Recycle Bin

All loaders that use dynamic tables to load data also contain code to drop any temporary tables as well as external tables. These dropped tables are accumulated in Oracle database's recycle bin. Periodically execute the following command to remove these unused objects and free up space:

SQL>purge recyclebin;