Export genomic data

The Genomic Data Export page is used to export the genomic data for filtered patients or subjects based on Study, Specimen type and Anatomical Site in a specific file format.

Note:

Currently, exporting variation data from sequencing platform, single and double channel gene expression and Copy Number Variation data in VCF, SEG, RES and GCT file formats are supported. These formats are supported by the IGV browser.

For more information, see:

Select results to export

Currently only one version of data can be exported at a time.
To select results:
  1. On the left, click the Patients or Subjects section, and select a source for your results.
  2. Select the Assembly Version from the drop-down list.
  3. Enter a Specimen Type or search for one using the search icon (Search icon is a magnifying glass.).
  4. Enter an Anatomical Site or search for one using the search icon (Search icon is a magnifying glass.).

Set the location

You can export genomic data using three different methods:

  • Click the radio button for In Genes from and select one or all options:
    • Pathway so you can select one or more pathways.
    • Gene Set so you can use the user-defined collection of genes.
    • Ad-hoc List so you can select one or more genes.
  • Click the radio button for At Genomic Position and enter a specific chromosome region.

    The following chromosome region formats are supported:

    CHR15:10000-200000: Considers region between 10000 to 200000 in chromosome 15.
    CHR15:1,200,000+5000 - Considers 5000 bases upstream from 1,200,000 position in chromosome 15.
    CHR15 - Considers whole of the chromosome 15.
    CHR15:1000 - Considers 1000th nucleotide position of chromosome 15.
    
  • Click the radio button for All Data to export the complete genomic data of patients or subjects.

Select file type

Select one or all of the available file types:

  • Mutation - VCF
  • Copy Number Variation - SEG
  • Microarray Expression - RES
  • Microarray Expression Dual Channel - GCT

Note:

For the At Genomic Position location selection, the Gene Expression - RES and Gene Expression Dual Channel options are disabled.

About the file types:

Mutation - VCF

This option exports the sequencing variation data for the selected patients or subjects for either the selected genes, pathway, gene set or for a given chromosome region as selected in the previous option. VCF supports multiple specimens' data in a single file.

The metadata header gives the following information that differs based on the search criteria:

  1. ##fileformat=VCFv4.1
  2. ##fileDate: Date and time of the VCF file generated.
  3. ##source=Oracle Healthcare Omics (OHO, formerly known as Omics Data Bank)
  4. ##Total Number of patients included in this VCF file
  5. ##Total Number of samples included in this VCF file
  6. ##INFO=<ID=NS, Number=1, Type=Integer, Description=Number of Samples With Data>
  7. ##FORMAT=<ID=GT,Number=1,Type=String,Description=Genotype>
  8. ##FORMAT=<ID=GQ, Number=1, Type=Integer, Description=Genotype Quality>
  9. ##FORMAT=<ID=GQVAF, Number=2, Type=Integer, Description=Genotype_quality_X>
  10. ##FORMAT=<ID=DP, Number=1, Type=Integer, Description=Read Depth> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description=Allelic depths for the ref and alt alleles in the order listed >
  11. ##FORMAT=<ID=HQ, Number=2, Type=Integer, Description=Haplotype Quality>
  12. ##FORMAT=<ID=BQ,Number=.,Type=Integer,Description=Average base quality >
  13. ##FORMAT=<ID=MQ,Number=.,Type=Integer,Description=Average mapping quality >
  14. ##FORMAT=<ID=SS,Number=1,Type=Integer,Description=Variant status relative to non-adjacent Normal,0=wildtype,1=germline,2=somatic,3=LOH,4=post-transcriptional modification,5=unknown>
  15. ##FORMAT=<ID=SSC,Number=1,Type=Integer,Description=Somatic Score>

The following data types are imported to VCF file:

Data Type Description

CHROM

chromosome

POS

position of the variation

ID

dbSNP ID or COSMIC ID associated with a variant

REF

reference allele

ALT

variant alleles

QUAL

not populated. Will have '.' specified in this column.

FILTER

is populated as PASS

INFO

Not populated. Will have '.' specified in this column.

FORMAT:GT

genotypic data for each specimen.

FORMAT:GQ

genotype quality. If not value available in DB, then '.' is specified in the file.

FORMAT:GQX

mapped to GENOTYPE_QUALITY_X column

FORMAT:DP

this stores the TotalReadCount for a specific variant

FORMAT:AD

this stores the reference read count and Allele read count for a specific variant.

FORMAT:HQ

not populated as of now. Will have '.' specified in this column.

FORMAT:FT

this stores GENOTYPE_FILTER column value

FORMAT:BQ

stores the RMS base quality

FORMAT:MQ

stores the RMS mapping quality

FORMAT:SS

stores the somatic status

FORMAT:SSC

stores the somatic status score value

Flex field format

If any custom formats are available, they are also included in the export.

1000 Genomes VCF 4.1 conventions are followed while exporting variation data, however certain data types, which are non-standard, like BQ and MQ, may differ in convention for some customers since there is no standard way to represent them.

Handle Non-variant and No-call Data

If NON_VARIANT and (or) NOCALL records exist for any given position, the zygosity is checked to determine if the information format from these tables is used.

Note:

For het-ref or half zygosity values, these other format fields are compared with the existing SEQUENCING information. This information is then used with zygosity to create the format string.

The NON_VARIANT data allows for GQ, GQX, MQ, BQ and the first reference read count of AD. The NOCALL data allows for all format fields to be compared. Both NON_VARIANT and NOCALL do not support exporting flex fields. The GT value of the format string reflects the stored zygosity as follows:

Zygosity FORMAT string GT:GQ:GQX:BQ:MQ:AD:DP

het-ref

1/0:99:98:38:45:20:10,10

Half

1/.:99:98:34,34:45,45:20:10,5

Het-alt

1/2:99:98:43,44:56,67:20:0,10,10

Hom

1/1:99:98:34,34:45,45:20:0,19

If there are no result records for any specimen, the export displays "." with no other information for the format.

Handle Duplicate Genetic Information Exports

There could be cases where users reload genetic information multiple times for the same specimen. This may create ambiguous values for the different fields that exist in the VCF export file. The export code deals with such ambiguous numerical values that represent the quality (that is, GQ, GQX, AD, BQ, MQ). This code now computes minimum values and ensure that the value of least confidence is reported. There could be more complex cases, for instance, if there are 2 different alleles for the same position belonging to the same specimen, or variants with same position for same specimen with different zygosity. The export code uses MIN functions on all values including all the text fields. This allows for VCF export to create a valid file that can be loaded into genome browsers.

Alternatively, you can choose not to consider data from a specific specimen or a specific file using following methods:

  • Using DELETE_FLG - A user may load results for a specimen more than once that can completely contradict previous results. Users can set the DELETE_FLG as 'Y' on W_EHA_RSLT_SPECIMEN and (or) W_EHA_SPEC_PATIENT or W_EHA_SPEC_SUBJECT to have previous loads excluded, and then reload the correct result files. When the user now exports the data, only the latest loaded specimen data is considered for export.
  • Using FILE_URI - Oracle recommends using this method since you need not reload the data again as opposed to the above method. When there are multiple files loaded with contradicting data for the same specimen, user can set some files as obsolete by changing the W_EHA_FILE_LOAD.FILE_WID column. For example, if you have loaded the same specimen data 3 times and would like to consider the latest file loaded for export, then you must first identify the latest FILE_WID from W_EHA_FILE_LOAD table. Then change the FILE_WID of two old files in W_EHA_FILE_LOAD table to the latest FILE_WID. Now, all the three records belonging to the three file loads contain same FILE_WID, which represents the latest file load and only the latest file export data is exported.

Represent AD Values

Allele depth values represented under the AD data type are in the order of the alleles represented in the GT. Refer to the following table with examples:

ALT FORMAT SAMPLE1 --

G,C,T

GT:AD

1/2:0,4,6

0 represents reference_read_count

4 represents allele_read_count of 'G'

6 represents allele_read_count of 'C'

G,T

GT:AD

2/2:0,4

0 represents reference_read_count

4 represents allele_read_count of 'T'

G,T

GT:AD

1/0:10,5

10 represents reference_read_count

5 represents allele_read_count of 'G'

Copy Number Variation - SEG

The copy number variation data is exported in SEG format. Currently, CNV data from any array based system like Affymetrix Genome Wide SNP 6 array whose data is in SEG format while loading in OHO is supported. The main requirement for exporting CNV data is to have the SEG_MEAN value in the CNV table of OHO.

For exporting data that is not loaded from SEG files, for example, data from CGI CNV files or any other source of CNV data, users have to create their own loader. The loader is expected to calculate the SEG_MEAN value since this value is most important for export.

  1. ID: specimen ID of the reported CNV segment
  2. chrom: chromosome name
  3. loc.start: start position of the CNV segment
  4. loc.end: end position of the CNV segment
  5. num.mark: for array based CNV data, this stores the number of probes details
  6. seg.mean: this stores the segment mean value from SEG_MEAN column in CNV table.

Gene Expression - RES

RES is one of the gene expression formats supported by IGV browser. Currently, only microarray gene expression data is exported to this format. Following data types are imported to RES format:

  1. Description: hugo name of a specific probe
  2. Accession: probe ID
  3. Intensity: intensity value of the associated probe
  4. Call: call of the associated probe

Gene Expression Dual Channel - GCT

GCT is one of the gene expression formats supported by IGV browser. Currently, only AgilentG4502A platform microarray gene expression data is exported to this format. Following data types are imported to GCT format:

  1. Description: Gene symbol of a specific probe
  2. Accession: probe ID
  3. Intensity: intensity value of the associated probe
  4. Call: call of the associated probe

Note:

The GCT file takes its gene symbol for the probe from the 2-channel composite element of ADF file. This is input into the ADF composite table in OHO. This value may not match with HUGO name in certain cases as OHTR associates 2-channel records in the result table that has partial (which includes a flanking region set by the user) genomic coordinate. The coordinate overlaps between composite elements and gene segments in the reference. This may also result in some cases in more than one unique gene in the reference mapping to a gene composite element.

Specify the export options

Export data in three ways:

  • Select the option to download last loaded file(s)
  • Immediately, which is the default option
  • Schedule

After selecting the Export option, click Submit to finalize the genomic data export. The data is generated and a separate link is provided in the bottom panel for each result type.

About Export Options

The Immediately option gives you the file link on the same screen and you can click to download it immediately. The link provided has a specific naming convention: file type_OHO_date:MM-DD-YYYY_time:HH24*-MI-SS.file_type_extension. For example, RES_OHO_09-14-2014_04-26.res. A short description of the file stating data type and advice on the expected count of features is displayed below the created link.

The Schedule option runs the process as a job. You can track the status of the job from the Home, Jobs tab. This option is best suited for exporting large data sets like All Data or whole chromosome variants. For the scheduling option, you must provide a job name and description.

There is a possibility of replicate and duplicate data in the database. This could be due to loading multiple files belonging to the same specimen_number. This can happen if the same library is sequenced multiple times or the data is reanalyzed. For example, the reads were realigned using the new reference version, so new VCF or gVCF files are created for the same sample. In this scenario, you can use the option to export VCF data only from last loaded files. For example, if variation data has been loaded for a specimen in Jan 2015, Mar 2015 and July 2015, then using this option you can export data from the file loaded in July 2015 and it would not consider variants from the file loaded in Jan and Mar 2015.

Note:

The Schedule Jobs option uses an asynchronous approach to store the file in DBFS. As an alternative to downloading the file using the link in the Oracle Healthcare Translational Research page, there are other ways to access DBFS. From a Linux OS, you can mount DBFS using dbfs_client application and then browse the directories. Windows OS does not support the FUSE interface and cannot mount DBFS directly. However, there is a dbfs_client application for Windows that can execute commands to access DBFS. The Windows version of dbfs_client lets you use the command line to execute normal directory commands. You can list the DBFS directories as well as copy data from DBFS to the local drive. The dbfs_client application is part of the standard Oracle client software.

For more information about using dbfs_client, see http://docs.oracle.com/cd/E11882_01/appdev.112/e18294/adlob_client.htm#ADLOB0006.