14.9 PARSE Function

The PARSE function enables you to parse XML, XLSX, CSV or JSON files and returns a generic table of the following structure:

LINE_NUMBER COL001 COL002 COL003 COL004 ... COL300

Values are generally returned in VARCHAR2 format. A returned table row can have a maximum of 300 columns. The maximum length for a varchar2 table column is 4000 bytes; there is no line length maximum. 20 out of the 300 supported columns can be handled as a CLOB.

File parsing happens on-the-fly as this function is invoked. No data is written to a collection or to a temporary table.

If the P_FILE_PROFILE parameter is not passed, the function will compute a file profile with column information during parse. If P_DETECT_DATA_TYPES is passed as 'Y' (default), the function also detect column data types during parse. The computed file profile can be retrieved using GET_FILE_PROFILE after this function is finished.

  1. Invoke PARSE - Use this table function to parse the files and get rows and columns in order to display a data preview.
  2. Invoke GET_FILE_PROFILE - Retrieve file profile information in JSON format.
  3. Process the data - Generate a SQL query based on the data profile to perform custom processing.

Note:

  • JSON parsing is supported on 11.2 and 12.1.0.1 database versions. In this case, the function uses APEX_JSON and XMLTABLE functions. For performance reasons it's recommended to upgrade the database to at least 12.2 - JSON parsing is faster by magnitudes on these versions.
  • XLSX parsing is done by using APEX_ZIP to extract individual XML files from the XLSX archive; the actual XLSX parsing is then done by using the XMLTABLE SQL function.

About CLOB Support

Starting with release 19.2, this package supports string values larger than 400 bytes. 20 out of the 300 supported columns can be handled as a CLOB. The level of CLOB support depends upon the file type being parsed.

  • CSV and XLSX
    • CLOB values are supported up to 32K
    • CLOB columns can be detected during discovery
    • When the data profile is discovered, values below 4000 bytes are normally returned as COLNNN. CLOB values are returned in the CLOBNN column and the first 1000 characters are returned as COLNNN. If a data profile is passed in and that has CLOB column defined, all values are returned in the CLOBNN column only.
  • XML
    • CLOB values with more than 32K are supported
    • CLOB columns can be detected during discovery
    • When the data profile is discovered, values below 4000 bytes are normally returned as COLNNN. CLOB values are returned in the CLOBNN column and the first 1000 characters are returned as COLNNN. If a data profile is passed in and that has CLOB column defined, all values are returned in the CLOBNN column only.
  • JSON
    • CLOB values with more than 32K are supported
    • CLOB columns are *not* detected during discovery; CLOB support is only active, if a file profile containing CLOB column is passed in as the p_file_profile parameter
    • Since JSON_TABLE does not support CLOBs on 12c databases, the parser uses XMLTYPE-based processing if a file profile with CLOB columns is passed in. Processing will be significantly slower.

Syntax

FUNCTION PARSE(
    p_content                      IN BLOB,
    p_file_name                    IN VARCHAR2     DEFAULT NULL,
    p_file_type                    IN T_FILE_TYPE  DEFAULT NULL,
    p_file_profile                 IN CLOB         DEFAULT NULL,
    p_detect_data_types            IN VARCHAR2     DEFAULT 'Y',
    p_decimal_char                 IN VARCHAR2     DEFAULT NULL,
    p_xlsx_sheet_name              IN VARCHAR2     DEFAULT NULL,
    p_row_selector                 IN VARCHAR2     DEFAULT NULL,
    p_csv_row_delimiter            IN VARCHAR2     DEFAULT LF,
    p_csv_col_delimiter            IN VARCHAR2     DEFAULT NULL,
    p_csv_enclosed                 IN VARCHAR2     DEFAULT '"',
    p_skip_rows                    IN PLS_INTEGER  DEFAULT 0,
    p_add_headers_row              IN VARCHAR2     DEFAULT 'N',
    p_file_charset                 IN VARCHAR2     DEFAULT 'AL32UTF8',
    p_max_rows                     IN NUMBER       DEFAULT NULL,
    p_return_rows                  IN NUMBER       DEFAULT NULL,
    p_store_profile_to_collection  IN VARCHAR2     DEFAULT NULL ) RETURN apex_t_parser_table pipelined;

Parameter

Table 14-6 PARSE Function Parameters

Parameter Description

P_CONTENT

The file content to be parsed as a BLOB

P_FILE_NAME

The name of the file; only used to derive the file type. Either P_FILE_NAME, P_FILE_TYPE or P_FILE_PROFILE must be passed in.

P_FILE_TYPE

The type of the file to be parsed. Use this to explicitly pass the file type in. Either P_FILE_NAME, P_FILE_TYPE or P_FILE_PROFILE must be passed in.

P_FILE_PROFILE

File profile to be used for parsing. The file profile might have been computed in a previous PARSE() invocation. If passed in again, the function will skip some profile detection logic and use the passed in profile - in order to improve performance.

P_DETECT_DATA_TYPES

Whether to detect data types (NUMBER, DATE, TIMESTAMP) during parsing. If set to 'Y', the function will compute the file profile and also add data type information to it. If set to 'N', no data types will be detected and all columns will be VARCHAR2. Default is 'Y'.

P_DECIMAL_CHAR

Use this decimal character when trying to detect NUMBER data types. If not specified,the procedure will auto-detect the decimal character.

P_XLSX_SHEET_NAME

For XLSX workbooks. The name of the worksheet to parse. If omitted, the function uses the first worksheet found.

P_ROW_SELECTOR

For JSON and XML files. Pointer to the array / list of rows within the JSON or XML file. If omitted, the function will:

  • For XML files: Use "/*/*" (first tag under the root tag) as the row selector.
  • For JSON files: Look for a JSON array and use the first array found.

P_CSV_ROW_DELIMITER

Override the default row delimiter for CSV parsing. Limited to one character and defaults to Linefeed (LF). Note that the Linefeed row delimiter also handles "Carriage Return/Linefeed" (CRLF).

P_CSV_COL_DELIMITER

Use a specific CSV column delimiter. If omitted, the function will detect the column delimiter based on the first row contents.

P_CSV_ENCLOSED

Override the default enclosure character for CSV parsing.

P_SKIP_ROWS

Skip the first N rows when parsing.

P_ADD_HEADERS_ROW

For XML, JSON: Emit the column headers (tag, attr names) as the first row.

P_FILE_CHARSET

File encoding, if not UTF-8 (AL32UTF8).

P_MAX_ROWS

Stop parsing after P_MAX_ROWS have been returned.

P_RETURN_ROWS

Amount of rows to return. This is useful when the parser shall to parse more rows (for data type detection), than it is supposed to return. When the specified amount of rows have been emitted, the function will continue parsing (and refining the detected data types) until P_MAX_ROWS has been reached, or until the ROWNUM < x clause of the SQL query kicks in and stops execution.

P_STORE_PROFILE_TO_COLLECTION

Store the File profile which has been computed during parse into a collection. The collection will be cleared, if it exists. Only be used for computed profiles.

Returns

Returns rows of the APEX_T_PARSER_ROW type.

LINE_NUMBER COL001 COL002 COL003 COL004 ... COL300
 

Example

select line_number, col001,col002,col003,col004,col005,col006,col007,col008 
   from table( 
              apex_data_parser.parse(
                  p_content         => {BLOB containing XLSX spreadsheet},
                  p_file_name       => 'test.xlsx',
                  p_xlsx_sheet_name => 'sheet1.xml') ) ;

LINE_NUMBER COL001   COL002       COL003       COL004   COL005          COL006   COL007       COL008 
----------- -------- ------------ ------------ -------- --------------- -------- ------------ ------------- 
          1 0        First Name   Last Name    Gender   Country         Age      Date         Id       
          2 1        Dulce        Abril        Female   United States   32       15/10/2017   1562     
          3 2        Mara         Hashimoto    Female   Great Britain   25       16/08/2016   1582     
          4 3        Philip       Gent         Male     France          36       21/05/2015   2587     
          5 4        Kathleen     Hanner       Female   United States   25       15/10/2017   3549     
          6 5        Nereida      Magwood      Female   United States   58       16/08/2016   2468     
          7 6        Gaston       Brumm        Male     United States   24       21/05/2015   2554     
          8 7        Etta         Hurn         Female   Great Britain   56       15/10/2017   3598     
          9 8        Earlean      Melgar       Female   United States   27       16/08/2016   2456     
         10 9        Vincenza     Weiland      Female   United States   40       21/05/2015   6548     
          : :        :            :            :        :               :        :            :

select line_number, col001,col002,col003,col004,col005,col006,col007,col008 
   from table( 
              apex_data_parser.parse(
                  p_content         => {BLOB containing JSON file},
                  p_file_name       => 'test.json') ) ;

LINE_NUMBER COL001    COL002   COL003                                COL004          COL005          
----------- --------- ---------------------------------------------- --------------- --------------
          1 Feature   1.5      41km E of Cape Yakataga, Alaska       1536513727239   1536514117117   
          2 Feature   0.21     11km ENE of Aguanga, CA               1536513299520   1536513521231   
          3 Feature   1.84     5km SSW of Pahala, Hawaii             1536513262940   1536513459610   
          4 Feature   2.55     9km W of Volcano, Hawaii              1536513100890   1536513446680   
          5 Feature   1.3      62km ESE of Cape Yakataga, Alaska     1536512917361   1536513322236   
          6 Feature   1.79     7km SW of Tiptonville, Tennessee      1536512379690   1536512668010   
          7 Feature   1.9      126km NNW of Arctic Village, Alaska   1536512346186   1536512846567   
          8 Feature   1.4      105km NW of Arctic Village, Alaska    1536512140162   1536512846334

About Large CSV Files

If the BLOB passed to APEX_DATA_PARSER.PARSE is less than 50 MB, Oracle Application Express copies the BLOB to an _internal, cached_ temporary LOB. Thus all CSV parsing is done in memory. For larger BLOBs, Oracle Application Express does CSV parsing on the original BLOB locator. If it is selected from a table, CSV parsing can happen on disk but might be significantly slower. Note that a performance degradation may occur when parsed CSV files grow beyond 50 MB.

However, developers can also use the DBMS_LOB.CREATETEMPORARY (passing CACHE => TRUE ) and DBMS_LOB.COPY procedures in order to explicitly create a cached temporary LOB, even for a larger file. Instead of the original BLOB, the cached temporary LOB can be passed to APEX_DATA_PARSER.PARSE. This approach also enables in-memory parsing for files larger than 50 MB.

Note:

"CREATETEMPORARY Procedures" and "COPY Procedures" in Oracle Database PL/SQL Packages and Types Reference