Linking to Objects in Cloud Storage

When you create a link to files in a cloud store bucket from you Oracle Autonomous database, you create an external table that links to the files in the cloud store.

You can link to files in these file formats: AVRO, CSV, JSON, GeoJSON, Parquet, ORC, Delimited TXT. For information on supported file formats, see Format Specifications for JSON, AVRO, and XML Files.

Configure and run a data link job from the Link Cloud Object page. To open that page:

  1. Open the Database Actions launchpad , click Data Studio tab and select the Data Load menu. See, The Data Load Page.
  2. Select LINK DATA and CLOUD STORE.

On the left side of the page is a navigator pane, where you choose a cloud store connection and the folders or files containing the data. On the right of the page is the data load cart, where you stage the files and folders for the data link job. You can set options for the data link job before running it. The Autonomous Database comes with predefined CPU/IO shares assigned to different consumer groups. You can set the consumer group to either low, medium or high while executing a data load job depending on your workload.

To link to files from a cloud store, do the following:

Manage Cloud Storage Links for Data Link Jobs

Before you can link to data in a cloud store, you must establish a connection to the cloud store you want to use.

On the Link Cloud Object page:

  1. Click the Manage cloud store icon besides the field where you enter the cloud store location. Select + Create Cloud Store Location.

  2. Enter your information in the Add Cloud Store Location pane. See to add cloud storage location.

See Managing Connections.

To return to the Link Cloud Object page, click Data Load in the breadcrumbs at the top of the page and then navigate back to the page.

Prepare the Data Link Job

You may have to adjust your source data or your target table so that the source data links correctly to the external target table. Consider:

  • If you're linking to multiple files, you must make sure that:

    • All the source files are of the same type, for example, CSV, JSON, etc.

    • The number, order, and data types of the columns in all the source files match.

  • If you want to partition by date:

    • The source file must contain data where the data type is date or timestamp.

    • You must load a folder containing two or more data sources.

    • The names of the files in the folder must indicate a date or dates, for example, MAR-1999.csv or 2017-04-21.xlsx.

Add Files or Folders for the Data Link Job

Add files from the cloud store to the data link cart, where you can edit the details of the data link job. To add the files:

  1. From the list at the top of the navigator pane on the left, select the bucket with your source data.

    The list shows links that were established on the Manage Cloud Storage page. If you haven't yet registered the cloud store you want to use, click the Connections button under the Data Load menu in Data Studio suite of tools and register a connection.

  2. Drag one or more items from the file navigator on the left and drop them into the cart on the right.

    • You can add files, folders, or both. A card is added to the cart for each file or folder you drag into it. The card lists the name of the source file or folder and a proposed name for the target table.

    • If you add a folder that contains multiple files, all the files must be of the same type, that is, CSV, TXT, etc.

      When you add the folder to the cart, a prompt is displayed that asks if you want to load all the objects from the multiple source files into a single target table. Click Yes to continue or No to cancel.

    • When you add multiple individual files or multiple folders to the cart, the data represented by each card will be loaded into a separate table, but all the items in the cart will be processed as part of the same data load job.

    • You can add files or folders from a different bucket, but if you do that, you're prompted to remove all files that are already in the cart before proceeding. To select files from a different bucket, select the bucket from the drop-down list in the navigator pane on the left and then add the file(s), as described above.

    • You can drop files or folders into the data load cart and then navigate away from the Data Link Object page. When you return to the page, those items remain on the page, but you may receive a message, "Remove All Data Link Items. Changing to another Cloud storage location requires all items to be removed from the data load job. Do you wish to continue?" Click Yes to remove the items from the cart. Click No to keep the items in the cart. Then you can continue to work.

You can remove items from the cart before running the data link job:

  • To remove an item from the cart, select Remove on the card for the item Data Link cart menu bar at the top of the pane.

  • To remove all items from the cart, click Remove All in the data link cart menu bar at the top of the pane.

Enter Details for the Data Link Job

Enter the details about the data link job in the Link Data from Cloud Storage pane.

On the card in the data link cart, click Settings to open the Link Data from Cloud Storage pane for that job. The pane contains:

Settings Tab - Table Section

Set details about the target table in the Table section.

  • Name: The name of the target table.
  • Partition Column:

    List Partitions and Date-based partitions are the different types of partitions available in data linking.

    List partitioning is required when you specifically want to map rows to partitions based on discrete values.

    To partition according to a specific column, click the Partition Column drop-down list and select the column you want to use for the partitioning.

    You will have N files per partition value, all partitioned by the partition column you select.

    Note:

    • For linked files (from external tables) there is also a requirement that for each file, the list partitioning column can contain only a single distinct value across all of the rows.
    • If a file is list partitioned, the partitioning key can only consist of a single column of the table.

    Date-based partitioning is available when you link a folder containing two or more data sources that have columns that contain date or timestamp data.

    To partition according to date, click the Partition Column drop-down list and select the DATE or TIMESTAMP column you want to use for the partitioning.

  • Validation Type: Validation examines the source files, optional partitioning information, and report rows that do not match the format options specified. Select None for no validation; select Sample to perform validation based on a sample of the data; or select Full to perform validation based on all the data.

  • Use Wildcard: This check box enables use of wildcard characters in search condition to retrieve specific group of files that matches the filter criteria.

    You can use a wildcard character, such as an asterisk (*) that searches, filters, and specifies groups of files that detect and add new files to the external table.

    For example, if you enter file*, then file01, file02, file03, and so on are considered to match the keyword. The asterisk (*) matches zero or more characters of the possibilities, to the keyword.

    Note:

    The wildcard support is incompatible with partitioning. The validation of source file fails if you use wildcards with partitioned data.

Settings Tab - Properties Section

Specify options to control how the source data is interpreted, previewed, and processed. These options vary, depending on the type of source data.

  • Encoding: Select a character encoding type from the list. This option is available when the linked file is in plain text format (CSV, TSV, or TXT). The default encoding type is UTF-8.

  • Text enclosure: Select the character for enclosing text: " (double-quote character), ' (single-quote character) or None. This option is visible only when the selected file is in plain text format (CSV, TSV, or TXT).

  • Field delimiter: Select the delimiter character used to separate columns in the source. For example, if the source file uses semicolons to delimit the columns, select Semicolon from this list. The default is Comma. This option is visible only when the selected file is in plain text format (CSV, TSV, or TXT).

  • Start processing data at row: Specifies the number of rows to skip when linking the source data to the target external table:

    • If you select the Column header row option under Source column name (see below) and if you enter a number greater than 0 in the Start processing data at row field, then that number of rows after the first row are not linked to the target.

    • If you deselect the Column header row option under Source column name, and if you enter a number greater than 0 in the Start processing data at row field, then that number of rows including the first row are not linked to the target.

  • Source column name: Select the Column header row checkbox to use the column names form the source table in the target table.

    • If you select this option, the first row in the file is processed as column names. The rows in the Mapping section, below, are filled with those names (and with the existing data types, unless you change them).

    • If you deselect this option, the first row is processed as data. To specify column names manually, enter a name for each target column in the Mapping section. (You will also have to enter data types.)

  • Numeric column: Select the Convert invalid data to null checkbox to convert an invalid numeric column value into a null value.

  • Newlines included in data values: Select this option if there are newline characters or returns to the beginning of the current line without advancing downward in the data fields. Selecting this option will increase the time taken to process the load. If you do not select this option when loading the data, the rows with newlines in the fields will be rejected. You can view the rejected row in the Job Report panel.

Settings Tab - Mapping Section

The settings in the Mapping section control how data from the source files are linked to the rows of the target external table. For each row, the data from the column listed under Source column will be linked to the column listed under Target column.

  • Source column: Lists the columns from the source file.

    If the Column header row option under Properties is selected, Source column shows the names of the columns in the source file. If the Column header row option is not selected, generic names like COLUMN_1, COLUMN_2, etc., are used. This field is always read only.

    You can view two source columns FILE$NAME and SYSTIMESTAMP. The FILE$NAME column enables you to locate the source file containing a particular data record. For example, you load a source file that contains a list of files. The file names in the file list refer to the department names across the organization. For instance, a finance.txt file contains data from the Finance department. In the mapping, you can use string data types to extract the department name from the output of the file name column. You can use the extracted department name to process data differently for each department.

    The SYSTIMESTAMP column allows us to view the current timestamp in the database.

    Note:

    • FILE$NAME and SYSTIMESTAMP source columns are not included by default. You must check the Include check box and run the load for the target table to display these two columns.
    • When you are creating a livefeed, the FILE$NAME and SYSTIMESTAMP source columns appear in the Mapping table by default.
  • Target column: Lists the columns in the target table.

    • If the Column header row option is selected, then the Target column uses the names of the columns in the source file. You can change the name of a target column by replacing the provided name with a new one. You need to make sure that the target column is not empty. Target column name must not be a duplicate of another target column. The target column name cannot have duplicate name as another target column. The target column length must not be beyond 128 bytes. 128 byte limit is a database limit.

    • If the Column header row option is not selected, then generic names like COLUMN_1, COLUMN_2, etc., are used. You can change the name of a target column by replacing the provided name with a new one.

    Note:

    If you're linking multiple files from a folder in a single data link job, only the first file will be shown in the Mapping section. However, as long as the column names and data types match, the data from all source files will be linked.

  • Data Type: Lists the data type to use for data in that column. The contents change depending on whether the Get from file header option is selected.

    • If the Column header row option is selected, Data type shows the data types of the columns in the source file. If you want to change the data type for the target, click the name and select a different one from the list.

    • If the Column header row option is not selected, Data type shows all available data types. Select the data type to use for the target column from the list.

  • Length/Precision (Optional): For columns where the Data Type is NUMBER, enter the length/precision for the numbers in the column. Precision is the number of significant digits in a number. Precision can range from 1 to 38.

    For columns where Data Type is VARCHAR2, the Auto value in Length/Precision field enables the Auto Size feature.

    With the Auto-Size column Width feature, you can automatically size any column to fit the largest value in the column. Select Auto from the Length/Precision drop-down values or pick a value from the drop-down list.

  • Scale (Optional): For columns where the Data Type is NUMBER, enter the scale for the numbers in the column. Scale is the number of digits to the right (positive) or left (negative) of the decimal point. Scale can range from ranges from -84 to 127.

  • Format: If the data type in the Data type column is DATE or one of the TIMESTAMP types, select a format for that type from the from the Format drop-down list.

Preview Tab

The Load Preview menu in the Preview tab displays the source data in tabular form. The display reflects the settings you chose in the Properties section. The File menu displays source data with the column names.

If you dragged a folder containing multiple files into the data link cart and then clicked Settings Settings for that card, the Preview pane includes a Preview Object (File) drop-down list at the top of the pane that lists all the files in the folder. Select the source file you want to preview from that list.

Table Tab

The Table tab displays what the target table is expected to look like after the data has been linked.

SQL Tab

The SQL tab displays the SQL commands that will be run to complete this data link job.

Note:

You can see the SQL code even before the table is created.

Close Button - Save and Close the Pane

After entering all the details for the data link job, click Close at the bottom of the page. This saves the details you entered and returns you to the Link Data from Cloud Storage pane.

Run the Data Link Job

Once you've added data sources to the data link cart and entered details about the data link job, you can run the job.

To run the job:

  1. If you haven't already done so, click the Close button in the Link Data from Cloud Storage pane to save your settings and close the pane. If any of the settings are invalid, an error message reports the problem. Fix the problem and click Close.
  2. Click Start Start in the data link cart menu bar. To stop the data link job, click Stop Stop.

    When the data link job completes, the Data Load Dashboard page displays the results of the job under Table and View Loads section.

    Once the data link job starts, you can view the progress of the job in the Data Load dashboard.

View Details About the Data Link Job After It Is Run

You can view the progress of the job on the Data Load dashboard.

When the data load job completes, the Data Load dashboard page displays the results of the job. At the top of the header of the table load, you can view the name of the table along with the total columns present in the table.

Click Job Report to view the total number of rows processed successfully and the count of rejected rows. You can also view the Start time. The SQL pane of the Job Report displays the equivalent SQL code of the job.

To view information about an item in the job, click the Actions icon on the Table Load.

To view a log of the load operation, click the Logging icon. You can save the log, clear it, or refresh it. Click OK to dismiss the log.

View the Table Resulting from the Data Link Job

After running a data link job, you can view the table created by the data link job on the Data Load dashboard.

Fix your data load job. After your data load job, you might see errors that you want to correct, or upon inspection, realize that you wanted to name a column differently. In such cases, click the Reload option on the selected Table Load to reload cards from your recent cart and edit them as you did before your first attempt. The Reload icon reloads the source data with the fixes suggested by the tool. Click the Actions icon on the Table header, click Table and select Edit to make any changes to the data load job (i.e., change a column name).