Source data format

You can load source data from a variety of formats.

Your Information Discovery applications will most often read data directly from one or more database systems, or from database extracts. Input components load records in a variety of formats including delimited, JDBC, and XML. Each input component has its own set of configuration properties. One of the most commonly used type of input component loads data stored in delimited format.

The format used as an example in this chapter is a two-dimensional format similar to the tables found in database management systems. Database tables are organized into rows of records, with columns that represent the source properties and property values for each record. (This type of format is often called a rectangular data format.) The source records are stored in a CSV file named FactSales.csv (the file is in the data-in folder of the Integrator Quick Start sample application).

The following image, which shows the beginning lines of the FactSales.csv input file, illustrates how the source data is organized in a two-dimensional format:

You specify the location and format of the source data to be loaded in the Integrator reader component in the graph. The reader component passes the data to the Information Discovery connector, which is configured to connect to either the Data Ingest Web Service (DIWS) or the Bulk Load Interface, both of which reside on the Dgraph. The records are then loaded into the Dgraph in batches of a pre-configured size. During the ingest operation, each source row is transformed into an Endeca record with a key-value pair for each non-null source column. The Dgraph then indexes the records for use during search queries.

Primary key attribute

You will be using one of the Endeca standard attributes as the primary-key attribute for the records. (The primary-key property is also known as the record spec property.) The primary-key property must be a unique property. For more information on primary keys, see the Oracle Endeca Server Data Loading Guide.

In our sample graph, the FactSales.csv input file does not have a field that contains unique values. Therefore, the Create Spec component creates the FactSales_RecordSpec primary-key attribute by concatenating two attributes. The name of the primary-key attribute will be specified in the Metadata definition for the Edge component.

Use of hyphens in input property names

Although the Dgraph will accept attribute names with hyphens (because hyphens are valid NCName characters), the Integrator will not accept source property names with hyphens as metadata. Therefore, if you have a source property name such as "Ship-Date", make sure you remove the hyphen from the name.

Using multi-assign data

Your source data may have multi-assign properties, that is, a property that has more than one value. For example, instead of having two properties (say, Color1 and Color2) in which each property has only one value, you can instead have one property (say, Color) with multiple values, as in this simple example:

ComponentID|Color|Size
123|Blue|Medium
456|Blue;Red|Small
789|Red;Black;Silver|Large

In the example, the pipe character (|) is the delimiter between the properties, while the semi-colon (;) is the delimiter between multiple values in a given property. For example, the Color property for record 789 has values of "Red", "Black", and "Silver".

When configuring the Bulk Add/Replace Records connector, you can then specify that the semi-colon is to be used as the delimiter for multi-assign properties.

Keep in mind that an Endeca property that is multi-assign must have the mdex-property_IsSingleAssign property set to false in its PDR. The default value of the property is false, which means the property is enabled for multi-assign by default.