Chapter 5 Configuring the Universal Text Parser

Chapter 5
Configuring the Universal Text Parser

The Universal Text Parser links an external data source with the Meta-Directory Join Engine. You can also synchronize various text-based data with the Meta-Directory views.

This chapter has the following sections:

Overview

About the Task.cfg Configuration File

Setting Up the Universal Text Parser

Creating the Configuration Files for Data Files

About UTC Data Exchange Format

Using Special Characters and UTF8 Data in DN

Overview

The Universal Text Parser (UTP) is a generic text file parser and generator that you can use to build connectors for sources that supply data in a text-based format. You can customize the Universal Text Parser by describing the text-based data that is input to the Universal connector. You can also specify the output of the Universal connector, enabling you to fully synchronize data with an external data source.

The Universal Text Parser uses Perl scripts to provide the link between the external data source and the Universal connector. To use the Universal Text Parser, you need to export the database to ASCII text which the Universal Text Parser synchronizes with the Connector View. The Universal Text Parser generates data to a file that the external database can import. (The external database exports and imports data.)

Meta-Directory supplies three pre-configured configuration files for the most common text-based data representations:

Comma-separated values

Name-value pairs

LDIF

You can configure the Universal Text Parser to support data in any text format. The input text file should have UTF8 data present in distinguished name and vrn values escaped with \xx according to the RFC 2253 standard. UTF8 data present in other attribute values should be base64 encoded over UTF8 encoding. Special characters if present in distinguishedname and vrn values should be escaped.

Universal Text Parser Modules

The Universal Text Parser consists of the following program modules:

Table 5-1 Universal Text Parser modules
task.cfg	Text configuration file used to describe data that you want to flow into and out of the Connector View. You must customize this file for each external data source to synchronize with the Meta-Directory views.
template.pl	Perl library that implements the interface between the script and Universal Text Connector.
universal.pm	Perl module that contains the main engine for interpreting user settings in the configuration file.
textparser.pm	Perl module that contains a generic set of routines for parsing files.
connectorutils.pm	Perl module that contains various generic routines that a Perl connector may use.

These modules are located in the following directory of the installed Meta-Directory:

NETSITE_ROOT/bin/utc50/install/templates/universalparser

NETSITE_ROOT is the root directory of the Sun ONE Meta-Directory installation.



Caution	Do not modify any of the Universal Text Parser modules other than the task.cfg configuration file.

About the Task.cfg Configuration File

To synchronize information between a text-based data source and the Meta-Directory views, you must configure an instance of the Universal Text Parser for each separate text-based data source you need to synchronize.

You can configure the link between the external data source and Universal Text Parser by creating a custom configuration file ‘task.cfg’. This configuration file describes the data format and file specifics of the text-based data.

The task.cfg configuration file describes the text-based data that the external data source exports to the Universal Text Parser. Task.cfg file also describes the text-based output of the Universal Text Parser, which you can import to the external database.

Pre-Configured Files

The task.cfg configuration file specifies the details of the text-based data. The Meta-Directory package includes three pre-configured files that you can use as the boilerplate for your task.cfg file. Each pre-configured file supports a specific format of text-based information, see Table 5-2.

Table 5-2 Pre-configured templates for the task.cfg configuration file
Format Supported	Boilerplate File Name	Description
Comma-Separated Value	csv.cfg	Supports the synchronization of data formatted as a comma-separated value (CSV) list.
Name-Value Pair	nvp.cfg	Supports the synchronization of data formatted as a name-value pair (NVP) list.
LDIF	ldif.cfg	Supports the synchronization of data formatted in LDIF.

Each of the pre-configured boilerplate files has an example input file that matches the settings in the boilerplate. Accordingly, these files are named sampledata.csv, sampledata.nvp, and sampledata.ldif, and can be found in the directory containing the rest of the Universal Text Parser files.

To customize a configuration file for your requirements, modify the appropriate boilerplate file to match the entry attributes and mappings of the data stream. Details on how to customize each file can be reviewed in the file as comments.

Non-Conforming Formats

If you have text-based data that does not conform to the format of one of the pre-configured files, you can modify the task.cfg file.

The modification of the supplied task.cfg file is supported through the services of the Sun ONE Professional Services. If your needs require a more detailed modification of the Universal Text Parser modules than what is described in this chapter, contact your Sun ONE Professional Services representative.

Setting Up the Universal Text Parser

To set up the files in the Universal Text Parser

Copy these Universal Text Parser modules to the directory containing the text-based data for processing:

template.pl

connectorutils.pm

textparser.pm

universal.pm

These modules are described in "Universal Text Parser Modules".

Copy the appropriate boilerplate file to the directory used in Step 1 and rename the file to task.cfg.

The boilerplate file is either csv.cfg, nvp.cfg or ldif.cfg, as described in the section "Pre-Configured Files".

Modify the boilerplate file (now named task.cfg) according to the instructions described in these sections:

Creating the File for Comma-Separated Value Files

Creating the File for Name-Value Pair Files

Creating the File for LDIF Files

When modifying the task.cfg file, note that the regular expressions is parsed using the Perl regular expression syntax, and not by the regular expression syntax used by UNIXsystems.

Configure the Universal Connector as described in Chapter 4, "Configuring the Universal Connector."

When you create the connector instance, ensure that you load the schema, as described in "Creating the Universal Connector Instance".

Select the connector instance, and then select the Script tab.

Enter the path name and file name of the script .pl file.

Restart the connector instance, as described in "Restarting the Connector Instance."

Creating the Configuration Files for Data Files

Creating the File for Comma-Separated Value Files

Use the csv.cfg file as a boilerplate for the task.cfg configuration file if the data is formatted in a comma-separated value (CSV) ASCII text file.

Before You Begin

Consider the following information when synchronizing a comma-separated value data file with the Universal Text Parser:

All attribute names in the LineFormat and ImportLineFormat statements must map to the attribute names in the declared LDAP schema before the Universal Text Parser can process entries. By default, the name-value pair boilerplate file uses inetOrgPerson as the default LDAP schema.

When modifying the task.cfg text file, be sure to use a text editor that does not delete trailing whitespace from the end of lines. Removing the whitespace at the end of certain lines might lead to errors when the Universal Text Parser interprets these lines.

Creating the File

After you have set up the Universal Text Parser, make the following modifications to the task.cfg file:

Modify the LineFormat statement so that it contains the attribute names of the data you want to synchronize. Separate each attribute name with a comma.

If needed, map the attribute names in the data file you are inputting to the names defined in the specified LDAP schema.

All attribute names must be contained in the LDAP schema that is specified in the task.cfg file. By default, this is inetOrgPerson. If the attribute names in your input file differ from those contained in the declared schema, you must map the attribute names in your input file to ones defined in inetOrgPerson.

Map the attribute names by first specifying the external database attribute name, a colon, then the associated LDAP attribute name. Separate attributes with a comma, as shown in the following example:

LineFormat=ID:uid,NTDOMAINACCOUNT:cn,SURNAME:sn,
GIVENNAME:givenName,INITIALS:initials

The order in which you specify the attributes is important; it must match exactly the order of the data supplied in your comma-separated value data file.

Normally, you can find the order of the attributes listed in the header of the data file. For an example of this, review the comma-separated value data file that is supplied with the product, sampledata.csv.

Specify a “format” if your data file uses a separator other than a comma.

The format can be either a character delimiter, a regular expression, or a field size (which is indicated by a digit). To specify a format, follow the external database attribute name with a pound sign (#) and list the necessary format, as shown below in the example for ImportLineFormat.

Specify the ImportLineFormat if you are going to import into your external database any modifications that are generated through the Universal Text Connector.

In the ImportLineFormat statement, supply the names of attributes generated by the Universal Text Connector in the order that you need them to be written to the output file. This is the order that your external database will import the entry attributes.

Specify the LDAP attribute name, a colon, then the external database attribute name. Follow the external database attribute name with a pound sign (#) and the delimiter, which in the following example is a comma. Separate each attribute listed with a comma, as shown in the following example:

ImportLineFormat=operation#,,uid:ID#,,
cn:NTDOMAINACCOUNT#,, sn:SURNAME#,,
givenName:GIVENNAME#,,initials:INITIALS#,,

As shown in the example above, the first data value is an operation, the value of which can be either add, modify, or delete. The operation indicates the database action needed to process the respective entry.

ImportLineFormat is like the LineFormat statement in that you must map the external database attributes to attribute names defined in the specified LDAP schema, which is inetOrgPerson by default.

If needed, you can specify a different LDAP object class using the AdditionalAttributes statement. Note, however, that you can use only a single schema in your comma-separated value file.

Modify the IndexAttribute= statement if you need to provide an attribute to index other than cn. By default, the line reads:

IndexAttribute=cn

To index on a different attribute value, specify the required attribute name in this statement. When specifying an attribute to index, use an attribute that contains unique values among each of the database entries. It is important, however, that you not use the distinguished name (dn) as the IndexAttribute value.

Specify the name of the file containing your comma-separated value data on the line that reads:

InputFile=%ScriptBase%sampledata.csv

On this line, replace the text sampledata.csv with the name of your data file.

InputFile is generated by DumpCommand, and OutputFile is used by ImportCommand. By default, the InputFile statement declares that your data file is located in the same directory as the Universal Text Parser modules. Because of this, you must copy your data file to this directory so the Universal Text Parser can read the file. If needed, you can place your data file in a different directory; be sure to supply the full path and file name to your data file in this statement.

Configure the ImportCommand statement if you are going to import data into your external database. To do so:

Uncomment the ImportCommand statement by removing the pound sign (#) listed at the beginning of the statement.

Specify the import command your external database uses.

If it is supported, supply the command that exports the data from your external database using the DumpCommand statement:

Uncomment the DumpCommand statement by removing the pound sign (#) listed at the beginning of the statement.

Specify the export command your external database uses.

Specify the comma separated list of attributes for which value can go from some value (multiple or single) to no value using ‘MultiValToNoValAttr’. The attribute names listed against this parameter should be the attribute names used in the external data source and one should not specify the attribute names used at the Connector View end. For example,

MultiValToNoValAttr=EMAIL,DESCRIPTION

Creating the File for Name-Value Pair Files

Use the nvp.cfg file as a boilerplate for the task.cfg configuration file if the data is formatted in a name-value pair (NVP) ASCII text file.

Before You Begin

Consider the following information when synchronizing a name-value pair data file with the Universal Text Parser:

All attribute names listed in the data file must map to the attribute names contained in the declared LDAP schema before the Universal Text Parser can process the entries. By default, the name-value pair boilerplate file uses inetOrgPerson as the default LDAP schema.

The name-value pair for each entry should include the object class of the attributes for that record. For example, the following line shows a valid example entry in an name-value pair text file:

uid=nvp1_uid
ObjectClass=top
ObjectClass=person
ObjectClass=organizationalperson
ObjectClass=inetOrgPerson
cn=nvp1_cn
sn=nvp1
mail=nvp1@madisonparc.com
title=Title for nvp1
description=This is the description for nvp1

If you are importing data into your external database that the Universal Text Parser has generated, the external database must not assume the order of the attributes generated for any given entry.

To create the file

After you have set up the Universal Text Parser, make the following modifications to the task.cfg file:

Modify the IndexAttribute= statement if you need to provide an attribute to index other than cn. By default, the line reads:

IndexAttribute=cn

Specify the name of the file containing your name-value pair data on the line that reads:

InputFile=%ScriptBase%sampledata.nvp

On this line, replace the text sampledata.nvp with the name of your data file.

By default, the InputFile statement declares that your data file is located in the same directory as the Universal Text Parser modules. Because of this, you must copy your external data file to this directory so the Universal Text Parser can read the file. If needed, you can place your data file in a different directory; be sure to supply the full path and file name to your data file in this statement.

Configure the ImportCommand statement if you are going to import data into your external database. To do so:

Uncomment the ImportCommand statement by removing the pound sign (#) listed at the beginning of the statement.

Specify the import command that your external database uses.

If it is supported, supply the command that exports the data from your external database using the DumpCommand statement:

Uncomment the DumpCommand statement by removing the pound sign (#) listed at the beginning of the statement.

Specify the export command that your external database uses.

MultiValToNoValAttr=mail,description

Creating the File for LDIF Files

Use the ldif.cfg file as a boilerplate for the task.cfg configuration file if the data is formatted in LDIF.

Before You Begin

Consider the following information when synchronizing an LDIF data file with the Universal Text Parser:

All attribute names in the ImportLineFormat statement must map to the attribute names in the declared LDAP schema before the Universal Text Parser can process the entries. By default, the name-value pair boilerplate file uses inetOrgPerson as the default LDAP schema.

In particular, note that the two lines ImportAttributeNamesSeparator and AttributeNamesSeparator must both end with a trailing whitespace.

If you specify an attribute flow rule, you turn on attribute data flow and you must provide the mappings. If you do not specify an attribute flow rule (if you use only entry level data flow), then all the mappings are configured in the LineFormat statement in your task.cfg file.

To creating the file

After you have set up the Universal Text Parser, make the following modifications to the task.cfg file:

Specify the ImportLineFormat if you are going to import into your external database any modifications that are generated through the Universal Text Connector.

In the ImportLineFormat statement, supply the names of attributes generated by the Universal Text Connector in the order that you need them to be written to the output file, with each attribute separated by a comma. For example:

ImportLineFormat=distinguishedname:dn,
operation:changetype,objectclass,uid,cn,sn,mail,
title,description

The Universal Text Parser will not generate any attributes excluded from this line.

Before listing the attributes you want to generate, you must specify two special values to support data output in LDIF. You specify these values as follows:

distinguishedname:dn,operation:changetype,

Modify the IndexAttribute= statement if you need to provide an attribute to index other than cn. By default, the line reads:

IndexAttribute=cn

Specify the name of the file containing your LDIF data on the line that reads:

InputFile=%ScriptBase%sampledata.ldif

On this line, replace the text sampledata.ldif with the name of your data file.

By default, the InputFile statement declares that your data file is located in the same directory as the Universal Text Parser modules. Because of this, you must copy your data file to this directory so the Universal Text Parser can read the file. If needed, you can place your data file in a different directory; be sure to supply the full path and file name to your data file in this statement.

Configure the ImportCommand statement if you are going to import data into your external database. To do so:

Uncomment the ImportCommand statement by removing the pound sign (#) listed at the beginning of the statement.

Specify the import command that your external database uses. For example, you could specify the following ImportCommand statement:

ImportCommand=myDBUtil -l administrator -p password -f ‘d:\\sunone\\servers\\utc-cv3\\logs\\test.out’

If it is supported, supply the command that exports the data from your external database using the DumpCommand statement:

Uncomment the DumpCommand statement by removing the pound sign (#) listed at the beginning of the statement.

Specify the export command that your external database uses.

MultiValToNoValAttr=mail,description

About UTC Data Exchange Format

UTC and the Connector Perl Script exchange data using UTC Data Exchange Format (UDEF). UDEF allows user to specify binary values and multiple values for the multi-valued attribute using special tags.

Binary Values

Binary (base64 encoded) values are prefixed with the tag [B].

Examples:

For ldif data:

jpegphoto: [B]w6Rrw7/DvGzDvA==

For nvp data:

jpegphoto=[B]w6Rrw7/DvGzDvA==

For csv data:

[B]w6Rrw7/DvGzDvA==

Multiple Values

There are two ways to specify multiple values. If the data is multiline, that is, each attribute name-value pair is specified in a separate line; then, the multiple values for an attribute can be specified in a separate line as shown below:

For ldif data:

telephonenumber: 4084913876
telephonenumber: 4085924901
telephonenumber: 4089027641

For nvp data:

telephonenumber=4084913876
telephonenumber=4085924901
telephonenumber=4089027641

This option is not available for csv data because in this data the record is specified in a single line.

Another way to specify multiple values is to use a comma separated list of values and prefix it with the [L] tag.

Examples:

For ldif data:

telephonenumber: [L]4084913876,4085924901,4089027641

For nvp data:

telephonenumber=[L]4084913876,4085924901,4089027641

For csv data:

Since, in csv data, values for the different attributes are separated by comma, the comma separated list for multiple values is enclosed in double quotes as follows:

“[L]4084913876,4085924901,4089027641”

All commas (‘,’) appearing in individual values should be escaped using “\”.

Multiple Binary Values:

Combination of the above two can be used to specify multiple binary values as follows:

Multiline data:

jpegphoto: [B]w6Rrw7/DvGzDvA==
jpegphoto: [B]dGVzdOODkeODkuODkw==
Single line data:
jpegphoto: [L][B]w6Rrw7/DvGzDvA==,[B]dGVzdOODkeODkuODkw==

Using Special Characters and UTF8 Data in DN

This section specifies the guidelines for using special characters and UTF8 data in the DN value; while preparing input data files for UTC-UTP based connectors.

Special characters (“,”, “+”, “\”, “ “ ”, “<“, “>”, or “;”) appearing in the DN should be escaped using “\”(ASCII 92). Also, if the DN contains UTF8 data, it must be escaped using the “\xx” notation (RFC 2253). The effect of the configuration parameter ‘EvaluateEscape’ (specified in the task.cfg file) must also be considered.

EvaluateEscapes=(TRUE|FALSE)

where ‘TRUE’ means that escape characters (\) will be evaluated. For example, ‘test\1’ becomes ‘test,1’.

where ‘FALSE’ or undefined means that the escape characters (\) will not be evaluated. Data is treated literally.

Special Characters in DN

Case 1: EvaluateEscapes=TRUE

Special character in the DN value should be prefixed by 2 backlashes

For example:

dn: uid=utp_u\\+ser4
objectclass: person
objectclass: organizationalperson
objectclass: inetorgperson
uid: utp_u+ser4
cn: utp_user4
sn: utp_user4
mail: utp_user4@madisonparc.com
title: Title for utp_user4
description: This is the description for utp_user4

Case 2: EvaluateEscapes=FALSE or undefined

Special character in the DN value should be prefixed by 1 backslash

For example:

dn: uid=utp_u\+ser4
bjectclass: person
objectclass: organizationalperson
objectclass: inetorgperson
uid: utp_u+ser4
cn: utp_user4
sn: utp_user4
mail: utp_user4@madisonparc.com
title: Title for utp_user4
description: This is the description for utp_user4



Note	Only the special characters in the DN value are escaped and not in the actual attribute value used for DN (uid in above example). Only when the DN or vrn is explicitly specified for an entry in the input file and DN/vrn value contains special characters, then this escaping is required.

For example, among the sample input files provided with the UTC installation, only LDIF files have DN explicitly specified for the entry in the input file (see above examples). CSV and NVP files does not have DN explicitly specified.

For example, see the NVP entry given below (cn is the naming attribute)

uid=nvp4_uid
ObjectClass=top
ObjectClass=person
ObjectClass=organizationalperson
ObjectClass=inetOrgPerson
cn=nv+p4_cn
sn=nvp4
mail=nvp41@madisonparc.com
title=Title for nvp4
description=This is the description for nvp4

UTF8 Data in DN

All data is considered UTF8 except the 7-bit ASCII data and it should be escaped using the ‘\xx’ notation when it appears in the DN. For details on enabling UTF8 support for Meta-Directory, see "Enabling UTF8 in Indirect Connectors."

Case 1: EvaluateEscapes=TRUE

dn: uid=\\c3\\a4k\\c3\\bf\\c3\\bcl\\c3\\bc
objectclass: person
objectclass: organizationalperson
objectclass: inetorgperson
uid: [B]w6Rrw7/DvGzDvA==
cn: [B]w6Rrw7/DvGzDvA==
sn: [B]w6Rrw7/DvGzDvA==
mail: utp_user6@madisonparc.com
title: Title for utp_user6
description: This is the description for utp_user6

Case 2: EvaluateEscapes=FALSE or undefined

dn: uid=\c3\a4k\c3\bf\c3\bcl\c3\bc
objectclass: person
objectclass: organizationalperson
objectclass: inetorgperson
uid: [B]w6Rrw7/DvGzDvA==
cn: [B]w6Rrw7/DvGzDvA==
sn: [B]w6Rrw7/DvGzDvA==
mail: utp_user6@madisonparc.com
title: Title for utp_user6
description: This is the description for utp_user6



Note	Only the UTF8 data in the DN value should be escaped using \xx notation and not in the actual attribute value used for DN (uid in above example). If attribute value contains UTF8 data then it should be base64 encoded over UTF8.

Only when the DN or vrn is explicitly specified for an entry in the input file and DN/vrn value contains UTF8 data, this escaping is required.

For example, among the sample input files provided with the UTC installation, only LDIF files have DN explicitly specified for the entry in the input file (see the above examples). CSV and NVP files does not have DN explicitly specified.

See the NVP entry given below (cn is the naming attribute)

uid=[B]w6Rrw7/DvGzDvA==
ObjectClass=top
ObjectClass=person
ObjectClass=organizationalperson
ObjectClass=inetOrgPerson
cn=[B]w6Rrw7/DvGzDvA==
sn=[B]w6Rrw7/DvGzDvA==
mail=nvp41@madisonparc.com
title=Title for nvp4
description=This is the description for nvp4

For more details on escaping of special characters and UTF8 data in DN, see this document: “RFC 2253 (Lightweight Directory Access Protocol (v3): UTF-8 String Representation of Distinguished Names)”. For more information, visit: http://www.ietf.org/rfc/rfc2253.txt.

Previous Contents Index Next