9 Understanding Input Agents

Learn how to enable the Input Agent and create the input files for batch uploading in Imaging.

The Input Agent is an Imaging service used to upload and index documents in bulk into the Imaging system.

Topics

9.1 Enabling Input Agent

Input Agent indexes Imaging documents in bulk by using an application definition, input definition, and a specially formatted text file called an input file. The input file specifies the list of images to index and the metadata to associate with them in the application. Input Agent is multithreaded and is configurable to handle large and small volumes of data.

To configure the Input Agent, do the following:

Start the managed servers and navigate to the Imaging configuration MBean using WLST or the Enterprise Manager MBean browser.
Set CheckInterval to a value that is appropriate for your environment. The CheckInterval MBean is a system setting that specifies how many minutes to pause before checking for new work to do. The default is 15 minutes.
Set the InputAgentRetryCount to control how many times a job can be retried after it has failed. The default is 3, after which the job is placed in the failed directory.
Set the InputDirectories MBean to specify the paths to the input files. This value can be expressed as an array of locations. If using a multinode installation of Imaging, this location is shared among all the Input Agents and must be accessible by all agents. If Input Agents are on different machines, this must be a shared network.

Note:

In order to process input files, the Input Agent must have the appropriate permissions on the input directory and the input directory must allow file locking. The Input Agent requires that the user account that is running the WebLogic Server service have read and write privileges to the input directory and all files and subdirectories in the input directory. These privileges are required so that Input Agent can move the files to the various directories as it works on them. File locking on the share is needed by Input Agent to coordinate actions between servers in the cluster.

After completing these steps, the Input Agent is active and ready to process work. Once you create an application (see Creating An Application) and input definitions (see Creating Input Definitions), the Input Agent will start processing jobs.

9.2 Understanding Input Files

The Input Agent performs work based on input files. These are simple text documents, similar to CSV (comma-separated values) files, that contain lists of files and associated metadata to index into the Imaging system. The input file can use different encodings as long as the correct encoding is specified in the input definition. Input Agent looks for all input files that match the input mask of the input definition and not the sample file that is used to define the input definition. Note that sample files are not required when creating an input through the API. They are only used when creating an input through the user interface so a user can see the data to help choose the columns.

WARNING:

Input file masks must be unique to the Imaging system and cannot overlap. Input Agent only processes an input file for one input and will not restage a file to be processed again for a different input definition. The order in which inputs are processed is random so it is unknown as to which input will pick up a shared input file.

A sample input file looks like:

C:\IPMData\Input Files\print\NewPrintstreams\doc16.txt|NEW ORDER|10/06/94|B82L|218482
C:\IPMData\Input Files\print\NewPrintstreams\doc17.txt|NEW ORDER|10/06/94|N71H|007124
C:\IPMData\Input Files\print\NewPrintstreams\doc18.txt|NEW ORDER|10/06/94|B83W|24710

The detailed structure of an input file is defined as:

[path to document file][delimiter][metadata value 1]<[delimiter]<metadata value 2> ... <delimiter>>

Items in brackets ([]) are required and items in angle brackets (<>) are optional.
path to document file is the location of the tiff, jpeg, doc or other file type that is being saved to Imaging. It must be a path that is accessible to the user account running the Input Agent.
delimiter is the character that separates the values from one another, such as the | character.
metadata value x are the index values that the application uses to index the document.
The delimiter character must be the same character throughout the entire input file and match what is specified in the input definition. The default is a pipe character (|).
Only one metadata value is required per required field in the application. For example, if a Name and Date field are both marked as Required in an application, then the input file must have values for both the Name and Date field as well. Additional values are optional but they must continue to follow the [delimiter]<metadata value> format.
There is no length restriction per line, but all metadata pertaining to the file must be on a single line because the newline character specifies the start of a new document.
Each value is separated by a delimiter, with the delimited values treated by the Input Agent as Column 1... Column N. Any commands on the line do not count as a column. See Using Input Filing Commands.
Columns in the input file need not match the ordering of the Application, but they must be in the same locations as specified in the input definition to be indexed correctly.

Note:

Dates and times specified in the input file are subject to current Daylight Savings Time rules, and not the DST rules in effect for the specified date. This can cause the timestamp of the document to shift forward or back up to two hours. If the timestamp shifts forward or back across midnight, the date used for the document input may also shift.

9.3 Using Input Filing Commands

Input Agent gives users more control over the filing process by inserting special command sequences in the input file. An Input Definition applies to all files, but commands can be inserted by Input Agent in the input file as needed and can change from file to file, offering the flexibility of setting a specific behavior per file, such as the file locale for changing date formats or numeric display.

These commands can be used for processing the entire input file or just a single row of the file, depending on the command. The details of the individual commands are specified below.

9.3.1 Locale

The locale command changes the locale which the agent uses to parse the data. This command can only be used once at the beginning of the input file before any documents are specified. If the command is used after data has been processed then an error will occur and the filing will stop.

Syntax

@Locale[delimiter][locale]

Example

@Locale|es-es

Notes

This command can only be used at the very beginning of the input file and applies to the whole file. If multiple locales need to be used then that data must be separated into different files. The delimiter must be the same as is used throughout the input file. The locale follows the format of ISO Language - ISO Country code.

9.3.2 New

The new command creates a new document in the Imaging system and behaves the same as leaving the index data on a line by itself. The command only applies to the line that is annotated and will reset on the next line.

Syntax

@New[delimiter][line data]

Line Data: The metadata values for the document as would exist on a typical input file.

Example

@New|TestTiff.TIF|98.765|Good Company LTD|10/08/2003|0000|1.733,12|10/09/2003

Notes

The @New at the beginning of the line is not counted as one of the columns to be mapped.

9.3.3 Supporting Content

The supporting content command allows the user to apply a file as supporting content to a document instead of creating a new document. The content is applied to the last new document line that appears in the input file unless an explicit document ID is specified in the command. If the last new document fails to index then the supporting content command also fails since the intended document to add content to doesn't exist.

Syntax

@Support[delimiter][key][delimiter][content path]<[delimiter][document id]>

Key: The supporting content key to store the file under. It must be unique for the document.
Content Path: The path to the file to save as supporting content.
Document ID (optional): The Imaging document ID that the supporting content should be applied to. If this value is given then the previous new document is ignored and the supporting content is placed on the document ID given.

Example

@Support|supporting content key 1|C:\temp\sample.tif

9.3.4 Apply Annotations

The apply annotations command applies a pre-generated annotation file to a document. The annotation is applied to the last new document line that appears in the input file unless an explicit document ID is specified in the command. If the last new document fails to index then the apply annotations command also fails since the intended document to apply annotations to doesn't exist.

Note that multiple annotation commands overwrite each other. They are not cumulative.

Note:

Use this command to apply annotations only when uploading new documents to Imaging using the Input Agent. It is not recommended to use this command to apply annotations to existing documents as it will overwrite any existing annotations associated with the document.

Syntax

@Annotation[delimiter][file path]<[delimiter][document id]>

File Path: The path to the annotation file to apply to the document.
Document ID (optional): The Imaging document ID that the annotation should be applied to. If this value is given then the previous new document is ignored and the supporting content is placed on the document ID given.

Example

@Annotation|C:\temp\annot.xml

9.3.5 Workflow Inject Document

The workflow inject document command kicks off a workflow injection for the specified document id. The command is only intended for use in the error file and is documented here for informational purposes only.

Syntax

@WorkflowInjectDoc[delimiter][document id]

Document ID (required): The Imaging document ID to inject into workflow.

Example

@WorkflowInjectDoc|2.IPM_014404

9.4 Input Agent Processing

This section describes how the Input Agent processes the input files.

9.4.1 Input Directory Structure

The input directory specified in the configuration MBean is the top level of the directory structure. Below the top level input directory, the Input Agent creates and manages other directories in the following structure to process its work. Directory definitions follow the following file structure.

Input
   - Errors
   – Holding 
   – Processed
      — YYYY-MM-DD
   – Samples 
   – Stage

Directory	Definitions
Input	This is the top level that is defined in the configuration MBean. This is where Input Agent looks for new input files. There can be multiple input directories defined in the MBean and each entry in the MBean will have this same structure below it.
Errors	Whenever an input file has a mixture of failed index attempts along with some successful indexes, an error file is created for that filing in this directory.
Holding	If CleanupExpireDays and CleanupFileExclusionList MBeans are enabled, the holding directory stores any successfully processed file, including annotation and supporting content files. The images remain there until the number of days specified in the CleanupExpireDays setting elapses. After that point the files and the batch folder are deleted. Specify any files that should not be deleted in the CleanupFileExclusionList setting with exact file names.
YYYY-MM-DD	These directories are date values in the form of year-month-day (such as 2009-04-01) that organize the input files by the date they were processed. This gives the date of when the file was processed and prevents any one directory from getting too many files in it.
Processed	Files under this directory have been parsed all the way through the filing process. If an error occurred during processing, then an error file is placed in the Errors directory and the original file is placed in the Processed directory even if no document is created in the Imaging system.
Samples	This directory contains all the sample files that work with input objects through the user interface. Files in this directory are visible in the input wizard under the user interface and should not contain production data. Note that the Samples directory location is configured separately from the input directories and may not be under the input directory.
Stage	Files in this directory have been selected for processing and are being worked on by the agent. Once the filing is complete, the file is moved to the Processed directory. If the processing fails, an error report is generated.

9.4.2 Input Agent Processing Order

Input Agent polls for input files, stages them, and posts a message to the JMS queue that there are files available for processing. Input ingestors listen to the JMS queue and start processing staged files. The sequence of events is as follows:

9.4.2.1 Polling

First, Input Agent polls for files:

Upon Input Agent wake up (specified by the CheckInterval MBean), Input Agent gets a list of the currently online input definitions.
For each of the input definitions, Input Agent checks all input directories for files that match the input file mask.
When a file is found, it is moved to the Stage directory and a message is generated on a JMS queue to process the file, at which point input ingestors are notified and processing can begin.
Steps 2 and 3 are repeated until all input definitions and directories have been checked.

9.4.2.2 Processing

Once input ingestors are notified that there are files staged for processing in the JMS queue, they begin processing the files:

The ingestor opens a connection to the repository and creates an error file and a new batch object for tracking the documents.
The thread begins parsing the input file and indexing the documents into Imaging. Any errors that are encountered during indexing are recorded in the error file. This step is repeated for all entries in the input file.
After all the documents have been processed, the batch is closed and, if there were no error entries, the error file is deleted.
The ingestor closes the connection to the repository and the input file is moved to the current date directory under the Processed or Failed directory, and the ingestor moves on to the next staged input file.

9.4.3 Changing Oracle WebLogic Server Work Manager Settings

A work manager is an Oracle WebLogic Server concept for controlling how many threads are assigned to a process. In Imaging, they are used to control how many threads are assigned to the Input Agents and for increasing or decreasing their load on the system. On a new installation, Input Agent is assigned 10 threads. You can reconfigure how many threads Oracle WebLogic Server should provide to the Input Agents by changing the default settings of the WebLogic Server work manager InputAgentMaxThreadsConstraint (default 10) to match your system needs. The number of maximum threads should be adjusted equally on all systems to avoid one machine falling behind and creating a backlog. A value of -1 or 0 disables the constraint. Values above 1 constrain the number of threads to the specified number.

To update thread settings, complete the following steps from the WebLogic Server administration console. For more information about WebLogic Server, see Administering Server Environments for Oracle WebLogic Server.

Bring up the WebLogic console for the domain and go to the deployments section.
Select the Imaging application to display the details for Imaging.
Select the Configuration and then Workload tabs to get to the Work Manager list.
Select InputAgentWorkManager to adjust.
Select InputAgentMaxThreadsConstraint at the bottom of the page.
Update the count to the new maximum thread count and click Save.
Restart the managed server(s) and the new thread count will be in effect.

9.5 Checking Results and Error Files

Input Agent has a retry mechanism to allow it to reattempt processing the input file in the event of a recoverable error. An example of this type of error is when the repository is not yet available and needs to finish initializing. When Input Agent detects a recoverable error, it puts the filing back on the JMS queue. The queue has a configurable retry wait timer that prevents the input file from being reprocessed immediately. You can also set the InputAgentRetryCount MBean to control how many times a job can be retried. The default is 3, after which the job is placed in the failed directory.

To troubleshoot any input file errors, do the following:

Determine if a severe error is preventing the input file from being parsed by examining the Errors directory. In the case of a severe error, two files are placed in the Errors directory: the original input file and an error file. The original input file is renamed by appending an error code, date and time the error occurred to the original file name in the following format:
```
original name-error code.YYYY-MM-DD.HH-mm-SS.original ext
```
The error file is named using the original input file name appended with the date and time of the error with a .err extension in the following format:
```
original name.YYYY-MM-DD.HH-mm-SS.err
```
For example, the failed input file invoices.dat would be renamed and placed in the Errors directory as invoices.dat-errorcode.05-21-2010.16_36_07.dat and have an associated error file of invoices.dat.05-21-2009.16_36_07.err.

Note:

The file name character limitations of the file system being used should be considered when naming input files. Input file names should be at least 55 characters less than the file system limit in order to allow for the appended error codes and date-time stamps.

For example the Windows NTFS file system restricts the file name plus the length of the file path to 256 characters. If the Errors directory path is 150 characters long, then no input file name should exceed 51 characters: 256 character limit - 150 character directory path - 55 character error code date-time stamp = 51 characters left for a file name. In this example, if more than 51 characters is required for input file names, then the Errors directory must be moved to a shorter path.
If an error report does exist it will contain a list of all lines from the original input file that failed with an additional column containing the error message. So, an original line of:
```
C:\IBPM Data\WorkFiles\Filer\input\Images\C885|Identifier 165|27/06/2008|28215|495.75|
 
```
would be listed in the error file as the following:
```
C:\IBPM Data\WorkFiles\Filer\input\Images\C885|Identifier 165|27/06/2008|28215|495.75|Could not find file C:\IBPM Data\WorkFiles\Filer\input\Images\C885
 
```
If a more detailed logging level is enabled for Imaging and a filing was placed in the processed directory, a log entry is created stating:
```
Filing <Input Name> completed successfully with <indexed doc count> documents processed successfully out of <total doc count> documents.
 
```
If the filing failed, then a log entry is created that states:
```
An error occurred while completing a batch.
 
```
Common causes of errors on a line by line basis are problems with proper formatting of metadata to be loaded, or invalid value ranges and truncation of data.
Refer to the server's Oracle Diagnostic Logging (ODL) framework logs. The most common way to check this is through the Enterprise Managers's Log viewer for the imaging application. Typical problems here are from underlying repository or file permissions issues.