Appendix: Performing Content Autocategorization

This appendix provides overviews of the autocategorization process and setup tasks, and discusses how to:

Note. PeopleSoft Enterprise Portal does not include an autocategorization engine. However, integrating one is relatively straightforward. Before performing content autocategorization, you must install an engine.

See Also

Setting Up to Run the Content Categorization Spider

Click to jump to parent topicUnderstanding the Autocategorization Process

Multiple autocategorization engines can be configured and used. An engine must be accessible by HTTP GET or POST. The HTTP request would typically target an adapter object that maps the generic autocategorization input parameters to the proprietary interface of a specific engine.

The following table describes the input query parameters to the autocategorization engine.

Note. Certain autocategorization engines may not use all of these parameters. The valid range of values for parameters may vary.

Parameter

Description

DOC

The URL of the document to be classified.

DOC_TYPE

The engine-specific code which indicates that the DOC parameter refers to a URL. This may be necessary for engines that support alternatives, such as sending the entire contents of the document instead of just the URL.

HIERARCHY

The vocabulary into which the document should be classified. This is the same value that you specify in the Vocabulary field when you define vocabularies on the Autocategorization Vocabularies page.

MAX_CATS

The maximum number of categories into which you can classify documents.

THRESHOLD

The minimum confidence ranking required for a document to be classified within a category.

USERID

The user ID to use when accessing the autocategorization engine.

PASSWORD

The password to use when accessing the autocategorization engine.

The output of the autocategorization request should be a simple text array of classification definitions in the following format. Each element of the array represents a separate classification for the same document. A space character separates element pairs.

<confidence score>,<category name or error message>

The confidence score should be a signed integer that conforms to the ranking scheme of the engine. A confidence score of −1 should denote errors.

For example, output from a successful call to an engine might look like this:

.23,/business/bus_law .18,/computers/internet .17,/business/industry/tech

The output from an unsuccessful call might look like this:

-1,Server Not Responding; Engine May Not Be Running.

The PeopleSoft Enterprise Portal provides sample ASP and Java servlet adapters as templates:

The templates illustrate how to access the input parameters of the autocategorization request, forward them to sample autocategorization engines, and then format and return the engines’ responses.

The ASP template demonstrates how to integrate with a Component Object Model interface; the servlets show how to integrate with an Enterprise JavaBeans or custom C interface.

Click to jump to parent topicUnderstanding Autocategorization Setup Tasks

This section provides a summary of the tasks required to perform autocategorization.

Note. The categorization spider is used to perform the autocategorization process.

To perform autocategorization:

  1. Define autocategorization engine vocabularies.

    This task enables you to set up meaningful names for different vocabularies within an autocategorization engine. Some engines use codes for different vocabularies, such as a number. Assigning a name within Content Categorization enables an administrator to refer to that vocabulary in a more meaningful way.

    See Defining Autocategorization Engine Vocabularies.

  2. Define content sources on the Run Categorization Spider page.

    This is the where the spider will locate content.

    See Creating a Spider Run Control Entry.

  3. Create taxonomies.

    This is where you create your taxonomy and connect the defined content source to the taxonomy. The top folder name should correspond to the autocategorization engine vocabulary name. Information entered to establish the content source for your taxonomy is based on the content source defined using the Run Categorization Spider page.

    This table shows how to complete the Content Source page:

    Spider Source Fields

    Values

    Source Type

    Auto Categorized File Server.

    Source Name

    Select the source name you entered on the Run Categorization Spider page.

    Source Path

    This value is automatically filled with the URL you entered on the Run Categorization Spider page for the selected source name.

    Auto Expand Folder

    Allows for subfolders to be created automatically.

    Note. Auto-created folders cannot have content added manually or added automatically from other content sources.

    See Maintaining Folder Properties for Categorized Content.

Click to jump to parent topicDefining Autocategorization Engine Vocabularies

This section discusses how to define autocategorization engine vocabularies.

Click to jump to top of pageClick to jump to parent topicDefining Autocategorization Engine Vocabularies

Access the Autocategorization Engine Vocabulary Definition page (EPPCM_SPIDR_VOC) (select Content Management, Categorized Contents, Autocategorization).

Use the Autocategorization Engine Vocabulary Definition page to set the Autocategorization vocabulary used in categorizing content.

Name

Enter the name of the autocategorization engine.

URL

Enter the autocategorization engine's URL.

Vocabulary Name and Long Description

Enter a name that is used by the autocategorization engine for a taxonomy into which it can classify documents. The description appears on the Content Categorization pages to help identify the vocabulary.

Click to jump to parent topicPerforming Autocategorization

After the taxonomy has been created and content sources have been mapped, the spider can be invoked.