Data Sets

Unused Data Sets

An unused Data Set is a collection transaction records that have been loaded into Spend Classification and not yet classified or used for training purposes.

Training Data Sets

A Training Data Set is a collection of high quality spend transaction records that are already accurately categorized and that provide good examples of the different description, item reference, supplier, site, price keywords typically found on purchases that would fall into a specific category. For example, the training data set records for categorizing spend on Desserts might include transactions for Chocolate cake, Meringue tart and Black Forest.

Spend Classification uses a training data set to build a Knowledge Base that will be used to identify data patterns that can be used to predict spend categories in the wrongly classified or unclassified spend data.

During a Spend Classification implementation, it's to be expected that an iterative process will be used to gradually improve the quality of training data sets, thereby improving the results from the tool. Typically after a set of spend transactions have been classified, you will review the results and use any classification errors to identify modifications or additions to the training data used for specific categories.

There are two ways to build a training data set:

The training data can be created from scratch and imported into Spend Classification.
A utility is provided that will analyze an existing data set of spend transaction records to identify a subset of examples of distinct transactions for manual classification.

An appropriate set of data that makes up the training data set should have these characteristics:

There must be a high number of examples for each leaf node to build a robust knowledge base so that the application can predict the appropriate category with higher accuracy.
The examples should represent all the necessary variations that are needed to categorize a transaction to a certain category. For example, if Items Description, Price, Supplier, Supplier Site, Amount of PO, and Invoice are considered in the classification process that's currently used, then the training data set must contain examples that cover values that describe a spend transaction for each category code used. You must take out sample transactions for each category and repeat the same steps for all categories present in the transaction tables. This is because the same item or service can be ordered by the users in different ways. In order to build an effective knowledge base, it's ideal that you familiarize the knowledge base with all such possible transactions, so that it can pick the keywords properly during the classification process.
The data should have referential integrity. This means that the data that's being used should always have reference to categories that are valid and exist in the instance.

Classification Data set

A Classification Data Set is a collection of spend transaction records of a similar type that have been processed to generate spend category predictions. The predictions and system confidence in these predictions are stored in a Batch.

The maximum number of spend transactions that can be processed and classified from a data set is 100,000 records. When a Data Set is selected for classification, the user has the option to apply business unit and date filters to ensure that the classification batch that will be generated won't exceed the maximum size.

When a subset of the records in a Data Set are selected for classification into a taxonomy, the next time that data set is selected for classification to the same taxonomy, only those records not yet classified will be available for selection. For example, if a Data Set contains 140,000 spend data records for an organization's buying activity during 2019 and 2020 (80,000 and 60,000 yearly records respectively), the user could select all transactions for 2019 for classification into taxonomy 1. This would process the 80,000 records from 2019. The next time they select the data set for classification into taxonomy 1, only the 60,000 records from 2020 would be available. However, all 140,000 records would still be available for classification into any of taxonomies 2, 3, 4, or 5.

Note: The reset feature can be used to allow batches or data sets to be reprocessed to generate updated category predictions. See Reset a Data Set.

Seeded Classification Data Sets

There are four seeded Data Sets in Spend Classification used to process transactions generated by the applications.

These are the Requisitions, Purchase Orders, Invoices, and Expenses data sets. These data sets are unique as they automatically grow due to transactions being entered and processed within the applications by users. For example, an administrator creates 2 batches to process all 160,000 transactions from 2019 and 2020 in the Requisitions data set for classification into taxonomy #1 on 31-Dec-2020. During Jan 2021, users enter and process another 10,000 requisitions. When the administrator selects the data set for classification on 31-Jan-2021, the 10,000 new transactions are available for classification into taxonomy 1. And 170,000 accumulative transactions are now available for classification into any of taxonomies 2, 3, 4, or 5.

The same processing limits apply to these seeded data sets, so any classification batch generated from Requisitions, Purchase Orders, Invoices, or Expenses can't exceed 100,000 records in size. And the same selection process applies to the records in these data sets, so a record can only be included in a single batch for classification into a specific taxonomy.

Managing Data Sets

The Data sets tab enables you to search for data sets, reset data sets, create knowledge base, improve knowledge base, submit classification run, upload or download spend data in a supported file format (tab delimited file), and monitor data set activity logs. Use the applicable search criteria and search for data sets. Once you find your data set, download the entire data set, or download a specific set of transactions using filters such as business unit and date range.

Import Data Sets

To import data sets:

In the Spend Classification work area, click Upload.
On the Scheduled Processes page, click Schedule New Process and select the Load Interface File for Import job.
On the Process Details page, select the Import Data Sets import process, upload the data file, and submit the job.

After the import is successful, you can search and view the data

Create a Copy of a Data Set

Create a new data set by creating a copy of an existing user-generated data set. This is not supported for seeded data sets such as Requisitions, Purchase Orders, Invoices, and Expenses.

To create a copy of a data set:

Click Copy in the data set’s menu options.
Note: The activity log of the original data set shows the status of the copy action.
Enter the name and select a purpose for the new data set and click Create.
After the new data set is created, Refresh the Configuration page and you can find it in the list of data sets.

Create a Sample Training Set

The data set has a large number of transactions and it won’t be possible to manually categorize all the transactions. So, manually classify a subset of the transactions. However, selecting the transactions to classify can be a daunting task, if done manually. Also, you could potentially miss on some good candidates for categorizing the transactions when trying to reduce the size of the data. By creating a sample training data set, Spend Classification extracts the most unique variety of transactions. You can then download this sample training set, manually classify the transactions, and upload with a different name to finalize the training data set.

To create a sample training data set:

On the Configuration page, on the Data Set tab, click the menu for a data set and select Create Sample Training Set.
In the Create Sample Training Set dialog box:
- Name the data set
- Optionally select a business unit.
- Optionally specify a date range by entering the from and to dates.
- Sampling volume percentage that determines the percentage of how many transactions you want to extract from the data set as the volume for the training data set. Typically, this is around 10 percent. Click Get Transactions to get the approximate number of transactions that will be present in the sample training set based on the sample volume percentage.
- Advanced users can use these options to create a sample raining set as per their requirements:
  - Distance factor: The method used to determine the distance between two keywords within a cluster. A cluster is a group of objects that are similar to each other. In data mining, similarity is measured as a distance with dimensions describing object features.
  - Cluster size: Label that indicates the number of similar objects that are grouped together as a cluster.
Click Create.

Reset a Data Set

To reclassify a data set after the classification process, you need to delete the current classification codes stamped on the applicable transactions or data sets by the Spend engine. You can perform this activity using the Reset action. If you have classified the data set but not reset it, then the Reset action is enabled. However, if you have never classified a training data set, then the Reset action is disabled. If you have at least one classified transaction in a data set, the Reset action is enabled.

Select the data set or multiple data sets and click the Reset action. Select the appropriate taxonomy (as there is a flexibility for a transaction to be classified using more than one taxonomy) to delete the classification codes for a particular taxonomy. Note that only the taxonomies that you have used for classifying the data sets are available for resetting.

Reset removes all category predictions and should be used infrequently on special circumstances. Batches already created for a data set remain unaffected if you reset a data set, however the category predictions available approved batches won't be available in analytic applications if you reset the data set.

Transactions can be classified using up to 5 taxonomies, you don't need to reset if you want to classify a data set in a different taxonomy. Rest is only required if transactions need to be reclassified using the same taxonomy.