SVM-Based Supervised Classification

The second method we can use for training purposes is known as Support Vector Machine (SVM) classification. SVM is a type of machine learning algorithm derived from statistical learning theory. A property of SVM classification is the ability to learn from a very small sample set.

Using the SVM classifier is much the same as using the Decision Tree classifier, with the following differences.

  • The preference used in the call to CTX_CLS.TRAIN should be of type SVM_CLASSIFIER instead of RULE_CLASSIFIER. (If you do not want to modify any attributes, you can use the predefined preference CTXSYS.SVM_CLASSIFIER.)

  • The CONTEXT index on the table does not have to be populated; that is, you can use the NOPOPULATE keyword. The classifier uses it only to find the source of the text, by means of datastore and filter preferences, and to determine how to process the text, through lexer and sectioner preferences.

  • The table for the generated rules must have (as a minimum) these columns:

    cat_id      number,
    type        number,
    rule        blob;

As you can see, the generated rule is written into a BLOB column. It is therefore opaque to the user, and unlike Decision Tree classification rules, it cannot be edited or modified. The trade-off here is that you often get considerably better accuracy with SVM than with Decision Tree classification.

With SVM classification, allocated memory has to be large enough to load the SVM model; otherwise, the application built on SVM will incur an out-of-memory error. Here is how to calculate the memory allocation:

Minimum memory request (in bytes) = number of unique categories x number of features 
                                    example: (value of MAX_FEATURES attributes) x 8

If necessary to meet the minimum memory requirements, either:

  • increase SGA memory (if in shared server mode)

  • increase PGA memory (if in dedicated server mode)