9.4 Rule-Based Classification

Rule-based classification is the basic solution for creating an Oracle Text classification application.

The basic steps for rule-based classification are as follows. Specific steps are explored in greater detail in the example.

  1. Create a table for the documents to be classified, and then populate it.

  2. Create a rule table (also known as a category table). The rule table consists of categories that you name, such as "medicine" or "finance," and the rules that sort documents into those categories.

    These rules are actually queries. For example, you define the "medicine" category as documents that include the words "hospital," "doctor," or "disease." Therefore, you would set up a rule in the form of "hospital OR doctor OR disease."

  3. Create a CTXRULE index on the rule table.

  4. Classify the documents.

See Also:

"CTXRULE Parameters and Limitations" for information on which operators are allowed for queries

9.4.1 Rule-Based Classification Example

In this example, you gather news articles about different subjects and then classify them. After you create the rules, you can index them and then use the MATCHES statement to classify documents.

To classify documents:

  1. Create the schema to store the data.

    The news_table stores the documents to be classified. The news_categories table stores the categories and rules that define the categories. The news_id_cat table stores the document IDs and their associated categories after classification.

    create table news_table (
           tk number primary key not null,
           title varchar2(1000),
           text clob);
    
    create table news_categories (
            queryid  number primary key not null,
            category varchar2(100),
            query    varchar2(2000));
    
    create table news_id_cat (
            tk number, 
            category_id number);
  2. Load the documents with SQLLDR.

    Use the SQLLDR program to load the HTML news articles into the news_table. The file names and titles are read from loader.dat.

    LOAD DATA
         INFILE 'loader.dat'
         INTO TABLE news_table
         REPLACE
         FIELDS TERMINATED BY ';'
         (tk         INTEGER EXTERNAL,
          title      CHAR,
          text_file  FILLER CHAR,
          text       LOBFILE(text_file) TERMINATED BY EOF)
  3. Create the categories and write the rules for each category.

    The defined categories are Asia, Europe, Africa, Middle East, Latin America, United States, Conflicts, Finance, Technology, Consumer Electronics, World Politics, U.S. Politics, Astronomy, Paleontology, Health, Natural Disasters, Law, and Music News.

    A rule is a query that selects documents for the category. For example, the 'Asia' category has a rule of 'China or Pakistan or India or Japan'. Insert the rules in the news_categories table.

    insert into news_categories values
      (1,'United States','Washington or George Bush or Colin Powell');
    
    insert into news_categories values
      (2,'Europe','England or Britain or Germany');
    
    insert into news_categories values
      (3,'Middle East','Israel or Iran or Palestine');
    
    insert into news_categories values(4,'Asia','China or Pakistan or India or Japan');
    
    insert into news_categories values(5,'Africa','Egypt or Kenya or Nigeria');
    
    insert into news_categories values
      (6,'Conflicts','war or soldiers or military or troops');
    
    insert into news_categories values(7,'Finance','profit or loss or wall street');
    insert into news_categories values
      (8,'Technology','software or computer or Oracle 
       or Intel or IBM or Microsoft');
    
    insert into news_categories values
      (9,'Consumer electronics','HDTV or electronics');
    
    insert into news_categories values
      (10,'Latin America','Venezuela or Colombia 
       or Argentina or Brazil or Chile');
    
    insert into news_categories values
      (11,'World Politics','Hugo Chavez or George Bush 
       or Tony Blair or Saddam Hussein or United Nations');
    
    insert into news_categories values
      (12,'US Politics','George Bush or Democrats or Republicans 
       or civil rights or Senate');
    
    insert into news_categories values
      (13,'Astronomy','Jupiter or Earth or star or planet or Orion 
       or Venus or Mercury or Mars or Milky Way 
       or Telescope or astronomer 
       or NASA or astronaut');
    
    insert into news_categories values
      (14,'Paleontology','fossils or scientist 
       or paleontologist or dinosaur or Nature');
    
    insert into news_categories values
      (15,'Health','stem cells or embryo or health or medical
       or medicine or World Health Organization 
       or virus or centers for disease control or vaccination');
    
    insert into news_categories values
      (16,'Natural Disasters','earthquake or hurricane or tornado');
    
    insert into news_categories values
      (17,'Law','Supreme Court or legislation');
    
    insert into news_categories values
      (18,'Music News','piracy or anti-piracy 
       or Recording Industry Association of America 
       or copyright or copy-protection or CDs 
       or music or artist or song');
    
    commit;
  4. Create the CTXRULE index on the news_categories query column.
    create index news_cat_idx on news_categories(query)
    indextype is ctxsys.ctxrule;
  5. To classify the documents, use the CLASSIFIER.THIS PL/SQL procedure (a simple procedure designed for this example).

    The procedure scrolls through the news_table, matches each document to a category, and writes the categorized results into the news_id_cat table.

    create or replace package classifier as procedure this;end;/
    
    show errors
    
    create or replace package body classifier as
    
     procedure this
     is
      v_document    clob;
      v_item        number;
      v_doc         number;
     begin
    
      for doc in (select tk, text from news_table)
         loop
            v_document := doc.text;
            v_item := 0;
            v_doc  := doc.tk;
            for c in (select queryid, category from news_categories
                 where matches(query, v_document) > 0 )
              loop
                v_item := v_item + 1;
                insert into news_id_cat values (doc.tk,c.queryid);
              end loop;
       end loop;
    
     end this;
    
    end;
    /
    show errors
    exec classifier.this

9.4.2 CTXRULE Parameters and Limitations

The following considerations apply to indexing a CTXRULE index:

  • If you use the SVM_CLASSIFIER classifier, then you may use the BASIC_LEXER, CHINESE_LEXER, JAPANESE_LEXER, or KOREAN_MORPH_LEXER lexers. If you do not use SVM_CLASSIFIER, then you can use only the BASIC_LEXER lexer type to index your query set.

  • Filter, memory, datastore, and [no]populate parameters are not applicable to the CTXRULE index type.

  • The CREATE INDEX storage clause is supported for creating the index on the queries.

  • Wordlists are supported for stemming operations on your query set.

  • Queries for CTXRULE are similar to the CONTAINS queries. Basic phrasing ("dog house") is supported, as are the following CONTAINS operators: ABOUT, AND, NEAR, NOT, OR, STEM, WITHIN, and THESAURUS. Section groups are supported for using the MATCHES operator to classify documents. Field sections are also supported; however, CTXRULE does not directly support field queries, so you must use a query rewrite on a CONTEXT query.

  • You must drop the CTXRULE index before exporting or downgrading the database.

See Also: