Performing Sentiment Analysis Using Oracle Text

12.1 Overview of Sentiment Analysis

Sentiment analysis uses trained sentiment classifiers to provide sentiment information for documents or topics within documents.

This section contains the following topics:

12.1.1 About Sentiment Analysis

Oracle Text enables you to perform sentiment analysis for a topic or document by using sentiment classifiers that are trained to identify sentiment metadata.

With growing amounts of data, it would be beneficial if organizations could gain more insights into their data rather than just obtaining “hits” in response to a search query. The insight could be in the form of answering certain basic types of queries (such as weather queries or queries about recent events) or providing opinions about a user-specified topic. Keyword searches provide a list of results containing the search term. However, to identify a sentiment or opinion with regard to the search term, you need to perform further data analysis by browsing through all the results and then manually locating the required sentiment information. Sentiment analysis provides a one-step process to identify sentiment information within a set of documents.

Sentiment analysis is the process of identifying and extracting sentiment metadata related to a specified topic or entity from a set of documents. The sentiment is identified using trained sentiment classifiers. When you run a query using sentiment analysis, in addition to the search results, sentiment metadata is also identified and displayed. Sentiment analysis provides answers to questions such as “Is a product review positive or negative?” or “Is the customer satisfied or dissatisfied?”. For example, from a document set consisting of multiple reviews for a particular product, you can determine an overall sentiment that indicates if the product is good or bad.

12.1.2 About Sentiment Classifiers

A sentiment classifier is a type of document classifier that is used to extract sentiment metadata related to a topic or document.

To perform sentiment analysis using a sentiment classifier, you must first associate a sentiment classifier preference with the sentiment classifier and then train the sentiment classifier.

User-defined sentiment classifiers can be associated with a sentiment classifier preference of type SENTIMENT_CLASSIFIER. A sentiment classifier preference specifies the parameters that are used to train a sentiment classifier. These parameters are defined as attributes of the sentiment classifier preference. You can either create a sentiment classifier preference or use the default CTXSYS.DEFAULT_SENTIMENT_CLASSIFIER. To create a user-defined sentiment classifier preference, use the CTX_DDL.CREATE_PREFERENCE procedure to define a sentiment classifier preference and the CTX_DDL.SET_ATTRIBUTE procedure to define its parameters.

To train a sentiment classifier, you need to provide an associated sentiment classifier preference, a training set of documents, and the sentiment categories. If no classifier preference is specified, then Oracle Text uses default values for the training parameters. The sentiment classifier is trained using the set of sample documents and specified preference. Each sample document is assigned to a particular category. Oracle Text deduces a set of classification rules that define how sentiment analysis must be performed using this sentiment classifier. Use the CTX_CLS.SA_TRAIN procedure to train a sentiment classifier.

Typically, you would define and train separate sentiment classifiers for different categories of documents such as finance, product reviews, music, and so on. If you do not want to create your own sentiment classifier or if suitable training data is not available to train your classifier, you can use the default sentiment classifier provided by Oracle Text. The default sentiment classifier is unsupervised.

Note:

The default sentiment classifier works only with AUTO_LEXER. Do not use AUTO_LEXER when using user-defined sentiment classifiers.

See Also:

12.1.3 About Performing Sentiment Analysis with Oracle Text

To perform sentiment analysis, you run a sentiment query that includes the sentiment classifier that must be used to identify sentiment information. The classifier can be the default sentiment classifier or a user-defined sentiment classifier.

Sentiment analysis can be performed only as part of a search operation. Oracle Text searches for the specified keywords and generates a result set. Then, sentiment analysis is performed on the result set to identify a sentiment score for each result. If you do not explicitly specify a sentiment classifier in your query, the default classifier is used.

You can either identify one single sentiment for the entire document or separate sentiments for each topic within a document. Most often, a document contains multiple topics and the author’s sentiment towards each topic may be different. In such cases, document-level sentiment scores may not be useful because they cannot identify sentiment scores associated with different topics in the document. Identifying topic-level sentiment scores provides the required answers. For example, when searching through a set of documents containing reviews for a camera, a document-level sentiment tells you whether the camera is good or not. Assume that you want the general opinion about the picture quality of the particular camera. Performing a topic-level sentiment analysis, with “picture quality” as one of the topics will provide the required information.

Note:

If you do not specify a topic of interest for sentiment analysis, then Oracle Text returns the overall sentiment for the entire document.

See Also:

12.1.4 Interfaces for Performing Sentiment Analysis

Oracle Text supports multiple interfaces for performing sentiment analysis.

Use one of the following interfaces to run a sentiment query:

Procedures in the CTX_DOC package
XML Query Result Set Interface (RSI)

See Also:

12.2 Creating a Sentiment Classifier Preference

Use the CTX_DDL.CREATE_PREFERENCE procedure to create a sentiment classifier preference and the CTX_DDL.SET_ATTRIBUTE procedure to define its attributes. The classifier type associated with a user-defined sentiment classifier preference is SENTIMENT_CLASSIFIER.

To create a sentiment classifier preference:

Define a sentiment classifier preference using the CTX_DDL.CREATE_PREFERENCE procedure. The classifier must be of type SENTIMENT_CLASSIFIER.
Define attributes for the sentiment classifier preference using the CTX_DDL.SET_ATTRIBUTE procedure. The attributes define the parameters that are used to train the sentiment classifier.

Example 12-1 Creating a Sentiment Classifier Preference

The following example creates a sentiment classifier preference named clsfier_camera. This preference will be used to classify a set of documents that contain reviews for SLR cameras.

Define a sentiment classifier preference of type SENTIMENT_CLASSIFIER.

The following command defines a sentiment classifier preference named clsfier_camera.
```
exec ctx_ddl.create_preference('clsfier_camera','SENTIMENT_CLASSIFIER');
```
Define the attributes of the sentiment classifier preference clsfier_camera.

The following commands define attributes for the clsfier_camera sentiment classifier preference. The maximum number of features to be extracted is set to 1000 and the number of iterations for which the classifier runs is set to 600.
```
exec ctx_ddl.set_attribute('clsfier_camera','MAX_FEATURES','1000');
exec ctx_ddl.set_attribute('clsfier_camera','NUM_ITERATIONS','600');
```

For attributes that are not explicitly defined, the default values are used.

See Also:

12.3 Training Sentiment Classifiers

Training a sentiment classifier generates the classification rules that will be used to provide a positive or negative sentiment with respect to a search keyword.

The following example trains a sentiment classifier that can perform sentiment analysis on user reviews of cameras:

Create and populate the training document table. This table contains the actual text of the training set documents or the filenames (if the documents are present externally).

Ensure that the training documents are randomly selected to avoid any possible bias in the trained sentiment classifier. The distribution of positive and negative documents must not be skewed. Oracle Text checks for this while training the sentiment classifier.
```
create table training_camera (review_id number primary key, text varchar2(2000));
insert into training_camera values( 1,'/sa/reviews/cameras/review1.txt');
insert into training_camera values( 2,'/sa/reviews/cameras/review2.txt');
insert into training_camera values( 3,'/sa/reviews/cameras/review3.txt');
insert into training_camera values( 4,'/sa/reviews/cameras/review4.txt');
```
Create and populate the category table.

This table specifies training labels for the documents present in the document table. It tells the classifier the true sentiment of the training set documents.

The primary key of the document table must have a foreign key relationship with the unique key of the category table. The names of these columns must be passed to the CTX_CLS.SA_TRAIN procedure so that the sentiment label can be associated with the corresponding document.

Oracle Text validates the parameters specified for the classifier preference and the category values. The category values are restricted to 1 for positive and 2 for negative sentiment. If certain documents are neutral (neither positive or negative) in sentiment or if certain documents are not to be trained for positive or negative sentiment classification, then do not add them in the training set category table. In this example, train_category is the training set category table. Additional columns in the category table, other than document ID and category, are also not used by the classifier.
```
create table train_category (doc_id number, category number, category_desc varchar2(100));

insert into train_category values (2,1,'positive');
insert into train_category values (3,2,'negative');
insert into train_category values (4,2,'negative');
```
Create the context index on the training document table. This index will be used to extract metadata related to training documents while training the sentiment classifier.
In this example, we create an index without populating it.
```
exec ctx_ddl.create_preference('fds','FILE_DATASTORE');
create index docx on training_camera(text) indextype is ctxsys.context parameters ('datastore fds nopopulate');
```
(Optional) Create a sentiment classifier preference, clsfier_camera, that performs sentiment analysis on a document set consisting of camera reviews.
Train the sentiment classifier clsfier_camera.
During training, Oracle Text determines the ratio of positive to negative documents. If this ratio is not in the range of 0.4 to 0.6, then a warning is written to the CTX log indicating that the sentiment classifier generated is skewed. After the sentiment classifier is trained, it is ready to be used in sentiment queries to perform sentiment analysis.

In the following example, clsfier_camera is the name of the sentiment classifier that is being trained, review_id is the name of the document ID column in the document training set, train_category is the name of the category table that contains the labels for the training set documents, doc_id is the document ID column in the category table, category is the category column in the category table, and clsfier is the name of the sentiment classifier preference that is used to train the classifier.
```
exec ctx_cls.sa_train_model('clsfier_camera','docx','review_id','train_category','doc_id','category','clsfier');
```
Note:

If you do not specify a sentiment classifier preference when running the CTX_CLS.SA_TRAIN_MODEL procedure, then Oracle Text uses the default preference CTXSYS.DEFAULT_SENTIMENT_CLASSIFIER.

See Also:

12.4 Performing Sentiment Analysis Using the CTX_DOC Package

Use the procedures in the CTX_DOC package to perform sentiment analysis on a single document within a document set. For each document, you can either determine a single sentiment score for entire document or individual sentiment scores for each topic within the document.

Before you perform sentiment analysis, you must create a context index on the document set. The following command creates a context index camera_revidx on the document set contained in the camera_reviews table.

create index camera_revidx on camera_reviews(review_text)indextype is ctxsys.context parameters ('lexer mylexer stoplist ctxsys.default_stoplist');

To perform sentiment analysis with the CTX_DOC package, use one of the following methods:

Run the CTX_DOC.SENTIMENT_AGGREGATE procedure with the required parameters

This procedure provides a single consolidated sentiment score for the entire document.

The sentiment score is a value in the range -100 to 100 and indicates the strength of the sentiment. A negative score represents a negative sentiment and a positive score represents a positive sentiment. Based on the sentiment scores, you can choose to group scores into labels such as Strongly Negative (–80 to –100), Negative (–80 to –50), Neutral (-50 to +50), Positive (+50 to +80), and Strongly Positive (+80 to +100).
Run the CTX_DOC.SENTIMENT procedure with the required parameters

This procedure returns the individual segments within the document that contain the search term and provides an associated sentiment score for each segment.

Example 12-2 Obtaining a Single Sentiment Score for a Document

The following example uses the sentiment classifier clsfier_camera to provide a single aggregate sentiment score for the entire document. The sentiment classifier has been created and trained. The table containing the document set has a context index called camera_revidx. The doc_id of the document within the document table for which sentiment analysis must be performed is 49. The topic for which a sentiment score is being generated is ‘Nikon’.

select ctx_doc.sentiment_aggregate('camera_revidx','49','Nikon','clsfier_camera') from dual;

CTX_DOC.SENTIMENT_AGGREGATE('CAMERA_REVIDX','49','NIKON','CLSFIER_CAMERA')
--------------------------------------------------------------------------
                            74
1 row selected.

Example 12-3 Obtaining a Single Sentiment Score Using the Default Classifier

The following example uses the default sentiment classifier to provide an aggregate sentiment score for the entire document. The table containing the document set has a context index called camera_revidx. The doc_id of the document, within the document table, for which sentiment analysis must be performed is 1.

select ctx_doc.sentiment_aggregate('camera_revidx','1') from dual;

CTX_DOC.SENTIMENT_AGGREGATE('CAMERA_REVIDX','1')
--------------------------------------------
                                           2

1 row selected.

Example 12-4 Obtaining Sentiment Scores for Each Topic within a Document

The following example uses the sentiment classifier clsfier_camera to generate sentiment scores for each segment within the document. The sentiment classifier has been created and trained. The table containing the document set has a context index called camera_revidx. The doc_id of the document within the document table for which sentiment analysis must be performed is 49. The topic for which a sentiment score is being generated is ‘Nikon’. The result table, restab, that will be populated with the analysis results has been created with the columns snippet (CLOB) and score (NUMBER).

exec ctx_doc.sentiment('camera_revidx','49','Nikon','restab','clsfier_camera', starttag=>'<<', endtag=>'>>');

SQL> select * from restab;
SNIPPET						
--------------------------------------------------------------------------------
     SCORE
----------
It took <<Nikon>> a while to produce a superb compact 85mm lens, but this time they finally got it right.
        65

Without a doubt, this is a fine portrait lens for photographing head-and-shoulder portraits (The only lens which is optically better is 
<<Nikon>>'s legendary 10
5mm f2.5 Nikkor lens, and its close optical twin, the 105mm f2.8 Micro Nikkor.
        75

Since the 105mm f2.5 Nikkor lens doesn't have an autofocus version, then this might be the perfect moderate telephoto lens for owners of 
<<Nikon>> autofocus 
SLR cameras.
        84
3 rows selected.

Example 12-5 Obtaining a Sentiment Score for a Topic Within a Document

The following example uses the sentiment classifier tdrbrtsent03_cl to generate a sentiment score for each segment within the document. The sentiment classifier has been created and trained. The table containing the document set has a context index called tdrbrtsent03_idx. The doc_id of the document within the document table for which sentiment analysis must be performed is 1. The topic for which a sentiment score is being generated is ‘movie’. The result table, tdrbrtsent03_rtab, that will be populated with the analysis results has been created with the columns snippet and score.

SQL> exec ctx_doc.sentiment('tdrbrtsent03_idx','1','movie','tdrbrtsent03_rtab','tdrbrtsent03_cl');
PL/SQL procedure successfully completed.  

SQL> select * from tdrbrtsent03_rtab;
SNIPPET
--------------------------------------------------------------------------------      
SCORE
---------- 
the <b>movie</b> is a bit overlong , but nicholson is such good fun that the running time passes by pretty quickly
 -62

1 row selected.

See Also:

CTX_DOC.SENTIMENT_AGGREGATE in the Oracle Text Reference
CTX_DOC.SENTIMENT in the Oracle Text Reference

12.5 Performing Sentiment Analysis Using Result Set Interface

The XML Query Result Set Interface (RSI) enables you to perform sentiment analysis on a set of documents by using either the default sentiment classifier or a user-defined sentiment classifier. The documents on which sentiment analysis must be performed are stored in a document table.

The sentiment element in the input RSI is used to indicate that sentiment analysis must be performed at query time in addition to other operations specified in the result set descriptor. If you specify a value for the classifier attribute of the sentiment element, then the specified sentiment classifier is used to perform the sentiment analysis. If the classifier attribute is omitted, then Oracle Text performs sentiment analysis using the default sentiment classifier. The sentiment element contains a child element called item that specifies the topic or concept about which a sentiment must be generated during sentiment analysis.

You can either generate a single sentiment score for each document or separate sentiment scores for each topic within the document. Use the agg attribute of the element item to generate a single aggregated sentiment score for each document.

Sentiment classification can be performed using a keyword query or by using the ABOUT operator. When you use the ABOUT operator, the result set includes synonyms of the keyword that are identified using the thesaurus.

To perform sentiment analysis using RSI:

Create and train the sentiment classifier that will be used to perform sentiment analysis.
Create the document table that contains the documents to be analyzed and a context index on the document table.
Use the required elements and attributes within a query to perform sentiment analysis.

The RSI must contain the sentiment element.

Example 12-6 Input Result Set Descriptor to Perform Sentiment Analysis

The following example performs sentiment analysis and generates a sentiment for the topic ‘lens’. The driving query is a keyword query for ‘camera’. The sentiment element specifies that sentiment analysis must be performed using the sentiment classifier clsfier_camera. This classifier has been previously created and trained using the CTX_CLS.SA_TRAIN_MODEL procedure. camera_revidx is a context index on the document set table.

The sentiment score ranges from -100 to 100. A positive score indicates positive sentiment whereas a negative score indicates a negative sentiment. The absolute value of the score is indicative of the magnitude of positive/negative sentiment.

To perform sentiment analysis and obtain a sentiment score for each topic within the document:

Create the result set table, rs, that will store the results of the search operation.

SQL> var rs clob;
SQL> exec dbms_lob.createtemporary(:rs, TRUE, DBMS_LOB.SESSION);

Perform sentiment analysis as part of a search query.

The keyword being searched for is ‘camera’. The topic for which sentiment analysis is performed is ‘lens’.

begin
ctx_query.result_set('camera_revidx','camera',' 
    <ctx_result_set_descriptor>
        <hitlist start_hit_num="1" end_hit_num="10" order="score desc"> 
        <sentiment classifier="clsfier_camera">
           <item topic="lens" /> 
           <item topic="picture quality" agg="true" />
       </sentiment> </hitlist>
   </ctx_result_set_descriptor>',:rs); 
end; 
/

View the results stored in the result table.

The XML result set can be used by other applications for further processing. Some of output has been removed for brevity. Notice that there is a score for each segment within the document that represents the sentiment score for the segment.

SQL> select xmltype(:rs) from dual; 
XMLTYPE(:RS) 
-------------------------------------------------------------------------------- 
<ctx_result_set>
  <hitlist>
    <hit>
      <sentiment>
         <item topic="lens">          
            <segment>             
               <segment_text>The first time it was sent in was because the <b>lens </b> door failed to turn on the camera 
and it was almost to come off of its track . Eight months later, the flash quit working in all modes AND the door was 
failing AGAIN!</segment_text>           
                <segment_score>-81</segment_score>           
           </segment>         
        </item>        
         <item topic="picture quality"> <score> -75 </score>       
         </item>
      </sentiment>
    </hit>
    <hit>
       <sentiment>
          <item topic="lens">
             <segment>
                 <segment_text>I was actually quite impressed with it. Powerful zoom , sharp <b>lens</b>, decent picture 
quality. I also played with some other Panasonic models in various stores just to get a better feel for them, as well as 
spent a few hours on </segment_text> 
                  <segment_score> 67 </segment_score>           
            </segment>        
          </item>         
             <item topic="picture quality">  <score>-1</score>    </item>
       </sentiment>
    </hit> 
    . . . 
  . . .
  </hitlist>
</ctx_result_set>

See Also:

Oracle Text Reference