Updating TF.IDF models

This topic describes how to set up and update the TF.IDF model with new training data.

For the TF.IDF training data, you provide one or more language-specific <lang>_abstracts.zip files, where <lang> is a supported country code:
  • de (German)
  • en (US English)
  • es (Spanish)
  • fr (French)
  • gb (UK English)
  • is (Icelandic)
  • it (Italian)
  • pt (Portuguese)
Each ZIP file contains a large number of language training model files that can be any text that's in the given language. You can use a variety of corpora, such as these two widely-used versions:
All the ZIP files must be in the same directory, which can have any name of your choosing. The example below assumes this directory structure:
/share/models/tfidf/en_abstracts.zip

The following procedure assumes that you have downloaded a corpus ZIP file and renamed it to en__abstracts.zip.

To update the TF.IDF model:

  1. Create the directory structure (explained above) for the TF.IDF training files, with one directory for the ZIP files.
  2. Copy the en__abstracts.zip training file into the /share/models/tfidf directory.
  3. Run the bdd-admin script with the update-model command, the tdidf model-type argument, and the absolute path to the /tfidf directory:
    ./bdd-admin.sh update-model tfidf /share/models/tfidf
If successful, the command prints these messages:
[2016/07/15 11:21:42 -0400] [Admin Server] Generating the tfidf model file using new model file...Success!
[2016/07/15 11:24:45 -0400] [Admin Server] Publishing the tfidf model file...
[2016/07/15 11:24:57 -0400] [Admin Server] Successfully published the model file.

The operation replaces the TF.IDF model's current JAR on the YARN worker nodes with the new one.

You can revert the model by running the command without the path argument:
./bdd-admin.sh update-model tfidf

This reverts the TF.IDF model to the original, shipped version.