Create and Use Custom Vocabulary

You can see how to create and use your own vocabulary of tokens when chunking data.

Here, you use the chunker helper function CREATE_VOCABULARY from the DBMS_VECTOR_CHAIN package to load custom vocabulary. This vocabulary file contains a list of tokens, recognized by your model's tokenizer.
  1. Start SQL*Plus and connect to Oracle Database as a local test user:
    1. Log in to SQL*Plus as the sys user, connecting as sysdba, to a pluggable database (PDB) within your multitenant container database (CDB):
      conn sys/password@CDB_PDB as sysdba
      CREATE TABLESPACE tbs1
      DATAFILE 'tbs5.dbf' SIZE 20G AUTOEXTEND ON
      EXTENT MANAGEMENT LOCAL
      SEGMENT SPACE MANAGEMENT AUTO;
    2. Create a local test user (docuser) and grant necessary privileges:
      drop user docuser cascade;
      create user docuser identified by docuser DEFAULT TABLESPACE tbs1 quota unlimited on tbs1;
      grant DB_DEVELOPER_ROLE to docuser;
    3. Create a local directory on your server (VEC_DUMP) to store your vocabulary file. Grant necessary privileges:
      create or replace directory VEC_DUMP as '/my_local_dir/';
      grant read, write on directory VEC_DUMP to docuser;
      
      commit;
    4. Transfer the vocabulary file for your required model to the VEC_DUMP directory.
      For example, if using WordPiece tokenization, you can download and transfer the vocab.txt vocabulary file for "bert-base-uncased".
    5. Connect to Oracle Database as the test user and alter the environment settings for your session:
      conn docuser/password@CDB_PDB;
      
      SET ECHO ON
      SET FEEDBACK 1
      SET NUMWIDTH 10
      SET LINESIZE 80
      SET TRIMSPOOL ON
      SET TAB OFF
      SET PAGESIZE 10000
      SET LONG 10000
  2. Create a relational table (doc_vocabtab) to store your vocabulary tokens in it:
    CREATE TABLE doc_vocabtab(token nvarchar2(64))
      ORGANIZATION EXTERNAL
      (default directory VEC_DUMP
       ACCESS PARAMETERS (RECORDS DELIMITED BY NEWLINE)
       location ('bert-vocabulary-uncased.txt'));
  3. Run DBMS_VECTOR_CHAIN.CREATE_VOCABULARY to create a vocabulary (doc_vocab).
    DECLARE
      vocab_params clob := '{"table_name"     : "doc_vocabtab",
                           "column_name"      : "token",
                           "vocabulary_name"  : "doc_vocab",
                           "format"           : "bert",
                           "cased"            : false}';
    
    BEGIN
      dbms_vector_chain.create_vocabulary(json(vocab_params));
    END;
    /
After loading the token vocabulary, you can now use the BY VOCABULARY chunking mode (with VECTOR_CHUNKS or UTL_TO_CHUNKS) to split data by counting the number of tokens.