Create and Use Custom Vocabulary

You can see how to create and use your own vocabulary of tokens when chunking data.

Here, you use the chunker helper function CREATE_VOCABULARY from the DBMS_VECTOR_CHAIN package to load custom vocabulary. This vocabulary file contains a list of tokens, recognized by your model's tokenizer.

Start SQL*Plus and connect to Oracle Database as a local test user:
1. Log in to SQL*Plus as the sys user, connecting as sysdba, to a pluggable database (PDB) within your multitenant container database (CDB):
```
conn sys/password@CDB_PDB as sysdba
```
```
CREATE TABLESPACE tbs1
DATAFILE 'tbs5.dbf' SIZE 20G AUTOEXTEND ON
EXTENT MANAGEMENT LOCAL
SEGMENT SPACE MANAGEMENT AUTO;
```
2. Create a local test user (docuser) and grant necessary privileges:
```
drop user docuser cascade;
```
```
create user docuser identified by docuser DEFAULT TABLESPACE tbs1 quota unlimited on tbs1;
```
```
grant DB_DEVELOPER_ROLE to docuser;
```
3. Create a local directory on your server (VEC_DUMP) to store your vocabulary file. Grant necessary privileges:
```
create or replace directory VEC_DUMP as '/my_local_dir/';
```
```
grant read, write on directory VEC_DUMP to docuser;

commit;
```
4. Transfer the vocabulary file for your required model to the VEC_DUMP directory.
  For example, if using WordPiece tokenization, you can download and transfer the vocab.txt vocabulary file for "bert-base-uncased".
5. Connect to Oracle Database as the test user and alter the environment settings for your session:
```
conn docuser/password@CDB_PDB;

SET ECHO ON
SET FEEDBACK 1
SET NUMWIDTH 10
SET LINESIZE 80
SET TRIMSPOOL ON
SET TAB OFF
SET PAGESIZE 10000
SET LONG 10000
```

Create a relational table (doc_vocabtab) to store your vocabulary tokens in it:

CREATE TABLE doc_vocabtab(token nvarchar2(64))
  ORGANIZATION EXTERNAL
  (default directory VEC_DUMP
   ACCESS PARAMETERS (RECORDS DELIMITED BY NEWLINE)
   location ('bert-vocabulary-uncased.txt'));

Run DBMS_VECTOR_CHAIN.CREATE_VOCABULARY to create a vocabulary (doc_vocab).

DECLARE
  vocab_params clob := '{"table_name"     : "doc_vocabtab",
                       "column_name"      : "token",
                       "vocabulary_name"  : "doc_vocab",
                       "format"           : "bert",
                       "cased"            : false}';

BEGIN
  dbms_vector_chain.create_vocabulary(json(vocab_params));
END;
/

After loading the token vocabulary, you can now use the BY VOCABULARY chunking mode (with VECTOR_CHUNKS or UTL_TO_CHUNKS) to split data by counting the number of tokens.

Related Topics

Parent topic: Perform Text Processing: PL/SQL Examples