Create and Use Custom Language Data

You can create and use your own language-specific conditions (such as common abbreviations) when chunking data.

Here, you use the chunker helper function CREATE_LANG_DATA from the DBMS_VECTOR_CHAIN package to load the data file for Simplified Chinese. This data file contains abbreviation tokens for your chosen language.
  1. Start SQL*Plus and connect to Oracle Database as a local test user:
    1. Log in to SQL*Plus as the sys user, connecting as sysdba, to a pluggable database (PDB) within your multitenant container database (CDB):
      conn sys/password@CDB_PDB as sysdba
      CREATE TABLESPACE tbs1
      DATAFILE 'tbs5.dbf' SIZE 20G AUTOEXTEND ON
      EXTENT MANAGEMENT LOCAL
      SEGMENT SPACE MANAGEMENT AUTO;
    2. Create a local test user (docuser) and grant necessary privileges:
      drop user docuser cascade;
      create user docuser identified by docuser DEFAULT TABLESPACE tbs1 quota unlimited on tbs1;
      grant DB_DEVELOPER_ROLE to docuser;
    3. Create a local directory on your server (VEC_DUMP) to store your language data file. Grant necessary privileges:
      create or replace directory VEC_DUMP as '/my_local_dir/';
      grant read, write on directory VEC_DUMP to docuser;
      
      commit;
    4. Transfer the data file for your required language to the VEC_DUMP directory. For example, dreoszhs.txt for Simplified Chinese.

      To know the data file location for your language, see Supported Languages and Data File Locations.

    5. Connect to Oracle Database as the test user and alter the environment settings for your session:
      conn docuser/password@CDB_PDB;
      
      SET ECHO ON
      SET FEEDBACK 1
      SET NUMWIDTH 10
      SET LINESIZE 80
      SET TRIMSPOOL ON
      SET TAB OFF
      SET PAGESIZE 10000
      SET LONG 10000
  2. Create a relational table (doc_langtab) to store your abbreviation tokens in it:
    CREATE TABLE doc_langtab(token nvarchar2(64))
      ORGANIZATION EXTERNAL
      (default directory VEC_DUMP
       ACCESS PARAMETERS (RECORDS DELIMITED BY NEWLINE CHARACTERSET AL32UTF8)
       location ('dreoszhs.txt'));
  3. Run DBMS_VECTOR_CHAIN.CREATE_LANG_DATA to create language data (doc_lang_data).
    DECLARE
      lang_params clob := '{"table_name"     : "doc_langtab",
                           "column_name"     : "token",
                           "language"        : "simplified chinese",
                           "preference_name" : "doc_lang_data"}';
    BEGIN
      dbms_vector_chain.create_lang_data(json(lang_params));
    END;
    /
After loading the language data, you can now use language-specific chunking by specifying the LANGUAGE chunking parameter with VECTOR_CHUNKS or UTL_TO_CHUNKS.