Create and Use Custom Language Data

Create and use your own language-specific conditions (such as common abbreviations) when chunking data.

Here, you use the chunker helper function CREATE_LANG_DATA from the DBMS_VECTOR_CHAIN package to load the data file for Simplified Chinese. This data file contains abbreviation tokens for your chosen language.

Connect as a local user and prepare your data dump directory.

conn sys/password as sysdba

CREATE TABLESPACE tbs1
DATAFILE 'tbs5.dbf' SIZE 20G AUTOEXTEND ON
EXTENT MANAGEMENT LOCAL
SEGMENT SPACE MANAGEMENT AUTO;

SET ECHO ON
SET FEEDBACK 1
SET NUMWIDTH 10
SET LINESIZE 80
SET TRIMSPOOL ON
SET TAB OFF
SET PAGESIZE 10000
SET LONG 10000

Create a local user (docuser) and grant necessary privileges:

drop user docuser cascade;

create user docuser identified by docuser DEFAULT TABLESPACE tbs1 quota unlimited on tbs1;

grant DB_DEVELOPER_ROLE to docuser;

Create a local directory (VEC_DUMP) to store your language data file. Grant necessary privileges:

create or replace directory VEC_DUMP as '/my_local_dir/';

grant read, write on directory VEC_DUMP to docuser;

commit;

Transfer the data file for your required language to the VEC_DUMP directory. For example, dreoszhs.txt for Simplified Chinese.

To know the data file location for your language, see Supported Languages and Data File Locations.
Connect as the local user (docuser):
```
conn docuser/password
```

Create a relational table (doc_langtab) to store your abbreviation tokens in it:

CREATE TABLE doc_langtab(token nvarchar2(64))
  ORGANIZATION EXTERNAL
  (default directory VEC_DUMP
   ACCESS PARAMETERS (RECORDS DELIMITED BY NEWLINE CHARACTERSET AL32UTF8)
   location ('dreoszhs.txt'));

Create language data (doc_lang_data) by calling DBMS_VECTOR_CHAIN.CREATE_LANG_DATA:

DECLARE
  lang_params clob := '{
                         "table_name"      : "doc_langtab",
                         "column_name"     : "token",
                         "language"        : "simplified chinese",
                         "preference_name" : "doc_lang_data"
                       }';
BEGIN
  dbms_vector_chain.create_lang_data(json(lang_params));
END;
/

After loading the language data, you can now use language-specific chunking by specifying the LANGUAGE chunking parameter with VECTOR_CHUNKS or UTL_TO_CHUNKS.

Related Topics

Parent topic: Configure Chunking Parameters