2.3 Document Classifier

This topic provides the information on document classification.

To index the documents correctly, classification of them needs to be carried out. The application supports two ways of classifying the document, either it can look for reference in the file name or it can work with ML based classifier to classify the documents.

For Example: The document type can be specified with either using the file names as the keyword such as statement for bank statements, paystub for salary slips and personal_doc for personal documents or the user can opt to use the document classifier provided by us which can predict the document type using the LLM.

To use the document classifier, the user needs to specify a CLASSIFIER_PARENT_DIR in the application-config.json file. This path will be used as the location to store all the training data to train the classifier model. To train the model user needs to hit the endpoint: /docGpt/docClassify/train.

Following is the body of the request:

{
"trainDir": "YOUR-FOLDER-NAME-WITH-TRAINING-DATA",
"llmAPIKey": "YOUR-LLM-API-KEY",
"llm": "YOUR-LLM-NAME"
}

The above request results in the training data to be uploaded in the trainDir folder at the CLASSIFIER_PARENT_DIR path. Inside the trainDir folder the folder structure should be in the following format with the country name followed by the document type. Follow the recommended folder structure as given below:

CLASSIFIER_PARENT_DIR > trainDir > country name > document type .

Below is an example showing the contents of the trainDir folder.
Description of salary-slips.png follows
Description of the illustration salary-slips.png

Once the model is trained and the model ID is returned, this model ID has to be provided in the CLASSIFIER_MODEL_ID in the application-config.json file for inference. After this the model will be used internally by other APIs to classify a particular document and do QnA on top of it.

Note:

This feature is currently supported by cohere only.
If the document classifier is not used, the system uses the associated metadata of the uploaded files coming from the DMS to classify the documents for further training and querying. For these the documentCatID is used. A separate config file called docCategory-config.json is maintained to specify which documentCatIDs fall in which categories. Below is an example for the same:
{
"PERSONAL_DOC": ["passport", "dl", "pan", "voter", "aadhaar", "birth"],
"SALARY_SLIP": ["paystub", "salary"],
"BANK_STATEMENT": ["bank", "statement"]
}