2.3 Document Classifier
This topic provides the information on document classification.
To index the documents correctly, classification of them needs to be carried out. The application supports two ways of classifying the document, either it can look for reference in the file name or it can work with ML based classifier to classify the documents.
For Example: The document type can be specified with either using the file names as the keyword such as statement for bank statements, paystub for salary slips and personal_doc for personal documents or the user can opt to use the document classifier provided by us which can predict the document type using the LLM.
To use the document classifier, the user needs to specify a CLASSIFIER_PARENT_DIR in the application-config.json file. This path will be used as the location to store all the training data to train the classifier model. To train the model user needs to hit the endpoint: /docGpt/docClassify/train.
Following is the body of the request:
{
"trainDir": "YOUR-FOLDER-NAME-WITH-TRAINING-DATA",
"llmAPIKey": "YOUR-LLM-API-KEY",
"llm": "YOUR-LLM-NAME"
}
The above request results in the training data to be uploaded in the trainDir folder at the CLASSIFIER_PARENT_DIR path. Inside the trainDir folder the folder structure should be in the following format with the country name followed by the document type. Follow the recommended folder structure as given below:
CLASSIFIER_PARENT_DIR > trainDir > country name > document
type
.
Below is an example showing the contents of the trainDir folder.
Description of the illustration salary-slips.png
Note:
This feature is currently supported by cohere only.{
"PERSONAL_DOC": ["passport", "dl", "pan", "voter", "aadhaar", "birth"],
"SALARY_SLIP": ["paystub", "salary"],
"BANK_STATEMENT": ["bank", "statement"]
}
Parent topic: Application Installation