Label Studio
Use Label Studio for labeling datasets.
Oracle’s Data Labeling service is being deprecated. You're required to migrate any labeled datasets to Label Studio, an open source and marketplace-supported labeling tool. With the deprecation of Data Labeling, this section provides details on converting Data Labeling snap shot exports to Label Studio import and Label Studio JSON export formats. You can use these formats for further annotation in Label Studio or direct model training.
To train custom models, you need the following two files:
- Manifest File
- This file contains metadata about the annotation files and typically has a
.jsonl
extension.
- Record File
-
This is the dataset exported from Label Studio after annotation. The file has a
.json
extension.-
For Custom KV, the record file is in JSON-MIN format.
-
For Custom DC, the record file is in standard JSON format.
-
1. Setting Up the Label Studio Environment
1.1 Start Label Studio
2. Labeling Workflow for Custom Key Value Extraction
Label Studio doesn't natively support PDF annotation for more information, see Fundamental Tools for PDF Labeling in the Label Studio documentation.
A work-around exists for Paginated Multi-Image Labeling.
Follow these steps to label PDFs:
2.1 Generate task list
If the training documents are in PDF format, you’ll need to convert them to images first. The pdf_to_images
function performs this
conversion and saves the images in an output_images
folder (Images_input_root), which is created at the root of the provided input
directory. For each PDF, a separate folder (named after the PDF file) is created inside the output_images
directory to store the
corresponding images.
- For Label Studio annotation, you need to generate a task list. Each task corresponds to the annotation of a single document.
- From a command line, run the file generate_tasks.kv.pv contained in the utility scripts downloaded in step 6 of the previous task.
2.2 Set Up OCR Integration for Preannotation
To streamline the annotation workflow and minimize manual effort, interactive preannotation can be enabled in Label Studio. This setup gives automatic generation of bounding boxes using the OCR service. OCI OCR is integrated as the ML backend to generate bounding boxes on images for key-value annotation. Clone the following repository and install the required dependencies:
For more information, see Write your own ML backend in the Label Studio documentation.
2.3 Project Creation and Configuration
2.4 Dataset Annotation in Label Studio
3. Labeling Workflow for Custom Document Classification
Label Studio doesn't natively support PDF annotation for more information, see Fundamental Tools for PDF Labeling in the Label Studio documentation.
A work-around exists for Paginated Multi-Image Labeling.
Follow these steps to label PDFs:
3.1 Generate task list
If the training documents are in PDF format, you’ll need to convert them to images first. The pdf_to_images
function performs this
conversion and saves the images in an output_images
folder (Images_input_root), which is created at the root of the provided input
directory. For each PDF, a separate folder (named after the PDF file) is created inside the output_images
directory to store the
corresponding images.
- For Label Studio annotation, you need to generate a task list. Each task corresponds to the annotation of a single document.
- From a command line, run the file generate_tasks_dc.pv contained in the utility scripts downloaded in step 6 of the previous task.