Label Studio

Use Label Studio for labeling datasets.

Oracle’s Data Labeling service is being deprecated. You're required to migrate any labeled datasets to Label Studio, an open source and marketplace-supported labeling tool. With the deprecation of Data Labeling, this section provides details on converting Data Labeling snap shot exports to Label Studio import and Label Studio JSON export formats. You can use these formats for further annotation in Label Studio or direct model training.

To train custom models, you need the following two files:

Manifest File
This file contains metadata about the annotation files and typically has a .jsonl extension.
Record File

This is the dataset exported from Label Studio after annotation. The file has a .json extension.

  • For Custom KV, the record file is in JSON-MIN format.

  • For Custom DC, the record file is in standard JSON format.

1. Setting Up the Label Studio Environment

You can use Virtual environment or Conda:
  1. Create the requirement.txt file:
    label-studio==1.19.0
    pdf2image
    oci
    label-studio-ml
  2. Create the environment:
    • Using Virtual environment:
      python3 -m venv env_name
    • Using Conda:
      conda create --name env_name
  3. Activate the environment:
    • Using Virtual environment:
      source env_name/bin/activate
    • Using Conda:
      conda activate env_name
  4. Install Label Studio with its dependencies
    pip install -r requirements.txt
  5. Enable local file serving.
    Because the files are stored locally, you need to export the following environment variables to enable Label Studio to access and annotate them:
    export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
    export LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=<absolute-path-to-parent-directory-of-folder-where-documents-are-stored>
    For more information, see: Set environment variables in the Label Studio documentation.

1.1 Start Label Studio

  1. Start Label Studio.
    By default, Label Studio runs on port 8080. If that port is already in use or to specify a different port, start Label Studio with the following command:
    label-studio start --port <port_number>
    For more information, see Start Label Studio in the Label Studio documentation.
  2. Create an account.

    When you first start Label Studio, you see the sign up screen.

    1. Create an account with your email address and a password.
    2. Log in to Label Studio.

2. Labeling Workflow for Custom Key Value Extraction

Label Studio doesn't natively support PDF annotation for more information, see Fundamental Tools for PDF Labeling in the Label Studio documentation.

A work-around exists for Paginated Multi-Image Labeling.

Follow these steps to label PDFs:

  1. Convert the PDFs to images. For more information, see this code tutorial.
  2. Store the images somewhere accessible by Label Studio.
  3. Create a new Label Studio project and upload the data.
    For every task, you have a list of URLs corresponding to the location of each image (or page) in the document. These are the URLs of where the document is hosted.
  4. Use the new Multi-Page Document Annotation template for configuring the labeling.
  5. Label the data.
  6. Download the utility scripts and extract utility_scripts_kv.zip for key value annotation.

2.1 Generate task list

If the training documents are in PDF format, you’ll need to convert them to images first. The pdf_to_imagesfunction performs this conversion and saves the images in an output_images folder (Images_input_root), which is created at the root of the provided input directory. For each PDF, a separate folder (named after the PDF file) is created inside the output_images directory to store the corresponding images.

  1. For Label Studio annotation, you need to generate a task list. Each task corresponds to the annotation of a single document.
  2. From a command line, run the file generate_tasks.kv.pv contained in the utility scripts downloaded in step 6 of the previous task.

2.2 Set Up OCR Integration for Preannotation

To streamline the annotation workflow and minimize manual effort, interactive preannotation can be enabled in Label Studio. This setup gives automatic generation of bounding boxes using the OCR service. OCI OCR is integrated as the ML backend to generate bounding boxes on images for key-value annotation. Clone the following repository and install the required dependencies:

For more information, see Write your own ML backend in the Label Studio documentation.

  1. Install label-studio-ml-backend:
    git clone https://github.com/HumanSignal/label-studio-ml-backend.git
    cd label-studio-ml-backend/
    pip install -e .
  2. Configure the parameters to call the OCI Text extraction service:
    CONFIG_PROFILE = "DEFAULT"
    COMPARTMENT_ID = "ocid1.compartment.oc1.xxxxxxxxxxxxxxxxxxxxxxxx"
    SERVICE_ENDPOINT = "https://document-preprod.aiservice.xxxxxxxxxxxxxxxx"
    LANGUAGE="ENG"
  3. Start Text extraction service:
    1. Download the utility scripts and extract the contents of ociocr.zip into the label-studio-ml-backend/label_studio_ml/examples/ directory.
    2. Run the command the fiollowing to start the OCR service:
      label-studio-ml start ./ociocr --port <port>

2.3 Project Creation and Configuration

  1. Create a new project by following the steps in the Label Studio documentation.
  2. Add source data storage.
    1. Under Settings, select Cloud Storage.
    2. Select Local Files.
    3. Enter the local file storage paths to the input_pdf folder and the output_images folder. For more information, see the Label Studio documentation on local storage.
  3. Set up labeling configuration.
    1. Under Settings, select Labeling Interface.
    2. Select Code.
    3. Add the following code for the labeling configuration:
      <View>
        <Repeater on="$pages" indexFlag="{{idx}}" mode="pagination">
          <View style="display:flex;align-items:start;gap:8px;flex-direction:row">
            <Image name="page_{{idx}}" value="$pages[{{idx}}].page" inline="true"/>
              <Labels name="labels_{{idx}}" toName="page_{{idx}}" showInline="false">
                <Label value="ignore" background="#FFA39E"/>
                <Label value="Invoice_Number" background="#a59eff"/>
                <Label value="Invoice_Date" background="#0dd377"/>
                <Label value="Total" background="#ffdf6b"/>
              </Labels>
          </View>
           
          <Rectangle name="bbox_{{idx}}" smart="true" toName="page_{{idx}}" strokeWidth="3"/>
          <TextArea name="transcription_{{idx}}" toName="page_{{idx}}" editable="true" perRegion="true" required="true" maxSubmissions="1" rows="5" placeholder="Recognized Text" displayMode="region-list"/>
        
        </Repeater>
      </View>
    4. Save it.
    5. Select Visual to add or remove labels as necessary.
  4. Enable interactive preannotation (set up the connect model option to call the OCI OCR).
    1. Under Settings, select Model.
    2. Select Add Model.
    3. Add or update the following:
      • Enable the Connect Model in Label Studio's project settings.

      • Configure Backend URL to point to the OCI OCR endpoint.

      • Ensure the required authentication keys are in the .env file to securely connect to OCI OCR.

      • After connecting, Label Studio automatically sends each uploaded document to the OCI OCR service and displays the predicted bounding boxes as preannotations.

      • You can review, change, or accept the preannotated results interactively.

2.4 Dataset Annotation in Label Studio

  1. Create a new dataset annotation.
    1. Generate a task JSON for annotation as mentioned in section 2.1 Generate Task List.
    2. Import them for annotation.
  2. Extend the existing Data Labeling-annotated data in Label Studio.
    1. Migrate Data Labeling-annotated datasets into Label Studio-compatible format.
    2. After conversion, these datasets can be imported into Label Studio to extend, refine, or complete the annotations as needed. Use the script, DLS2LS_conversion_ky.py, from the utility scripts.
  3. Import the tasks for annotations.
    1. Import the generated tasks.json file as created in step 1a for fresh annotation.
    2. For existing Data Labeling annotations, convert them into Label Studio format.
    3. Import the into Label Studio to extend, refine, or complete the annotations as needed.
  4. Begin labeling, following the steps in the Label Studio documentation.
  5. Export the annotations in JSON-MIN format.

    The record files are exported directly from Label Studio after completing the annotation process.

  6. Generate the manifest file, using the script, Generate_manifest.py, from the utility scripts.
  7. Upload the manifest file, record file and image or pdf documents to the bucket and path defined in the manifest.
    Note

    The local document_root directory maps to <bucket_name/prefix> on the Cloud. Maintain the same folder structure as in the local storage. The training pipeline expects this structure and relies on it to find the files correctly.

3. Labeling Workflow for Custom Document Classification

Label Studio doesn't natively support PDF annotation for more information, see Fundamental Tools for PDF Labeling in the Label Studio documentation.

A work-around exists for Paginated Multi-Image Labeling.

Follow these steps to label PDFs:

  1. Convert the PDFs to images. For more information, see this code tutorial.
  2. Store the images somewhere accessible by Label Studio.
  3. Create a new Label Studio project and upload the data.
    For every task, you have a list of URLs corresponding to the location of each image (or page) in the document. These are the URLs of where the document is hosted.
  4. Use the new Multi-Page Document Annotation template for configuring the labeling.
  5. Label the data.
  6. Download the utility scripts and extract utility_scripts_dc.zip for document classification annotation.

3.1 Generate task list

If the training documents are in PDF format, you’ll need to convert them to images first. The pdf_to_imagesfunction performs this conversion and saves the images in an output_images folder (Images_input_root), which is created at the root of the provided input directory. For each PDF, a separate folder (named after the PDF file) is created inside the output_images directory to store the corresponding images.

  1. For Label Studio annotation, you need to generate a task list. Each task corresponds to the annotation of a single document.
  2. From a command line, run the file generate_tasks_dc.pv contained in the utility scripts downloaded in step 6 of the previous task.

3.2 Project Creation and Configuration

  1. Create a new project by following the steps in the Label Studio documentation.
  2. Add source data storage.
    1. Under Settings, select Cloud Storage.
    2. Select Local Files.
    3. Enter the local file storage paths to the input_pdf folder and the output_images folder. For more information, see the Label Studio documentation on local storage.
  3. Set up labeling configuration.
    1. Under Settings, select Labeling Interface.
    2. Select Code.
    3. Add the following code for the labeling configuration:
      <View style="display: flex; flex-direction: row;">
        <Image valueList="$pages" name="pdf"/>
        <Choices name="choices" toName="pdf" choice="single" perItem="true" layout="vertical" style="margin-left: 20px;">
          <Choice value="Label1"/>
          <Choice value="Label2"/>
        </Choices>
      </View>

      For single labels, set choice="single". For multilabel classification, set choice="multiple".

    4. Save it.
    5. Select Visual to add or remove labels as necessary.

3.3 Dataset Annotation for Document Classification in Label Studio

  1. Create a new dataset annotation.
    1. Generate a task JSON for annotation as mentioned in section 3.1 Generate Task List.
    2. Import them for annotation.
  2. Extend the existing Data Labeling-annotated data in Label Studio.
    1. Migrate Data Labeling-annotated datasets into Label Studio-compatible format.
    2. After conversion, these datasets can be imported into Label Studio to extend, refine, or complete the annotations as needed. Use the script, DLS2LS_conversion_dc.py, from the utility scripts.
  3. Import the tasks for annotations.
    1. Import the generated tasks.json file as created in step 1a for fresh annotation.
    2. For existing Data Labeling annotations, convert them into Label Studio format.
    3. Import the into Label Studio to extend, refine, or complete the annotations as needed.
  4. Assign a label to each page.
  5. Export the annotations in JSON format.

    The record files are exported directly from Label Studio after completing the annotation process.

  6. Generate the manifest file, using the script, Generate_manifest.py, from the utility scripts.
  7. Upload the manifest file, record file and image or pdf documents to the bucket and path defined in the manifest.
    Note

    The local document_root directory maps to <bucket_name/prefix> on the Cloud. Maintain the same folder structure as in the local storage. The training pipeline expects this structure and relies on it to find the files correctly.