Label Studioの統合

Oracleのデータ・ラベリング・サービスは非推奨です。ラベル付きデータセットは、オープン・ソースおよびマーケットプレイスでサポートされているラベリング・ツールであるLabel Studioに移行することをお薦めします。

データ・ラベリングが非推奨になった場合は、次のステップに従って、データ・ラベリング・スナップショットのエクスポートをLabel StudioのインポートおよびLabel StudioのRAW JSONエクスポート形式に変換します。これらの形式は、Label Studioまたは直接モデル・トレーニングでさらに注釈として使用できます。

カスタムNERまたはカスタムTXTCのラベル付けワークフロー

Label Studioで新しいデータセットに注釈を付けます。
  1. データセット(データ・ラベリングから変換されたテキスト・ファイルまたは事前注釈付きデータセットのいずれか)をアップロードします。
  2. (オプション)必要に応じて、エンティティ・スパンに注釈を付けます。
  3. 注釈付きデータをエクスポートします。
    • データセットは、Label StudioからRaw JSON形式でエクスポートされます。
    • エクスポートされたこのデータは、さらに処理するためにオブジェクト・ストレージにアップロードできます。

モデル・トレーニング・ワークフローのデータを言語パイプラインの一部として使用します。

Label Studioへの既存のデータ・ラベリング注釈付きデータの移行

データ・ラベリングでデータにすでに注釈を付けている場合は、既存のデータセットをLabel Studio互換形式に変換するための移行スクリプトが提供されます。
  1. データセットをJSONL形式のデータ・ラベリングからオブジェクト・ストレージ・バケットにエクスポートします。
  2. 変換スクリプトを実行します。
    Pythonスクリプトはデータセットを処理し、次のように変換します。
    • Label Studioインポート・フォーマットにより、さらに注釈を付けるためにLabel Studioで直接使用できるようになります。
    • Label Studioエクスポート形式。追加の変更なしでトレーニング・ワークフローで直接使用できるように構成されています。
  3. 変換されたインポート・フォーマット・ファイルをLabel Studioにアップロードして、注釈を更新します。
  4. 更新されたデータセットをRAW JSONL形式でオブジェクト・ストレージにエクスポートします。
言語のトレーニングにデータを使用します。

データ・フォーマット

CNER

CNERデータは、次の形式で格納できます。
  • データ・ラベリング書式
  • ポータブルJSNOL形式(自己完結型テキスト)
  • Label Studio RAW JSONエクスポート形式

データ・ラベリング形式(ポータブルJSONL)

データ・ラベリングとポータブルJSONLフォーマットの主な違いは、テキストの格納方法です。

•ポータブルJSONL: 実際のテキストは、sourceDetails['text']の下の同じファイル内に存在します。

• データ・ラベリング形式: テキストのかわりに、sourcePathsourceDetails['path']に指定され、テキストが格納されている外部の場所を指しています。

様々な形式の例:
CNERポータブルJSONL形式
{"labelsSet": [{"name": "PER"}, {"name": "MISC"}, {"name": "LOC"}, {"name": "ORG"}], "annotationFormat": "ENTITY_EXTRACTION", "datasetFormatDetails": {"formatType": "TEXT"}}
{"sourceDetails": {"text": "His current band , Gigi , was formed in 1994 with Baron ( guitar ) , Thomas ( bass ) , Armand ( vocal ) and Ronald ( drum ) ."}, "annotations": [{"entities": [{"entityType": "TEXTSELECTION", "labels": [{"label_name": "ORG"}], "textSpan": {"offset": 19, "length": 4}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "PER"}], "textSpan": {"offset": 50, "length": 5}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "PER"}], "textSpan": {"offset": 69, "length": 6}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "PER"}], "textSpan": {"offset": 87, "length": 6}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "PER"}], "textSpan": {"offset": 108, "length": 6}}]}]}
{"sourceDetails": {"text": "Since then , Gigi has released six albums ."}, "annotations": [{"entities": [{"entityType": "TEXTSELECTION", "labels": [{"label_name": "ORG"}], "textSpan": {"offset": 13, "length": 4}}]}]}
{"sourceDetails": {"text": "He has also released his own solo albums : Nusa Damai ; Gitarku ; Samsara ; and Home , a tribute album to the December 2004 tsunami victims ."}, "annotations": [{"entities": [{"entityType": "TEXTSELECTION", "labels": [{"label_name": "MISC"}], "textSpan": {"offset": 43, "length": 10}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "MISC"}], "textSpan": {"offset": 56, "length": 7}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "MISC"}], "textSpan": {"offset": 66, "length": 7}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "MISC"}], "textSpan": {"offset": 80, "length": 4}}]}]}
{"sourceDetails": {"text": "Since Dewa Budjana started his professional career as a musician , he has mainly used a Parker Fly Delux as his main guitar , occasionally using Klein and Gibson guitars SG series instead ."}, "annotations": [{"entities": [{"entityType": "TEXTSELECTION", "labels": [{"label_name": "PER"}], "textSpan": {"offset": 6, "length": 12}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "MISC"}], "textSpan": {"offset": 88, "length": 16}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "ORG"}], "textSpan": {"offset": 145, "length": 16}}]}]}
{"sourceDetails": {"text": "Budjana also owns a double neck Klein guitar which was used on his latest album , Home , and pictured on the CD cover ."}, "annotations": [{"entities": [{"entityType": "TEXTSELECTION", "labels": [{"label_name": "PER"}], "textSpan": {"offset": 0, "length": 7}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "ORG"}], "textSpan": {"offset": 32, "length": 5}}, {"entityType": "TEXTSELECTION", "labels": [{"label_name": "MISC"}], "textSpan": {"offset": 82, "length": 4}}]}]}
Label Studioインポート・フォーマット
[{'data': {'text': 'His current band , Gigi , was formed in 1994 with Baron ( guitar ) , Thomas ( bass ) , Armand ( vocal ) and Ronald ( drum ) .'},
  'predictions': [{'result': [{'value': {'start': 19,
       'end': 23,
       'text': 'Gigi',
       'labels': ['ORG']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'},
     {'value': {'start': 50, 'end': 55, 'text': 'Baron', 'labels': ['PER']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'},
     {'value': {'start': 69, 'end': 75, 'text': 'Thomas', 'labels': ['PER']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'},
     {'value': {'start': 87, 'end': 93, 'text': 'Armand', 'labels': ['PER']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'},
     {'value': {'start': 108, 'end': 114, 'text': 'Ronald', 'labels': ['PER']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'}]}]},
 {'data': {'text': 'Since then , Gigi has released six albums .'},
  'predictions': [{'result': [{'value': {'start': 13,
       'end': 17,
       'text': 'Gigi',
       'labels': ['ORG']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'}]}]},
 {'data': {'text': 'He has also released his own solo albums : Nusa Damai ; Gitarku ; Samsara ; and Home , a tribute album to the December 2004 tsunami victims .'},
  'predictions': [{'result': [{'value': {'start': 43,
       'end': 53,
       'text': 'Nusa Damai',
       'labels': ['MISC']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'},
     {'value': {'start': 56, 'end': 63, 'text': 'Gitarku', 'labels': ['MISC']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'},
     {'value': {'start': 66, 'end': 73, 'text': 'Samsara', 'labels': ['MISC']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'},
     {'value': {'start': 80, 'end': 84, 'text': 'Home', 'labels': ['MISC']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'}]}]},
 {'data': {'text': 'Since Dewa Budjana started his professional career as a musician , he has mainly used a Parker Fly Delux as his main guitar , occasionally using Klein and Gibson guitars SG series instead .'},
  'predictions': [{'result': [{'value': {'start': 6,
       'end': 18,
       'text': 'Dewa Budjana',
       'labels': ['PER']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'},
     {'value': {'start': 88,
       'end': 104,
       'text': 'Parker Fly Delux',
       'labels': ['MISC']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'},
     {'value': {'start': 145,
       'end': 161,
       'text': 'Klein and Gibson',
       'labels': ['ORG']},
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels'}]}]}]
Label Studioエクスポート形式(raw JSON)
[{'id': 141,
  'annotations': [{'id': 42,
    'completed_by': 4,
    'result': [{'value': {'start': 19,
       'end': 23,
       'text': 'Gigi',
       'labels': ['ORG']},
      'id': '467b95b0',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'},
     {'value': {'start': 50, 'end': 55, 'text': 'Baron', 'labels': ['PER']},
      'id': '8f12fa43',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'},
     {'value': {'start': 69, 'end': 75, 'text': 'Thomas', 'labels': ['PER']},
      'id': '936f63f0',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'},
     {'value': {'start': 87, 'end': 93, 'text': 'Armand', 'labels': ['PER']},
      'id': 'f02d2509',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'},
     {'value': {'start': 108, 'end': 114, 'text': 'Ronald', 'labels': ['PER']},
      'id': 'cb479df1',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'}],
    'was_cancelled': False,
    'ground_truth': False,
    'created_at': '2025-01-19T16:47:10.630009Z',
    'updated_at': '2025-01-19T16:47:10.630009Z',
    'draft_created_at': None,
    'lead_time': 0,
    'prediction': {'id': 95,
     'result': [{'value': {'start': 19,
        'end': 23,
        'text': 'Gigi',
        'labels': ['ORG']},
       'id': '467b95b0',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'},
      {'value': {'start': 50, 'end': 55, 'text': 'Baron', 'labels': ['PER']},
       'id': '8f12fa43',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'},
      {'value': {'start': 69, 'end': 75, 'text': 'Thomas', 'labels': ['PER']},
       'id': '936f63f0',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'},
      {'value': {'start': 87, 'end': 93, 'text': 'Armand', 'labels': ['PER']},
       'id': 'f02d2509',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'},
      {'value': {'start': 108,
        'end': 114,
        'text': 'Ronald',
        'labels': ['PER']},
       'id': 'cb479df1',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'}],
     'model_version': 'undefined',
     'created_ago': '0\xa0minutes',
     'score': None,
     'cluster': None,
     'neighbors': None,
     'mislabeling': 0.0,
     'created_at': '2025-01-19T16:47:10.630009Z',
     'updated_at': '2025-01-19T16:47:10.630009Z',
     'model': None,
     'model_run': None,
     'task': 141,
     'project': 14},
    'result_count': 5,
    'unique_id': '54a60d2b-6960-48c9-baae-bf24b15ce273',
    'import_id': None,
    'last_action': None,
    'task': 141,
    'project': 14,
    'updated_by': 4,
    'parent_prediction': 95,
    'parent_annotation': None,
    'last_created_by': None}],
  'file_upload': 'example_file.json',
  'drafts': [],
  'predictions': [95],
  'data': {'text': 'His current band , Gigi , was formed in 1994 with Baron ( guitar ) , Thomas ( bass ) , Armand ( vocal ) and Ronald ( drum ) .'},
  'meta': {},
  'created_at': '2025-01-19T16:47:10.630009Z',
  'updated_at': '2025-01-19T16:47:10.630009Z',
  'inner_id': 1,
  'total_annotations': 1,
  'cancelled_annotations': 0,
  'total_predictions': 1,
  'comment_count': 0,
  'unresolved_comment_count': 0,
  'last_comment_updated_at': None,
  'project': 14,
  'updated_by': 4,
  'comment_authors': []},
 {'id': 141,
  'annotations': [{'id': 60,
    'completed_by': 4,
    'result': [{'value': {'start': 13,
       'end': 17,
       'text': 'Gigi',
       'labels': ['ORG']},
      'id': 'ce521973',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'}],
    'was_cancelled': False,
    'ground_truth': False,
    'created_at': '2025-01-19T16:47:10.630104Z',
    'updated_at': '2025-01-19T16:47:10.630104Z',
    'draft_created_at': None,
    'lead_time': 0,
    'prediction': {'id': 95,
     'result': [{'value': {'start': 13,
        'end': 17,
        'text': 'Gigi',
        'labels': ['ORG']},
       'id': 'ce521973',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'}],
     'model_version': 'undefined',
     'created_ago': '0\xa0minutes',
     'score': None,
     'cluster': None,
     'neighbors': None,
     'mislabeling': 0.0,
     'created_at': '2025-01-19T16:47:10.630104Z',
     'updated_at': '2025-01-19T16:47:10.630104Z',
     'model': None,
     'model_run': None,
     'task': 141,
     'project': 14},
    'result_count': 1,
    'unique_id': '4a4b5631-4b51-4a09-99c4-ad8927074503',
    'import_id': None,
    'last_action': None,
    'task': 141,
    'project': 14,
    'updated_by': 4,
    'parent_prediction': 95,
    'parent_annotation': None,
    'last_created_by': None}],
  'file_upload': 'example_file.json',
  'drafts': [],
  'predictions': [95],
  'data': {'text': 'Since then , Gigi has released six albums .'},
  'meta': {},
  'created_at': '2025-01-19T16:47:10.630104Z',
  'updated_at': '2025-01-19T16:47:10.630104Z',
  'inner_id': 1,
  'total_annotations': 1,
  'cancelled_annotations': 0,
  'total_predictions': 1,
  'comment_count': 0,
  'unresolved_comment_count': 0,
  'last_comment_updated_at': None,
  'project': 14,
  'updated_by': 4,
  'comment_authors': []},
 {'id': 141,
  'annotations': [{'id': 96,
    'completed_by': 4,
    'result': [{'value': {'start': 43,
       'end': 53,
       'text': 'Nusa Damai',
       'labels': ['MISC']},
      'id': 'a013849e',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'},
     {'value': {'start': 56, 'end': 63, 'text': 'Gitarku', 'labels': ['MISC']},
      'id': '57423aa5',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'},
     {'value': {'start': 66, 'end': 73, 'text': 'Samsara', 'labels': ['MISC']},
      'id': 'eeee84bb',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'},
     {'value': {'start': 80, 'end': 84, 'text': 'Home', 'labels': ['MISC']},
      'id': '3df96d57',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'}],
    'was_cancelled': False,
    'ground_truth': False,
    'created_at': '2025-01-19T16:47:10.630126Z',
    'updated_at': '2025-01-19T16:47:10.630126Z',
    'draft_created_at': None,
    'lead_time': 0,
    'prediction': {'id': 95,
     'result': [{'value': {'start': 43,
        'end': 53,
        'text': 'Nusa Damai',
        'labels': ['MISC']},
       'id': 'a013849e',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'},
      {'value': {'start': 56,
        'end': 63,
        'text': 'Gitarku',
        'labels': ['MISC']},
       'id': '57423aa5',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'},
      {'value': {'start': 66,
        'end': 73,
        'text': 'Samsara',
        'labels': ['MISC']},
       'id': 'eeee84bb',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'},
      {'value': {'start': 80, 'end': 84, 'text': 'Home', 'labels': ['MISC']},
       'id': '3df96d57',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'}],
     'model_version': 'undefined',
     'created_ago': '0\xa0minutes',
     'score': None,
     'cluster': None,
     'neighbors': None,
     'mislabeling': 0.0,
     'created_at': '2025-01-19T16:47:10.630126Z',
     'updated_at': '2025-01-19T16:47:10.630126Z',
     'model': None,
     'model_run': None,
     'task': 141,
     'project': 14},
    'result_count': 4,
    'unique_id': 'd2b8ec9a-cc62-4881-a42b-833546df7953',
    'import_id': None,
    'last_action': None,
    'task': 141,
    'project': 14,
    'updated_by': 4,
    'parent_prediction': 95,
    'parent_annotation': None,
    'last_created_by': None}],
  'file_upload': 'example_file.json',
  'drafts': [],
  'predictions': [95],
  'data': {'text': 'He has also released his own solo albums : Nusa Damai ; Gitarku ; Samsara ; and Home , a tribute album to the December 2004 tsunami victims .'},
  'meta': {},
  'created_at': '2025-01-19T16:47:10.630126Z',
  'updated_at': '2025-01-19T16:47:10.630126Z',
  'inner_id': 1,
  'total_annotations': 1,
  'cancelled_annotations': 0,
  'total_predictions': 1,
  'comment_count': 0,
  'unresolved_comment_count': 0,
  'last_comment_updated_at': None,
  'project': 14,
  'updated_by': 4,
  'comment_authors': []},
 {'id': 141,
  'annotations': [{'id': 7,
    'completed_by': 4,
    'result': [{'value': {'start': 6,
       'end': 18,
       'text': 'Dewa Budjana',
       'labels': ['PER']},
      'id': '2d675f56',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'},
     {'value': {'start': 88,
       'end': 104,
       'text': 'Parker Fly Delux',
       'labels': ['MISC']},
      'id': '13f6536b',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'},
     {'value': {'start': 145,
       'end': 161,
       'text': 'Klein and Gibson',
       'labels': ['ORG']},
      'id': '283b9e25',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'prediction'}],
    'was_cancelled': False,
    'ground_truth': False,
    'created_at': '2025-01-19T16:47:10.630165Z',
    'updated_at': '2025-01-19T16:47:10.630165Z',
    'draft_created_at': None,
    'lead_time': 0,
    'prediction': {'id': 95,
     'result': [{'value': {'start': 6,
        'end': 18,
        'text': 'Dewa Budjana',
        'labels': ['PER']},
       'id': '2d675f56',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'},
      {'value': {'start': 88,
        'end': 104,
        'text': 'Parker Fly Delux',
        'labels': ['MISC']},
       'id': '13f6536b',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'},
      {'value': {'start': 145,
        'end': 161,
        'text': 'Klein and Gibson',
        'labels': ['ORG']},
       'id': '283b9e25',
       'from_name': 'label',
       'to_name': 'text',
       'type': 'labels',
       'origin': 'prediction'}],
     'model_version': 'undefined',
     'created_ago': '0\xa0minutes',
     'score': None,
     'cluster': None,
     'neighbors': None,
     'mislabeling': 0.0,
     'created_at': '2025-01-19T16:47:10.630165Z',
     'updated_at': '2025-01-19T16:47:10.630165Z',
     'model': None,
     'model_run': None,
     'task': 141,
     'project': 14},
    'result_count': 3,
    'unique_id': '21115ec1-4a96-43c4-b1ec-1734cd004160',
    'import_id': None,
    'last_action': None,
    'task': 141,
    'project': 14,
    'updated_by': 4,
    'parent_prediction': 95,
    'parent_annotation': None,
    'last_created_by': None}],
  'file_upload': 'example_file.json',
  'drafts': [],
  'predictions': [95],
  'data': {'text': 'Since Dewa Budjana started his professional career as a musician , he has mainly used a Parker Fly Delux as his main guitar , occasionally using Klein and Gibson guitars SG series instead .'},
  'meta': {},
  'created_at': '2025-01-19T16:47:10.630165Z',
  'updated_at': '2025-01-19T16:47:10.630165Z',
  'inner_id': 1,
  'total_annotations': 1,
  'cancelled_annotations': 0,
  'total_predictions': 1,
  'comment_count': 0,
  'unresolved_comment_count': 0,
  'last_comment_updated_at': None,
  'project': 14,
  'updated_by': 4,
  'comment_authors': []}]

カスタムTXTC (CTXTC) - テキスト分類

CTXTCデータセットは、注釈と処理を柔軟に行うために、いくつかのフォーマットをサポートしています。
  • データ・ラベリング書式
  • CSVフォーマット
  • Label Studioフォーマット
データ・ラベリング書式
CTXTC portablejsonlフォーマット
CSVフォーマット
CTXTC CSV形式- 単一ラベル
CTXTC CSV形式- マルチラベル
Label Studioインポート・フォーマット
CTXTC CSV形式- 単一ラベル
CTXTC CSV形式- マルチラベル
[{'data': {'text': "I need to book a hotel in the east that has 4 stars. I can help you with that. What is your price range? That doesn't matter as long as it has free wifi and parking. If you'd like something cheap, I recommend the Allenbell. For something moderately priced, I would recommend the Warkworth House. Could you book the Wartworth for one night, 1 person? What day will you be staying? Friday and Can you book it for me and get a reference number ? Booking was successful. Reference number is : BMUKPTG6.  Can I help you with anything else today? I am looking to book a train that is leaving from Cambridge to Bishops Stortford on Friday. There are a number of trains leaving throughout the day.  What time would you like to travel? I want to get there by 19:45 at the latest. Okay! The latest train you can take leaves at 17:29, and arrives by 18:07. Would you like for me to book that for you? Yes please. I also need the travel time, departure time, and price. Reference number is : UIFV8FAS. The price is 10.1 GBP and the trip will take about 38 minutes. May I be of any other assistance? Yes. Sorry, but suddenly my plans changed. Can you change the Wartworth booking to Monday for 3 people and 4 nights? I have made that change and your reference number is YF86GE4J. Thank you very much, goodbye. You're welcome. Have a nice day!"},
  'predictions': [{'model_version': '1.3',
    'result': [{'id': '0',
      'from_name': 'textClassification',
      'to_name': 'text',
      'type': 'choices',
      'value': {'choices': ['hotel', 'train']}}]}]},
 {'data': {'text': 'Howdy, I need a train heading into cambridge. I would be happy to help you find a train.  Where are you departing from? I am departing from norwich.  I need to leave after 18:45 on Wednesday. I have several options for you. Where is your destination? I will be heading to cabridge. The earliest after 18:45 is the TR8658, leaving Norwich at 19:16. Can I reserve you one or more seats on this train? yeah, i need one ticket. Booking was successful, the total fee is 17.6 GBP payable at the station. Your reference number is AXH1NM1I. Do you need assistance with anything else? I am also looking for a multi sport in the East. It looks like there is The Cherry Hinton Village Centre.  Can I get you more information about it? I would like to get the phone number, please. their phone number is 01223576412. anything else? Oh, and what is their postcode, please? Sure, the postcode is cb19ej. Can I help you find any other information about Cambridge? That is all for now thank you. enjoy your time in Cambridge!'},
  'predictions': [{'model_version': '1.3',
    'result': [{'id': '1',
      'from_name': 'textClassification',
      'to_name': 'text',
      'type': 'choices',
      'value': {'choices': ['attraction', 'train']}}]}]},
 {'data': {'text': "What can you tell me about the Riverside Brasserie? It is a restaurant that serves modern european food near the centre of town. It is moderately priced. The phone number is 01223259988. Perfect. Can you help me with a reservation for 6 people at 14:30 this coming sunday? And please make sure I have a confirmation number to use. Your reservation is set! The table will be reserved for 15 minutes. Your reference number is LZLUDTVI. Is there anything else you need? I'm also looking for a place to stay.  In the south preferably. What price range were you thinking? No particular price range, but I would like it to be a 4 star hotel. There are no hotels that fit your criteria in the South, but there are two Guesthouses. Would you like to book one of those? Sure, that will work. Can you tell me more about them? Aylesbray Lodge Guesthouse and Rosa's Bed and Breakfast, both are rated at 4 stars and both include free parking and internet. Would you like a reservation for one of them? Can I get the postcode for both of them? Aylesbray postcode is cd17sr and Rosa's postcode is cb22ha. Is there anything else I can help you with today? No thanks. That's all the help I need. Take care. Bye. thank you! Enjoy your stay!"},
  'predictions': [{'model_version': '1.3',
    'result': [{'id': '2',
      'from_name': 'textClassification',
      'to_name': 'text',
      'type': 'choices',
      'value': {'choices': ['hotel', 'restaurant']}}]}]},
 {'data': {'text': "I am looking for a specific hotel, its name is express by holiday inn cambridge. I have the Express by Holiday Inn Cambridge located on 15-17 norman way, coldhams business park. Their phone number is 01223866800. Would you like to know anything else? Yes, could you book the hotel room for me for 7 people? Yes, of course. What day would you like to stay? Monday, please. There will be 7 of us and we'd like to stay for 4 days. Here is the booking information:Booking was successful. Reference number is : 5F8G6J1G. Thank you. I would also like to book a train, please. Sure, which stations will you be traveling between? I will be going from cambridge to birmingham new street. What time would you like to leave?  The trains depart every hour. Whenever will get me there by 17:30. I do need to leave on Friday and I will need the travel time please. There are 11 results. Would you prefer the earliest departure time or latest?"},
  'predictions': [{'model_version': '1.3',
    'result': [{'id': '3',
      'from_name': 'textClassification',
      'to_name': 'text',
      'type': 'choices',
      'value': {'choices': ['hotel', 'train']}}]}]}]
Label Studioエクスポート形式
CTXTCラベルスタジオエクスポートフォーマット- シングルラベル
CTXTCラベルスタジオエクスポートフォーマット- マルチラベル
[{'id': 237,
  'annotations': [{'id': 110,
    'completed_by': 4,
    'result': [{'id': '0',
      'from_name': 'textClassification',
      'to_name': 'text',
      'type': 'choices',
      'value': {'choices': ['hotel', 'train']},
      'origin': 'prediction'}],
    'was_cancelled': False,
    'ground_truth': False,
    'created_at': '2025-01-19T19:15:56.447771Z',
    'updated_at': '2025-01-19T19:15:57.447771Z',
    'draft_created_at': None,
    'lead_time': 1.101,
    'prediction': {'id': 253,
     'result': [{'id': '0',
       'from_name': 'textClassification',
       'to_name': 'text',
       'type': 'choices',
       'value': {'choices': ['hotel', 'train']}}],
     'model_version': '1.3',
     'created_ago': '1\xa0minute',
     'score': None,
     'cluster': None,
     'neighbors': None,
     'mislabeling': 0.0,
     'created_at': '2025-01-19T19:15:56.447771Z',
     'updated_at': '2025-01-19T19:15:57.447771Z',
     'model': None,
     'model_run': None,
     'task': 237,
     'project': 22},
    'result_count': 1,
    'unique_id': '6aa5eb24-4c5b-4e7a-9770-eca304ee1420',
    'import_id': None,
    'last_action': None,
    'task': 237,
    'project': 22,
    'updated_by': 4,
    'parent_prediction': 253,
    'parent_annotation': None,
    'last_created_by': None}],
  'file_upload': '0606dc5a-multiLabel_ethos_train_LabelStudio_Upload.json',
  'drafts': [],
  'predictions': [253],
  'data': {'text': "I need to book a hotel in the east that has 4 stars. I can help you with that. What is your price range? That doesn't matter as long as it has free wifi and parking. If you'd like something cheap, I recommend the Allenbell. For something moderately priced, I would recommend the Warkworth House. Could you book the Wartworth for one night, 1 person? What day will you be staying? Friday and Can you book it for me and get a reference number ? Booking was successful. Reference number is : BMUKPTG6.  Can I help you with anything else today? I am looking to book a train that is leaving from Cambridge to Bishops Stortford on Friday. There are a number of trains leaving throughout the day.  What time would you like to travel? I want to get there by 19:45 at the latest. Okay! The latest train you can take leaves at 17:29, and arrives by 18:07. Would you like for me to book that for you? Yes please. I also need the travel time, departure time, and price. Reference number is : UIFV8FAS. The price is 10.1 GBP and the trip will take about 38 minutes. May I be of any other assistance? Yes. Sorry, but suddenly my plans changed. Can you change the Wartworth booking to Monday for 3 people and 4 nights? I have made that change and your reference number is YF86GE4J. Thank you very much, goodbye. You're welcome. Have a nice day!"},
  'meta': {},
  'created_at': '2025-01-19T19:15:56.447771Z',
  'updated_at': '2025-01-19T19:15:57.447771Z',
  'inner_id': 2,
  'total_annotations': 1,
  'cancelled_annotations': 0,
  'total_predictions': 1,
  'comment_count': 0,
  'unresolved_comment_count': 0,
  'last_comment_updated_at': None,
  'project': 22,
  'updated_by': 4,
  'comment_authors': []},
 {'id': 237,
  'annotations': [{'id': 110,
    'completed_by': 4,
    'result': [{'id': '1',
      'from_name': 'textClassification',
      'to_name': 'text',
      'type': 'choices',
      'value': {'choices': ['attraction', 'train']},
      'origin': 'prediction'}],
    'was_cancelled': False,
    'ground_truth': False,
    'created_at': '2025-01-19T19:15:56.447847Z',
    'updated_at': '2025-01-19T19:15:57.447847Z',
    'draft_created_at': None,
    'lead_time': 1.101,
    'prediction': {'id': 253,
     'result': [{'id': '1',
       'from_name': 'textClassification',
       'to_name': 'text',
       'type': 'choices',
       'value': {'choices': ['attraction', 'train']}}],
     'model_version': '1.3',
     'created_ago': '1\xa0minute',
     'score': None,
     'cluster': None,
     'neighbors': None,
     'mislabeling': 0.0,
     'created_at': '2025-01-19T19:15:56.447847Z',
     'updated_at': '2025-01-19T19:15:57.447847Z',
     'model': None,
     'model_run': None,
     'task': 237,
     'project': 22},
    'result_count': 1,
    'unique_id': 'adfa5ebc-bd45-4c9a-9f7e-1e797329dd85',
    'import_id': None,
    'last_action': None,
    'task': 237,
    'project': 22,
    'updated_by': 4,
    'parent_prediction': 253,
    'parent_annotation': None,
    'last_created_by': None}],
  'file_upload': '0606dc5a-multiLabel_ethos_train_LabelStudio_Upload.json',
  'drafts': [],
  'predictions': [253],
  'data': {'text': 'Howdy, I need a train heading into cambridge. I would be happy to help you find a train.  Where are you departing from? I am departing from norwich.  I need to leave after 18:45 on Wednesday. I have several options for you. Where is your destination? I will be heading to cabridge. The earliest after 18:45 is the TR8658, leaving Norwich at 19:16. Can I reserve you one or more seats on this train? yeah, i need one ticket. Booking was successful, the total fee is 17.6 GBP payable at the station. Your reference number is AXH1NM1I. Do you need assistance with anything else? I am also looking for a multi sport in the East. It looks like there is The Cherry Hinton Village Centre.  Can I get you more information about it? I would like to get the phone number, please. their phone number is 01223576412. anything else? Oh, and what is their postcode, please? Sure, the postcode is cb19ej. Can I help you find any other information about Cambridge? That is all for now thank you. enjoy your time in Cambridge!'},
  'meta': {},
  'created_at': '2025-01-19T19:15:56.447847Z',
  'updated_at': '2025-01-19T19:15:57.447847Z',
  'inner_id': 2,
  'total_annotations': 1,
  'cancelled_annotations': 0,
  'total_predictions': 1,
  'comment_count': 0,
  'unresolved_comment_count': 0,
  'last_comment_updated_at': None,
  'project': 22,
  'updated_by': 4,
  'comment_authors': []},
 {'id': 237,
  'annotations': [{'id': 110,
    'completed_by': 4,
    'result': [{'id': '2',
      'from_name': 'textClassification',
      'to_name': 'text',
      'type': 'choices',
      'value': {'choices': ['hotel', 'restaurant']},
      'origin': 'prediction'}],
    'was_cancelled': False,
    'ground_truth': False,
    'created_at': '2025-01-19T19:15:56.447870Z',
    'updated_at': '2025-01-19T19:15:57.447870Z',
    'draft_created_at': None,
    'lead_time': 1.101,
    'prediction': {'id': 253,
     'result': [{'id': '2',
       'from_name': 'textClassification',
       'to_name': 'text',
       'type': 'choices',
       'value': {'choices': ['hotel', 'restaurant']}}],
     'model_version': '1.3',
     'created_ago': '1\xa0minute',
     'score': None,
     'cluster': None,
     'neighbors': None,
     'mislabeling': 0.0,
     'created_at': '2025-01-19T19:15:56.447870Z',
     'updated_at': '2025-01-19T19:15:57.447870Z',
     'model': None,
     'model_run': None,
     'task': 237,
     'project': 22},
    'result_count': 1,
    'unique_id': '2b0c6f75-82a7-4a0f-a5d5-a5a697ef6798',
    'import_id': None,
    'last_action': None,
    'task': 237,
    'project': 22,
    'updated_by': 4,
    'parent_prediction': 253,
    'parent_annotation': None,
    'last_created_by': None}],
  'file_upload': '0606dc5a-multiLabel_ethos_train_LabelStudio_Upload.json',
  'drafts': [],
  'predictions': [253]]

変換スクリプト- CNER

スクリプト: データ・ラベリング形式からLabel Studio形式への変換
データ・ラベリングJSONL形式をLabel Studioのインポートおよびエクスポート形式に変換するために、cner_export_to_LS.pyというPythonスクリプトが用意されています。
import os
import argparse
import json
import uuid
from datetime import datetime
import random
import glob
 
 
def convert_to_label_studio_import(data):
    """ Convert portable JSONL format to Label Studio import format. """
    dd = []
    for i in data[1:]:
        ents = i['annotations'][0]['entities']
        ee = []
        text = i['sourceDetails']['text']
        for e in ents:
            start = e['textSpan']['offset']
            end = e['textSpan']['offset'] + e['textSpan']['length']
            ee.append({
                'value': {
                    'start': start,
                    'end': end,
                    'text': text[start:end],
                    'labels': [e['labels'][0]['label_name']]
                },
                'from_name': 'label',
                'to_name': 'text',
                'type': 'labels'
            })
        dd.append({
            'data': {'text': i['sourceDetails']['text']},
            'predictions': [{'result': ee}]
        })
    return dd
 
 
def convert_to_label_studio_raw_export(input_data):
    """ Convert portable JSONL format to Label Studio raw JSON export format. """
    # Extract text and annotations
    text = input_data['sourceDetails']['text']
    entities = input_data['annotations'][0]['entities']
     
    # Initialize variables
    task_id = 141
    project_id = 14
    now = datetime.now().isoformat() + "Z"
    unique_id = str(uuid.uuid4())
     
    # Process entities
    results = []
    for entity in entities:
        offset = entity['textSpan']['offset']
        length = entity['textSpan']['length']
        label_name = entity['labels'][0]['label_name']
        entity_text = text[offset:offset + length]
         
        result = {
            'value': {
                'start': offset,
                'end': offset + length,
                'text': entity_text,
                'labels': [label_name]
            },
            'id': str(uuid.uuid4())[:8],
            'from_name': 'label',
            'to_name': 'text',
            'type': 'labels',
            'origin': 'prediction'
        }
        results.append(result)
     
    # Construct the final output
    output = {
        'id': task_id,
        'annotations': [{
            'id': random.randint(1, 100),
            'completed_by': 4,
            'result': results,
            'was_cancelled': False,
            'ground_truth': False,
            'created_at': now,
            'updated_at': now,
            'draft_created_at': None,
            'lead_time': 0,
            'prediction': {
                'id': 95,
                'result': results,
                'model_version': 'undefined',
                'created_ago': '0\xa0minutes',
                'score': None,
                'cluster': None,
                'neighbors': None,
                'mislabeling': 0.0,
                'created_at': now,
                'updated_at': now,
                'model': None,
                'model_run': None,
                'task': task_id,
                'project': project_id
            },
            'result_count': len(results),
            'unique_id': unique_id,
            'import_id': None,
            'last_action': None,
            'task': task_id,
            'project': project_id,
            'updated_by': 4,
            'parent_prediction': 95,
            'parent_annotation': None,
            'last_created_by': None
        }],
        'file_upload': 'example_file.json',
        'drafts': [],
        'predictions': [95],
        'data': {'text': text},
        'meta': {},
        'created_at': now,
        'updated_at': now,
        'inner_id': 1,
        'total_annotations': 1,
        'cancelled_annotations': 0,
        'total_predictions': 1,
        'comment_count': 0,
        'unresolved_comment_count': 0,
        'last_comment_updated_at': None,
        'project': project_id,
        'updated_by': 4,
        'comment_authors': []
    }
     
    return output
 
 
def load_jsonl(file_path):
    """ Load a JSONL file and return a list of parsed JSON objects. """
    with open(file_path, 'r', encoding='utf-8') as f:
        return [json.loads(line) for line in f]
 
 
def save_jsonl(data, file_path):
    """ Save a list of JSON objects to a JSONL file. """
    with open(file_path, 'w', encoding='utf-8') as f:
        for entry in data:
            f.write(json.dumps(entry) + '\n')
 
 
def save_json(data, file_path):
    """ Save a JSON object to a JSON file. """
    with open(file_path, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=4)
 
 
def process_folder(input_folder, output_folder):
    """ Process all test, train, and dev files in the folder. """
    for split in ["test", "train", "dev"]:
        input_file = os.path.join(input_folder, f"{split}.jsonl")
        if not os.path.exists(input_file):
            print(f"Skipping {input_file}: File not found.")
            continue
 
        # Load input data
        data = load_jsonl(input_file)
         
        # Convert and save import format
        import_output_file = os.path.join(output_folder, f"{split}_import.jsonl")
        import_data = convert_to_label_studio_import(data)
        save_jsonl(import_data, import_output_file)
        print(f"Saved Label Studio import data to {import_output_file}")
         
        # Convert and save export format
        export_output_file = os.path.join(output_folder, f"{split}_export.json")
        export_data = [convert_to_label_studio_raw_export(item) for item in data]
        save_json(export_data, export_output_file)
        print(f"Saved Label Studio export data to {export_output_file}")
 
 
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Convert JSONL files to Label Studio formats.")
    parser.add_argument("input_folder", type=str, help="Folder containing input JSONL files (test, train, dev).")
    parser.add_argument("output_folder", type=str, help="Folder to save converted files.")
    args = parser.parse_args()
 
    # Ensure output folder exists
    os.makedirs(args.output_folder, exist_ok=True)
 
    # Process input folder
    process_folder(args.input_folder, args.output_folder)

使用手順

  1. すべてのデータ・ラベリングJSONLファイル(test、train、dev)を入力ディレクトリに配置します。
  2. スクリプトcner_conversion.ipynbを実行します。
    import glob
     
     
    def list_folders_in_directory(directory_path):
        """List all folders in a given directory using glob."""
        folder_paths = glob.glob(os.path.join(directory_path, "*/"))
        # folder_names = [os.path.basename(os.path.normpath(folder)) for folder in folder_paths]
        return folder_paths
     
     
    def list_files_in_directory(directory_path, extension="*"):
        """List all files in a directory with an optional extension filter."""
        file_paths = glob.glob(os.path.join(directory_path, f"*.{extension}"))
        # file_names = [os.path.basename(file) for file in file_paths]
        return file_paths
     
    directory_path = "/home/niksoni/cner_labelstudio_integeration/datasets/cner"
    output_folder = "/home/niksoni/cner_labelstudio_integeration/datasets/cner/label_studio_formats"
    folders = list_folders_in_directory(directory_path)
    for folder in folders:
        folder_name = os.path.basename(os.path.normpath(folder))
        output_folder_path = f"{output_folder}/{folder_name}"
        os.makedirs(output_folder_path, exist_ok=True)
        files = list_files_in_directory(folder)
        for input_file in files:
            if not os.path.exists(input_file):
                        print(f"Skipping {input_file}: File not found.")
                        continue 
            data = load_jsonl(input_file)
            # Convert and save import format
            file_name = os.path.basename(input_file)
            import_output_file = os.path.join(output_folder_path, f"{file_name}_LS_import.json")
            import_data = convert_to_label_studio_import(data)
            save_json(import_data, import_output_file)
            print(f"Saved Label Studio import data to {import_output_file}")
             
            export_output_file = os.path.join(output_folder_path, f"{file_name}_LS_export.json")
            export_data = [convert_to_label_studio_raw_export(item) for item in data[1:]]
            save_json(export_data, export_output_file)
            print(f"Saved Label Studio export data to {export_output_file}")

    スクリプトは次のものを生成します。

    • Label Studioへの直接インポートの場合は*_import.jsonl
    • トレーニング・ワークフローで直接使用する場合は*_export.json
    変換後、出力フォルダには、次の構造を持つコンテンツが含まれます。
    /input_folder/
    │── test.jsonl
    │── train.jsonl
    │── dev.jsonl
    
    /output_folder/
    │── test_import.json
    │── test_export.json
    │── train_import.json
    │── train_export.json
    │── dev_import.json
    │── dev_export.json

変換スクリプト- CTXTC

スクリプト: CSV形式をLabel Studio形式に変換
CSV形式をLabel Studioのインポートおよびエクスポート形式に変換するために、ctxtc_export_to_LS.pyというPythonスクリプトが用意されています。
import os
import argparse
import json
import uuid
# from datetime import datetime
import random
import glob
from datetime import datetime, timedelta
 
import pandas as pd
 
def load_jsonl(file_path):
    """ Load a JSONL file and return a list of parsed JSON objects. """
    with open(file_path, 'r', encoding='utf-8') as f:
        return [json.loads(line) for line in f]
 
 
def save_jsonl(data, file_path):
    """ Save a list of JSON objects to a JSONL file. """
    with open(file_path, 'w', encoding='utf-8') as f:
        for entry in data:
            f.write(json.dumps(entry) + '\n')
             
             
def save_to_json(data, file_path):
    with open(file_path, 'w') as f:
        json.dump(data, f, indent=4)
    print(f"Data successfully saved to {file_path}")
 
def load_from_json(file_path):
    with open(file_path, 'r') as f:
        data = json.load(f)
    print(f"Data successfully loaded from {file_path}")
    return data
 
 
# Function to convert dataframe to Label Studio format
def convert_to_label_studio(df, label_name="textClassification"):
    data = []
     
    for idx, row in df.iterrows():
        text = row['text']
        label = str(row['labels']).split("|")
         
        data_entry = {
            "data": {
                "text": text
            },
            "predictions": [
                {
                    "model_version": "1.3",  # This can be adjusted if needed
                    "result": [
                        {
                            "id": str(idx),
                            "from_name": label_name,  # Label name changed here
                            "to_name": "text",
                            "type": "choices",
                            "value": {
#                                 "score": 1.0,  # You can adjust the score based on confidence if needed
                                "choices": label
                            }
                        }
                    ]
                }
            ]
        }
        data.append(data_entry)
     
    return json.dumps(data, indent=4)
 
 
 
 
def convert_prediction_to_label_studio_export_format(input_data):
    # Generate placeholders
    current_time = datetime.utcnow()
    created_at = current_time.isoformat() + "Z"
    updated_at = (current_time + timedelta(seconds=1)).isoformat() + "Z"
    prediction_id = 253
    task_id = 237
    project_id = 22
    annotation_id = 110
    completed_by = 4
 
    # Extract values from input
    text = input_data['data']['text']
    predictions = input_data['predictions'][0]
    model_version = predictions['model_version']
    result = predictions['result']
 
    # Build the output structure
    converted_data = {
        'id': task_id,
        'annotations': [{
            'id': annotation_id,
            'completed_by': completed_by,
            'result': [
                {
                    **item,
                    'origin': 'prediction'
                }
                for item in result
            ],
            'was_cancelled': False,
            'ground_truth': False,
            'created_at': created_at,
            'updated_at': updated_at,
            'draft_created_at': None,
            'lead_time': 1.101,
            'prediction': {
                'id': prediction_id,
                'result': result,
                'model_version': model_version,
                'created_ago': '1\xa0minute',
                'score': None,
                'cluster': None,
                'neighbors': None,
                'mislabeling': 0.0,
                'created_at': created_at,
                'updated_at': updated_at,
                'model': None,
                'model_run': None,
                'task': task_id,
                'project': project_id
            },
            'result_count': 1,
            'unique_id': str(uuid.uuid4()),
            'import_id': None,
            'last_action': None,
            'task': task_id,
            'project': project_id,
            'updated_by': completed_by,
            'parent_prediction': prediction_id,
            'parent_annotation': None,
            'last_created_by': None
        }],
        'file_upload': '0606dc5a-multiLabel_ethos_train_LabelStudio_Upload.json',
        'drafts': [],
        'predictions': [prediction_id],
        'data': {
            'text': text
        },
        'meta': {},
        'created_at': created_at,
        'updated_at': updated_at,
        'inner_id': 2,
        'total_annotations': 1,
        'cancelled_annotations': 0,
        'total_predictions': 1,
        'comment_count': 0,
        'unresolved_comment_count': 0,
        'last_comment_updated_at': None,
        'project': project_id,
        'updated_by': completed_by,
        'comment_authors': []
    }
    return converted_data

使用手順

  1. すべてのデータ・ラベリングJSONLファイル(test、train、dev)を入力ディレクトリに配置します。
  2. スクリプトctxtc_conversion.ipynbを実行します。
    import json
    from ctxtc_conversion_to_LS import *
    import glob
     
    def list_folders_in_directory(directory_path):
        """List all folders in a given directory using glob."""
        folder_paths = glob.glob(os.path.join(directory_path, "*/"))
        # folder_names = [os.path.basename(os.path.normpath(folder)) for folder in folder_paths]
        return folder_paths
     
     
    def list_files_in_directory(directory_path, extension="*"):
        """List all files in a directory with an optional extension filter."""
        file_paths = glob.glob(os.path.join(directory_path, f"*.{extension}"))
        # file_names = [os.path.basename(file) for file in file_paths]
        return file_paths
     
    # cls_types = ["singleLabel", "multilabel"]
    cls_types = ["multilabel"]
    directory_path = "/home/niksoni/cner_labelstudio_integeration/datasets/ctxtc"
    output_folder = "/home/niksoni/cner_labelstudio_integeration/datasets/ctxtc/label_studio_formats"
    for cls_type in cls_types:
        cls_directory_path = f"{directory_path}/{cls_type}"
        folders = list_folders_in_directory(cls_directory_path)
        for folder in folders:
            folder_name = os.path.basename(os.path.normpath(folder))
            output_folder_path = f"{output_folder}/{cls_type}/{folder_name}"
            os.makedirs(output_folder_path, exist_ok=True)
            files = list_files_in_directory(folder)
            # print(files)
            for input_file in files:
                if not os.path.exists(input_file):
                            print(f"Skipping {input_file}: File not found.")
                            continue 
                df = pd.read_csv(input_file)
                json_data = convert_to_label_studio(df)
                json_data = json.loads(json_data)
                file_name = os.path.basename(input_file)
                # save_to_json(json_data,file_name)
                import_output_file = os.path.join(output_folder_path, f"{file_name}_LS_import.json")
                # import_data = convert_to_label_studio(data)
                save_to_json(json_data, import_output_file)
                print(f"Saved Label Studio import data to {import_output_file}")
     
                export_output_file = os.path.join(output_folder_path, f"{file_name}_LS_export.json")
                export_data = [convert_prediction_to_label_studio_export_format(item) for item in json_data]
                save_to_json(export_data, export_output_file)
                print(f"Saved Label Studio export data to {export_output_file}")

    スクリプトは次のものを生成します。

    • Label Studioへの直接インポートの場合は*_import.jsonl
    • トレーニング・ワークフローで直接使用する場合は*_export.json
    変換後、出力フォルダには、次の構造を持つコンテンツが含まれます。
    /input_folder/
    │── test.jsonl
    │── train.jsonl
    │── dev.jsonl
    
    /output_folder/
    │── test_import.json
    │── test_export.json
    │── train_import.json
    │── train_export.json
    │── dev_import.json
    │── dev_export.json

言語との統合

変換されたデータを使用して、次のトレーニング・フローに従います。
  1. ラベル・スタジオを使用して手動で注釈を付けるか、データ・ラベリングから変換して、データセットをアップロードします。
  2. モデル・トレーニング用にオブジェクト・ストレージに格納されているデータセットを選択します。
    言語はデータセットを処理し、トレーニング・パイプラインを続行します。

    モデルは、以前と同じ方法で検証、トレーニングおよびデプロイされます。

まとめ

  • 移行スクリプトにより、データ・ラベリングからLabel Studioへのシームレスな移行が保証されます。
  • Label Studioで既存のデータの注釈付けを続行できます。
  • エクスポートされたデータセットは、言語トレーニング・パイプラインと完全に互換性があります。
  • 既存のJSONL形式は、下位互換性のために引き続きサポートされています。