21.7 Before Using the Converter (2): Optionally Create Data-Guide JSON Schemas

A data-guide JSON schema provides frequency information about the fields in a document set, in addition to structure and type information. You can use such schemas to get an idea how migration might proceed, and you can compare them with other JSON schemas as a shortcut for comparing document sets.

A JSON data guide created using keyword FORMAT_SCHEMA is a special kind of JSON schema that includes not only the usual document structure and type information but also statistical information about the specific content; in particular, for each field, in what percentage of documents it occurs, in what percentage of documents it has values of which types, and the range of values for each type.

Creating data-guide JSON schemas for your input document sets is optionalFoot 1, and you can create them at any time (as long as you still have the transfer tables of input documents). But it's a good idea to create them before starting to convert your input document sets, in particular because they can help guide how you configure the converter.

Example 21-5 Create JSON Data Guides For Input Document Sets

This example uses Oracle SQL function json_dataguide to create data guides for the input student, teacher, and course document sets. These are JSON schemas that can be used to validate their documents.

Parameter DBMS_JSON.FORMAT_SCHEMA ensures that the data guide is usable for validating. Parameter DBMS_JSON.PRETTY pretty-prints the result. Parameter DBMS_JSON.GATHER_STATS provides the data guide with statistical fields such as o:frequency, which specifies the percentage of documents in which a given field occurs or has a given type of value.

SELECT json_dataguide(data,
                      DBMS_JSON.FORMAT_SCHEMA,
                      DBMS_JSON.PRETTY+DBMS_JSON.GATHER_STATS)
  FROM student_tab;

SELECT json_dataguide(data,
                      DBMS_JSON.FORMAT_SCHEMA,
                      DBMS_JSON.PRETTY+DBMS_JSON.GATHER_STATS)
  FROM teacher_tab;

SELECT json_dataguide(data,
                      DBMS_JSON.FORMAT_SCHEMA,
                      DBMS_JSON.PRETTY+DBMS_JSON.GATHER_STATS)
  FROM course_tab;

See Also:

  • JSON_DATAGUIDE in Oracle Database SQL Language Reference

  • DBMS_JSON Constants in Oracle Database PL/SQL Packages and Types Reference for information about constants DBMS_JSON.FORMAT_SCHEMA, DBMS_JSON.GATHER_STATS, and DBMS_JSON.PRETTY

The resulting data guides for input student document set, input teacher document set, and input course documentation set are shown in the JSON data guide examples that follow in this topic, "JSON Data Guide for Input Student Document Set", "JSON Data Guide for Input Teacher Document Set", and "JSON Data Guide for Input Course Document Set", which describe the documents in JSON-type column data of tables student_tab, teacher_tab, and course_tab, respectively.

Comparing JSON schemas can serve as a useful proxy for comparing entire document sets — in particular, migration input document sets versus output document collections supported by duality views (proposed during migration or fully created).

  • As the first conversion step, PL/SQL function DBMS_JSON_DUALITY.infer_schema produces a JSON schema that describes the entire proposed relational schema for a migration, that is, the proposed duality views plus their underlying tables.

    The JSON schema produced by function infer_schema is not a data-guide JSON schema — there are no duality views yet, so there are no supported document collections from which statistical information can be gathered. But it does specify the structure and typing of the document sets that could result from a migration.

    You can use the view parts of this JSON schema to compare against JSON schemas for input document sets (in a transfer table).

    For example, instead of comparing the individual input documents in table course_tab with the individual documents to be supported by the (inferred) course duality view, you can compare the data-guide JSON schema from "JSON Data Guide for Input Course Document Set" in this topic with the COURSE duality-view part of the JSON schema inferred by infer_schema — see the section "JSON Schema from INFER_SCHEMA for Duality Vies with No Outliers" in Migrating To Duality, Simplified Recipe. When you do that, you can ignore fields that are relevant to only one or the other kind of JSON schema--for example, fields named with prefix "o:" (for Oracle) and fields named with prefix "db" (for database).

  • Similarly, comparing a JSON schema for an input document set against a JSON schema for the created and populated duality view that replaces it after migration can highlight differences. For example, you can compare the data-guide JSON schema for the course input table in the example "JSON Data Guide For Import Course Document Set" against a data-guide JSON schema for the course duality view. A data-guide schema serves as a shortcut (proxy) for comparing the documents supported by a duality view with the corresponding input documents.

Example 21-6 JSON Data Guide For Input Student Document Set

This data guide summarizes the input set of student documents stored in transfer table student_tab.

{
  "type" : "object",
  "o:frequency" : 100,
  "o:last_analyzed" : "2024-12-30T18:12:41",
  "o:sample_size" : 10,
  "required" : true,
  "properties" :
  {
    "age" :
    {
      "oneOf" :
      [
        {
          "type" : "number",
          "o:preferred_column_name" : "age",
          "o:frequency" : 90,
          "o:low_value" : 19,
          "o:high_value" : 21,
          "o:num_nulls" : 0,
          "o:last_analyzed" : "2024-12-30T18:12:41",
          "o:sample_size" : 10,
          "maximum" : 21,
          "minimum" : 19
        },
        {
          "type" : "string",
          "o:length" : 8,
          "o:preferred_column_name" : "age",
          "o:frequency" : 10,
          "o:low_value" : "Nineteen",
          "o:high_value" : "Nineteen",
          "o:num_nulls" : 0,
          "o:last_analyzed" : "2024-12-30T18:12:41",
          "o:sample_size" : 10,
          "maxLength" : 8,
          "minLength" : 8
        }
      ]
    },
    "name" :
    {
      "type" : "string",
      "o:length" : 16,
      "o:preferred_column_name" : "name",
      "o:frequency" : 100,
      "o:low_value" : "Donald P.",
      "o:high_value" : "Ming L.",
      "o:num_nulls" : 0,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 10,
      "required" : true,
      "maxLength" : 10,
      "minLength" : 6
    },
    "courses" :
    {
      "type" : "array",
      "o:preferred_column_name" : "courses",
      "o:frequency" : 100,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 10,
      "required" : true,
      "items" :
      {
        "properties" :
        {
          "name" :
          {
            "type" : "string",
            "o:length" : 16,
            "o:preferred_column_name" : "name",
            "o:frequency" : 100,
            "o:low_value" : "Advanced Algebra",
            "o:high_value" : "Data Structures",
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 10,
            "required" : true,
            "maxLength" : 16,
            "minLength" : 7
          },
          "avgGrade" :
          {
            "oneOf" :
            [
              {
                "type" : "number",
                "o:preferred_column_name" : "avgGrade",
                "o:frequency" : 100,
                "o:low_value" : 75,
                "o:high_value" : 95,
                "o:num_nulls" : 0,
                "o:last_analyzed" : "2024-12-30T18:12:41",
                "o:sample_size" : 10,
                "required" : true,
                "maximum" : 95,
                "minimum" : 75
              },
              {
                "type" : "string",
                "o:length" : 4,
                "o:preferred_column_name" : "avgGrade",
                "o:frequency" : 50,
                "o:low_value" : "TBD",
                "o:high_value" : "TBD",
                "o:num_nulls" : 0,
                "o:last_analyzed" : "2024-12-30T18:12:41",
                "o:sample_size" : 10,
                "maxLength" : 3,
                "minLength" : 3
              }
            ]
          },
          "courseNumber" :
          {
            "type" : "string",
            "o:length" : 8,
            "o:preferred_column_name" : "courseNumber",
            "o:frequency" : 100,
            "o:low_value" : "CS101",
            "o:high_value" : "MATH103",
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 10,
            "required" : true,
            "maxLength" : 7,
            "minLength" : 5
          }
        }
      }
    },
    "advisorId" :
    {
      "type" : "number",
      "o:preferred_column_name" : "advisorId",
      "o:frequency" : 100,
      "o:low_value" : 101,
      "o:high_value" : 103,
      "o:num_nulls" : 0,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 10,
      "required" : true,
      "maximum" : 103,
      "minimum" : 101
    },
    "dormitory" :
    {
      "type" : "object",
      "o:preferred_column_name" : "dormitory",
      "o:frequency" : 100,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 10,
      "required" : true,
      "properties" :
      {
        "dormId" :
        {
          "type" : "number",
          "o:preferred_column_name" : "dormId",
          "o:frequency" : 100,
          "o:low_value" : 201,
          "o:high_value" : 205,
          "o:num_nulls" : 0,
          "o:last_analyzed" : "2024-12-30T18:12:41",
          "o:sample_size" : 10,
          "required" : true,
          "maximum" : 205,
          "minimum" : 201
        },
        "dormName" :
        {
          "type" : "string",
          "o:length" : 4,
          "o:preferred_column_name" : "dormName",
          "o:frequency" : 100,
          "o:low_value" : "ABC",
          "o:high_value" : "XYZ",
          "o:num_nulls" : 0,
          "o:last_analyzed" : "2024-12-30T18:12:41",
          "o:sample_size" : 10,
          "required" : true,
          "maxLength" : 3,
          "minLength" : 3
        }
      }
    },
    "studentId" :
    {
      "type" : "number",
      "o:preferred_column_name" : "studentId",
      "o:frequency" : 100,
      "o:low_value" : 1,
      "o:high_value" : 10,
      "o:num_nulls" : 0,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 10,
      "required" : true,
      "maximum" : 10,
      "minimum" : 1
    }
  }
}

Field age has a type that is either (1) a number, with o:frequency 90, or (2) a string, with o:frequency 10. This means that a numeric age appears in 90% of the documents, and a string grade appears in 10% of the documents. Field age is thus a mixed-type field.

Similarly, field avgGrade is a mixed-type field, with a numeric grade in 100% of the documents, and a string grade in 50% of the documents.

Converter configuration fields minFieldFrequency and minTypeFrequency test the percentage of documents where a field, or a field of a given type, respectively, is present across the document set.

All of the fields in the student documents are present in 100% of the documents, so none of them can be occurrence outliers, regardless of the value of minFieldFrequency.

If the converter is used with a value of 15 for configuration field minTypeFrequency then field age will be considered a type-occurence outlier, because it occurs with a string value in only 10% of the student documents (10 < 15). Field avgGrade will not be considered a type-occurrence outlier, because neither of its types is used in less than 15% of the student documents.

As a type-occurrence outlier, field age would be mapped by the converter to a column with SQL type NUMBER (JSON number type being dominant for the field). Then the importer would try, and fail, to convert the string value "Nineteen" to a number, and would log an error for the document where age is "Nineteen".

A field that doesn't occur rarely but has a type that occurs rarely is not removed from the data.

Because there is no SQL data type of number-or-string, non-outlier mixed-type field avgGrade will be mapped by the converter to a JSON-type column, and it will apply a JSON schema to that column as a validating check constraint, to require the value to always be either a string or a number.

Example 21-7 JSON Data Guide For Input Teacher Document Set

This data guide summarizes the input set of teacher documents stored in transfer table teacher_tab.

{
  "type" : "object",
  "o:frequency" : 100,
  "o:last_analyzed" : "2024-12-30T18:12:41",
  "o:sample_size" : 4,
  "required" : true,
  "properties" :
  {
    "_id" :
    {
      "type" : "number",
      "o:preferred_column_name" : "_id",
      "o:frequency" : 100,
      "o:low_value" : 101,
      "o:high_value" : 104,
      "o:num_nulls" : 0,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 4,
      "required" : true,
      "maximum" : 104,
      "minimum" : 101
    },
    "name" :
    {
      "type" : "string",
      "o:length" : 16,
      "o:preferred_column_name" : "name",
      "o:frequency" : 100,
      "o:low_value" : "Abdul J.",
      "o:high_value" : "Natalie C.",
      "o:num_nulls" : 0,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 4,
      "required" : true,
      "maxLength" : 10,
      "minLength" : 8
    },
    "salary" :
    {
      "type" : "number",
      "o:preferred_column_name" : "salary",
      "o:frequency" : 100,
      "o:low_value" : 180000,
      "o:high_value" : 300000,
      "o:num_nulls" : 0,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 4,
      "required" : true,
      "maximum" : 300000,
      "minimum" : 180000
    },
    "department" :
    {
      "type" : "string",
      "o:length" : 16,
      "o:preferred_column_name" : "department",
      "o:frequency" : 100,
      "o:low_value" : "Computer Science",
      "o:high_value" : "Mathematics",
      "o:num_nulls" : 0,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 4,
      "required" : true,
      "maxLength" : 16,
      "minLength" : 11
    },
    "phoneNumber" :
    {
      "oneOf" :
      [
        {
          "type" : "string",
          "o:length" : 16,
          "o:preferred_column_name" : "phoneNumber",
          "o:frequency" : 50,
          "o:low_value" : "222-555-022",
          "o:high_value" : "222-555-044",
          "o:num_nulls" : 0,
          "o:last_analyzed" : "2024-12-30T18:12:41",
          "o:sample_size" : 4,
          "maxLength" : 11,
          "minLength" : 11
        },
        {
          "type" : "array",
          "o:preferred_column_name" : "phoneNumber",
          "o:frequency" : 50,
          "o:last_analyzed" : "2024-12-30T18:12:41",
          "o:sample_size" : 4,
          "items" :
          {
            "type" : "string",
            "o:length" : 16,
            "o:preferred_column_name" : "scalar_string",
            "o:frequency" : 50,
            "o:low_value" : "222-555-011",
            "o:high_value" : "222-555-023",
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 4,
            "maxLength" : 11,
            "minLength" : 11
          }
        }
      ]
    },
    "coursesTaught" :
    {
      "type" : "array",
      "o:preferred_column_name" : "coursesTaught",
      "o:frequency" : 100,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 4,
      "required" : true,
      "items" :
      {
        "properties" :
        {
          "name" :
          {
            "type" : "string",
            "o:length" : 16,
            "o:preferred_column_name" : "name",
            "o:frequency" : 75,
            "o:low_value" : "Advanced Algebra",
            "o:high_value" : "Data Structures",
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 4,
            "maxLength" : 16,
            "minLength" : 7
          },
          "courseId" :
          {
            "type" : "string",
            "o:length" : 8,
            "o:preferred_column_name" : "courseId",
            "o:frequency" : 75,
            "o:low_value" : "CS101",
            "o:high_value" : "MATH103",
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 4,
            "maxLength" : 7,
            "minLength" : 5
          },
          "classType" :
          {
            "type" : "string",
            "o:length" : 16,
            "o:preferred_column_name" : "classType",
            "o:frequency" : 75,
            "o:low_value" : "In-person",
            "o:high_value" : "Online",
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 4,
            "maxLength" : 9,
            "minLength" : 6
          }
        }
      }
    },
    "studentsAdvised" :
    {
      "type" : "array",
      "o:preferred_column_name" : "studentsAdvised",
      "o:frequency" : 100,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 4,
      "required" : true,
      "items" :
      {
        "properties" :
        {
          "name" :
          {
            "type" : "string",
            "o:length" : 16,
            "o:preferred_column_name" : "name",
            "o:frequency" : 75,
            "o:low_value" : "Donald P.",
            "o:high_value" : "Ming L.",
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 4,
            "maxLength" : 10,
            "minLength" : 6
          },
          "dormId" :
          {
            "type" : "number",
            "o:preferred_column_name" : "dormId",
            "o:frequency" : 75,
            "o:low_value" : 201,
            "o:high_value" : 205,
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 4,
            "maximum" : 205,
            "minimum" : 201
          },
          "studentId" :
          {
            "type" : "number",
            "o:preferred_column_name" : "studentId",
            "o:frequency" : 75,
            "o:low_value" : 1,
            "o:high_value" : 10,
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 4,
            "maximum" : 10,
            "minimum" : 1
          }
        }
      }
    }
  }
}

Field phoneNumber has a type that is either (1) a string or (2) an array of strings. Each of those types occurs in 50% (o:frequency = 50) of the teacher documents. It is thus a mixed-type field. If the converter is used with a minTypeFrequency value of 15 then it will not be considered a type-occurrence outlier.

Because there is no SQL data type of string-or-array-of-strings, (non-outlier) mixed-type field phoneNumber will be mapped by the converter to a JSON-type column, and it will apply a JSON schema to that column as a validating check constraint, to require the value to always be either a string or an array of strings.

Example 21-8 JSON Data Guide For Input Course Document Set

This data guide summarizes the input set of course documents stored in transfer table course_tab.

{
  "type" : "object",
  "o:frequency" : 100,
  "o:last_analyzed" : "2024-12-30T18:12:41",
  "o:sample_size" : 5,
  "required" : true,
  "properties" :
  {
    "name" :
    {
      "type" : "string",
      "o:length" : 16,
      "o:preferred_column_name" : "name",
      "o:frequency" : 100,
      "o:low_value" : "Advanced Algebra",
      "o:high_value" : "Data Structures",
      "o:num_nulls" : 0,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 5,
      "required" : true,
      "maxLength" : 16,
      "minLength" : 7
    },
    "Notes" :
    {
      "type" : "string",
      "o:length" : 64,
      "o:preferred_column_name" : "Notes",
      "o:frequency" : 20,
      "o:low_value" : "Prerequisite for Advanced Algebra",
      "o:high_value" : "Prerequisite for Advanced Algebra",
      "o:num_nulls" : 0,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 5,
      "maxLength" : 33,
      "minLength" : 33
    },
    "teacher" :
    {
      "type" : "object",
      "o:preferred_column_name" : "teacher",
      "o:frequency" : 100,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 5,
      "required" : true,
      "properties" :
      {
        "name" :
        {
          "type" : "string",
          "o:length" : 8,
          "o:preferred_column_name" : "name",
          "o:frequency" : 100,
          "o:low_value" : "Abdul J.",
          "o:high_value" : "Colin J.",
          "o:num_nulls" : 0,
          "o:last_analyzed" : "2024-12-30T18:12:41",
          "o:sample_size" : 5,
          "required" : true,
          "maxLength" : 8,
          "minLength" : 8
        },
        "teacherId" :
        {
          "type" : "number",
          "o:preferred_column_name" : "teacherId",
          "o:frequency" : 100,
          "o:low_value" : 101,
          "o:high_value" : 103,
          "o:num_nulls" : 0,
          "o:last_analyzed" : "2024-12-30T18:12:41",
          "o:sample_size" : 5,
          "required" : true,
          "maximum" : 103,
          "minimum" : 101
        }
      }
    },
    "courseId" :
    {
      "type" : "string",
      "o:length" : 8,
      "o:preferred_column_name" : "courseId",
      "o:frequency" : 100,
      "o:low_value" : "CS101",
      "o:high_value" : "MATH103",
      "o:num_nulls" : 0,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 5,
      "required" : true,
      "maxLength" : 7,
      "minLength" : 5
    },
    "students" :
    {
      "type" : "array",
      "o:preferred_column_name" : "students",
      "o:frequency" : 100,
      "o:last_analyzed" : "2024-12-30T18:12:41",
      "o:sample_size" : 5,
      "required" : true,
      "items" :
      {
        "properties" :
        {
          "name" :
          {
            "type" : "string",
            "o:length" : 16,
            "o:preferred_column_name" : "name",
            "o:frequency" : 100,
            "o:low_value" : "Donald P.",
            "o:high_value" : "Ming L.",
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 5,
            "required" : true,
            "maxLength" : 10,
            "minLength" : 6
          },
          "studentId" :
          {
            "type" : "number",
            "o:preferred_column_name" : "studentId",
            "o:frequency" : 100,
            "o:low_value" : 1,
            "o:high_value" : 10,
            "o:num_nulls" : 0,
            "o:last_analyzed" : "2024-12-30T18:12:41",
            "o:sample_size" : 5,
            "required" : true,
            "maximum" : 10,
            "minimum" : 1
          }
        }
      }
    },
    "creditHours" :
    {
      "oneOf" :
      [
        {
          "type" : "number",
          "o:preferred_column_name" : "creditHours",
          "o:frequency" : 80,
          "o:low_value" : 3,
          "o:high_value" : 5,
          "o:num_nulls" : 0,
          "o:last_analyzed" : "2024-12-30T18:12:41",
          "o:sample_size" : 5,
          "maximum" : 5,
          "minimum" : 3
        },
        {
          "type" : "string",
          "o:length" : 1,
          "o:preferred_column_name" : "creditHours",
          "o:frequency" : 20,
          "o:low_value" : "3",
          "o:high_value" : "3",
          "o:num_nulls" : 0,
          "o:last_analyzed" : "2024-12-30T18:12:41",
          "o:sample_size" : 5,
          "maxLength" : 1,
          "minLength" : 1
        }
      ]
    }
  }
}

Field Notes occurs in only 20% of the documents (field o:frequency is 20). If the converter is used with a value of 25 for configuration field minFieldFrequency then field Notes is considered an occurrence outlier, and the converter will not map it to any column.

However, if configuration field useFlexFields is true (the default) then the converter creates flex columns, and when flex columns exist the importer places all unmapped fields into flex columns. In that default (useFlexFields = true) case, field Notes will therefore be supported (by a flex column) after migration. If the converter is used with useFlexFields = false then the importer will log an error for rare field Notes.

Field creditHours has a type that is either (1) a number, with o:frequency = 80, or (2) a string, with o:frequency = 20. It is thus a mixed-type field. If the converter is used with a minTypeFrequency value of 15 then it will not be considered a type-occurrence outlier.

Because there is no SQL data type of number-or-string, (non-outlier) mixed-type field creditHours will be mapped by the converter to a JSON-type column, and it will apply a JSON schema to that column as a validating check constraint, to require the value to always be either a string or a number.

See Also:



Footnote Legend

Footnote 1: Transfer tables for your input document sets are all you need, to use the JSON-To-Duality converter.