Chapter 7. Avro Schemas

Table of Contents

Creating Avro Schemas
Avro Schema Definitions
Primitive Data Types
Complex Data Types
Using Avro Schemas
Schema Evolution
Rules for Changing Schema
Writer and Reader Schema
How Schema Evolution Works
Managing Avro Schema in the Store
Adding Schema
Changing Schema
Disabling and Enabling Schema
Showing Schema

Avro is used to define the data schema for a record's value. This schema describes the fields allowed in the value, along with their data types.

You apply a schema to the value portion of an Oracle NoSQL Database record using Avro bindings. These bindings are used to serialize values before writing them, and to deserialize values after reading them. The usage of these bindings requires your applications to use the Avro data format, which means that each stored value is associated with a schema.

The use of Avro schemas allows serialized values to be stored in a very space-efficient binary format. Each value is stored without any metadata other than a small internal schema identifier, between 1 and 4 bytes in size. One such reference is stored per key-value pair. In this way, the serialized Avro data format is always associated with the schema used to serialize it, with minimal overhead. This association is made transparently to the application, and the internal schema identifier is managed by the bindings supplied by the AvroCatalog class. The application never sees or uses the internal identifier directly.

The Avro API is the result of an open source project provided by the Apache Software Foundation. It is formally described here: http://avro.apache.org.

In addition, Avro makes use of the Jackson APIs for parsing JSON. This is likely to be of interest to you if you are integrating Oracle NoSQL Database with a JSON-based system. Jackson is formally described here: http://wiki.fasterxml.com/JacksonHome.

Creating Avro Schemas

An Avro schema is created using JSON format. JSON is short for JavaScript Object Notation, and it is a lightweight, text-based data interchange format that is intended to be easy for humans to read and write. JSON is described in a great many places, both on the web and in after-market documentation. However, it is formally described in the IETF's RFC 4627, which can be found at http://www.ietf.org/rfc/rfc4627.txt?number=4627.

To describe an Avro schema, you create a JSON record which identifies the schema, like this:

{
     "type": "record",
     "namespace": "com.example",
     "name": "FullName",
     "fields": [
       { "name": "first", "type": "string" },
       { "name": "last", "type": "string" }
     ]
} 

The above example is a JSON record which describes schema that might be used by the value portion of a key-value pair in the store. It describes a schema for a person's full name.

Notice that for the record, there are four fields:

  • type

    Identifies the JSON field type. For Avro schemas, this must always be record when it is specified at the schema's top level. The type record means that there will be multiple fields defined.

  • namespace

    This identifies the namespace in which the object lives. Essentially, this is meant to be a URI that has meaning to you and your organization. It is used to differentiate one schema type from another should they share the same name.

  • name

    This is the schema name which, when combined with the namespace, uniquely identifies the schema within the store. In the above example, the fully qualified name for the schema is com.example.FullName.

  • fields

    This is the actual schema definition. It defines what fields are contained in the value, and the data type for each field. A field can be a simple data type, such as an integer or a string, or it can be complex data. We describe this in more detail, below.

    Note that schema field names must begin with [A-Za-z_], and subsequently contain only [A-Za-z0-9_].

To use the schema, you must define it in a flat text file, and then add the schema to your store using the appropriate command line call. You must also somehow provide it to your code. The schema that your code is using must correspond to the schema that has been added to your store.

The remainder of this chapter describes schemas and how to add them to your store. For a description of how to use schemas in your code, see Avro Bindings.

Avro Schema Definitions

Avro schema definitions are JSON records. Because it is a record, it can define multiple fields which are organized in a JSON array. Each such field identifies the field's name as well as its type. The type can be something simple, like an integer, or something complex, like another record.

For example, the following trivial Avro schema definition can be used for a value that contains just someone's age:

{
    "type" : "record",
    "name" : "userInfo",
    "namespace" : "my.example",
    "fields" : [{"name" : "age", "type" : "int"}]
} 

Of course, if your data storage needs are this simple, you can just use a byte-array to store the integer in the store. (Although this is not considered best practice.)

Notice in the previous example that the top-level type for the schema definition is of type record, even though we are defining a single-field schema. Oracle NoSQL Database requires you to use record for the top-level type, even if you only need one field.

Also, it is best-practice to define default values for the fields in your schema. While this is optional, should you ever decide to change your schema, it can save you a lot of trouble. To define a default value, use the default attribute:

{
    "type" : "record",
    "name" : "userInfo",
    "namespace" : "my.example",
    "fields" : [{"name" : "age", "type" : "int", "default" : -1}]
}

You almost certainly will not be using single-field definitions. To add multiple fields, specify an array in the fields field. For example:

{
    "type" : "record",
    "name" : "userInfo",
    "namespace" : "my.example",
    "fields" : [{"name" : "username", 
                "type" : "string", 
                "default" : "NONE"},

                {"name" : "age", 
                "type" : "int", 
                "default" : -1},

                {"name" : "phone", 
                "type" : "string", 
                "default" : "NONE"},

                {"name" : "housenum", 
                "type" : "string", 
                "default" : "NONE"},

                {"name" : "street", 
                "type" : "string", 
                "default" : "NONE"},

                {"name" : "city", 
                "type" : "string", 
                "default" : "NONE"},

                {"name" : "state_province", 
                "type" : "string", 
                "default" : "NONE"},

                {"name" : "country", 
                "type" : "string", 
                "default" : "NONE"},

                {"name" : "zip", 
                "type" : "string", 
                "default" : "NONE"}]
} 

The above schema definition provides a lot of information. However, simple as it is, you could add some more structure to it by using an embedded record:

{
    "type" : "record",
    "name" : "userInfo",
    "namespace" : "my.example",
    "fields" : [{"name" : "username", 
                 "type" : "string", 
                 "default" : "NONE"},

                {"name" : "age", 
                 "type" : "int",
                 "default" : -1},

                 {"name" : "phone", 
                  "type" : "string", 
                  "default" : "NONE"},

                 {"name" : "housenum", 
                  "type" : "string", 
                  "default" : "NONE"},

                  {"name" : "address", 
                     "type" : {
                         "type" : "record",
                         "name" : "mailing_address",
                         "fields" : [
                            {"name" : "street", 
                             "type" : "string", 
                             "default" : "NONE"},

                            {"name" : "city", 
                             "type" : "string", 
                             "default" : "NONE"},

                            {"name" : "state_prov", 
                             "type" : "string", 
                             "default" : "NONE"},

                            {"name" : "country", 
                             "type" : "string", 
                             "default" : "NONE"},

                            {"name" : "zip", 
                             "type" : "string", 
                             "default" : "NONE"}
                            ]}
                }
    ]
} 

Note

It is unlikely that you will need just one record definition for your entire store. Probably you will have more than one type of record. You handle this by providing each of your record definitions individually in separate files. Your code must then be written to handle the different record definitions. We will discuss how to do that later in this chapter.

Primitive Data Types

In the previous Avro schema examples, we have only shown strings and integers. The complete list of primitive types which Avro supports are:

  • null

    No value.

  • boolean

    A binary value.

  • int

    A 32-bit signed integer.

  • long

    A 64-bit signed integer.

  • float

    A single precision (32 bit) IEEE 754 floating-point number.

  • double

    A double precision (64-bit) IEEE 754 floating-point number.

  • bytes

    A sequence of 8-bit unsigned bytes.

  • string

    A Unicode character sequence.

These primitive types do not have any specified attributes. Primitive type names are also defined type names. For example, the schema "string" is equivalent to:

{"type" : "string"}

Complex Data Types

Beyond the primitive data types described in the previous section, Avro also supports six complex data types: Records, Enums, Arrays, Maps, Unions, and Fixed. They are described in this section.

record

A record represents an encapsulation of attributes that, all combined, describe a single thing. The attributes that an Avro record supports are:

  • name

    This is the record's name, and it is required. It is meant to identify the thing that the record describes. For example: PersonInformation or Automobiles or Hats or BankDeposit.

    Note that record names must begin with [A-Za-z_], and subsequently contain only [A-Za-z0-9_].

  • namespace

    A namespace is an optional attribute that uniquely identifies the record. It is optional, but it should be used when there is a chance that the record's name will collide with another record's name. For example, suppose you have a record that describes an employee. However, you might have several different types of employees: full-time, part time, and contractors. So you might then create all three types of records with the name EmployeeInfo, but then with namespaces such as FullTime, PartTime and Contractor. The fully qualified name for the records used to describe full time employees would then be FullTime.EmployeeInfo.

    Alternatively, if your store contains information for many different organizations, you might want to use a namespace that identifies the organization used by the record so as to avoid collisions in the record names. In this case, you could end up with fully qualified records with names such as My.Company.Manufacturing.EmployeeInfo and My.Company.Sales.EmployeeInfo.

  • doc

    This optional attribute simply provides documentation about the record. It is parsed and stored with the schema, and is available from the Schema object using the Avro API, but it is not used during serialization.

  • aliases

    This optional attribute provides a JSON array of strings that are alternative names for the record. Note that there is no such thing as a rename operation for JSON schema. So if you want to refer to a schema by a name other than what you initially defined in the name attribute, use an alias.

  • type

    A required attribute that is either the keyword record, or an embedded JSON schema definition. If this attribute is for the top-level schema definition, record must be used.

  • fields

    A required attribute that provides a JSON array which lists all of the fields in the schema. Each field must provide a name and a type attribute. Each field may provide doc, order, aliases and default attributes:

    • The name, type, doc and aliases attributes are used in the exact same way as described earlier in this section.

      As is the case with record names, field names must begin with [A-Za-z_], and subsequently contain only [A-Za-z0-9_].

    • The order attribute is optional, and it is ignored by Oracle NoSQL Database. For applications (other than Oracle NoSQL Database) that honor it, this attribute describes how this field impacts sort ordering of this record. Valid values are ascending, descending, or ignore. For more information on how this works, see http://http://avro.apache.org/docs/current/spec.html#order.

    • The default attribute is optional, but highly recommended in order to support schema evolution. It provides a default value for the field that is used only for the purposes of schema evolution. Use of the default attribute does not mean that you can fail to initialize the field when creating a new value object; all fields must be initialized regardless of whether the default attribute is present.

      Schema evolution is described in Schema Evolution.

      Permitted values for the default attribute depend on the field's type. Default values for unions depend on the first field in the union. Default values for bytes and fixed fields are JSON strings.

Enum

Enums are enumerated types, and it supports the following attributes

  • name

    A required attribute that provides the name for the enum. This name must begin with [A-Za-z_], and subsequently contain only [A-Za-z0-9_].

  • namespace

    An optional attribute that qualifies the enum's name attribute.

  • aliases

    An optional attribute that provides a JSON array of alternative names for the enum.

  • doc

    An optional attribute that provides a comment string for the enum.

  • symbols

    A required attribute that provides the enum's symbols as an array of names. These symbols must begin with [A-Za-z_], and subsequently contain only [A-Za-z0-9_].

For example:

{ "type" : "enum",
  "name" : "Colors",
  "namespace" : "palette",
  "doc" : "Colors supported by the palette.",
  "symbols" : ["WHITE", "BLUE", "GREEN", "RED", "BLACK"]}

Arrays

Defines an array field. It only supports the items attribute, which is required. The items attribute identifies the type of the items in the array:

{"type" : "array", "items" : "string"}

Maps

A map is an associative array, or dictionary, that organizes data as key-value pairs. The key for an Avro map must be a string. Avro maps supports only one attribute: values. This attribute is required and it defines the type for the value portion of the map.

{"type" : "map", "values" : "int"}

Unions

A union is used to indicate that a field may have more than one type. They are represented as JSON arrays.

For example, suppose you had a field that could be either a string or null. Then the union is represented as:

["string", "null"]

You might use this in the following way:

{
     "type": "record",
     "namespace": "com.example",
     "name": "FullName",
     "fields": [
       { "name": "first", "type": ["string", "null"] },
       { "name": "last", "type": "string", "default" : "Doe" }
     ]
} 

Fixed

A fixed type is used to declare a fixed-sized field that can be used for storing binary data. It has two required attributes: the field's name, and the size in 1-byte quantities.

For example, to define a fixed field that is one kilobyte in size:

{"type" : "fixed" , "name" : "bdata", "size" : 1048576}