Schema evolution is the term used for how
the store behaves when Avro schema is changed after data has
been written to the store using an older version of that
schema. To change an existing schema, you update the schema
as stored in its flat-text file, then add the new schema to the
store using the ddl add-schema
command with
the -evolve
flag.
For example, if a middle name property is added to the FullName
schema, it might be stored in a file named
schema2.avsc
, and then added to the store
using the ddl add-schema
command.
Note that when you change schema, the new field must be given a default value. This prevents errors when clients using an old version of the schema create new values that will be missing the new field:
{
"type": "record",
"namespace": "com.example",
"name": "FullName",
"fields": [
{ "name": "first", "type": "string" },
{ "name": "middle", "type": "string", "default": "" },
{ "name": "last", "type": "string" }
]
}
These are the modifications you can safely perform to your schema without any concerns:
A field with a default value is added.
A field that was previously defined with a default value is removed.
A field's doc attribute is changed, added or removed.
A field's order attribute is changed, added or removed.
A field's default value is added, or changed.
Field or type aliases are added, or removed.
A non-union type may be changed to a union that contains only the original type, or vice-versa.
Beyond these kind of changes, there are unsafe changes that you can do which will either cause the schema to be rejected when you attempt to add it to the store, or which can be performed so long as you are careful about how you go about upgrading clients which use the schema. These type of issues are identified when you try to modify (evolve) schema that is currently enabled in the store. See Changing Schema for details.
There are a few rules you need to remember if you are modifying schema that is already in use in your store:
For best results, always provide a default value for the fields in your schema. This makes it possible to delete fields later on if you decide it is necessary. If you do not provide a default value for a field, you cannot delete that field from your schema.
You cannot change a field's data type. If you have decided that a field should be some data type other than what it was originally created using, then add a whole new field to your schema that uses the appropriate data type.
When adding a field to your schema, you must provide a default value for the field.
You cannot rename an existing field. However, if you want to access the field by some name other than what it was originally created using, add and use aliases for the field.
When a schema is changed, multiple versions of the schema will
exist and be maintained by the store. The version of the schema
used to serialize a value, before writing it to the store, is
called the writer schema. The writer
schema is specified by the application when creating a binding.
It is associated with the value when calling the binding's
AvroBinding.toValue()
method to
serialize the data. This writer schema is associated internally
with every stored value.
The reader schema is used to
deserialize a value after reading it from the store. Like
the writer schema, the reader schema is specified by the
client application when creating a binding. It is used to
deserialize the data when calling the binding's
AvroBinding.toObject()
method,
after reading a value from the store.
Schema evolution is the automatic transformation of Avro schema. This transformation is between the version of the schema that the client is using (its local copy), and what is currently contained in the store. When the local copy of the schema is not identical to the schema used to write the value (that is, when the reader schema is different from the writer schema), this data transformation is performed. When the reader schema matches the schema used to write the value, no transformation is necessary.
Schema evolution is applied only during deserialization. If the reader schema is different from the value's writer schema, then the value is automatically modified during deserialization to conform to the reader schema. To do this, default values are used.
There are two cases to consider when using schema evolution: when you add a field and when you delete a field. Schema evolution takes care of both scenarios, so long as you originally assigned default values to the fields that were deleted, and assigned default values to the fields that were added.
Suppose you had the following schema:
{ "type" : "record", "name" : "userInfo", "namespace" : "my.example", "fields" : [{"name" : "name", "type" : "string", "default" : ""}] }
In version 2 of the schema, you add a field:
{ "type" : "record", "name" : "userInfo", "namespace" : "my.example", "fields" : [{"name" : "name", "type" : "string", "default" : ""}, {"name" : "age", "type" : "int" , "default" : -1}] }
In this scenario, a client that is using the new schema can deserialize a value that uses the old schema, even though the age field will be missing from the value. Upon deserialization, the value retrieved from the store will be automatically transformed such that the age field is contained in the value. The age field will be set to the default value, which is -1 in this case.
The reverse also works. A client that is using the old version of the schema attempts can deserialize a value that was written using the new version of the schema. In this case, the value retrieved from the store contains the age field, which from the client perspective is unexpected. So upon deserialization the age field is automatically removed from the retrieved object.
This has ramifications if you change your schema, and then have clients concurrently running that are using different schema versions. This scenario is not unusual in a large, distributed system of the type that Oracle NoSQL Database supports.
In this scenario, you might see fields revert to their default value, even though no client has explicitly touched those fields. This can happen in the following way:
Client v.2 creates a my.example.userInfo record, and sets the age field to 38. Then it writes that value to the store. Client v.2 is using schema version 2.
Client v.1 reads the record. It is using version 1 of the schema, so the age field is automatically removed from the value during deserialization.
Client v.1 modifies the name field and then writes the record back to the store. When it does this, the age field is missing from the value that it writes to the store.
Client v.2 reads the record again. Because the age field is missing from the record (because Client v.1 last wrote it), the age field is set to the default value, which is -1. This means that the value of the age field has reverted to the default, even though no client explicitly modified it.
Field deletion works largely the same way as field addition, with the same concern for field values automatically reverting to the default. Suppose you had the following trivial schema:
{ "type" : "record", "name" : "userInfo", "namespace" : "my.example", "fields" : [{"name" : "name", "type" : "string", "default" : ""}, {"name" : "age", "type" : "int" , "default" : -1}] }
In version 2 of the schema, you delete the age field:
{ "type" : "record", "name" : "userInfo", "namespace" : "my.example", "fields" : [{"name" : "name", "type" : "string", "default" : ""}] }
In this scenario, a client that is using the new schema can deserialize a value that uses the old schema, even though the age field is contained in that value. In this case, the age field is silently removed from the value during deserialization.
Further, a client that is using the old version of the schema attempts can deserialize a value that uses the new version of the schema. In this case, the value retrieved from the store does not contain the age field. So upon deserialization, the age field is automatically inserted into the schema (because the reader schema requires it) and the default value is used for the newly inserted field.
As with adding fields, this has ramifications if you change your schema, and then have clients concurrently running that are using different schema versions.
Client v.1 creates a my.example.userInfo record, and sets the age field to 38. Then it writes that value to the store. Client v.1 is using schema version 1.
Client v.2 reads the record. It is using version 2 of the schema, so it is not expecting the age field. As a result, the age field is automatically stripped from the value during deserialization.
Client v.2 modifies the name field and then writes the record back to the store. When it does this, the age field is missing from the value that it writes to the store.
Client v.1 reads the record again. Because the age field is missing from the record (because Client v.2 last wrote it), the age field is automatically inserted into the value, using the default of -1. This means that the value of the age field has reverted to the default, even though no client explicitly modified it.