Combine Existing Non-Sharded Databases into a Federated Sharded Database

If you have several database installations in different locations that run the same application, and you want to to include the data from all of them, to run data analytics queries for example, you can combine the independent databases into a sharded database to take advantage of Oracle Globally Distributed Database multi-shard queries.

Overview

About Federated Sharding

Learn what a federated sharding configuration is, why you need it, and how it works.

Federated sharding is a sharded database configuration where the shards consist of independent databases with similar schemas.

Creating a sharded database from independent databases reduces the need to import tons of data into a single location for data analytics.

Consider the following benefits to this approach.

  • Create a sharding environment using existing, geographically distributed databases; there is no need to provision new systems.
  • Run multi-shard queries; access data from many locations in a single query.

Oracle Globally Distributed Database, in a federated sharding configuration, treats each independent database as a shard, and as such can issue multi-shard queries on those shards.

You can create a federated sharding configuration with minor version mismatches between the shards. For example, one region could be on Oracle Database 21.1 and another could be on Oracle Database 21.3. All database shards and the shard catalog must be on Oracle Database 21c or later.

Federated Sharding Schema Requirements

You can convert existing databases running the same application into a federated sharding configuration, without modifying the database schemas or the application.

However, the databases must have the same schema or minor differences. For example, a table can have an extra column in one of the databases.

An application upgrade can trigger changes in the schema, such as when you add a new table, new column, new check constraint, or/and modify a column data type. When part of an overall federated sharding configuration, Oracle Globally Distributed Database handles the schema differences caused by an application upgrade, as long as the overall schema structure stays the same.

Sharded and Duplicated Tables in a Federated Sharding Configuration

Tables that have different sets of data on each of the federated databases are equivalent to the sharded tables in a traditional sharded database. Tables with the same content on all of the federated databases are equivalent to the duplicated tables in a traditional sharded database.

When you create the federated sharding configuration, the system assumes that all of the tables are sharded, so you must explicitly mark the tables that must be considered duplicated by the multi-shard query coordinator.

Limitations to Federated Sharding

There are some limitations to creating a federated sharding configuration.

  • There is no concept of chunk in a federated sharding configuration, so the GDSCTL MOVE CHUNK command is not supported.
  • Application sharding key-based routing is not supported.
  • The existing databases, before being added to a federated sharding configuration, must be upgraded to Oracle Database 21c or later.
  • DDLs, cross-shard insert, update, and delete are not supported from the shard catalog in a federated sharding architecture under ENABLE SHARD DDL.

Federated Sharding Security

The database users do not need to exist on all of the federated databases, but the schema owners should exist on all of the databases. The privileges and the passwords of these schema owners can be different. Only common privileges are imported for security.

Creating and Deploying a Federated Sharding Configuration

To deploy a federated sharding environment using existing databases, you define the database layout just as you would for user-defined sharding, using GDSCTL commands.

The following is a high level description of the process for creating and deploying a federated sharding configuration.

  1. Run the GDSCTL CREATE SHARDCATALOG command with the FOR_FEDERATED_DATABASE option to create the federated sharding configuration

  2. Add shard directors to the configuration.

  3. Add a shardspace to the configuration. A shardspace is defined as an existing database and its replica.

  4. Add a shard by adding the existing database to the shardspace, then run DEPLOY.

  5. Run GDSCTL SYNC SCHEMA to compare the schemas in the federated sharding configuration and retrieve the common shared schemas. Use SYNC SCHEMA to inspect and apply the DDLs.

  6. Use SQL ALTER TABLE on the shard catalog to convert tables containing the same data across the federated shards to duplicated tables.

  7. Prepare the shards in the federated sharding configuration for multi-shard queries.

The following topics describe the federated sharding-specific tasks in detail.

Create the Federated Sharding Configuration

The GDSCTL command CREATE SHARDCATALOG is used to create the federated sharding configuration, with the FOR_FEDERATED_DATABASE option used instead of selecting a sharding method in the SHARDING parameter.

The usage for the GDSCTL command CREATE SHARDCATALOG in creating a federated sharding configuration is similar to how it is used to create the shard catalog in user-defined sharding, except that instead of specifying a sharding method in the SHARDING parameter, you use the FOR_FEDERATED_DATABASE option. That is, the FOR_FEDERATED_DATABASE option is mutually exclusive with the SHARDING option.

CREATE SHARDCATALOG -DATABASE connect_identifier
  [-USER username[/password]]
  [-REGION region_name_list] 
  [-CONFIGNAME config_name]
  [-AUTOVNCR ON/OFF]
  [-FORCE] 
  [-SDB sdb_name]
  [-SHARDSPACE shardspace_name_list]
  -FOR_FEDERATED_DATABASE

The CREATE SHARDCATALOG syntax statement above shows which parameters are supported. The parameters not shown are not supported when used with the FOR_FEDERATED_DATABASE sharding method, for example, –AGENT_PASSWORD, REPFACTOR, and the Oracle Data Guard protection mode PROTECTMODE.

Note:

Only Oracle Data Guard replication is supported for federated sharding configurations. Oracle Globally Distributed Database doesn't handle the creation and management of the Data Guard configuration, but you can use Data Guard parameters with the ADD SHARD command so that you can add the primary and standbys to see the status in GDSCTL.

See Also:

The GDSCTL create shardcatalog topic in Oracle Database Global Data Services Concepts and Administration Guide for usage notes and command options.

Retrieve, Inspect, and Apply the DDLs

Run the GDSCTL SYNC SCHEMA command in phases to create the schema objects common to the existing databases in the shard catalog.

The GDSCTL SYNC SCHEMA syntax shown here illustrates the three phases of the opertion.

sync[hronize] schema
  [-schema [schemalist | all] [-retrieve_only] [-restart [-force]]
 | -apply [-skip_first]
 | -show [[-ddl ddlnum] [-count n] | [-failed_only]]]

SYNC SCHEMA should be run in phases, as described here.

  1. Retrieve Phase

    Run SYNC SCHEMA with the -retrieve_only option to inspect and verify the DDLs before they are run on the shard catalog.

    sync schema -schema schemalist –retrieve_only

    When SYNC SCHEMA is run without -retrieve_only, the DDL is retrieved and applied at the same time.

  2. Inspection Phase

    You can examine the DDL statements and their processing status with the -show option. The -ddl ddlnum option shows the specified DDL, and the -count n option specifies the maximum number of entries to show.

    sync schema –show -ddl ddlnum -count n

    Or you can use the -failed_only option to examine only the errored out statements.

    sync schema –show -failed_only
  3. Apply Phase

    In the final phase, you run the DDLs on the shard catalog to create the schemas and their objects.

    sync schema –apply

    If you get an error in the apply phase, there are a couple of ways to work around it:

    • If you can fix the cause of the error, fix and then retry SYNC SCHEMA -apply, which retries the failed DDL.

    • If the DDL cannot be fixed or it is not required, you can run SYNC SCHEMA –apply -skip_first, which resumes the apply phase from the point of the DDL failure.

    For security reasons, Oracle Globally Distributed Database doesn't offer a way to edit the DDLs.

  4. Import Incremental Changes

    If there are changes in the schema at a later point, the previous phases can be run again to import incremental changes. For example, when new objects are added, or a new column is added to a table, which will generate an ALTER TABLE ADD statement.

See Also:

The sync schema (synchronize schema) topic in Oracle Database Global Data Services Concepts and Administration Guide for more SYNC SCHEMA usage notes and option details.

SYNC SCHEMA Operations for information about the tasks performed by SYNC SCHEMA

Convert Tables to Duplicated Tables

Use ALTER TABLE table_name externally duplicated to mark tables as duplicated in a federated sharding configuration.

Any table created by SYNC SCHEMA is considered by the multi-shard query layer as an externally sharded table. If the table contains the same data on all of the shards, you can alter the table to externally duplicated, so that the multi-shard query retrieves the data from one shard only, even if it is a query on a table with no filter predicates on ORA_SHARDSPACE_NAME.

ALTER TABLE table_name [externally duplicated | externally sharded]

Prepare the Shards For Multi-Shard Queries

Create all shard users and use the ORA_SHARDSPACE_NAME pseudo-column to perform queries on specific shards.

All Shard Users

Before running multi-shard queries from the shard catalog, you must create all shard users and grant them access to the sharded and duplicated tables. These users and their privileges should be created in the shard catalog under shard DDL enabled.

Create Shardspace-Specific Queries

A shardspace in federated sharding is a set consisting of a primary shard and zero or more standby shards. To filter query results for a particular shard[space], a pseudo-column called ORA_SHARDSPACE_NAME is added to every externally sharded table. The value of this pseudo column in the tables is the name of the shardspace.

Depending on the value of MULTISHARD_QUERY_DATA_CONSISTENCY, the rows can be fetched from the primary or from any of the standbys in the shardspace. To run a multi-shard query on a given shard, you can filter the query with the predicate ORA_SHARDSPACE_NAME = shardspace_name_shard_belongs_to.

A query like SELECT CUST_NAME, CUST_ID FROM CUSTOMER, where the table CUSTOMER is marked as externally sharded, runs on all of the shards.

A query like SELECT CUST_NAME, CUST_ID FROM CUSTOMER WHERE ora_shardspace_name = ‘EUROPE’ runs on the shards belonging to the shardspace_name Europe. Depending on the MULTISHARD_QUERY_DATA_CONSISTENCY parameter value, the query is run on either the primary shard of the shardspace Europe or on its standbys.

You can join sharded tables from different shardspaces. For example, to find the customers from shardspace Europe with orders in shardspace NA, write a query similar to the following.

SELECT order_id, customer_name FROM customers c , orders o WHERE c.cust_id = o.cust_id and
c.ora_shardspace_name = ‘Europe’ and o.ora_shardspace_name = ‘NA’ 

Querying an externally duplicated table, with or without the ORA_SHARDSPACE_NAME predicate, should go to only one of the shardspaces. The MULTISHARD_QUERY_DATA_CONSISTENCY parameter value determines whether to query a primary shard in the shardspace or its replicas.

Federated Sharding Reference

SYNC SCHEMA Operations

DDL Synchronization

DDL synchronization is an operation that SYNC SCHEMA runs just after the deployment of the shards in a federated sharding configuration.

The goal of this operation is to import the object definitions from all of the shards, compare the definitions across the shards, and generate DDLs for the objects that exist on all of the shards (common objects). Once the DDLs are run and the objects are created, you can reference these objects in multi-shard queries.

Import Users

A user or schema is a candidate for import by SYNC SCHEMA if it exists on all of the shards and owns importable schema objects.

You can narrow the list of users to be imported by passing a list of users in the -SCHEMA parameter. For example,

gdsctl> sync schema -schema scott
gdsctl> sync schema -schema scott,myschema

For case-sensitive schemas use quoted identifiers.

gdsctl> sync schema -schema "O'Brien",scott

To include all non-Oracle schemas, use the value ALL in the SCHEMA parameter.

gdsctl> sync schema -schema all

Before importing the users, SYNC SCHEMA verifies that any discovered users exist on all shards, and no user already exists on the shard catalog with the same name. The users are then created on the shard catalog as local users and they are locked. Because these are local users, they only share the same name with shards and are essentially the same as any other user that may have the same name across different databases. Note that these users are not able to login and issue queries because they are not all shard users. To issue multi-shard queries, an all shard user must be created.

Note:

Only users local to a PDB are imported. Common CDB users are not imported.
Grant User Roles and Priviledges

For the imported users, SYNC SCHEMA compares users' privileges.

SYNC SCHEMA grants only the privileges that are granted on all of the shards (common grants). A user A who has a DBA role on shard1, but does not have DBA role on shard2, is not granted the DBA role in the shard catalog.

Import Object Definitions

The objects compared and imported by SYNC SCHEMA to the shard catalog are the objects that will be referenced in multi-shard queries or used by multi-shard query processing.

These objects are:

  • Tables
  • Views and Materialized Views (exported as tables)
  • Check Constraints
  • Object Types
  • Synonyms

Running SYNC SCHEMA does not import objects related to storage, or objects that have no impact on multi-shard query processing, such as tablespaces, indexes, indextypes, directories, or zone maps.

Schema Object Comparison

The objects, from one shard to another, can have different definitions. SYNC SCHEMA compares the different definitions and creates a common definition to enable multi-shard queries against imported objects.

SYNC SCHEMA detects the objects' differences at two levels: number of objects, and object definitions.

First, SYNC SCHEMA considers the number of objects. It is likely that, during an application upgrade, some objects are added to the schemas. Only objects that are on all of the shards will be imported into the shard catalog.

Second, the object definitions from one shard to another can have different attributes. For the objects that SYNC SCHEMA imports, the following differences are noted:

Differences in Tables

When comparing objects in a federated sharding configuration, some differences in tables have an impact on multi-shard queries and some do not.

Column Differences

Only column differences have an impact on multi-shard queries. SYNC SCHEMA addresses only this difference.

  • The number of columns can be different.

  • The data type of a given column can be different.

  • The default value of a given column can be different.

  • The expression of a virtual column can be different

When a table has a different numbers of columns, SYNC SCHEMA will opt for the creation of a table that contains the union of all of the columns. Taking the union of all of the columns, compared to just taking the intersection, will spare you from re-writing multi-shard queries in case of an incremental deploy, when the added shard has fewer columns than indicated in the shard catalog.

When a column has different data types, SYNC SCHEMA defines it as the highest (largest) datatype.

When a column has different data types, and one of the columns is a user-defined object type, then that column is not imported into the shard catalog.

When a column has different default values, SYNC SCHEMA sets NULL as the default value.

Nested table columns are not imported into the shard catalog.

Example: a Customer table is defined on shard1 and shard2 as shown here.

On shard1:

Customer( Cust_id number, Name varchar(30),
    Address varchar(50),Zip_code number)

On shard2:

Customer( Cust_id varchar(20), Name varchar(30),
    Address varchar(50),Zip_code number,
    Country_code number)

Note that the column Cust_id is a number on shard1 and a varchar(20) on shard2. Also, note that Country_code exists on shard2 but does not exist on shard1.

The Customer table created by SYNC SCHEMA in the shard catalog has all of the columns, including Country_code, and the Cust_id type is varchar(20).

Customer( Cust_id varchar(20), Name varchar(30),
    Address  varchar(50),Zip_code number,
    Country_code number)

SYNC SCHEMA keeps track of these differences between schemas in the shard catalog. A query issued on the catalog database that accesses these heterogeneous columns is rewritten to address the differences before it is sent to the shards. On the shard, if there is a data type mismatch, the data is CAST into the "superior" data type as created on the catalog. If the column is missing on the shard, the default value is returned as set on the catalog.

Partition Scheme Differences

Note that this difference has no impact on multi-shard queries, and is ignored.

  • Partitioning column can be different.
  • Partition type can be different.
  • Number of partitions can be different.

Storage Attribute Differences

Note that this difference has no impact on multi-shard queries, and is ignored.

  • Tablespaces, on which the table is created, are different.
  • The encryption can be different.
  • The INMEMORY attribute can be different.
Differences in Views

Views on shards are created and handled as tables in the shard catalog. The same restrictions that apply to tables also apply to views.

Differences in Constraints

Only CHECK constraints are created in the shard catalog. The CHECK constraint condition should be same on all of the shards.

Differences in Object Types

Object types and type bodies are only created if they have the same definition on all of the shards.

Troubleshooting Federated Sharding

Solve common federated sharding issues with these troubleshooting tips.

ORA-03851: Operation not supported in federated database

ORA-03701: Invalid parameter combination: federated database and ...

Some of the operations and command options that apply to a traditional sharded database are not applicable to a federated database. This is because:

  • There is no concept of a chunk in a federated database. Any chunk-related operation is invalid, for example SPLIT CHUNK and MOVE CHUNK.

  • The Data Guard broker configuration is not set up or managed by the system in federated database, because the existing shards may already have been set up with their own high availability configurations. Operations such as SET DATAGUARD_PROPERTY or MODIFY SHARDSPACE are not supported.

  • The CREATE SHARD command is not supported.

ORA-03885: Some primary shards are undeployed or unavailable

The SYNC SCHEMA operation requires that all primary shards be available. Check the output of the CONFIG SHARD command, and check the status of all primary shards. Fix any issues and retry the operations when the shards become available.

ORA-03871: Some DDL statements are not applied to the catalog

The SYNC SCHEMA operation cannot import object definitions from the shards when some statements from the previous issuance are still not applied on the shard catalog. Run SYNC SCHEMA with the -apply option to run these statements.

Handling Errors During Multi-Shard Queries

If a multi-shard query fails with this error due to a mismatch of the object definition on the shard and the catalog, make sure that the shard catalog has the latest schema changes imported. Any time there are schema changes in the federated database, you must run SYNC SCHEMA to import any changes in the schemas on the shards.

Note that subsequent runs of SYNC SCHEMA will not drop and recreate the object, but will generate ALTER statements to incorporate the definition changes. This ensures that if there are queries already running during the SYNC SCHEMA operation, they won't fail with invalid object errors.

Handling Errors During DDL Processing Phase

If DDL fails on the shard catalog, the status of each DDL can be examined with the SYNC SCHEMA -show option.

gdsctl> sync schema -show

Note: The SYNC SCHEMA -show command is different from the command SHOW DDL. SHOW DDL lists DDL statements run by an all-shard user that are first run on the catalog and then propagated to the shards, whereas SYNC SCHEMA -show DDL statements are generated from the objects imported from shards.

By default, SYNC SCHEMA -show lists a fixed number of the latest DDLs. The -count and -ddl options can be used to inspect specific range of DDLs. For example,

gdsctl> sync schema -show -count 20
gdsctl> sync schema -show -count 20 -ddl 5

To check the complete DDL text and error message, if any, use the -ddl option.

gdsctl> sync schema -show -ddl 5

To list only the failed DDL statements, use the -failed_only option.

gdsctl> sync schema –failed_only

Based on the error message of the failed DDL, fix the cause of the error and perform the apply phase.

gdsctl> sync schema -apply

The SYNC SCHEMA command also has a -restart option to perform the complete operation from the beginning as if it were run for the first time. This option will DROP all existing schemas imported during all previous runs of SYNC SCHEMA and any related metadata. Be aware that this will cause any running queries on these objects to fail.

gdsctl> sync schema -restart