Troubleshooting multi-region data store setup

  1. Find agent logs for a multi-region setup:

    Users can find the logs of an XRegion agent at the path specified in the JSON config file. The agent logs, like data store logs, contain all diagnostic information from the service agent. To learn more about the JSON config file used by the XRegion agent, see Configure XRegion Service.

  2. Access the statistics of an XRegion agent
    The XRegion agent collects statistics periodically and posts it to a system table in the local region. You can query the system table for XRegion agent statistics by using the standard CLI command “SHOW” that returns a JSON string of agent statistics.
    show mrtable-agent-statistics
            [-agent <agentID>][-table <tableName>][-json]

    The show command with mrtable-agent-statistics option shows the latest statistics as of the last one minute for the XRegion agent. With no arguments, this command shows the combined statistics over all regions that the multi-region table spans. You can limit the statistics to a particular agent by specifying the agent id. If a table name is specified in the command, the statistics is limited to a particular multi-region table. To understand more details about using the show command to obtain statistics for a multi-region setup, see show mrtable-agent-statistics.

  3. Display the status of a multi-region table syncing up with remote regions

    The statistic lastModificationMs in the show mrtable-agent-statistics command is the timestamp of the last operation performed in each remote region, in milliseconds. By comparing the values of this statistic of the local region and the remote region, you can determine if the remote region has caught up with the local region or still lagging behind.

    For example, suppose the time of the last write made to a remote region is T1, while the statistic lastModificationMs for the local region is T2. If T2 < T1, it means that the multi-region table has caught up with that remote region for all writes up to T2 and will continue catching up for all writes made in between T2 and T1. If T2 = T1, that means the multi-region table has caught up with all writes made at the remote region. However T2 can never be greater than T1.
    # MR table agent statistics for a specific agent
    kv-> show mrtable-agent-statistics -agent 0 -json
    {
       "operation": "show mrtable-agent-statistics",
       "returnCode": 5000,
       "description": "Operation ends successfully",
       "returnValue": {
          "XRegionService-1_0": {
          "timestamp": 1592901180001,
          "statistics": {
             "agentId": "XRegionService-1_0",
             "beginMs": 1592901120001,
             "dels": 1024,
             "endMs": 1592901180001,
             "incompatibleRows": 100,
             "intervalMs": 60000,
             "localRegion": "slc1",
             "persistStreamBytes": 524288,
             "puts": 2048,
             "regionStat": {
                "lnd": {
                   "completeWriteOps": 10,
                   "laggingMs": {
                      "avg": 512,
                      "max": 998,
                      "min": 31
                   },
                   "lastMessageMs": 1591594977587,
                   "lastModificationMs": 1591594941686,
                   "latencyMs": {
                      "avg": 20,
                      "max": 40,
                      "min": 10
                   }
                },
                "dub": {
                   "completeWriteOps": 20,
                   "laggingMs": {
                      "avg": 535,
                      "max": 1024,
                      "min": 45
                   },
                   "lastMessageMs": 1591594978254,
                   "lastModificationMs": 1591594956786,
                   "latencyMs": {
                      "avg": 30,
                      "max": 45,
                      "min": 15
                   }
                }
             },
             "requests": 12,
             "responses": 12,
             "streamBytes": 1048576,
             "winDels": 1024,
             "winPuts": 2048
              }
            }
          }
    }
  4. Troubleshoot problems with XRegion Agent

    If the XRegion agent encounters a problem, for example if the network connection is dropped, you should investigate the reason of the connection failure and come up with a solution to fix the connection. Meanwhile the XRegion agent would try to re-connect to the remote region until the remote region is up again. After successfully re-connecting to the remote region, the XRegion agent will resume from the stream position or the last checkpoint made, before the connection was dropped. During re-connection, the agent may dump warning messages in the log to alert users that the connection to a region or a shard in that region is lost.

  5. Troubleshoot when the local region or remote region goes down

    The XRegion agent streams changes to the multi-region table from each remote region and persists them in the local region. Therefore, if the local region is down, the agent will keep retrying but won’t be able to write any changes. After a period of time, when the buffer in the XRegion agent is full, the XRegion agent will stop streaming data from the remote regions and the data flow gets frozen. When the local region is back, the XRegion agent will just resume the stream and the workflow. No manual intervention to the XRegion agent is needed here. However you may have to fix the issue with the local region manually.
    Description of mr_table_local_region_down_troubleshoot.eps follows
    Description of the illustration mr_table_local_region_down_troubleshoot.eps

    If a particular remote region is down, the XRegion agent will just keep retrying till that remote region is back. This issue is similar to any network connection problem with the XRegion agent. Until the connection to the remote region is established again, the multi-region table at the local region won’t be able to see the changes in that remote region. But changes in the other remote regions are not affected as long as the XRegion agent is able to maintain the connection to these regions.
    Description of mr_table_remote_down_troubleshoot.eps follows
    Description of the illustration mr_table_remote_down_troubleshoot.eps

  6. Handle schema evolution in a multi-region setup

    Schema evolution happens when there is a schema change in any of the remote regions. Then the schema of a multi-region table at the local region differs from that in the remote region. In such a situation the XRegion agent will try to solve the difference by converting a row from the remote region to the schema of local region. For example, if you add a new column to a multi-region table at a remote region but this new column is not yet added in the local region. The multi-region table at the local region will not be able to see the new column in the changes streamed from the remote region, but the local region should still see the other columns. This would last until you fix the problem by adding the same column in the local region to end the schema divergence. In a multi-region table, there is no automatic notification to other regions when a schema changes in one region. The XRegion agent of local region is able to detect the change when it sees the data from a remote region with higher table version, and it will refresh its table metadata from the remote region to get the latest schema.
    Description of schema_evolution_matched_metadata.eps follows
    Description of the illustration schema_evolution_matched_metadata.eps

    Consider the situation when the schema in different regions diverge in a way that the agent is not able to fix the schema differences by refreshing the local region table metadata from the remote region. For example, if you add a new column “Foo” with type “STRING” to the remote region but adds the same column with type “LONG” in the local region, these changes at the remote region are considered incompatible to the local region, and the agent cannot fix this difference. These changes from the remote region will not be persisted locally. Consequently the changes in the remote regions will be discarded and accounted in the per-table statistic incompatibleRows. See the details about persistence of remote data in the show mrtable-agent-statistics section.
    Description of schema_evolution_mismatched_metadata.eps follows
    Description of the illustration schema_evolution_mismatched_metadata.eps

  7. Handle difference in software versions between regions

    For any particular region, you need to upgrade the data store first and then upgrade the agent to the same version. If a multi-region table has different versions of software on different regions, the agent with old version may not be able to process the rows streamed from regions with a newer version of the software correctly, and some data may be treated by the old agent as incompatible for operations. For example, if the local region is upgraded to support TTL ( Time to Live) while the remote region has not yet upgraded, the changes made to the remote region will be persisted to the local region, but without any expiration information, that means the row will never expire. The same is the case if the remote region has upgraded to support TTL while the local region has not. Then all changes to the remote region with TTL will lose their TTL when applied to the local region, which means these rows will never expire. If this is undesirable, you should upgrade all regions first before writing the data to the table to ensure every region can process the data correctly. Any feature will be completely available to a multi-region table only after all the regions have upgraded to the same version.