Changes in 18.1.13

Topics

New Features

Changed the behavior of CLI command plan stop-service to ensure store health. After this change, if stopping services using the plan will cause the store to fall into an unhealthy state, the plan will fail with detailed health check information as the output, such as,

One of the groups is not healthy enough for the operation: [rg1] Only 1 primary nodes are running such that a simple majority cannot be formed which requires 2 primary nodes. The shard is vulnerable and will not be able to elect a new master. Nodes not running: [rg1-rn1]. Nodes to stop: {rg1=[rg1-rn2]} ... ...

The service can be forcefully stopped by adding the -force flag.

Note that there is one exception. Since plan stop-service -all-rn will always result in an unhealthy store, the health check is skipped for such plans and the -force flag is not required. [#22425]
Implemented group by clause and seq_transform expression in SQL for Oracle NoSQL.

Group-by is similar to the one in the standard SQL. However, in Oracle NoSQL grouping is possible only if there is an index that sorts the rows by the group-by expressions.

Together with group-by the following aggregate functions were implemented: count(*), count(expr), sum(expr), avg(expr), min(expr), and max(expr).

The seq_transform expression takes as input a "source" expression and a "mapper" expression. It evaluates the source expr, producing a sequence of zero or more items, and "transforms" this source sequence by evaluating the mapper expr on each item of the source sequence and concatenating the results of these evaluations. The current source item can be accessed by the mapper expr via the $ variable.

Here is an example that demonstrates both group-by and seq_transform:

Assume a "sales" table, whose rows look like this:
```
{
  "id":1,
  "sale":
  {
    "acctno" : 349,
    "year" : 2000, "month" : 10, "day" : 23,
    "state" : "CA", "city" : "San Jose", "storeid" : 76,
    "prodcat" : "vegies",

    "items" : [ { "prod" : "tomatoes", "qty" : 3, "price" : 10.0 },
                { "prod" : "carrots",  "qty" : 1, "price" :  5.0 },
                { "prod" : "pepers",   "qty" : 1, "price" : 15.0 }
              ]
  }
}
```
Assume there is the following index on sales:
```
create index on sales (sale.acctno as integer,
                       sale.year as integer,
                       sale.prodcat as string)
```
Then one can write the following query, which returns the total sales per acctno and year: [#26427]
```
select t.sale.acctno,
       t.sale.year,
       sum(seq_transform(t.sale.items[], $.price * $.qty)) as sales
from sales t
group by t.sale.acctno, t.sale.year
```
Implemented parent-child joins in SQL.

More generally, this feature allows for joins among tables in the same table hierarchy. Syntactically, this is done by a new NESTED TABLES clause that may appear in the FROM clause of an SQL query. The NESTED TABLES clause specifies a target table and a number of ancestors and/or descendant tables of the target table. This is similar to using a MultiRowOptions object as a parameter to the tableIterator and tableKeysIterator methods of TableAPI. However, NESTED TABLES greatly extends the capabilities of these programatic APIs and provides more standard semantics and better performance as well. Specifically, NESTED TABLES:
- Is equivalent to a number of LEFT OUTER JOINS and UNION operations, as they are defined by standard SQL.
- Allows projection: the rest of the query can select any subset of the columns of the participating tables.
- Allows predicates to be specified on the ancesteros and descendant tables, using the ON clause from standard SQL.
- Allows the target table to be accessed via a secondary index even when descendant tables are specified.
As an example, consider an application that tracks a population of users and the emails sent or received by these users. Given that SQL for Oracle NoSQL does not currently support general purpose joins, the emails are stored in a table that is created as a child of users, so that queries can be written that combine information from both tables using the NESTED TABLES clause. The create table statements for the two tables are shown below.
```
create table users(
    uid integer,
    name string,
    email_address string,
    salary integer,
    address json,
    primary key(id))

create table users.emails(
    eid integer,
    sender_address string,   // sender email address
    receiver_address string, // receiver email address
    time timastamp(3),
    size integer,
    content string,
    primary key(eid))
```
Here are two queries that can be written over the users and emails tables. Given that users.emails is a child table of users, it has an additional column, not included in its create table definition. This implicit column is named uid and has type integer. Normally, its value will be a user id that appears in the users table, but this is not necessary.
```
#
# Count the number of emails sent in 2017 by all users whose salary is greater
# than 200K
#
select count(eid)
from NESTED TABLES(
       users
       descendants(users.emails ON email_address = sender_address and
                                   year(time) = 2017)
     )
where salary > 200
```
The above NESTED TABLES clause is equivalent to the following left outer join:
```
users u left outer join users.emails e on u.uid = e.uid and
                                          email_address = sender_address and
                                          year(time) = 2017
#
# For each email whose size is greater than 100KB and was sent by a user in the
# the users table, return the name and address of that user.
#
select name, address
from NESTED TABLES(users.emails ancestors(users))
where size > 100 and sender_address == email_address
```
The above NESTED TABLES clause is equivalent to the following left outer join: [#26670]
```
users.emails e left outer join users u on u.uid = e.uid
```
Introduce new JSON output format for admin CLI commands, runadmin CLI command with "-json" flag will display in the new JSON output format. For compatibility, previous admin CLI JSON output are still supported, user can use "-json-v1" to display previous JSON v1 output format. [#25917]
Introduce a new plan plan verify-data that verifies the primary tables and secondary indices for data integrity. The users can run this plan on Admins and/or RepNodes and can choose to verify either the checksum of data records, or the B-tree of databases or both.

For example:
```
plan verify-data -all-rns
```
verifies both data record integrity and b-tree integrity of primary tables and secondary indices for all RepNodes.

And:
```
plan verify-data -verify-log disable -verify-btree enable -index disable -all-rns
```
verifies the b-tree integrity of primary tables for all RepNodes. [#26284]
Modify the Admin CLI command to support multiple helper hosts. Now users can use either -helper-hosts or -host/-port to connect to the master admin. The command can find the admin so long as it can contact any services at the given hosts/ports, so the given hosts/ports do not need to have an admin. Two flags -admin-host and -admin-port are removed. Scripts that rely on them can simply remove these two flags as long as the given hosts/ports for -helper-hosts or -host/-port can connect to an SN in the store. [#26633]
Changed the release numbering convention to use the last two digits of the current year as the major version number, and a sequential number incremented for each release as the minor version number. This numbering scheme matches similar changes being made to other Oracle products, so the Oracle major and minor release numbers (KVVersion.getOracleMajor() and getOracleMinor()) are now the same as the regular release numbers (KVVersion.getMajor() and getMinor()). In addition, the release string no longer contains separate Oracle version numbers. Applications should use the KVVersion methods to access individual version number fields. If application parse the version string, they will need to be updated to account for the removal of the Oracle version numbers. [#26756]
Added two new attributes to the RepNodeMXBean that is available via JMX.
- RepNodeMXBean.getOpMetric returns a JSON string containing a bundle of operation-related metrics.
- RepNodeMXBean.getEnvMetric returns a JSON string containing a bundle of environment-related metrics.
These JSON objects have been, and remain, available as the JMX notifications oracle.kv.repnode.opmetric and oracle.kv.repnode.envmetric. They are described in the Run Book. [#26760]
The Key Distribution Statistics Utility (described in Appendix G of the Admin Guide) has been changed so that if enabled (via the rnStatisticsEnabled parameter) it is scheduled automatically when a RepNode is lightly loaded. As a result, the parameters: rnStatisticsLowActivePeriod and rnStatisticsRequestThreshold, have been rendered obsolete and are no longer supported. Admin CLI scripts that set these parameters should be updated to eliminate use of these obsolete parameters. [#26635]
Added a new parameter, rnStatisticsTTL, to govern how long data collected by the Key Distribution Statistics Utility remains in system tables once statistics gathering is disabled (via the rnStatisticsEnabled parameter) or a table or index has been dropped. The default time-to-live (TTL) is 60 days. The time unit specified must be either days, or hours. [#26796]
Release version strings now identify which edition is being used. The result of calling KVVersion.CURRENT_VERSION.toString() will mention the Client when using the kvclient.jar file. For example:
```
18.1.1 2018-01-24 09:06:12 UTC  Build id: 3eef91c0eaf6 Edition: Client
```
If the kvstore.jar file is used, then the version string will identify the server edition of the release. [#24136]
Implemented support for specifying the Admin directory, Admin directory size and RN log directory in the makebootconfig command:
- -admindir <directory path>
  
  A path to the directory that will contain the environment associated with an Admin Node. In the absence of explicit directory arguments, the environment files are located under the KVROOT/KVSTORE/SN'ID'/Admin'Id'/ directory. This argument is optional in makebootconfig but recommended.
- -admindirsize <directory size>
  
  The size of the Admin storage directory. This argument is optional in makebootconfig but recommended.
- -rnlogdir <directory path>
  
  A path to the directory that will contain the log files associated with a Replication Node. For capacity values greater than one, multiple rnlogdir parameters must be specified in makebootconfig, one for each Replication Node that will be hosted on the Storage Node. If rnlogdir is not specified, by default the logs will be placed under the KVROOT/KVSTORE/log directory. This argument is optional in makebootconfig but recommended.
If -rnlogdir is specified, then the je.info, je.config and je.stat files for specific RN will be stored in rnlogdir. In all cases, the je.info, je.stat and je.config files for Admins will be stored under kvroot log directory. [#26444]
Full Text Search now uses an internal HTTP client which supports HTTPS connections to Elasticsearch.

FTS does not use the elasticsearch transport client anymore. It now uses it's own HttpClient built over apache's httpasyncclient library. This implies that the port used while registering Elasticsearch cluster changes to http port instead of the transport used in the earlier revision.

In the command shown below:
```
plan register-es -clustername <es_cluster_name> -host <host_name>
                 -port <http_port> -secure true
```
The port specified should be the HTTP port of ES cluster. It was the transport port in the previous release.

As can be seen in the above command, a new secure flag is added whose default value is true. That is, FTS now runs in secure mode by default and this needs some additional certificate set up, as described in the FTS documentation.

If an existing KVStore already has a registered Elasticsearch, then, after the upgrade, it needs to be registered again using the plan register-es command. The additional registration is needed because the registered port has changed from the transport port to the HTTP port. [#26059]
FTS security is only available in Enterprise Edition.

For the basic and community editions, the -secure flag needs to be set to false explicitly: [#26781]
```
plan register-es -clustername <es_cluster_name> -host <host_name>
                 -port <http_port> -secure false
```

Bug and Performance Fixes

Fixed a bug that could cause temporary failures in the plan switchover Admin CLI command. The bug caused plan task UpdateRepNodeParams to fail. The failure could be identified by the plan status output, such as,

Failures: Task 72 ERROR at 2018-01-01 01:01:01 UTC: UpdateRepNodeParams rg1-rn1: 72/UpdateRepNodeParams rg1-rn1 failed.: null,

together with a message inside the admin log, such as,

2018-01-01 01:01:01.001 UTC INFO [admin1] Couldn't update parameters for rg1-rn1 because unexpected exception: com.sleepycat.je.rep.ReplicaStateException: (JE 7.3.6) (JE 7.3.6) GroupService operation can only be performed at master. [#26776]
Fixed a bug and improved read availability for when a rep node reaches disk limit. The bug caused a rep node failing to restart after reaching disk limit with a message inside the rep node log, such as,

2018-01-01 01:01:01.001 UTC SEVERE [rg1-rn1] Process exiting com.sleepycat.je.DiskLimitException: (JE 7.6.3) Disk usage is not within je.maxDisk or je.freeDisk limits and write operations are prohibited: ... ...

The fix also enables the rep node to keep serving read requests after reaching disk limit. [#26701]
Allow predicates inside path-filtering steps to be used as index-filtering predicates.

For example, consider the following query:
```
select id
from Foo f
where exists f.info.address.phones[$element.areacode > 408 and
                                   $element.kind = "home"]
```
Assume we have an index on (info.address.phones.areacode, info.address.phones.kind).

The areacode predicate will be used as a start/stop pred for the index scan. The kind predicate can be used as an index-filtering pred, but before this fix it wouldn't be used as such. If the kind pred is not pushed to the index, we also loose the "covering-index" optimization for this query. [#26162]
In case of a secondary covering index, locks were being acquired on the table rows, which means that the primary index was being looked up for every such lock. Instead, in this case, only the qualifying index entries need to be locked. This fix can result in significant performance improvement if all data accessed are found in main memory at the server. For example, for a simple prepared query, this fix showed a 20% improvement in the end-to-end query latency. [#26451]
Avoid data cloning in UPDATE query statements. In previous releases, new values (computed in SET, ADD or PUT clauses) were cloned before they were used to update the target item(s). This is because, it is possible to create cycles among items, resulting in stack overflows when such a circular data structure is serialized. This fix tries to avoid avoid such cloning in most common cases, by detecting at compile time that a cycle is not possible.
When a query uses a multi-key index, it is often the case that duplicate results must be eliminated. This is done based on primary key values. A bug was fixed that would cause an exception to be thrown if a primary key column is of type Number or Timestamp. [#26460]
Fixed a bug that was preventing external variables to be used in update statements. [#26504]
Fixed a bug that ocurred when the SELECT clause contained a single expression with no AS clause. In this case, the value returned by this expression must be wrapped in a single-field record (because it may not be a record, and queries must always return records having the same record type). However, this wrapping was not always done. [#26767]
Fixed a query bug showing up when querying a key-only table with an order-by query. A QueryStateEception would be thrown in this case during compilation, because the order-by expressions were not being rewritten to access the index fields instead of the table columns. [#26632]
Fixed a bug with a select-star query that (a) queries a key-only table and (b) uses a secondary index that indexes all the table columns. The bug would cause the index entries to be returned instead of the table rows. Although, in this case, the index entries contain the same information as the table rows, the ordering of the columns and/or their names may be in different. As part of this fix, an index that indexes all the tale columns will always be recognized as a covering index, whereas this was not the case before. [#26838]
Fixed a bug showing up in queries like the following example:
```
declare $ext6 string;
select f.str
from foo f
where f.str >= $ext and f.str <= "ab"
```
If the $ext variable is set to "ab", the query will return all rows whose str column starts with "ab" rather than being equal to "ab". Notice that if an external variable were not used, the bug would not show up, because the compiler would convert the WHERE condition to f.str = "ab". [#26671]
Fixed an issue where the topology validate command returns "null" when the topology candidate that is being validated contains Storage Nodes that do not exist in the current store. [#26294]
There is now a limit on the number of plans stored by the Admin. Plans with an ID 1000 less than the latest plan will be automatically removed from the Admin's persistent store. Only plans that are in a terminal state (SUCCEEDED or CANCELED) are removed. [#22963]
A limitation has been placed on plan names which prevents the creation of new plans with a name starting with "SYS$". This prefix is reserved to plans which are created internal by the Admin to perform its own administrative operations. Existing plans starting with "SYS$" are not affected. [#26279]
Fixed some issues that prevented requests from completing within the specified request timeout, in particular when checking for quorum and specifying JE transaction timeouts. Also, modify request handling to prioritize ConsistencyException over RequestTimeoutException, since ConsistencyException provides more specific information about the cause of the request failure. [#22849]
Fixed a table scan issue where the order of rows returned by the following method on the TableAPI interface may be incorrect when Direction.FORWARD and Direction.REVERSE are specified. The bug required that the primary key field(s) be declared in order at the beginning of the DDL statement to create the table. That is no longer the case. This is the method with the problem:

TableIterator<Row> tableIterator(...) [#26769]
Made code changes and upgraded some external libraries to support Java 9. Note that these changes are still preliminary: full testing will move to the latest Java version in a future release. [#25278]
Fixed an issue that Storage Node configured with "-noadmin" flag doesn't suppress starting a Bootstrap Admin. [#26840]
Fixed an issue that returns "Unknown statement" when execute update statement on sql shell. [#26357]
Support to parse Timestamp string with zone offset in format of "+HH:MM"/"-HH:MM" or 'Z'(rep for UTC) with default pattern. [#25808]
Fixed the output message for "show pool -name" command when supplying a non-exist pool name. Previously it will display "Unknown Exception" with stack trace. [#26639]
Fixed the output of "ping" command to be directed to stdout when the command exit code is 0. Previously all the output of "ping" will be directed to stderr. [#26693]
Additional optional argument -shard added in ping command to check for services status specific to a particular shard. These changes have been done for both admin and top level ping version. ping -shard shardId will give status for SN, RNs and Arbiter associated with specific shard. Additional information about number of shard in topology added in show topology output with numShard=X. [#25348]
Fixed a problem which prevented information about storage directories specified with the -storagedir and -storagedirsize flags to the makebootconfig command from being included in the output of the generateconfig command. [#26353]

Storage Engine Changes (JE 18.1)

Fixed a bug that could on occasion cause a NullPointerException after an RN transitions from Master to Replica with the representative stack trace below: [#26495]

   2017-08-08 06:59:15.498 UTC SEVERE [rg3-rn1] JE: Uncaught exception in feeder
   thread Thread[Feeder Output for rg1-rn1,5,main] nulljava.lang.NullPointerException
    at com.sleepycat.je.rep.vlsn.VLSNIndex.getLatestAllocatedVal(VLSNIndex.java:488)
    at com.sleepycat.je.rep.impl.node.Feeder$OutputThread.writeAvailableEntries(Feeder.java:1337)
    at com.sleepycat.je.rep.impl.node.Feeder$OutputThread.run(Feeder.java:1163)

JE's per-Database (per-partition) disk utilization metadata is no longer maintained in order to better support very large numbers of partitions. Using a worst case example of 20K partitions per RN with records spread over 1,000 data files, previously the utilization metadata would occupy roughly 2GB of memory and over 100GB of disk. This metadata has been removed. Due to the metadata removal, the dataAdminBytes cache statistic has also been removed. [#26597]
JE recovery time (RN and Admin startup time after a crash) has been reduced. To go along with this optimization, NoSQL DB has increased the default JE checkpoint interval to reduce writing and improve disk utilization. [#26179]
Fixed a NullPointerException during deadlock detection on an RN or Admin, for example:
```
java.lang.NullPointerException:
  at com.sleepycat.je.txn.LockManager$DeadlockChecker.hasCycleInternal(LockManager.java:1942)
...
```
This was never reported for NoSQL DB, but if it did occur would have caused the RN or Admin to restart. It did not cause persistent corruption. [#26570]
Fixed two bugs that sometimes caused internal operations to be included in JE operation stats (priSearchOps, etc). [#26694]
- The Btree verifier was incorrectly contributing to operation stats.
- Deletion operations sometimes incorrectly included search and position stats, depending on the timing.
Fixed a very rare, timing related bug that could cause internal corruption. The bug has always been present in JE HA and has been seen only once in a stress test. [#26706]
The default for JE EnvironmentConfig.TREE_COMPACT_MAX_KEY_LENGTH has been changed from 16 to 42 bytes. This reduces cache usage for Btree internal nodes by 5 to 25%, depending on the key size used (smaller key sizes give larger savings). [ANDC-203]

Fixed a bug that could cause the following exception on a replica node, when using TTL in conjunction with record deletions:

  2018-02-10 12:00:00.006 UTC INFO [rg1-rn1] JE: Replay thread exiting
  with exception:Environment invalid because of previous exception: ...
  Replicated operation could not be applied. DEL_LN_TX/14
  vlsn=1,711,027,333 isReplicated="1" txn=-834602813 LOG_INCOMPLETE:
  Transaction logging is incomplete, replica is invalid. Environment is
  invalid and must be closed. Problem seen replaying entry DEL_LN_TX/14
  vlsn=1,711,027,333 isReplicated="1" txn=-834602813

The problem can occur under the following conditions:

The TTL feature is used.
A replica is lagging or down (this is not uncommon).
Records with a TTL are also sometimes deleted explicitly at around the time the record expires.

Prior to this bug fix, the problem could be corrected only by performing a network restore on the replica. [#26851]

Utility Changes

Snapshot utility will by default create snapshot for configurations as well as service data. Introduced new arguments -restore-from-snapshot in "start" command line to allow user directly restore from previous snapshot when starting up SN. [#26119]
Added new CLI command to limit the type of client requests enabled for the whole store or specific shards.
```
plan enable-requests -request-type {ALL|READONLY|NONE}
     {-shard <shardId[,shardId]*> | -store}
```
There are three request types can be configured by this command, setting ALL means the store or shards can process both read and write requests; READONLY makes the store or shards only respond to read requests; NONE means no requests will be processed by store or shards. [#25422]

Release version information provided in the output of the version and ping commands, and also the show versions, ping, and verify configuration commands in the Admin CLI, now identifies the edition of the release being used. For version, the output identifies the edition of the JAR file used for the command. For show versions, the server information identifies the edition of the Admin service. For the other commands, the version information for each Storage Node identifies the edition of the JAR file being used to run that Storage Node. For example: [#24136]

  java -jar kvstore.jar version
  18.1.1 2018-01-24 09:06:12 UTC  Build id: 3eef91c0eaf6 Edition: Community

And:

  java -jar dist/lib/kvstore.jar ping -host localhost -port 5000 \
       -security /tmp/kvroot/security/user.security
  Pinging components of store kvstore based upon topology sequence #14
  10 partitions and 1 storage nodes
  Time: 2018-01-24 21:35:59 UTC   Version: 18.1.1
  Shard Status: healthy:1 writable-degraded:0 read-only:0 offline:0 total:1
  Admin Status: healthy
  Zone [name=KVLite id=zn1 type=PRIMARY allowArbiters=false
      masterAffinity=false]   RN Status: online:1 offline:0
  Storage Node [sn1] on localhost:5000    Zone: [name=KVLite id=zn1
      type=PRIMARY allowArbiters=false masterAffinity=false]
      Status: RUNNING   Ver: 18.1.1 2018-01-24 06:34:44 UTC
      Build id: 3eef91c0eaf6 Edition: Community
	Admin [admin1]		Status: RUNNING,MASTER
	Rep Node [rg1-rn1]	Status: RUNNING,MASTER sequenceNumber:50
            haPort:5006

Implemented the new feature master affinity zone for KVStore. This feature allows users to set master affinity/no-master affinity for a zone when deploying a zone. The zone with master affinity property has higher priority to host master RNs. Besides, the feature also provides a command to allow users to change the master affinity for the deployed zones. [#25157]

The feature adds new flags -master-affinity/-no-master-affinity for the existing command deploy-zone and adds a new command topology change-zone-master-affinity. The usage of the new command is as follows:
```
    Usage: topology change-zone-master-affinity -name <name>
    {-zn <id> | -znname <name>}
    {-master-affinity | -no-master-affinity}
    Modifies the topology to change the master affinity of the specified
    zone.
```