Details: Cache Fusion Hardening

This page provides more information about the new cache fusion hardening features.

In RAC, for each cache fusion lock, there is a set of client instances and a single master instance. LMS is the main process at both client and master instances handling cache fusion requests. Most of the cache fusion protocol traffic will pass through LMS processes. There is significant communication between the client and master instances and some communication through remote block transfer between any two instances. All the communication carries some lock state information and the information will eventually carry back to master or client instances. The majority of errors encountered by LMS are caused by some discrepancies among the instances such as missing or stale last disk write SCN and missing or delayed inter-instance messages among instances. This feature introduces a way to determinedifferences and to recover from such inconsistencies, preventing crashing the critical process LMS which will result in an instance termination. Similar lock state reconciliation can be applied to other fatal processes involved in cache fusion protocol.

Some of the discrepancies in lock states among instances can sometimes cause cache fusion requests to hang. For example, for cache fusion write requests, the master may have thought it granted a write permission when receiving a write request from the same instance which the master has picked as the writer. The master may ignore the write request as it may have crossed paths with a write permission message sent by the master due to a write request from another instance. If for some reason, the first write permission sent by the master is not received by the writer instance, it may cause a cache fusion write hang which may result in a checkpointing hang. A mechanism monitors long pending requests to detect this kind of cache fusion hang and the protocols to address the inconsistencies among instances are triggered to resolve the issue.