Skip Headers
Oracle® Clusterware Administration and Deployment Guide
12c Release 1 (12.1)

E17886-14
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

I Troubleshooting Oracle Clusterware

This appendix introduces monitoring the Oracle Clusterware environment and explains how you can enable dynamic debugging to troubleshoot Oracle Clusterware processing, and enable debugging and tracing for specific components and specific Oracle Clusterware resources to focus your troubleshooting efforts.

This appendix includes the following topics:

Monitoring Oracle Clusterware

You can use various tools to monitor Oracle Clusterware. While Oracle recommends that you use Oracle Enterprise Manager to monitor the everyday operations of Oracle Clusterware, Cluster Health Monitor (CHM) monitors the complete technology stack, including the operating system, for the purpose of ensuring smooth cluster operations. Both tools are enabled, by default, for any Oracle cluster, and Oracle strongly recommends that you use them.

This section includes the following topics:

Oracle Enterprise Manager

You can use Oracle Enterprise Manager to monitor the Oracle Clusterware environment. When you log in to Oracle Enterprise Manager using a client browser, the Cluster Database Home page appears where you can monitor the status of both Oracle Clusterware environments. Monitoring can include such things as:

  • Notification if there are any VIP relocations

  • Status of the Oracle Clusterware on each node of the cluster using information obtained through the Cluster Verification Utility (cluvfy)

  • Notification if node applications (nodeapps) start or stop

  • Notification of issues in the Oracle Clusterware alert log for the Oracle Cluster Registry, voting file issues (if any), and node evictions

The Cluster Database Home page is similar to a single-instance Database Home page. However, on the Cluster Database Home page, Oracle Enterprise Manager displays the system state and availability. This includes a summary about alert messages and job activity, and links to all the database and Oracle Automatic Storage Management (Oracle ASM) instances. For example, you can track problems with services on the cluster including when a service is not running on all of the preferred instances or when a service response time threshold is not being met.

You can use the Oracle Enterprise Manager Interconnects page to monitor the Oracle Clusterware environment. The Interconnects page shows the public and private interfaces on the cluster, the overall throughput on the private interconnect, individual throughput on each of the network interfaces, error rates (if any) and the load contributed by database instances on the interconnect, including:

  • Overall throughput across the private interconnect

  • Notification if a database instance is using public interface due to misconfiguration

  • Throughput and errors (if any) on the interconnect

  • Throughput contributed by individual instances on the interconnect

All of this information also is available as collections that have a historic view. This is useful with cluster cache coherency, such as when diagnosing problems related to cluster wait events. You can access the Interconnects page by clicking the Interconnect tab on the Cluster Database home page.

Also, the Oracle Enterprise Manager Cluster Database Performance page provides a quick glimpse of the performance statistics for a database. Statistics are rolled up across all the instances in the cluster database in charts. Using the links next to the charts, you can get more specific information and perform any of the following tasks:

  • Identify the causes of performance issues.

  • Decide whether resources must be added or redistributed.

  • Tune your SQL plan and schema for better optimization.

  • Resolve performance issues

The charts on the Cluster Database Performance page include the following:

  • Chart for Cluster Host Load Average: The Cluster Host Load Average chart in the Cluster Database Performance page shows potential problems that are outside the database. The chart shows maximum, average, and minimum load values for available nodes in the cluster for the previous hour.

  • Chart for Global Cache Block Access Latency: Each cluster database instance has its own buffer cache in its System Global Area (SGA). Using Cache Fusion, Oracle RAC environments logically combine each instance's buffer cache to enable the database instances to process data as if the data resided on a logically combined, single cache.

  • Chart for Average Active Sessions: The Average Active Sessions chart in the Cluster Database Performance page shows potential problems inside the database. Categories, called wait classes, show how much of the database is using a resource, such as CPU or disk I/O. Comparing CPU time to wait time helps to determine how much of the response time is consumed with useful work rather than waiting for resources that are potentially held by other processes.

  • Chart for Database Throughput: The Database Throughput charts summarize any resource contention that appears in the Average Active Sessions chart, and also show how much work the database is performing on behalf of the users or applications. The Per Second view shows the number of transactions compared to the number of logons, and the amount of physical reads compared to the redo size for each second. The Per Transaction view shows the amount of physical reads compared to the redo size for each transaction. Logons is the number of users that are logged on to the database.

In addition, the Top Activity drilldown menu on the Cluster Database Performance page enables you to see the activity by wait events, services, and instances. Plus, you can see the details about SQL/sessions by going to a prior point in time by moving the slider on the chart.

Cluster Health Monitor

The Cluster Health Monitor (CHM) detects and analyzes operating system and cluster resource-related degradation and failures. CHM stores real-time operating system metrics in the Oracle Grid Infrastructure Management Repository that you can use for later triage with the help of My Oracle Support should you have cluster issues.

This section includes the following CHM topics:

CHM Services

CHM consists of the following services:

System Monitor Service

There is one system monitor service on every node. The system monitor service (osysmond) is a real-time, monitoring and operating system metric collection service that sends the data to the cluster logger service. The cluster logger service receives the information from all the nodes and persists in an Oracle Grid Infrastructure Management Repository database.

Cluster Logger Service

There is one cluster logger service (OLOGGERD) for every 32 nodes in a cluster. Another OLOGGERD is spawned for every additional 32 nodes (which can be a sum of Hub and Leaf Nodes). If the cluster logger service fails (because the service is not able come up after a fixed number of retries or the node where it was running is down), then Oracle Clusterware starts OLOGGERD on a different node. The cluster logger service manages the operating system metric database in the Oracle Grid Infrastructure Management Repository.

Oracle Grid Infrastructure Management Repository

The Oracle Grid Infrastructure Management Repository:

  • Is an Oracle database that stores real-time operating system metrics collected by CHM. You configure the Oracle Grid Infrastructure Management Repository during an installation of or upgrade to Oracle Clusterware 12c on a cluster.

    Note:

    If you are upgrading Oracle Clusterware to Oracle Clusterware 12c and Oracle Cluster Registry (OCR) and the voting file are stored on raw or block devices, then you must move them to Oracle ASM or a shared file system before you upgrade your software.
  • Runs on one node in the cluster (this must be a Hub Node in an Oracle Flex Cluster configuration), and must support failover to another node in case of node or storage failure.

    You can locate the Oracle Grid Infrastructure Management Repository on the same node as the OLOGGERD to improve performance and decrease private network traffic.

  • Communicates with any cluster clients (such as OLOGGERD and OCLUMON) through the private network. Oracle Grid Infrastructure Management Repository communicates with external clients over the public network, only.

  • Data files are located in the same disk group as the OCR and voting file.

    If OCR is stored in an Oracle ASM disk group called +MYDG, then configuration scripts will use the same disk group to store the Oracle Grid Infrastructure Management Repository.

    Oracle increased the Oracle Clusterware shared storage requirement to accommodate the Oracle Grid Infrastructure Management Repository, which can be a network file system (NFS), cluster file system, or an Oracle ASM disk group.

  • Size and retention is managed with OCLUMON.

Collecting CHM Data

You can collect CHM data from any node in the cluster by running the Grid_home/bin/diagcollection.pl script on the node.

Notes:

  • Oracle recommends that, when you run the diagcollection.pl script to collect CHM data, you run the script on all nodes in the cluster to ensure gathering all of the information needed for analysis.

  • You must run this script as a privileged user.

See Also:

"Diagnostics Collection Script" for more information about the diagcollection.pl script

To run the data collection script on only the node where the cluster logger service is running:

  1. Run the following command to identify the node running the cluster logger service:

    $ Grid_home/bin/oclumon manage -get master
    
  2. Run the following command from a writable directory outside the Grid home as a privileged user on the cluster logger service node to collect all the available data in the Oracle Grid Infrastructure Management Repository:

    # Grid_home/bin/diagcollection.pl --collect
    

    On Windows, run the following commands:

    C:\Grid_home\perl\bin\perl.exe
    C:\Grid_home\bin\diagcollection.pl --collect
    

    The diagcollection.pl script creates a file called chmosData_host_name_time_stamp.tar.gz, similar to the following:

    chmosData_stact29_20121006_2321.tar.gz
    

To limit the amount of data you want collected, enter the following command on a single line:

# Grid_home/bin/diagcollection.pl --collect --chmos
   --incidenttime time --incidentduration duration

In the preceding command, the format for the --incidenttime argument is MM/DD/YYYY24HH:MM:SS and the format for the --incidentduration argument is HH:MM. For example:

# Grid_home/bin/diagcollection.pl --collect --crshome Grid_home
   --chmos --incidenttime 07/21/2013 01:00:00 --incidentduration 00:30

OCLUMON Command Reference

The OCLUMON command-line tool is included with CHM and you can use it to query the CHM repository to display node-specific metrics for a specified time period. You can also use OCLUMON to perform miscellaneous administrative tasks, such as changing the debug levels, querying the version of CHM, and changing the metrics database size.

This section details the following OCLUMON commands:

oclumon debug

Use the oclumon debug command to set the log level for the CHM services.

Syntax

oclumon debug [log daemon module:log_level] [version]

Parameters

Table I-1 oclumon debug Command Parameters

Parameter Description
log daemon module:log_level

Use this option change the log level of daemons and daemon modules. Supported daemons are:


osysmond
ologgerd
client
all

Supported daemon modules are:


osysmond: CRFMOND, CRFM, and allcomp
ologgerd: CRFLOGD, CRFLDREP, CRFM, and allcomp
client: OCLUMON, CRFM, and allcomp
all: allcomp

Supported log_level values are 0, 1, 2, and 3.

version

Use this option to display the versions of the daemons.


Example

The following example sets the log level of the system monitor service (osysmond):

$ oclumon debug log osysmond CRFMOND:3

oclumon dumpnodeview

Use the oclumon dumpnodeview command to view log information from the system monitor service in the form of a node view.

A node view is a collection of all metrics collected by CHM for a node at a point in time. CHM attempts to collect metrics every five seconds on every node. Some metrics are static while other metrics are dynamic.

A node view consists of eight views when you display verbose output:

  • SYSTEM: Lists system metrics such as CPU COUNT, CPU USAGE, and MEM USAGE

  • TOP CONSUMERS: Lists the top consuming processes in the following format:

    metric_name: 'process_name(process_identifier) utilization'
    
  • PROCESSES: Lists process metrics such as PID, name, number of threads, memory usage, and number of file descriptors

  • DEVICES: Lists device metrics such as disk read and write rates, queue length, and wait time per I/O

  • NICS: Lists network interface card metrics such as network receive and send rates, effective bandwidth, and error rates

  • FILESYSTEMS: Lists file system metrics, such as total, used, and available space

  • PROTOCOL ERRORS: Lists any protocol errors

  • CPUS: Lists statistics for each CPU

You can generate a summary report that only contains the SYSTEM and TOP CONSUMERS views.

"Metric Descriptions" lists descriptions for all the metrics associated with each of the views in the preceding list.

Note:

Metrics displayed in the TOP CONSUMERS view are described in Table I-4, "PROCESSES View Metric Descriptions".

Example I-1 shows an example of a node view.

Syntax

oclumon dumpnodeview [[-allnodes] | [-n node1 node2 noden] [-last "duration"] | 
[-s "time_stamp" -e "time_stamp"] [-v]] [-h]

Parameters

Table I-2 oclumon dumpnodeview Command Parameters

Parameter Description
-allnodes

Use this option to dump the node views of all the nodes in the cluster.

-n node1 node2

Specify one node (or several nodes in a space-delimited list) for which you want to dump the node view.

-last "duration"

Use this option to specify a time, given in HH24:MM:SS format surrounded by double quotation marks (""), to retrieve the last metrics. For example:

"23:05:00"
-s "time_stamp" -e "time_stamp"

Use the -s option to specify a time stamp from which to start a range of queries and use the -e option to specify a time stamp to end the range of queries. Specify time in YYYY-MM-DD HH24:MM:SS format surrounded by double quotation marks ("").

"2011-05-10 23:05:00"

Note: You must specify these two options together to obtain a range.

-v

Displays verbose node view output.

-h

Displays online help for the oclumon dumpnodeview command.


Usage Notes

  • In certain circumstances, data can be delayed for some time before it is replayed by this command.

  • The default is to continuously dump node views. To stop continuous display, use Ctrl+C on Linux and Windows.

  • Both the local system monitor service (osysmond) and the cluster logger service (ologgerd) must be running to obtain node view dumps.

Examples

The following example dumps node views from node1, node2, and node3 collected over the last twelve hours:

$ oclumon dumpnodeview -n node1 node2 node3 -last "12:00:00"

The following example displays node views from all nodes collected over the last fifteen minutes:

$ oclumon dumpnodeview -allnodes -last "00:15:00"

Metric Descriptions

This section includes descriptions of the metrics in each of the seven views that comprise a node view listed in the following tables.

Table I-3 SYSTEM View Metric Descriptions

Metric Description
#disks

Number of disks

#fds

Number of open file descriptors

Number of open handles on Windows

#nics

Number of network interface cards

#sysfdlimit

System limit on number of file descriptors

Note: This metric is not available on Windows systems.

chipname

The name of the CPU vendor

cpu

Average CPU utilization per processing unit within the current sample interval (%).

cpuht

CPU hyperthreading enabled (Y) or disabled (N)

cpuq

Number of processes waiting in the run queue within the current sample interval

hugepagefree

Free size of huge page in KB

Note: This metric is not available on Solaris or Windows systems.

hugepagesize

Smallest unit size of huge page

Note: This metric is not available on Solaris or Windows systems.

hugepagetotal

Total size of huge in KB

Note: This metric is not available on Solaris or Windows systems.

ior

Average total disk read rate within the current sample interval (KB per second)

ios

I/O operation average time to serve I/O request

iow

Average total disk write rate within the current sample interval (KB per second)

mcache

Amount of physical RAM used for file buffers plus the amount of physical RAM used as cache memory (KB)

Note: This metric is not available on Solaris or Windows systems.

netr

Average total network receive rate within the current sample interval (KB per second)

netw

Average total network send rate within the current sample interval (KB per second)

nicErrors

Average total network error rate within the current sample interval (errors per second)

#pcpus

The number of physical CPUs

pgin

Average page in rate within the current sample interval (pages per second)

pgout

Average page out rate within the current sample interval (pages per second)

physmemfree

Amount of free RAM (KB)

physmemtotal

Amount of total usable RAM (KB)

procs

Number of processes

rtprocs

Number of real-time processes

swapfree

Amount of swap memory free (KB)

swaptotal

Total amount of physical swap memory (KB)

swpin

Average swap in rate within the current sample interval (KB per second)

Note: This metric is not available on Windows systems.

swpout

Average swap out rate within the current sample interval (KB per second)

Note: This metric is not available on Windows systems.

#vcpus

Number of logical compute units


Table I-4 PROCESSES View Metric Descriptions

Metric Description
name

The name of the process executable

pid

The process identifier assigned by the operating system

#procfdlimit

Limit on number of file descriptors for this process

Note: This metric is not available on Windows, Solaris, AIX, and HP-UX systems.

cpuusage

Process CPU utilization (%)

Note: The utilization value can be up to 100 times the number of processing units.

privmem

Process private memory usage (KB)

shm

Process shared memory usage (KB)

Note: This metric is not available on Windows, Solaris, and AIX systems.

workingset

Working set of a program (KB)

Note: This metric is only available on Windows.

#fd

Number of file descriptors open by this process

Number of open handles by this process on Windows

#threads

Number of threads created by this process

priority

The process priority

nice

The nice value of the process

state

The state of the process


Table I-5 DEVICES View Metric Descriptions

Metric Description
ior

Average disk read rate within the current sample interval (KB per second)

iow

Average disk write rate within the current sample interval (KB per second)

ios

Average disk I/O operation rate within the current sample interval (I/O operations per second)

qlen

Number of I/O requests in wait state within the current sample interval

wait

Average wait time per I/O within the current sample interval (msec)

type

If applicable, identifies what the device is used for. Possible values are SWAP, SYS, OCR, ASM, and VOTING.


Table I-6 NICS View Metric Descriptions

Metric Description
netrr

Average network receive rate within the current sample interval (KB per second)

netwr

Average network sent rate within the current sample interval (KB per second)

neteff

Average effective bandwidth within the current sample interval (KB per second)

nicerrors

Average error rate within the current sample interval (errors per second)

pktsin

Average incoming packet rate within the current sample interval (packets per second)

pktsout

Average outgoing packet rate within the current sample interval (packets per second)

errsin

Average error rate for incoming packets within the current sample interval (errors per second)

errsout

Average error rate for outgoing packets within the current sample interval (errors per second)

indiscarded

Average drop rate for incoming packets within the current sample interval (packets per second)

outdiscarded

Average drop rate for outgoing packets within the current sample interval (packets per second)

inunicast

Average packet receive rate for unicast within the current sample interval (packets per second)

type

Whether PUBLIC or PRIVATE

innonunicast

Average packet receive rate for multi-cast (packets per second)

latency

Estimated latency for this network interface card (msec)


Table I-7 FILESYSTEMS View Metric Descriptions

Metric Description
total

Total amount of space (KB)

used

Amount of used space (KB)

available

Amount of available space (KB)

used%

Percentage of used space (%)

ifree%

Percentage of free file nodes (%)

Note: This metric is not available on Windows systems.


Table I-8 PROTOCOL ERRORS View Metric DescriptionsFoot 1 

Metric Description
IPHdrErr

Number of input datagrams discarded due to errors in their IPv4 headers

IPAddrErr

Number of input datagrams discarded because the IPv4 address in their IPv4 header's destination field was not a valid address to be received at this entity

IPUnkProto

Number of locally-addressed datagrams received successfully but discarded because of an unknown or unsupported protocol

IPReasFail

Number of failures detected by the IPv4 reassembly algorithm

IPFragFail

Number of IPv4 discarded datagrams due to fragmentation failures

TCPFailedConn

Number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times that TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state

TCPEstRst

Number of times that TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state

TCPRetraSeg

Total number of TCP segments retransmitted

UDPUnkPort

Total number of received UDP datagrams for which there was no application at the destination port

UDPRcvErr

Number of received UDP datagrams that could not be delivered for reasons other than the lack of an application at the destination port


Footnote 1 All protocol errors are cumulative values since system startup.

Table I-9 CPUS View Metric Descriptions

Metric Description
cpuid

Virtual CPU

sys-usage

CPU usage in system space

user-usage

CPU usage in user space

nice

Value of NIC for a specific CPU

usage

CPU usage for a specific CPU

iowait

CPU wait time for I/O operations


Example I-1 Sample Node View

----------------------------------------
Node: node1 Clock: '07-17-13 23.33.25' SerialNo:34836
----------------------------------------

SYSTEM:
#pcpus: 12 #vcpus: 12 cpuht: N chipname: Intel(R) cpu: 2.43 cpuq: 2
physmemfree: 56883116 physmemtotal: 74369536 mcache: 13615352 swapfree: 18480408
swaptotal: 18480408 ior: 170 iow: 37 ios: 37 swpin: 0 swpout: 0 pgin: 170
pgout: 37 netr: 40.301 netw: 57.211 procs: 437 rtprocs: 33 #fds: 15008
#sysfdlimit: 6815744 #disks: 9 #nics: 5  nicErrors: 0

TOP CONSUMERS:
topcpu: 'osysmond.bin(9103) 2.59' topprivmem: 'java(26616) 296808'
topshm: 'ora_mman_orcl_4(32128) 1222220' topfd: 'ohasd.bin(7594) 150'
topthread: 'crsd.bin(9250) 43'

PROCESSES:

name: 'mdnsd' pid: 12875 #procfdlimit: 8192 cpuusage: 0.19 privmem: 9300
shm: 8604 #fd: 36 #threads: 3 priority: 15 nice: 0
name: 'ora_cjq0_rdbms3' pid: 12869 #procfdlimit: 8192 cpuusage: 0.39
privmem: 10572 shm: 77420 #fd: 23 #threads: 1 priority: 15 nice: 0
name: 'ora_lms0_rdbms2' pid: 12635 #procfdlimit: 8192 cpuusage: 0.19
privmem: 15832 shm: 49988 #fd: 24 #threads: 1 priority: 15 nice: 0
name: 'evmlogger' pid: 32355 #procfdlimit: 8192 cpuusage: 0.0 privmem: 4600
shm: 8756 #fd: 9 #threads: 3 priority: 15 nice: 0
.
.
.

DEVICES:

xvda ior: 0.798 iow: 193.723 ios: 33 qlen: 0 wait: 0 type: SWAP
xvda2 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SWAP
xvda1 ior: 0.798 iow: 193.723 ios: 33 qlen: 0 wait: 0 type: SYS

CPUS:

cpu0: sys-0.93 user-0.41 nice-0.0 usage-1.35 iowait-0.0
cpu1: sys-0.20 user-0.72 nice-0.0 usage-0.93 iowait-7.16
cpu3: sys-0.40 user-0.40 nice-0.0 usage-0.81 iowait-0.51
cpu2: sys-0.30 user-0.20 nice-0.0 usage-0.50 iowait-0.0

NICS:

lo netrr: 35.743  netwr: 35.743  neteff: 71.486  nicerrors: 0 pktsin: 22
pktsout: 22  errsin: 0  errsout: 0  indiscarded: 0  outdiscarded: 0
inunicast: 22 innonunicast: 0  type: PUBLIC
eth0 netrr: 7.607  netwr: 1.363  neteff: 8.971  nicerrors: 0 pktsin: 41  
pktsout: 18  errsin: 0  errsout: 0  indiscarded: 0  outdiscarded: 0  
inunicast: 41  innonunicast: 0  type: PRIVATE latency: <1

FILESYSTEMS:

mount: / type: rootfs total: 155401100 used: 125927608 available: 21452240
used%: 85 ifree%: 93 [ORACLE_HOME CRF_HOME rdbms2 rdbms3 rdbms4 has51]

mount: /scratch type: ext3 total: 155401100 used: 125927608 available: 21452240
used%: 85 ifree%: 93 [rdbms2 rdbms3 rdbms4 has51]

mount: /net/adc6160173/scratch type: ext3 total: 155401100 used: 125927608
 available: 21452240 used%: 85 ifree%: 93 [rdbms2 rdbms4 has51]

PROTOCOL ERRORS:

IPHdrErr: 0 IPAddrErr: 19568 IPUnkProto: 0 IPReasFail: 0 IPFragFail: 0
TCPFailedConn: 931776 TCPEstRst: 76506 TCPRetraSeg: 12258 UDPUnkPort: 29132
UDPRcvErr: 148

oclumon manage

Use the oclumon manage command to view and change configuration information from the system monitor service.

Syntax

oclumon manage -repos {{changeretentiontime time} | {changerepossize 
memory_size}} | -get {key1 [key2 ...] | alllogger [-details] | mylogger [-details]}

Parameters

Table I-10 oclumon manage Command Parameters

Parameter Description
-repos 
{{changeretentiontime time} |
{changerepossize memory_size}}

The -repos flag is required to specify the following CHM repository-related options:

  • changeretentiontime time: Use this option to confirm that there is sufficient tablespace to hold the amount of CHM data that can be accumulated in a specific amount of time.

    Note: This option does not change retention time.

  • changerepossize memory_size: Use this option to change the CHM repository space limit to a specified number of MB

    Caution: If you decrease the space limit of the CHM repository, then all data collected before the resizing operation is permanently deleted.

-get key1 [key2 ...]

Use this option to obtain CHM repository information using the following keywords:


repsize: Size of the CHM repository, in seconds
reppath: Directory path to the CHM repository
master: Name of the master node
alllogger: Special key to obtain a list of all nodes running Cluster Logger Service
mylogger: Special key to obtain the node running the Cluster Logger Service which is serving the current node
  • -details: Use this option with alllogger and mylogger for listing nodes served by the Cluster Logger Service

You can specify any number of keywords in a space-delimited list following the -get flag.

-h

Displays online help for the oclumon manage command


Usage Notes

  • The local system monitor service must be running to change the retention time of the CHM repository.

  • The Cluster Logger Service must be running to change the retention time of the CHM repository.

Examples

The following examples show commands and sample output:

$ oclumon manage -get MASTER
Master = node1

$ oclumon manage -get alllogger -details
Logger = node1
Nodes = node1,node2

$ oclumon manage -repos changeretentiontime 86400

$ oclumon manage -repos changerepossize 6000

oclumon version

Use the oclumon version command to obtain the version of CHM that you are using.

Syntax

oclumon version

Example

This command produces output similar to the following:

Cluster Health Monitor (OS), Version 12.1.0.0.2 - Production Copyright 2013
Oracle. All rights reserved.

Oracle Clusterware Diagnostic and Alert Log Data

Oracle Database uses a unified log directory structure to consolidate the Oracle Clusterware component log files. This consolidated structure simplifies diagnostic information collection and assists during data retrieval and problem analysis.

Alert files are stored in the directory structures shown in Table I-11.

Table I-11 Locations of Oracle Clusterware Component Log Files

Component Log File LocationFoot 1 

Cluster Health Monitor (CHM)

The system monitor service and cluster logger service record log information in following locations, respectively:

Grid_home/log/host_name/crfmond
Grid_home/log/host_name/crflogd

Cluster Ready Services Daemon (CRSD) Log Files

ORACLE_BASE/diag/crs/host_name/crs

Cluster Synchronization Services (CSS)

Grid_home/log/host_name/cssd

Cluster Time Synchronization Service (CTSS)

Grid_home/log/host_name/ctssd

Grid Plug and Play

Grid_home/log/host_name/gpnpd

Multicast Domain Name Service Daemon (MDNSD)

Grid_home/log/host_name/mdnsd

Oracle Cluster Registry

Oracle Cluster Registry tools (OCRDUMP, OCRCHECK, OCRCONFIG) record log information in the following location:Foot 2 

Grid_home/log/host_name/client

Cluster Ready Services records Oracle Cluster Registry log information in the following location:

Grid_home/log/host_name/crsd

Grid Naming Service (GNS)

Grid_home/log/host_name/gnsd

Oracle High Availability Services Daemon (OHASD)

Grid_home/log/host_name/ohasd

Oracle Automatic Storage Management Cluster File System (Oracle ACFS)

Grid_home/log/host_name/acfsrepl
Grid_home/log/host_name/acfsreplroot
Grid_home/log/host_name/acfssec
Grid_home/log/host_name/acfs

Event Manager (EVM) information generated by evmd

Grid_home/log/host_name/evmd

Cluster Verification Utility (CVU)

Grid_home/log/host_name/cvu

Oracle RAC RACG

The Oracle RAC high availability trace files are located in the following two locations:

Grid_home/log/host_name/racg
$ORACLE_HOME/log/host_name/racg

Core files are in subdirectories of the log directory. Each RACG executable has a subdirectory assigned exclusively for that executable. The name of the RACG executable subdirectory is the same as the name of the executable.

Additionally, you can find logging information for the VIP in Grid_home/log/host_name/agent/crsd/orarootagent_root and for the database in $ORACLE_HOME/log/host_name/racg.

Server Manager (SRVM)

Grid_home/log/host_name/srvm

Disk Monitor Daemon (diskmon)

Grid_home/log/host_name/diskmon

Grid Interprocess Communication Daemon (GIPCD)

Grid_home/log/host_name/gipcd

Footnote 1 The directory structure is the same for Linux, UNIX, and Windows systems.

Footnote 2  To change the amount of logging, edit the path in the Grid_home/srvm/admin/ocrlog.ini file.

Diagnostics Collection Script

When an Oracle Clusterware error occurs, run the diagcollection.pl diagnostics collection script to collect diagnostic information from Oracle Clusterware into trace files. The diagnostics provide additional information so My Oracle Support can resolve problems. Run this script as root from the Grid_home/bin directory.

Syntax

Use the diagcollection.pl script with the following syntax:

diagcollection.pl {--collect [--crs | --acfs | -all] [--chmos] [--adr location
   [--aftertime time [--beforetime time]]] [--crshome path]
   [--incidenttime time [--incidentduration time]]
   | --clean | --coreanalyze}

Note:

The diagcollection.pl script arguments are all preceded by two dashes (--).

Parameters

Table I-12 lists and describes the parameters used with the diagcollection.pl script.

Table I-12 diagcollection.pl Script Parameters

Parameter Description
--collect

Use this parameter with any of the following arguments:

  • --crs: Use this argument to collect Oracle Clusterware diagnostic information

  • --acfs: Use this argument to collect Oracle ACFS diagnostic information

    Note: You can only use this argument on UNIX systems.

  • --all: Use this argument to collect all diagnostic information except CHM (OS) data

    Note: This is the default.

  • --chmos: Use this argument to collect CHM diagnostic information

  • --adr location: Use this argument to collect Automatic Diagnostic Repository (ADR) diagnostic information

    Note: You must specify the ADR location.

  • --aftertime time: Use this argument with the --adr argument to collect archives after the specified time

    Note: The time format is YYYYMMDDHHMISS24.

  • --beforetime time: Use this argument with the --adr argument to collect archives before the specified time

    Note: The time format is YYYYMMDDHHMISS24.

  • --crshome path: Use this argument to override the location of the Oracle Clusterware home

    Note: The diagcollection.pl script typically derives the location of the Oracle Clusterware home from the system configuration (either the olr.loc file or the Windows registry), so this argument is not required.

  • --incidenttime time: Use this argument to collect CHM (OS) data from the specified time

    Note: The time format is MM/DD/YYYYHH24:MM:SS.

  • --incidentduration time: Use this argument with --incidenttime to collect CHM (OS) data for the duration after the specified time

    Note: The time format is HH:MM. If you do not use --incidentduration, then all CHM (OS) data after the time you specify in --incidenttime is collected.

--clean

Use this parameter to clean up the diagnostic information gathered by the diagcollection.pl script.

Note: You cannot use this parameter with --collect.

--coreanalyze

Use this parameter to extract information from core files and store it in a text file.

Note: You can only use this parameter on UNIX systems.


Rolling Upgrade and Driver Installation Issues

During an upgrade, while running the Oracle Clusterware root.sh script, you may see the following messages:

  • ACFS-9427 Failed to unload ADVM/ACFS drivers. A system restart is recommended.

  • ACFS-9428 Failed to load ADVM/ACFS drivers. A system restart is recommended.

If you see these error messages during the upgrade of the initial (first) node, then do the following:

  1. Complete the upgrade of all other nodes in the cluster.

  2. Restart the initial node.

  3. Run the root.sh script on initial node again.

  4. Run the Oracle_home/cfgtoollogs/configToolAllCommands script as root to complete the upgrade.

For nodes other than the initial node (the node on which you started the installation):

  1. Restart the node where the error occurs.

  2. Run the orainstRoot.sh script as root on the node where the error occurs.

  3. Change directory to the Grid home, and run the root.sh script on the node where the error occurs.

Testing Zone Delegation

See Also:

Appendix E, "Oracle Clusterware Control (CRSCTL) Utility Reference" for information about using the CRSCTL commands referred to in this procedure

Use the following procedure to test zone delegation:

  1. Start the GNS VIP by running the following command as root:

    # crsctl start ip -A IP_name/netmask/interface_name
    

    The interface_name should be the public interface and netmask of the public network.

  2. Start the test DNS server on the GNS VIP by running the following command (you must run this command as root if the port number is less than 1024):

    # crsctl start testdns -address address [-port port]
    

    This command starts the test DNS server to listen for DNS forwarded packets at the specified IP and port.

  3. Ensure that the GNS VIP is reachable from other nodes by running the following command as root:

    crsctl status ip -A IP_name
    
  4. Query the DNS server directly by running the following command:

    crsctl query dns -name name -dnsserver DNS_server_address
    

    This command fails with the following error:

    CRS-10023: Domain name look up for name asdf.foo.com failed. Operating system error: Host name lookup failure

    Look at Grid_home/log/host_name/client/odnsd_*.log to see if the query was received by the test DNS server. This validates that the DNS queries are not being blocked by a firewall.

  5. Query the DNS delegation of GNS domain queries by running the following command:

    crsctl query dns -name name
    

    Note:

    The only difference between this step and the previous step is that you are not giving the -dnsserver DNS_server_address option. This causes the command to query name servers configured in /etc/resolv.conf. As in the previous step, the command fails with same error. Again, look at odnsd*.log to ensure that odnsd received the queries. If step 5 succeeds but step 6 does not, then you must check the DNS configuration.
  6. Stop the test DNS server by running the following command:

    crsctl stop testdns -address address
    
  7. Stop the GNS VIP by running the following command as root:

    crsctl stop ip -A IP_name/netmask/interface_name
    

Oracle Clusterware Alerts

Oracle Clusterware posts alert messages when important events occur. The following is an example of an alert from the CRSD process:

2009-07-16 00:27:22.074
[ctssd(12817)]CRS-2403:The Cluster Time Synchronization Service on host stnsp014 is in observer mode.
2009-07-16 00:27:22.146
[ctssd(12817)]CRS-2407:The new Cluster Time Synchronization Service reference node is host stnsp013.
2009-07-16 00:27:22.753
[ctssd(12817)]CRS-2401:The Cluster Time Synchronization Service started on host stnsp014.
2009-07-16 00:27:43.754
[crsd(12975)]CRS-1012:The OCR service started on node stnsp014.
2009-07-16 00:27:46.339
[crsd(12975)]CRS-1201:CRSD started on node stnsp014.

The location of this alert log on Linux, UNIX, and Windows systems is in the following directory path, where Grid_home is the name of the location where the Oracle Grid Infrastructure is installed: Grid_home/log/host_name.

The following example shows the start of the Oracle Cluster Time Synchronization Service (OCTSS) after a cluster reconfiguration:

[ctssd(12813)]CRS-2403:The Cluster Time Synchronization Service on host stnsp014 is in observer mode.
2009-07-15 23:51:18.292
[ctssd(12813)]CRS-2407:The new Cluster Time Synchronization Service reference node is host stnsp013.
2009-07-15 23:51:18.961
[ctssd(12813)]CRS-2401:The Cluster Time Synchronization Service started on host stnsp014.

Alert Messages Using Diagnostic Record Unique IDs

Beginning with Oracle Database 11g release 2 (11.2), certain Oracle Clusterware messages contain a text identifier surrounded by "(:" and ":)". Usually, the identifier is part of the message text that begins with "Details in..." and includes an Oracle Clusterware diagnostic log file path and name similar to the following example. The identifier is called a DRUID, or diagnostic record unique ID:

2012-11-16 00:18:44.472
[/scratch/12.1/grid/bin/orarootagent.bin(13098)]CRS-5822:Agent
 '/scratch/12.1/grid/bin/orarootagent_root' disconnected from server. Details at
 (:CRSAGF00117:) in /scratch/12.1/grid/log/stnsp014/agent/crsd/orarootagent_
root/orarootagent_root.log.

DRUIDs are used to relate external product messages to entries in a diagnostic log file and to internal Oracle Clusterware program code locations. They are not directly meaningful to customers and are used primarily by My Oracle Support when diagnosing problems.