N1 Provisioning Server 3.1, Blades Edition, System Administration Guide

Chapter 7 Troubleshooting

As with any software, you might have occasional difficulties with N1 Provisioning Server software or an Infrastructure Fabric (I-Fabric). This chapter helps you to identify and resolve such problems. This chapter describes the potential problems, their likely causes, and practical solutions to these problems. It features useful troubleshooting tables and flowcharts. The following major topics are covered in this chapter:

Failover of resource pool servers within an I-Fabric is automated, and requires you to perform few or no manual tasks after a failover and startup of a new device. Any tasks to be performed after a failover are described within the sections dedicated to specific I-Fabric devices within this chapter.

Debug Levels

You can set the debug level in the tspr.properties configuration file according to your preference of debugging detail. The higher the debug levef, the more logging information you receive. A debug level setting of 9 is recommended. You can view debug information in the tspr.debug.logfile. The following is an example tspr.propertiesfile the debug level set to 9. All entries in the tspr.propertiesfile must have the following format:

package name.class name.attribute=value

com.terraspring.core.sys.GridOS.debuglevel=9

Handling Failed Farm Devices

N1 Provisioning Server software actively and automatically monitors the availability of all devices within a farm to enable automated failover of logical server farm devices. No special configuration is necessary to enable this monitoring. When the software detects that a device is no longer available, the software can take the necessary automated steps to replace the physical device with an identical device from the pool of unused devices. In replacing the failed physical device, the system duplicates configuration and logically reattaches storage to the new device so that the new device can take over the role of the failed device.

Whenever a failure is detected, the system starts the failover process by placing a failover “job” request into the N1 Provisioning Server queuing mechanism. This same queuing mechanism is used to process all farm activation and farm update requests. You can view the current set of tasks and their status by running the request -l command.

Two options for specifying failover behavior when the monitoring system detects a device failure:

Automatic failover consists of the N1 Provisioning Server placing and automatically processing a replacefaileddevice request in the queue. You are not required to intervene.
Manual failover consists of the N1 Provisioning Server placing a replacefaileddevice request as a blocked request in the request queue for the replacement of the failed device. You must unblock the request before it can be processed and the device replaced. Run the command request -u request-ID to do this.

Note –

A blocked failover request blocks all subsequent requests for that farm, including farm update requests made through the Control Center. The system does not process any other changes to the farm until the failover request is either unblocked or deleted.

The automatic option enables immediate processing of the failover when the request enters the system's request queue. The manual option causes the failover request to be blocked in the system's request queue. As a result, the blocked failover request and any subsequent request for the failed device's farm are not processed until you either unblock the request to allow the failover or you delete the request to abort the failover.

Automatic failover can be costly if the server has not really failed, but was unavailable. In this case, the device might get replaced unnecessarily.

To configure the behavior of the failover mechanism, alter the property in the /etc/opt/terraspring/tspr.properties file.

If the com.terraspring.cs.services.DeviceStatus.blockReqFailedDevice property is set to false, any detected device failures will not result in blocked failover requests. If this property is set to true, any detected device failures will result in blocked failover requests. The property is set to true by default, which is the recommended setting.

Troubleshooting Farm Device Failure

Upon the detection of a farm device failure, the monitoring software sends a device DOWN event to the segment manager of the N1 Provisioning Server that manages the farm.

Note –

The monitoring software sends a device DOWN event only for active farms.

When the segment manager receives a DOWN event from the monitoring software, the segment manager performs the following procedures:

A blocked replacePhysicalDevices request is sent to the farm manager of the farm that owns that device.
A critical error message is logged into the log file to alert operations. The critical message that the segment manager generates contains the failed device's name and the corresponding farm ID.

To Respond to Farm Device Failure

Perform the following procedure on an control plane server if you receive a critical farm device failure message.

Note –

If the automatic failover property is set to true, no action needs to be taken.

Steps

List the current requests for the farm:
request -lf farm ID

Review the list and obtain the requestID of the blocked replacePhysicalDevices request generated by the segment manager for the farm.

You can identify the requestID by the replacePhysicalDevice request where state is listed as QUEUED_BLOCKED. The second argument of the replacePhysicalDevices request specifies the IDs of the devices that failed.

Verify that the physical device has actually failed and that it is not a spurious error. See Handling a Failed Control Plane Server for details. Temporary network failures can cause spurious errors.
- If the device has not failed, or you do not want to replace the device, delete the request by typing request -d request ID.
- If only one device failed, start the device replacement by unblocking the replacePhysicalDevices request typing request -u request-ID.
- If multiple devices failed, you will see many replacePhysicalDevices requests.
  1. Identify the device ID of each of the failed devices.
  2. Run the replacedevice command with the device IDs of all the devices that you want to replace:
    
    replacedevice farm-ID failed-device-ID failed-device-ID failed-device-ID
    
    For example:
    
    replacedevice 5 19001 37001 2003

After replacing failed devices, delete the replacePhysicalDevices requests by typing:
request -d request-ID

Handling a Failed Control Plane Server

If the control plane server fails, see the server documentation for details on replacing the server. If the Oracle database on the control plane database (CPDB) has been damaged, contact http://sun.com/service/contacting/index.html for assistance.

See Chapter 5, Backing Up and Restoring Components for a description of the files that you need to restore. See the N1 Provisioning Server 3.1, Blades Edition, Installation Guide for details on how to reinstall the software that runs on the control plane, such as the Control Center software and the N1 Provisioning Server software.

Handling Failed Software on the Control Plane

This section describes how to troubleshoot the failure of software processes on the control plane.

Monitoring Manager Failover

If the monitoring manager fails, attempt to restart it by running the /opt/terraspring/sbin/mmd start command. If the monitoring manager does not restart, see the N1 Provisioning Server 3.1, Blades Edition, Installation Guide for details on how to reinstall the monitoring manager software and restore the most recent backup.

Control Center Failure

If the Control Center fails, attempt to restart it by running the /opt/terraspring/sunone/bin/appserv start command. If the Control Center does not restart, see the N1 Provisioning Server 3.1, Blades Edition, Installation Guide for details on how to reinstall the Control Center software and restore the most recent backup.

Diagnosing N1 Provisioning Server Problems

Almost all farm operations are carried out asynchronously, that is, a request (message) is queued for the N1 Provisioning Server, which processes them in the order received. Use the request command to view all pending and in-progress requests in the N1 Provisioning Server.

Any critical error occurring during the execution of a farm operation is reported by the Monitoring Manager through the monitoring system. Critical messages are also logged to the /var/adm/messages file on the N1 Provisioning Server. The critical error causes the current farm operation to exit and moves the farm into an error state. When the farm is in an error state, requests in the queue cannot be processed until the error is cleared manually. After the problem causing the error is resolved, you must reset the farm error state so that the stopped operation can be restarted.

In addition to the /var/adm/messages file, various debug messages are logged in the file /var/adm/tspr.debug. The amount of information in this file is determined by the debug level set in the /etc/opt/terraspring/tspr.properties file. The default debug level is 9. When a critical error is encountered, the additional information in the /var/adm/tspr.debug file is useful in determining the cause of the error.

To diagnose a problem with the N1 Provisioning Server, follow the steps illustrated in Figure 7–1.

Figure 7–1 Diagnosing N1 Provisioning Server Problems

Message Log File Format

All messages logged into the /var/adm/messages and the /var/adm/tspr.debug files are in a standard format. The following example is a typical message:

Nov 12 17:22:57 sp3 java[23033]: [ID 398540 user.info] TSPR [sev=crit] [fmid=1211] [MSG6718] FM Activate: Ready timeout expired.

Table 7–1 Message Log File Format


Message Element	Description
`Nov 12 17:22:57`	The date and timestamp of the message.
`sp3`	The name of the control plane server.
`java[23033]`	The unique ID of the Solaris process.
`[ID 398540`	The unique ID of the system log.
`user.info]`	The type of system log entry.
`TSPR`	The type of message. In the example, the message is an N1 Provisioning Server message.
`[sev=crit]`	Indicates the severity of the message. In the example, `crit` means this error message is critical. Other severity types are `warn`, `okay`, and `dbug`.
`[fmid=1211]`	Indicates the application ID that generated the message. In the example, the application is the farm manager for farm `1211`. The segment manager application ID is in the format `[smid=27003]`. The command-line tools application ID is in the format `[apps=26765]`.
`[MSG6718`]	The message ID assigned by the N1 Provisioning Server.
`FM Activate: Ready timeout expired`	The message itself.

When an error or exception occurs during execution, the name of the exception is logged as part of the message along with a description of the exception.

Managing the Request Queue

The farm operations initiated using the Control Center or the farm command are queued for the segment manager. The segment manager processes the request, and dispatches the request to the farm manager, which actually does the work. When investigating problems, ensure that your request has not stalled along the way to the farm manager.

If there are no critical errors or messages relating to requests that are being processed in the tspr.debug file of the segment manager, check the request queue to verify that your request has been processed.

The request command is used to view the request queue:

To list all current requests in the N1 Provisioning Server, run the command request -l.
To list all current requests for a farm, run the command request -lf farm-ID.
To list all completed requests for a farm, run the command request -lcf farm-ID.
To list all requests for a farm, run the command request -laf farm-ID.

For more information on the request command and its usage, see the request man page.

If your request is still in the queue in the QUEUED state, the request has not been processed yet. If no other requests are ahead of this request for your farm, the request queue might have stalled. The next section describes how to resolve this problem.

Handling Stalled Requests

A request can sometimes fail to be processed by its intended server and might stay in the queue unattended. This section explains how to diagnose and solve this problem.

Check for Blocked Requests

If starting the segment manager does not process all requests on the queue, some requests might be blocked. Blocked requests require manual intervention prior to their processing. When you have reviewed the blocked requests, you can unblock them to process them or delete them. You can run the request -u request-ID command to unblock or the request -d request-ID command to delete blocked requests.

After the blocked requests are cleared, the requests in the queue are processed in the order in which they were received.

Verify That the Farm Queue Handler Is Active

Sometimes the request is not blocked but just queued, and the farm manager is still not processing the request. In this case, ping the farm. Issuing the ping command to the farm activates the farm manager queue handler to process requests. This command especially applies to requests for farms that are in an error state. Run the following command to ping the farm to activate the request queue handler:

farm -p farm-ID

If the farm was in an error state (that is, a previous operation ended in error), reset the error to move the queue:

farm -pf farm-ID

Managing Failed Power and Switch Operation Requests

You can look for failed power or switch operation requests by checking the tspr.debug log for ExpectTimedOut messages. Power operation requests include powerUp, ispowerUp, and powerDown operations. Switch operation requests include addPort, removeVlan, and removeAll operations.

A failure can occur if the N1 Provisioning Server is unable to access a device or the N1 Provisioning Server receives unexpected output from a device. If a failure occurs, check the following items:

Verify that the IP address for the device is set correctly and that communication from the N1 Provisioning Server to the device exists.
Run the device -lv device-ID command from the N1 Provisioning Server to verify that all login names and passwords for this device are correct.
Ensure that the firmware is the supported version and that any changes made to the default settings are acceptable.

If you need to debug further to resolve the issue, enable the expect log in the following properties:

com.terraspring.drivers.util.expect.Expect.print=true
com.terraspring.drivers.util.expect.Expect.output=log-name-path

Run the tail -f log-name-path command and reissue your request to view the operation.

Troubleshooting Farm Operations

The following table describes N1 Provisioning Server troubleshooting issues related to farm operations. This list is not inclusive. The message log file indicates whether your problem relates to farm operations.

Table 7–2 Troubleshooting Farm Operations


Problem	Possible Cause	Action
Configuration exception	Invalid Farm Markup Language (FML) for the farm.	Contact http://sun.com/service/contacting/index.html for assistance.
Dynamic Host Configuration Protocol (DHCP) exception	Cannot create DHCP configuration for the farm.	This problem can occur if the host name of the control plane server cannot be determined. Make sure the server has the proper network configuration.
Domain Name Service (DNS) exception	Cannot create the DNS configuration for the farm.	This problem can occur if the host name of the control plane server cannot be determined. Make sure the server has the proper network configuration. This problem also can occur if the database connectivity is lost. To verify that the control plane database (CPDB) is running and is accepting connections, run any command, such as `farm -l`. If no proper output displays, the CPDB is not running.
`NoMoreResources` exception	Not enough resources are available to allocate the farm.	Check the message to see which resource is exhausted. Use one of the following methods to provision more resources Add resources and add their information to the database Free up resources by deactivating existing farms Flex down a farm to free up the required devices. Restart the farm operation.
I/O exception	An I/O error occurred while performing a task such as manipulating disks, configuring monitoring, or accessing files.	Depending on the specific place where the exception is thrown, take appropriate action that should be part of the message with the exception, such as checking network connections.
SQL exception	A database error occurred.	This problem can occur if the database connectivity is lost. To verify that the CPDB is running and is accepting connections, run any command, such as `farm -l`. If no proper output displays, the CPDB is not running. Report all other database errors to http://sun.com/service/contacting/index.html.
Exceptions other than SQL, such as`IllegalState`, `IllegalArgument`, `IllegalAccess`, and so on	Internal error.	Contact http://sun.com/service/contacting/index.html for assistance.
Farm activation fail.	Check the log file for any critical errors that might point to the cause.	Depending on what type of error message was received, take appropriate action to activate the farm.
Farm activation fails during allocation because not enough subnets are available	The subnet size might be too large.	You have the option to choose the size of the external subnet for the farm. If you select a size that is not currently in the database, you need to add a subnet by running the `subnet` command.
Farm deployment fails because `named` could not restart	The JVM^TM software caches the configuration of the `nsswitch.conf` file, which describes which database to use for host lookups. If DNS was not part of the `nsswitch.conf` file's host entry at the time the segment manager was started, all host lookups that cannot be resolved using the `/etc/hosts` file will fail. See the `tspr.debug` log for a detailed message describing the error.	Ensure that the entry for hosts lookup in the `/etc/nsswitch.conf` file reads as follows: `hosts: files dns` Restart the segment manager by running the command `/opt/terraspring/sbin/sm -start` Reactivate the farm.
Server is booting, but is unable to get its IP address through DHCP.	The DHCP daemon might not be running or the media access control (MAC) address of the server is incorrect.	Ensure that the details listed in the database are correct for that server. Run the `device -lv device-ID` command and check the MAC addresses and switch ports. Verify that the DHCP daemon is running and is answering requests by first running `ps -ef \|grep dhcp`, then look in the `tspr.debug` file to see if there are DHCP messages logged. Connect to the switch and ensure that those ports are connected. Depending on the switch, run either the `sh cam dyn port` or `sh.mac dyn` command to ensure that the correct MAC address of the server appears. Check that the Ethernet interface on the control plane server appears as connected on the switch and that it is running as an 802.1q trunk port, with a native virtual local area network (VLAN) of `1`.
The control plane server cannot create the DHCP configuration for the farm and receives a message indicating an unknown host.	The network configuration of the control plane server might be incorrect.	Check the network configuration of the control plane server. Check database connectivity and the file system that contains the DNS configuration.
The control plane server cannot create the DNS configuration for the farm and receives a message indicating an unknown host.	The network configuration of the control plane server might be incorrect.	Check the network configuration of the control plane server. Check database connectivity and the file system that contains the DNS configuration.
The Control Center does not display a farm correctly after it has been updated.	Two possible causes: The update request has not yet completed in the CPDB. This may take a few minutes. The update request has failed.	Reconfigure the farm in the Editor dialog and resubmit the update request. Reissue the the `farm -af farm ID` command from the command line. Then submit another farm update request from the Control Center Editor dialog.
Replace device requests are generated intermittently even though the devices are running and able to `ping` successfully.	The Ethernet port speeds might not be set to the correct value.	Ensure that the Ethernet port speeds and duplex setting are set to the same values on all sides (control plane server, switch, and device).

Farm Error Status Codes

Every farm has an error status code that is associated with the farm to indicate whether the farm is currently in an abnormal state.

The error status code of 0 represents a health state.
The error status code of 1000 means that the farm manager is processing a request.
An error status code other than 0 or 1000 means that the farm has an error.

During the request process, the farm's internal state changes whenever the transition process is completed with a success. If the farm fails to transition from one internal state to another internal state, the farm's internal state is not changed and the farm is set with an error status. The value of the error status code is the failed internal state value.

For example, if the farm failed the transition from state ALLOCATED (20) to state WIRED (30), the farm is still left with the internal state ALLOCATED (20) and the farm error status code is set to 30 to represent the failed state WIRED. Just before the code is set to 30, it is 1000 to indicate that the request is in progress.

Whenever a farm error occurs, a critical error message is generated in the system log file /var/adm/messages.

Caution –

The farm manager will not process any further farm requests until the error condition is changed and the error status code is cleared to 0.

Troubleshooting Image Problems

The following table contains descriptions of troubleshooting scenarios involving images. This list is not inclusive. The message log file indicates whether your problem relates to images.

Table 7–3 Troubleshooting Image Problems


Problem	Possible Cause	Solution
While allocating the resources for a farm, the N1 Provisioning Server generates a message indicating that it cannot find the named image.	The image ID in the FML file is incorrect.	Contact http://sun.com/service/contacting/index.html for assistance. Synchronize the Control Center. Also check the images list on the CPDB by running the command `image -ls`, and ensure that this list matches what displays in the Control Center.
A farm update request failed after making a snapshot.	The server was not completely backed up and running yet when the update request was made.	When you issue a farm update request after taking a snapshot, make sure that the server is completely up and running before issuing a farm update request. Otherwise, the farm update will fail. To ensure that the server is up and running, run the command `ping server IP-address`.
Farm activation fails after image creation.	Not enough external subnets.	The farm that is automatically created by the image creation process includes an external subnet. Define a new external subnet using the command `subnet -cx -m netmask-size network-IP`.
Farm standby request failed.	Not enough space on the image server for storing the image.	Go to the directory where all images are stored and issue the command `df -k` to determine how much space is available for storing image. If the capacity is above 85 percent, add more storage space for images.

Troubleshooting the Image Server

The image server manages images. The image server can be either a stand-alone network file server (NFS) or it can run on a control plane server. See Chapter 3, Managing Software Images for details.

Gigabit Ethernet Support

If your image requires Gigabit Ethernet support for the Solaris operating environment, ensure that the appropriate drivers are loaded each time the system boots. You can initialize the list of managed interfaces by running the following commands:

devfsadm
ifconfig -a plumb
ifconfig -a | grep flags | cut -d: -f1 | grep -v lo0> /etc/opt/terraspring/managed_interfaces

Now edit the contents of /etc/opt/terraspring/managed_interfaces by commenting out any interfaces that should not be managed by N1 Provisioning Server software.

Note –

The instance assigned to any gigabit Ethernet card must always be 0. For example, ce0, skge 0, alt0.

Troubleshooting Monitoring Problems

This section describes some common monitoring problems. It shows how to diagnose these problems and suggests possible corrective actions. See Chapter 4, Monitoring and Messaging for an introduction to monitoring concepts.

The most common problems are as follows:

An UP message is not received from one or more resource pool servers.
Monitors do not show colors on the Element Monitor window of the Control Center.
A farm does not activate even though an UP message was sent.
Frequent UP and DOWN messages are received.

Many of these problems have interconnected root causes. The most common causes are:

Network, DNS, or DHCP issues.
Monitoring processes are not running on the control plane server.
Agent processes are not running on the resource pool servers.

This section describes how to diagnose these symptoms. Corrective Actions for Monitoring describes the corrective action to take to resolve these problems.

An UP Message Not Received From One or More Servers

You can confirm an UP message problem on the control plane server by checking whether the following conditions exist:

In the /var/adm/tspr.debug file, a message is listed similar to the following:
"Still waiting for 1 device(s) in 2879974 ms"

The farm activation shows ERROR 50, as shown in the following example:

FARM_ID    FARM_NAME   CUSTOMER   STATE   ISTATE       ERROR  
123        Farm_Name   Customer   NEW      DISPATCHED   50

The following figure shows the steps needed to diagnose and resolve this problem.

Figure 7–2 Resolving Monitoring Problems

The preceding illustration shows the following troubleshooting sequence:

Check for a network, DNS, or DHCP problem. See Network, DNS, or DHCP Problems for details on how to do this.
Check that the monitoring processes are running on the control plane server. See Monitoring Processes Are Not Running on the Control Plane for details. Follow the instructions in this section to restart the processes.
Check that the agent processes are running on the resource pool server. See Agent Processes Are Not Running on a Resource Pool Server for details. If the agent processes are not running, follow the instructions in this section to restart them.

Monitors Do Not Show Colors in the Element Monitor Window of the Control Center

Farm-specific monitors might not appear on the Control Center. This condition could be caused by one of the following problems:

Agent processes are not running on the servers.
The mapping between the gw-mon-vip and the IP address of the Control Center server software is not set in the /etc/hosts file on the control plane server.
The listener on the Control Center is not running. See Control Plane Server-to-Control Center Messages Not Working for information on how to verify this condition.

Figure 7–2 shows the sequence of steps for you to follow to diagnose and resolve the above error condition. See Control Plane Server-to-Control Center Messages Not Working for details on how to resolve these problems.

Farm Does Not Activate

Even though the UP message was sent by the monitoring system, the segment manager might not be running. In this case, restart the segment manager. See Check for Blocked Requests for details on this procedure.

Frequent UP and Down Messages Received

A number of UP and DOWN messages received for a server might be received as a result of incorrect configuration of the interfaces on the N1 Provisioning Server.

Clear the duplicate Ethernet interfaces on the control plane server by running the clearNicInterface command. See the man pages for details on using this command.

Diagnosing Common Monitoring Symptoms

A number of symptoms are common to a number of problems. This section describes how to diagnose the following symptoms.

Network, DNS, or DHCP Problems

Perform the checks in the following table for network, DNS, or DHCP problems:

Table 7–4 Checking for Errors


Error Check	Error Confirmation
Verify that all the resource pool servers can receive `ping`signals by running the following command on the control plane server: `/opt/terraspring/sbin/mls -lf farm-ID`. Note – This command lists all the servers in the farm that can receive `ping` signals.	Any of the servers are listed as `ADDED`
Verify that all the resource pool servers are reachable by performing a `telnet` to each of the servers.	Any of the servers are not reachable with `telnet`

Note –

Sometimes a server can receive ping signals but is not reachable with telnet when in a single-user mode. To resolve this problem, connect to the console port and boot into multiuser mode.

Monitoring Processes Are Not Running on the Control Plane

After you determine a diagnosis for a monitoring process run the command:

/usr/ucb/ps -auxww | grep MM

If the monitoring process is running, you will see an output similar to this example:

USER	 PID %CPU %MEM SZ  RSS TT   S START  TIME  COMMAND
root 14540 0.2	1.14 485 620 608? S Mar 05	 18:32 /bin/../java/bin/..
/bin/sparc/native_threads/java -Dsun.net.inetaddr.ttl=0 com.
terraspring.mon.MM 
root 9529  0.1	0.1	  976 672 pts/2 S 11:49:40 0:00 grep MM

If the monitoring process is not running, you will see an output similar to this example:

USER PID %CPU %MEM  SZ  RSS TT     S  START TIME     COMMAND
root 9565 0.1  0.1  976 672 pts/2  S  11:50:28  0:00 grep MM

See Restart the Monitoring Processes on the Control Plane Server for details on how to restart the process.

Agent Processes Are Not Running on a Resource Pool Server

Agent processes might not be running on a resource pool server. You can verify this condition by one of two methods:

On the control plane server run the following command:
/opt/terraspring/sbin/mls -a IP address of host
To be able to use this command, you must know the IP address of the server.

On the server on which the agent you want to verify is running, run the following command:

/usr/ucb/ps -auxww | grep tspragt

If the agent processes are running, you will see output similar to the following example:

root 7652  0.1  0.1  976  656 pts/1 S 11:37:30  0:00 grep tspragt

root 321  0.1  0.73167213816 ? S 16:26:37  0:10 /usr/bin/../java/bin/..
/bin/sparc/native_threads/java -Dsun.net.inetaddr.ttl=0 
com.terraspring.mon.client.tspragt start 10.42.14.2

If the agent processes are not running, you will see output similar to the following example:

root      7709  0.1  0.1  976  656 pts/1    S 11:39:54  0:00 grep tspragt

See Restart the Agent Processes on a Resource Pool Server for details on how to restart the process.

Control Plane Server-to-Control Center Messages Not Working

For a number of reasons messages between the control plane server and Control Center might not work. The most common reasons include:

The mapping between the gw-mon-vip to the IP address of the Control Center server software is not set in the /etc/hosts file on the control plane server. Verify that a suitable entry is present to check this condition.

For example:
10.5.131.19 gw-mon-vip
The listener on the Control Center server software is not running. You can verify this condition by running finger test@gw-mon-vip on the control plane server. The expected sample output is similar to the following examples:
[gw-mon-vip]
or
[hostname]

Corrective Actions for Monitoring

This section describes a number of corrective actions that you can take to resolve a monitoring problem.

Restart the Monitoring Processes on the Control Plane Server

To restart the monitoring process run the following commands on the control plane server:

/opt/terraspring/sbin/mmd stop

This command ensures that all relevant processes are stopped. Restart the monitoring processes with the following command:

/opt/terraspring/sbin/mmd start

Restart the Agent Processes on a Resource Pool Server

If the control agent process terminated on the server, start the process on the server with the following command:

/etc/init.d/N1PSagt start

To verify that the processes have restarted, run the following command from the control plane server:

/opt/terraspring/sbin/mls -a server IP address

If the agent is running you will see output similar to the following:

FARM_ID IP_ADDRESS TYPE STATE DB_STATE SINCE 134 10.9.0.35 Server UP UP Feb 05 14:15:32

If the agent is down in real-time (STATE) it might still be marked as being UP in the database (DB_STATE because the database state is updated every five minutes. Therefore it will be up in real time, but still down in the database state. You will see output similar to the following:

FARM_ID IP_ADDRESS TYPE STATE DB_STATE SINCE 134 10.9.0.35 Server DOWN UP Feb 10 14:20:33

Troubleshooting Control Center Problems

The main problem you might encounter when working with the Control Center relates to the CPDB connection: either the connection between the Control Center and the CPDB is down, or parameters are incorrectly configured in the Control Center database.

To determine which of these two problems you might be encountering:

Log in as an administrator in the Control Center using the browser.
In the configuration tools section, select I-Fabrics to bring up a list of I-Fabrics.
On the I-Fabrics List, select the I-Fabric whose connection needs to be checked.
Click OK when the following dialog displays: Improper Configuration of these Properties may disrupt System Operation. Proceed with Caution.
In the Property Name Property Value dialog, take note of the IP address of the database server (defined at the field primary) and its port.
If no IP address is listed, telnet to the hostname of the URL on which the Control Center runs.
Try the telnet command (assuming 10.0.0.18 and 1521 are the IP address and port obtained at step 4):

telnet 10.0.0.18 1521

Trying 10.0.0.18...

If you see the following response:

Connected to cpdb

Escape character is "^]"

the connection is okay. Type Ctrl-C to cancel the telnet session.

If you see the following response:

telnet: Unable to connect to remote host: Connection refused

the IP is okay, but the CPDB database server is not running (or it is running on a different port). Contact your DBA or consult the database manufacturer's documentation for information on how to verify the listener port.

ping 10.0.0.18

If you see the following result:

no answer from 10.0.0.18

no communication exists between the Control Center and the database server hosts exists. This error might be caused by a routing problem or by the database server being down.

Connecting to the CPDB

The connection from the Control Center to the CPDB might be too slow at times causing a timeout. This condition might be caused by either the database or the network being slow due to overwhelming workload. To prevent a timeout when the connection is slow, increase the value in the timeout field in the I-Fabric Properties screen in the Control Center. The default is 30 seconds.

To Increase the Timeout Interval

Steps

Select I-Fabrics under Configuration Tools.

An I-Fabrics List and I-Fabrics Properties screen appear.

Increase the value in the Timeout field.

If this property does not yet exist you can create it by clicking the Add Property button.