Recover from a Failed Upgrade

A.9 Recover from a Failed Upgrade

The following procedure provides the steps required to recover a server after a failed upgrade. Due to the complexity of the DSR system and the nature of troubleshooting, it is recommended to contact My Oracle Support (MOS) for guidance while executing this procedure.

A.9 Active NOAM VIP: Select Affected Server Group Containing the Failed Server

Log in to the NOAM GUI using the VIP.
Navigate to Administration, then Software Management, and then Upgrade.
Select the server group tab for the server to be recovered.

Note:
If the failed server was upgraded using the Upgrade Server option, then skip to step 7 of this procedure.
If the failed server was upgraded using the Auto Upgrade option, then continue with step 2 of this procedure.

A.9 Active NOAM VIP: Navigate to the Active Tasks Screen to View Active Tasks

Navigate to Status & Manage, then Tasks, and then Active Tasks.

A.9 Active NOAM VIP: Use the Filter to Locate the Server Group Upgrade Task

From the Filter option, enter the following filter values:

Network Element: All

Display Filter: Name Like *upgrade*
Click Go.

A.9 Active NOAM VIP: Identify the Upgrade Task

In the search results list, locate the Server Group Upgrade task.

If not already selected, select the tab displaying the host name of the active NOAM server.
Locate the task for the Server Group Upgrade. It shows a the status as paused.

Note:
Consider the case of an upgrade cycle where it is seen that the upgrade of one or more servers in the server group have status as exception (i.e., failed), while the other servers in that server group have upgraded successfully. However, the server group upgrade task still shows as running. In this case, please cancel the running (upgrade) task for that server group before reattempting ASU for the same.

Caution:
Before clicking Cancel for the server group upgrade task, ensure the upgrade status of the individual servers in that particular server group should have has status as completed or exception (that is, failed for some reason). Make sure you are not cancelling a task with some servers still in running state.

A.9 Active NOAM VIP: Cancel the Server Group Upgrade task

Click the Server Group Upgrade task to select it.
Click Cancel to cancel the task.
Click OK on the confirmation screen to confirm the cancellation.

A.9 Active NOAM VIP: Verify the Server Group Upgrade task is Cancelled

On the Active Tasks screen, verify the task that was cancelled in step 5 shows a status of completed.

A.9 Failed Server CLI: Inspect Upgrade Log

Log in to the failed server to inspect the upgrade log for the cause of the failure.
Use an SSH client to connect to the failed server:

ssh <XMI IP address>

login as: admusr

password: <enter password>

Note:
The static XMI IP address for each server should be available in Table 5.
View or edit the upgrade log at /var/TKLC/log/upgrade/upgrade.log for clues to the to identify the cause of the upgrade failure.
If the upgrade log contains a message similar to the followingone shown below, inspect the early upgrade log at /var/TKLC/log/upgrade/earlyChecks.log for additional clues.

1440613685::Early Checks failed for the next upgrade

1440613691::Look at earlyChecks.log for more info

Caution:
Although outside of the scope of this document, the user is expected to use standard troubleshooting techniques to clear the alarm condition from the failed server.
If troubleshooting assistance is needed, it is recommended to contact My Oracle Support (MOS).

Do not proceed to the next procedure until the alarm condition has been cleared.

A.9 Failed Server CLI: Verify Platform Alarms are Cleared from the Failed Server

Use the alarmMgr utility to verify all platform alarms have been cleared from the system.

$ sudo alarmMgr --alarmstatus

Example output:

[admusr@SO2 ~]$ sudo alarmMgr --alarmstatus

SEQ: 2 UPTIME: 827913 BIRTH: 1458738821 TYPE: SET ALARM: TKSPLATMI10|tpdNTPDaemonNotSynchronizedWarning|1.3.6.1.4.1.323.5.3.18.3.1.3.10|32509|Communications|Communications Subsystem Failure

***user troubleshoots alarm and is able to resolve NTP sync issue and clear alarm***

[admusr@SO2 ~]$ sudo alarmMgr --alarmstatus

[admusr@SO2 ~]$

A.9 Active NOAM VIP: Re-execute the Server Upgrade

Return to the upgrade procedure being executed when the failure occurred. Re-execute the upgrade for the failed server using the Upgrade Server option.

Note:
Once a server has failed while using the Automated Server Group Upgrade option, the Auto Upgrade option cannot be used again on that server group. The remaining servers in that server group must be upgraded using the Upgrade Server option.