A P P E N D I X C |
Error Messages |
This appendix gives several error messages that you might see while operating or servicing your Netra CT server, their meanings, and the actions necessary for each. All error messages in this appendix are written to the /var/adm/messages file on your system.
This program must be run on the same chassis.
You must restart mcnet. Change directories to the mcn directory.
Alarm and Slot presence state bits do not match!
A problem was encountered when a hot swap alarm card was installed into the server.
Run prtdiag to determine the state of the I/O slot. If the alarm card is not listed when you run prtdiag, reinsert the alarm card into the slot.
SCSB: Should NOT remove SCB(#) while cPCI Slot # is in RESET with a possible bad board. scsb#0: Slot # Now out of Reset!
The system controller board was removed from the server while the amber Okay to Remove LED was ON for an I/O slot.
Enable basic hot swap on all the I/O slots in the server using the instructions in Chapter 5. Once basic hot swap is enabled on all I/O slots, it is safe to remove the system controller board from the server.
scsb#0: ALERT! Lost HEALTHY# on Slot #, Occupant Offlinescsb#0: ALERT! Lost HEALTHY# on Slot #, Occupant Online!!!
The CompactPCI card has lost its HEALTHY report.
The CompactPCI card has failed or is damaged. Configure the card, and then unconfigure it using the instructions in Chapter 6. If the error messages repeat, then the card has failed. Replace the I/O card using the instructions in Chapter 6.
If the system has already taken the card offline because the card stopped sending a HEALTHY signal, the following message is displayed:
scsb#0: Slot # successfully taken offline
scsb#0: Bad (non friendly ?) Board in Slot # ? Taking it Offline.
The system has identified an I/O card that is sending repeated interrupts and has taken it offline.
Replace the I/O card using the instructions in Chapter 6.
scsb#0: Could not Update %s LEDs.scsb#0: Could not Blink %s LEDs.
An I2C error has resulted in an LED change failure. The LEDs on the system status panel may give incorrect information as a result.
Use the prtdiag tool to print the correct LED states. Remove and reinstall the system controller board to correct the problem. Refer to Section 8.2, System Controller Board for those instructions.
scsb#0: hsc_board_healthy: No Slot Info.
A disabled slot that is no longer being monitored by the system (due to errors or user request) is having HEALTHY state changes and sending full hot swap style interrupts to the CPU.
Remove the I/O card from that slot. If the error messages repeat, set the I/O slot to basic hot swap using the instructions in Section 5.2.3.2, Enabling Basic Hot Swap on I/O Slots.
scsb#0: hsc_enum_intr: No Last Board Insertion Info.
A CompactPCI card that is probably damaged was installed into an I/O slot in the system. The card has some sort of error causing it to continually interrupt the CPU with hot swap service events when there is no change to the board's state. The card continually reports itself 'inserted' after it has already been acknowledged. Since no board is 'claiming' the event, no slot # can be given. Also see Section , scsb#0: Bad (non friendly ?) Board in Slot # ? Taking it Offline..
Remove the I/O card from the server using the instructions in Chapter 6. If the error message repeats, the system controller board may have failed. Try replacing the system controller board using the instructions in Section 8.2, System Controller Board.
scsb#0: hsc_restore: Cannot reset disconnected slot #
The system controller board was installed in the server while the amber Okay to Remove LED was ON for an I/O slot.
Enable basic hot swap on all the I/O slots in the server using the instructions in Chapter 5. Once basic hot swap is enabled on all I/O slots, remove the system controller board from the server.
scsb0: I2C TRANSFER Failedscsb0: Error Reading Healthy# Registersscsb#0: scsb_reset_slot: error reading Reset regs
An error occured when the scsb driver received the retry command from the system controller board.
Retry. If the error persists, the system controller board is damaged and should be replaced. Refer to Section 8.2, System Controller Board for those instructions.
scsb#0: no HEALTHY# signal on slot#
You tried to connect or configure a hot-swappable I/O card that was not reporting itself HEALTHY. The card has failed or was not inserted properly.
Remove the I/O card from the server and reinsert it, making sure the card is completely and properly inserted into the server. If the error message repeats, then the card has failed. Replace the I/O card using the instructions in Chapter 6.
scsb#0: Reset Not Asserted on Healthy# Failed slot#
You rebooted the system with a failed board. While the board is not reporting itself HEALTHY, the OpenBoot PROM has taken it out of reset and probed it anyway.
The board is probably damaged and should not be used. Unconfigure the board manually and remove the board from the system using the instructions in Chapter 6.
scsb#0: slot # Occupant configured, Regained HEALTHY#!scsb#0: slot # Occupant Unconfigured, Regained HEALTHY#!
A CompactPCI card is sending conflicting HEALTHY and UNHEALTHY signals.
The card has failed. Replace the I/O card using the instructions in Chapter 6.
scsb#0: Successfully Downgraded to Basic Hotswap Mode
Basic hot swap was enabled on the system.
scsb#0: Successfully Upgraded to Full Hotswap Mode
Full hot swap was enabled on the system.
Interrupt Level 4--Not serviced
Such a message occuring intermittently is always a result of the underlying hardware doing something unpredictable.
Transient interrupts occur when, for example, a fan is starting to fail, and it fails long enough to generate an interrupt and then resumes operation. By the time the fan driver is queried, it denies the interruption because now it is functioning normally.
The condition is a result of the architecture of interrupt generation and response. As long as the generating hardware has resumed normal operation, no further action is required.
Interrupt Level 4--Not serviced
This message, occurring continuously, signals a soft hang of the system. The presenting symptom is the system is noticeably sluggish because it is busy processing interrupts.
A soft hang occurs when a component such as a power supply sends a level high interrupt and keeps it high. The kernel notices and polls the devices. Each device answers negative, including the culprit power supply. Meanwhile, the CPU continues with minimal work before returning to the querying process. This is a serious problem because the failing component remains unidentified.
Completely power the server off and then on again using the instructions in Chapter 2. When the system boots, it always boots interrupts low (masked), and attaches the drivers one by one. You can also use OpenBoot PROM commands to probe the components and determine which one has failed.
NO ADDERSS ACK 80
This message indicates a problem with I2C, and often it's the pcf8584 driver that complains, followed by the address it was trying to access (for example, NO ADDRESS ACK 80. indicates a problem with address 80, which is the fixed address of the system controller board.
Most of the Sun drivers print a secondary error message, but the principal error message comes from pcf8584. The interface to this is through an ioctl, so its done through software. This message indicates a problem, but not the severity. Sometimes such a message is normal.
For example when a power supply is removed, the Present line goes low and the SCB sets the bit high (interrupt). The kernel pcf8584 goes down the device line querying for interrupts in the order in which the devices boot, each one answering. The message 8584 NO ADDR ACK 0x9E occurs when the device is removed. Because it happened after the driver tried to query the hardware, this spurious error message occurs. This happens with fans and power supplies.
If the error message occurs during a hot swap operation, it is erroneous and should be ignored. If the error messge occurs during normal operation, it may indicate a problem with the I2C device.
Bus busy, cleared after initiallizing
This is a transient 12C error message.
Usually no action is necessary because the system should recover from most transient 12C errors. If the system becomes unresponsive, completely power the server off and then power it back on. Watch the Power-On Self-Test messages to determine the cause for the error.
Copyright © 2003, Sun Microsystems, Inc. All rights reserved.