This section describes system Alerts, how they are customized, and where to find alert logs. To monitor statistics from Analytics, create custom threshold alerts. To configure the system to respond to certain types of alerts, use Alert actions.
Important appliance events trigger alerts, which includes hardware and software faults. These alerts appear in the Maintenance Logs, and may also be configured to execute any of the Alert actions.
The following actions are supported.
An email containing the alert details can be sent. The configuration requires an email address and email subject line. The following is a sample email sent based on a threshold alert:
From email@example.com Mon Oct 13 15:24:47 2009 Date: Mon, 13 Oct 2009 15:24:21 +0000 (GMT) From: Appliance on caji <firstname.lastname@example.org> Subject: High CPU on caji To: email@example.com SUNW-MSG-ID: AK-8000-TT, TYPE: Alert, VER: 1, SEVERITY: Minor EVENT-TIME: Mon Oct 13 15:24:12 2009 PLATFORM: i86pc, CSN: 0809QAU005, HOSTNAME: caji SOURCE: svc:/appliance/kit/akd:default, REV: 1.0 EVENT-ID: 15a53214-c4e7-eae4-dae6-a652a51ea29b DESC: cpu.utilization threshold of 90 is violated. AUTO-RESPONSE: None. IMPACT: The impact depends on what statistic is being monitored. REC-ACTION: The suggested action depends on what statistic is being monitored. SEE: https://192.168.2.80:215/#maintenance/alert=15a53214-c4e7-eae4-dae6-a652a51ea29b
Details on how the appliance sends mail can be configured on the SMTP service screen.
An SNMP trap containing alert details can be sent, if an SNMP trap destination is configured in the SNMP service, and that service is online. The following is an example SNMP trap, as seen from the Net-SNMP tool snmptrapd -P:
# /usr/sfw/sbin/snmptrapd -P 2009-10-13 15:31:15 NET-SNMP version 5.0.9 Started. 2009-10-13 15:31:34 caji.com [192.168.2.80]: iso.220.127.116.11.18.104.22.168 = Timeticks: (2132104431) 246 days, 18:30:44.31 iso.22.214.171.124.126.96.36.199.1.0 = OID: iso.188.8.131.52.184.108.40.206.220.127.116.11 iso.18.104.22.168.22.214.171.124.126.96.36.199.188.8.131.52.184.108.40.206.220.127.116.11. 18.104.22.168.22.214.171.124.126.96.36.199.188.8.131.52.184.108.40.206.50.54. 98.55.57 = STRING: "7cf0acd4-30c1-4c19-e9cb-ac27f7126b79" iso.220.127.116.11.18.104.22.168.22.214.171.124.126.96.36.199.188.8.131.52.184.108.40.206. 220.127.116.11.18.104.22.168.22.214.171.124.126.96.36.199.188.8.131.52.50.54. 98.55.57 = STRING: "alert.ak.xmlrpc.threshold.violated" iso.184.108.40.206.220.127.116.11.18.104.22.168.22.214.171.124.126.96.36.199.52.45.51. 188.8.131.52.184.108.40.206.220.127.116.11.18.104.22.168.22.214.171.124.49.50. 126.96.36.199 = STRING: "cpu.utilization threshold of 90 is violated."
A syslog message containing alert details can be sent to one or more remote systems, if the Syslog service is enabled. Refer to the documentation describing the Syslog Relay service for example syslog payloads and a description of how to configure syslog receivers on other operating systems.
Analytics Datasets may be resumed or suspended. This is particularly useful when tracking down sporadic performance issues, and when enabling these datasets 24x7 is not desirable.
For example: imagine you noticed a spike in CPU activity once or twice a week, and other analytics showed an associated drop in NFS performance. You enable some additional datasets, but you don't quite have enough information to prove what the problem is. If you could enable the NFS by hostname and filename datasets, you are certain you will understand the cause a lot better. However those particular datasets can be heavy handed - leaving them enabled 24x7 will degrade performance for everyone. This is where the resume/suspend dataset actions may be of use. A threshold alert could be configured to resume paused NFS by hostname and filename datasets, only when the CPU activity spike is detected; a second alert can be configured to then suspend those datasets, after a short interval of data is collected. The end result - you collect the data you need only during the issue, and minimize the performance impact of this data collection.
These actions are to resume or suspend an entire Analytics Worksheet, which may contain numerous datasets. The reasons for doing this are similar to those for resuming and suspending datasets.
These are alerts based on the statistics from Analytics. The following are properties when creating threshold alerts:
The "Add Threshold Alert" dialog has been organized so that it can be read as though it is a paragraph describing the alert. The default reads:
Threshold CPU: percent utilization exceeds 95 percent
Timing for at least 5 minutes only between 0:00 and 0:00 only during weekdays
Repost alert every 5 minutes while this condition persists.
Also post alert when this condition clears for at least 5 minutes