有时,Solaris Volume Manager 会遇到问题(如由于片级别的物理错误而无法写入卷)。出现问题后,Solaris Volume Manager 会更改卷的状态,以便系统管理员可以得到通知。但是,除非借助 Solaris Management Console 或者运行 metastat 命令在 Solaris Volume Manager GUI 中定期检查卷的状态,否则可能无法立即看到这些状态更改。
本章提供有关 Solaris Volume Manager 中提供的各种监视工具的信息。其中一个工具是 Solaris Volume Manager SNMP 代理,该代理是 Solstice Enterprise AgentsTM 监视软件的子代理。除了可配置该工具使其报告 SNMP 陷阱以外,还可以创建一个 shell 脚本来不断监视多种 Solaris Volume Manager 功能。此 shell 脚本可以作为 cron 作业运行,而且可用来确定潜在的问题。
以下是本章中信息的列表:
以下任务列表列出了管理 Solaris Volume Manager 错误报告所需的过程。
任务 |
说明 |
参考 |
---|---|---|
配置 mdmonitord 守护进程,使其定期检查错误 |
通过编辑 /lib/svc/method/svc-mdmonitor 脚本来配置由 mdmonitord 守护进程使用的错误检查间隔。 | |
配置 Solaris Volume Manager SNMP 代理 |
编辑 /etc/snmp/conf 目录中的配置文件,以便 Solaris Volume Manager 正确地将陷阱发送给相应的系统。 | |
用 cron 命令所运行的脚本监视 Solaris Volume Manager |
创建新脚本或改编现有的脚本以检查错误,然后从 cron 命令运行该脚本。 |
Solaris Volume Manager 中包括 /usr/sbin/mdmonitord 守护进程。当磁盘出现故障时,Solaris Volume Manager 会检测到该故障并生成一个错误。此错误事件将触发 mdmonitord 守护进程,使其对 RAID-1(镜像)卷、RAID-5 卷和热备件执行检查。但是,您还可以将该程序配置为以指定的时间间隔不断检查错误。
编辑 /lib/svc/method/svc-mdmonitor 脚本,添加定期检查的时间间隔。
成为超级用户。
在所选编辑器中打开 /lib/svc/method/svc-mdmonitor 脚本。在该脚本中查找以下部分:
$MDMONITORD error=$? case $error in 0) exit 0 ;; *) echo "Could not start $MDMONITORD. Error $error." exit 0 |
更改以 mdmonitord 命令开头的行,具体操作为添加一个 -t 标志以及检查的时间间隔(以秒为单位)。
|
$MDMONITORD -t 3600 error=$? case $error in 0) exit 0 ;; *) echo "Could not start $MDMONITORD. Error $error." exit 0 ;; esac |
重新启动 mdmonitord 命令以激活所做的更改。
# svcadm restart system/mdmonitor |
有关更多信息,请参见 mdmonitord(1M) 手册页。
Solaris Volume Manager SNMP 陷阱代理需要核心软件包 SUNWlvmr、SUNWlvma 的软件包以及 Solstice Enterprise Agents。核心软件包包括:
SUNWmibii
SUNWsacom
SUNWsadmi
SUNWsasnm
这些软件包是 Solaris 操作系统的一部分。缺省情况下会安装这些软件包,除非在安装时修改了软件包选项或者安装的是一组最少的软件包。要确认这些软件包是否可用,请使用 pkginfo pkgname 命令,正如在 pkginfo SUNWsasnm 中一样。确认了这五个软件包均可用之后,需要按照下一节中的说明配置 Solaris Volume Manager SNMP 代理。
Solaris Volume Manager SNMP 代理在缺省情况下处于禁用状态。请使用以下过程启用 SNMP 陷阱。
无论何时升级 Solaris 操作系统,都可能需要编辑 the/etc/snmp/conf/enterprises.oid 文件并再次附加步骤 6 中的行,然后重新启动 Solaris Enterprise Agents 服务器。
在完成此过程之后,系统将向指定的一台或多台主机发送 SNMP 陷阱。系统发送陷阱时,需要使用适当的 SNMP 监视器(如 Solstice Enterprise Agents 软件)来查看它们。
设置 mdmonitord 命令来定期探测系统,以帮助确保在出现问题时您能够收到陷阱。请参见配置 mdmonitord 命令以定期检查错误。有关其他错误检查选项,另请参阅用 cron 作业监视 Solaris Volume Manager。
成为超级用户。
将 /etc/snmp/conf/mdlogd.rsrc– 配置文件移到 /etc/snmp/conf/mdlogd.rsrc 中。
# mv /etc/snmp/conf/mdlogd.rsrc- /etc/snmp/conf/mdlogd.rsrc |
编辑 /etc/snmp/conf/mdlogd.acl 文件以指定哪些主机应当接收 SNMP 陷阱。 在该文件中查找以下内容:
trap = { { trap-community = SNMP-trap hosts = corsair { enterprise = "Solaris Volume Manager" trap-num = 1, 2, 3 } |
更改包含 hosts = corsair 的行,以指定要接收 Solaris Volume Manager SNMP 陷阱的主机的名称。例如,如果要将 SNMP 陷阱发送到 lexicon,则应当将该行更改为 hosts = lexicon。如果要包括多台主机,请提供用逗号分隔的主机名列表,如 hosts = lexicon, idiom。
还需要编辑 /etc/snmp/conf/snmpdx.acl 文件以指定哪些主机应当接收 SNMP 陷阱。
查找以 trap = 开头的块并添加在上一步中添加的那些主机。 该部分可能已用 # 注释掉。如果是这样,则必须去掉该部分中所需行开头的 #。陷阱部分中的其他行也已经注释掉。但是,您可以保留这些行,也可以为清楚起见而删除它们。在取消对所需行的注释并更新主机行之后,该部分看上去与以下内容类似:
################### # trap parameters # ################### trap = { { trap-community = SNMP-trap hosts =lexicon { enterprise = "sun" trap-num = 0, 1, 2-5, 6-16 } # { # enterprise = "3Com" # trap-num = 4 # } # { # enterprise = "snmp" # trap-num = 0, 2, 5 # } # } # { # trap-community = jerry-trap # hosts = jerry, nanak, hubble # { # enterprise = "sun" # trap-num = 1, 3 # } # { # enterprise = "snmp" # trap-num = 1-3 # } } } |
请确保 /etc/snmp/conf/snmpdx.acl 文件中左括号和右括号的数量相同。
在 /etc/snmp/conf/snmpdx.acl 文件中,向已在上一步中取消注释的部分中添加一个新的 Solaris Volume Manager 部分。
trap-community = SNMP-trap hosts = lexicon { enterprise = "sun" trap-num = 0, 1, 2-5, 6-16 } { enterprise = "Solaris Volume Manager" trap-num = 1, 2, 3 } |
请注意,所添加的四行将紧邻 enterprise = “sun” 块之后放置。
将下行附加到 /etc/snmp/conf/enterprises.oid 文件中:
"Solaris Volume Manager" "1.3.6.1.4.1.42.104" |
停止并重新启动 Solstice Enterprise Agents 服务器。
# /etc/init.d/init.snmpdx stop # /etc/init.d/init.snmpdx start |
对于管理员需要知道的所有 Solaris Volume Manager 问题,Solaris Volume Manager SNMP 代理并不能全部发送陷阱。具体来说,该代理仅在以下情况下才发送陷阱:
RAID-1 和 RAID-5 子组件转入“Needs Maintenance(需要维护)”状态
交换热备件,将其投入使用
热备件开始重新同步
热备件完成了重新同步
镜像处于脱机状态
磁盘集由另一台主机提取,当前的主机崩溃
许多问题(如带有 RAID-0 卷或软分区的磁盘不可用)不会导致发送 SNMP 陷阱,即使在尝试读写该设备时也是如此。在这些情况下,通常会报告 SCSI 或 IDE 错误。但是,对于那些要报告给监视控制台的错误,必须由其他 SNMP 代理发送陷阱。
要自动检查 Solaris Volume Manager 配置是否有错误,请创建一个由 cron 实用程序定期运行的脚本。
以下示例显示一个可根据自己的需要改编和修改的脚本。
自动对 Solaris Volume Manager 进行错误检查的第一步是运行此脚本。您可能需要根据自己的配置修改此脚本。
# #!/bin/ksh #ident "@(#)metacheck.sh 1.3 96/06/21 SMI" # ident='%Z%%M% %I% %E% SMI' # # Copyright (c) 1999 by Sun Microsystems, Inc. # # metacheck # # Check on the status of the metadevice configuration. If there is a problem # return a non zero exit code. Depending on options, send email notification. # # -h # help # -s setname # Specify the set to check. By default, the 'local' set will be checked. # -m recipient [recipient...] # Send email notification to the specified recipients. This # must be the last argument. The notification shows up as a short # email message with a subject of # "Solaris Volume Manager Problem: metacheck.who.nodename.setname" # which summarizes the problem(s) and tells how to obtain detailed # information. The "setname" is from the -s option, "who" is from # the -w option, and "nodename" is reported by uname(1). # Email notification is further affected by the following options: # -f to suppress additional messages after a problem # has been found. # -d to control the supression. # -w to identify who generated the email. # -t to force email even when there is no problem. # -w who # indicate who is running the command. By default, this is the # user-name as reported by id(1M). This is used when sending # email notification (-m). # -f # Enable filtering. Filtering applies to email notification (-m). # Filtering requires root permission. When sending email notification # the file /etc/lvm/metacheck.setname.pending is used to # controll the filter. The following matrix specifies the behavior # of the filter: # # problem_found file_exists # yes no Create file, send notification # yes yes Resend notification if the current date # (as specified by -d datefmt) is # different than the file date. # no yes Delete file, send notification # that the problem is resolved. # no no Send notification if -t specified. # # -d datefmt # Specify the format of the date for filtering (-f). This option # controls the how often re-notification via email occurs. If the # current date according to the specified format (strftime(3C)) is # identical to the date contained in the # /etc/lvm/metacheck.setname.pending file then the message is # suppressed. The default date format is "%D", which will send one # re-notification per day. # -t # Test mode. Enable email generation even when there is no problem. # Used for end-to-end verification of the mechanism and email addresses. # # # These options are designed to allow integration of metacheck # into crontab. For example, a root crontab entry of: # # 0,15,30,45 * * * * /usr/sbin/metacheck -f -w SVMcron \ # -d '\%D \%h' -m notice@example.com 2148357243.8333033@pager.example.com # # would check for problems every 15 minutes, and generate an email to # notice@example.com (and send to an email pager service) every hour when # there is a problem. Note the \ prior to the '%' characters for a # crontab entry. Bounced email would come back to root@nodename. # The subject line for email generated by the above line would be # Solaris Volume Manager Problem: metacheck.SVMcron.nodename.local # # display a debug line to controlling terminal (works in pipes) decho() { if [ "$debug" = "yes" ] ; then echo "DEBUG: $*" < /dev/null > /dev/tty 2>&1 fi } # if string $1 is in $2-* then return $1, else return "" strstr() { typeset look="$1" typeset ret="" shift # decho "strstr LOOK .$look. FIRST .$1." while [ $# -ne 0 ] ; do if [ "$look" = "$1" ] ; then ret="$look" fi shift done echo "$ret" } # if string $1 is in $2-* then delete it. return result strdstr() { typeset look="$1" typeset ret="" shift # decho "strdstr LOOK .$look. FIRST .$1." while [ $# -ne 0 ] ; do if [ "$look" != "$1" ] ; then ret="$ret $1" fi shift done echo "$ret" } merge_continued_lines() { awk -e '\ BEGIN { line = "";} \ $NF == "\\" { \ $NF = ""; \ line = line $0; \ next; \ } \ $NF != "\\" { \ if ( line != "" ) { \ print line $0; \ line = ""; \ } else { \ print $0; \ } \ }' } # trim out stuff not associated with metadevices find_meta_devices() { typeset devices="" # decho "find_meta_devices .$*." while [ $# -ne 0 ] ; do case $1 in d+([0-9]) ) # metadevice name devices="$devices $1" ;; esac shift done echo "$devices" } # return the list of top level metadevices toplevel() { typeset comp_meta_devices="" typeset top_meta_devices="" typeset devices="" typeset device="" typeset comp="" metastat$setarg -p | merge_continued_lines | while read line ; do echo "$line" devices=`find_meta_devices $line` set -- $devices if [ $# -ne 0 ] ; then device=$1 shift # check to see if device already refered to as component comp=`strstr $device $comp_meta_devices` if [ -z $comp ] ; then top_meta_devices="$top_meta_devices $device" fi # add components to component list, remove from top list while [ $# -ne 0 ] ; do comp=$1 comp_meta_devices="$comp_meta_devices $comp" top_meta_devices=`strdstr $comp $top_meta_devices` shift done fi done > /dev/null 2>&1 echo $top_meta_devices } # # - MAIN # METAPATH=/usr/sbin PATH=//usr/bin:$METAPATH USAGE="usage: metacheck [-s setname] [-h] [[-t] [-f [-d datefmt]] \ [-w who] -m recipient [recipient...]]" datefmt="%D" debug="no" filter="no" mflag="no" set="local" setarg="" testarg="no" who=`id | sed -e 's/^uid=[0-9][0-9]*(//' -e 's/).*//'` while getopts d:Dfms:tw: flag do case $flag in d) datefmt=$OPTARG; ;; D) debug="yes" ;; f) filter="yes" ;; m) mflag="yes" ;; s) set=$OPTARG; if [ "$set" != "local" ] ; then setarg=" -s $set"; fi ;; t) testarg="yes"; ;; w) who=$OPTARG; ;; \?) echo $USAGE exit 1 ;; esac done # if mflag specified then everything else part of recipient shift `expr $OPTIND - 1` if [ $mflag = "no" ] ; then if [ $# -ne 0 ] ; then echo $USAGE exit 1 fi else if [ $# -eq 0 ] ; then echo $USAGE exit 1 fi fi recipients="$*" curdate_filter=`date +$datefmt` curdate=`date` node=`uname -n` # establish files msg_f=/tmp/metacheck.msg.$$ msgs_f=/tmp/metacheck.msgs.$$ metastat_f=/tmp/metacheck.metastat.$$ metadb_f=/tmp/metacheck.metadb.$$ metahs_f=/tmp/metacheck.metahs.$$ pending_f=/etc/lvm/metacheck.$set.pending files="$metastat_f $metadb_f $metahs_f $msg_f $msgs_f" rm -f $files > /dev/null 2>&1 trap "rm -f $files > /dev/null 2>&1; exit 1" 1 2 3 15 # Check to see if metadb is capable of running have_metadb="yes" metadb$setarg > $metadb_f 2>&1 if [ $? -ne 0 ] ; then have_metadb="no" fi grep "there are no existing databases" < $metadb_f > /dev/null 2>&1 if [ $? -eq 0 ] ; then have_metadb="no" fi grep "/dev/md/admin" < $metadb_f > /dev/null 2>&1 if [ $? -eq 0 ] ; then have_metadb="no" fi # check for problems accessing metadbs retval=0 if [ "$have_metadb" = "no" ] ; then retval=1 echo "metacheck: metadb problem, can't run '$METAPATH/metadb$setarg'" \ >> $msgs_f else # snapshot the state metadb$setarg 2>&1 | sed -e '1d' | merge_continued_lines > $metadb_f metastat$setarg 2>&1 | merge_continued_lines > $metastat_f metahs$setarg -i 2>&1 | merge_continued_lines > $metahs_f # # Check replicas for problems, capital letters in the flags # indicate an error, fields are seperated by tabs. # problem=`awk < $metadb_f -F\t '{if ($1 ~ /[A-Z]/) print $1;}'` if [ -n "$problem" ] ; then retval=`expr $retval + 64` echo "\ metacheck: metadb problem, for more detail run:\n\t$METAPATH/metadb$setarg -i" \ >> $msgs_f fi # # Check the metadevice state # problem=`awk < $metastat_f -e \ '/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'` if [ -n "$problem" ] ; then retval=`expr $retval + 128` echo "\ metacheck: metadevice problem, for more detail run:" \ >> $msgs_f # refine the message to toplevel metadevices that have a problem top=`toplevel` set -- $top while [ $# -ne 0 ] ; do device=$1 problem=`metastat $device | awk -e \ '/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'` if [ -n "$problem" ] ; then echo "\t$METAPATH/metastat$setarg $device" >> $msgs_f # find out what is mounted on the device mp=`mount|awk -e '/\/dev\/md\/dsk\/'$device'[ \t]/{print $1;}'` if [ -n "$mp" ] ; then echo "\t\t$mp mounted on $device" >> $msgs_f fi fi shift done fi # # Check the hotspares to see if any have been used. # problem="" grep "no hotspare pools found" < $metahs_f > /dev/null 2>&1 if [ $? -ne 0 ] ; then problem=`awk < $metahs_f -e \ '/blocks/ { if ( $2 != "Available" ) print $0;}'` fi if [ -n "$problem" ] ; then retval=`expr $retval + 256` echo "\ metacheck: hot spare in use, for more detail run:\n\t$METAPATH/metahs$setarg -i" \ >> $msgs_f fi fi # If any errors occurred, then mail the report if [ $retval -ne 0 ] ; then if [ -n "$recipients" ] ; then re="" if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then re="Re: " # we have a pending notification, check date to see if we resend penddate_filter=`cat $pending_f | head -1` if [ "$curdate_filter" != "$penddate_filter" ] ; then rm -f $pending_f > /dev/null 2>&1 else if [ "$debug" = "yes" ] ; then echo "metacheck: email problem notification still pending" cat $pending_f fi fi fi if [ ! -f $pending_f ] ; then if [ "$filter" = "yes" ] ; then echo "$curdate_filter\n\tDate:$curdate\n\tTo:$recipients" \ > $pending_f fi echo "\ Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate" >> $msg_f echo "\ --------------------------------------------------------------" >> $msg_f cat $msg_f $msgs_f | mailx -s \ "${re}Solaris Volume Manager Problem: metacheck.$who.$set.$node" $recipients fi else cat $msgs_f fi else # no problems detected, if [ -n "$recipients" ] ; then # default is to not send any mail, or print anything. echo "\ Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate" >> $msg_f echo "\ --------------------------------------------------------------" >> $msg_f if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then # pending filter exista, remove it and send OK rm -f $pending_f > /dev/null 2>&1 echo "Problem resolved" >> $msg_f cat $msg_f | mailx -s \ "Re: Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients elif [ "$testarg" = "yes" ] ; then # for testing, send mail every time even thought there is no problem echo "Messaging test, no problems detected" >> $msg_f cat $msg_f | mailx -s \ "Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients fi else echo "metacheck: Okay" fi fi rm -f $files > /dev/null 2>&1 exit $retval |
有关使用 cron 实用程序来调用脚本的信息,请参见 cron(1M) 手册页。