Solaris Volume Manager 管理指南

第 24 章监视和错误报告（任务）

有时，Solaris Volume Manager 会遇到问题（如由于片级别的物理错误而无法写入卷）。出现问题后，Solaris Volume Manager 会更改卷的状态，以便系统管理员可以得到通知。但是，除非借助 Solaris Management Console 或者运行 metastat 命令在 Solaris Volume Manager GUI 中定期检查卷的状态，否则可能无法立即看到这些状态更改。

本章提供有关 Solaris Volume Manager 中提供的各种监视工具的信息。其中一个工具是 Solaris Volume Manager SNMP 代理，该代理是 Solstice Enterprise Agents^TM 监视软件的子代理。除了可配置该工具使其报告 SNMP 陷阱以外，还可以创建一个 shell 脚本来不断监视多种 Solaris Volume Manager 功能。此 shell 脚本可以作为 cron 作业运行，而且可用来确定潜在的问题。

以下是本章中信息的列表：

Solaris Volume Manager 监视和报告（任务列表）

以下任务列表列出了管理 Solaris Volume Manager 错误报告所需的过程。

任务	说明	参考
配置 `mdmonitord` 守护进程，使其定期检查错误	通过编辑 `/lib/svc/method/svc-mdmonitor` 脚本来配置由 `mdmonitord` 守护进程使用的错误检查间隔。	配置 `mdmonitord` 命令以定期检查错误
配置 Solaris Volume Manager SNMP 代理	编辑 `/etc/snmp/conf` 目录中的配置文件，以便 Solaris Volume Manager 正确地将陷阱发送给相应的系统。	配置 Solaris Volume Manager SNMP 代理
用 `cron` 命令所运行的脚本监视 Solaris Volume Manager	创建新脚本或改编现有的脚本以检查错误，然后从 `cron` 命令运行该脚本。	用 `cron` 作业监视 Solaris Volume Manager

配置 `mdmonitord` 命令以定期检查错误

Solaris Volume Manager 中包括 /usr/sbin/mdmonitord 守护进程。当磁盘出现故障时，Solaris Volume Manager 会检测到该故障并生成一个错误。此错误事件将触发 mdmonitord 守护进程，使其对 RAID-1（镜像）卷、RAID-5 卷和热备件执行检查。但是，您还可以将该程序配置为以指定的时间间隔不断检查错误。

如何配置 `mdmonitord` 命令以定期检查错误

编辑 /lib/svc/method/svc-mdmonitor 脚本，添加定期检查的时间间隔。

成为超级用户。

在所选编辑器中打开 /lib/svc/method/svc-mdmonitor 脚本。在该脚本中查找以下部分：

$MDMONITORD

error=$?

case $error in

0)      exit 0

        ;;



*)      echo "Could not start $MDMONITORD. Error $error."

        exit 0

更改以 mdmonitord 命令开头的行，具体操作为添加一个 -t 标志以及检查的时间间隔（以秒为单位）。

$MDMONITORD -t 3600

error=$?

case $error in

0)      exit 0

        ;;



*)      echo "Could not start $MDMONITORD. Error $error."

        exit 0

        ;;

esac

重新启动 mdmonitord 命令以激活所做的更改。
# svcadm restart system/mdmonitor
有关更多信息，请参见 mdmonitord(1M) 手册页。

Solaris Volume Manager SNMP 代理概述

Solaris Volume Manager SNMP 陷阱代理需要核心软件包 SUNWlvmr、SUNWlvma 的软件包以及 Solstice Enterprise Agents。核心软件包包括：

SUNWmibii
SUNWsacom
SUNWsadmi
SUNWsasnm

这些软件包是 Solaris 操作系统的一部分。缺省情况下会安装这些软件包，除非在安装时修改了软件包选项或者安装的是一组最少的软件包。要确认这些软件包是否可用，请使用 pkginfo pkgname 命令，正如在 pkginfo SUNWsasnm 中一样。确认了这五个软件包均可用之后，需要按照下一节中的说明配置 Solaris Volume Manager SNMP 代理。

配置 Solaris Volume Manager SNMP 代理

Solaris Volume Manager SNMP 代理在缺省情况下处于禁用状态。请使用以下过程启用 SNMP 陷阱。

无论何时升级 Solaris 操作系统，都可能需要编辑 the/etc/snmp/conf/enterprises.oid 文件并再次附加步骤 6 中的行，然后重新启动 Solaris Enterprise Agents 服务器。

在完成此过程之后，系统将向指定的一台或多台主机发送 SNMP 陷阱。系统发送陷阱时，需要使用适当的 SNMP 监视器（如 Solstice Enterprise Agents 软件）来查看它们。

设置 mdmonitord 命令来定期探测系统，以帮助确保在出现问题时您能够收到陷阱。请参见配置 mdmonitord 命令以定期检查错误。有关其他错误检查选项，另请参阅用 cron 作业监视 Solaris Volume Manager。

如何配置 Solaris Volume Manager SNMP 代理

成为超级用户。

将 /etc/snmp/conf/mdlogd.rsrc– 配置文件移到 /etc/snmp/conf/mdlogd.rsrc 中。
# mv /etc/snmp/conf/mdlogd.rsrc- /etc/snmp/conf/mdlogd.rsrc

编辑 /etc/snmp/conf/mdlogd.acl 文件以指定哪些主机应当接收 SNMP 陷阱。在该文件中查找以下内容：

        trap = {

             {

                trap-community = SNMP-trap

                hosts = corsair

                {

                    enterprise = "Solaris Volume Manager"

                    trap-num = 1, 2, 3

                }

更改包含 hosts = corsair 的行，以指定要接收 Solaris Volume Manager SNMP 陷阱的主机的名称。例如，如果要将 SNMP 陷阱发送到 lexicon，则应当将该行更改为 hosts = lexicon。如果要包括多台主机，请提供用逗号分隔的主机名列表，如 hosts = lexicon, idiom。

还需要编辑 /etc/snmp/conf/snmpdx.acl 文件以指定哪些主机应当接收 SNMP 陷阱。

查找以 trap = 开头的块并添加在上一步中添加的那些主机。该部分可能已用 # 注释掉。如果是这样，则必须去掉该部分中所需行开头的 #。陷阱部分中的其他行也已经注释掉。但是，您可以保留这些行，也可以为清楚起见而删除它们。在取消对所需行的注释并更新主机行之后，该部分看上去与以下内容类似：

###################

# trap parameters #

###################



trap = {

  {

        trap-community = SNMP-trap

        hosts =lexicon

        {

          enterprise = "sun"

          trap-num = 0, 1, 2-5, 6-16

        }

#       {

#         enterprise = "3Com"

#         trap-num = 4

#       }

#       {

#         enterprise = "snmp"

#         trap-num = 0, 2, 5

#       }

#  }

#  {

#       trap-community = jerry-trap

#       hosts = jerry, nanak, hubble

#       {

#         enterprise = "sun"

#         trap-num = 1, 3

#       }

#       {

#         enterprise = "snmp"

#         trap-num = 1-3

#       }

  }

}

注 –

请确保 /etc/snmp/conf/snmpdx.acl 文件中左括号和右括号的数量相同。

在 /etc/snmp/conf/snmpdx.acl 文件中，向已在上一步中取消注释的部分中添加一个新的 Solaris Volume Manager 部分。

        trap-community = SNMP-trap

        hosts = lexicon

        {

          enterprise = "sun"

          trap-num = 0, 1, 2-5, 6-16

        }

        {

            enterprise = "Solaris Volume Manager"

            trap-num = 1, 2, 3

        }

请注意，所添加的四行将紧邻 enterprise = “sun” 块之后放置。

将下行附加到 /etc/snmp/conf/enterprises.oid 文件中：

"Solaris Volume Manager"                           "1.3.6.1.4.1.42.104"

停止并重新启动 Solstice Enterprise Agents 服务器。

# /etc/init.d/init.snmpdx stop

# /etc/init.d/init.snmpdx start

Solaris Volume Manager SNMP 代理的限制

对于管理员需要知道的所有 Solaris Volume Manager 问题，Solaris Volume Manager SNMP 代理并不能全部发送陷阱。具体来说，该代理仅在以下情况下才发送陷阱：

RAID-1 和 RAID-5 子组件转入“Needs Maintenance（需要维护）”状态
交换热备件，将其投入使用
热备件开始重新同步
热备件完成了重新同步
镜像处于脱机状态
磁盘集由另一台主机提取，当前的主机崩溃

许多问题（如带有 RAID-0 卷或软分区的磁盘不可用）不会导致发送 SNMP 陷阱，即使在尝试读写该设备时也是如此。在这些情况下，通常会报告 SCSI 或 IDE 错误。但是，对于那些要报告给监视控制台的错误，必须由其他 SNMP 代理发送陷阱。

用 `cron` 作业监视 Solaris Volume Manager

如何自动检查卷中的错误

要自动检查 Solaris Volume Manager 配置是否有错误，请创建一个由 cron 实用程序定期运行的脚本。

以下示例显示一个可根据自己的需要改编和修改的脚本。

注 –

自动对 Solaris Volume Manager 进行错误检查的第一步是运行此脚本。您可能需要根据自己的配置修改此脚本。

#
#!/bin/ksh
#ident "@(#)metacheck.sh   1.3     96/06/21 SMI"
# ident='%Z%%M%   %I%     %E% SMI'
#
# Copyright (c) 1999 by Sun Microsystems, Inc.
#
# metacheck
#
# Check on the status of the metadevice configuration.  If there is a problem
# return a non zero exit code.  Depending on options, send email notification.
#
# -h
#	help
# -s setname
#	Specify the set to check.  By default, the 'local' set will be checked.
# -m recipient [recipient...]
#	Send email notification to the specified recipients.  This
#	must be the last argument. The notification shows up as a short 
#	email message with a subject of 
#		"Solaris Volume Manager Problem: metacheck.who.nodename.setname"
#	which summarizes the problem(s) and tells how to obtain detailed 
#	information. The "setname" is from the -s option, "who" is from 
#	the -w option, and "nodename" is reported by uname(1).
#	Email notification is further affected by the following options:
#		-f	to suppress additional messages after a problem 
#			has been found. 
#		-d	to control the supression.
#		-w	to identify who generated the email.
#		-t	to force email even when there is no problem.
# -w who
#	indicate who is running the command. By default, this is the
#	user-name as reported by id(1M). This is used when sending
#	email notification (-m).
# -f 
#	Enable filtering.  Filtering applies to email notification (-m).
#	Filtering requires root permission.  When sending email notification
#	the file /etc/lvm/metacheck.setname.pending is used to 
#	controll the filter.  The following matrix specifies the behavior
#	of the filter:
#
#	problem_found	file_exists
#	  yes		  no		Create file, send notification
#	  yes		  yes		Resend notification if the current date 
#					(as specified by -d datefmt) is 
#					different than the file date.
#	  no		  yes		Delete file, send notification 
#					that the problem is resolved.
#	  no		  no		Send notification if -t specified.
#	
# -d datefmt
#	Specify the format of the date for filtering (-f).  This option 
#	controls the how often re-notification via email occurs. If the 
#	current date according to the specified format (strftime(3C)) is 
#	identical to the date contained in the 
#	/etc/lvm/metacheck.setname.pending file then the message is 
#	suppressed. The default date format is "%D", which will send one 
#	re-notification per day.
# -t
#	Test mode.  Enable email generation even when there is no problem.
#	Used for end-to-end verification of the mechanism and email addresses.
#	
#
# These options are designed to allow integration of metacheck
# into crontab.  For example, a root crontab entry of:
#
# 0,15,30,45 * * * * /usr/sbin/metacheck -f -w SVMcron \
#   -d '\%D \%h' -m notice@example.com 2148357243.8333033@pager.example.com
#
# would check for problems every 15 minutes, and generate an email to
# notice@example.com (and send to an email pager service) every hour when 
# there is a problem.  Note the \ prior to the '%' characters for a 
# crontab entry.  Bounced email would come back to root@nodename.
# The subject line for email generated by the above line would be
# Solaris Volume Manager Problem: metacheck.SVMcron.nodename.local
#

# display a debug line to controlling terminal (works in pipes)
decho()
{
    if [ "$debug" = "yes" ] ; then
	echo "DEBUG: $*"	< /dev/null > /dev/tty 2>&1
    fi
}

# if string $1 is in $2-* then return $1, else return ""
strstr()
{
    typeset	look="$1"
    typeset	ret=""

    shift
#   decho "strstr LOOK .$look. FIRST .$1."
    while [ $# -ne 0 ] ; do
	if [ "$look" = "$1" ] ; then
	    ret="$look"
	fi
	shift
    done
    echo "$ret"
}

# if string $1 is in $2-* then delete it. return result
strdstr()
{
    typeset	look="$1"
    typeset	ret=""

    shift
#   decho "strdstr LOOK .$look. FIRST .$1."
    while [ $# -ne 0 ] ; do
	if [ "$look" != "$1" ] ; then
	    ret="$ret $1"
	fi
	shift
    done
    echo "$ret"
}

merge_continued_lines()
{
    awk -e '\
	BEGIN { line = "";} \
	$NF == "\\" { \
	    $NF = ""; \
	    line = line $0; \
	    next; \
	} \
	$NF != "\\" { \
	    if ( line != "" ) { \
		print line $0; \
		line = ""; \
	    } else { \
		print $0; \
	    } \
	}'
}

# trim out stuff not associated with metadevices
find_meta_devices()
{
    typeset	devices=""

#   decho "find_meta_devices .$*."
    while [ $# -ne 0 ] ; do
	case $1 in
	d+([0-9]) )	# metadevice name
	    devices="$devices $1"
	    ;;
	esac
	shift
    done
    echo "$devices"
}

# return the list of top level metadevices
toplevel()
{
    typeset	comp_meta_devices=""
    typeset	top_meta_devices=""
    typeset	devices=""
    typeset	device=""
    typeset	comp=""

    metastat$setarg -p | merge_continued_lines | while read line ; do
	echo "$line"
	devices=`find_meta_devices $line`
	set -- $devices
	if [ $# -ne 0 ] ; then
	    device=$1
	    shift
	    # check to see if device already refered to as component
	    comp=`strstr $device $comp_meta_devices`
	    if [ -z $comp ] ; then 
		top_meta_devices="$top_meta_devices $device"
	    fi
	    # add components to component list, remove from top list
	    while [ $# -ne 0 ] ; do
		comp=$1
		comp_meta_devices="$comp_meta_devices $comp"
		top_meta_devices=`strdstr $comp $top_meta_devices`
		shift
	    done
	fi
    done > /dev/null 2>&1
    echo $top_meta_devices
}

#
# - MAIN
#
METAPATH=/usr/sbin
PATH=//usr/bin:$METAPATH
USAGE="usage: metacheck [-s setname] [-h] [[-t] [-f [-d datefmt]] \
    [-w who] -m recipient [recipient...]]"

datefmt="%D"
debug="no"
filter="no"
mflag="no"
set="local"
setarg=""
testarg="no"
who=`id | sed -e 's/^uid=[0-9][0-9]*(//' -e 's/).*//'`

while getopts d:Dfms:tw: flag
do
    case $flag in
    d)	datefmt=$OPTARG;
	;;
    D)	debug="yes"
	;;
    f)	filter="yes"
	;;
    m)	mflag="yes"
	;;
    s)	set=$OPTARG;
	if [ "$set" != "local" ] ; then
		setarg=" -s $set";
	fi
	;;
    t)	testarg="yes";
	;;
    w)	who=$OPTARG;
	;;
    \?)	echo $USAGE
	exit 1
	;;
    esac
done

# if mflag specified then everything else part of recipient
shift `expr $OPTIND - 1`
if [ $mflag = "no" ] ; then
    if [ $# -ne 0 ] ; then 
	echo $USAGE
	exit 1
    fi
else
    if [ $# -eq 0 ] ; then 
	echo $USAGE
	exit 1
    fi
fi
recipients="$*"

curdate_filter=`date +$datefmt`
curdate=`date`
node=`uname -n`

# establish files
msg_f=/tmp/metacheck.msg.$$
msgs_f=/tmp/metacheck.msgs.$$
metastat_f=/tmp/metacheck.metastat.$$
metadb_f=/tmp/metacheck.metadb.$$
metahs_f=/tmp/metacheck.metahs.$$
pending_f=/etc/lvm/metacheck.$set.pending 
files="$metastat_f $metadb_f $metahs_f $msg_f $msgs_f"

rm -f $files							> /dev/null 2>&1
trap "rm -f $files > /dev/null 2>&1; exit 1" 1 2 3 15

# Check to see if metadb is capable of running
have_metadb="yes"
metadb$setarg 							> $metadb_f 2>&1
if [ $? -ne 0 ] ; then
    have_metadb="no"
fi
grep "there are no existing databases"  	< $metadb_f	> /dev/null 2>&1
if [ $? -eq 0 ] ; then
    have_metadb="no"
fi
grep "/dev/md/admin"				< $metadb_f	> /dev/null 2>&1
if [ $? -eq 0 ] ; then
    have_metadb="no"
fi

# check for problems accessing metadbs
retval=0
if [ "$have_metadb" = "no" ] ; then
    retval=1
    echo "metacheck: metadb problem, can't run '$METAPATH/metadb$setarg'" \
								>> $msgs_f
else
    # snapshot the state
    metadb$setarg 2>&1 | sed -e '1d' | merge_continued_lines	> $metadb_f
    metastat$setarg 2>&1 | merge_continued_lines		> $metastat_f
    metahs$setarg -i 2>&1 | merge_continued_lines		> $metahs_f

    #
    # Check replicas for problems, capital letters in the flags
    # indicate an error, fields are seperated by tabs.
    #
    problem=`awk < $metadb_f -F\t '{if ($1 ~ /[A-Z]/) print $1;}'`
    if [ -n "$problem" ] ; then
	retval=`expr $retval + 64`
	echo "\
metacheck: metadb problem, for more detail run:\n\t$METAPATH/metadb$setarg -i" \
								>> $msgs_f
    fi

    #
    # Check the metadevice state
    #
    problem=`awk < $metastat_f -e \
		'/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'`
    if [ -n "$problem" ] ; then
	retval=`expr $retval + 128`
	echo "\
metacheck: metadevice problem, for more detail run:" \
								>> $msgs_f

	# refine the message to toplevel metadevices that have a problem
	top=`toplevel`
	set -- $top
	while [ $# -ne 0 ] ; do
	    device=$1
	    problem=`metastat $device | awk -e \
		'/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'`
	    if [ -n "$problem" ] ; then
		echo "\t$METAPATH/metastat$setarg $device"	>> $msgs_f
		# find out what is mounted on the device
		mp=`mount|awk -e '/\/dev\/md\/dsk\/'$device'[ \t]/{print $1;}'`
		if [ -n "$mp" ] ; then
		    echo "\t\t$mp mounted on $device"		>> $msgs_f
		fi
	    fi
	    shift
	done
    fi

    #
    # Check the hotspares to see if any have been used.
    #
    problem=""
    grep "no hotspare pools found"	< $metahs_f		> /dev/null 2>&1
    if [ $? -ne 0 ] ; then
	problem=`awk < $metahs_f -e \
	    '/blocks/ { if ( $2 != "Available" ) print $0;}'`
    fi
    if [ -n "$problem" ] ; then
	retval=`expr $retval + 256`
	echo "\
metacheck: hot spare in use, for more detail run:\n\t$METAPATH/metahs$setarg -i" \
								 >> $msgs_f
    fi
fi

# If any errors occurred, then mail the report
if [ $retval -ne 0 ] ; then
    if [ -n "$recipients" ] ; then 
	re=""
	if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then
	    re="Re: "
	    # we have a pending notification, check date to see if we resend
	    penddate_filter=`cat $pending_f | head -1`
	    if [ "$curdate_filter" != "$penddate_filter" ] ; then
		rm -f $pending_f				> /dev/null 2>&1
	    else
	 	if [ "$debug" = "yes" ] ; then
		    echo "metacheck: email problem notification still pending"
		    cat $pending_f
		fi
	    fi
	fi
	if [ ! -f $pending_f ] ; then
	    if [ "$filter" = "yes" ] ; then
		echo "$curdate_filter\n\tDate:$curdate\n\tTo:$recipients" \
								> $pending_f
	    fi
	    echo "\
Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate"		>> $msg_f
	    echo "\
--------------------------------------------------------------" >> $msg_f
	    cat $msg_f $msgs_f | mailx -s \
		"${re}Solaris Volume Manager Problem: metacheck.$who.$set.$node" $recipients
	fi
    else
	cat $msgs_f
    fi
else
    # no problems detected,
    if [ -n "$recipients" ] ; then
	# default is to not send any mail, or print anything.
	echo "\
Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate"		>> $msg_f
	echo "\
--------------------------------------------------------------" >> $msg_f
	if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then
	    # pending filter exista, remove it and send OK
	    rm -f $pending_f					> /dev/null 2>&1
	    echo "Problem resolved"				>> $msg_f
	    cat $msg_f | mailx -s \
		"Re: Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients
	elif [ "$testarg" = "yes" ] ; then
	    # for testing, send mail every time even thought there is no problem
	    echo "Messaging test, no problems detected"		>> $msg_f
	    cat $msg_f | mailx -s \
		"Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients
	fi
    else
	echo "metacheck: Okay"
    fi
fi

rm -f $files							> /dev/null 2>&1
exit $retval

有关使用 cron 实用程序来调用脚本的信息，请参见 cron(1M) 手册页。

第 24 章 监视和错误报告（任务）