Solaris Volume Manager 管理指南

第 24 章 监视和错误报告(任务)

有时,Solaris Volume Manager 会遇到问题(如由于片级别的物理错误而无法写入卷)。出现问题后,Solaris Volume Manager 会更改卷的状态,以便系统管理员可以得到通知。但是,除非借助 Solaris Management Console 或者运行 metastat 命令在 Solaris Volume Manager GUI 中定期检查卷的状态,否则可能无法立即看到这些状态更改。

本章提供有关 Solaris Volume Manager 中提供的各种监视工具的信息。其中一个工具是 Solaris Volume Manager SNMP 代理,该代理是 Solstice Enterprise AgentsTM 监视软件的子代理。除了可配置该工具使其报告 SNMP 陷阱以外,还可以创建一个 shell 脚本来不断监视多种 Solaris Volume Manager 功能。此 shell 脚本可以作为 cron 作业运行,而且可用来确定潜在的问题。

以下是本章中信息的列表:

Solaris Volume Manager 监视和报告(任务列表)

以下任务列表列出了管理 Solaris Volume Manager 错误报告所需的过程。

任务 

说明 

参考 

配置 mdmonitord 守护进程,使其定期检查错误

通过编辑 /lib/svc/method/svc-mdmonitor 脚本来配置由 mdmonitord 守护进程使用的错误检查间隔。

配置 mdmonitord 命令以定期检查错误

配置 Solaris Volume Manager SNMP 代理 

编辑 /etc/snmp/conf 目录中的配置文件,以便 Solaris Volume Manager 正确地将陷阱发送给相应的系统。

配置 Solaris Volume Manager SNMP 代理

cron 命令所运行的脚本监视 Solaris Volume Manager

创建新脚本或改编现有的脚本以检查错误,然后从 cron 命令运行该脚本。

cron 作业监视 Solaris Volume Manager

配置 mdmonitord 命令以定期检查错误

Solaris Volume Manager 中包括 /usr/sbin/mdmonitord 守护进程。当磁盘出现故障时,Solaris Volume Manager 会检测到该故障并生成一个错误。此错误事件将触发 mdmonitord 守护进程,使其对 RAID-1(镜像)卷、RAID-5 卷和热备件执行检查。但是,您还可以将该程序配置为以指定的时间间隔不断检查错误。

Procedure如何配置 mdmonitord 命令以定期检查错误

编辑 /lib/svc/method/svc-mdmonitor 脚本,添加定期检查的时间间隔。

  1. 成为超级用户。

  2. 在所选编辑器中打开 /lib/svc/method/svc-mdmonitor 脚本。在该脚本中查找以下部分:


    $MDMONITORD
    
    error=$?
    
    case $error in
    
    0)      exit 0
    
            ;;
    
    
    
    *)      echo "Could not start $MDMONITORD. Error $error."
    
            exit 0
    
  3. 更改以 mdmonitord 命令开头的行,具体操作为添加一个 -t 标志以及检查的时间间隔(以秒为单位)。


    
    

    $MDMONITORD -t 3600
    
    error=$?
    
    case $error in
    
    0)      exit 0
    
            ;;
    
    
    
    *)      echo "Could not start $MDMONITORD. Error $error."
    
            exit 0
    
            ;;
    
    esac
    
  4. 重新启动 mdmonitord 命令以激活所做的更改。


    # svcadm restart system/mdmonitor
    

    有关更多信息,请参见 mdmonitord(1M) 手册页。

Solaris Volume Manager SNMP 代理概述

Solaris Volume Manager SNMP 陷阱代理需要核心软件包 SUNWlvmrSUNWlvma 的软件包以及 Solstice Enterprise Agents。核心软件包包括:

这些软件包是 Solaris 操作系统的一部分。缺省情况下会安装这些软件包,除非在安装时修改了软件包选项或者安装的是一组最少的软件包。要确认这些软件包是否可用,请使用 pkginfo pkgname 命令,正如在 pkginfo SUNWsasnm 中一样。确认了这五个软件包均可用之后,需要按照下一节中的说明配置 Solaris Volume Manager SNMP 代理。

配置 Solaris Volume Manager SNMP 代理

Solaris Volume Manager SNMP 代理在缺省情况下处于禁用状态。请使用以下过程启用 SNMP 陷阱。

无论何时升级 Solaris 操作系统,都可能需要编辑 the/etc/snmp/conf/enterprises.oid 文件并再次附加步骤 6 中的行,然后重新启动 Solaris Enterprise Agents 服务器。

在完成此过程之后,系统将向指定的一台或多台主机发送 SNMP 陷阱。系统发送陷阱时,需要使用适当的 SNMP 监视器(如 Solstice Enterprise Agents 软件)来查看它们。

设置 mdmonitord 命令来定期探测系统,以帮助确保在出现问题时您能够收到陷阱。请参见配置 mdmonitord 命令以定期检查错误。有关其他错误检查选项,另请参阅cron 作业监视 Solaris Volume Manager

Procedure如何配置 Solaris Volume Manager SNMP 代理

  1. 成为超级用户。

  2. /etc/snmp/conf/mdlogd.rsrc– 配置文件移到 /etc/snmp/conf/mdlogd.rsrc 中。


    # mv /etc/snmp/conf/mdlogd.rsrc- /etc/snmp/conf/mdlogd.rsrc
    
  3. 编辑 /etc/snmp/conf/mdlogd.acl 文件以指定哪些主机应当接收 SNMP 陷阱。 在该文件中查找以下内容:


            trap = {
    
                 {
    
                    trap-community = SNMP-trap
    
                    hosts = corsair
    
                    {
    
                        enterprise = "Solaris Volume Manager"
    
                        trap-num = 1, 2, 3
    
                    }

    更改包含 hosts = corsair 的行,以指定要接收 Solaris Volume Manager SNMP 陷阱的主机的名称。例如,如果要将 SNMP 陷阱发送到 lexicon,则应当将该行更改为 hosts = lexicon。如果要包括多台主机,请提供用逗号分隔的主机名列表,如 hosts = lexicon, idiom

  4. 还需要编辑 /etc/snmp/conf/snmpdx.acl 文件以指定哪些主机应当接收 SNMP 陷阱。

    查找以 trap = 开头的块并添加在上一步中添加的那些主机。 该部分可能已用 # 注释掉。如果是这样,则必须去掉该部分中所需行开头的 #。陷阱部分中的其他行也已经注释掉。但是,您可以保留这些行,也可以为清楚起见而删除它们。在取消对所需行的注释并更新主机行之后,该部分看上去与以下内容类似:


    ###################
    
    # trap parameters #
    
    ###################
    
    
    
    trap = {
    
      {
    
            trap-community = SNMP-trap
    
            hosts =lexicon
    
            {
    
              enterprise = "sun"
    
              trap-num = 0, 1, 2-5, 6-16
    
            }
    
    #       {
    
    #         enterprise = "3Com"
    
    #         trap-num = 4
    
    #       }
    
    #       {
    
    #         enterprise = "snmp"
    
    #         trap-num = 0, 2, 5
    
    #       }
    
    #  }
    
    #  {
    
    #       trap-community = jerry-trap
    
    #       hosts = jerry, nanak, hubble
    
    #       {
    
    #         enterprise = "sun"
    
    #         trap-num = 1, 3
    
    #       }
    
    #       {
    
    #         enterprise = "snmp"
    
    #         trap-num = 1-3
    
    #       }
    
      }
    
    }

    注 –

    请确保 /etc/snmp/conf/snmpdx.acl 文件中左括号和右括号的数量相同。


  5. /etc/snmp/conf/snmpdx.acl 文件中,向已在上一步中取消注释的部分中添加一个新的 Solaris Volume Manager 部分。


            trap-community = SNMP-trap
    
            hosts = lexicon
    
            {
    
              enterprise = "sun"
    
              trap-num = 0, 1, 2-5, 6-16
    
            }
    
            {
    
                enterprise = "Solaris Volume Manager"
    
                trap-num = 1, 2, 3
    
            }
    

    请注意,所添加的四行将紧邻 enterprise = “sun” 块之后放置。

  6. 将下行附加到 /etc/snmp/conf/enterprises.oid 文件中:


    "Solaris Volume Manager"                           "1.3.6.1.4.1.42.104"
  7. 停止并重新启动 Solstice Enterprise Agents 服务器。


    # /etc/init.d/init.snmpdx stop
    
    # /etc/init.d/init.snmpdx start
    

Solaris Volume Manager SNMP 代理的限制

对于管理员需要知道的所有 Solaris Volume Manager 问题,Solaris Volume Manager SNMP 代理并不能全部发送陷阱。具体来说,该代理在以下情况下才发送陷阱:

许多问题(如带有 RAID-0 卷或软分区的磁盘不可用)不会导致发送 SNMP 陷阱,即使在尝试读写该设备时也是如此。在这些情况下,通常会报告 SCSI 或 IDE 错误。但是,对于那些要报告给监视控制台的错误,必须由其他 SNMP 代理发送陷阱。

cron 作业监视 Solaris Volume Manager

Procedure如何自动检查卷中的错误

    要自动检查 Solaris Volume Manager 配置是否有错误,请创建一个由 cron 实用程序定期运行的脚本。

    以下示例显示一个可根据自己的需要改编和修改的脚本。


    注 –

    自动对 Solaris Volume Manager 进行错误检查的第一步是运行此脚本。您可能需要根据自己的配置修改此脚本。


    #
    #!/bin/ksh
    #ident "@(#)metacheck.sh   1.3     96/06/21 SMI"
    # ident='%Z%%M%   %I%     %E% SMI'
    #
    # Copyright (c) 1999 by Sun Microsystems, Inc.
    #
    # metacheck
    #
    # Check on the status of the metadevice configuration.  If there is a problem
    # return a non zero exit code.  Depending on options, send email notification.
    #
    # -h
    #	help
    # -s setname
    #	Specify the set to check.  By default, the 'local' set will be checked.
    # -m recipient [recipient...]
    #	Send email notification to the specified recipients.  This
    #	must be the last argument. The notification shows up as a short 
    #	email message with a subject of 
    #		"Solaris Volume Manager Problem: metacheck.who.nodename.setname"
    #	which summarizes the problem(s) and tells how to obtain detailed 
    #	information. The "setname" is from the -s option, "who" is from 
    #	the -w option, and "nodename" is reported by uname(1).
    #	Email notification is further affected by the following options:
    #		-f	to suppress additional messages after a problem 
    #			has been found. 
    #		-d	to control the supression.
    #		-w	to identify who generated the email.
    #		-t	to force email even when there is no problem.
    # -w who
    #	indicate who is running the command. By default, this is the
    #	user-name as reported by id(1M). This is used when sending
    #	email notification (-m).
    # -f 
    #	Enable filtering.  Filtering applies to email notification (-m).
    #	Filtering requires root permission.  When sending email notification
    #	the file /etc/lvm/metacheck.setname.pending is used to 
    #	controll the filter.  The following matrix specifies the behavior
    #	of the filter:
    #
    #	problem_found	file_exists
    #	  yes		  no		Create file, send notification
    #	  yes		  yes		Resend notification if the current date 
    #					(as specified by -d datefmt) is 
    #					different than the file date.
    #	  no		  yes		Delete file, send notification 
    #					that the problem is resolved.
    #	  no		  no		Send notification if -t specified.
    #	
    # -d datefmt
    #	Specify the format of the date for filtering (-f).  This option 
    #	controls the how often re-notification via email occurs. If the 
    #	current date according to the specified format (strftime(3C)) is 
    #	identical to the date contained in the 
    #	/etc/lvm/metacheck.setname.pending file then the message is 
    #	suppressed. The default date format is "%D", which will send one 
    #	re-notification per day.
    # -t
    #	Test mode.  Enable email generation even when there is no problem.
    #	Used for end-to-end verification of the mechanism and email addresses.
    #	
    #
    # These options are designed to allow integration of metacheck
    # into crontab.  For example, a root crontab entry of:
    #
    # 0,15,30,45 * * * * /usr/sbin/metacheck -f -w SVMcron \
    #   -d '\%D \%h' -m notice@example.com 2148357243.8333033@pager.example.com
    #
    # would check for problems every 15 minutes, and generate an email to
    # notice@example.com (and send to an email pager service) every hour when 
    # there is a problem.  Note the \ prior to the '%' characters for a 
    # crontab entry.  Bounced email would come back to root@nodename.
    # The subject line for email generated by the above line would be
    # Solaris Volume Manager Problem: metacheck.SVMcron.nodename.local
    #
    
    # display a debug line to controlling terminal (works in pipes)
    decho()
    {
        if [ "$debug" = "yes" ] ; then
    	echo "DEBUG: $*"	< /dev/null > /dev/tty 2>&1
        fi
    }
    
    # if string $1 is in $2-* then return $1, else return ""
    strstr()
    {
        typeset	look="$1"
        typeset	ret=""
    
        shift
    #   decho "strstr LOOK .$look. FIRST .$1."
        while [ $# -ne 0 ] ; do
    	if [ "$look" = "$1" ] ; then
    	    ret="$look"
    	fi
    	shift
        done
        echo "$ret"
    }
    
    # if string $1 is in $2-* then delete it. return result
    strdstr()
    {
        typeset	look="$1"
        typeset	ret=""
    
        shift
    #   decho "strdstr LOOK .$look. FIRST .$1."
        while [ $# -ne 0 ] ; do
    	if [ "$look" != "$1" ] ; then
    	    ret="$ret $1"
    	fi
    	shift
        done
        echo "$ret"
    }
    
    merge_continued_lines()
    {
        awk -e '\
    	BEGIN { line = "";} \
    	$NF == "\\" { \
    	    $NF = ""; \
    	    line = line $0; \
    	    next; \
    	} \
    	$NF != "\\" { \
    	    if ( line != "" ) { \
    		print line $0; \
    		line = ""; \
    	    } else { \
    		print $0; \
    	    } \
    	}'
    }
    
    # trim out stuff not associated with metadevices
    find_meta_devices()
    {
        typeset	devices=""
    
    #   decho "find_meta_devices .$*."
        while [ $# -ne 0 ] ; do
    	case $1 in
    	d+([0-9]) )	# metadevice name
    	    devices="$devices $1"
    	    ;;
    	esac
    	shift
        done
        echo "$devices"
    }
    
    # return the list of top level metadevices
    toplevel()
    {
        typeset	comp_meta_devices=""
        typeset	top_meta_devices=""
        typeset	devices=""
        typeset	device=""
        typeset	comp=""
    
        metastat$setarg -p | merge_continued_lines | while read line ; do
    	echo "$line"
    	devices=`find_meta_devices $line`
    	set -- $devices
    	if [ $# -ne 0 ] ; then
    	    device=$1
    	    shift
    	    # check to see if device already refered to as component
    	    comp=`strstr $device $comp_meta_devices`
    	    if [ -z $comp ] ; then 
    		top_meta_devices="$top_meta_devices $device"
    	    fi
    	    # add components to component list, remove from top list
    	    while [ $# -ne 0 ] ; do
    		comp=$1
    		comp_meta_devices="$comp_meta_devices $comp"
    		top_meta_devices=`strdstr $comp $top_meta_devices`
    		shift
    	    done
    	fi
        done > /dev/null 2>&1
        echo $top_meta_devices
    }
    
    #
    # - MAIN
    #
    METAPATH=/usr/sbin
    PATH=//usr/bin:$METAPATH
    USAGE="usage: metacheck [-s setname] [-h] [[-t] [-f [-d datefmt]] \
        [-w who] -m recipient [recipient...]]"
    
    datefmt="%D"
    debug="no"
    filter="no"
    mflag="no"
    set="local"
    setarg=""
    testarg="no"
    who=`id | sed -e 's/^uid=[0-9][0-9]*(//' -e 's/).*//'`
    
    while getopts d:Dfms:tw: flag
    do
        case $flag in
        d)	datefmt=$OPTARG;
    	;;
        D)	debug="yes"
    	;;
        f)	filter="yes"
    	;;
        m)	mflag="yes"
    	;;
        s)	set=$OPTARG;
    	if [ "$set" != "local" ] ; then
    		setarg=" -s $set";
    	fi
    	;;
        t)	testarg="yes";
    	;;
        w)	who=$OPTARG;
    	;;
        \?)	echo $USAGE
    	exit 1
    	;;
        esac
    done
    
    # if mflag specified then everything else part of recipient
    shift `expr $OPTIND - 1`
    if [ $mflag = "no" ] ; then
        if [ $# -ne 0 ] ; then 
    	echo $USAGE
    	exit 1
        fi
    else
        if [ $# -eq 0 ] ; then 
    	echo $USAGE
    	exit 1
        fi
    fi
    recipients="$*"
    
    curdate_filter=`date +$datefmt`
    curdate=`date`
    node=`uname -n`
    
    # establish files
    msg_f=/tmp/metacheck.msg.$$
    msgs_f=/tmp/metacheck.msgs.$$
    metastat_f=/tmp/metacheck.metastat.$$
    metadb_f=/tmp/metacheck.metadb.$$
    metahs_f=/tmp/metacheck.metahs.$$
    pending_f=/etc/lvm/metacheck.$set.pending 
    files="$metastat_f $metadb_f $metahs_f $msg_f $msgs_f"
    
    rm -f $files							> /dev/null 2>&1
    trap "rm -f $files > /dev/null 2>&1; exit 1" 1 2 3 15
    
    # Check to see if metadb is capable of running
    have_metadb="yes"
    metadb$setarg 							> $metadb_f 2>&1
    if [ $? -ne 0 ] ; then
        have_metadb="no"
    fi
    grep "there are no existing databases"  	< $metadb_f	> /dev/null 2>&1
    if [ $? -eq 0 ] ; then
        have_metadb="no"
    fi
    grep "/dev/md/admin"				< $metadb_f	> /dev/null 2>&1
    if [ $? -eq 0 ] ; then
        have_metadb="no"
    fi
    
    # check for problems accessing metadbs
    retval=0
    if [ "$have_metadb" = "no" ] ; then
        retval=1
        echo "metacheck: metadb problem, can't run '$METAPATH/metadb$setarg'" \
    								>> $msgs_f
    else
        # snapshot the state
        metadb$setarg 2>&1 | sed -e '1d' | merge_continued_lines	> $metadb_f
        metastat$setarg 2>&1 | merge_continued_lines		> $metastat_f
        metahs$setarg -i 2>&1 | merge_continued_lines		> $metahs_f
    
        #
        # Check replicas for problems, capital letters in the flags
        # indicate an error, fields are seperated by tabs.
        #
        problem=`awk < $metadb_f -F\t '{if ($1 ~ /[A-Z]/) print $1;}'`
        if [ -n "$problem" ] ; then
    	retval=`expr $retval + 64`
    	echo "\
    metacheck: metadb problem, for more detail run:\n\t$METAPATH/metadb$setarg -i" \
    								>> $msgs_f
        fi
    
        #
        # Check the metadevice state
        #
        problem=`awk < $metastat_f -e \
    		'/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'`
        if [ -n "$problem" ] ; then
    	retval=`expr $retval + 128`
    	echo "\
    metacheck: metadevice problem, for more detail run:" \
    								>> $msgs_f
    
    	# refine the message to toplevel metadevices that have a problem
    	top=`toplevel`
    	set -- $top
    	while [ $# -ne 0 ] ; do
    	    device=$1
    	    problem=`metastat $device | awk -e \
    		'/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'`
    	    if [ -n "$problem" ] ; then
    		echo "\t$METAPATH/metastat$setarg $device"	>> $msgs_f
    		# find out what is mounted on the device
    		mp=`mount|awk -e '/\/dev\/md\/dsk\/'$device'[ \t]/{print $1;}'`
    		if [ -n "$mp" ] ; then
    		    echo "\t\t$mp mounted on $device"		>> $msgs_f
    		fi
    	    fi
    	    shift
    	done
        fi
    
        #
        # Check the hotspares to see if any have been used.
        #
        problem=""
        grep "no hotspare pools found"	< $metahs_f		> /dev/null 2>&1
        if [ $? -ne 0 ] ; then
    	problem=`awk < $metahs_f -e \
    	    '/blocks/ { if ( $2 != "Available" ) print $0;}'`
        fi
        if [ -n "$problem" ] ; then
    	retval=`expr $retval + 256`
    	echo "\
    metacheck: hot spare in use, for more detail run:\n\t$METAPATH/metahs$setarg -i" \
    								 >> $msgs_f
        fi
    fi
    
    # If any errors occurred, then mail the report
    if [ $retval -ne 0 ] ; then
        if [ -n "$recipients" ] ; then 
    	re=""
    	if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then
    	    re="Re: "
    	    # we have a pending notification, check date to see if we resend
    	    penddate_filter=`cat $pending_f | head -1`
    	    if [ "$curdate_filter" != "$penddate_filter" ] ; then
    		rm -f $pending_f				> /dev/null 2>&1
    	    else
    	 	if [ "$debug" = "yes" ] ; then
    		    echo "metacheck: email problem notification still pending"
    		    cat $pending_f
    		fi
    	    fi
    	fi
    	if [ ! -f $pending_f ] ; then
    	    if [ "$filter" = "yes" ] ; then
    		echo "$curdate_filter\n\tDate:$curdate\n\tTo:$recipients" \
    								> $pending_f
    	    fi
    	    echo "\
    Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate"		>> $msg_f
    	    echo "\
    --------------------------------------------------------------" >> $msg_f
    	    cat $msg_f $msgs_f | mailx -s \
    		"${re}Solaris Volume Manager Problem: metacheck.$who.$set.$node" $recipients
    	fi
        else
    	cat $msgs_f
        fi
    else
        # no problems detected,
        if [ -n "$recipients" ] ; then
    	# default is to not send any mail, or print anything.
    	echo "\
    Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate"		>> $msg_f
    	echo "\
    --------------------------------------------------------------" >> $msg_f
    	if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then
    	    # pending filter exista, remove it and send OK
    	    rm -f $pending_f					> /dev/null 2>&1
    	    echo "Problem resolved"				>> $msg_f
    	    cat $msg_f | mailx -s \
    		"Re: Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients
    	elif [ "$testarg" = "yes" ] ; then
    	    # for testing, send mail every time even thought there is no problem
    	    echo "Messaging test, no problems detected"		>> $msg_f
    	    cat $msg_f | mailx -s \
    		"Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients
    	fi
        else
    	echo "metacheck: Okay"
        fi
    fi
    
    rm -f $files							> /dev/null 2>&1
    exit $retval

    有关使用 cron 实用程序来调用脚本的信息,请参见 cron(1M) 手册页。