Solaris ボリュームマネージャの管理

第 24 章監視とエラーレポート (作業)

Solaris ボリュームマネージャで、スライスレベルの物理エラーが原因でボリュームに書き込みできないというような問題が起きることがあります。問題が発生すると、Solaris ボリュームマネージャがボリュームの状態を変更するので、システム管理者は常に情報を把握できます。ただし、Solaris 管理コンソールを通じて Solaris ボリュームマネージャの GUI を使用したり、metastat コマンドを実行してボリューム状態の変化を定期的にチェックしないと、状態の変化をタイムリーに把握することはできません。

この章では、Solaris ボリュームマネージャ SNMP エージェント (Solstice Enterprise Agents^TM 監視ソフトウェアのサブエージェント) など、Solaris ボリュームマネージャのさまざまな監視ツールについて説明します。このツールを設定して SNMP トラップを報告する以外に、シェルスクリプトを作成して Solaris ボリュームマネージャのさまざまな機能を能動的に監視することもできます。このシェルスクリプトは cron ジョブとして動作し、潜在的な問題が顕在化する前にそれらを検出する上で役立ちます。

この章の内容は次のとおりです。

Solaris ボリュームマネージャの監視機能と報告機能 (作業マップ)

次の表に、Solaris ボリュームマネージャのエラーレポートを管理するために必要な作業を示します。

作業	説明	参照先
`mdmonitord` デーモンを設定してエラーを周期的にチェックする	`/lib/svc/method/svc-mdmonitor` スクリプトを編集して、`mdmonitord` デーモンに使用させるエラーチェック間隔を設定します。	「エラーを周期的にチェックするための `mdmonitord` デーモンの構成」
Solaris ボリュームマネージャ SNMP エージェントを構成する	`/etc/snmp/conf` ディレクトリの構成ファイルを編集して、Solaris ボリュームマネージャが適切なシステムにトラップを送信できるようにします。	「Solaris ボリュームマネージャ SNMP エージェントの構成」
`cron` コマンドでスクリプトを実行して Solaris ボリュームマネージャを監視する	エラーをチェックするスクリプトを作成または変更し、 `cron` コマンドでスクリプトを実行します。	「`cron` ジョブによる Solaris ボリュームマネージャの監視」

エラーを周期的にチェックするための `mdmonitord` デーモンの構成

Solaris ボリュームマネージャには /usr/sbin/mdmonitord デーモンが組み込まれています。このデーモンは、Solaris ボリュームマネージャのボリュームにエラーがないかどうかを調べるプログラムです。このプログラムはデフォルトで、ボリューム上で (書き込みエラーなどの) エラーが検出された場合に限り、RAID-1 (ミラー) ボリューム、RAID-5 ボリューム、およびホットスペアのエラーの有無を調べます。ただし、指定の間隔でエラーを能動的にチェックするようにこのプログラムを設定することもできます。

`mdmonitord` コマンドを設定してエラーを周期的にチェックするには

/lib/svc/method/svc-mdmonitor スクリプトを編集して、周期的に検査する時間間隔を追加します。

スーパーユーザーになります。

任意のエディタで /lib/svc/method/svc-mdmonitor スクリプトを開きます。スクリプトから次の部分を検索します。
$MDMONITORD error=$? case $error in 0) exit 0 ;; *) echo "Could not start $MDMONITORD. Error $error." exit 0

mdmonitord コマンドから始まる行に、-t フラグとチェック間隔の秒数を追加します。

$MDMONITORD -t 3600
error=$?
case $error in
0)      exit 0
        ;;

*)      echo "Could not start $MDMONITORD. Error $error."
        exit 0
        ;;
esac

mdmonitord コマンドを再起動して変更を有効にします。
# svcadm restart system/mdmonitor
詳細は、mdmonitord(1M) のマニュアルページを参照してください。

Solaris ボリュームマネージャ SNMP エージェントの概要

Solaris ボリュームマネージャ SNMP トラップエージェントには、コアパッケージ SUNWlvmr と SUNWlvma のパッケージのほかに Solstice Enterprise Agents が必要です。必要なコアパッケージは、次のとおりです。

SUNWmibii
SUNWsacom
SUNWsadmi
SUNWsasnm

これらのパッケージは Solaris オペレーティングシステムに組み込まれています。これらのパッケージは、インストール時にパッケージ選択を変更した場合、または最小限のパッケージセットをインストールした場合を除き、通常はデフォルトでインストールされます。各パッケージが使用可能かどうかを確認するには、pkginfo SUNWsasnm のように、pkginfo pkgname コマンドを使用します。5 つのパッケージがすべて使用できることを確認してから、以下の説明に従って Solaris ボリュームマネージャ SNMP エージェントを構成する必要があります。

Solaris ボリュームマネージャ SNMP エージェントの構成

Solaris ボリュームマネージャ SNMP エージェントはデフォルトでは有効にされていません。SNMP トラップを有効にするには、次の手順を実行します。

おそらく、Solaris オペレーティングシステムをアップグレードするたびに、/etc/snmp/conf/enterprises.oid ファイルを編集し、その終わりに手順 6 の行を追加してから、Solaris Enterprise Agents サーバーの停止と再起動を行う必要があります。

この手順が終わると、SNMP トラップが、指定されたホストに送信されるようになります。送信されたトラップを表示するためには、Solstice Enterprise Agents ソフトウェアなど、適切な SNMP モニターを使用する必要があります。

問題が発生したときにトラップを受信するためには、mdmonitord コマンドを設定してシステムを定期的にチェックする必要があります。「エラーを周期的にチェックするための mdmonitord デーモンの構成」を参照してください。また、その他のエラーチェックオプションについては、「cron ジョブによる Solaris ボリュームマネージャの監視」を参照してください。

Solaris ボリュームマネージャ SNMP エージェントを構成するには

スーパーユーザーになります。

/etc/snmp/conf/mdlogd.rsrc– 構成ファイルを /etc/snmp/conf/mdlogd.rsrc に移動します。
# mv /etc/snmp/conf/mdlogd.rsrc- /etc/snmp/conf/mdlogd.rsrc

/etc/snmp/conf/mdlogd.acl ファイルを編集して、SNMP トラップをどのホストに送信するかを指定します。次の部分を探してください。

        trap = {
             {
                trap-community = SNMP-trap
                hosts = corsair
                {
                    enterprise = "Solaris Volume Manager"
                    trap-num = 1, 2, 3
                }

hosts = corsair の行に、Solaris ボリュームマネージャ SNMP トラップの送信先ホストを指定します。たとえば、SNMP トラップを lexicon に送信する場合は、この行を hosts = lexicon に変更します。複数のホストを指定する場合は、hosts = lexicon, idiom のように、ホスト名をコンマで区切って指定します。

次に、/etc/snmp/conf/snmpdx.acl ファイルを編集して、SNMP トラップをどのホストに送信するかを指定します。

trap = で始まるブロックを探し、ここに前の手順で指定したのと同じホスト名のリストを指定します。このセクションは # でコメント文にされていることがあります。その場合は、必要な行の始めにある # を取り除いてください。トラップセクションにはコメントになっている行がほかにもあります。これらの行はそのまま残しておいても、わかりやすくするために削除してもかまいません。必要な行のコメントを解除し、ホスト名の行を変更した後のセクションは次のようになります。

###################
# trap parameters #
###################

trap = {
  {
        trap-community = SNMP-trap
        hosts =lexicon
        {
          enterprise = "sun"
          trap-num = 0, 1, 2-5, 6-16
        }
#       {
#         enterprise = "3Com"
#         trap-num = 4
#       }
#       {
#         enterprise = "snmp"
#         trap-num = 0, 2, 5
#       }
#  }
#  {
#       trap-community = jerry-trap
#       hosts = jerry, nanak, hubble
#       {
#         enterprise = "sun"
#         trap-num = 1, 3
#       }
#       {
#         enterprise = "snmp"
#         trap-num = 1-3
#       }
  }
}

注 –

/etc/snmp/conf/snmpdx.acl ファイルに同じ数の左括弧と右括弧があることを確認してください。

前の手順でコメントを解除した、/etc/snmp/conf/snmpdx.acl ファイルのセクションの中に新しい Solaris ボリュームマネージャセクションを追加します。

        trap-community = SNMP-trap
        hosts = lexicon
        {
          enterprise = "sun"
          trap-num = 0, 1, 2-5, 6-16
        }
        {
            enterprise = "Solaris Volume Manager"
            trap-num = 1, 2, 3
        }

追加する 4 行は、enterprise = “sun” ブロックのすぐ後に挿入する必要があります。

/etc/snmp/conf/enterprises.oid ファイルの最後に次の行を追加します。
"Solaris Volume Manager" "1.3.6.1.4.1.42.104"

Solstice Enterprise Agents サーバーを停止し、再起動します。
# /etc/init.d/init.snmpdx stop # /etc/init.d/init.snmpdx start

Solaris ボリュームマネージャ SNMP エージェントの制約

Solaris ボリュームマネージャ SNMP エージェントは、システム管理者が把握すべき Solaris ボリュームマネージャのあらゆる問題についてトラップを送信するわけではありません。具体的には、このエージェントは、次の場合にのみトラップを送信します。

RAID-1 または RAID-5 のサブコンポーネントが「保守が必要 (Needs Maintenance)」状態に移行した場合
ホットスペアが、障害のあるディスクに代わって使用されるようになった場合
ホットスペアが再同期処理を開始した場合
ホットスペアが再同期処理を完了した場合
ミラーがオフライン状態になった場合
ディスクセットが別のホストに予約されたため、現在のホストがパニック状態になった場合

RAID-0 ボリュームやソフトパーティションが定義されているディスクが使用不能な場合など、問題が発生しても、SNMP トラップが送信されない状況は少なくありません。これは、このデバイスに対する読み取りや書き込みが行われた場合でも同様です。このような状況では通常、SCSI または IDE エラーは報告されます。しかし、これらのエラーが監視コンソールに伝えられるようにするには、他の SNMP エージェントからトラップを発行する必要があります。

`cron` ジョブによる Solaris ボリュームマネージャの監視

ボリュームのエラーを自動的にチェックするには

Solaris ボリュームマネージャ構成のエラーを自動的にチェックするために、cron ユーティリティーで定期的に実行できるスクリプトを作成します。

次のスクリプト例は、必要に応じて変更することができます。

注 –

このスクリプトは、Solaris ボリュームマネージャのエラーチェックを自動化するための基本的なスクリプトです。各自の構成に合わせて変更する必要があります。

#
#!/bin/ksh
#ident "@(#)metacheck.sh   1.3     96/06/21 SMI"
# ident='%Z%%M%   %I%     %E% SMI'
#
# Copyright (c) 1999 by Sun Microsystems, Inc.
#
# metacheck
#
# Check on the status of the metadevice configuration.  If there is a problem
# return a non zero exit code.  Depending on options, send email notification.
#
# -h
#	help
# -s setname
#	Specify the set to check.  By default, the 'local' set will be checked.
# -m recipient [recipient...]
#	Send email notification to the specified recipients.  This
#	must be the last argument. The notification shows up as a short 
#	email message with a subject of 
#		"Solaris Volume Manager Problem: metacheck.who.nodename.setname"
#	which summarizes the problem(s) and tells how to obtain detailed 
#	information. The "setname" is from the -s option, "who" is from 
#	the -w option, and "nodename" is reported by uname(1).
#	Email notification is further affected by the following options:
#		-f	to suppress additional messages after a problem 
#			has been found. 
#		-d	to control the supression.
#		-w	to identify who generated the email.
#		-t	to force email even when there is no problem.
# -w who
#	indicate who is running the command. By default, this is the
#	user-name as reported by id(1M). This is used when sending
#	email notification (-m).
# -f 
#	Enable filtering.  Filtering applies to email notification (-m).
#	Filtering requires root permission.  When sending email notification
#	the file /etc/lvm/metacheck.setname.pending is used to 
#	controll the filter.  The following matrix specifies the behavior
#	of the filter:
#
#	problem_found	file_exists
#	  yes		  no		Create file, send notification
#	  yes		  yes		Resend notification if the current date 
#					(as specified by -d datefmt) is 
#					different than the file date.
#	  no		  yes		Delete file, send notification 
#					that the problem is resolved.
#	  no		  no		Send notification if -t specified.
#	
# -d datefmt
#	Specify the format of the date for filtering (-f).  This option 
#	controls the how often re-notification via email occurs. If the 
#	current date according to the specified format (strftime(3C)) is 
#	identical to the date contained in the 
#	/etc/lvm/metacheck.setname.pending file then the message is 
#	suppressed. The default date format is "%D", which will send one 
#	re-notification per day.
# -t
#	Test mode.  Enable email generation even when there is no problem.
#	Used for end-to-end verification of the mechanism and email addresses.
#	
#
# These options are designed to allow integration of metacheck
# into crontab.  For example, a root crontab entry of:
#
# 0,15,30,45 * * * * /usr/sbin/metacheck -f -w SVMcron \
#   -d '\%D \%h' -m notice@example.com 2148357243.8333033@pager.example.com
#
# would check for problems every 15 minutes, and generate an email to
# notice@example.com (and send to an email pager service) every hour when 
# there is a problem.  Note the \ prior to the '%' characters for a 
# crontab entry.  Bounced email would come back to root@nodename.
# The subject line for email generated by the above line would be
# Solaris Volume Manager Problem: metacheck.SVMcron.nodename.local
#

# display a debug line to controlling terminal (works in pipes)
decho()
{
    if [ "$debug" = "yes" ] ; then
	echo "DEBUG: $*"	< /dev/null > /dev/tty 2>&1
    fi
}

# if string $1 is in $2-* then return $1, else return ""
strstr()
{
    typeset	look="$1"
    typeset	ret=""

    shift
#   decho "strstr LOOK .$look. FIRST .$1."
    while [ $# -ne 0 ] ; do
	if [ "$look" = "$1" ] ; then
	    ret="$look"
	fi
	shift
    done
    echo "$ret"
}

# if string $1 is in $2-* then delete it. return result
strdstr()
{
    typeset	look="$1"
    typeset	ret=""

    shift
#   decho "strdstr LOOK .$look. FIRST .$1."
    while [ $# -ne 0 ] ; do
	if [ "$look" != "$1" ] ; then
	    ret="$ret $1"
	fi
	shift
    done
    echo "$ret"
}

merge_continued_lines()
{
    awk -e '\
	BEGIN { line = "";} \
	$NF == "\\" { \
	    $NF = ""; \
	    line = line $0; \
	    next; \
	} \
	$NF != "\\" { \
	    if ( line != "" ) { \
		print line $0; \
		line = ""; \
	    } else { \
		print $0; \
	    } \
	}'
}

# trim out stuff not associated with metadevices
find_meta_devices()
{
    typeset	devices=""

#   decho "find_meta_devices .$*."
    while [ $# -ne 0 ] ; do
	case $1 in
	d+([0-9]) )	# metadevice name
	    devices="$devices $1"
	    ;;
	esac
	shift
    done
    echo "$devices"
}

# return the list of top level metadevices
toplevel()
{
    typeset	comp_meta_devices=""
    typeset	top_meta_devices=""
    typeset	devices=""
    typeset	device=""
    typeset	comp=""

    metastat$setarg -p | merge_continued_lines | while read line ; do
	echo "$line"
	devices=`find_meta_devices $line`
	set -- $devices
	if [ $# -ne 0 ] ; then
	    device=$1
	    shift
	    # check to see if device already refered to as component
	    comp=`strstr $device $comp_meta_devices`
	    if [ -z $comp ] ; then 
		top_meta_devices="$top_meta_devices $device"
	    fi
	    # add components to component list, remove from top list
	    while [ $# -ne 0 ] ; do
		comp=$1
		comp_meta_devices="$comp_meta_devices $comp"
		top_meta_devices=`strdstr $comp $top_meta_devices`
		shift
	    done
	fi
    done > /dev/null 2>&1
    echo $top_meta_devices
}

#
# - MAIN
#
METAPATH=/usr/sbin
PATH=//usr/bin:$METAPATH
USAGE="usage: metacheck [-s setname] [-h] [[-t] [-f [-d datefmt]] \
    [-w who] -m recipient [recipient...]]"

datefmt="%D"
debug="no"
filter="no"
mflag="no"
set="local"
setarg=""
testarg="no"
who=`id | sed -e 's/^uid=[0-9][0-9]*(//' -e 's/).*//'`

while getopts d:Dfms:tw: flag
do
    case $flag in
    d)	datefmt=$OPTARG;
	;;
    D)	debug="yes"
	;;
    f)	filter="yes"
	;;
    m)	mflag="yes"
	;;
    s)	set=$OPTARG;
	if [ "$set" != "local" ] ; then
		setarg=" -s $set";
	fi
	;;
    t)	testarg="yes";
	;;
    w)	who=$OPTARG;
	;;
    \?)	echo $USAGE
	exit 1
	;;
    esac
done

# if mflag specified then everything else part of recipient
shift `expr $OPTIND - 1`
if [ $mflag = "no" ] ; then
    if [ $# -ne 0 ] ; then 
	echo $USAGE
	exit 1
    fi
else
    if [ $# -eq 0 ] ; then 
	echo $USAGE
	exit 1
    fi
fi
recipients="$*"

curdate_filter=`date +$datefmt`
curdate=`date`
node=`uname -n`

# establish files
msg_f=/tmp/metacheck.msg.$$
msgs_f=/tmp/metacheck.msgs.$$
metastat_f=/tmp/metacheck.metastat.$$
metadb_f=/tmp/metacheck.metadb.$$
metahs_f=/tmp/metacheck.metahs.$$
pending_f=/etc/lvm/metacheck.$set.pending 
files="$metastat_f $metadb_f $metahs_f $msg_f $msgs_f"

rm -f $files							> /dev/null 2>&1
trap "rm -f $files > /dev/null 2>&1; exit 1" 1 2 3 15

# Check to see if metadb is capable of running
have_metadb="yes"
metadb$setarg 							> $metadb_f 2>&1
if [ $? -ne 0 ] ; then
    have_metadb="no"
fi
grep "there are no existing databases"  	< $metadb_f	> /dev/null 2>&1
if [ $? -eq 0 ] ; then
    have_metadb="no"
fi
grep "/dev/md/admin"				< $metadb_f	> /dev/null 2>&1
if [ $? -eq 0 ] ; then
    have_metadb="no"
fi

# check for problems accessing metadbs
retval=0
if [ "$have_metadb" = "no" ] ; then
    retval=1
    echo "metacheck: metadb problem, can't run '$METAPATH/metadb$setarg'" \
								>> $msgs_f
else
    # snapshot the state
    metadb$setarg 2>&1 | sed -e '1d' | merge_continued_lines	> $metadb_f
    metastat$setarg 2>&1 | merge_continued_lines		> $metastat_f
    metahs$setarg -i 2>&1 | merge_continued_lines		> $metahs_f

    #
    # Check replicas for problems, capital letters in the flags
    # indicate an error, fields are seperated by tabs.
    #
    problem=`awk < $metadb_f -F\t '{if ($1 ~ /[A-Z]/) print $1;}'`
    if [ -n "$problem" ] ; then
	retval=`expr $retval + 64`
	echo "\
metacheck: metadb problem, for more detail run:\n\t$METAPATH/metadb$setarg -i" \
								>> $msgs_f
    fi

    #
    # Check the metadevice state
    #
    problem=`awk < $metastat_f -e \
		'/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'`
    if [ -n "$problem" ] ; then
	retval=`expr $retval + 128`
	echo "\
metacheck: metadevice problem, for more detail run:" \
								>> $msgs_f

	# refine the message to toplevel metadevices that have a problem
	top=`toplevel`
	set -- $top
	while [ $# -ne 0 ] ; do
	    device=$1
	    problem=`metastat $device | awk -e \
		'/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'`
	    if [ -n "$problem" ] ; then
		echo "\t$METAPATH/metastat$setarg $device"	>> $msgs_f
		# find out what is mounted on the device
		mp=`mount|awk -e '/\/dev\/md\/dsk\/'$device'[ \t]/{print $1;}'`
		if [ -n "$mp" ] ; then
		    echo "\t\t$mp mounted on $device"		>> $msgs_f
		fi
	    fi
	    shift
	done
    fi

    #
    # Check the hotspares to see if any have been used.
    #
    problem=""
    grep "no hotspare pools found"	< $metahs_f		> /dev/null 2>&1
    if [ $? -ne 0 ] ; then
	problem=`awk < $metahs_f -e \
	    '/blocks/ { if ( $2 != "Available" ) print $0;}'`
    fi
    if [ -n "$problem" ] ; then
	retval=`expr $retval + 256`
	echo "\
metacheck: hot spare in use, for more detail run:\n\t$METAPATH/metahs$setarg -i" \
								 >> $msgs_f
    fi
fi

# If any errors occurred, then mail the report
if [ $retval -ne 0 ] ; then
    if [ -n "$recipients" ] ; then 
	re=""
	if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then
	    re="Re: "
	    # we have a pending notification, check date to see if we resend
	    penddate_filter=`cat $pending_f | head -1`
	    if [ "$curdate_filter" != "$penddate_filter" ] ; then
		rm -f $pending_f				> /dev/null 2>&1
	    else
	 	if [ "$debug" = "yes" ] ; then
		    echo "metacheck: email problem notification still pending"
		    cat $pending_f
		fi
	    fi
	fi
	if [ ! -f $pending_f ] ; then
	    if [ "$filter" = "yes" ] ; then
		echo "$curdate_filter\n\tDate:$curdate\n\tTo:$recipients" \
								> $pending_f
	    fi
	    echo "\
Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate"		>> $msg_f
	    echo "\
--------------------------------------------------------------" >> $msg_f
	    cat $msg_f $msgs_f | mailx -s \
		"${re}Solaris Volume Manager Problem: metacheck.$who.$set.$node" $recipients
	fi
    else
	cat $msgs_f
    fi
else
    # no problems detected,
    if [ -n "$recipients" ] ; then
	# default is to not send any mail, or print anything.
	echo "\
Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate"		>> $msg_f
	echo "\
--------------------------------------------------------------" >> $msg_f
	if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then
	    # pending filter exista, remove it and send OK
	    rm -f $pending_f					> /dev/null 2>&1
	    echo "Problem resolved"				>> $msg_f
	    cat $msg_f | mailx -s \
		"Re: Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients
	elif [ "$testarg" = "yes" ] ; then
	    # for testing, send mail every time even thought there is no problem
	    echo "Messaging test, no problems detected"		>> $msg_f
	    cat $msg_f | mailx -s \
		"Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients
	fi
    else
	echo "metacheck: Okay"
    fi
fi

rm -f $files							> /dev/null 2>&1
exit $retval

cron ユーティリティーを使ってスクリプトを起動する手順については、cron(1M) のマニュアルページを参照してください。

第 24 章 監視とエラーレポート (作業)