ユースケース: 自己ホスト・インスタンスを使用したOSパッチ適用の自動化

組織として、フリート・アプリケーション管理で自己ホスト・インスタンスを使用して、Oracle Base Database ServiceのDBNodeリソースのOSパッチ適用を自動化します。

このユースケースでは、Oracle Base Database Serviceリソースに対してオペレーティング・システムにパッチを適用する必要がある例について説明します。Oracle Base Database ServiceではネイティブOSパッチ適用サポートが提供されないため、フリート・アプリケーション管理の自己ホスト・インスタンス機能を使用して、DBNodeリソースのOSパッチ適用を自動化し、システムを最新の状態に保つことができます。

自己ホスト・インスタンスの詳細は、フリート・アプリケーション管理の自己ホスト・インスタンスを参照してください。

フリート・アプリケーション管理の自己ホスト・インスタンス機能を使用して、Oracle Base Database Service DBNodeリソースのOSパッチ適用を自動化するには、次のステップを実行します。

1. コンピュート・データベースの作成と設定
2. インスタンスのInstance Principal認証の有効化
3. 機密情報の作成
4. インスタンスでのDBNode OSパッチ適用スクリプトの準備
5. ランブックでの自己ホスト・インスタンスの使用によるスクリプトの実行

1. コンピュート・データベースの作成と設定

Oracle Cloud Infrastructureに自己ホスト型コンピュート・インスタンスを作成します。

コンソールにアクセスし、資格証明を使用してサインインします。
コンピュート・インスタンス・ページに移動し、インスタンスの作成ワークフローを開始します。
- ナビゲーション・メニューを開き、「コンピュート」を選択します。「コンピュート」で、「インスタンス」を選択します。
- 「インスタンスの作成」を選択します。
- 名前: インスタンスの名前を入力します(たとえば、Self-Hosted-Instance1)。
- コンパートメント: インスタンスを作成するコンパートメントを選択します。
- イメージ: Oracle Linuxイメージ(VM.Standard.E4.Flexなど)を選択します。
  ノート
  
  大規模なテナンシの処理を処理するのに十分なリソース(2 OCPU、8 GB RAMなど)があるコンピュート・シェイプを選択します。詳細は、コンピュート・シェイプを参照してください。
- ネットワーキング: 既存のVirtual Cloud Network (VCN)を使用するか、新しいネットワークを作成します。インスタンスにパブリックIPアドレスがあるか、アクセス可能であることを確認します(たとえば、要塞ホストを使用)。
- SSHキー: 公開SSHキーを追加するか、セキュア・アクセス用の新しいキーを生成します。
- 「作成」を選択します。
インスタンスの確認: 作成後、OCIコンソールでインスタンスの状態が「実行中」であることを確認します。
コンピュート・インスタンスからプライベートIPアドレスを取得します。
「コンピュート・インスタンス」リスト・ページで、新しいインスタンスを選択し、プライベートIPアドレスを書き留めます。
SSHを使用して接続します。
- ローカル・マシンでターミナルを開きます。
- SSHコマンドは、メモしたプライベートIPアドレスとともに使用します。
```
ssh -i <your-key.pem> opc@instance-ip
```
  your-key.pemおよびinstance-ipを自分のファイルおよびアドレスに置き換えます。
- インスタンスがプライベート・サブネットにある場合は、要塞ホストまたはクラウド・シェルを使用して接続します。
接続を確認します。
接続後、Oracle Linuxプロンプト([opc@Self-Hosted-Instance1 ~]$など)が表示されることを確認します。
システムを更新します。
次のコマンドを実行して、システムが最新であることを確認します: sudo yum update -y

2. インスタンスのInstance Principal認証の有効化

インスタンスがターゲットDBNodesのSSHアカウント資格証明および秘密キーにアクセスできるように、インスタンス・プリンシパル認証を設定します。

「動的グループ」ページにナビゲートし、「動的グループの作成」フローを開始します。
- ナビゲーション・メニューを開き、「アイデンティティおよびセキュリティ」を選択します。「アイデンティティ」で、「ドメイン」を選択します。
- 「ドメイン」から、関連するドメインを選択します。
- 詳細ページで、「動的グループ」タブを選択し、「動的グループの作成」を選択します。
- 名前(たとえば、FAMS-SelfHost-Scheduler-Mgmt-DG)と説明(たとえば、作成したインスタンスの動的グループ)を入力します。
- OCID別にインスタンスを含める一致ルールを追加します:
```
instance.id = 'ocid1.instance.oc1..<your-instance-ocid>'
```
  ノート
  
  インスタンスOCIDを検索するには、「コンピュート・インスタンス」リスト・ページに移動し、インスタンスを選択してOCIDをコピーします。
- 「作成」を選択します。
動的グループのポリシーを追加します。
- ナビゲーション・メニューを開き、「アイデンティティおよびセキュリティ」を選択します。「アイデンティティ」にある「ポリシー」を選択します。
- 「ポリシーの作成」を選択します。
  - 名前(FAMS-SelfHost-Scheduler-Mgmt-DGなど)と説明(OSパッチ適用スクリプト・インスタンス・プリンシパルのポリシーなど)を入力します。
  - ルート・コンパートメントまたは必要に応じて別のコンパートメントを選択します。
  - 次のポリシー・ステートメントを追加します。
    Allow dynamic-group FAMS-SelfHost-Scheduler-Mgmt-DG to {VAULT_READ, SECRET_BUNDLE_READ, OBJECT_INSPECT, OBJECT_READ} in tenancy where any {target.compartment.name in ('Services-Comp1', 'Services-Comp2')} Allow dynamic-group FAMS-SelfHost-Scheduler-Mgmt-DG to read database-family in tenancy where any {target.compartment.name in ('Services-Comp1', 'Services-Comp2')} Allow dynamic-group FAMS-SelfHost-Scheduler-Mgmt-DG to read vnic in tenancy where any {target.compartment.name in ('Services-Comp1', 'Services-Comp2')} Allow dynamic-group FAMS-SelfHost-Scheduler-Mgmt-DG to {FAMS_SCHEDULE_JOB_UPDATE} in tenancy
  - 「作成」を選択します。
ノート

動的グループ・ルールでインスタンスOCIDを確認して、インスタンスに必要な権限があることを確認します。これらのポリシーがない場合、DBNode OSパッチ適用スクリプトは、OCI APIへのアクセス時に権限エラーで失敗します。

3. 機密情報の作成

SSHアカウント(opc) ssh_private_keyのシークレットを作成します:

ナビゲーション・メニューを開き、「アイデンティティとセキュリティ」、「Vault」の順に選択します。
Vaultが存在しない場合は、新しいものを作成します。「ボールトの作成」を参照してください。
SSH秘密キー用にVault内にシークレットを作成します。Vaultでのシークレットの作成を参照してください。

4. インスタンスでのDBNode OSパッチ適用スクリプトの準備

SSHを使用して、前に作成したインスタンスに接続します。
DBNode OSパッチ適用スクリプト(「サンプルDBNode OSパッチ適用スクリプト」など)をコピーします。
scpを使用して、スクリプト・ファイル(run_os_patching.pyなど)をインスタンスにアップロードします。
```
scp -i your-key.pem run_os_patching.py opc@instance-ip:/home/opc
sudo mv /home/opc/run_os_patching.py /root/fams_os_patching/run_os_patching.py
```
次のような環境を設定します。
- Python依存関係をインストールします。
- 必要な資格証明(Jiraなど)を構成します。

テストでスクリプトを実行します。

python3 /root/fams_os_patching/run_os_patching.py

出力を確認して進行状況を確認し、エラーがないか確認します。

5. ランブックでの自己ホスト・インスタンスの使用によるスクリプトの実行

ランブックで自己ホスト・インスタンスを使用して、DBNode OSパッチ適用スクリプトを実行します。このプロセスには、インスタンスの構成、ランブックの定義およびプロセスの監視が含まれます。

作成したインスタンス(Self-Hosted-Instance1)をフリート・アプリケーション管理で自己ホスト・インスタンスとして割り当てます。
コンピュート・インスタンスのリストからインスタンスを選択して、自己ホスト・インスタンスとしてインスタンスを追加します。自己ホスト・インスタンスの作成を参照してください。
インスタンスがアタッチされ、フリート・アプリケーション管理に表示されていることを確認します。
インスタンスのフリートを作成および構成します。
- フリート(fams_db-os-patchingなど)を作成し、自己ホスト・インスタンスをリソースとして追加します。製品を追加する必要はありません。フリートの作成を参照してください。
- フリートが適切なコンパートメントにあること、および該当する場合はProduction環境タイプに設定されていることを確認します。
自己ホスト・インスタンスのランブックを作成します。
- ライフサイクル操作としてパッチ適用を使用するランブック(fams_os_dbaas_patching_test_runbookなど)を作成します。ランブックの作成を参照してください。
- 自己ホスト・インスタンスでシェル・スクリプト(/root/fams_os_patching/run_os_dbaas_patching.pyなど)を実行するタスクをランブックに追加します。
```
sh -c '. /root/fams_os_patching/os_dbaas/bin/activate; set -eu; dbsystemname=""; for arg in "$@"; do case "$arg" in dbsystemname=*) dbsystemname="${arg#dbsystemname=}";; esac; done; : "${dbsystemname:?dbsystemname not provided}"; echo "dbsystemname=${dbsystemname}"; exec python3 /root/fams_os_patching/run_os_dbaas_patching.py --display-name "${dbsystemname}" --option precheck' sh "$@"
```
- ランブックの保存。
ランブックを使用して、フリートのランブック・プロセスを作成します。
- ランブック・プロセスをスケジュールまたはトリガーして、DBNode OSパッチ適用スクリプトを実行します。ランブックの処理を参照してください。
- 「フリート」リスト・ページからフリート(fams_db-os-patchingなど)を選択します。フリートの詳細の表示を参照してください。
- ジョブまたはランブック・プロセスを作成します。適切なランブック(fams_os_dbaas_patching_test_runbookなど)およびライフサイクル操作を選択します。ランブックの処理を参照してください。
- ジョブをすぐにスケジュールまたは実行します。
ランブック・プロセス・ログを監視します。
- フリート・アプリケーション管理のログで、スクリプトが正常に実行されたことを確認します。フリートのランブック・プロセス・ログ詳細の取得を参照してください。
- ランブック・プロセス・ジョブを選択し、進行状況およびエラー・メッセージ(OSパッチ適用ジョブの進行状況など)のログを表示します。

DBNode OSパッチ適用スクリプトのサンプル

次に、DBNodeオペレーティング・システムにパッチを適用するサンプル・スクリプトを示します。データベースdisplay nameを指定し、事前チェックや更新など、実行するoptionsを選択します。

def main():
    """Main function to orchestrate the DBaaS patching process."""
    args = parse_arguments()
    logger.info(f"Starting script with arguments: display-name={args.display_name}, option={args.option}")
    print(f"Starting script with arguments: display-name={args.display_name}, option={args.option}")

    # Get tenancy and region
    tenancy_id, region = get_tenancy_and_region()

    # Initialize OCI clients
    db_client, compute_client, virtual_network_client, identity_client, secrets_client = initialize_oci_clients(region)

    # Get all compartments
    try:
        compartments = oci.pagination.list_call_get_all_results(identity_client.list_compartments, tenancy_id).data
        logger.info(f"Retrieved {len(compartments)} compartments")        
    except oci.exceptions.ServiceError as e:
        logger.error(f"Failed to list compartments: {str(e)}. Exiting program.")
        //handle exception and exit

    # Get DB System by display name
    db_system, compartment_id = get_db_system_by_display_name(db_client, compartments, args.display_name)
    if not db_system:
        //handle exception and exit

    # Get DB Nodes
    db_nodes = get_db_nodes(db_client, compartment_id, db_system.id)
    if not db_nodes:
        //handle exception and exit

    # Retrieve secret for SSH
    secret_id = //handle fetching secrets from vault if required either from arguments or a suitable mechanism 
    private_key_content = get_secret_content(secrets_client, secret_id)
    if not private_key_content:
        //handle exception and exit

    # Process each DB Node
    for node in db_nodes:
        node_ip = get_node_ip(virtual_network_client, node.vnic_id)
        if not node_ip:
            //handle exception and exit

        # Initialize SSH client example using any suitable library based on your use case
        ssh_client = paramiko.SSHClient()
        ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        try:
            logger.info(f"Connecting to {node_ip} as <user>")
            private_key_file = StringIO(private_key_content)
            private_key = paramiko.RSAKey.from_private_key(private_key_file)
            ssh_client.connect(node_ip, username=user, pkey=private_key)
            logger.info(f"Connected to {node_ip}")            
        except Exception as e:
            //handle exception and exit

        # Check DCS agent status and attempt to restart if down
        //handle agent check if required
        ...
		
		# Determine storage type to check if ASM is used if required
        is_asm = identify_storage_type(ssh_client,command)

        # Perform precheck or update
        if args.option == "precheck":
            if not os_update_precheck(ssh_client, node_ip, is_asm):
                //handle exception and exit
            logger.info(f"OS update precheck completed successfully on {node_ip}")            
        elif args.option == "update":
            if not os_update_precheck(ssh_client, node_ip, is_asm):
                //handle exception and exit
            if not os_update(ssh_client, node_ip, is_asm, secrets_client, secret_id):
                //handle exception and exit
            logger.info(f"OS update completed successfully on {node_ip}")            

        ssh_client.close()

def os_update(ssh_client, node_ip, is_asm, secrets_client, secret_id):
    # Pre-patching checks
    if is_asm:
        # Check grid user permissions
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user)
        if error:
            //handle exception and exit

        # Check CRS status
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
        if error or output != "<expected outcome>":
            //handle exception and exit
        logger.info(f"CRS is online on {node_ip}")        

        # Check DB processes
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
        if error or int(output) <= <expected outcome>:
            //handle exception and exit
        logger.info(f"DB services are up on {node_ip} with {output} processes")
        
    else:
        # Check DB processes
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user)
        if error or int(output) <= <expected outcome>:
            //handle exception and exit
        logger.info(f"DB services are up on {node_ip} with {output} processes")

        # Check alert log for startup
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user)
        if error or int(output) <= <expected outcome>:
            //handle exception and exit
        logger.info(f"Database startup confirmed in alert log on {node_ip}")

    # Kernel control check
    output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user)
    if error:
        //handle exception and exit
    kernel = output
    if "<kernel version 1>" in kernel:
        repo_file = "<version suitable repo>"
    elif "<kernel version 2>" in kernel:
        logger.warning(f"Node {node_ip} is running a version, which is end of life. Skipping OS patching.")
        return False
    else:
        repo_file = "<version suitable repo>"


    # Start OS patching
    output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
    if error:
        //handle exception and exit
    if not output:
        logger.error(f"No output from dbcli update server on {node_ip}, cannot proceed. Exiting program.")
        //handle exception and exit

    logger.info(f"dbcli update output: {output}")
    
    try:
        job_data = json.loads(output)
        job_id = job_data.get('jobId')
        if not job_id:
            logger.error(f"No jobId found in dbcli update server output on {node_ip}. Exiting program.")
            //handle exception and exit
        logger.info(f"Update Job ID: {job_id}")
        
    except json.JSONDecodeError:
        //handle exception and exit

    # Monitor job status every 5 minutes for up to 3 hours
    start_time = time.time()
    timeout = 10800  # 3 hours in seconds
    polling_interval = 300  # 5 minutes in seconds
    while True:
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
        if error:
            logger.error(f"Failed to check job status for {job_id} on {node_ip}: {error}. Exiting program.")
            //handle exception and exit
        if not output:
            logger.error(f"No output from dbcli describe job for {job_id} on {node_ip}. Exiting program.")
            //handle exception and exit
        logger.info(f"Job {job_id} status output: {output}")
        
        try:
            job_data = json.loads(output)
            status = job_data.get('status')
            if not status:
                logger.error(f"No status found in dbcli describe-job output for {job_id} on {node_ip}. Exiting program.")
                //handle exception and exit
            logger.info(f"Job {job_id} status: {status}")
            if status == "Success":
                logger.info(f"OS patching job {job_id} completed successfully on {node_ip}")                
                break
            elif status == "Failure":
                logger.error(f"OS patching job {job_id} failed on {node_ip}. Exiting program.")
                //handle exception and exit
            elif status in ["Running", "InProgress", "In_Progress"]:
                elapsed = time.time() - start_time
                if elapsed > timeout:
                    logger.error(f"OS patching job {job_id} timed out after 3 hours on {node_ip}. Exiting program.")
                    //handle exception and exit
                logger.info(f"Job {job_id} still {status}, checking again in 5 minutes")                
                time.sleep(polling_interval)
            else:
                logger.error(f"Unexpected job status for {job_id} on {node_ip}: {status}. Exiting program.")
                //handle exception and exit
        except json.JSONDecodeError:
            //handle exception and exit

    # Shutdown CRS/DB before reboot
    if is_asm:
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
        logger.info(f"Pre-reboot CRS status output (as root): {output}")
        if output == <expected outcome>:
            logger.info(f"CRS is up, shutting down CRS on {node_ip} as root")
            if error:
                //handle exception and exit
            time.sleep(120)
        else:
            logger.info(f"CRS is already down on {node_ip}, proceeding with reboot")
            
    else:
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user)
        logger.info(f"Pre-reboot database processes output: {output}")
        print(f"Pre-reboot database processes output: {output}")
        if output == <expected outcome>:
            logger.info(f"Database is up, shutting down database on {node_ip}")
            if error:
                //handle exception and exit
            output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user) # check trace log
            if output != <expected outcome>:
                logger.error(f"Database shutdown incomplete on {node_ip}, expected 'Shutting down instance' in alert log. Exiting program.")
                //handle exception and exit
            time.sleep(120)
        else:
            logger.info(f"Database is already down on {node_ip}, proceeding with reboot")            

    # Reboot the server
    output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
    if error:
        //handle exception and exit
    logger.info(f"Initiated reboot on {node_ip}")    
    time.sleep(120)  # Wait for reboot to initiate

    # Check host status with fresh SSH client
    start_time = time.time()
    timeout = 1440  # 24 minutes in seconds
    new_ssh_client = None
    while True:
        new_ssh_client = paramiko.SSHClient()
        new_ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        try:
            //attempt connecting SSH to ensure its online
        except Exception as e:
            //handle exception and exit
        elapsed = time.time() - start_time
        if elapsed > timeout:
            logger.error(f"Node {node_ip} failed to come online after {timeout} seconds. Exiting program.")
            //handle exception and exit
        logger.info(f"{node_ip} not up yet. Waiting 30 seconds...")
        time.sleep(30)

    # Post-reboot wait and checks with new SSH client
    

    # Perform post-reboot service startup if needed
    ...

    # Perform post-reboot checks if required
    ...

    logger.info(f"OS update completed successfully on {node_ip}")
    new_ssh_client.close()
    return True

データベース・コマンドライン・インタフェース(DBCLI)コマンドの詳細は、Oracle Database CLIリファレンスを参照してください。