Use Case: Automating OS Patching by Using a Self-Hosted Instance

As an organization, I want to automate OS patching for Oracle Base Database Service DBNode resources by using a self-hosted instance with Fleet Application Management.

This use case describes an example in which you need to patch the operating system for Oracle Base Database Service resources. Because Oracle Base Database Service doesn't provide native OS patching support, you can use the self-hosted instance feature in Fleet Application Management to automate OS patching on DBNode resources and keep systems up-to-date.

For information about self-hosted instances, see Self-Hosted Instances in Fleet Application Management.

Perform the following steps to automate OS patching for Oracle Base Database Service DBNode resources using the self-hosted instance feature in Fleet Application Management:

1. Create and Set Up the Compute Instance
2. Enable Instance Principal Authentication for the Instance
3. Create Secrets
4. Prepare the DBNode OS Patching Script on the Instance
5. Run the Script by Using the Self-Hosted Instance with a Runbook

1. Create and Set Up the Compute Instance

Create a self-hosted Compute instance in Oracle Cloud Infrastructure.

Access the Console and sign in with your credentials.
Navigate to the Compute instances page and start the Create Instance workflow.
- Open the navigation menu and select Compute. Under Compute, select Instances.
- Select Create instance.
- Name: Enter a name for the instance (for example, Self-Hosted-Instance1).
- Compartment: Select the compartment where you want to create the instance.
- Image: Select an Oracle Linux image (for example, VM.Standard.E4.Flex).
  Note
  
  Select a Compute shape with enough resources (for example, 2 OCPUs, 8 GB RAM) to handle the processing for large tenancies. For more information, see Compute Shapes.
- Networking: Use an existing Virtual Cloud Network (VCN) or create a new one. Ensure the instance has a public IP address or is accessible (for example, by using a Bastion host).
- SSH Keys: Add your public SSH key or generate new keys for secure access.
- Select Create.
Verify instance: After creating, confirm that the instance state is Running in the OCI Console.
Get the private IP address from the Compute instance.
On the Compute instances list page, select your new instance, and note the private IP address.
Connect by using SSH.
- Open a terminal on your local machine.
- Use the SSH command with the private IP address you noted:
```
ssh -i <your-key.pem> opc@instance-ip
```
  Replace your-key.pem and instance-ip with your file and address.
- Use a Bastion host or Cloud Shell to connect if the instance is in a private subnet.
Verify connection.
After you connect, verify you see the Oracle Linux prompt (for example, [opc@Self-Hosted-Instance1 ~]$).
Update the system.
Run the following command to ensure the system is up-to-date: sudo yum update -y

2. Enable Instance Principal Authentication for the Instance

Set up instance principal authentication to allow the instance to access SSH account credentials and private keys for target DBNodes.

Navigate to the Dynamic Groups page and start the Create dynamic group flow.
- Open the navigation menu and select Identity & Security. Under Identity, select Domains.
- From the Domains, select the relevant domain.
- On the details page, select the Dynamic groups tab, then select Create dynamic group.
- Enter a name (for example, FAMS-SelfHost-Scheduler-Mgmt-DG) and description (for example, Dynamic group for the instance you created).
- Add a matching rule to include your instance by OCID:
```
instance.id = 'ocid1.instance.oc1..<your-instance-ocid>'
```
  Note
  
  To find the instance OCID, go to the Compute instances list page, select your instance, and copy the OCID.
- Select Create.

Add policies for the dynamic group.

Open the navigation menu and select Identity & Security. Under Identity, select Policies.

Select Create policy.

Enter a name (for example, FAMS-SelfHost-Scheduler-Mgmt-DG) and description (for example, Policy for OS patching script instance principal).
Select the root compartment, or another compartment if needed.

Add the following policy statements:

Allow dynamic-group FAMS-SelfHost-Scheduler-Mgmt-DG to {VAULT_READ, SECRET_BUNDLE_READ, OBJECT_INSPECT, OBJECT_READ} in tenancy where any {target.compartment.name in ('Services-Comp1', 'Services-Comp2')}
Allow dynamic-group FAMS-SelfHost-Scheduler-Mgmt-DG to read database-family in tenancy where any {target.compartment.name in ('Services-Comp1', 'Services-Comp2')}
Allow dynamic-group FAMS-SelfHost-Scheduler-Mgmt-DG to read vnic in tenancy where any {target.compartment.name in ('Services-Comp1', 'Services-Comp2')}
Allow dynamic-group FAMS-SelfHost-Scheduler-Mgmt-DG to {FAMS_SCHEDULE_JOB_UPDATE} in tenancy

Select Create.

Note

Verify the instance OCID in the dynamic group rules to ensure the instance has required permissions. Without these policies, the DBNode OS patching script fails with permission errors when accessing OCI APIs.

3. Create Secrets

Create a secret for SSH account (opc) ssh_private_key:

Open the navigation menu , select Identity & Security, and then select Vault.
If no Vault exists, create a new one. See Creating a Vault.
Create a secret within the Vault for the SSH private key. See Creating a Secret in a Vault.

4. Prepare the DBNode OS Patching Script on the Instance

Use SSH to connect to the instance you created before.
Copy the DBNode OS patching script (for example, Sample DBNode OS Patching Script).
Upload the script file (for example, run_os_patching.py) to the instance using scp:
```
scp -i your-key.pem run_os_patching.py opc@instance-ip:/home/opc
sudo mv /home/opc/run_os_patching.py /root/fams_os_patching/run_os_patching.py
```
Set up an environment:
- Install Python dependencies.
- Configure required credentials (for example, Jira).

Test run the script:

python3 /root/fams_os_patching/run_os_patching.py

Observe the output to verify progress and check for errors.

5. Run the Script by Using the Self-Hosted Instance with a Runbook

Run the DBNode OS patching script by using a self-hosted instance in a runbook. The process involves configuring the instance, defining a runbook, and monitoring the process.

Assign the instance you created (Self-Hosted-Instance1) as a self-hosted instance in Fleet Application Management.
Add the instance as a self-hosted instance by selecting it from the list of Compute instances. See Creating a Self-Hosted Instance.
Confirm the instance is attached and visible in Fleet Application Management.
Create and configure a fleet for the instance.
- Create a fleet (for example, fams_db-os-patching) and add the self-hosted instance as a resource. No products need to be added. See Creating a Fleet.
- Ensure the fleet is in the appropriate compartment and set to the Production environment type if applicable.

Create a runbook for the self-hosted instance.

Create a runbook—for example, fams_os_dbaas_patching_test_runbook—that uses patching as the lifecycle operation. See Creating a Runbook.

Add a task to the runbook to run the shell script (for example, /root/fams_os_patching/run_os_dbaas_patching.py) on the self-hosted instance:

sh -c '. /root/fams_os_patching/os_dbaas/bin/activate; set -eu; dbsystemname=""; for arg in "$@"; do case "$arg" in dbsystemname=*) dbsystemname="${arg#dbsystemname=}";; esac; done; : "${dbsystemname:?dbsystemname not provided}"; echo "dbsystemname=${dbsystemname}"; exec python3 /root/fams_os_patching/run_os_dbaas_patching.py --display-name "${dbsystemname}" --option precheck' sh "$@"

Save the runbook.

Create a runbook process for the fleet by using the runbook.
- Schedule or trigger the runbook process to run the DBNode OS patching script. See Processing a Runbook.
- Select the fleet (for example, fams_db-os-patching) from the Fleets list page. See Getting a Fleet's Details.
- Create a job or runbook process. Select the appropriate runbook such as, fams_os_dbaas_patching_test_runbook) and the lifecycle operation. See Processing a Runbook.
- Schedule or run the job immediately.
Monitor the runbook process logs.
- Check the logs in Fleet Application Management confirm the script ran successfully. See Getting Runbook Process Log Details for a Fleet.
- Select the runbook process job and view its logs for progress and error messages (for example, OS patching job progress).

Sample DBNode OS Patching Script

The following is a sample script to patch the DBNode operating system. Specify the database display name and select the options you want to perform, such as precheck or update.

def main():
    """Main function to orchestrate the DBaaS patching process."""
    args = parse_arguments()
    logger.info(f"Starting script with arguments: display-name={args.display_name}, option={args.option}")
    print(f"Starting script with arguments: display-name={args.display_name}, option={args.option}")

    # Get tenancy and region
    tenancy_id, region = get_tenancy_and_region()

    # Initialize OCI clients
    db_client, compute_client, virtual_network_client, identity_client, secrets_client = initialize_oci_clients(region)

    # Get all compartments
    try:
        compartments = oci.pagination.list_call_get_all_results(identity_client.list_compartments, tenancy_id).data
        logger.info(f"Retrieved {len(compartments)} compartments")        
    except oci.exceptions.ServiceError as e:
        logger.error(f"Failed to list compartments: {str(e)}. Exiting program.")
        //handle exception and exit

    # Get DB System by display name
    db_system, compartment_id = get_db_system_by_display_name(db_client, compartments, args.display_name)
    if not db_system:
        //handle exception and exit

    # Get DB Nodes
    db_nodes = get_db_nodes(db_client, compartment_id, db_system.id)
    if not db_nodes:
        //handle exception and exit

    # Retrieve secret for SSH
    secret_id = //handle fetching secrets from vault if required either from arguments or a suitable mechanism 
    private_key_content = get_secret_content(secrets_client, secret_id)
    if not private_key_content:
        //handle exception and exit

    # Process each DB Node
    for node in db_nodes:
        node_ip = get_node_ip(virtual_network_client, node.vnic_id)
        if not node_ip:
            //handle exception and exit

        # Initialize SSH client example using any suitable library based on your use case
        ssh_client = paramiko.SSHClient()
        ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        try:
            logger.info(f"Connecting to {node_ip} as <user>")
            private_key_file = StringIO(private_key_content)
            private_key = paramiko.RSAKey.from_private_key(private_key_file)
            ssh_client.connect(node_ip, username=user, pkey=private_key)
            logger.info(f"Connected to {node_ip}")            
        except Exception as e:
            //handle exception and exit

        # Check DCS agent status and attempt to restart if down
        //handle agent check if required
        ...
		
		# Determine storage type to check if ASM is used if required
        is_asm = identify_storage_type(ssh_client,command)

        # Perform precheck or update
        if args.option == "precheck":
            if not os_update_precheck(ssh_client, node_ip, is_asm):
                //handle exception and exit
            logger.info(f"OS update precheck completed successfully on {node_ip}")            
        elif args.option == "update":
            if not os_update_precheck(ssh_client, node_ip, is_asm):
                //handle exception and exit
            if not os_update(ssh_client, node_ip, is_asm, secrets_client, secret_id):
                //handle exception and exit
            logger.info(f"OS update completed successfully on {node_ip}")            

        ssh_client.close()

def os_update(ssh_client, node_ip, is_asm, secrets_client, secret_id):
    # Pre-patching checks
    if is_asm:
        # Check grid user permissions
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user)
        if error:
            //handle exception and exit

        # Check CRS status
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
        if error or output != "<expected outcome>":
            //handle exception and exit
        logger.info(f"CRS is online on {node_ip}")        

        # Check DB processes
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
        if error or int(output) <= <expected outcome>:
            //handle exception and exit
        logger.info(f"DB services are up on {node_ip} with {output} processes")
        
    else:
        # Check DB processes
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user)
        if error or int(output) <= <expected outcome>:
            //handle exception and exit
        logger.info(f"DB services are up on {node_ip} with {output} processes")

        # Check alert log for startup
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user)
        if error or int(output) <= <expected outcome>:
            //handle exception and exit
        logger.info(f"Database startup confirmed in alert log on {node_ip}")

    # Kernel control check
    output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user)
    if error:
        //handle exception and exit
    kernel = output
    if "<kernel version 1>" in kernel:
        repo_file = "<version suitable repo>"
    elif "<kernel version 2>" in kernel:
        logger.warning(f"Node {node_ip} is running a version, which is end of life. Skipping OS patching.")
        return False
    else:
        repo_file = "<version suitable repo>"


    # Start OS patching
    output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
    if error:
        //handle exception and exit
    if not output:
        logger.error(f"No output from dbcli update server on {node_ip}, cannot proceed. Exiting program.")
        //handle exception and exit

    logger.info(f"dbcli update output: {output}")
    
    try:
        job_data = json.loads(output)
        job_id = job_data.get('jobId')
        if not job_id:
            logger.error(f"No jobId found in dbcli update server output on {node_ip}. Exiting program.")
            //handle exception and exit
        logger.info(f"Update Job ID: {job_id}")
        
    except json.JSONDecodeError:
        //handle exception and exit

    # Monitor job status every 5 minutes for up to 3 hours
    start_time = time.time()
    timeout = 10800  # 3 hours in seconds
    polling_interval = 300  # 5 minutes in seconds
    while True:
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
        if error:
            logger.error(f"Failed to check job status for {job_id} on {node_ip}: {error}. Exiting program.")
            //handle exception and exit
        if not output:
            logger.error(f"No output from dbcli describe job for {job_id} on {node_ip}. Exiting program.")
            //handle exception and exit
        logger.info(f"Job {job_id} status output: {output}")
        
        try:
            job_data = json.loads(output)
            status = job_data.get('status')
            if not status:
                logger.error(f"No status found in dbcli describe-job output for {job_id} on {node_ip}. Exiting program.")
                //handle exception and exit
            logger.info(f"Job {job_id} status: {status}")
            if status == "Success":
                logger.info(f"OS patching job {job_id} completed successfully on {node_ip}")                
                break
            elif status == "Failure":
                logger.error(f"OS patching job {job_id} failed on {node_ip}. Exiting program.")
                //handle exception and exit
            elif status in ["Running", "InProgress", "In_Progress"]:
                elapsed = time.time() - start_time
                if elapsed > timeout:
                    logger.error(f"OS patching job {job_id} timed out after 3 hours on {node_ip}. Exiting program.")
                    //handle exception and exit
                logger.info(f"Job {job_id} still {status}, checking again in 5 minutes")                
                time.sleep(polling_interval)
            else:
                logger.error(f"Unexpected job status for {job_id} on {node_ip}: {status}. Exiting program.")
                //handle exception and exit
        except json.JSONDecodeError:
            //handle exception and exit

    # Shutdown CRS/DB before reboot
    if is_asm:
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
        logger.info(f"Pre-reboot CRS status output (as root): {output}")
        if output == <expected outcome>:
            logger.info(f"CRS is up, shutting down CRS on {node_ip} as root")
            if error:
                //handle exception and exit
            time.sleep(120)
        else:
            logger.info(f"CRS is already down on {node_ip}, proceeding with reboot")
            
    else:
        output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user)
        logger.info(f"Pre-reboot database processes output: {output}")
        print(f"Pre-reboot database processes output: {output}")
        if output == <expected outcome>:
            logger.info(f"Database is up, shutting down database on {node_ip}")
            if error:
                //handle exception and exit
            output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>, sudo_user=user) # check trace log
            if output != <expected outcome>:
                logger.error(f"Database shutdown incomplete on {node_ip}, expected 'Shutting down instance' in alert log. Exiting program.")
                //handle exception and exit
            time.sleep(120)
        else:
            logger.info(f"Database is already down on {node_ip}, proceeding with reboot")            

    # Reboot the server
    output, error = execute_ssh_command(ssh_client, command, user, sudo=<yes/no>)
    if error:
        //handle exception and exit
    logger.info(f"Initiated reboot on {node_ip}")    
    time.sleep(120)  # Wait for reboot to initiate

    # Check host status with fresh SSH client
    start_time = time.time()
    timeout = 1440  # 24 minutes in seconds
    new_ssh_client = None
    while True:
        new_ssh_client = paramiko.SSHClient()
        new_ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        try:
            //attempt connecting SSH to ensure its online
        except Exception as e:
            //handle exception and exit
        elapsed = time.time() - start_time
        if elapsed > timeout:
            logger.error(f"Node {node_ip} failed to come online after {timeout} seconds. Exiting program.")
            //handle exception and exit
        logger.info(f"{node_ip} not up yet. Waiting 30 seconds...")
        time.sleep(30)

    # Post-reboot wait and checks with new SSH client
    

    # Perform post-reboot service startup if needed
    ...

    # Perform post-reboot checks if required
    ...

    logger.info(f"OS update completed successfully on {node_ip}")
    new_ssh_client.close()
    return True

For more information about Database Command Line Interface (DBCLI) commands, see the Oracle Database CLI Reference.