Troubleshoot a Stack

Identify common problems in a Oracle WebLogic Server for OKE stack and learn how to diagnose to solve them.

Stack Creation Failed

Troubleshoot a failed Oracle WebLogic Server domain that you created using Oracle WebLogic Server for OKE.

Failed to install WebLogic Operator

Stack provisioning might fail when you create a domain with Oracle WebLogic Server for OKE in a new subnet for an existing VCN due to error in installation of WebLogic Server Kubernetes Operator.

Example message:
module.provisioner.null_resource.check_provisioning_status_1  (remote-exec):
<Aug 27, 2020 07:01:31 PM GMT> <INFO>  <install_wls_operator.sh>
<(host:sample-admin.admin1.existingnetwork.oraclevcn.com) -  <WLSOKE-VM-INFO-0020> :
Installing weblogic operator in namespace [wrjrf8-operator-ns]>
module.provisioner.null_resource.check_provisioning_status_1  (remote-exec): <Aug 27, 2020
07:02:12 PM GMT> <ERROR>  <install_wls_operator.sh>
<(host:sample-admin.admin1.existingnetwork.oraclevcn.com) -  <WLSOKE-VM-ERROR-0013> : Error
installing weblogic operator. Exit  code[1]>

Run a Destroy job on the stack and apply the job again to recreate the resources using the same database.

Failed to create service account

Stack provisioning might fail with HTTP 409 conflict error if the service account creation fails.

Example message:
module.provisioner.null_resource.check_provisioning_status_1 (remote-exec):
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":
"Operation cannot be fulfilled on serviceaccounts \"default\": the object has been modified;
please apply your changes to the latest version and try again","reason":"Conflict","details":
{"name":"default","kind":"serviceaccounts"}

,"code":409}

Run a Destroy job on the stack and apply the job again to recreate the resources using the same database.

Failed to login to OCIR

Stack provisioning might fail if the docker login to OCI registry is not succcesful.

Example message:
[phx.ocir.io]>module.provisioner.null_resource.check_provisioning_status_1 (remote-exec):
<Sep 22, 2020 02:33:46 PM GMT> <ERROR> <docker_init.sh> <(host:sample-admin.admin.existingnetwork.oraclevcn.com)
- <WLSOKE-VM-ERROR-0003> : Unable to login to custom OCIR
[phx.ocir.io]>module.provisioner.null_resource.check_provisioning_status_1 (remote-exec):
]>module.provisioner.null_resource.check_provisioning_status_1 (remote-exec):
<Sep 22, 2020 02:33:46 PM GMT> <ERROR> <docker_init.py> <(host:sample-admin.admin.existingnetwork.oraclevcn.com)
- <WLSOKE-VM-ERROR-0020> : Error executing sh /u01/scripts/bootstrap/docker_init.sh. Exit code [1]>

Run a Destroy job on the stack and apply the job again to recreate the resources using the same database.

Failed to verify OKE cluster node status

Stack provisioning fails if the OKE cluster worker nodes are inactive when you create the WebLogic domain with Oracle WebLogic Server for OKE.

Example message:
<INFO> <oke_worker_status.py>
<(host:sample-admin.nokeadmin.okevcn.oraclevcn.com) - <WLSOKE-VM-INFO-0011> : Waiting
for the workers nodes to be Active. Retrying...><Dec 17, 2020 04:47:56 PM GMT> <ERROR>
<markers.py> <(host:sample-admin.okeadmin.okevcn.oraclevcn.com) - <Dec 17, 2020
16:47:56> - <WLS-OKE-ERROR-003> - Failed to verify oke cluster nodes status. [Exit code : Status
check timed out]>

Run a Destroy job on the stack and apply the job again to recreate the resources using the same database.

Nodepools are not Recreated with the Latest Kubernetes Version

Issue: If you upgrade an existing Kubernetes cluster and scale out a nodepool, the new nodes are created with Kubernetes version 1.20 or later.

Note:

This topic is applicable for instances provisioned prior to release 22.1.2.

Workaround:

  1. Sign in to the Jenkins console for your domain. See Access the Jenkins Console.
  2. On the Dashboard page, click create domain.
  3. Click on the pipeline for groovy pipeline definition.
  4. Search for agent-label-jenkins and replace it with agent-label.
  5. On the Dashboard page, click create wls nodepool, and then complete the step 3 and step 4.
  6. On the Dashboard page, click create base image, and then complete the step 3 and step 4.
  7. Sign in to the Oracle Cloud Infrastructure Console.
  8. Go to the recreated nodepools and make a note of the IP address of the nodes which are on Kubernetes version 1.20 or later.
  9. Log in to each node as an opc user.
  10. Run the following command on each node:
    sudo yum install docker-engine-19.03.11.ol-4.el7
    systemctl start docker

Load Balancer Creation Failed

After creating a stack, you might encounter an issue where the internal Load Balancer (LB) is missing.

When you run the following command, the internal IP for the LB would is displayed as <pending>:
kubectl get svc -n <domain-name>-internal
Following are the reasons the load balancer creation fails:
  1. Lack of quota for the selected LB shapes.
  2. Lack of available private IPs in the VCN or subnets selected during provisioning.

Check the Status of the Load Balancers

You can view the status of the load balancers by checking the load balancer services and the provisioning logs.

Load Balancer Services:

To check the load balancer services, run the following command:
kubectl get svc -n wlsoke-ingress-nginx

If the output lists any of the load balancer services as <pending>, under the EXTERNAL-IP column, then the load balancers are not created.

Sample output:
NAME               TYPE           CLUSTER-IP      EXTERNAL-IP       PORT(S)         AGE
okename-internal   LoadBalancer   10.96.185.81    <pending>         443:30618/TCP   11m

Provisioning logs:

If the internal load balancer is not created successfully, the /u01/logs/provisioning.log file would include an error message.

Sample of the error message:
<WLSOKE-VM-INFO-0058> : Installing ingress controller charts for jenkins [ ingress-controller ]>
<WLSOKE-VM-ERROR-0058> : Error installing ingress controller with Helm. Exit code [1]>
And, in the /u01/logs/provisioning_cmd.out file, you would see the following error message:
<install_ingress_controller.sh>  -  Error: timed out waiting for the condition

Reinstall the Load Balancer

After identifying and fixing the cause of the failure, like increased quota for the selected LB shape, you can reinstall the private load balancer in the stack.

  1. Run the following command to bounce the Jenkins service:
    kubectl delete deployment.apps/nginx-ingress-controller -n wlsoke-ingress-nginx
  2. Run the following command to delete the load balancer that has an issue:
    kubectl delete service/<service-prefix>-internal -n wlsoke-ingress-nginx 
  3. Run the following command to remove the existing helm release:
    helm uninstall ingress-controller
  4. Copy the YAML file to the temporary folder:
    cp /u01/provisioning-data/*.yaml /tmp 
  5. Run the following command to install the load balancer:
    /u01/scripts/bootstrap/install_ingress_controller.sh /tmp/ingress-controller-input-values.yaml 
  6. Run the following command to verify if load balancer services are created and have an IP addresses:
    kubectl get svc -n wlsoke-ingress-nginx
    Sample output:
    NAME               TYPE           CLUSTER-IP      EXTERNAL-IP       PORT(S)         AGE
    domain_name-internal   LoadBalancer   10.0.0.1   100.0.0.1   80:30605/TCP    12m