Rebooting Worker Nodes

Find out how to reboot a worker node in a Kubernetes cluster that you've created using Kubernetes Engine (OKE).

Note

You can only cycle nodes to reboot worker nodes when using enhanced clusters. See Working with Enhanced Clusters and Basic Clusters.

You can cycle nodes to reboot nodes with both virtual machine shapes and bare metal shapes.

You can cycle nodes to reboot both managed nodes and self-managed nodes.

Sometimes, rebooting a worker node is the best way to resolve an issue with the compute instance hosting the worker node. Rebooting a worker node power cycles the compute instance, which would, for example, clear out all the rules in the compute instance's iptables. In the case of bare metal GPU compute instances, rebooting a worker node might resolve issues such as:

Lowered job performance or thermal throttling, caused by high GPU memory temperatures.
Reports of fewer than the expected number of GPUs.
NVLink errors, indicated by the NVIDIA Fabric Manager failing to start, or by NCCL jobs failing to run.

Using Kubernetes Engine, you can:

Reboot specific managed nodes.
Reboot specific self-managed nodes.

When you cycle and reboot a worker node, Kubernetes Engine automatically cordons and drains the worker node before shutting it down. The compute instance hosting the worker node is then rebooted. The shutdown command that is sent to the compute instance hosting the worker node depends on the number of minutes you specify as the eviction grace period (the length of time to allow to cordon and drain worker nodes):

If you specify an eviction grace period of zero minutes, a RESET command is sent to the compute instance. The instance is immediately powered off, and then powered back on.
If you specify an eviction grace period greater than zero minutes, a SOFTRESET command is sent to the compute instance. After waiting 15 minutes for the operating system to shut down, the instance is powered off, and then powered back on.

Note that the instance itself is not terminated, and keeps the same OCID and network address.

Note the following considerations when cycling to reboot worker nodes:

You have to cycle and reboot managed nodes individually. You cannot select a managed node pool and cycle and reboot all the managed nodes within it.
You can use the Console, the CLI, or the API to cycle and reboot managed nodes.
You have to use the CLI or the API to cycle and reboot self-managed nodes. You cannot use the Console to cycle and reboot self-managed nodes.

Cordoning and draining when cycling and rebooting nodes

When you select an individual worker node (either a managed node or a self-managed node), and specify that you want to cycle and reboot that node, you can specify Cordon and drain options. In the case of managed nodes, the Cordon and drain options you specify for a managed node override the Cordon and drain options specified for the node pool.

For more information, see Cordoning and Draining Managed Nodes Before Shut Down or Termination