Enable Lazy Container Image Loadingin Oracle Cloud Infrastructure Kubernetes Engine (OKE) Using Stargz Store

Introduction

Lazy container image loading (sometimes called lazy pulling or on-demand image loading) is a technique that allows a container to start running before the full image is downloaded. Only the parts of the image needed at runtime are fetched, reducing startup time—especially for large images.

Normally, when you run a container:

  1. The container runtime (e.g., CRI-O, containerd) downloads the entire image from the registry.
  2. It unpacks all image layers.
  3. Only after all layers are downloaded and unpacked does the container start.

With lazy loading:

  1. The container starts immediately, using a virtual filesystem (the image is not locally present yet).
  2. When the application accesses a file, the runtime fetches just the required content from the remote registry on-demand.
  3. Additional files are downloaded only when actually used.
  4. The full container image is pulled in the background.

Container images need to be repacked to support lazy loading. Using the eStargz format, the image is repacked into small, independently decompressible chunks. This allows the container runtime to fetch only the necessary chunks when the container starts.

eStargz images contain a special file called a TOC (Table of Contents), which records metadata (e.g., name, file type, owners, offset) of all file entries in the eStargz layer, except the TOC itself. Container runtimes can use the TOC to mount the container’s filesystem without downloading the entire layer contents.

In this tutorial, you will learn how to:

  1. Repack an existing container image to use the Stargz format
  2. Configure Oracle Cloud Infrastructure Kubernetes Engine (OKE) CRI-O to use the Stargz Store plugin
  3. Run a sample workload to demonstrate the container startup time improvements

Prerequisites

Setup

Rebuild the container image using nerdctl to use the eStargz format

  1. Create an image repository in OCI Registry (OCIR). In this tutorial, the tenancy-namespace is idxzjcdxqj and we will create a repo named vllm/vllm-openai.

    OCIR Repository

    You can follow these instructions to authenticate to OCIR and create a new repository: Pushing Images Using the Docker CLI.

    Make sure to replace iad.ocir.io with the OCIR domain of the OCI region you are working on. Here is a list of the region keys.

    docker login iad.ocir.io
  2. Download the latest version of nerdctl on the machine with Docker installed.

    export NERDCTL_VERSION=2.2.2
    wget https://github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/nerdctl-${NERDCTL_VERSION}-linux-amd64.tar.gz
    tar -xf nerdctl-${NERDCTL_VERSION}-linux-amd64.tar.gz
  3. Authenticate nerdctl to OCIR.

    ./nerdctl login iad.ocir.io
  4. Convert the vllm container image to the eStargz format.

    ./nerdctl image convert --estargz --oci docker.io/vllm/vllm-openai:v0.11.0 iad.ocir.io/idxzjcdxqj/vllm/vllm-openai:v0.11.0-esgz
  5. Push the eStargz vllm container image to OCIR.

    ./nerdctl image push iad.ocir.io/idxzjcdxqj/vllm/vllm-openai:v0.11.0-esgz
  6. You can use docker to pull/push the regular image to OCIR using docker.

    docker image pull docker.io/vllm/vllm-openai:v0.11.0
    docker image tag docker.io/vllm/vllm-openai:v0.11.0 iad.ocir.io/idxzjcdxqj/vllm/vllm-openai:v0.11.0-regular
    docker image push iad.ocir.io/idxzjcdxqj/vllm/vllm-openai:v0.11.0-regular

Install Stargz Store Plugin on the OKE Nodes

Considering OKE uses CRI-O, we need to set up the Stargz Store plugin. This is an implementation of an additional layer store plugin for CRI-O/Podman. Stargz Store provides remotely-mounted eStargz layers to CRI-O/Podman.

  1. SSH into one of the worker nodes and download the latest version from the stargz-snapshotter Github page.

    export STARGZ_SNAPSHOTTER_VERSION=v0.18.2
    wget https://github.com/containerd/stargz-snapshotter/releases/download/${STARGZ_SNAPSHOTTER_VERSION}/stargz-snapshotter-${STARGZ_SNAPSHOTTER_VERSION}-linux-amd64.tar.gz
  2. Extract the stargz-store to /usr/local/bin.

    tar -C /usr/local/bin -xvf stargz-snapshotter-v0.18.2-linux-amd64.tar.gz stargz-store
  3. Update the /etc/containers/storage.conf file to include additionallayerstores in the [storage.options] section.

    [storage]
    driver = "overlay"
    graphroot = "/var/lib/containers/storage"
    runroot = "/run/containers/storage"
    
    [storage.options]
    additionallayerstores = ["/var/lib/stargz-store/store:ref"]
  4. Ensure fuse is installed and loaded.

    apt-get install fuse
    modprobe fuse
  5. Enable the stargz-store service.

    wget -O /etc/systemd/system/stargz-store.service https://raw.githubusercontent.com/containerd/stargz-snapshotter/main/script/config-cri-o/etc/systemd/system/stargz-store.service
    systemctl daemon-reload
    systemctl restart stargz-store crio
  6. Confirm crio and stargz-store services are running.

    systemctl status stargz-store
    systemctl status crio

Evaluate the performance

  1. Update the nodeName and image in the manifest below and create the test deployments:

    kubectl apply -f - <<EOF
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: test-estargz-cpu
      namespace: default
      labels:
        app: test-estargz
    spec:
      strategy:
        type: Recreate
      replicas: 1
      selector:
        matchLabels:
          app: test-estargz
      template:
        metadata:
          labels:
            app: test-estargz
        spec:
          containers:
            ## Update the image to your own container repository
          - image: iad.ocir.io/idxzjcdxqj/vllm/vllm-openai:v0.11.0-esgz
            name: test
            command: ["/bin/bash", "-c", "echo started && sleep infinity"]
          ## Replace the nodename with the name of the node where stargz-store is configured.
          nodeName: 10.140.34.52
          tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: test-regular-cpu
      namespace: default
      labels:
        app: test-regular
    spec:
      strategy:
        type: Recreate
      replicas: 1
      selector:
        matchLabels:
          app: test-regular
      template:
        metadata:
          labels:
            app: test-regular
        spec:
          containers:
            ## Update the image to your own container repository
          - image: iad.ocir.io/idxzjcdxqj/vllm/vllm-openai:v0.11.0-regular
            name: test
            command: ["/bin/bash", "-c", "echo started && sleep infinity"]
            env:
            - name: HF_HOME
              value: "/workspace"
          ## Replace the nodename with the name of the node where stargz-store is configured.
          nodeName: 10.140.34.52
          tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
    EOF
  2. Monitor the container start time.

    kubectl get pods -w
  3. Cleanup resources.

    kubectl delete deploy test-regular-cpu
    kubectl delete deploy test-estargz-cpu
  4. For a test deployment using the GPU, update the nodeName and image in this yaml manifest and apply it using the command below:

    kubectl apply -f test-manifest-gpu.yaml

Results

The following results are based on tests executed on a BM.GPU4.8 shape using a 30B parameters large language model.

  1. Pod start time is 33s for eStargz and 4m for regular image.

    $ kubectl get pods -w
    NAME                            READY   STATUS              RESTARTS   AGE
    test-estargz-5699988945-2zxmh   0/1     ContainerCreating   0          8s
    test-regular-55fbdf64c8-hn6hg   0/1     ContainerCreating   0          7s
    test-estargz-5699988945-2zxmh   1/1     Running             0          33s
    test-regular-55fbdf64c8-hn6hg   1/1     Running             0          4m
  2. Application starts on the pod using the eStargz image in 1m27s.

    INFO 12-11 07:59:02 [__init__.py:216] Automatically detected platform cuda.
    (APIServer pid=1) INFO 12-11 07:59:26 [api_server.py:1839] vLLM API server version 0.11.0
    (APIServer pid=1) INFO 12-11 07:59:26 [utils.py:233] non-default args: {'model_tag': 'cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit', 'max_model_len': 8192, 'enforce_eager': True}
    ...
    (APIServer pid=1) INFO 12-11 08:00:29 [launcher.py:42] Route: /metrics, Methods: GET
    (APIServer pid=1) INFO:     Started server process [1]
    (APIServer pid=1) INFO:     Waiting for application startup.
    (APIServer pid=1) INFO:     Application startup complete.
  3. Application starts on the pod using the regular image in 32s.

    INFO 12-11 08:01:19 [__init__.py:216] Automatically detected platform cuda.
    (APIServer pid=1) INFO 12-11 08:01:22 [api_server.py:1839] vLLM API server version 0.11.0
    (APIServer pid=1) INFO 12-11 08:01:22 [utils.py:233] non-default args: {'model_tag': 'cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit', 'max_model_len': 8192, 'enforce_eager': True}
    ...
    (APIServer pid=1) INFO 12-11 08:01:51 [launcher.py:42] Route: /metrics, Methods: GET
    (APIServer pid=1) INFO:     Started server process [1]
    (APIServer pid=1) INFO:     Waiting for application startup.
    (APIServer pid=1) INFO:     Application startup complete.
  4. The pod using the eStargz image is ready to serve traffic 1m22s faster than the regular one.

Conclusions

The implementation of lazy image pulling using eStargz format in OKE demonstrates significant improvements in pod initialization time, reducing container startup from 4 minutes to just 33 seconds, an 87% reduction. While the application initialization time within the eStargz container takes approximately 1 minute longer than the regular image (1m27s vs 32s), the overall time to ready state is still 1m22s faster, making the pod available for traffic more quickly. This trade-off proves particularly valuable in production environments where rapid scaling, pod rescheduling, or cold starts are critical, as the substantial reduction in initial container pull time outweighs the modest increase in application startup overhead. For workloads using large container images, especially AI workloads, lazy pulling with eStargz offers a practical solution to accelerate deployment without requiring changes to application code or significant infrastructure modifications.

  1. Stargz Snapshotter installation documentation
  2. Experimenting with eStargz image pulling on OpenShift
  3. nerdctl Stargz documentation

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.