Use OKE to Improve Data Locality for Cassandra and Spark Activity

Introduction

Apache Cassandra is a distributed, masterless database where each node owns token ranges. Apache Spark is a distributed compute engine that can use the Spark–Cassandra connector to read from Cassandra replicas. In Kubernetes, pods are scheduled without knowledge of where data lives, so data locality isn’t guaranteed.

This tutorial shows how OKE can improve locality with Kubernetes primitives: StatefulSets (stable identity for Cassandra), node labels, and affinity/anti-affinity to co-locate Spark executors with Cassandra pods—so reads are served from the same node (ideal) or, worst case, one hop to the co-located replica.

Objectives

Prerequisites

  1. Click below to open the stack in the OCI console:

    Deploy to Oracle Cloud

  2. Follow the guided flow to:

  1. After the stack completes you will get the IP of the bastion in the output section.

    Stack output

Task 2: Connect to the Bastion and Verify the Deployment

The initial infrastructure provisioning completes within about 15 minutes, but the full setup (via cloud-init on the bastion) takes about 20 more minutes to install Helm, deploy Cassandra and Spark, and run the read job.

  1. To monitor the process, SSH into the bastion:

    ssh -i <path-to-private-key> opc@<bastion_public_ip>

  2. Run the below command to monitor progress of the cloudinit script.

    tail -f /var/log/oke-automation.log

  3. The stack completes when you see the 3 seed Cassandra values being read and the Cloud-init complete message.

    Cloud-init complete

Note: What the cloudinit script has done is:

  1. From the bastion VM confirm the existing nodes:

    kubectl get nodes

  2. Confirm locality labels. Expect two nodes with spark-locality=true and data-locality=enabled.

    kubectl get nodes --show-labels | grep -E 'spark-locality|data-locality'

  3. Verify Cassandra placement:

    kubectl -n k8ssandra-operator get pods -l app.kubernetes.io/name=cassandra -o wide

  4. Verify Spark placement:

    kubectl -n spark get pods -o wide

  5. Check Spark read Job logs. You should see the 3 records from testks.users and a successful run.

    kubectl -n spark logs job/spark-read-cassandra --tail=20

Tip: Matching NODE values across Cassandra and Spark pods confirms co-location and ideal conditions for locality. For more conclusive Flow Log results, insert additional rows into testks.users using cqlsh. Larger datasets will generate more read traffic, making locality vs. non-locality effects easier to observe.

Below you can see an example output for the above commands:

nodes check

Task 3: Observe Network Effects with VCN Flow Logs

Use VCN Flow Logs to understand where Cassandra traffic flows during Spark reads. The current automation uses Flannel (VXLAN), which affects what Flow Logs can see.

What changes with the CNI

  1. Enable Flow Logs on the worker subnet.

    In the OCI Console, enable Flow Logs for the OKE worker subnet. Re-run (or wait for) the Spark read Job to generate traffic.

  2. Query Flow Logs (choose the path that matches your cluster)

If using this automation (Flannel/VXLAN): Use an advanced query similar to:

   search "<your-flow-log-OCID>"
   | where data.protocolName = 'UDP'
   | where data.destinationPort = <vxlan-port>

Replace with the actual Flow Log resource OCID and with the port used by your overlay (in this lab: 14789, see below picture).

UDP traffic

If your cluster uses NPN:

Note: Flow Logs can take a few minutes to ingest new entries.

Key Considerations

Provide links to additional resources. This section is optional; delete if not needed.

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.