Using OCI HDFS Connector

The OCI Hadoop Distributed File System (HDFS) connector lets your Apache Hadoop application read and write data to and from Object Storage.

Important

This is for Big Data Service 3.0.4 or earlier. If the version is 3.0.4 or later, use the Object Storage API Key Integration to connect to Object Storage. The Big Data Service Version is displayed on the Cluster Information tab of the Cluster details page.

General Tuning Guidelines

Following the general tuning guidelines for Big Data Service when running workloads against the OCI Object storage file system for effective performance and reliability for medium/larger workloads.

Read Tuning

By default, the InputStream created by an Hadoop application is a basic wrapper for the InputStream generated by the Object Storage Service (OSS) client. However, it lacks any performance optimizations.

To enhance read speed, use the following parameters associated with read caching.

Read ahead tuning configs

  • fs.oci.io.read.ahead=true
  • fs.oci.io.read.ahead.blocksize=6291456
  • fs.oci.io.read.ahead.blockcount=4
  • fs.oci.rename.operation.numthreads=2
Note

You can adjust the block size, thread counts, or enable/disable caching based on workload patterns and performance benchmarks. These are particularly useful for workloads such as distcp, Trino workloads where the application reads entire files sequentially.

Parquet Metadata Caching

  • fs.oci.caching.object.parquet.enabled=true: This parameter activates the caching of Parquet footer metadata in RAM.
  • fs.oci.caching.object.parquet.spec: This parameter allows for fine-tuning the caching behavior, though its usage is optional and depends on specific workload requirements.

By enabling Parquet footer caching, you can significantly improve the read performance of Parquet files, especially in scenarios where the same files are accessed multiple times or when the metadata is relatively static. This caching mechanism ensures that critical metadata is readily available in memory, reducing the need to repeatedly read the metadata from the Parquet files themselves. Set the preceding parameters to enable caching of the Parquet footer in RAM.

Parquet metadata caching is most effective when the block count is 1, as it allows for the entire metadata to be cached and accessed quickly. When the block count is greater than 1, the effectiveness of metadata caching depends on the workload and the application's reading patterns. Workloads such as distcp, which read entire files, can benefit from higher parallelism and the blockcount setting, while applications such as Spark, which read specific columns from a Parquet file, might have their own optimization strategies that make these settings less relevant.

Write Tuning

The HDFS connector, by default, generates a local temporary file on disk to buffer data written to an HDFS file. See the following HDFS connector configurations:

  • fs.oci.io.write.multipart.inmemory: This configuration enables the "upload-as-you-go" process. By setting this to true, the connector starts uploading data parts to the Object Storage Service as soon as the written size reaches the predefined part size, without waiting for the entire file to be buffered on disk.
  • fs.oci.client.multipart.numthreads: This configuration specifies the number of threads to be used for multipart uploading. We recommend basing the number of threads on the expected number of chunks for a typical file. A general guideline is to set it to 2 times the number of vCPUs per executor. This configuration ensures parallel processing of data parts during the upload.
  • fs.oci.client.multipart.partsize.mb: This configuration defines the size of each data part in megabytes for multipart uploading. We recommend using a chunk size nearest to 64 MB that evenly divides the file size into equal chunks. For example, if the file size is 640 MB, a part size of 64 MB would result in 10 equal chunks.
  • fs.oci.client.multipart.minobjectsize.mb: This configuration specifies the minimum object size in megabytes for which multipart uploading is enabled. Objects smaller than the specified size are uploaded using a standard PutObject request instead of multipart uploading. This can be useful for optimizing performance for smaller files.
  • fs.oci.client.md5.numthreads: This configuration sets the number of threads for calculating MD5 hash values asynchronously. The MD5 hash is used to verify data integrity during transmission. By establishing a dedicated thread pool, the calculation of MD5 hashes can be performed concurrently with the data part uploading process, improving overall efficiency.

Some tunings are based on the file size. In general, we want to write files as 32 to 128 MB chunks through multipart uploading (which is enabled by default). For explicit tuning, for a filesize of X MB, use a multipart chunk size nearest to 64 MB that most evenly divides X into equal chunks. For example, if X is 640 MB then you would get ten 64 MB chunks, which is even. If X is 641 MB, then you get ten 64 MB chunks and one 1 MB chunk. A chunk size of 65 MB is better. But for a default start, set N in the following tunings to 64 (for 64 MB).

Then, the number of threads (M) is the number of chunks for a typical file. In general, this can be 2 * vcpus per executor. X * M bytes of heap for the transfer is returned, so those might need to be adjusted to fit in a smaller heap.

Examples of using HDFS Connector from Big Data Service Clusters

You must have the required IAM policies created to access the Object Storage buckets and other resources.

  • Hadoop
    hadoop fs -ls oci://<bucket-name>@<namespace>/
  • Spark
    import org.apache.spark._
    val conf = sc.getConf
    val test_prefix = sc.textFile("oci://<bucket-name>@<namespace>/")
    test_prefix.toDF().show()

References: URI Formats for HDFS Connectors