Data Reader Component

Data Reader component is responsible for reading data for a specific format. Hence, a CsvReader, is capable of reading csv files only. Also, note that data readers are intentionally kept execution engine aware. Hence, a csv reader might come in multiple variations for different execution engines (for example, CsvDaskReader and CsvSparkReader). The allows the component to take advantage of any engine specific APIs and features (for example, dask_delayed)

As discussed, the primary responsibility, in simpler terms, for the data reader is that given a set of location(s) where data reside, read the data and provide a columnar representation (dataframe) to the the runner. For data frame that supports partitions, the runner will also load the data in parts.

Permissions Best Practices

Use policies to limit access to Object Storage.

A policy specifies who can access Oracle Cloud Infrastructure resources and how. For more information, see How Policies Work.

Assign a group the least privileges that are required to perform their responsibilities. Each policy has a verb that describes what actions the group is allowed to do. From the least amount of access to the most, the available verbs are: inspect, read, use, and manage.

Assign least privileged access to resource types in object-family (buckets and objects). For example, the inspect verb lets you check to see if a bucket exists (HeadBucket) and list the buckets in a compartment (ListBucket). The manage verb gives all permissions on the resource.

We recommend that you give DELETE permissions to a minimum set of IAM users and groups. This practice minimizes loss of data from inadvertent deletes by authorized users or from malicious actors. Only give DELETE permissions to tenancy and compartment administrators.

In addition to creating IAM policies, lock down access to Object Storage using features like pre-authenticated requests.

Security Zones

Using Security Zones ensures your resources in Object Storage comply with security best practices.

A security zone is associated with one or more compartments and a security zone recipe. When you create and update resources in a security zone’s compartment, Oracle Cloud Infrastructure validates these operations against the list of security zone policies in the recipe. If any policy in the recipe is violated, then the operation is denied. The following security zone policies are available for resources in Object Storage.

deny public_buckets deny buckets_without_vault_key For a list of all security zone policies, see Security Zone Policies.

To move existing resources to a compartment that is in a security zone, the resources must comply with all security zone policies in the zone’s recipe. Similarly, resources in a security zone can’t be moved to a compartment outside of the security zone because it might be less secure. See Managing Security Zones - https://docs.oracle.com/en-us/iaas/security-zone/using/managing-security-zones.htm.

Native Data Reader

These data readers can read data from various sources and provides a pandas dataframe as output. These are useful when the dataset is small, you are quickly running some experimentation in your notebook for a small sample data or you have an existing application using pandas dataframe and you want to integrate ML Insights. Make sure the infrastructure on which you are running ML Insights is compatible with pandas and is the right option. For example, using the native data reader, when running insights in a spark cluster isn’t the correct option. Similarly, check what engine configuration you have passed to builder (if any), and if that matches.

Warning

Pandas dataframe is not suitable to handle large dataset. It is recommended to use dask or spark in such scenarios. Also pandas dataframe usually take three to five times more memory than your filesize when loaded in memory, so make sure enough memory is available.

How to use

In order to use datareader, please familiarise with all the available out of the box components available. Then follow the next steps to construct:

  1. Import the right class that you want to use
    from mlm_insights.mlm_native.readers import CSVNativeDataReader
    
  2. Create a new object of the class by passing the right parameter to the constructor
    csv_reader = CSVNativeDataReader("<Location to your files>")
    
  3. Pass the created object to the correct builder api
    InsightsBuilder().with_reader(reader=csv_reader)
    

That’s it! Once you have passed on other components you want to use, build the Builder as shown in Builder page to get the runner and run it.

Note

Native out of the box readers, reads the entire data in memory as pandas dataframe does not support partitioning.

Dask Data Reader

These data readers can read data from various sources and provides a dask dataframe as output. Dask dataframe provides multiple advantages over pandas while keeping the structure same. These includes ability to read data in partitions, lazy loading, better memory handling, parallelization amongst other. Dask can also support a wide array of use cases, so it can be safely used when working locally, or in production. It is able to scale vertically and horizontally and can handle any amount of data. As usual make sure proper engine configuration are passed and compatible infrastructures are used.

Hint

Dask is the recommended execution engine by the ML Monitoring Team when working on a single node infrastructure.

How to use

In order to use datareader, please familiarise with all the available out of the box components. Then follow the next steps to construct

  1. Import the right class that you want to use
    from mlm_insights.mlm_native.readers import CSVDaskDataReader
    
  2. Create a new object of the class by passing the right parameter to the constructor
    csv_reader = CSVDaskDataReader("<Location to your files>")
    
  3. Pass the created object to the correct builder api
    InsightsBuilder().with_reader(reader=csv_reader)
    

That’s it! Once you have passed on other components you want to use, build the Builder as shown in Builder page to get the runner and run it.

Hint

Even though the data reader is used the same way for dask, the data is not loaded entirely in memory (unless total volume is less than single partition size). Also note that partition strategy (row based or file based) may change based on the reader.