Data Source Component

Data Source component is responsible for interacting with a specific data source and returning a list of locations to be read. For e.g. if Insights needs to fetch data from OCI Object Storage, an ObjectStorageFileSearchDataSource is used which returns a list of objects in a specific bucket

Data Source Type

Description

ObjectStorageFileSearchDataSource

Return list of file locations based on an OCI file path string or list of OCI file path strings and filters arguments provided by user from OCI Object Storage

LocalDatePrefixDataSource

Return list of file locations based on a user provided offset or date range.

LocalFileDataSource

Return list of file locations based on a simple file path string or list of strings or a glob string

OCIDatePrefixDataSource

Return list of file locations based on a user provided offset or date range from OCI Object storage.

OCIObjectStorageDataSource

Return list of file locations based on an OCI file path string or list of OCI file path strings or a glob string.

ObjectStorageFileSearchDataSource

This Data source is for getting file locations based on an OCI file path string or list of OCI file path strings and filters arguments provided by user from OCI Object storage . These filter options filter out the file locations based on the file path prefix , file path suffix , last modified date , date in file path string and folder names containing string .

Following are the allowed filters :

  1. daterange filter : Get list of files which were last modified between the given date range . User need to provide start and end date in ‘yyyy-mm-dd’

    filter_args = {
        "date_range" : {
                        "start": < valid date string>,
                        "end": < valid date string>
        }
    }
    
  2. prefix filter : Get list of files based on a provided prefix string of file names. User has to provide the bucket name and namespace before the prefix.

    filter_args = {
        "prefix": "<bucket_name>@<namespace>/<object_prefix>/"
    }
    
  3. suffix filter : Get list of files based on a provided suffix string of file names.

    filter_args = {
        "suffix": "<suffix string>"
    }
    
  4. file_extension filter : Get list of files based on a file extension eg - json, csv, txt etc.

    filter_args = {
        "file_extension": "<file extension name>"
    }
    
  5. contains filter : Get list of files based on a file names which contains the string provided by user , for example - input_data .

    filter_args = {
        "contains": "<python regular expression>"
    }
    
  6. last_n_days filter : Get list of files which are modified in Last N days (where N is an whole number) .

    filter_args = {
        "last_n_days": "<whole number>"
    }
    
  7. partition_based_date_range filter : Get list of files based on a folder name where date is part of the object folder name.User need tp provide date_format for start and end . The default start and end date_format is ‘.d{4}-d{2}-d{2}.’

    filter_args = {
        "partition_based_date_range":
            {
                "start": "<valid date in the given date format>",
                "end": "<valid date in the given date format>",
                "date_format": ".\d{4}-\d{2}-\d{2}."
            }
        }
    

User can pass N number of file location in Data Source component .

Note

  1. All the filter arguments are optional . In case where no arguments passed then Data source returns all the file paths fetched from the object storage file locations .

  2. In case of incorrect bucket location passed or empty bucket an empty file list is returned.

  3. If user provides date_range argument in filter argument then both start and end arguments are mandatory otherwise DataSourceArgumentExpection would be raised

  4. If the provided filter arguments are invalid then DataSourceException would be raised

  5. If no date filters match the data source returns empty file list

  6. UTC time is considered for all date time computation and read_operations in object storage

How to use

In order to use data source, please familiarise with all the available out of the box components available. Then follow the next steps to construct:

  1. Import the right class that you want to use
    from mlm_insights.core.data_sources import ObjectStorageFileSearchDataSource
    
  2. Create a new data source object of the class by passing the right parameter to the constructor
    filter_args = [
        {"file_extension": "csv"},
        {"prefix": "ml-insights-bucket@oci_namespace/folder1/"},
        {"suffix": "input.csv"},
        {"date_range":
            {"start": "2024-04-10",
             "end": "2024-04-20"
        }},
        {"contains": ".input.+iris_dataset."},
        {"last_n_days": 10},
        {"partition_based_date_range":
            {"start": "2024-04-20",
             "end": "2024-04-23",
             "date_format": ".\d{4}-\d{2}-\d{2}."
            }
    }]
    
    base_locations = ['oci://%s@%s/%s' % (bucket_name, namespace, object_prefix_1), 'oci://%s@%s/%s' % (bucket_name, namespace, object_prefix_2)]
    
    dataSourceObject =  ObjectStorageFileSearchDataSource(file_path=base_locations, storage_options=storage_options, filter_arg=filter_args)
    
  3. Pass the created data source object to supported data reader to the correct builder api

    csv_reader = CSVDaskDataReader(data_source=dataSourceObject)
    
  4. Pass the created data reader to the correct builder api

    InsightsBuilder().with_reader(reader=csv_reader)
    

That’s it! Once you have passed on other components you want to use, build the Builder as shown in Builder page to get the runner and run it. Builder Object

LocalDatePrefixDataSource

This Data source is for getting file locations based on a user provided offset or date range. These set of locations are passed to the data reader.

How to use

In order to use data source, please familiarise with all the available out of the box components available. Then follow the next steps to construct:

  1. Import the right class that you want to use
    from mlm_insights.core.data_sources import LocalDatePrefixDataSource
    
  2. Create a new data source object of the class by passing the right parameter to the constructor
    data = {
         "file_type": "csv",
         "date_range": {"start": "2023-03-18", "end": "2023-03-19"}
     }
     dataSourceObject = LocalDatePrefixDataSource(base_location, **data)
    
  3. Pass the created data source object to supported data reader to the correct builder api

    csv_reader = CSVDaskDataReader(data_source=dataSourceObject)
    
  4. Pass the created data reader to the correct builder api

    InsightsBuilder().with_reader(reader=csv_reader)
    

That’s it! Once you have passed on other components you want to use, build the Builder as shown in Builder page to get the runner and run it. Builder Object

LocalFileDataSource

This Data source is for getting file locations based on a simple file path string or list of strings or a glob string

How to use

In order to use data source, please familiarise with all the available out of the box components available. Then follow the next steps to construct:

  1. Import the right class that you want to use
    from mlm_insights.core.data_sources import LocalFileDataSource
    
  2. Create a new data source object of the class by passing the right parameter to the constructor
    dataSourceObject =  LocalFileDataSource(file_path = 'location/csv/*.csv')
    
  3. Pass the created data source object to supported data reader to the correct builder api

    csv_reader = CSVDaskDataReader(data_source=dataSourceObject)
    
  4. Pass the created data reader to the correct builder api

    InsightsBuilder().with_reader(reader=csv_reader)
    

That’s it! Once you have passed on other components you want to use, build the Builder as shown in Builder page to get the runner and run it. Builder Object

OCIDatePrefixDataSource

This Data source is for getting file locations based on a user provided offset or date range from OCI Object storage. These set of locations are passed to the reader for reading

How to use

In order to use data source, please familiarise with all the available out of the box components available. Then follow the next steps to construct:

  1. Import the right class that you want to use
    from mlm_insights.core.data_sources import OCIDatePrefixDataSource
    
  2. Create a new data source object of the class by passing the right parameter to the constructor
    data = {
        "bucket_name": "mlm",
        "namespace": "mlm",
        "object_prefix": "mlm",
        "file_type": "csv",
        "date_range": {"start": "2023-03-18", "end": "2023-03-19"}
    }
    dataSourceObject = OCIDatePrefixDataSource(**data)
    
  3. Pass the created data source object to supported data reader to the correct builder api

    csv_reader = CSVDaskDataReader(data_source=ds)
    
  4. Pass the created data reader to the correct builder api

    InsightsBuilder().with_reader(reader=csv_reader)
    

That’s it! Once you have passed on other components you want to use, build the Builder as shown in Builder page to get the runner and run it. Builder Object

OCIObjectStorageDataSource

This Data source is for getting file locations based on an OCI file path string or list of OCI file path strings or a glob string.

How to use

In order to use data source, please familiarise with all the available out of the box components available. Then follow the next steps to construct:

  1. Import the right class that you want to use
    from mlm_insights.core.data_sources import OCIObjectStorageDataSource
    
  2. Create a new data source object of the class by passing the right parameter to the constructor
    dataSourceObject = OCIObjectStorageDataSource(file_path = 'oci://location/csv/*.csv')
    
  3. Pass the created data source object to supported data reader to the correct builder api

    csv_reader = CSVDaskDataReader(data_source=dataSourceObject)
    
  4. Pass the created data reader to the correct builder api

    InsightsBuilder().with_reader(reader=csv_reader)
    

That’s it! Once you have passed on other components you want to use, build the Builder as shown in Builder page to get the runner and run it. Builder Object

Permissions Best Practices

Use policies to limit access to Object Storage.

A policy specifies who can access Oracle Cloud Infrastructure resources and how. For more information, see How Policies Work.

Assign a group the least privileges that are required to perform their responsibilities. Each policy has a verb that describes what actions the group is allowed to do. From the least amount of access to the most, the available verbs are: inspect, read, use, and manage.

Assign least privileged access to resource types in object-family (buckets and objects). For example, the inspect verb lets you check to see if a bucket exists (HeadBucket) and list the buckets in a compartment (ListBucket). The manage verb gives all permissions on the resource.

We recommend that you give DELETE permissions to a minimum set of IAM users and groups. This practice minimizes loss of data from inadvertent deletes by authorized users or from malicious actors. Only give DELETE permissions to tenancy and compartment administrators.

In addition to creating IAM policies, lock down access to Object Storage using features like pre-authenticated requests.

Security Zones

Using Security Zones ensures your resources in Object Storage comply with security best practices.

A security zone is associated with one or more compartments and a security zone recipe. When you create and update resources in a security zone’s compartment, Oracle Cloud Infrastructure validates these operations against the list of security zone policies in the recipe. If any policy in the recipe is violated, then the operation is denied. The following security zone policies are available for resources in Object Storage.

deny public_buckets deny buckets_without_vault_key For a list of all security zone policies, see Security Zone Policies.

To move existing resources to a compartment that is in a security zone, the resources must comply with all security zone policies in the zone’s recipe. Similarly, resources in a security zone can’t be moved to a compartment outside of the security zone because it might be less secure. See Managing Security Zones - https://docs.oracle.com/en-us/iaas/security-zone/using/managing-security-zones.htm.