mlm_insights.core.data_sources package¶
Submodules¶
mlm_insights.core.data_sources.data_source module¶
- class mlm_insights.core.data_sources.data_source.DataSource(type: str, **kwargs: Any)¶
- Bases: - ABC- This interface is responsible for encapsulating the file_path, file type. It can be used to implement special functionality that allow for taking parameters and forming a list of file paths to be read by the readers. - For example: if current date needs to be used for reading a specific folder with today’s date, a data source can be used for this purpose. - It is an optional component to implement and use. It can be omitted if the customer explicitly passes the file paths/glob expressions to be read by the readers. - fetch(filename: str, **kwargs: Any) Any¶
- This method is responsible for fetching the contents of the file for the underlying data source - Parameters¶- filename:
- The canonical file path for which the client has to fetch the raw content 
- kwargs:
- Extra keyword arguments. 
 - Returns¶- Any:
- The raw content of the file in the accepted format by underlying engine read method by default returns the file path 
 
 
mlm_insights.core.data_sources.local_date_prefix_data_source module¶
- class mlm_insights.core.data_sources.local_date_prefix_data_source.LocalDatePrefixDataSource(base_location: str, file_type: str, offset: int = -1, date_range: Dict[Any, Any] = {}, **kwargs: Any)¶
- Bases: - DataSource- This class implements the OOB Data source for retrieving file locations based on a user provided offset or date range. These set of locations are passed to the reader for reading. - User needs to provide only 1 of the 2 - either offset or date range for calculating the folder location. If both are provided, date range is given preference followed by offset value. - Configuration¶- base_location: str
- The prefix to the folder location 
- file_type: str
- File format for the input data files. eg. csv, jsonl etc. 
- date_range: Dict[str, str]
- Specify the date range for the dates to be used for folder locations, with ‘start’ and ‘end’ as keys. eg. - {'start': '2023-03-18', 'end': '2023-03-19'}
- Either date range or offset needs to be provided by the user 
 
- offset: int, default=-1
- No. of days from current time, for calculating the date to pick up data from. Example: for yesterday, offset=1, for 2 days back date, offset=2 
 - Returns¶- List[str]:
- List of file locations 
 - Example code - # For using date_range data = { "file_type": "csv", "date_range": {"start": "2023-03-18", "end": "2023-03-19"} } ds = LocalDatePrefixDataSource(base_location, **data) csv_reader = CSVDaskDataReader(data_source=ds) # Returns 2 data locations ['<base_location>/2023-03-18/*.csv', '<base_location>/2023-03-19/*.csv'] actual_df = csv_reader.read(None) # Reads from the data locations # For using offset data = { "file_type": "csv", "offset": 1 } ds = LocalDatePrefixDataSource(base_location, **data) csv_reader = CSVDaskDataReader(data_source=ds) # Returns 1 data location, given today's date is 2023-03-19: ['<base_location>/2023-03-18/*.csv'] actual_df = csv_reader.read(None) # Reads from the data locations 
- mlm_insights.core.data_sources.local_date_prefix_data_source.validate(base_location: str, offset: int, date_range: Dict[Any, Any], file_type: str) None¶
mlm_insights.core.data_sources.local_file_data_source module¶
- class mlm_insights.core.data_sources.local_file_data_source.LocalFileDataSource(file_path: List[str] | str = '', **kwargs: Any)¶
- Bases: - DataSource- This class implements the OOB Data source for retrieving file locations based on a simple file path string or list of strings or a glob string - Configuration¶- file_path: Union[List[str], str]
- A simple file path string / list of string / glob string 
 - Returns¶- List[str]:
- List of files present on the file path in the local system 
 - Example code - ds = LocalFileDataSource(file_path = 'location/csv/*.csv') csv_reader = CSVDaskDataReader(data_source=ds) # Data source will return a list of csv files within the folder location/csv/ actual_df = csv_reader.read(None) # Reads all the files returned by the LocalFileDataSource 
mlm_insights.core.data_sources.oci_date_prefix_data_source module¶
- class mlm_insights.core.data_sources.oci_date_prefix_data_source.OCIDatePrefixDataSource(bucket_name: str, namespace: str, file_type: str, object_prefix: str, offset: int = -1, date_range: Dict[Any, Any] = {}, storage_options: Dict[str, Any] = {}, **kwargs: Any)¶
- Bases: - DataSource- This class implements the OOB Data source for retrieving file locations based on a user provided offset or date range from OCI Object storage. These set of locations are passed to the reader for reading - User needs to provide only 1 of the 2 - either offset or date range for calculating the folder location. If both are provided, date range is given preference followed by offset value. - Configuration¶- bucket_name: str
- Name of the bucket 
- namespace: str
- oci cloud namespace of the bucket location 
- object_prefix: str
- folder path of the data relative to the bucket location, cannot be empty 
- file_type: str
- File format for the input data files. eg. csv, jsonl etc. 
- date_range: Dict[str, str]
- Specify the date range for the dates to be used for folder locations, with ‘start’ and ‘end’ as keys. eg. - {'start': '2023-03-18', 'end': '2023-03-19'}
- Either date range or offset needs to be provided by the user 
 
- offset: int, default=-1
- No. of days from current time, for calculating the date to pick up data from. Example: for yesterday, offset=1, for 2 days back date, offset=2 
- storage_options: Dict[str, Any]
- storage options are the authentication provided to the underlying ocifs client 
 - Returns¶- List[str]:
- List of OCI Object storage file locations 
 - Example code - # For using date_range data = { "bucket_name": "mlm", "namespace": "mlm", "object_prefix": "mlm", "file_type": "csv", "date_range": {"start": "2023-03-18", "end": "2023-03-19"} } ds = OCIDatePrefixDataSource(**data) csv_reader = CSVDaskDataReader(data_source=ds) # Returns 2 data locations ['oci://mlm@mlm/mlm/2023-03-18/*.csv', 'oci://mlm@mlm/mlm/2023-03-19/*.csv'] actual_df = csv_reader.read(None) # Reads from the data locations # For using offset data = { "bucket_name": "mlm", "namespace": "mlm", "object_prefix": "mlm", "file_type": "csv", "offset": 1 } ds = OCIDatePrefixDataSource(**data) csv_reader = CSVDaskDataReader(data_source=ds) # Returns 1 data location, given today's date is 2023-03-19: ['oci://mlm@mlm/mlm/2023-03-18/*.csv'] actual_df = csv_reader.read(None) # Reads from the data locations 
- mlm_insights.core.data_sources.oci_date_prefix_data_source.validate(bucket_name: str, namespace: str, object_prefix: str, offset: int, date_range: Dict[Any, Any], file_type: str) None¶
mlm_insights.core.data_sources.oci_object_storage_data_source module¶
- class mlm_insights.core.data_sources.oci_object_storage_data_source.OCIObjectStorageDataSource(file_path: List[str] | str = '', storage_options: Dict[str, Any] = {}, **kwargs: Any)¶
- Bases: - DataSource- This class implements the OOB Data source for retrieving file locations based on an OCI file path string or list of OCI file path strings or a glob string. - Configuration¶- file_path: Union[List[str], str]
- A simple file path string / list of string / glob string 
 - Returns¶- List[str]:
- List of files present on the file path in the oci object system 
 - Example code - ds = OCIObjectStorageDataSource(file_path = 'oci://location/csv/*.csv') csv_reader = CSVDaskDataReader(data_source=ds) # Data source will return a list of csv files within the OCI Object store location: oci://location/csv/ actual_df = csv_reader.read(None) # Reads all the files returned by the OCIObjectStorageDataSource 
- mlm_insights.core.data_sources.oci_object_storage_data_source.validate(file_path: List[str] | str) None¶