Create a Cache Snapshot

You can create a Cache Snapshot of the Dataset from this page.
To create a cache snapshot, follow these steps:
  1. Navigate to Dataset Summary page.
  2. Click next to corresponding Dataset and select View.
    The Overview page is displayed.
  3. Click on the header and select Create a cache snapshot.
  4. Enter the Snapshot Name having alphanumeric character, underscore and hyphen not exceeding 30 characters and click Create.
    This screen additionally allows users to cache a snapshot of the current state of data. The cached snapshots can be accessed later in Model Pipelines without any recompilation or re-read from data sources.
    To use the dataset in model pipeline or data pipeline, the actual data is fetched using the Cache option. For example, to take the data from dataset on As of Date, create the data frame and provide the name to cache. Only Cache pulls the data from dataset.
    This helps the things to work faster when you have millions of records, and you want to use intermediate data for use. For example, if you have 1 million records and want to use only 10,000 out of that, then perform the sampling for that 10, 000 entries. This increases the speed of processing, validation of information.
    • Once the metadata is created, the original data can be cached. A snapshot of the actual data in the dataset at the current time can be stored referenced by a tag name.
    • The location for caching is in the datastudio server location $DS_HOME/work/ftpshare/mmg/workspace_name/dscode/tag. The dataset will be saved as a parquet file with name dscode_tag.parquet
    • When executing all APIs from notebook, workspace has to be attached.
    • Caching can be performed in two ways:
      • From UI, immediately after saving the metadata.
    Or the users will have to fetch(create) a new snapshot/dataframe of the dataset using the API 'Fetch New Snapshot of dataset' and manually cache using the 'Caching Data Frame' API.
    The following table provides information on the error and the troubleshooting procedure in case of dataset failure.

    Table 8-6 Information on the error and the troubleshooting procedure

    Error Troubleshooting procedure
    ModuleNotFoundError: No module named '_bz2' Install the package 'libbz2-dev' before building python.
    ModuleNotFoundError: No module named '_sqlite3' Install the package 'libsqlite3-dev' before building python.

    Python-env-health-check fails if pandas version is less than 1.4.1.

    NOTE: modin dataframe library is supported only from pandas version 1.4.1 and higher.

    You can switch between the options modin[dask] and pandas for the underlying dataframe library if pandas version 1.4.1 is installed.
    “Not a valid file" errorwhile profiling the Hive Data sources. Copy the following required files : kbank.keytab , krb5.conf , hive-jdbc-driver.jar into the path : $DS_HOME/conf folder of datastudio