ADS Release Notes

Janurary 13, 2021

ADS

A full distribution of this release of ADS is found in the General Machine Learning for CPU and GPU environments. The Classic environments include the previous release of ADS. A distribution of ADS without AutoML and MLX is found in the remaining environments.

  • DatasetFactory can now download files first before opening them in memory using the .download() method

  • Added support to archive files in creating Data Flow applications and runs.

  • Support was added for loading Avro format data into ADS.

  • Changed model serialization to use ONNX by default when possible on supported models.

  • Added ADSTuner, which is a framework and model agnostic hyperparmater optimizer, use the adstuner.ipynb notebook for examples of how to use this feature.

  • Corrected the up_sample() method in get_recommendations() so that it does not fail when all features are categorical. Up-sampling is possible for datasets containing continuous and categorical features.

  • Resolved issues with serializing ndarray objects into JSON.

  • A table of all of the ADS notebook examples can be found in our service documentation: Oracle Cloud Infrastructure Data Science

  • Changed set_documentation_mode to false by default

  • Added unit-tests related to the dataset helper

  • Fixed the _check_object_exists to handle situations where the object storage bucket has more than 1000 objects.

  • Added option overwrite_script in the create_app() method to allow a user to override a pre-existing file.

  • Added support for newer fsspec versions

  • Added support for the C library Snappy

  • Fixed issue with uploading model provenance data due to inconsistency with OCI interface.

  • Resolved issue with multiple versions of Cryptography being installed when installing fbprophet

AutoML

AutoML is upgraded to MLX v0.5.4 and the changes include:

AutoML is distributed in the General Machine Learning and Data Exploration condapacks.

  • Support for ONNX. AutoML models can now be serialized using ONNX by calling the to_onnx() API on the AutoML estimator.

  • Pre-processing has been overhauled to use sklearn pipelines to allow serialization using ONNX. Numerical, categorical, and text columns are supported for ONNX serialization. Datetime and time series columns are not supported.

  • Torch-based deep learning models, TorchMLPClassifier and TorchMLPRegressor, have been added.

  • GPU support for XGBoost and torch-based models have been added. This is disabled by default, and can be enabled by passing in ‘gpu_id’: ‘auto’ in engine_opts in the constructor. ONNX serialization for GPUs has not been tested.

  • Adaptive sampling’s learning curve has been smoothened. This allows adaptive sampling to converge faster on some datasets.

  • Improvements to ranking performance in feature selection were added. Feature selection is now much faster on large datasets.

  • The default execution engine for AutoML has been switched to Dask. You can still use the Python multiprocessing by passing engine='local', engine_opts={'n_jobs' : -1} to init()

  • GuassianNB has been enabled in the interface by default.

  • The AdaBoostClassifier has been disabled in the pipeline interface by default. The ONNX converter for AdaBoost should not be used.

  • The issue ValueError: Found unknown categories during transform has been fixed.

  • You can manually specify a hyperparameter search space to AutoML. New parameter added to the pipeline. This allows you to freeze some hyperparmaters or to expose further ones for tuning.

  • New API: Refit an AutoML pipeline to another dataset. This is primarily used to handle updated training data, where you train the pipeline once, and refit in on newer data.

  • AutoML no longer closes a user specified Dask cluster.

  • AutoML properly cleans up any existing futures on the Dask cluster at the end of fit.

  • Switched to using Pandas dataframes internally. AutoML now uses Pandas dataframes internally instead of Numpy dataframes, avoiding needless conversions.

  • Pytorch is now an optional dependency. If Pytorch is installed, AutoML automatically considers multilayer perceptrons in its search. If Pytorch is not found, deep learning models are ignored.

  • Updated the pipeline interface to include train(), which runs all the pipeline stages but does not do a final fitting of the model (the fit() API should be used if final fit is needed).

  • Updated the pipeline interface to include refit(), which allows you to refit the pipeline to an updated dataset without re-running the full pipeline again. This should be used by advanced users only. For best results, we recommend that you re-run the full pipeline when dataset changes.

MLX

MLX upgraded is upgraded to MLX v1.0.15 the changes include:

MLX is distributed in the General Machine Learning condapacks

  • Updated the explanation descriptions to use a base64 representation of the static plots. This obviates the need for creating a mlx_static directory.

  • Replaced the boolean indexing in slicing Pandas dataFrame with integer indexing. After updating to Pandas >= 1.1.0 the boolean indexing caused some issues. Integer indexing addresses these issues.

  • Fixed MLX related import warnings.

  • Corrected an issue with ALE when the target values are strings.

  • Removed the dependency on Paramiko.

August 11 2020

ADS

  • Support was added to use Resource principals as an authentication mechanism for ADS.

  • Support was added to MLX for an additional model explanation diagnostic, Accumulated Local Effects (ALEs).

  • Support was added to MLX for “What-if” scenarios in model explainability.

  • Improvements were made to the correlation heatmap calculations in show_in_notebook().

  • Improvements were made to the model artifact.

Bug Fixes

  • Data Flow applications inherit the compartment assignment of the client. Runs inherit from applications by default. Compartment OCIDs can also be specied independently at the client, application, and run levels.

  • The Data Flow log link for logs pulled from an application loaded into the notebook session is fixed.

  • Progress bars now complete fully (in ADSModel.prepare() and prepare_generic_model()).

  • BaselineModel is now significantly faster and can be opted out of.

AutoML

No changes.

MLX

MLX upgraded to MLX v1.0.10 the changes include:

  • Added support to specify the mlx_static root path (used for ALE summary).

  • Added support for making mlx_static directory hidden (for example, <path>/.mlx_static/).

  • Fixed issue with the boolean features in ALE.

June 9 2020

ADS

Numerous bug fixes including:

  • Support for Data Flow applications and runs outside of a notebook session compartment. Support for specific object storage logs and script buckets at the application and run levels.

  • ADS detects small shapes and gives warnings for AutoML execution.

  • Removal of triggers in the Oracle Cloud Infrastructure Functions func.yaml file.

  • DatasetFactory.open() incorrectly yielding a classification dataset for a continuous target was fixed.

  • LabelEncoder producing the wrong results for category and object columns was fixed.

  • An untrusted notebook issue when running model explanation visualizations was fixed.

  • A warning about adaptive sampling requiring at least 1000 datapoints was added.

  • A dtype cast float to integer into DatasetFactory.open("csv") was added.

  • An option to specify the bucket of Data Flow logs when you create the application was added.

AutoML

AutoML upgraded to 0.4.2 the changes include:

  • Reduced parallelization on low compute hardware.

  • Support for passing in a custom logger object in automl.init(logger=).

  • Support for datetime columns. AutoML should automatically infer datetime columns based on the Pandas dataframe, and perform feature engineering on them. This can also be forced by using the col_types argument in pipeline.fit(). The supported types are: ['categorical', 'numerical', 'datetime']

MLX

MLX upgraded to MLX 1.0.7 the changes include:

  • Updated the feature distributions in the PDP/ICE plots (performance improvement).

  • All distributions are now shown as PMFs. Categorical features show the category frequency and continuous features are computed using a NumPy histogram (with ‘auto’). They are also separate sub-plots, which are interactive.

  • Classification PDP: The y-axis for continous features are now auto-scaled (not fixed to 0-1).

  • 1-feature PDP/ICE: The x-axis for continuous features now shows the entire feature distribution, whereas the plot may show a subset depending on the partial_range parameter (for example, partial_range=[0.2, 0.8] computes the PDP between the 20th and 80th percentile. The plot now shows the full distribution on the x-axis, but the line charts are only drawn between the specified percentile ranges).

  • 2-feature PDP: The plot x and y axes are now auto-set to match the partial_range specified by the user. This ensures that the heatmap fills the entire plot by default. However, the entire feature distribution can be viewed by zooming out or clicking Autoscale in plotly.

  • Support for plotting scatter plots using WebGL (show_in_notebook(..., use_webgl=True)) was added.

  • The side-issues that were causing the MLX Visualization Omitted warnings in JupyterLab was fixed.

April 30 2020

Environment Updates

  • The home folder is now backed by block volume. You can now save all your files to the /home/datascience folder and they will persist when you deactivate and activate your sessions. The block_storage folder no longer exists. The Oracle Cloud Infrastructure keys can be saved directly to the ~/.oci folder, and no symbolic links are required.

Note that the ads-examples folder in the home folder is a symbolic link to the /opt/notebooks/ads-examples folder. Any changes made in ads-examples are not be saved if you deactivate a notebook. * Each new notebook that is launched has a pre-populated accordion-style cell containing useful tips.

Useful Tips Image

The following packages were added:

  • fdk = 0.1.12

  • pandas-datareader = 0.8.1

  • py-cpuinfo = 5.0

ADS

  • ADS integration with the Oracle Cloud Infrastructure Data Flow service provides a more efficient and convenient to launch a Spark application and run Spark jobs

  • show_in_notebook() has had “head” removed from accordion and is replaced with dataset “warnings”.

  • get_recommendations() is deprecated and replaced with suggest_recommendations(), which returns a pandas dataframe with all the recommendations and suggested code to implement each action.

  • A progress indication of Autonomous Data Warehouse reads has been added.

Notebooks

  • A new notebook is included in the ads-examples folder to demonstrate ADS and DataFlow Integration.

  • A new notebook is included in the ads-examples folder which demonstrates advanced custom scoring functions within AutoML by implementing custom class weights.

  • New version of the notebook example for deployment to Functions and API Gateway: Now using cloud shell.

  • Significant improvements were made to existing ADS Notebooks.

AutoML

AutoML updated to version 0.4.1 from 0.3.1:

  • More consistent handling of stratification and random state.

  • Bug fix for LightGBM and XGBoost crashing on AMD shapes was implemented.

  • Unified Proxy Models across all stages of the AutoML Pipeline, ensuring leaderboard rankings are consistent was implemented.

  • Remove visual option from the interface.

  • The default tuning metric for both binary and multi-class classification has been changed to neg_log_loss.

  • Bug fix in AutoML XGBoost, where the predicted probabilities were sometimes NaN, was implemented.

  • Fixed several corner case issues in Hyperparameter Optimization.

MLX

MLX updated to version 1.0.3 from 1.0.0:

  • Added support for specifying the ‘average’ parameter in sklearn metrics by <metric>_<average>, for examlple F1_avg.

  • Fixed an issue with the detailed scatter plot visualizations and cutoff feature/axis names.

  • Fixed an issue with the balanced sampling in the Global Feature Permutation Importance explainer.

  • Updated the supported scoring metrics in MLX. The PermutationImportance explainer now supports a large number of classification and regression metrics. Also, many of the metrics names were changed.

  • Updated LIME and PermutationImportance explainer descriptions.

  • Fixed an issue where sklearn.pipeline wasn’t imported.

  • Fixed deprecated asscalar warnings.

March 18 2020

Access to ADW performance has been improved significantly

Major improvements made to the performance of the ADW dataset loader. Your data is now loaded much faster, depending on your environment.

Change to DatasetFactory.open() with ADW

DatasetFactory.open() with format='sql' no longer requires the index_col to be specified. This was confusing, since “index” means something very different in databases. Additionally, the table parameter may now be either a table or a sql expression.

ds = DatasetFactory.open(
  connection_string,
  format = 'sql',
  table = """
    SELECT *
    FROM sh.times
    WHERE rownum <= 30
  """
)

No longer automatically starts an H2O cluster

ADS no longer instantiates an H2O cluster on behalf of the user. Instead you need to import h2o on your own and then start your own cluster.

Preloaded Jupyter extensions

JupyterLab now supports these extensions:

  • Bokeh

  • Plotly

  • Vega

  • GeoJSON

  • FASTA

  • Variable Inspector

  • VDOM

Profiling Dask APIs

With support for Bokeh extension, you can now profile Dask operations and visualize profiler output. For more details, see Dask ResourceProfiler.

You can use the ads.common.analyzer.resource_analyze decorator to visualize the CPU and memory utilization of operations.

During execution, it records the following information for each timestep:

  • Time in seconds since the epoch

  • Memory usage in MB

  • % CPU usage

Example:

from ads.common.analyzer import resource_analyze
from ads.dataset.dataset_browser import DatasetBrowser
@resource_analyze
def fetch_data():
    sklearn = DatasetBrowser.sklearn()
    wine_ds = sklearn.open('wine').set_target("target")
    return wine_ds
fetch_data()

The output shows two lines, one for total CPU percentage used by all the workers, and one for total memory used.

Dask Upgrade

Dask is updated to version 2.10.1 with support for Oracle Cloud Infrastructure Object Storage. The 2.10.1 version provides better performance over the older version.