{ "cells": [ { "cell_type": "markdown", "id": "81880ec8", "metadata": {}, "source": [ "\n", "

Building a Recommender using AutoMLx

\n", "

by the Oracle AutoMLx Team

\n", "\n", "***" ] }, { "cell_type": "markdown", "id": "ed4786de", "metadata": {}, "source": [ "Recommendation Demo Notebook.\n", "\n", "Copyright © 2025, Oracle and/or its affiliates.\n", "\n", "Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/" ] }, { "cell_type": "markdown", "id": "96fd4d93", "metadata": {}, "source": [ "# Overview of this Notebook\n", "\n", "In this notebook we will build a recommender using the Oracle AutoMLx tool for the Movielens 100k dataset to predict the next item that users will most likely watch, based on their ratings history.\n", "We explore the various options provided by the Oracle AutoMLx tool, allowing the user to control the AutoMLx training process. We finally evaluate the different models trained by AutoMLx. Depending on the machine running this notebook, it can take up to minutes.\n", "\n", "---\n", "## Prerequisites:\n", "\n", " - Experience level: Novice (Python and Machine Learning)\n", " - Professional experience: Some industry experience\n", "---\n", "\n", "## Business Use:\n", "\n", "Data analytics and modeling problems using Machine Learning (ML) are becoming popular and often rely on data science expertise to build accurate ML models. Such modeling tasks primarily involve the following steps:\n", "- Preprocess dataset (clean, impute, engineer features, normalize).\n", "- Pick an appropriate model for the given dataset and prediction task at hand.\n", "- Tune the chosen model’s hyperparameters for the given dataset.\n", "\n", "All of these steps are significantly time consuming and heavily rely on data scientist expertise. Unfortunately, to make this problem harder, the best feature subset, model, and hyperparameter choice widely varies with the dataset and the prediction task. Hence, there is no one-size-fits-all solution to achieve reasonably good model performance. Using a simple Python API, AutoML can quickly jump-start the datascience process with an accurately-tuned model and appropriate features for a given prediction task.\n", "\n", "## Table of Contents\n", "\n", "- Setup\n", "- Load the Movielens 100k dataset\n", " - Define the column types\n", " - Splitting the dataset\n", "- AutoML\n", " - Create an Instance of AutoMLx\n", " - Train a Model using AutoMLx\n", " - Generate recommendations \n", " - Analyze the AutoMLx optimization process \n", " - Algorithm Selection\n", " - Hyperparameter Tuning\n", " - Advanced AutoMLx Configuration\n", " - Use a custom validation set\n", " - Final evaluation of the best model\n", "\n", "\n", "## Setup\n", "\n", "Basic setup for the Notebook." ] }, { "cell_type": "code", "execution_count": 1, "id": "f21246b4", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:34:49.524994Z", "iopub.status.busy": "2025-04-25T10:34:49.524508Z", "iopub.status.idle": "2025-04-25T10:34:56.014433Z", "shell.execute_reply": "2025-04-25T10:34:56.013264Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:34:51,937] [automlx.backend] Overwriting ray session directory to /tmp/1frrc9a7/ray, which will be deleted at engine shutdown. If you wish to retain ray logs, provide _temp_dir in ray_setup dict of engine_opts when initializing the AutoMLx engine.\n" ] } ], "source": [ "\n", "\n", "import datetime\n", "import logging\n", "import os\n", "import time\n", "import urllib\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from automlx import AutoRecommender, init\n", "\n", "# Settings for plots\n", "plt.rcParams[\"figure.figsize\"] = [10, 7]\n", "plt.rcParams[\"font.size\"] = 15\n", "\n", "# Silence unnecessary warnings\n", "logging.getLogger(\"sanerec.autotuning.parameter\").setLevel(logging.ERROR)\n", "\n", "# Initialize the parallelization engine of AutoMLx\n", "init(engine='ray', engine_opts={\"ray_setup\": {\"log_to_driver\": False}})" ] }, { "cell_type": "markdown", "id": "60de0bc8", "metadata": {}, "source": [ "\n", "## Load Movielens 100k data\n", "Movielens 100k dataset is one of the most common public datasets for movie recommendation. It contains 100k ratings from about 1k users on 1.6k movies, some information about user demographic, and additional movie characteristics. For more information about this dataset, you can visit the [Movielens website](https://grouplens.org/datasets/movielens/100k/).\n", "\n", "In this demo, we use the ratings to train a movie recommendation model, exploiting AutoMLx to find the best recommendation model and hyperparameters to use in terms of recommendation accuracy.\n", "Therefore, we start retrieving and loading the ratings data of the Movielens 100k dataset.\n", "To make this notebook lighter and quicker, we also subsample the ratings in the dataset, keeping only 50%." ] }, { "cell_type": "code", "execution_count": 2, "id": "10d71715", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:34:56.019370Z", "iopub.status.busy": "2025-04-25T10:34:56.018522Z", "iopub.status.idle": "2025-04-25T10:35:01.532354Z", "shell.execute_reply": "2025-04-25T10:35:01.531578Z" }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n", "\n", "get_ipython().system(' wget https://files.grouplens.org/datasets/movielens/ml-100k/u.data --no-check-certificate -q -O ./ml100k_interactions.tsv')" ] }, { "cell_type": "code", "execution_count": 3, "id": "d5f67d34", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:35:01.534915Z", "iopub.status.busy": "2025-04-25T10:35:01.534278Z", "iopub.status.idle": "2025-04-25T10:35:01.580798Z", "shell.execute_reply": "2025-04-25T10:35:01.580270Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idratingtimestamp
436605081855883777430
872785187425876823804
14317178285882826806
819328992914884122279
953211151174881171009
\n", "
" ], "text/plain": [ " user_id movie_id rating timestamp\n", "43660 508 185 5 883777430\n", "87278 518 742 5 876823804\n", "14317 178 28 5 882826806\n", "81932 899 291 4 884122279\n", "95321 115 117 4 881171009" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "dataset = pd.read_csv(\n", " \"./ml100k_interactions.tsv\",\n", " sep=\"\\t\",\n", " names=[\"user_id\", \"movie_id\", \"rating\", \"timestamp\"],\n", ").sample(frac=0.5, random_state=1)\n", "\n", "dataset.head(5)" ] }, { "cell_type": "markdown", "id": "ff0c9b09", "metadata": {}, "source": [ "In order to be used for the recommendation task, the data must have a timestamp column that is used to infer the temporal order of the samples. We also require to set the timestamp column as index of the dataframes used in our AutoML pipelines.\n", "\n", "Movielens contains a `timestamp` column that contains the time when a rating was given, so we set it as index of our dataframe." ] }, { "cell_type": "code", "execution_count": 4, "id": "812d547d", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:35:01.582716Z", "iopub.status.busy": "2025-04-25T10:35:01.582278Z", "iopub.status.idle": "2025-04-25T10:35:01.588223Z", "shell.execute_reply": "2025-04-25T10:35:01.587772Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idrating
timestamp
8837774305081855
8768238045187425
882826806178285
8841222798992914
8811710091151174
\n", "
" ], "text/plain": [ " user_id movie_id rating\n", "timestamp \n", "883777430 508 185 5\n", "876823804 518 742 5\n", "882826806 178 28 5\n", "884122279 899 291 4\n", "881171009 115 117 4" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "dataset = dataset.set_index(\"timestamp\")\n", "dataset.head(5)" ] }, { "cell_type": "markdown", "id": "b963f4dc", "metadata": {}, "source": [ "\n", "### Define types of columns in the dataframe\n", "\n", "The recommendation task requires to define the two main entities involved in the recommendation:\n", "- the `recommendation`, which represents the entity type that is going to be recommended;\n", "- the `recommendation_subject`, which represents the entity type that receives the recommendation.\n", "\n", "For this reason, AutoML requires to indicate what are the columns in the dataset that refer to these two concepts, and, in particular, the two columns that contain their unique identifiers.\n", "\n", "In our demo we want to recommend movies (`recommendation`), identified by the `movie_id` column, to users (`recommendation_subject`), identified by the `user_id` column. We declare this binding in a python dictionary that we will reuse throughout the demo." ] }, { "cell_type": "code", "execution_count": 5, "id": "cfc2901b", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:35:01.590086Z", "iopub.status.busy": "2025-04-25T10:35:01.589606Z", "iopub.status.idle": "2025-04-25T10:35:01.592388Z", "shell.execute_reply": "2025-04-25T10:35:01.591922Z" } }, "outputs": [], "source": [ "\n", "\n", "col_types = {\"movie_id\": \"recommendation\", \"user_id\": \"recommendation_subject\"}" ] }, { "cell_type": "markdown", "id": "b46f96e5", "metadata": {}, "source": [ "\n", "## Splitting the dataset\n", "\n", "We split the dataset into training and test datasets using a leave-last-out technique.\n", "The training set will be used to create a Machine Learning model using Oracle AutoMLx, and the test set will be used to evaluate the model's performance on unseen data.\n", "\n", "The leave-last-out splitting technique consists in keeping in the test set only the last data sample, as determined by its timestamp, for each `recommendation_subject` (user in this case). All the other samples form the training set. This corresponds to the common next item recommendation use case, where given the history of all the past data concerning a `recommendation_subject` in the training set, we want to predict what should be recommended next to the same subject, and check if it corresponds to the actual sample in the test set." ] }, { "cell_type": "code", "execution_count": 6, "id": "278bbc69", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:35:01.594188Z", "iopub.status.busy": "2025-04-25T10:35:01.593720Z", "iopub.status.idle": "2025-04-25T10:35:06.946927Z", "shell.execute_reply": "2025-04-25T10:35:06.946304Z" } }, "outputs": [], "source": [ "\n", "\n", "training_data, test_data = AutoRecommender.train_test_split(data=dataset, col_types=col_types)" ] }, { "cell_type": "markdown", "id": "16104052", "metadata": {}, "source": [ "\n", "# AutoML\n", "\n", "\n", "## Create an instance of Oracle AutoMLx\n", "\n", "The Oracle AutoMLx solution provides a pipeline that automatically finds a tuned model given a prediction task and a training dataset. In particular, it allows finding a tuned model for any supervised prediction task, for example, classification or regression where the target can be binary, categorical or real-valued.\n", "\n", "In this demo we want a model that performs a recommendation task, so we create a pipeline of type `AutoRecommender`, and we configure it with default parameters. You can find the complete list of all the available parameters and their meaning in our documentation." ] }, { "cell_type": "code", "execution_count": 7, "id": "c041b433", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:35:06.949707Z", "iopub.status.busy": "2025-04-25T10:35:06.948792Z", "iopub.status.idle": "2025-04-25T10:35:06.952588Z", "shell.execute_reply": "2025-04-25T10:35:06.952107Z" } }, "outputs": [], "source": [ "\n", "\n", "automl_pipeline = AutoRecommender().configure()" ] }, { "cell_type": "markdown", "id": "0974f27a", "metadata": {}, "source": [ "\n", "## Train a model using AutoMLx\n", "\n", "The training data is passed to the `fit()` function which executes the model selection and hyperparameter tuning steps." ] }, { "cell_type": "code", "execution_count": 8, "id": "8b379a85", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:35:06.954439Z", "iopub.status.busy": "2025-04-25T10:35:06.953936Z", "iopub.status.idle": "2025-04-25T10:36:14.308745Z", "shell.execute_reply": "2025-04-25T10:36:14.308200Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:35:07,292] [automlx.interface] Dataset shape: (49055,3)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:35:07,362] [automlx.process] Running Model Generation\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:35:07,402] [automlx.process] Model Generation completed.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:35:07,446] [automlx.model_selection] Running Model Selection\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:35:56,451] [automlx.model_selection] Model Selection completed - Took 49.005 sec - Selected models: [['ItemKNNRecommender']]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:35:56,513] [automlx.trials] Running Model Tuning for ['ItemKNNRecommender']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:12,913] [automlx.trials] Best parameters for ItemKNNRecommender: {'n_recommendations': 10, 'num_of_neighbors': 506, 'bias': 0.0001, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:12,915] [automlx.trials] Model Tuning completed. Took: 16.401 secs\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:13,337] [automlx.interface] Re-fitting pipeline\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:13,344] [automlx.final_fit] Skipping updating parameter seed, already fixed by FinalFit_19cf1d36-6\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:14,240] [automlx.interface] AutoMLx completed.\n" ] } ], "source": [ "\n", "\n", "automl_pipeline = automl_pipeline.fit(data=training_data, col_types=col_types)" ] }, { "cell_type": "markdown", "id": "4a4d8214", "metadata": {}, "source": [ "\n", "## Generate recommendations\n", "\n", "Once the AutoML pipeline is completed, we predict 5 recommendations for a random user in the dataset." ] }, { "cell_type": "code", "execution_count": 9, "id": "e31a99f9", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:36:14.310959Z", "iopub.status.busy": "2025-04-25T10:36:14.310411Z", "iopub.status.idle": "2025-04-25T10:36:14.371178Z", "shell.execute_reply": "2025-04-25T10:36:14.370685Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idscore
062833015.370814
162828615.029380
262825814.119169
362827213.703196
462831313.564656
\n", "
" ], "text/plain": [ " user_id movie_id score\n", "0 628 330 15.370814\n", "1 628 286 15.029380\n", "2 628 258 14.119169\n", "3 628 272 13.703196\n", "4 628 313 13.564656" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "recommendation_subjects = test_data.sample(1)[['user_id']]\n", "automl_pipeline.predict(subjects=recommendation_subjects, n_recommendations=5)" ] }, { "cell_type": "markdown", "id": "801fc125", "metadata": {}, "source": [ "\n", "## Analyze the AutoMLx optimization process\n", "\n", "During the Oracle AutoMLx process for recommendation, a summary of the optimization process is logged, containing:\n", "- Information about the training data.\n", "- Information about the AutoMLx Pipeline, such as:\n", " - Selected algorithm that was the best choice for this data;\n", " - Selected hyperparameters for the selected algorithm.\n", "\n", "AutoMLx provides a `print_summary` API to output all the different trials performed." ] }, { "cell_type": "code", "execution_count": 10, "id": "d7fa1f8f", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:36:14.373161Z", "iopub.status.busy": "2025-04-25T10:36:14.372643Z", "iopub.status.idle": "2025-04-25T10:36:14.387085Z", "shell.execute_reply": "2025-04-25T10:36:14.386548Z" } }, "outputs": [ { "data": { "text/html": [ "
General Summary
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
None
None
ManualSplit(Shuffle=False, Seed=7)
SanerecMetric
ItemKNNRecommender
{'n_recommendations': 10, 'num_of_neighbors': 506, 'bias': 0.0001, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}
25.2.1
3.9.21 (main, Dec 11 2024, 16:24:11) \\n[GCC 11.2.0]
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Trials Summary
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Step# Samples# FeaturesAlgorithmHyperparametersScore (SanerecMetric)All MetricsRuntime (Seconds)Memory Usage (GB)Finished
Model Selection481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 100, 'bias': 25, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0882{'hr': 0.08820403825717323}1.03190.7063Fri Apr 25 03:35:20 2025
Model Selection481142AlsRecommender{'n_recommendations': 10, 'iterations': 10, 'factors': 16, 'regularization': 0.01, 'cache_users_states': True}0.0691{'hr': 0.06907545164718384}4.58450.7055Fri Apr 25 03:35:19 2025
Model Selection481142TRexxRecommender{'n_recommendations': 10, 'embedding_dim': 32, 'sequence_length': 5, 'num_sampled': 100, 'dropout_rate': 0.2, 'num_blocks': 2, 'num_head': 4, 'l2_reg_embedding': 1e-06, 'dnn_activation': 'tanh', 'optimizer_name': 'lazyadam', 'optimizer_learning_rate': 0.01, 'future_blinding': False, 'embeddings_on_cpu': False, 'cache_users_states': False, 'negative_sampling_method': CandidateSamplingMethod.UNIFORM_CANDIDATE_SAMPLING, 'epochs': 10, 'batch_size': 512, 'verbose': 1, 'augment_data': True, 'early_stopping_patience': -1}0.0531{'hr': 0.053134962805526036}35.10171.1860Fri Apr 25 03:35:56 2025
Model Selection481142BprRecommender{'n_recommendations': 10, 'iterations': 10, 'factors': 16, 'regularization': 0.01, 'cache_users_states': True}0.0372{'hr': 0.03719447396386823}0.41480.7046Fri Apr 25 03:35:21 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 505, 'bias': 0.0001, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0999{'hr': 0.09989373007438895}1.39550.6848Fri Apr 25 03:36:10 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 506, 'bias': 0.0001, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0999{'hr': 0.09989373007438895}1.09100.6842Fri Apr 25 03:36:12 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 506, 'bias': 0.0001, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0999{'hr': 0.09989373007438895}1.35820.6869Fri Apr 25 03:36:09 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 28.25660795027468, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0956{'hr': 0.09564293304994687}1.44620.6762Fri Apr 25 03:36:08 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 28.26160794927468, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0956{'hr': 0.09564293304994687}1.13240.6785Fri Apr 25 03:36:10 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 28.26160794927468, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0956{'hr': 0.09564293304994687}1.35470.6758Fri Apr 25 03:36:08 2025
..............................
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.084{'hr': 0.08395324123273114}1.31910.6742Fri Apr 25 03:36:08 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 21, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.084{'hr': 0.08395324123273114}1.33250.6755Fri Apr 25 03:36:08 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 505, 'bias': 25, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0829{'hr': 0.08289054197662062}1.13551.1758Fri Apr 25 03:36:01 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 506, 'bias': 25, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0829{'hr': 0.08289054197662062}1.39341.1758Fri Apr 25 03:36:02 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 752, 'bias': 25, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0818{'hr': 0.0818278427205101}1.42941.1812Fri Apr 25 03:36:05 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 753, 'bias': 25, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0818{'hr': 0.0818278427205101}1.25251.1812Fri Apr 25 03:36:04 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 132, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0797{'hr': 0.07970244420828905}1.44490.6758Fri Apr 25 03:36:09 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 133, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0797{'hr': 0.07970244420828905}1.30440.6773Fri Apr 25 03:36:08 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 255, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0797{'hr': 0.07970244420828905}1.44710.6760Fri Apr 25 03:36:09 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 256, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0797{'hr': 0.07970244420828905}1.48310.6730Fri Apr 25 03:36:09 2025
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "\n", "automl_pipeline.print_summary()" ] }, { "cell_type": "markdown", "id": "30691ab3", "metadata": {}, "source": [ "We also provide the capability to visualize the results of each stage of the AutoMLx pipeline.\n", "\n", "\n", "### Algorithm Selection\n", "\n", "The plot below shows the scores predicted by Algorithm Selection for each algorithm. The horizontal line shows the average score across all algorithms. Algorithms below the line are colored turquoise, whereas those with a score higher than the mean are colored teal. The selected algorithm is in orange." ] }, { "cell_type": "code", "execution_count": 11, "id": "faccd1b4", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:36:14.388999Z", "iopub.status.busy": "2025-04-25T10:36:14.388508Z", "iopub.status.idle": "2025-04-25T10:36:14.578365Z", "shell.execute_reply": "2025-04-25T10:36:14.577825Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "\n", "def plot_model_selection_scores(_pipeline):\n", " # Each trial is a row in a dataframe that contains\n", " # Algorithm, Number of Samples, Number of Features, Hyperparameters, Score, Runtime, Memory Usage, Step as features\n", " trials = _pipeline.completed_trials_summary_[\n", " _pipeline.completed_trials_summary_[\"Step\"].str.contains(\"Model Selection\")\n", " ]\n", " name_of_score_column = f\"Score ({_pipeline._inferred_score_metric[0].name})\"\n", " trials.replace([np.inf, -np.inf], np.nan, inplace=True)\n", " trials.dropna(subset=[name_of_score_column], inplace=True)\n", " scores = trials[name_of_score_column].tolist()\n", " models = trials[\"Algorithm\"].tolist()\n", "\n", " y_margin = 0.10 * (max(scores) - min(scores))\n", " s = pd.Series(scores, index=models).sort_values(ascending=False)\n", "\n", " colors = []\n", " for f in s.keys():\n", " if f.strip() == _pipeline.selected_model_.strip():\n", " colors.append(\"orange\")\n", " elif s[f] >= s.mean():\n", " colors.append(\"teal\")\n", " else:\n", " colors.append(\"turquoise\")\n", "\n", " fig, ax = plt.subplots(1)\n", " ax.set_title(\"Algorithm Selection Trials\")\n", " ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)\n", " ax.set_ylabel(\"Hit Rate\")\n", " s.plot.bar(ax=ax, color=colors, edgecolor=\"black\")\n", " ax.axhline(y=s.mean(), color=\"black\", linewidth=0.5)\n", " plt.show()\n", "\n", "plot_model_selection_scores(automl_pipeline)" ] }, { "cell_type": "markdown", "id": "b1bd701b", "metadata": {}, "source": [ "\n", "### Hyperparameter Tuning\n", "\n", "Hyperparameter Tuning is the last stage of the AutoMLx pipeline, and focuses on improving the chosen algorithm's score on the dataset. We use a novel iterative algorithm to search across many hyperparameter dimensions, and converge automatically when optimal hyperparameters are identified. Each trial represents a particular hyperparameter configuration for the selected model.\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "efc97132", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:36:14.580268Z", "iopub.status.busy": "2025-04-25T10:36:14.579872Z", "iopub.status.idle": "2025-04-25T10:36:14.768618Z", "shell.execute_reply": "2025-04-25T10:36:14.768077Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "\n", "def plot_hp_tuning_scores(_pipeline):\n", " # Each trial is a row in a dataframe that contains\n", " # Algorithm, Number of Samples, Number of Features, Hyperparameters, Score, Runtime, Memory Usage, Step as features\n", " trials = _pipeline.completed_trials_summary_[\n", " _pipeline.completed_trials_summary_[\"Step\"].str.contains(\"Model Tuning\")\n", " ]\n", " name_of_score_column = f\"Score ({_pipeline._inferred_score_metric[0].name})\"\n", " trials.replace([np.inf, -np.inf], np.nan, inplace=True)\n", " trials.dropna(subset=[name_of_score_column], inplace=True)\n", " trials.drop(trials[trials[\"Finished\"] == -1].index, inplace=True)\n", " trials[\"Finished\"] = trials[\"Finished\"].apply(\n", " lambda x: time.mktime(datetime.datetime.strptime(x, \"%a %b %d %H:%M:%S %Y\").timetuple())\n", " )\n", " trials.sort_values(by=[\"Finished\"], ascending=True, inplace=True)\n", " scores = trials[name_of_score_column].tolist()\n", " score = []\n", " score.append(scores[0])\n", " for i in range(1, len(scores)):\n", " if scores[i] >= score[i - 1]:\n", " score.append(scores[i])\n", " else:\n", " score.append(score[i - 1])\n", " y_margin = 0.10 * (max(score) - min(score))\n", " fig, ax = plt.subplots(1)\n", " ax.set_title(\"Hyperparameter Tuning Trials\")\n", " ax.set_xlabel(\"Iteration $n$\")\n", " ax.set_ylabel(\"Hit Rate\")\n", " ax.grid(color=\"g\", linestyle=\"-\", linewidth=0.1)\n", " ax.set_ylim(min(score) - y_margin, max(score) + y_margin)\n", " ax.plot(range(1, len(trials) + 1), score, \"k:\", marker=\"s\", color=\"teal\", markersize=3)\n", " plt.show()\n", "\n", "plot_hp_tuning_scores(automl_pipeline)" ] }, { "cell_type": "markdown", "id": "e419e8b2", "metadata": {}, "source": [ "\n", "## Advanced AutoMLx Configuration\n", "\n", "You can also configure the AutoRecommender pipeline with suitable parameters according to your needs." ] }, { "cell_type": "code", "execution_count": 13, "id": "cce288b6", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:36:14.770632Z", "iopub.status.busy": "2025-04-25T10:36:14.770102Z", "iopub.status.idle": "2025-04-25T10:36:14.773855Z", "shell.execute_reply": "2025-04-25T10:36:14.773370Z" } }, "outputs": [], "source": [ "\n", "\n", "custom_pipeline = AutoRecommender().configure(\n", " model_list=[ # Specify the models you want the AutoMLx to consider\n", " \"ItemKNNRecommender\",\n", " \"AlsRecommender\",\n", " \"BprRecommender\",\n", " ],\n", " n_algos_tuned=2, # Choose how many models to tune\n", " search_space={ # You can specify the hyperparameters and ranges we search for each model\n", " \"ItemKNNRecommender\": {\"num_of_neighbors\": {\"range\": [10, 30], \"type\": \"continuous\"}}\n", " },\n", " max_tuning_trials=20, # The maximum number of tuning trials. Can be integer or Dict (max number for each model)\n", " score_metric=\"recall\", # Any of the metrics available, see the documentation for a list of supported values\n", ")" ] }, { "cell_type": "markdown", "id": "f68bee6c", "metadata": {}, "source": [ "\n", "## Use a custom validation set\n", "\n", "You can specify a custom validation set that you want AutoMLx to use to evaluate the quality of models and configurations." ] }, { "cell_type": "code", "execution_count": 14, "id": "a8f1a67e", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:36:14.775669Z", "iopub.status.busy": "2025-04-25T10:36:14.775187Z", "iopub.status.idle": "2025-04-25T10:36:14.947404Z", "shell.execute_reply": "2025-04-25T10:36:14.946822Z" } }, "outputs": [], "source": [ "\n", "\n", "training_data, validation_data = AutoRecommender.train_test_split(data=training_data, col_types=col_types)\n", "\n", "\n", "# We run again the AutoML pipeline with the custom training/validation split we just created, and some advanced settings that we can specify directly in the fit method." ] }, { "cell_type": "code", "execution_count": 15, "id": "0787d525", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:36:14.949398Z", "iopub.status.busy": "2025-04-25T10:36:14.948881Z", "iopub.status.idle": "2025-04-25T10:36:25.228649Z", "shell.execute_reply": "2025-04-25T10:36:25.228075Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:15,105] [automlx.interface] Dataset shape: (49055,3)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:15,170] [automlx.process] Running Model Generation\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:15,215] [automlx.process] Model Generation completed.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:15,246] [automlx.model_selection] Running Model Selection\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:16,875] [automlx.model_selection] Model Selection completed - Took 1.629 sec - Selected models: [['ItemKNNRecommender', 'AlsRecommender']]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:16,953] [automlx.trials] Running Model Tuning for ['ItemKNNRecommender']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:20,323] [automlx.trials] Best parameters for ItemKNNRecommender: {'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.010099998000000002, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:20,324] [automlx.trials] Model Tuning completed. Took: 3.371 secs\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:20,456] [automlx.trials] Running Model Tuning for ['AlsRecommender']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:23,155] [automlx.trials] Best parameters for AlsRecommender: {'n_recommendations': 10, 'iterations': 10, 'factors': 16, 'regularization': 0.00044721247746457157, 'cache_users_states': True}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:23,156] [automlx.trials] Model Tuning completed. Took: 2.700 secs\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:23,552] [automlx.interface] Re-fitting pipeline\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:23,562] [automlx.final_fit] Skipping updating parameter seed, already fixed by FinalFit_22bd3002-a\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-04-25 03:36:25,050] [automlx.interface] AutoMLx completed.\n" ] } ], "source": [ "\n", "\n", "custom_pipeline = custom_pipeline.fit(\n", " training_data,\n", " col_types,\n", " validation_data,\n", " time_budget=20, # Specify time budget in seconds\n", ")" ] }, { "cell_type": "markdown", "id": "f90f31e9", "metadata": {}, "source": [ "Now that the custom AutoML pipeline is completed, we can generate recommendations.\n", "Note that the pipeline's `recommend` method is equivalent to `predict`." ] }, { "cell_type": "code", "execution_count": 16, "id": "03976c22", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:36:25.230997Z", "iopub.status.busy": "2025-04-25T10:36:25.230474Z", "iopub.status.idle": "2025-04-25T10:36:25.290481Z", "shell.execute_reply": "2025-04-25T10:36:25.290013Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idscore
062828613.964525
162833013.761287
262827212.332191
362833112.210435
462831312.163628
\n", "
" ], "text/plain": [ " user_id movie_id score\n", "0 628 286 13.964525\n", "1 628 330 13.761287\n", "2 628 272 12.332191\n", "3 628 331 12.210435\n", "4 628 313 12.163628" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "custom_pipeline.recommend(subjects=recommendation_subjects, n_recommendations=5)" ] }, { "cell_type": "markdown", "id": "b3867203", "metadata": {}, "source": [ "\n", "## Final evaluation of the best model\n", "\n", "Finally, we evaluate the best model found on the test data we have. If no metric is specified, the pipeline computes the score using the same metric used to run the Hyperparameter Tuning, which in this case is the Recall, as we defined at pipeline creation.\n", "\n", "In this example, instead, we ask the pipeline to perform the evaluation using Normalized Discounted Cumulative Gain (NDCG), a common ranking metric. Our online documentation provides the list of the available metrics and how they are computed." ] }, { "cell_type": "code", "execution_count": 17, "id": "a7cfe11a", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:36:25.292408Z", "iopub.status.busy": "2025-04-25T10:36:25.291898Z", "iopub.status.idle": "2025-04-25T10:36:25.822968Z", "shell.execute_reply": "2025-04-25T10:36:25.822432Z" }, "lines_to_next_cell": 2 }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0dc5ce11a5664847a51c68144c131631", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/939 [00:00