{ "cells": [ { "cell_type": "markdown", "id": "06a5f433", "metadata": {}, "source": [ "\n", "

Building a Recommender using AutoMLx

\n", "

by the Oracle AutoMLx Team

\n", "\n", "***" ] }, { "cell_type": "markdown", "id": "ba30d145", "metadata": {}, "source": [ "Recommendation Demo Notebook.\n", "\n", "Copyright © 2025, Oracle and/or its affiliates.\n", "\n", "Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/" ] }, { "cell_type": "markdown", "id": "461f9bf8", "metadata": {}, "source": [ "# Overview of this Notebook\n", "\n", "In this notebook we will build a recommender using the Oracle AutoMLx tool for the Movielens 100k dataset to predict the next item that users will most likely watch, based on their ratings history.\n", "We explore the various options provided by the Oracle AutoMLx tool, allowing the user to control the AutoMLx training process. We finally evaluate the different models trained by AutoMLx. Depending on the machine running this notebook, it can take up to minutes.\n", "\n", "---\n", "## Prerequisites:\n", "\n", " - Experience level: Novice (Python and Machine Learning)\n", " - Professional experience: Some industry experience\n", "---\n", "\n", "## Business Use:\n", "\n", "Data analytics and modeling problems using Machine Learning (ML) are becoming popular and often rely on data science expertise to build accurate ML models. Such modeling tasks primarily involve the following steps:\n", "- Preprocess dataset (clean, impute, engineer features, normalize).\n", "- Pick an appropriate model for the given dataset and prediction task at hand.\n", "- Tune the chosen model’s hyperparameters for the given dataset.\n", "\n", "All of these steps are significantly time consuming and heavily rely on data scientist expertise. Unfortunately, to make this problem harder, the best feature subset, model, and hyperparameter choice widely varies with the dataset and the prediction task. Hence, there is no one-size-fits-all solution to achieve reasonably good model performance. Using a simple Python API, AutoML can quickly jump-start the datascience process with an accurately-tuned model and appropriate features for a given prediction task.\n", "\n", "## Table of Contents\n", "\n", "- Setup\n", "- Load the Movielens 100k dataset\n", " - Define the column types\n", " - Splitting the dataset\n", "- AutoML\n", " - Create an Instance of AutoMLx\n", " - Train a Model using AutoMLx\n", " - Generate recommendations \n", " - Analyze the AutoMLx optimization process \n", " - Algorithm Selection\n", " - Hyperparameter Tuning\n", " - Advanced AutoMLx Configuration\n", " - Use a custom validation set\n", " - Final evaluation of the best model\n", "\n", "\n", "## Setup\n", "\n", "Basic setup for the Notebook." ] }, { "cell_type": "code", "execution_count": 1, "id": "96b89fb7", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:33:40.617151Z", "iopub.status.busy": "2025-05-22T12:33:40.616791Z", "iopub.status.idle": "2025-05-22T12:33:47.612467Z", "shell.execute_reply": "2025-05-22T12:33:47.611412Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:33:43,584] [automlx.backend] Overwriting ray session directory to /tmp/dd53z4r9/ray, which will be deleted at engine shutdown. If you wish to retain ray logs, provide _temp_dir in ray_setup dict of engine_opts when initializing the AutoMLx engine.\n" ] } ], "source": [ "\n", "\n", "import datetime\n", "import logging\n", "import os\n", "import time\n", "import urllib\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from automlx import AutoRecommender, init\n", "\n", "# Settings for plots\n", "plt.rcParams[\"figure.figsize\"] = [10, 7]\n", "plt.rcParams[\"font.size\"] = 15\n", "\n", "# Silence unnecessary warnings\n", "logging.getLogger(\"sanerec.autotuning.parameter\").setLevel(logging.ERROR)\n", "\n", "# Initialize the parallelization engine of AutoMLx\n", "init(engine='ray', engine_opts={\"ray_setup\": {\"log_to_driver\": False}})" ] }, { "cell_type": "markdown", "id": "7e230f8d", "metadata": {}, "source": [ "\n", "## Load Movielens 100k data\n", "Movielens 100k dataset is one of the most common public datasets for movie recommendation. It contains 100k ratings from about 1k users on 1.6k movies, some information about user demographic, and additional movie characteristics. For more information about this dataset, you can visit the [Movielens website](https://grouplens.org/datasets/movielens/100k/).\n", "\n", "In this demo, we use the ratings to train a movie recommendation model, exploiting AutoMLx to find the best recommendation model and hyperparameters to use in terms of recommendation accuracy.\n", "Therefore, we start retrieving and loading the ratings data of the Movielens 100k dataset.\n", "To make this notebook lighter and quicker, we also subsample the ratings in the dataset, keeping only 50%." ] }, { "cell_type": "code", "execution_count": 2, "id": "cf715830", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:33:47.615459Z", "iopub.status.busy": "2025-05-22T12:33:47.614822Z", "iopub.status.idle": "2025-05-22T12:33:48.124233Z", "shell.execute_reply": "2025-05-22T12:33:48.123469Z" }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n", "\n", "get_ipython().system(' wget https://files.grouplens.org/datasets/movielens/ml-100k/u.data --no-check-certificate -q -O ./ml100k_interactions.tsv')" ] }, { "cell_type": "code", "execution_count": 3, "id": "156d2b43", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:33:48.126752Z", "iopub.status.busy": "2025-05-22T12:33:48.126373Z", "iopub.status.idle": "2025-05-22T12:33:48.171028Z", "shell.execute_reply": "2025-05-22T12:33:48.170413Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idratingtimestamp
436605081855883777430
872785187425876823804
14317178285882826806
819328992914884122279
953211151174881171009
\n", "
" ], "text/plain": [ " user_id movie_id rating timestamp\n", "43660 508 185 5 883777430\n", "87278 518 742 5 876823804\n", "14317 178 28 5 882826806\n", "81932 899 291 4 884122279\n", "95321 115 117 4 881171009" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "dataset = pd.read_csv(\n", " \"./ml100k_interactions.tsv\",\n", " sep=\"\\t\",\n", " names=[\"user_id\", \"movie_id\", \"rating\", \"timestamp\"],\n", ").sample(frac=0.5, random_state=1)\n", "\n", "dataset.head(5)" ] }, { "cell_type": "markdown", "id": "56678702", "metadata": {}, "source": [ "In order to be used for the recommendation task, the data must have a timestamp column that is used to infer the temporal order of the samples. We also require to set the timestamp column as index of the dataframes used in our AutoML pipelines.\n", "\n", "Movielens contains a `timestamp` column that contains the time when a rating was given, so we set it as index of our dataframe." ] }, { "cell_type": "code", "execution_count": 4, "id": "7100daf2", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:33:48.173343Z", "iopub.status.busy": "2025-05-22T12:33:48.172820Z", "iopub.status.idle": "2025-05-22T12:33:48.180698Z", "shell.execute_reply": "2025-05-22T12:33:48.180078Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idrating
timestamp
8837774305081855
8768238045187425
882826806178285
8841222798992914
8811710091151174
\n", "
" ], "text/plain": [ " user_id movie_id rating\n", "timestamp \n", "883777430 508 185 5\n", "876823804 518 742 5\n", "882826806 178 28 5\n", "884122279 899 291 4\n", "881171009 115 117 4" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "dataset = dataset.set_index(\"timestamp\")\n", "dataset.head(5)" ] }, { "cell_type": "markdown", "id": "fb61014c", "metadata": {}, "source": [ "\n", "### Define types of columns in the dataframe\n", "\n", "The recommendation task requires to define the two main entities involved in the recommendation:\n", "- the `recommendation`, which represents the entity type that is going to be recommended;\n", "- the `recommendation_subject`, which represents the entity type that receives the recommendation.\n", "\n", "For this reason, AutoML requires to indicate what are the columns in the dataset that refer to these two concepts, and, in particular, the two columns that contain their unique identifiers.\n", "\n", "In our demo we want to recommend movies (`recommendation`), identified by the `movie_id` column, to users (`recommendation_subject`), identified by the `user_id` column. We declare this binding in a python dictionary that we will reuse throughout the demo." ] }, { "cell_type": "code", "execution_count": 5, "id": "571620a1", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:33:48.182829Z", "iopub.status.busy": "2025-05-22T12:33:48.182348Z", "iopub.status.idle": "2025-05-22T12:33:48.185562Z", "shell.execute_reply": "2025-05-22T12:33:48.184956Z" } }, "outputs": [], "source": [ "\n", "\n", "col_types = {\"movie_id\": \"recommendation\", \"user_id\": \"recommendation_subject\"}" ] }, { "cell_type": "markdown", "id": "1f64d908", "metadata": {}, "source": [ "\n", "## Splitting the dataset\n", "\n", "We split the dataset into training and test datasets using a leave-last-out technique.\n", "The training set will be used to create a Machine Learning model using Oracle AutoMLx, and the test set will be used to evaluate the model's performance on unseen data.\n", "\n", "The leave-last-out splitting technique consists in keeping in the test set only the last data sample, as determined by its timestamp, for each `recommendation_subject` (user in this case). All the other samples form the training set. This corresponds to the common next item recommendation use case, where given the history of all the past data concerning a `recommendation_subject` in the training set, we want to predict what should be recommended next to the same subject, and check if it corresponds to the actual sample in the test set." ] }, { "cell_type": "code", "execution_count": 6, "id": "1af22d1c", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:33:48.187640Z", "iopub.status.busy": "2025-05-22T12:33:48.187082Z", "iopub.status.idle": "2025-05-22T12:33:53.424425Z", "shell.execute_reply": "2025-05-22T12:33:53.423689Z" } }, "outputs": [], "source": [ "\n", "\n", "training_data, test_data = AutoRecommender.train_test_split(data=dataset, col_types=col_types)" ] }, { "cell_type": "markdown", "id": "f45bfd0f", "metadata": {}, "source": [ "\n", "# AutoML\n", "\n", "\n", "## Create an instance of Oracle AutoMLx\n", "\n", "The Oracle AutoMLx solution provides a pipeline that automatically finds a tuned model given a prediction task and a training dataset. In particular, it allows finding a tuned model for any supervised prediction task, for example, classification or regression where the target can be binary, categorical or real-valued.\n", "\n", "In this demo we want a model that performs a recommendation task, so we create a pipeline of type `AutoRecommender`, and we configure it with default parameters. You can find the complete list of all the available parameters and their meaning in our documentation." ] }, { "cell_type": "code", "execution_count": 7, "id": "3c249aac", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:33:53.427665Z", "iopub.status.busy": "2025-05-22T12:33:53.426586Z", "iopub.status.idle": "2025-05-22T12:33:53.430815Z", "shell.execute_reply": "2025-05-22T12:33:53.430189Z" } }, "outputs": [], "source": [ "\n", "\n", "automl_pipeline = AutoRecommender().configure()" ] }, { "cell_type": "markdown", "id": "67f55f1c", "metadata": {}, "source": [ "\n", "## Train a model using AutoMLx\n", "\n", "The training data is passed to the `fit()` function which executes the model selection and hyperparameter tuning steps." ] }, { "cell_type": "code", "execution_count": 8, "id": "3addf298", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:33:53.432821Z", "iopub.status.busy": "2025-05-22T12:33:53.432369Z", "iopub.status.idle": "2025-05-22T12:35:03.218983Z", "shell.execute_reply": "2025-05-22T12:35:03.218376Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:33:53,772] [automlx.interface] Dataset shape: (49055,3)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:33:53,843] [automlx.process] Running Model Generation\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:33:53,883] [automlx.process] Model Generation completed.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:33:53,928] [automlx.model_selection] Running Model Selection\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:34:44,101] [automlx.model_selection] Model Selection completed - Took 50.173 sec - Selected models: [['ItemKNNRecommender']]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:34:44,143] [automlx.trials] Running Model Tuning for ['ItemKNNRecommender']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:01,774] [automlx.trials] Best parameters for ItemKNNRecommender: {'n_recommendations': 10, 'num_of_neighbors': 506, 'bias': 0.0001, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:01,777] [automlx.trials] Model Tuning completed. Took: 17.634 secs\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:02,270] [automlx.interface] Re-fitting pipeline\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:02,277] [automlx.final_fit] Skipping updating parameter seed, already fixed by FinalFit_29765b3f-3\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:03,130] [automlx.interface] AutoMLx completed.\n" ] } ], "source": [ "\n", "\n", "automl_pipeline = automl_pipeline.fit(data=training_data, col_types=col_types)" ] }, { "cell_type": "markdown", "id": "0bedd242", "metadata": {}, "source": [ "\n", "## Generate recommendations\n", "\n", "Once the AutoML pipeline is completed, we predict 5 recommendations for a random user in the dataset." ] }, { "cell_type": "code", "execution_count": 9, "id": "dadc63df", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:35:03.221248Z", "iopub.status.busy": "2025-05-22T12:35:03.220741Z", "iopub.status.idle": "2025-05-22T12:35:03.280214Z", "shell.execute_reply": "2025-05-22T12:35:03.279719Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idscore
055012128.218944
1550725.928565
25502525.786326
355011725.769397
455074224.947235
\n", "
" ], "text/plain": [ " user_id movie_id score\n", "0 550 121 28.218944\n", "1 550 7 25.928565\n", "2 550 25 25.786326\n", "3 550 117 25.769397\n", "4 550 742 24.947235" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "recommendation_subjects = test_data.sample(1)[['user_id']]\n", "automl_pipeline.predict(subjects=recommendation_subjects, n_recommendations=5)" ] }, { "cell_type": "markdown", "id": "a57e8e74", "metadata": {}, "source": [ "\n", "## Analyze the AutoMLx optimization process\n", "\n", "During the Oracle AutoMLx process for recommendation, a summary of the optimization process is logged, containing:\n", "- Information about the training data.\n", "- Information about the AutoMLx Pipeline, such as:\n", " - Selected algorithm that was the best choice for this data;\n", " - Selected hyperparameters for the selected algorithm.\n", "\n", "AutoMLx provides a `print_summary` API to output all the different trials performed." ] }, { "cell_type": "code", "execution_count": 10, "id": "1e3db309", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:35:03.281812Z", "iopub.status.busy": "2025-05-22T12:35:03.281626Z", "iopub.status.idle": "2025-05-22T12:35:03.294289Z", "shell.execute_reply": "2025-05-22T12:35:03.293794Z" } }, "outputs": [ { "data": { "text/html": [ "
General Summary
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
(48114, 4)
(941, 4)
ManualSplit(Shuffle=False, Seed=7)
SanerecMetric
ItemKNNRecommender
{'n_recommendations': 10, 'num_of_neighbors': 506, 'bias': 0.0001, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}
24.4.1
3.9.21 (main, Dec 11 2024, 16:24:11) \\n[GCC 11.2.0]
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Trials Summary
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Step# Samples# FeaturesAlgorithmHyperparametersScore (SanerecMetric)Runtime (Seconds)Memory Usage (GB)Finished
Model Selection481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 100, 'bias': 25, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.08821.07511.2225Thu May 22 05:34:43 2025
Model Selection481142AlsRecommender{'n_recommendations': 10, 'iterations': 10, 'factors': 16, 'regularization': 0.01, 'cache_users_states': True}0.07654.71780.6711Thu May 22 05:34:05 2025
Model Selection481142BprRecommender{'n_recommendations': 10, 'iterations': 10, 'factors': 16, 'regularization': 0.01, 'cache_users_states': True}0.04360.43361.2218Thu May 22 05:34:42 2025
Model Selection481142TRexxRecommender{'n_recommendations': 10, 'embedding_dim': 32, 'sequence_length': 5, 'num_sampled': 100, 'dropout_rate': 0.2, 'num_blocks': 2, 'num_head': 4, 'l2_reg_embedding': 1e-06, 'dnn_activation': 'tanh', 'optimizer_name': 'lazyadam', 'optimizer_learning_rate': 0.01, 'future_blinding': False, 'embeddings_on_cpu': False, 'cache_users_states': False, 'negative_sampling_method': CandidateSamplingMethod.UNIFORM_CANDIDATE_SAMPLING, 'epochs': 10, 'batch_size': 512, 'verbose': 1, 'augment_data': True, 'early_stopping_patience': -1}0.039336.81911.2217Thu May 22 05:34:42 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 505, 'bias': 0.0001, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.09991.25560.6840Thu May 22 05:34:58 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 506, 'bias': 0.0001, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.09991.12970.6867Thu May 22 05:35:01 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 506, 'bias': 0.0001, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.09991.35680.6848Thu May 22 05:34:58 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 28.25660795027468, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.09561.38450.6717Thu May 22 05:34:57 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 28.26160794927468, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.09561.10490.6777Thu May 22 05:34:58 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 28.26160794927468, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.09561.46640.6764Thu May 22 05:34:57 2025
...........................
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0841.37180.6718Thu May 22 05:34:57 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 21, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.0841.52540.6763Thu May 22 05:34:57 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 505, 'bias': 25, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.08291.39301.2117Thu May 22 05:34:49 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 506, 'bias': 25, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.08291.37071.2117Thu May 22 05:34:50 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 752, 'bias': 25, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.08181.25191.2170Thu May 22 05:34:52 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 753, 'bias': 25, 'hist_len': 20, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.08181.29131.2170Thu May 22 05:34:53 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 132, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.07971.46830.6743Thu May 22 05:34:57 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 133, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.07971.30120.6728Thu May 22 05:34:57 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 255, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.07971.35230.6761Thu May 22 05:34:57 2025
Model Tuning481142ItemKNNRecommender{'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.0001, 'hist_len': 256, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}0.07971.46400.6722Thu May 22 05:34:57 2025
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "\n", "automl_pipeline.print_summary()" ] }, { "cell_type": "markdown", "id": "fbef7aa7", "metadata": {}, "source": [ "We also provide the capability to visualize the results of each stage of the AutoMLx pipeline.\n", "\n", "\n", "### Algorithm Selection\n", "\n", "The plot below shows the scores predicted by Algorithm Selection for each algorithm. The horizontal line shows the average score across all algorithms. Algorithms below the line are colored turquoise, whereas those with a score higher than the mean are colored teal. The selected algorithm is in orange." ] }, { "cell_type": "code", "execution_count": 11, "id": "01d0e2ba", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:35:03.296171Z", "iopub.status.busy": "2025-05-22T12:35:03.295673Z", "iopub.status.idle": "2025-05-22T12:35:03.489821Z", "shell.execute_reply": "2025-05-22T12:35:03.489286Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "\n", "def plot_model_selection_scores(_pipeline):\n", " # Each trial is a row in a dataframe that contains\n", " # Algorithm, Number of Samples, Number of Features, Hyperparameters, Score, Runtime, Memory Usage, Step as features\n", " trials = _pipeline.completed_trials_summary_[\n", " _pipeline.completed_trials_summary_[\"Step\"].str.contains(\"Model Selection\")\n", " ]\n", " name_of_score_column = f\"Score ({_pipeline._inferred_score_metric[0].name})\"\n", " trials.replace([np.inf, -np.inf], np.nan, inplace=True)\n", " trials.dropna(subset=[name_of_score_column], inplace=True)\n", " scores = trials[name_of_score_column].tolist()\n", " models = trials[\"Algorithm\"].tolist()\n", " \n", " y_margin = 0.10 * (max(scores) - min(scores))\n", " s = pd.Series(scores, index=models).sort_values(ascending=False)\n", " \n", " colors = []\n", " for f in s.keys():\n", " if f.strip() == _pipeline.selected_model_.strip():\n", " colors.append(\"orange\")\n", " elif s[f] >= s.mean():\n", " colors.append(\"teal\")\n", " else:\n", " colors.append(\"turquoise\")\n", " \n", " fig, ax = plt.subplots(1)\n", " ax.set_title(\"Algorithm Selection Trials\")\n", " ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)\n", " ax.set_ylabel(\"Hit Rate\")\n", " s.plot.bar(ax=ax, color=colors, edgecolor=\"black\")\n", " ax.axhline(y=s.mean(), color=\"black\", linewidth=0.5)\n", " plt.show()\n", "\n", "plot_model_selection_scores(automl_pipeline)" ] }, { "cell_type": "markdown", "id": "9c8a8dce", "metadata": {}, "source": [ "\n", "### Hyperparameter Tuning\n", "\n", "Hyperparameter Tuning is the last stage of the AutoMLx pipeline, and focuses on improving the chosen algorithm's score on the dataset. We use a novel iterative algorithm to search across many hyperparameter dimensions, and converge automatically when optimal hyperparameters are identified. Each trial represents a particular hyperparameter configuration for the selected model.\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "25e765b8", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:35:03.491793Z", "iopub.status.busy": "2025-05-22T12:35:03.491324Z", "iopub.status.idle": "2025-05-22T12:35:03.684496Z", "shell.execute_reply": "2025-05-22T12:35:03.683929Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "\n", "def plot_hp_tuning_scores(_pipeline):\n", " # Each trial is a row in a dataframe that contains\n", " # Algorithm, Number of Samples, Number of Features, Hyperparameters, Score, Runtime, Memory Usage, Step as features\n", " trials = _pipeline.completed_trials_summary_[\n", " _pipeline.completed_trials_summary_[\"Step\"].str.contains(\"Model Tuning\")\n", " ]\n", " name_of_score_column = f\"Score ({_pipeline._inferred_score_metric[0].name})\"\n", " trials.replace([np.inf, -np.inf], np.nan, inplace=True)\n", " trials.dropna(subset=[name_of_score_column], inplace=True)\n", " trials.drop(trials[trials[\"Finished\"] == -1].index, inplace=True)\n", " trials[\"Finished\"] = trials[\"Finished\"].apply(\n", " lambda x: time.mktime(datetime.datetime.strptime(x, \"%a %b %d %H:%M:%S %Y\").timetuple())\n", " )\n", " trials.sort_values(by=[\"Finished\"], ascending=True, inplace=True)\n", " scores = trials[name_of_score_column].tolist()\n", " score = []\n", " score.append(scores[0])\n", " for i in range(1, len(scores)):\n", " if scores[i] >= score[i - 1]:\n", " score.append(scores[i])\n", " else:\n", " score.append(score[i - 1])\n", " y_margin = 0.10 * (max(score) - min(score))\n", " fig, ax = plt.subplots(1)\n", " ax.set_title(\"Hyperparameter Tuning Trials\")\n", " ax.set_xlabel(\"Iteration $n$\")\n", " ax.set_ylabel(\"Hit Rate\")\n", " ax.grid(color=\"g\", linestyle=\"-\", linewidth=0.1)\n", " ax.set_ylim(min(score) - y_margin, max(score) + y_margin)\n", " ax.plot(range(1, len(trials) + 1), score, \"k:\", marker=\"s\", color=\"teal\", markersize=3)\n", " plt.show()\n", "\n", "plot_hp_tuning_scores(automl_pipeline)" ] }, { "cell_type": "markdown", "id": "fbfbbc5c", "metadata": {}, "source": [ "\n", "## Advanced AutoMLx Configuration\n", "\n", "You can also configure the AutoRecommender pipeline with suitable parameters according to your needs." ] }, { "cell_type": "code", "execution_count": 13, "id": "18ea54f1", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:35:03.686682Z", "iopub.status.busy": "2025-05-22T12:35:03.686117Z", "iopub.status.idle": "2025-05-22T12:35:03.689956Z", "shell.execute_reply": "2025-05-22T12:35:03.689457Z" } }, "outputs": [], "source": [ "\n", "\n", "custom_pipeline = AutoRecommender().configure(\n", " model_list=[ # Specify the models you want the AutoMLx to consider\n", " \"ItemKNNRecommender\",\n", " \"AlsRecommender\",\n", " \"BprRecommender\",\n", " ],\n", " n_algos_tuned=2, # Choose how many models to tune\n", " search_space={ # You can specify the hyperparameters and ranges we search for each model\n", " \"ItemKNNRecommender\": {\"num_of_neighbors\": {\"range\": [10, 30], \"type\": \"continuous\"}}\n", " },\n", " max_tuning_trials=20, # The maximum number of tuning trials. Can be integer or Dict (max number for each model)\n", " score_metric=\"recall\", # Any of the metrics available, see the documentation for a list of supported values\n", ")" ] }, { "cell_type": "markdown", "id": "63367d59", "metadata": {}, "source": [ "\n", "## Use a custom validation set\n", "\n", "You can specify a custom validation set that you want AutoMLx to use to evaluate the quality of models and configurations." ] }, { "cell_type": "code", "execution_count": 14, "id": "f6adc660", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:35:03.691824Z", "iopub.status.busy": "2025-05-22T12:35:03.691399Z", "iopub.status.idle": "2025-05-22T12:35:03.860679Z", "shell.execute_reply": "2025-05-22T12:35:03.860061Z" } }, "outputs": [], "source": [ "\n", "\n", "training_data, validation_data = AutoRecommender.train_test_split(data=training_data, col_types=col_types)\n", "\n", "\n", "# We run again the AutoML pipeline with the custom training/validation split we just created, and some advanced settings that we can specify directly in the fit method." ] }, { "cell_type": "code", "execution_count": 15, "id": "211aa25d", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:35:03.862935Z", "iopub.status.busy": "2025-05-22T12:35:03.862401Z", "iopub.status.idle": "2025-05-22T12:35:14.199358Z", "shell.execute_reply": "2025-05-22T12:35:14.198664Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:04,023] [automlx.interface] Dataset shape: (49055,3)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:04,090] [automlx.process] Running Model Generation\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:04,135] [automlx.process] Model Generation completed.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:04,166] [automlx.model_selection] Running Model Selection\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:05,776] [automlx.model_selection] Model Selection completed - Took 1.610 sec - Selected models: [['ItemKNNRecommender', 'AlsRecommender']]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:05,845] [automlx.trials] Running Model Tuning for ['ItemKNNRecommender']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:09,216] [automlx.trials] Best parameters for ItemKNNRecommender: {'n_recommendations': 10, 'num_of_neighbors': 10, 'bias': 0.010099998000000002, 'hist_len': 10, 'reciprocal_ranking': False, 'normalize_scores': False, 'cache_users_states': True}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:09,218] [automlx.trials] Model Tuning completed. Took: 3.374 secs\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:09,391] [automlx.trials] Running Model Tuning for ['AlsRecommender']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:12,002] [automlx.trials] Best parameters for AlsRecommender: {'n_recommendations': 10, 'iterations': 10, 'factors': 16, 'regularization': 0.01, 'cache_users_states': True}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:12,004] [automlx.trials] Model Tuning completed. Took: 2.614 secs\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:12,464] [automlx.interface] Re-fitting pipeline\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:12,475] [automlx.final_fit] Skipping updating parameter seed, already fixed by FinalFit_836245be-e\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2025-05-22 05:35:14,026] [automlx.interface] AutoMLx completed.\n" ] } ], "source": [ "\n", "\n", "custom_pipeline = custom_pipeline.fit(\n", " training_data,\n", " col_types,\n", " validation_data,\n", " time_budget=20, # Specify time budget in seconds\n", ")" ] }, { "cell_type": "markdown", "id": "592b6260", "metadata": {}, "source": [ "Now that the custom AutoML pipeline is completed, we can generate recommendations.\n", "Note that the pipeline's `recommend` method is equivalent to `predict`." ] }, { "cell_type": "code", "execution_count": 16, "id": "2717cd7f", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:35:14.202111Z", "iopub.status.busy": "2025-05-22T12:35:14.201551Z", "iopub.status.idle": "2025-05-22T12:35:14.261614Z", "shell.execute_reply": "2025-05-22T12:35:14.261091Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idscore
055012127.870747
15502525.221432
255010023.984905
3550723.856528
455011822.179071
\n", "
" ], "text/plain": [ " user_id movie_id score\n", "0 550 121 27.870747\n", "1 550 25 25.221432\n", "2 550 100 23.984905\n", "3 550 7 23.856528\n", "4 550 118 22.179071" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "custom_pipeline.recommend(subjects=recommendation_subjects, n_recommendations=5)" ] }, { "cell_type": "markdown", "id": "050513f5", "metadata": {}, "source": [ "\n", "## Final evaluation of the best model\n", "\n", "Finally, we evaluate the best model found on the test data we have. If no metric is specified, the pipeline computes the score using the same metric used to run the Hyperparameter Tuning, which in this case is the Recall, as we defined at pipeline creation.\n", "\n", "In this example, instead, we ask the pipeline to perform the evaluation using Normalized Discounted Cumulative Gain (NDCG), a common ranking metric. Our online documentation provides the list of the available metrics and how they are computed." ] }, { "cell_type": "code", "execution_count": 17, "id": "82b47a8e", "metadata": { "execution": { "iopub.execute_input": "2025-05-22T12:35:14.263708Z", "iopub.status.busy": "2025-05-22T12:35:14.263214Z", "iopub.status.idle": "2025-05-22T12:35:14.842348Z", "shell.execute_reply": "2025-05-22T12:35:14.841681Z" }, "lines_to_next_cell": 2 }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "718f5b84003740a3bde9ac2ef7aa878c", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/939 [00:00