Chapter 3: Understand the Machine Learning Model

In this chapter contains information to help you better understand a machine learning model and is broken down as follows.

Creating a Model
Interpreting the Training Results
Interpreting the Prediction Results
Understanding the Event Model

Creating a Model

Let's review information you should know when creating a model.

What’s the necessary amount of shipment history for training a model?

It depends. It is generally recommended to use at least 1 years’ worth of shipment history, especially if there are seasonal patterns that you would like the model to capture. However, some customers have seen decent results with just several months’ of data. You should run a few tests to find out the optimal setting.

How to tune a model?

There are a number of ways to improve model accuracy.

Tune logic config parameters, such as “OUTLIER SCALING FACTOR”, “PERCENTAGE OF SHIPMENTS IN THE TRAINING DATA SET”, etc.
Adjust shipment history. Make sure the shipment saved query provides sufficient historical data. If the shipment count per lane and/or per lane geohash is lower than the threshold specified on the relevant logic config parameters, the pipeline will skip these lanes and geohashes. You should view the input data distribution in analytics to validate your selection.
Subset the historical data to create scenarios. Each scenario trains its own model. A well-defined scenario typically produces more tailored predictions. However, if a scenario is defined too narrowly, it may not have enough qualified training data to build a model on. You should test out a few scenario settings.
Set up outlier rules based on your business knowledge. A more stringent outlier rule improves model accuracy; however, such model may not generalize well when you need to predict for an outlier.
Configure included/excluded columns Also, if some input data columns are empty or contain single values only, they should not matter to the model and are safe to remove. Correlation between columns is also an important consideration.

Interpreting the Training Results

Here is some information that will help you interpret the training results.

What’s the definition of “Model Accuracy”? How do I interpret it?

"Model Accuracy" is defined as “1-Mean Absolute Percent Error (MAPE)” , or simply “1- error rate”.

How is “Error” defined?

Here error is defined as the absolute deviation of the predicted transit times from the actual transit times. For example, if actual transit time is 2 days, while predicted transit time is 1.5 days or 2.5 days, in either case, the error is 0.5 days and the percentage error is 25%.

How is “Mean Absolute Percent Error (MAPE)” defined?

First, the individual percentage error is calculated by dividing the absolute prediction error by actual transit time.

Then an average is taken of all individual percentage errors to derive the “Mean Absolute Percent Error (MAPE)”.

Which dataset is the accuracy evaluated on?

The model is trained on a subset of the input data, called the “training set”. Then evaluated on a held-out “test set” that the model has never seen before. The model’s prediction accuracy on the test set is reported as the “Model Accuracy” of the model.

Often, the larger the dataset, the tighter the range between the “Confidence Low Value” and “Confidence High Value”.

Interpreting the Prediction Results

This section contains information that will help you interpret the prediction results.

What does the “Model Accuracy” of the prediction mean?

For the Planned ETA Model, the prediction “Model Accuracy” is the lane-level accuracy of the model.

For the Event-based ETA Model, the prediction “Model Accuracy” grid displays model-level, lane-level, and lane geohash-level accuracy respectively

How do I interpret the “Predictions” as well as “Prediction Low Value” and “Prediction High Value”?

The “Predictions” is the model predicted mean transit time of this shipment.

Additionally, the “Prediction Low Value” and “Prediction High Value” tells you that the transit time should fall in that range with a 95% likelihood.

Understanding the Event Model

This section contains information that will help you understand the event model.

What is a geohash?

A geohash a public domain geocode system which encodes a geographic location into a short string of letters and digits.

How is geohash used in prediction?

Tracking events are grouped by proximity and assigned geohash codes. The goal here is to downsize large volume of events while preserving the patterns of the route and extract statistics at the geohash level to provide additional features to the model.