Configuring Machine Learning: Transit Time Prediction

In this chapter contains information to help you better understand a machine learning model and is broken down as follows.

Creating a Model
Interpreting the Training Results
Interpreting the Prediction Results
Understanding the Event Model

Creating a Planned Model and Event Based Model

Let's review information you should know when creating a model.

What’s the necessary amount of shipment history for training a model?

It depends, but the minimum is several months of data. It is generally recommended to use at least 1 years’ worth of shipment history, especially if there are seasonal patterns that you would like the model to capture. However, some customers have seen decent results with just several months’ of data. You should run a few tests to find out the optimal setting.

How to tune a model?

There are a number of ways to improve model accuracy. On each shipment, ensure that you have the following"

The actual departure at the first stop/source location and actual arrival at the final stop/destination location (SOURCE_ACTUAL_DEPARTURE / DEST_ACTUAL_ARRIVAL of the W_ML_SHIPMENT_F table)
Planned departure at the first stop and planned arrival at the final stop (SOURCE_PLANNED_DEPARTURE / DEST_PLANNED_ARRIVAL of the W_ML_SHIPMENT_F table)
Depending on “SOURCE/DESTINATION LANE HIERARCHY” in logic parameter, by default it’s city, then SOURCE_CITY/ DEST_CITY are required. Users can change it to location, then SOURCE_LOCATION_GID/ DEST_LOCATION_GID is required. These values are used to identify lane.
Depending on “MINIMUM NUMBER OF SHIPMENTS PER LANE” in logic parameter, by default a minimum of 3 shipments per lane is required for the lane to be considered in the training model.
When doing prediction, we can only predict lanes that are seen in training
Latitude and longitude at source/destination location (SOURCE_LAT / SOURCE_LON / DEST_LAT / DEST_LON in W_ML_SHIPMENT_F).
Latitude and longitude of events (LATITUDE / LONGITUDE in W_ML_EVENT_DETAIL_F) and event date (EVENTDATE in W_ML_EVENT_F) are required.
Depending on “MINIMUM NUMBER OF SHIPMENTS PER GEOHASH” in logic parameter, a minimum of 3 shipments by default in the same geographic area is required for the geohash to be considered in the training model.
Tune logic config parameters, such as “OUTLIER SCALING FACTOR”, “PERCENTAGE OF SHIPMENTS IN THE TRAINING DATA SET”, etc.
Adjust shipment history. Make sure the shipment saved query provides sufficient historical data. If the shipment count per lane and/or per lane geohash is lower than the threshold specified on the relevant logic config parameters, the pipeline will skip these lanes and geohashes. You should view the input data distribution in analytics to validate your selection.
Subset the historical data to create scenarios. Each scenario trains its own model. A well-defined scenario typically produces more tailored predictions. However, if a scenario is defined too narrowly, it may not have enough qualified training data to build a model on. You should test out a few scenario settings.
Set up outlier rules based on your business knowledge. A more stringent outlier rule improves model accuracy; however, such model may not generalize well when you need to predict for an outlier.
Configure included/excluded columns Also, if some input data columns are empty or contain single values only, they should not matter to the model and are safe to remove. Correlation between columns is also an important consideration.

Events have to be valid:
- The following events are enabled for ML purpose: EST_DELIVERY, DELIVERY APPOINTMENT TIME, ENROUTE LOCATION UPDATE ACTUAL, PICKUP, CARGO_PROBLEM, DEPARTED, EST_ARRIVAL, ENTERED_CUSTOMS, EST_DEPARTURE, ACTUAL_PICKUP, DELIVERED, DELAYED, READY_FOR_DELIVERY, DELIVERY_PROBLEM, ARRIVED, PICKUP APPOINTMENT TIME. Users can also create custom Event Group and add the attribute “IS_LML_EVENT=Y”. When loading data to analytics, OTM will look for the flag and only load if it’s LML event.
- Event date must fall between source actual departure and destination actual arrival date.
- Its speed is under the maximum speed of the corresponding mode type, which can be specified in logic parameter.
Shipments have to be valid:
- Have at least 3 valid events in a shipment.
- “MAXIMUM EVENTS FILTER RATIO” in logic parameter determines if OTM would discard the shipment based on the ratio of its valid events
- The duration of a shipment is between 1 hour to 20 days for TL mode type.
- The distance between source and destination location is at least 1 km.

Interpreting the Training Results

Here is some information that will help you interpret the training results.

What’s the definition of “Model Accuracy”? How do I interpret it?

"Model Accuracy" is defined as “1-Mean Absolute Percent Error (MAPE)” , or simply “1- error rate”.

How is “Error” defined?

Here error is defined as the absolute deviation of the predicted transit times from the actual transit times. For example, if actual transit time is 2 days, while predicted transit time is 1.5 days or 2.5 days, in either case, the error is 0.5 days and the percentage error is 25%.

How is “Mean Absolute Percent Error (MAPE)” defined?

First, the individual percentage error is calculated by dividing the absolute prediction error by actual transit time.

Then an average is taken of all individual percentage errors to derive the “Mean Absolute Percent Error (MAPE)”.

Which dataset is the accuracy evaluated on?

The model is trained on a subset of the input data, called the “training set”. Then evaluated on a held-out “test set” that the model has never seen before. The model’s prediction accuracy on the test set is reported as the “Model Accuracy” of the model.

Often, the larger the dataset, the tighter the range between the “Confidence Low Value” and “Confidence High Value”.

Interpreting the Prediction Results

This section contains information that will help you interpret the prediction results.

What does the “Training Model Accuracy” of the prediction mean?

For the Planned ETA Model, the prediction results screen displays “Training Model Accuracy”. “Training Model Accuracy” is the accuracy of the trained model calculated during training, and it is displayed in decimals.

For the Event-based ETA Model, the prediction results screen displays metrics such as "Lane Accuracy", "Geohash Code", "Geohash Lane Accuracy", and “Training Model Accuracy”.

Lane Accuracy: Accuracy of the lane of the shipment given in prediction and calculated during training.
Geohash Code: Represents geographic coordinates of the last event in the shipment.
Geohash Lane Accuracy: Accuracy of the lane-geohash combination of the shipment given in prediction and calculated during training.
Training Model Accuracy is the same as defined for Planned ETA model.

How do I interpret the “Predictions” as well as “Prediction Low Value” and “Prediction High Value”?

The “Predictions” is the model predicted mean transit time of this shipment.

Additionally, the “Prediction Low Value” and “Prediction High Value” tells you that the transit time should fall in that range with a 95% likelihood.

Understanding the Event Model

This section contains information that will help you understand the event model.

What is a geohash?

A geohash a public domain geocode system which encodes a geographic location into a short string of letters and digits.

How is geohash used in prediction?

Tracking events are grouped by proximity and assigned geohash codes. The goal here is to downsize large volume of events while preserving the patterns of the route and extract statistics at the geohash level to provide additional features to the model.