25 Neural Network
Learn about the Neural Network algorithms for regression and classification machine learning techniques.
25.1 About Neural Network
Neural Networks in Oracle Machine Learning are designed for complex tasks like classification and regression, inspired by biological neural networks.
In machine learning, an artificial neural network is an algorithm inspired from biological neural network and is used to estimate or approximate functions that depend on a large number of generally unknown inputs. An artificial neural network is composed of a large number of interconnected neurons which exchange messages between each other to solve specific problems. They learn by examples and tune the weights of the connections among the neurons during the learning process. The Neural Network algorithm is capable of solving a wide variety of tasks such as computer vision, speech recognition, and various complex business problems.
Related Topics
25.1.1 Neurons and Activation Functions
Neurons process inputs through weighted sums and activation functions like Sigmoid and Rectified Linear Units (ReLU).
A neuron takes one or more inputs having different weights and has an output which depends on the inputs. The output is achieved by adding up inputs of each neuron with weights and feeding the sum into the activation function.
A Sigmoid function is usually the most common choice for activation function
but other non-linear functions, piecewise linear functions or step functions are also
used.
The Rectified Linear Units function NNET_ACTIVATIONS_RELU
is a commonly
used activation function that addresses the vanishing gradient problem for larger neural
networks.
The following are some examples of activation functions:
-
Logistic Sigmoid function
-
Linear function
-
Tanh function
-
Arctan function
-
Bipolar sigmoid function
-
Rectified Linear Units
25.1.2 Loss or Cost function
A loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event.
An optimization problem seeks to minimize a loss function. The form of loss function is chosen based on the nature of the problem and mathematical needs.
The following are the different loss functions for different scenarios:
-
Binary classification: binary cross entropy loss function.
-
Multi-class classification: multi cross entropy loss function.
-
Regression: squared error function.
25.1.3 Forward-Backward Propagation
Forward propagation computes loss values, while backward propagation updates weights to minimize loss functions.
Forward propagation computes the loss function value by weighted summing the previous layer neuron values and applying activation functions. Backward propagation calculates the gradient of a loss function with respect to all the weights in the network. The weights are initialized with a set of random numbers uniformly distributed within a region specified by user (by setting weights boundaries), or region defined by the number of nodes in the adjacent layers (data driven). The gradients are fed to an optimization method which in turn uses them to update the weights, in an attempt to minimize the loss function.
25.1.4 Optimization Solvers
An optimization solver is a function that searches for the optimal solution of the loss function to find the extreme value (maximum or minimum) of the loss (cost) function. Neural Networks use L-BFGS and Adam solvers for efficient and effective optimization.
Oracle Machine Learning implements Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) together with line search and the Adam solver.
Limited-memory Broyden–Fletcher–Goldfarb–Shanno Solver
L-BFGS is a
Quasi-Newton method. This method uses rank-one updates specified by gradient
evaluations to approximate a Hessian matrix. This method only needs a limited amount
of memory. L-BFGS is used to find the descent direction and line search is used to
find the appropriate step size. The number of historical copies kept in the L-BFGS
solver is defined by the LBFGS_HISTORY_DEPTH
solver setting. When
the number of iterations is smaller than the history depth, the Hessian computed by
L-BFGS is accurate. When the number of iterations is larger than the history depth,
the Hessian computed by L-BFGS is an approximation. Therefore, the history depth
should not be too small or too large to avoid making the computation too slow.
Typically, the value is between 3
and
10
.
Adam Solver
Adam is an extension to stochastic gradient descent that uses mini-batch optimization. The L-BFGS solver may be a more stable solver whereas the Adam solver can make progress faster by seeing less data. Adam is computationally efficient, with little memory requirements, and is well-suited for problems that are large in terms of data or parameters or both.
25.1.5 Regularization
Regularization techniques, such as L1 norms, L2 norms, and held-aside prevent overfitting and improve model generalization.
Regularization refers to a process of introducing additional information to solve an ill-posed problem or to prevent over-fitting. Ill-posed or over-fitting can occur when a statistical model describes random errors or noise instead of the underlying relationship. Typical regularization techniques include L1-norm regularization, L2-norm regularization, and held-aside.
Held-aside is usually used for large training date sets whereas L1-norm regularization and L2-norm regularization are mostly used for small training date sets.
25.1.6 Convergence Check
Convergence checks ensure optimization processes reach optimal solutions, stopping when performance criteria are met.
In L-BFGS solver, the convergence criteria includes maximum number of
iterations, infinity norm of gradient, and relative error tolerance. For held-aside
regularization, the convergence criteria checks the loss function value of the test data
set, as well as the best model learned so far. The training is terminated when the model
becomes worse for a specific number of iterations (specified by
NNET_HELDASIDE_MAX_FAIL
), or the loss function is close to zero, or
the relative error on test data is less than the tolerance.
25.1.7 LBFGS_SCALE_HESSIAN
LBFGS_SCALE_HESSIAN
setting improves
optimization by adjusting initial inverse Hessian approximations.
The setting adjusts the initial approximation of the inverse Hessian at the
beginning of each iteration.
If
the value is LBFGS_SCALE_HESSIAN_ENABLE
, then the initial inverse
Hessian is approximated with Oren-Luenberger
scaling.
If it is LBFGS_SCALE_HESSIAN_DISABLE
, identity is used as the initial
approximation of the inverse Hessian at the beginning of each iteration.
Related Topics
25.2 Data Preparation for Neural Network
Neural Network algorithms normalize numeric data, convert categorical data into binary attributes, and handle missing values automatically.
The algorithm automatically "explodes" categorical data into a set of binary attributes, one per category value. Oracle Machine Learning algorithms automatically handle missing values and therefore, missing value treatment is not necessary.
The algorithm automatically replaces missing categorical values with the mode and missing numerical values with the mean. The algorithm requires the normalization of numeric input and it uses z-score normalization. The normalization occurs only for two-dimensional numeric columns (not nested). Normalization places the values of numeric attributes on the same scale and prevents attributes with a large original scale from biasing the solution. Neural Network scales the numeric values in nested columns by the maximum absolute value seen in the corresponding columns.
Related Topics
25.3 Neural Network Algorithm Configuration
Configure Neural Network algorithms by specifying nodes per layer and activation functions to optimize performance.
Specify Nodes Per Layer
INSERT INTO SETTINGS_TABLE (setting_name, setting_value) VALUES
('NNET_NODES_PER_LAYER', '2,3');
Specify Activation Functions Per Layer
NNET_ACTIVATIONS
setting specifies the activation functions or
hidden layers.
See Also:
DBMS_DATA_MINING —Algorithm Settings: Neural Network for a listing and explanation of the available model settings.Note:
The term hyperparameter is also interchangeably used for model setting.25.4 Scoring with Neural Network
Score data with Neural Networks using standard regression and classification scoring functions.
Scoring with Neural Network is the same as any other
classification or regression algorithm. The following functions are supported:
PREDICTION
, PREDICTION_PROBABILITY
,
PREDICTION_COST
, PREDICTION_SET
, and
PREDICTION_DETAILS
.
Related Topics