I
am currently working on building a neural network completely from
scratch using base R. This post shows some my progress thus far: a
basic multilayered perceptron with full feed-forward and backpropagation
capabilities. It uses basic Stochastic Gradient Descent to optimize the
parameters (weights & biases). In this post, it performs a simple
classification task on some simulated data.
When the entire project is complete, I hope to have a fully usable
system of neural network functions complete with different optimization
methods (such as Adam), hyperparameters (like learning rate decay and
momentum), and out-of-sample prediction capabilities. I’ve been writing
up the process in a long-form blog/tutorial which I plan to post for
anyone who is interested.
Learning how each component of a neural network works and then building
them with code has been an enlightening experience, and really deepened
my understanding of neural networks and their capabilities. It has also
been a unique challenge to build the network using base R, which I
believe is worthwhile since it prevents me from using pre-existing
packages as quick fixes to complicated problems (and also avoids any
dependency issues). At the end of the day, the calculations performed by
the computer are fairly basic dot products and matrix multiplications,
so advanced packages aren’t really necessary.
This project is inspired by the “Neural
Networks from Scratch in Python” book, which does an excellent job
explaining the intuition, math, and implementation of each step in a
neural network. I cannot recommend it enough.
This function simulates data in a spiral form with 3 labels, as seen below. This provides us with a basic multi-class classification task, where we want the neural network to learn the decision boundaries that effectively separate each class. This function is inspired by this Stanford course.
## A function to simulate spiral data
sim_data_fn = function(
N = 100, # number of points per class
D = 2, # dimensionality (number of features)
K = 3, # number of classes
random_order = FALSE
){
set.seed(123)
X = data.frame() # data matrix (each row = single sample)
y = data.frame() # class labels
for (j in (1:K)){
r = seq(0.05,1,length.out = N) # radius
t = seq((j-1)*4.7,j*4.7, length.out = N) + rnorm(N, sd = 0.3) # theta
Xtemp = data.frame(x =r*sin(t) , y = r*cos(t))
ytemp = data.frame(matrix(j, N, 1))
X = rbind(X, Xtemp)
y = rbind(y, ytemp)
}
spiral_data = cbind(X,y)
colnames(spiral_data) = c(colnames(X), 'label')
# Want randomly ordered labels?
if (random_order==TRUE) {spiral_data$label = sample(1:3, size = N*K,
replace = TRUE)}
return(spiral_data)
}
spiral_data = sim_data_fn(N=100) #100 obs per class, total of 300 samples
plot(spiral_data$x, spiral_data$y, col = spiral_data$label)
All of the functions used here were built from scratch using R’s
base
package. This is everything we need to perform a basic
forward-pass, backward-pass, and parameter optimization.
Included are functions to:
- Initialize random parameters (weights & biases) for each layer
(hidden and output layers)
- Create a densely connected layer with specified number of
neurons.
- Apply ReLU or Softmax Activation Functions to each neuron.
- Calculate loss using categorical cross-entropy.
- Backpropogate each step in order to determine gradients that determine
how to change every weight and bias.
- Apply the gradients to the current weights and biases using an
optimizer (Stochastic Gradient Descent).
# Initialize Parameters for a Layer----
init_params = function(n_inputs = "# of features", n_neurons){
weights = matrix(data = (0.1 * rnorm(n = n_inputs*n_neurons)),
nrow = n_inputs, ncol = n_neurons)
#Number of weights = the number of inputs*number of neurons,
#since every connection between the previous neurons (from input) and
#the current neurons have an associated weight
biases = matrix(data = 0, nrow = 1, ncol = n_neurons)
#Number of biases = the number of neurons in a layer
#saves:
list("weights"=weights, "biases"=biases)
}
# Create a Dense Layer ----
layer_dense = list(
## FORWARD PASS
forward = function(
inputs,
n_neurons,
parameters) {#begin fn
n_inputs = ncol(inputs)
#determine number of inputs per sample from dims of the input matrix
#should be equal to # of features (i.e, columns) in a sample (i.e, a row)
#bias will have shape 1 by number of neurons. we initialize with zeros
weights = parameters$weights
biases = parameters$biases
#Forward Pass
output = inputs%*%weights + biases[col(inputs%*%weights)]
#SAVE:
list("output" = output, #for forward pass
"inputs" = inputs, "weights"= weights, "biases" = biases) #for backprop
},
## BACKWARD PASS
backward = function(inputs, weights, dvalues){
#Gradients on parameters
dweights = t(inputs)%*%dvalues
dbiases = colSums(dvalues)
#Gradient on values
dinputs = dvalues%*%t(weights)
#save:
list("dinputs"=dinputs,
"dweights"=dweights,
"dbiases"=dbiases)
}
)
### Activation Functions ----
## ReLU
activation_ReLU = list(
#FORWARD PASS
forward = function(input_layer){
output = matrix(sapply(X = input_layer,
function(X){max(c(0, X))}
),
nrow = nrow(input_layer), ncol = ncol(input_layer))
#ReLU function coerced into a matrix so the shape
#is maintained (it will be equivalent to that of the input shape)
#Function saves:
list("output" = output, "inputs" = input_layer)
#And prints
#invisible(output)
},
#BACKWARD PASS
backward = function(inputs, dvalues){
dinputs = dvalues
dinputs[inputs <= 0] = 0
#save:
list("dinputs"=dinputs)
}
)
## SoftMax
activation_Softmax = list(
#FORWARD PASS
forward = function(inputs){
#scale inputs
max_value = apply(X = inputs, MARGIN = 2, FUN = max)
scaled_inputs = sapply(X = 1:ncol(inputs),
FUN = function(X){
inputs[,X] - abs(max_value[X])})
# exponetiate
exp_values = exp(scaled_inputs)
# normalize
norm_base = matrix(rowSums(exp_values),
nrow = nrow(inputs), ncol = 1)
probabilities = sapply(X = 1:nrow(inputs),
FUN = function(X){exp_values[X,]/norm_base[X,]})
return(t(probabilities))
#(transpose probabilities)
},
#BACKWARD PASS
backward = function(softmax_output, dvalues){
#*INCOMPLETE SECTION - don't use*
#flatten output array
flat_output = as.vector(softmax_output)
#calculate jacobian matrix of output
jacobian_matrix = diag(flat_output) - flat_output%*%t(flat_output)
#calculate sample-wise gradient
dinputs = jacobian_matrix%*%flat_dvalues
}
)
### Loss ----
Categorical_CrossEntropy = list(
#FORWARD PASS
forward = function(y_pred = "softmax output", y_true = "targets"){
#DETECT NUMBER OF SAMPLES
samples = length(y_true)
#CLIP SAMPLES TO AVOID -Inf ERROR
y_pred_clipped = ifelse(y_pred <= 1e-7, 1e-7,
ifelse(y_pred >= (1-1e-7), (1-1e-7), y_pred))
#DETERMINE IF Y_TRUE IS ONE-HOT-ENCODED AND SELECT CORRESPODNING CONFIDENCES
confidences = ifelse(nrow(t(y_true)) == 1,
#if y_true is a single vector of labels (i.e, sparse), then confidences =
y_pred_clipped[cbind(1:samples, y_true)],
ifelse(nrow(y_true) > 1,
#else, if y_true is one-hot encoded, then confidences =
rowSums(y_pred_clipped*y_true),
#else
"error indexing the predicted class confidences")
)
#CALC LOSS FOR EACH SAMPLE (ROW)
neg_log_likelihoods = -log(confidences)
return(neg_log_likelihoods)
},
#BACKWARD PASS
backward = function(y_true, dvalues){
#number of samples
samples = length(dvalues)
#number of labels
labels = length(unique(dvalues[1,]))
#if labels are sparse, turn them into one-hot encoded vector
y_true = ifelse(#if
nrow(t(y_true)) ==1,
#one-hot-encode
y_true = do.call(rbind,
lapply(X = y_true,
function(X) as.integer(
!is.na(match(unique(
unlist(y_true)
), X)
))
)),
#else
y_true)
#calculate gradient
dinputs = -y_true/dvalues
#normalize gradient
dinputs = dinputs/samples
return(dinputs)
}
)
#Softmax Activation X Cross-Entropy Loss Combination ----
# combine for faster backprop step (derivatives simplify nicely)
activation_loss_SoftmaxCrossEntropy = list(
#FORWARD PASS
forward = function(inputs, y_true){
#output layer's activation function
softmax_out = activation_Softmax$forward(inputs)
#calculate loss
loss = Categorical_CrossEntropy$forward(softmax_out, y_true)
#function saves:
list("softmax_output"=softmax_out, "loss"=loss)
},
#BACKWARD PASS
backward = function(dvalues, y_true){
#Detect number of samples
if (is.vector(dvalues)) { #if one sample
samples = 1
} else if (is.array(dvalues)) { #else if multiple samples
samples = nrow(dvalues)
} else print("error checking shape of inputs")
#Reverse One-Hot Encoding
#if labels are one-hot encoded, turn them discrete values
##helper function
anti_ohe = function(y_true){
unique_classes = ncol(y_true)
samples = nrow(y_true)
y_true_vec = as.vector(y_true)
class_key = rep(1:unique_classes, each = samples)
y_true = class_key[y_true_vec==1]
#selects the classes that correspond to 1s in y_true vector
return(y_true)
}
##check & modify
y_true = if(is.array(y_true)){ #if one-hot encoded
#change to sparse
anti_ohe(y_true)
} else y_true
#Calculate gradient
#Copy so we can modify
dinputs = dvalues
#Calculate gradient
#index the prediction array with the sample number and its
#true value index, subtracting 1 from these values. Requires discrete,
#not one-hot encoded, true labels (explaining the need for the above step)
dinputs[cbind(1:samples, y_true)] = dinputs[cbind(1:samples, y_true)] - 1
#Normalize gradient
dinputs = dinputs/samples
#save desired outputs
list("dinputs" = dinputs, "samples" = samples, "y_true" = y_true)
}
)
# Optimizers----
## Basic Stochastic Gradient Descent (SGD)
optimizer_SGD = list(
update_params = function(layer_forward, layer_backward,
learning_rate = 1){ #default value
#current params
current_weights = layer_forward$weights
current_biases = layer_forward$biases
#gradients
weight_gradients = layer_backward$dweights
bias_gradients = layer_backward$dbiases
#update
weights = current_weights - learning_rate*weight_gradients
biases = current_biases - learning_rate*bias_gradients
#save:
list("weights" = weights, "biases" = biases)
}
)
Here I build a network to predict the classes (1, 2, or 3) of the
spiral data based on the x and y coordinates, which represent the
“features” in this dataset. Parameters are initialized randomly and then
I use a for-loop to iterate through the forward and backward pass over
2500 epochs.
At the bottom of the loop, the optimizer applies the calculated
gradients to the current weights and biases, and the layer parameters
are written over with the newly calculated parameters.
set.seed(1) ##set seed for results replication
#Data: use x and y coordinates as features
spiral_X = spiral_data[,c("x","y")]
spiral_X = as.matrix(spiral_X) ##convert to matrix
#Initalize Weights & Biases Outside of Loop
l1_params = init_params(n_inputs = 2, ## ncol(spiral_X) would also work
n_neurons = 64) ## = to desired # neurons in layer
l2_params = init_params(n_inputs = 64, ## = to n_neurons in prior layer
n_neurons = 3) ## = to desired # neurons in layer
# EPOCH LOOP
tot_epochs = 2500L
for (epoch in 1:tot_epochs) {
#Forward-pass
layer1 = layer_dense$forward(inputs = spiral_X, ##feed in data
n_neurons = 64, ##specify desired neurons
parameters = l1_params) ##provide layer parameters
layer1_relu = activation_ReLU$forward(input_layer = layer1$output)# #apply ReLU
layer2 = layer_dense$forward(inputs = layer1_relu$output,##feed in ReLU output
n_neurons = 3, ##3 neurons - 1 for each class
parameters = l2_params) ##provide params
output =
activation_loss_SoftmaxCrossEntropy$forward(inputs = layer2$output,
y_true = spiral_data$label)
##uses a combined softmax/crossentropy fn to apply a softmax activation function and calculate loss.
#metrics:
LOSS = output$loss ##extract calculate loss
PRED = max.col(output$softmax_output, ties.method = "random")
##extract predictions based on category with highest predicted probability
ACC = mean(PRED==spiral_data$label) ##calculate accuracy based on portion of correctly labeled samples
#Backward-pass
loss_softm_b = activation_loss_SoftmaxCrossEntropy$backward( ##backprop softmax fn
dvalues = output$softmax_output,
y_true = spiral_data$label)
l2_b = layer_dense$backward(inputs = layer2$inputs, ##backprop layer2, pass in softmax derivatives (chain rule)
weights = layer2$weights,
dvalues = loss_softm_b$dinputs)
l1_relu_b = activation_ReLU$backward(inputs = layer1_relu$inputs, ##backprop layer1's ReLU, pass in layer2 derivatives
dvalues = l2_b$dinputs)
l1_b = layer_dense$backward(inputs = layer1$inputs, ##backprop layer1, pass in layer1 ReLU derivatives
weights = layer1$weights,
dvalues = l1_relu_b$dinputs)
#Optimize parameters with learning rate of 0.5
l1_optim_params = optimizer_SGD$update_params(layer_forward = layer1,
layer_backward = l1_b,
learning_rate = 0.5)
l2_optim_params = optimizer_SGD$update_params(layer_forward = layer2,
layer_backward = l2_b,
learning_rate = 0.5)
#Update weights & biases with newly calculated params
l1_params = l1_optim_params
l2_params = l2_optim_params
#loop repeats with new weights & biases...
#---
#Status Report:
if (epoch == 1){print(c("epoch","loss","accuracy"))}
if (epoch %in% seq(0,tot_epochs,by=100)){
print(c( epoch,
LOSS,
ACC) )}
#Save Final Metrics:
if (epoch==tot_epochs){
out = list("loss"=LOSS,
"accuracy"=ACC,
"predictions"=PRED)}
}#end loop
## [1] "epoch" "loss" "accuracy"
## [1] 100.0000000 1.1481168 0.4666667
## [1] 200.0000000 1.0666849 0.5633333
## [1] 300.0000000 0.8856428 0.6766667
## [1] 400.0000000 0.7638010 0.6966667
## [1] 500.0000000 0.6494212 0.7433333
## [1] 600.0000000 0.6265188 0.7633333
## [1] 700.0000000 1.3210822 0.8266667
## [1] 800.0000000 1.7219226 0.8533333
## [1] 900.000000 1.715416 0.840000
## [1] 1000.0000000 0.9714231 0.9066667
## [1] 1100.0000000 0.1680926 0.8733333
## [1] 1200.0000000 0.2213610 0.8433333
## [1] 1300.0000000 0.3538507 0.9333333
## [1] 1400.0000000 0.7161815 0.9766667
## [1] 1500.00000 10.46614 0.49000
## [1] 1600.0000000 1.9601171 0.9066667
## [1] 1700.0000000 1.3836203 0.9466667
## [1] 1800.0000000 1.1736429 0.9666667
## [1] 1900.0000000 0.7299922 0.9833333
## [1] 2000.0000000 0.6426703 0.9933333
## [1] 2100.0000000 0.5317172 0.9966667
## [1] 2200.0000000 0.4946968 0.9966667
## [1] 2300.0000000 0.5027338 0.9966667
## [1] 2400.0000000 0.4728830 0.9966667
## [1] 2500.0000000 0.4751791 0.9966667
The neural net is able to learn the classes quite effectively. As it minimizes loss, the accuracy of its predictions steadily increases. Of course, it is not necessarily a hard task for a model to memorize data. To truly test this network we would need to provide it with some test data, something I plan to do when my full neural net package is complete. For now, I am happy with what this program is capable of.
Final Metrics
#Loss:
out$loss
## [1] 0.4751791
#Accuracy:
out$accuracy
## [1] 0.9966667
#Predictions
cbind(predicted_label = out$predictions, actual_label = spiral_data$label)
## predicted_label actual_label
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
## [4,] 1 1
## [5,] 1 1
## [6,] 1 1
## [7,] 1 1
## [8,] 1 1
## [9,] 1 1
## [10,] 1 1
## [11,] 1 1
## [12,] 1 1
## [13,] 1 1
## [14,] 1 1
## [15,] 1 1
## [16,] 1 1
## [17,] 1 1
## [18,] 1 1
## [19,] 1 1
## [20,] 1 1
## [21,] 1 1
## [22,] 1 1
## [23,] 1 1
## [24,] 1 1
## [25,] 1 1
## [26,] 1 1
## [27,] 1 1
## [28,] 1 1
## [29,] 1 1
## [30,] 1 1
## [31,] 1 1
## [32,] 1 1
## [33,] 1 1
## [34,] 1 1
## [35,] 1 1
## [36,] 1 1
## [37,] 1 1
## [38,] 1 1
## [39,] 1 1
## [40,] 1 1
## [41,] 1 1
## [42,] 1 1
## [43,] 1 1
## [44,] 1 1
## [45,] 1 1
## [46,] 1 1
## [47,] 1 1
## [48,] 1 1
## [49,] 1 1
## [50,] 1 1
## [51,] 1 1
## [52,] 1 1
## [53,] 1 1
## [54,] 1 1
## [55,] 1 1
## [56,] 1 1
## [57,] 1 1
## [58,] 1 1
## [59,] 1 1
## [60,] 1 1
## [61,] 1 1
## [62,] 1 1
## [63,] 1 1
## [64,] 1 1
## [65,] 1 1
## [66,] 1 1
## [67,] 1 1
## [68,] 1 1
## [69,] 1 1
## [70,] 1 1
## [71,] 1 1
## [72,] 1 1
## [73,] 1 1
## [74,] 1 1
## [75,] 1 1
## [76,] 1 1
## [77,] 1 1
## [78,] 1 1
## [79,] 1 1
## [80,] 1 1
## [81,] 1 1
## [82,] 1 1
## [83,] 1 1
## [84,] 1 1
## [85,] 1 1
## [86,] 1 1
## [87,] 1 1
## [88,] 1 1
## [89,] 1 1
## [90,] 1 1
## [91,] 1 1
## [92,] 1 1
## [93,] 1 1
## [94,] 1 1
## [95,] 1 1
## [96,] 1 1
## [97,] 1 1
## [98,] 1 1
## [99,] 1 1
## [100,] 1 1
## [101,] 2 2
## [102,] 2 2
## [103,] 2 2
## [104,] 2 2
## [105,] 2 2
## [106,] 2 2
## [107,] 2 2
## [108,] 3 2
## [109,] 2 2
## [110,] 2 2
## [111,] 2 2
## [112,] 2 2
## [113,] 2 2
## [114,] 2 2
## [115,] 2 2
## [116,] 2 2
## [117,] 2 2
## [118,] 2 2
## [119,] 2 2
## [120,] 2 2
## [121,] 2 2
## [122,] 2 2
## [123,] 2 2
## [124,] 2 2
## [125,] 2 2
## [126,] 2 2
## [127,] 2 2
## [128,] 2 2
## [129,] 2 2
## [130,] 2 2
## [131,] 2 2
## [132,] 2 2
## [133,] 2 2
## [134,] 2 2
## [135,] 2 2
## [136,] 2 2
## [137,] 2 2
## [138,] 2 2
## [139,] 2 2
## [140,] 2 2
## [141,] 2 2
## [142,] 2 2
## [143,] 2 2
## [144,] 2 2
## [145,] 2 2
## [146,] 2 2
## [147,] 2 2
## [148,] 2 2
## [149,] 2 2
## [150,] 2 2
## [151,] 2 2
## [152,] 2 2
## [153,] 2 2
## [154,] 2 2
## [155,] 2 2
## [156,] 2 2
## [157,] 2 2
## [158,] 2 2
## [159,] 2 2
## [160,] 2 2
## [161,] 2 2
## [162,] 2 2
## [163,] 2 2
## [164,] 2 2
## [165,] 2 2
## [166,] 2 2
## [167,] 2 2
## [168,] 2 2
## [169,] 2 2
## [170,] 2 2
## [171,] 2 2
## [172,] 2 2
## [173,] 2 2
## [174,] 2 2
## [175,] 2 2
## [176,] 2 2
## [177,] 2 2
## [178,] 2 2
## [179,] 2 2
## [180,] 2 2
## [181,] 2 2
## [182,] 2 2
## [183,] 2 2
## [184,] 2 2
## [185,] 2 2
## [186,] 2 2
## [187,] 2 2
## [188,] 2 2
## [189,] 2 2
## [190,] 2 2
## [191,] 2 2
## [192,] 2 2
## [193,] 2 2
## [194,] 2 2
## [195,] 2 2
## [196,] 2 2
## [197,] 2 2
## [198,] 2 2
## [199,] 2 2
## [200,] 2 2
## [201,] 3 3
## [202,] 3 3
## [203,] 3 3
## [204,] 3 3
## [205,] 3 3
## [206,] 3 3
## [207,] 3 3
## [208,] 3 3
## [209,] 3 3
## [210,] 3 3
## [211,] 3 3
## [212,] 3 3
## [213,] 3 3
## [214,] 3 3
## [215,] 3 3
## [216,] 3 3
## [217,] 3 3
## [218,] 3 3
## [219,] 3 3
## [220,] 3 3
## [221,] 3 3
## [222,] 3 3
## [223,] 3 3
## [224,] 3 3
## [225,] 3 3
## [226,] 3 3
## [227,] 3 3
## [228,] 3 3
## [229,] 3 3
## [230,] 3 3
## [231,] 3 3
## [232,] 3 3
## [233,] 3 3
## [234,] 3 3
## [235,] 3 3
## [236,] 3 3
## [237,] 3 3
## [238,] 3 3
## [239,] 3 3
## [240,] 3 3
## [241,] 3 3
## [242,] 3 3
## [243,] 3 3
## [244,] 3 3
## [245,] 3 3
## [246,] 3 3
## [247,] 3 3
## [248,] 3 3
## [249,] 3 3
## [250,] 3 3
## [251,] 3 3
## [252,] 3 3
## [253,] 3 3
## [254,] 3 3
## [255,] 3 3
## [256,] 3 3
## [257,] 3 3
## [258,] 3 3
## [259,] 3 3
## [260,] 3 3
## [261,] 3 3
## [262,] 3 3
## [263,] 3 3
## [264,] 3 3
## [265,] 3 3
## [266,] 3 3
## [267,] 3 3
## [268,] 3 3
## [269,] 3 3
## [270,] 3 3
## [271,] 3 3
## [272,] 3 3
## [273,] 3 3
## [274,] 3 3
## [275,] 3 3
## [276,] 3 3
## [277,] 3 3
## [278,] 3 3
## [279,] 3 3
## [280,] 3 3
## [281,] 3 3
## [282,] 3 3
## [283,] 3 3
## [284,] 3 3
## [285,] 3 3
## [286,] 3 3
## [287,] 3 3
## [288,] 3 3
## [289,] 3 3
## [290,] 3 3
## [291,] 3 3
## [292,] 3 3
## [293,] 3 3
## [294,] 3 3
## [295,] 3 3
## [296,] 3 3
## [297,] 3 3
## [298,] 3 3
## [299,] 3 3
## [300,] 3 3
hanselliott61@gmail.com