The Basics of a Neural Network Through the Eyes of a (Student) Mathematician

In this post, I’ll explain the fundamentals of a neural network in a way that a high school maths student, even with almost no computer science background, would be able to understand it as: an optimization problem.

So what is a neural network?

A neural network is a multivariable function that returns a multivariable answer.
To understand how this works, let’s start with a simple explanation.

In order to create the neural network, we need to identify a problem to solve. For machine learning models, the starting point is recognizing handwritten digits from the MNIST dataset, all numbers between 0 and 9.

All images are 28x28 pixel drawings with black backgrounds with grayscale drawings of numbers on top, so to make an input for our function, we’ll turn this 28x28 matrix of numbers into a 1x784 vector (meaning 1 row matrix), stacking each of the 28 rows next to each other:

        ______________________[
        |       28 columns
        |   [0,0,0,0,0,....0],
        |   [0,0,0,0,0,....0],
28 rows |   ...
        |   [0,0,0,0,0,....0]
        |_____________________]

        ||
        \/

        _____________________[
        |     784 columns
1 row   |   0,0,0,0,0,....0  
        |____________________]

In the eyes of the neural network, as long as the 28x28 matrix is mapped to this 1x784 matrix in the exact same way every time, these 2 matrices are just two forms of the same thing. One version is for the human to understand it as a grayscale image, and the other version is the format used by the neural network for calculations.

As for the calculations, how does the neural network arrive at its answer?

Neural Network Parts

In this part, I’ll explain the basic parts we put into a neural network for it to work.

Neural networks work by sequentially activating layers of artificial neurons to varying degrees. A final answer is reached after the network passes results through the last layer. A particular neuron performs 3 steps:

Compute a weighted sum of the values of all neurons it receives numbers from. A given neuron determines how to weight each of the neurons inputting into it to prioritize some neurons’ values over others.
Apply a bias to the weighted sum calculated which is just a positive or negative number added to the result. Biases allow neurons that output values that are really high to reduce their output value, or allow neurons that don’t output really high values to increase their base value.
Apply an activation function to the weighted sum + bias value. Activation functions tell how a given sum+bias combination can transmit a value to the next set of neurons. For example, the sigmoid function turns the real number line into values from -1 to 1, with more extreme values coming closer to -1 and 1. For this model, we’ll use the Relu activation function, which turns all negative sums into the number 0 but leaves positive numbers alone.

This is repeated for every layer of neurons one layer at a time until the final layer outputs its answer, which in this case gives a 1x10 matrix. The values of the 10 cells are tied to the number the model thinks is being shown.

Expected values
0 => [1,0,0,0,0,0,0,0,0,0]
1 => [0,1,0,0,0,0,0,0,0,0]
2 => [0,0,1,0,0,0,0,0,0,0]
...
9 => [0,0,0,0,0,0,0,0,0,1]


What we get when we start running a untrained version of the network:

model => [0.456, 0.675, 0.986, 0.67...]

This jumbled mess of values is the beginning of turning this computer science question into a math one.

But How Is This Optimization?

In calculus classes, students learn that optimization problems are problems where derivatives are used to minimize or maximize the value of a single number through derivatives. In order to turn our training session into an optimization problem, we need to turn 10 numbers into 1. Our answer: the error function.

Error(expected_matrix, output_matrix) =
   squarederror = 0
   for each cell in the same position in both matrices a:
   add (output_matrix[a] - expected_matrix[0])^2
   return squarederror

This error function should be as close to 0 as possible (in our case, above 1), but it starts out very high, since the model is made of random values for its biases and weights. To know how to modify the neural network we can take the gradient of the function Error(expected_value, neuralnetwork(neuron_values, input_matrix)) with respect to the values of the neurons in the current iteration of the neurons’ values. The gradient will return a matrix that gives the values needed to adjust the current neurons, but using these values directly will increase the error, not reduce it! Simply flip the signs to go in the direction that makes the error go down.

Game Plan to Make Our Neural Network

Before I show you how to make a neural network in python, let’s go over the key steps to this process (when understood, you can translate these steps to other coding languages):

Get your dataset. For this project, I’ll be using the MNIST dataset as shown above, but this will work for any dataset that you can translate into arrays that your neural network can read. You’ll also need a way to translate the answers in the dataset (given as integers) into arrays to use in the error function. I showed a method two sections ago.
Make your starting neural network. You’ll not only need to know the size of your input matrix and output matrix, but also decide how many layers and neurons per layer that you want.
Write the function that takes in your input matrix and neuron values to make the output matrix.
Write the function that trains your neural network. This has three parts: get the gradient matrices for a set number of randomly chosen inputs, take the average of those matrices to get one that approximates the best possible shift, and then makes the following adjustment.
Test the model! It is good practice to separate training data from testing data to see how the model does with something new.

Making the Neural Network in Python

Before we begin, remember this exercise is merely a exercise in seeing how neural networks learn, and there are FAR better models that use libraries like pytorch to train faster, but I feel that seeing the inner workings of a basic model makes it easier to understand much more complicated concepts in that same field.

To begin, import the following libraries:

import numpy as np
import math
import csv
import random as r
import numdifftools as nd

Getting the dataset:

labels = []
data = []
with open('./sample_data/mnist_train_small.csv', newline='') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
    for row in spamreader:
      unfiltered_data = row[0].split(",")
      unfiltered_data = [float(i) for i in unfiltered_data]
      label = [0 for i in range(10)]
      label[int(unfiltered_data.pop(0))] = 1
      labels.append(label)
      data.append(unfiltered_data)

This turns the csv into easy to use arrays that we can input into our model.

Making the starting neural network:

#creating the brain
layers = 2 # all layers have the same number of neurons
neurons = 2

input_size = 28**2 #size of the images from MNIST
output_size = 10 # 0 - 9

relu = lambda x: x if x>0 else 0 #our activation function

# Turns our 1d array into a many dimensional array that we need for our network
def _1d_brain(x): 
  global index
  index = -1
  #print(index)
  def next():
    global index
    index = index + 1
    #print(x[index])
    return x[index]

  brain = []
  brain.append([[[next() for i in range(input_size)],next()] for i in range(neurons)])
  #print("reach")
  for i in range(1,layers):
    brain.append([])
    for j in range(neurons):
      brain[i].append([[next() for i in range(neurons)],next()])
  #print("reach")
  brain.append([[[next() for i in range(neurons)],next()] for i in range(output_size)])
  return brain

# Turns our many dimensional array into a 1d array
def flatten(data): 
  flat_list = []
  for item in data:
    if isinstance(item, (list, tuple)):
      flat_list.extend(flatten(item))
    else:
      flat_list.append(item)
  return flat_list


# matrix multiplication for a single cell, this can definitely be optimized
# by using once operation to return an entire matrix
def matmul(a,b):
  return sum([a[i]*b[i] for i in range(len(a))]) if len(a) == len(b) else None


# This is how we will get a result from our neural network

#count = 0
def runbrain(brain, input):
  #global count
  #count+=1
  #if count%10000 == 0:
    #print(count)
  isfirst = True
  for layer in range(len(brain)-1):
    if isfirst:
      #print(len(input))
      output = [relu(matmul(input,neuron[0]) + neuron[1]) for neuron in brain[layer]]
      isfirst = False
    else:
      output = [relu(matmul(output,neuron[0]) + neuron[1]) for neuron in brain[layer]]
  return [relu(matmul(output,neuron[0]) + neuron[1]) for neuron in brain[-1]]

# error function
def error(output, expected):
  #print(output)
  #print(expected)
  output = [i/sum(output) for i in output]
  return sum([(output[i]-expected[i])**2 for i in range(len(output))])

# 2 functions in 1
def run1dbrainerror(brain,index):
   return error(runbrain(_1d_brain(brain.tolist()), data[index]),labels[index])


# make the brain
brain =  _1d_brain([r.uniform(-1,1) for i in range((input_size+1)*neurons + neurons*(neurons+1)*(layers-1) + (neurons+1)*output_size)])
print(brain)

All of this is the work that allows us to run a neural network, now we just need to train it:

learning_rate = 0.1
for i in range(10):
  sum_grad = [0 for i in range(len(flatten(brain)))]
  for j in range(10):
    input = r.randint(0,len(data)-1)
    print(error(runbrain(brain,data[input]),labels[input]))
    print(i,j)
    def fun(brain):
      return run1dbrainerror(brain,input)
    grad2 = nd.Gradient(fun)(flatten(brain))
    sum_grad = [sum_grad[k] + grad2[k] for k in range(len(grad2))]
  sum_grad = [i/10 for i in sum_grad]
  flatbrain = flatten(brain)
  for i in range(len(flatbrain)):
    flatbrain[i] -= learning_rate*sum_grad[i]
  brain = _1d_brain(flatbrain)
  print(brain)

nd.Gradient runs our neural network a couple thousand times (because of the 1000s of variables in the neural network) to find the gradient at a particular point, so it’s important that our runbrain function runs as fast as possible. If I were to reprogram this, I would use some linear algebra tricks to perform a single matrix multiplication and addition per layer and speed this up.

But this is it! Now you have made (or copied, it’s okay as long as you understand it) your own little neural network that tries to recognize digits. At the end of this post, I’ve made my own Github repository so you can see this neural network in action!