An example of how Neural Networks are excellent function approximators.
Why neural networks are so good and why we should care.
Neural networks need no introduction. Their popularity in the past decade is has steadily risen, having outperformed other machine learning models in achieving mean feats in human intelligent domains such as speech, language and vision. Their magnum opus, however, has been powering the transformer architecture (Vaswani et al 2017) that gave rise to LLMs such as GPT (Brown et al 2020) and diffusion models such as stable diffusion (Rombach et al 2022). But a fundamental question, and an important one at that remains; why are neural networks so good and why do they work so well?
As it turns out, one major reason, and the motivation behind this article, is the ability of a neural network to approximate any arbitrary function.
A function, mathematically, is described as an expression, rule, or law that defines a relationship between one variable (the independent variable) and another. One such example is addition. Addition is a fairly simple function to comprehend, as it takes a given set of numerical inputs and gives, as the output, the summation of those inputs.
1 +1 = 2
In the above case, the relationship between inputs (1,1) and 2 is addition. It is deterministic, meaning that it always produces the same output (1+1 will always be 2). We represent this in software programs using the + operator and in doing so we tell the program what to do to get the output of 1 +1.
However, how about instead of telling the program what to do to add 2 numbers, we teach it how to add using neural networks. Let’s explore how.
Consider, a simple program made up of a neural network and our aim is to teach it how to add two numbers without us explicitly telling it how to. As our training dataset, we will use, a set of 1000 numbers, each set containing 2 numbers. our labels would be the sum of those two numbers.
X_train = np.random.rand(1000, 2).astype(np.float32) # 1000 samples of 2 numbers each
y_train = np.sum(X_train, axis=1, keepdims=True).astype(np.float32) # Sum of the two numbers
class Dataset(torch.utils.data.Dataset):
def __init__(self,X_train,y_train):
super(Dataset,self).__init__()
self.X_train=X_train
self.y_train=y_train
def __len__(self):
return 1000
def __getitem__(self,idx):
return torch.FloatTensor(self.X_train[idx]),self.y_train[idx]
dataset=Dataset(X_train,y_train)
dataloader=torch.utils.data.DataLoader(dataset,batch_size=200,shuffle=True)
Our objective is simple, we will give the model 2 numerical inputs and the summed output as the label to be predicted and let it figure out the relationship between the two variables. We will iterate through this process, showing it many different set of numbers and the summed output, and with each iteration it will improve, a process we call training
The code to implement the training is as follows:
# We import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
#We define our model
class AdditionModel(nn.Module):
def __init__(self):
super(AdditionModel, self).__init__()
self.linear = nn.Linear(2, 1) # Single unit neural network
def forward(self, x):
return self.linear(x)
model = AdditionModel()
#We specify our Loss Metric and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)
# We will collect the losses to visualize later
losses=[]
epochs=[]
num_epochs = 100 # The number of times to train the model
# We define our training Loop
for epoch in range(num_epochs):
total_loss=0
for batch in dataloader:
X_train_tensor,y_train_tensor=batch
outputs = model(X_train_tensor)
loss = criterion(outputs, y_train_tensor)
total_loss+=loss.item()
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss/=len(dataloader)
losses.append(total_loss)
epochs.append(epoch)
if epoch % 10 == 0:
print(f"Epoch {epoch+1}, Loss: {total_loss}")
Epoch 1, Loss: 0.03447453267872334
Epoch 11, Loss: 7.138949513318948e-05
Epoch 21, Loss: 7.14224569264843e-07
Epoch 31, Loss: 1.628855395008344e-09
Epoch 41, Loss: 1.501077821070007e-11
Epoch 51, Loss: 2.109467643423256e-13
Epoch 61, Loss: 1.1469381251216857e-14
Epoch 71, Loss: 1.1102229998097382e-19
Epoch 81, Loss: 0.0
Epoch 91, Loss: 0.0
The model starts of poorly but after 100 rounds of training, the model becomes fairly good at predicting the sum given two numbers.
If we logically test the model:
def test(a,b):
X_test = torch.FloatTensor([a, b])
y_pred = model(X_test)
return y_pred.detach().numpy()[0]
test(9,90)
#Outputs
99.0
Our program just learnt addition!
So we trained a fancy Neural network to do something that we can do with a simple + operator. why?
Well, the benefits of what we have just done becomes apparent when we extrapolate. It might not be very useful or practical to train a model to learn a function we already know like add 9 and 90; but it starts to make more sense to follow the same process above to learn one we don’t know, like a function that maps an image to its description (Karpathy & Fei-Fei, 2015), or takes in a word and predict the next word.
The premise is: with every input or set of inputs, there is a function that maps it to the output, and a neural network architecture can learn that function with enough examples. In our addition example, the function was known. but in the real world, it might not be. We might want to learn a function that maps financial transactions to fraudulent behaviour, or maps audio to its transcription. What we have to do, is structure the neural network to take in the input say audio like in the latter example, and output words, which will be random and senseless at first, but many examples and iterations later, it will become insanely good. Just as with our addition example, the model starts off with random initialization of its weights:
# Model weights before training
print(list(model.parameters())[0][0].detach().numpy())
# output
array([0.34894478, 0.05129533], dtype=float32)
to learn transformation function below:
print(list(model.parameters())[0][0].detach().numpy())
# output
array([1., 1.], dtype=float32)
Putting our linear algebra hats on, we realize that the transformation [1,1] applied to any (1 x2) matrix is akin to an addition operation. So given the numbers 9 and 90:
[9,90] * [1,1] = (9*1) + (90*1) = 99
In conclusion, the power of neural networks lies in their adaptive ability to learn any function; from simple ones that we know and can describe to complex ones and hard to describe; making them a powerful tool to learn anything.
REFERENCES
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684–10695).
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128–3137).