Determinist Data

MNIST Dataset

The Modified National Institute of Standards and Technology (MNIST) dataset is one of the most widely used datasets in machine learning due to its availability, size, and relative ease to solve. Because of these factors, I picked this problem for my first forray into what I hope will be a long list of deep learning problems.

Traditionally, the MNIST dataset includes 60,000 28x28 images of handwritten digits with an accompanying label for training, as well as 10,000 unclassified images for testing. However, the dataset I found here had 42,000 training images and 28,000 test images. This should still be plenty.

After downloading the dataset we can see that the training data is split into 10 directories, each one representing the label of the images it contains. Using the Python Image Library (PIL) and the os library we can load each training image and its label into memory.


import numpy as np
from os import listdir
import PIL

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


use_cuda = torch.cuda.is_available()
device = torch.device('cuda:0' if use_cuda else 'cpu')

data_path = 'data/mnist'
trn_data_path = f'{data_path}/train'
test_data_path = f'{data_path}/test'


temp_train_data = []
temp_val_data = []
for label in listdir(trn_data_path):
    for jpg in listdir(trn_data_path+'/'+label):
        file_path = trn_data_path + '/' + label + '/' + jpg
        img = PIL.Image.open(file_path)
        data_arr = np.asarray(img).flatten() / 255.0 * 0.99 + 0.01
        data_arr = torch.FloatTensor([data_arr]).to(device)
        label_arr = np.zeros(10) + 0.01
        label_arr[int(label)] = 0.99
        label_arr = torch.FloatTensor([label_arr]).to(device)
        if np.random.choice(100) > 19: 
            temp_train_data.append([data_arr label_arr])
        else:
            temp_val_data.append([data_arr, label_arr])

Now temp_train_data and temp_val_data will contain all of our training images and their labels. By allocating ~20% of our training data to temp_val_data we are following a machine learning best practice that will allow us to check whether our model is generalized well enough or if it's over-fitting. We do this by setting the validation data aside from our training data and keep our model from training on it. But, before we can start training we must first convert our data into tensors for our model to understand.


for x in temp_train_data:
    train_data = torch.cat([train_data, x[0]])
    train_labels = torch.cat([train_labels, x[-1]])
    
for x in temp_val_data:
    validation_data = torch.cat([validation_data, x[0]])
    validation_labels = torch.cat([validation_labels, x[-1]])
    
del temp_train_data
del temp_val_data

We iterate through our lists of tensors and concatenate them together, giving us 2 data tensors and 2 label tensors.

Now that we have our data setup properly we can define the model we'd like to train using PyTorch.


class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        self.l1 = nn.Linear(784, 300)
        self.l2 = nn.Linear(300, 10)
        
    def forward(self, x):
        x = F.relu(self.l1(x))
        x = F.dropout(x, p=0.6, training=self.training)
        x = torch.sigmoid(self.l2(x))
        return x

We'll use a simple shallow network that consists of four layers: an input layer, a hidden layer, a dropout layer, and an output layer. The input layer accepts a tensor of size 1x784, which is a flattened version of our 28x28 images. The dropout layer will help us with generalization by randomly setting ~60% of the weights between the input layer and output layer to 0, but only during training. The output layer condenses our giant weight matrix back down to an 1x10 array that represents the digit the model thinks is the correct one.

Now that we have our model defined we just need to setup our training loop.


def train(train_data, training_labels, validation_data, validation_labels, batch_size):
    train_loss = []
    val_loss = []
    train_acc = 0
    val_acc = 0
    for bs in range(batch_size):
        model.train()
        batch_start = int(len(train_data)/batch_size*bs)
        batch_end = int(len(train_data)/batch_size*(bs+1))
        prediction = model(train_data[batch_start:batch_end])
        actual = train_labels[batch_start:batch_end]
        loss = F.mse_loss(prediction, actual).mean()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_acc += (prediction.argmax(1) == actual.argmax(1)).cpu().detach().numpy().sum()
        train_loss.append(loss.sum().cpu().detach().numpy())
    train_loss = np.asarray(train_loss).mean()
    train_acc = train_acc / len(train_data)
    print('Epoch {} training loss: {}'.format(epoch, train_loss))
    print('Epoch {} training accuracy: {}'.format(epoch, train_acc))

    for bs in range(batch_size):
        model.eval()
        batch_start = int(len(validation_data)/batch_size*bs)
        batch_end = int(len(validation_data)/batch_size*(bs+1))
        prediction = model(validation_data[batch_start:batch_end])
        actual = validation_labels[batch_start:batch_end]
        loss = F.mse_loss(prediction, actual).mean()
        val_acc += (prediction.argmax(1) == actual.argmax(1)).cpu().detach().numpy().sum()
        val_loss.append(loss.sum().cpu().detach().numpy())
    val_loss = np.asarray(val_loss).mean()
    val_acc = val_acc / len(validation_data)
    print('Epoch {} validation loss: {}'.format(epoch, val_loss))
    print('Epoch {} validation accuracy: {}\n'.format(epoch, val_acc))

The function defined above will divide our training and validation data into the number of batches specified by the batch_size argument and calculate the loss for the entire batch using mean squared error, which penalizes higher error more harshly. After backpropogation the training loop will print the average error and accuracy for each batch in the epoch.

All that's left to do now is to train.


model = Network().to(device)
for epoch in range(100):
    LR = 0.001
    if epoch >= 15 & epoch < 30:
        LR = 0.0001
    elif epoch >= 40 & epoch < 60:
        LR = 0.00001
    elif epoch >= 60:
        LR = 0.000005
    optimizer = optim.Adam(model.parameters(), lr=LR)
    train(train_data, train_labels, validation_data, validation_labels, batch_size=500)

After 100 epochs I saw results of ~97% accuracy.


Epoch 99 training loss: 0.004727411083877087
Epoch 99 training accuracy: 0.9775373304400384
Epoch 99 validation loss: 0.005019115284085274
Epoch 99 validation accuracy: 0.9723252382661403

You can find a slightly more fleshed version of this project on my Github, where we further increase accuracy by augmenting our dataset with some simple image manipulation.