MNIST image recognition using Deep Feed Forward Network

Alexey Novakov published on March 12, 2021

16 min, 3170 words

Categories: scala

Tags: deep learning machine learning MNIST images

Deep Feed Forward Neural Network is one of the type of Artificial Neural Networks, which is also able to classify computer images. In order to feed pixel data into the neural net in RBG/Greyscale/other format one can map every pixel to network inputs. That means every pixel becomes a feature. It may sound scary and highly inefficient to feed, let's say, 28 hieght on 28 width image size, which is 784 features to learn from. However, neural networks can learn from the pixel data successfully and classify unseen data. We are going to prove this.

Please note, there are additional type of networks which are more efficient in image classification such as Convolutional Neural Network, but we are going to talk about that next time.

Dataset

Wikipedia MnistExamples

MNIST dataset is a "Hello, World!" dataset in the field of Deep Learning. It consists of thousands of grey-scaled images which represent hand-written digits from 0 to 9, so 10 labels. This dataset is used by many researches in the field to evaluate their discoveries and test that on well-known dataset. However, MNIST dataset should not be a panacea. There are other public datasets with images like ImageNet, AlexNet, etc., which are more advanced as they have more objects than just hand-written digits. Nevertheless, MNIST made important contribution to the history of Deep Learning and still helps people to learn this field by playing with this dataset.

Loading Data

MNIST dataset can be taken from Yann LeCun web-site: http://yann.lecun.com/exdb/mnist/. If it is unavailable, you can easily find a copy of this dataset in numerous GitHub repositories, since it is not big in size (for example here). I have downloaded the following 4 archives and put into the folder images.

9.5M train-images-idx3-ubyte.gz
28K  train-labels-idx1-ubyte.gz

1.6M t10k-images-idx3-ubyte.gz
4.4K t10k-labels-idx1-ubyte.gz

First two files is training dataset. Bigger file is for images and smaller is for labels. There are 60000 training images and labels for them. Next two files are for model testing following the same concept (images, labels). There are 10000 testing images and labels.

In order to load these files into the memory we need to follow MNIST file format specification. For each file we need to do:

Read first magic number and compare it with MNIST expected number, which is:

val LabelFileMagicNumber = 2049
val ImageFileMagicNumber = 2051

Read next number for number of rows
Read next number for number of columns
Read images and labels in the loop based on the number of rows and columns

We are going to implement MNIST classification on top of the existing mini-libary for Deep Learning. Here is how we can load MNIST dataset:

import scala.collection.mutable.ArrayBuffer
import scala.reflect.ClassTag

import java.io.{DataInputStream, BufferedInputStream, FileInputStream}
import java.nio.file.{Files, Path}
import java.util.zip.GZIPInputStream

def loadDataset[T: ClassTag](images: Path, labels: Path)(
    using n: Numeric[T]
  ): (Tensor[T], Tensor[T]) =

  val imageStream = new GZIPInputStream(Files.newInputStream(images))
  val imageInputStream = new DataInputStream(new BufferedInputStream(imageStream))
  val magicNumber = imageInputStream.readInt()
  
  assert(magicNumber == ImageFileMagicNumber, 
    s"Images magic number is incorrect, expected $ImageFileMagicNumber, 
    but was $magicNumber")

  val numberOfImages = imageInputStream.readInt()
  val (nRows, nCols) = (imageInputStream.readInt(), imageInputStream.readInt())

  val labelStream = new GZIPInputStream(Files.newInputStream(labels))
  val labelInputStream = new DataInputStream(new BufferedInputStream(labelStream))  
  val labelMagicNumber = labelInputStream.readInt()

  assert(labelMagicNumber == LabelFileMagicNumber, 
    s"Labels magic number is incorrect, expected $LabelFileMagicNumber, 
    but was $labelMagicNumber")

  val numberOfLabels = labelInputStream.readInt()

  assert(numberOfImages == numberOfLabels)
  
  val labelsTensor = labelInputStream.readAllBytes.map(l => n.fromInt(l)).as1D

  val singeImageSize = nRows * nCols
  val imageArray = ArrayBuffer.empty[Array[T]]

  for i <- (0 until numberOfImages) do
    val image = (0 until singeImageSize)
      .map(_ => n.fromInt(imageInputStream.readUnsignedByte())).toArray      
    imageArray += image

  (imageArray.toArray.as2D, labelsTensor)

Preparing data

Before we construct a neural network to train it on MNIST dataset, we need to transform it a bit.

Feature normalisation

In order to be more efficient when learning weights we need to scale X data to be in [0, 1] data range. We know that every image is encoded as a matrix of pixels 28 x 28. If print one of the image data into the console with line breaks after 28-th element, then it will look like this:

0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0  84 185 159 151  60  36   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0 222 254 254 254 254 241 198 198 198 198 198 198 198 198 170  52   0   0   0   0   0   0
0   0   0   0   0   0  67 114  72 114 163 227 254 225 254 254 254 250 229 254 254 140   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0  17  66  14  67  67  67  59  21 236 254 106   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  83 253 209  18   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  22 233 255  83   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 129 254 238  44   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  59 249 254  62   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 133 254 187   5   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   9 205 248  58   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 126 254 182   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0  75 251 240  57   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0  19 221 254 166   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   3 203 254 219  35   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0  38 254 254  77   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0  31 224 254 115   1   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0 133 254 254  52   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0  61 242 254 254  52   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0 121 254 254 219  40   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0 121 254 207  18   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Above output corresponds to digit "7".

Data in 0-255 numeric range will explode our network gradient if we do not apply any optimization technique on gradient or weight values. The easiest way is to scale the input data.

First we load data using previously defined function with one more case class as a wrapper:

val dataset = MnistLoader.loadData[Double]("images")

where dataset is wrapped into a case class:

case class MnistDataset[T: Numeric](
  trainImage: Tensor[T], 
  trainLabels: Tensor[T], 
  testImages: Tensor[T], 
  testLabels: Tensor[T])

Then we simply divide every value by 255, that gives data in [0,1] range format.

val xData = dataset.trainImage.map(_ / 255d)

Target Encoding

Our model is going to predict one label over multi-class dataset. In order to make our neural network to predict something we need to encode label tensor with One-Hot encoder, so that every scalar label becomes as a vector of zeros and a single 1. Index of 1 corresponds to the digit that this label stores.

MNIST data is currently a vector of numbers, where number is a label for hand-written digit. For example:

[7,5,0,1]

Once we one-hot encode it, it will look like:

[0,0,0,0,0,0,0,1,0,0]
[0,0,0,0,0,1,0,0,0,0]
[1,0,0,0,0,0,0,0,0,0]
[0,1,0,0,0,0,0,0,0,0]

We can reuse OneHotEncoder implemented earlier:

val encoder = OneHotEncoder(classes = (0 to 9).map(i => (i.toDouble, i.toDouble)).toMap)
val yData = encoder.transform(dataset.trainLabels.as1D)

Common preparation

Let's wrap both transformation into one function:

def prepareData(x: Tensor[Double], y: Tensor[Double]) =
  val xData = x.map(_ / 255d) // normalize to [0,1] range
  val yData = encoder.transform(y.as1D)
  (xData, yData)

Now we can call it like this:

val (xTrain, yTrain) = prepareData(dataset.trainImage, dataset.trainLabels)

Model construction

Our model is going to be designed/trained with:

nodes: 784 x 100 x 10
activation: ReLU, Softmax
loss: cross-entropy
accuracy: via argmax
initialisation: Kaiming
optimizer: Adam

val ann = Sequential[Double, Adam, HeNormal](
  crossEntropy,
  learningRate = 0.001,
  metrics = List(accuracy),
  batchSize = 128,
  gradientClipping = clipByValue(5.0d)
)
  .add(Dense(relu, 100))      
  .add(Dense(softmax, 10))

Adam optimizer gets better results on MNIST data, so we stick to it, rather than with standard Stochastic Gradient Descent.

Activations

We have already seen ReLU activation function, but let's recall its definition:

def relu[T: ClassTag](x: Tensor[T])(using n: Numeric[T]): Tensor[T] =
  x.map(t => max(n.zero, t))

Important note here that it is applied element-wise, i.e. for every element of z matrix in the layer.

However, softmax activation function is applied across nodes of the layer to get the probability which sums up to 1. This activation function is a typical choice for multi-class problem type. When we feed input data sample into the network, we want to get an output as vector with probabilities for each class.

Coming back to MNIST target, below representation shows that most likely the target value is digit "4", because the highest argument is "0.5" at index [4].

scala> List(0.01, 0.2, 0.1, 0.1, 0.5, 0.01, 0.02, 0.03, 0.01, 0.02).sum
val res0: Double = 1.0

This is how we can implement softmax:

val toleration = castFromTo[Double, T](0.4E-15d)

def softmax(x: Tensor[T]): Tensor[T] =       
  val applied = x.mapRow { row =>
    val max = row.max        
    val expNorm = row.map(v => exp(v - max))         
    val sum = expNorm.sum        
    expNorm.map(_ / sum)
  }
  
  // rest is an extra defence against numeric overflow
  val appliedSum = applied.sumCols.map( v => 
    if v.abs - toleration > n.one 
    then v 
    else n.one
  )
  val totalSum = appliedSum.sumRows.as0D.data      
  assert(totalSum == x.length, 
    s"Softmax distribution sum is not equal to 1 at some activation, but\n${appliedSum}")
  
  applied

It is obviously more complicated than relu. This is what the above code is doing:

For each row: Array[T] of the x Tensor we find a max value and substract it from each value of this row to get stable values in the vector. The reason to subtract max is to avoid numeric overflow.
Apply exponent to each value right after the max subtraction.
Make a sum of exponents.
Finally, use exponent vector to divide each value by the sum.
Additionally, we raise an error if a sum of individual values in the vector is not equal to 1. Such situation may happen due to numeric overflow. If it happens, then we may end up with exploding gradient (as a result bad training outcome). However, we tolerate numeric difference of 0.4E-15d, i.e. it should be no more than 1.0000000000000004.

In order to perform back-propagation with gradient descent we need softmax derivative as well. This is simplest version of softmax derivative:

def derivative(x: Tensor[T]): Tensor[T] =       
  val sm = softmax(x)      
  sm.multiply(n.one - sm) // element-wise multiplication, NOT dot product

Loss function

Cross-entropy can then be used to calculate the difference between the two probability distributions and typical choice for multi-class classification. It can written in code as:

def crossEntropy[T: ClassTag: Numeric](y: T, yHat: T): T = 
  y * log(yHat)

It will return some value as a as difference. Example of input vectors:

some random yHat = [0.1, 0.1, 0, 0.8, ...] - it will be length of 10 in our MNIST case
        actual y = [0, 0, 0, 1, ........ ] - length 10

To get an idea whey cross-entropy is a usefull loss function to our problem, please have a look at this blog-post.

Accuracy

Before we calculate a number of correct predictions, we need to not just compare y and yHat vectors, but first need to find an index of the max element in the y and yHat vectors.

So we need to help the existing algorithm to extract from the yHat vector the value of the label, i.e. predicted digit. Function called argmax can be used for this task:

def accuracyMnist[T: ClassTag: Ordering](using n: Numeric[T]) = new Metric[T]:
  val name = "accuracy"
  
  def matches(actual: Tensor[T], predicted: Tensor[T]): Int =      
    val predictedArgMax = predicted.argMax      
    actual.argMax.equalRows(predictedArgMax)

val accuracy = accuracyMnist[Double]

Accuracy is a Metric type-class that has matches method to return a number of correct predictions.

The argMax itself as generic tensor function:

def argMax[T: ClassTag](t: Tensor[T])(using n: Numeric[T]) =
  def maxIndex(a: Array[T]) = 
    n.fromInt(a.indices.maxBy(a))

  t match
    case Tensor2D(data) => Tensor1D(data.map(maxIndex))
    case Tensor1D(data) => Tensor0D(maxIndex(data))
    case Tensor0D(_) => t

Weight Initialisation

Weight initialisation approach is important factor in Deep Learning to converge model training faster or even to avoid vanished or exploded gradient.

Kaiming weight initialisation is helping to address above problems. So let's use that as well:

given [T: ClassTag: Numeric]: ParamsInitializer[T, HeNormal] with    
  val rnd = new Random() 

  def gen(lenght: Int): T = 
    castFromTo[Double, T]{
      val v = rnd.nextGaussian + 0.001d // value shift is optional
      v * math.sqrt(2d / lenght.toDouble)
    }

  override def weights(rows: Int, cols: Int): Tensor[T] =
    Tensor2D(Array.fill(rows)(Array.fill[T](cols)(gen(rows))))

  override def biases(length: Int): Tensor[T] = 
    zeros(length)

We initialise biases to zeros. Weight matrices are initialised using random generator with normal distribution. Every random number then multiplied by sqrt(2 / n), where n is a number of input nodes for this particular layer.

Model Training

Now we are ready to start training process.

val model = ann.train(xTrain, yTrain, epochs = 15, shuffle = true)

 epoch: 1/15, avg. loss: 0.04434336993179046, metrics: [accuracy: 0.8785666666666667]
 epoch: 2/15, avg. loss: 0.024939809896450383, metrics: [accuracy: 0.9350166666666667]
 epoch: 3/15, avg. loss: 0.02028075875579972, metrics: [accuracy: 0.9478833333333333]
 epoch: 4/15, avg. loss: 0.017196840063260558, metrics: [accuracy: 0.9560833333333333]
 epoch: 5/15, avg. loss: 0.01491209973340988, metrics: [accuracy: 0.9625666666666667]
 epoch: 6/15, avg. loss: 0.01350024657628137, metrics: [accuracy: 0.9671833333333333]
 epoch: 7/15, avg. loss: 0.01222168129663168, metrics: [accuracy: 0.9699]
 epoch: 8/15, avg. loss: 0.011222418180870991, metrics: [accuracy: 0.9729833333333333]
 epoch: 9/15, avg. loss: 0.010388172803460627, metrics: [accuracy: 0.9752833333333333]
 epoch: 10/15, avg. loss: 0.009549474708521941, metrics: [accuracy: 0.97765]
 epoch: 11/15, avg. loss: 0.008920235294999721, metrics: [accuracy: 0.9787]
 epoch: 12/15, avg. loss: 0.008214811390229967, metrics: [accuracy: 0.9806833333333334]
 epoch: 13/15, avg. loss: 0.0077112882811408694, metrics: [accuracy: 0.9824]
 epoch: 14/15, avg. loss: 0.0071559669134910325, metrics: [accuracy: 0.98405]
 epoch: 15/15, avg. loss: 0.006797865863855411, metrics: [accuracy: 0.9848]

We have gotten quite good accuracy on training. 98.4% correct predictions, which is 1.6% errors.

Model Testing

val (xTest, yTest) = prepareData(dataset.testImages, dataset.testLabels)
val testPredicted = model(xTest)
val value = accuracy(yTest, testPredicted)
println(s"test accuracy = $value")

Accuracy on test data is quite close to the train accuracy:

test accuracy = 0.9721

We can also try to run a single test on the first image from the test dataset:

val singleTestImage = dataset.testImages.as2D.data.head
val label = dataset.testLabels.as1D.data.head // this must be "7" 
val predicted = model(singleTestImage.as2D).argMax.as0D.data  

assert(label == predicted, 
  s"Predicted label is not equal to expected '$label' label, but was '$predicted'")

println(s"predicted = $predicted")

predicted = 7.0

Summary

We have seen that even one hidden layer is able to classify MNIST dataset with quite low error rate. Key takeaways when classifying images are:

make sure that gradient is not going to explode or vanish. For that, we can use proper weight initialisation, clipping gradient by value or norm any other weight normalisation during the training. Also scale or normalise the input data
use one-hot encoding for your target variable in case of multi-class classification
in case of single label prediction, use argmax function
use softmax activation at the last layer to distribute probabilities across classes.

Try Convolutional Neural Network as a next step in image classification problem.