MNIST image recognition using Deep Feed Forward Network
Deep Feed Forward Neural Network is one of the type of Artificial Neural Networks, which is also able to classify computer images. In order to feed pixel data into the neural net in RBG/Greyscale/other format one can map every pixel to network inputs. That means every pixel becomes a feature. It may sound scary and highly inefficient to feed, let's say, 28 hieght on 28 width image size, which is 784 features to learn from. However, neural networks can learn from the pixel data successfully and classify unseen data. We are going to prove this.
Please note, there are additional type of networks which are more efficient in image classification such as Convolutional Neural Network, but we are going to talk about that next time.
Dataset
MNIST dataset is a "Hello, World!" dataset in the field of Deep Learning. It consists of thousands of grey-scaled images which represent hand-written digits from 0 to 9, so 10 labels. This dataset is used by many researches in the field to evaluate their discoveries and test that on well-known dataset. However, MNIST dataset should not be a panacea. There are other public datasets with images like ImageNet, AlexNet, etc., which are more advanced as they have more objects than just hand-written digits. Nevertheless, MNIST made important contribution to the history of Deep Learning and still helps people to learn this field by playing with this dataset.
Loading Data
MNIST dataset can be taken from Yann LeCun web-site: http://yann.lecun.com/exdb/mnist/. If it is unavailable, you can easily find
a copy of this dataset in numerous GitHub repositories, since it is not big in size (for example here). I have downloaded the following 4 archives and put into the folder images
.
9.5M train-images-idx3-ubyte.gz
28K train-labels-idx1-ubyte.gz
1.6M t10k-images-idx3-ubyte.gz
4.4K t10k-labels-idx1-ubyte.gz
First two files is training dataset. Bigger file is for images and smaller is for labels. There are 60000 training images and labels for them. Next two files are for model testing following the same concept (images, labels). There are 10000 testing images and labels.
In order to load these files into the memory we need to follow MNIST file format specification. For each file we need to do:
- Read first magic number and compare it with MNIST expected number, which is:
val LabelFileMagicNumber = 2049
val ImageFileMagicNumber = 2051
- Read next number for number of rows
- Read next number for number of columns
- Read images and labels in the loop based on the number of rows and columns
We are going to implement MNIST classification on top of the existing mini-libary for Deep Learning. Here is how we can load MNIST dataset:
import scala.collection.mutable.ArrayBuffer
import scala.reflect.ClassTag
import java.io.{DataInputStream, BufferedInputStream, FileInputStream}
import java.nio.file.{Files, Path}
import java.util.zip.GZIPInputStream
def loadDataset[T: ClassTag](images: Path, labels: Path)(
using n: Numeric[T]
): (Tensor[T], Tensor[T]) =
val imageStream = new GZIPInputStream(Files.newInputStream(images))
val imageInputStream = new DataInputStream(new BufferedInputStream(imageStream))
val magicNumber = imageInputStream.readInt()
assert(magicNumber == ImageFileMagicNumber,
s"Images magic number is incorrect, expected $ImageFileMagicNumber,
but was $magicNumber")
val numberOfImages = imageInputStream.readInt()
val (nRows, nCols) = (imageInputStream.readInt(), imageInputStream.readInt())
val labelStream = new GZIPInputStream(Files.newInputStream(labels))
val labelInputStream = new DataInputStream(new BufferedInputStream(labelStream))
val labelMagicNumber = labelInputStream.readInt()
assert(labelMagicNumber == LabelFileMagicNumber,
s"Labels magic number is incorrect, expected $LabelFileMagicNumber,
but was $labelMagicNumber")
val numberOfLabels = labelInputStream.readInt()
assert(numberOfImages == numberOfLabels)
val labelsTensor = labelInputStream.readAllBytes.map(l => n.fromInt(l)).as1D
val singeImageSize = nRows * nCols
val imageArray = ArrayBuffer.empty[Array[T]]
for i <- (0 until numberOfImages) do
val image = (0 until singeImageSize)
.map(_ => n.fromInt(imageInputStream.readUnsignedByte())).toArray
imageArray += image
(imageArray.toArray.as2D, labelsTensor)
Preparing data
Before we construct a neural network to train it on MNIST dataset, we need to transform it a bit.
Feature normalisation
In order to be more efficient when learning weights we need to scale X data to be in [0, 1] data range. We know that every image is encoded as a matrix of pixels 28 x 28. If print one of the image data into the console with line breaks after 28-th element, then it will look like this:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 84 185 159 151 60 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 222 254 254 254 254 241 198 198 198 198 198 198 198 198 170 52 0 0 0 0 0 0
0 0 0 0 0 0 67 114 72 114 163 227 254 225 254 254 254 250 229 254 254 140 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 17 66 14 67 67 67 59 21 236 254 106 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 83 253 209 18 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 233 255 83 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 129 254 238 44 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 59 249 254 62 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 133 254 187 5 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 205 248 58 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 126 254 182 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 75 251 240 57 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 19 221 254 166 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 3 203 254 219 35 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 38 254 254 77 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 31 224 254 115 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 133 254 254 52 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 61 242 254 254 52 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 121 254 254 219 40 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 121 254 207 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Above output corresponds to digit "7".
Data in 0-255 numeric range will explode our network gradient if we do not apply any optimization technique on gradient or weight values. The easiest way is to scale the input data.
First we load data using previously defined function with one more case class as a wrapper:
val dataset = MnistLoader.loadData[Double]("images")
where dataset
is wrapped into a case class:
case class MnistDataset[T: Numeric](
trainImage: Tensor[T],
trainLabels: Tensor[T],
testImages: Tensor[T],
testLabels: Tensor[T])
Then we simply divide every value by 255, that gives data in [0,1] range format.
val xData = dataset.trainImage.map(_ / 255d)
Target Encoding
Our model is going to predict one label over multi-class dataset. In order to make our neural network
to predict something we need to encode label tensor with One-Hot encoder
, so that every scalar label becomes as
a vector of zeros and a single 1
. Index of 1
corresponds to the digit that this label stores.
MNIST data is currently a vector of numbers, where number is a label for hand-written digit. For example:
[7,5,0,1]
Once we one-hot encode it, it will look like:
[0,0,0,0,0,0,0,1,0,0]
[0,0,0,0,0,1,0,0,0,0]
[1,0,0,0,0,0,0,0,0,0]
[0,1,0,0,0,0,0,0,0,0]
We can reuse OneHotEncoder implemented earlier:
val encoder = OneHotEncoder(classes = (0 to 9).map(i => (i.toDouble, i.toDouble)).toMap)
val yData = encoder.transform(dataset.trainLabels.as1D)
Common preparation
Let's wrap both transformation into one function:
def prepareData(x: Tensor[Double], y: Tensor[Double]) =
val xData = x.map(_ / 255d) // normalize to [0,1] range
val yData = encoder.transform(y.as1D)
(xData, yData)
Now we can call it like this:
val (xTrain, yTrain) = prepareData(dataset.trainImage, dataset.trainLabels)
Model construction
Our model is going to be designed/trained with:
- nodes: 784 x 100 x 10
- activation: ReLU, Softmax
- loss: cross-entropy
- accuracy: via argmax
- initialisation: Kaiming
- optimizer: Adam
val ann = Sequential[Double, Adam, HeNormal](
crossEntropy,
learningRate = 0.001,
metrics = List(accuracy),
batchSize = 128,
gradientClipping = clipByValue(5.0d)
)
.add(Dense(relu, 100))
.add(Dense(softmax, 10))
Adam optimizer gets better results on MNIST data, so we stick to it, rather than with standard Stochastic Gradient Descent.
Activations
We have already seen ReLU
activation function, but let's recall its definition:
def relu[T: ClassTag](x: Tensor[T])(using n: Numeric[T]): Tensor[T] =
x.map(t => max(n.zero, t))
Important note here that it is applied element-wise, i.e. for every element of z
matrix in the layer.
However, softmax
activation function is applied across nodes of the layer to get the probability which sums up to 1
.
This activation function is a typical choice for multi-class problem type. When we feed input data sample into the network,
we want to get an output as vector with probabilities for each class.
Coming back to MNIST target, below representation shows that most likely the target value is digit "4", because the highest argument is "0.5" at index [4].
scala> List(0.01, 0.2, 0.1, 0.1, 0.5, 0.01, 0.02, 0.03, 0.01, 0.02).sum
val res0: Double = 1.0
This is how we can implement softmax
:
val toleration = castFromTo[Double, T](0.4E-15d)
def softmax(x: Tensor[T]): Tensor[T] =
val applied = x.mapRow { row =>
val max = row.max
val expNorm = row.map(v => exp(v - max))
val sum = expNorm.sum
expNorm.map(_ / sum)
}
// rest is an extra defence against numeric overflow
val appliedSum = applied.sumCols.map( v =>
if v.abs - toleration > n.one
then v
else n.one
)
val totalSum = appliedSum.sumRows.as0D.data
assert(totalSum == x.length,
s"Softmax distribution sum is not equal to 1 at some activation, but\n${appliedSum}")
applied
It is obviously more complicated than relu
. This is what the above code is doing:
- For each
row: Array[T]
of thex
Tensor we find a max value and substract it from each value of this row to get stable values in the vector. The reason to subtractmax
is to avoid numeric overflow. - Apply exponent to each value right after the
max
subtraction. - Make a sum of exponents.
- Finally, use exponent vector to divide each value by the
sum
. - Additionally, we raise an error if a sum of individual values in
the vector is not equal to
1
. Such situation may happen due to numeric overflow. If it happens, then we may end up with exploding gradient (as a result bad training outcome). However, we tolerate numeric difference of0.4E-15d
, i.e. it should be no more than1.0000000000000004
.
In order to perform back-propagation with gradient descent we need softmax
derivative as well.
This is simplest version of softmax derivative:
def derivative(x: Tensor[T]): Tensor[T] =
val sm = softmax(x)
sm.multiply(n.one - sm) // element-wise multiplication, NOT dot product
Loss function
Cross-entropy
can then be used to calculate the difference between the two probability distributions and
typical choice for multi-class classification. It can written in code as:
def crossEntropy[T: ClassTag: Numeric](y: T, yHat: T): T =
y * log(yHat)
It will return some value as a as difference. Example of input vectors:
some random yHat = [0.1, 0.1, 0, 0.8, ...] - it will be length of 10 in our MNIST case
actual y = [0, 0, 0, 1, ........ ] - length 10
To get an idea whey cross-entropy is a usefull loss function to our problem, please have a look at this blog-post.
Accuracy
Before we calculate a number of correct predictions, we need to not just compare y
and yHat
vectors,
but first need to find an index of the max element in the y
and yHat
vectors.
So we need to help the existing algorithm to extract from the yHat
vector the value of the label, i.e. predicted digit.
Function called argmax
can be used for this task:
def accuracyMnist[T: ClassTag: Ordering](using n: Numeric[T]) = new Metric[T]:
val name = "accuracy"
def matches(actual: Tensor[T], predicted: Tensor[T]): Int =
val predictedArgMax = predicted.argMax
actual.argMax.equalRows(predictedArgMax)
val accuracy = accuracyMnist[Double]
Accuracy is a Metric
type-class that has matches
method to return a number of correct predictions.
The argMax
itself as generic tensor function:
def argMax[T: ClassTag](t: Tensor[T])(using n: Numeric[T]) =
def maxIndex(a: Array[T]) =
n.fromInt(a.indices.maxBy(a))
t match
case Tensor2D(data) => Tensor1D(data.map(maxIndex))
case Tensor1D(data) => Tensor0D(maxIndex(data))
case Tensor0D(_) => t
Weight Initialisation
Weight initialisation approach is important factor in Deep Learning to converge model training faster or even to avoid vanished or exploded gradient.
Kaiming
weight initialisation is helping to address above problems. So let's use that as well:
given [T: ClassTag: Numeric]: ParamsInitializer[T, HeNormal] with
val rnd = new Random()
def gen(lenght: Int): T =
castFromTo[Double, T]{
val v = rnd.nextGaussian + 0.001d // value shift is optional
v * math.sqrt(2d / lenght.toDouble)
}
override def weights(rows: Int, cols: Int): Tensor[T] =
Tensor2D(Array.fill(rows)(Array.fill[T](cols)(gen(rows))))
override def biases(length: Int): Tensor[T] =
zeros(length)
We initialise biases to zeros. Weight matrices are initialised using random generator with normal distribution. Every random number then
multiplied by sqrt(2 / n)
, where n is a number of input nodes for this particular layer.
Model Training
Now we are ready to start training process.
val model = ann.train(xTrain, yTrain, epochs = 15, shuffle = true)
epoch: 1/15, avg. loss: 0.04434336993179046, metrics: [accuracy: 0.8785666666666667]
epoch: 2/15, avg. loss: 0.024939809896450383, metrics: [accuracy: 0.9350166666666667]
epoch: 3/15, avg. loss: 0.02028075875579972, metrics: [accuracy: 0.9478833333333333]
epoch: 4/15, avg. loss: 0.017196840063260558, metrics: [accuracy: 0.9560833333333333]
epoch: 5/15, avg. loss: 0.01491209973340988, metrics: [accuracy: 0.9625666666666667]
epoch: 6/15, avg. loss: 0.01350024657628137, metrics: [accuracy: 0.9671833333333333]
epoch: 7/15, avg. loss: 0.01222168129663168, metrics: [accuracy: 0.9699]
epoch: 8/15, avg. loss: 0.011222418180870991, metrics: [accuracy: 0.9729833333333333]
epoch: 9/15, avg. loss: 0.010388172803460627, metrics: [accuracy: 0.9752833333333333]
epoch: 10/15, avg. loss: 0.009549474708521941, metrics: [accuracy: 0.97765]
epoch: 11/15, avg. loss: 0.008920235294999721, metrics: [accuracy: 0.9787]
epoch: 12/15, avg. loss: 0.008214811390229967, metrics: [accuracy: 0.9806833333333334]
epoch: 13/15, avg. loss: 0.0077112882811408694, metrics: [accuracy: 0.9824]
epoch: 14/15, avg. loss: 0.0071559669134910325, metrics: [accuracy: 0.98405]
epoch: 15/15, avg. loss: 0.006797865863855411, metrics: [accuracy: 0.9848]
We have gotten quite good accuracy on training. 98.4% correct predictions, which is 1.6% errors.
Model Testing
val (xTest, yTest) = prepareData(dataset.testImages, dataset.testLabels)
val testPredicted = model(xTest)
val value = accuracy(yTest, testPredicted)
println(s"test accuracy = $value")
Accuracy on test data is quite close to the train accuracy:
test accuracy = 0.9721
We can also try to run a single test on the first image from the test dataset:
val singleTestImage = dataset.testImages.as2D.data.head
val label = dataset.testLabels.as1D.data.head // this must be "7"
val predicted = model(singleTestImage.as2D).argMax.as0D.data
assert(label == predicted,
s"Predicted label is not equal to expected '$label' label, but was '$predicted'")
println(s"predicted = $predicted")
predicted = 7.0
Summary
We have seen that even one hidden layer is able to classify MNIST dataset with quite low error rate. Key takeaways when classifying images are:
- make sure that gradient is not going to explode or vanish. For that, we can use proper weight initialisation, clipping gradient by value or norm any other weight normalisation during the training. Also scale or normalise the input data
- use one-hot encoding for your target variable in case of multi-class classification
- in case of single label prediction, use
argmax
function - use
softmax
activation at the last layer to distribute probabilities across classes.
Try Convolutional Neural Network as a next step in image classification problem.