# TensorFlow Scala - Linear Regression via ANN

TensorFlow Scala is a strongly-typed Scala API for TensorFlow core C++ library developed by Anthony Platanios. This library integrates with native TensorFlow library via JNI, so no intermediate official/non-official Java libraries are used.

In this article we will implement "multiple linear regression" from the previous article about Customer Churn using TensorFlow.

# Setup Project

I am going to use SBT, but you can also use any other Scala-aware build tool.

New SBT project configuration:

```
lazy val tensorFlowScalaVer = "0.5.10"
lazy val root = (project in file("."))
.settings(
name := "tensorflow-scala-example",
libraryDependencies ++= Seq(
"org.platanios" %% "tensorflow-data" % tensorFlowScalaVer,
"org.platanios" %% "tensorflow" % tensorFlowScalaVer classifier "darwin"
)
)
```

In order to add `TensorFlow Scala`

to existing project, just add two library dependencies from above configuration. `tensorflow-data`

module is optional, but we are going to use it
in this article as well.

**Important:** I am using OSX, so my classifier is `darwin`

. In case you use Linux or Windows, please **change classifier** value to currently available classifiers for those platforms, check it here to be sure: https://repo1.maven.org/maven2/org/platanios/tensorflow_2.13/0.5.10/.

# Tensor API

Before start implementing an Artificial Neural Network with TensorFlow, let's briefly look at how we can load data into a tensor.

We would need to create matrices, so we can use Scala collections and map Arrays to Tensors and put them as rows into another Tensor. For example:

```
import org.platanios.tensorflow.api._
val t1 = Tensor(Tensor(1, 4, 0), Tensor(2, 3, 5))
println(t1.summarize())
```

will print:

```
Tensor[Int, [2, 3]]
[[1, 4, 0],
[2, 3, 5]]
```

# Data Preparation

`TensorFlow Scala`

also has data API to load datasets. We will use that API
partly. Unfortunately, current documentation does not give an idea how to use Data API, so that we will use partly. Most of data preparation code I took from previous article.

## Custom Code

```
type Matrix[T] = Array[Array[T]]
private def createEncoders[T: Numeric: ClassTag](data: Matrix[String])
: Matrix[String] => Matrix[T] = {
val encoder = LabelEncoder.fit[String](TextLoader.column(data, 2))
val hotEncoder = OneHotEncoder.fit[String, T](TextLoader.column(data, 1))
val label = t => encoder.transform(t, 2)
val hot = t => hotEncoder.transform(t, 1)
val typeTransform = (t: Matrix[String]) => transform[T](t)
label andThen hot andThen typeTransform
}
def loadData() = {
// loading data from CSV file into memory
val loader = TextLoader(Path.of("data/Churn_Modelling.csv")).load()
val data = loader.cols[String](3, -1)
val encoders = createEncoders[Float](data)
val numericData = encoders(data)
val scaler = StandardScaler[Float]().fit(numericData)
// create data transformation as custom code
val prepareData = (t: Matrix[String]) => {
val numericData = encoders(t)
scaler.transform(numericData)
}
// transform features
val xMatrix = prepareData(data)
val yVector = loader.col[Float](-1)
```

## Tensor API

Continuation of `loadData`

:

```
import org.platanios.tensorflow.data.utilities.UniformSplit
// Wrap arrays into Tensors and set Shapes
val xData = xMatrix.map(a => Tensor(a.toSeq)).toSeq
val x = Tensor(xData).reshape(Shape(-1, features))
val y = Tensor(yVector.toSeq).reshape(Shape(-1, targets))
// use library API to split data for train and test sets
val split = UniformSplit(x.shape(0), None)
val (trainIndices, testIndices) = split(trainPortion = 0.8f)
val xTrain = x.gather[Int](trainIndices, axis = 0)
val yTrain = y.gather[Int](trainIndices, axis = 0)
val xTest = x.gather[Int](testIndices, axis = 0)
val yTest = y.gather[Int](testIndices, axis = 0)
(xTrain, yTrain, xTest, yTest, prepareData)
}
```

As per comments above, we prepare training and test data and return it as 4 different Tensor objects. Also, we return a function as fifth element of the tuple for one more application (read further).

Full code of data preparation that includes:

- selecting specific CSV file columns
- normalising numeric columns
- encoding categorical columns as one-hot
- encoding label-like columns

see here:

# Model Assembly

We are going to learn on 12 features to predict one target.

```
val features = 12
val targets = 1
val batchSize = 100
val input = tf.learn.Input(FLOAT32, Shape(-1, features))
val trainInput = tf.learn.Input(FLOAT32, Shape(-1, targets))
```

`batchSize`

will be used in a couple of places of TensorFlow API.

In order to construct all 12 x 6 x 6 x 1 network below:

we use the following API:

```
val layer =
tf.learn.Linear[Float]("Layer_0/Linear", 6) >>
tf.learn.ReLU[Float]("Layer_0/ReLU") >>
tf.learn.Linear[Float]("Layer_1/Linear", 6) >>
tf.learn.ReLU[Float]("Layer_1/ReLU") >>
tf.learn.Linear[Float]("OutputLayer/Linear", 1) >>
tf.learn.Sigmoid[Float]("OutputLayer/Sigmoid")
```

`layer`

state is a composition of fully-connected layers with its own activation function. We specify `String`

name for each layer that will be eventually used in TensorFlow graph.

```
val loss = tf.learn.L2Loss[Float, Float]("Loss/L2Loss")
val optimizer = tf.train.Adam()
```

We are going to use `L2`

, which is a `half least square error`

as a loss function.
And Adaptive Moment Estimation (Adam) as weights optimization algorithm.

Finally, we pass above values to construct simple supervised model.

```
val model = tf.learn.Model.simpleSupervised(
input = input,
trainInput = trainInput,
layer = layer,
loss = loss,
optimizer = optimizer,
clipGradients = ClipGradientsByGlobalNorm(5.0f)
)
```

You can also create unsupervised model with `.unsupervised`

method, if needed.

As we can see, model construction is highly declarative as the entire `TensorFlow Scala`

library API.

# Estimator

Another abstraction in TensorFlow we are going to use is Estimator. It is used to train, predict, evaluate and export for serving the TensorFlow models. In some other libraries, an estimator is usually a model abstraction itself. Estimator provides nice separation from the input data and actual model to train.

## Dataset

Before we construct an estimator we need to wrap the input data into a Dataset. As I mentioned before, the `TensorFlow Scala`

provides a data package to
load some data formats in streaming way / lazily. This is recommended way to use this library that allows us to iterate over the data in streaming fashion, so the full dataset does not need to fit into memory. However, our current example dataset is quite small. so that we used custom code to load and transform the data before we start any learning. Now we have to wrap Tensors into Datasets:

```
val (xTrain, yTrain, xTest, yTest, dataTransformer) = loadData()
val trainFeatures = tf.data.datasetFromTensorSlices(xTrain)
val trainLabels = tf.data.datasetFromTensorSlices(yTrain)
val testFeatures = tf.data.datasetFromTensorSlices(xTest)
val testLabels = tf.data.datasetFromTensorSlices(yTest)
val trainData =
trainFeatures
.zip(trainLabels)
.repeat()
.shuffle(1000)
.batch(batchSize)
.prefetch(10)
val evalTrainData = trainFeatures.zip(trainLabels).batch(batchSize).prefetch(10)
val evalTestData = testFeatures.zip(testLabels).batch(batchSize).prefetch(10)
```

## Output

Above code creates training dataset as combination of features and labels `outputs`

.
Core TensorFlow library has an idea of Output abstraction. The `Output`

is a symbolic handle that represents a tensor value produced by an `Operation`

. In other words, it is future state of a Tensor once particular operand is applied to that tensor. It does not hold the values of that operation
output, but instead provides a means of computing those values in a TensorFlow `Session`

, which is another TensorFlow abstraction. The session is created automatically by the Estimator. One can also construct TensorFlow session manually, we are not going to do this in this article.

## Training Metrics

One of the place where we work with `Output`

s directly in this example, is
training metric configuration. For binary classification we need to transform predicted
values by the model to `0`

and `1`

.

```
val accMetric = tf.metrics.MapMetric(
(v: (Output[Float], (Output[Float], Output[Float]))) => {
val (predicted, (_, actual)) = v
val positives = predicted > 0.5f
val shape = Shape(batchSize, positives.shape(1))
val binary = tf
.select(
positives,
tf.fill(shape)(1f),
tf.fill(shape)(0f)
)
(binary, actual)
},
tf.metrics.Accuracy("Accuracy")
)
```

I think to transform predicted values to binary values, i.e. `1`

and `0`

can be done more efficient than filling `true`

boolean values with `1`

and `false`

values with `0`

using `tf.select`

function, but I could not find another way.

We will use above accuracy metric in `estimator`

construction.

## Construct Estimator

```
val summariesDir = Paths.get("temp/ann")
val estimator = tf.learn.InMemoryEstimator(
model,
tf.learn.Configuration(Some(summariesDir)),
tf.learn.StopCriteria(maxSteps = Some(100000)),
Set(
tf.learn.LossLogger(trigger = tf.learn.StepHookTrigger(100)),
tf.learn.Evaluator(
summaryDir = summariesDir,
log = true,
datasets = Seq(("Train", () => evalTrainData), ("Test", () => evalTestData)),
metrics = Seq(accMetric),
trigger = tf.learn.StepHookTrigger(1000),
name = "Evaluator"
),
tf.learn.StepRateLogger(
log = false,
summaryDir = summariesDir,
trigger = tf.learn.StepHookTrigger(100)
),
tf.learn.SummarySaver(summariesDir, tf.learn.StepHookTrigger(100)),
tf.learn.CheckpointSaver(summariesDir, tf.learn.StepHookTrigger(1000))
),
tensorBoardConfig =
tf.learn.TensorBoardConfig(summariesDir, reloadInterval = 1)
)
```

Above code configures:

- Logging in training loop:
- log to summary directory "temp/ann"
- store checkpoint at every 1000 step to summary directory
- log at every 100 steps
- log loss value at every 100 steps

- Evaluate metrics:
- calculate accuracy metric (and any other specified metrics in Seq) at every 1000 step
- use data specified in Evaluator datasets

- Take data for Tensorboard from summary directory
- Stop after
`100 000`

step unless overridden by`.train`

method

**Note:** estimator works with `step`

notion rather than with `epoch`

. In order to calculate number of desired training steps, you can divide a *number training records* on *batch size*. In our case we have 8000 training records / 100 batch size = 80 steps. This is one epoch, i.e. one full training cycle on available dataset. In order to repeat training on the same model parameters 100 times, i.e. 100 epochs instead of 1 epoch we need 80 * 100 = 8000 steps. So if we set `10 000`

steps we ask for 125 epochs since 2000 steps is 25 epochs.

# Training

```
estimator.train(
() => trainData,
tf.learn.StopCriteria(maxSteps = Some(10000))
)
```

Finally we are starting the training loop. We pass the same `trainData`

as we used
for metric evaluation. However, we could use different datasets for training and metric evaluations. We override `maxSteps`

with `10 000`

. As we have only 10k rows in CSV file, we do not need initial `100 000`

steps for training. TensorFlow

Once we run `train`

method, we can see the following output in console:

```
.....
2021-02-12 19:07:15.308 [run-main-9] INFO Learn / Hooks / Evaluation - Step 10000 Evaluator:
2021-02-12 19:07:15.308 [run-main-9] INFO Learn / Hooks / Evaluation - ╔═══════╤════════════╗
2021-02-12 19:07:15.308 [run-main-9] INFO Learn / Hooks / Evaluation - ║ │ Accuracy ║
2021-02-12 19:07:15.308 [run-main-9] INFO Learn / Hooks / Evaluation - ╟───────┼────────────╢
2021-02-12 19:07:15.369 [run-main-9] INFO Learn / Hooks / Evaluation - ║ Train │ 0,8494 ║
2021-02-12 19:07:15.386 [run-main-9] INFO Learn / Hooks / Evaluation - ║ Test │ 0,8367 ║
2021-02-12 19:07:15.391 [run-main-9] INFO Learn / Hooks / Evaluation - ╚═══════╧════════════╝
```

there will be `11`

logging statements for intermediate accuracy value, so I copied only the last summary.

`train`

method returns `Unit`

, so it mutates state of the estimator, so that you can use
it further for model inference.

## Single test

```
val example = TextLoader(
"n/a,n/a,n/a,600,France,Male,40,3,60000,2,1,1,50000,n/a"
).cols[String](3, -1)
val testExample = Tensor(dataTransformer(example).map(Tensor(_)).toSeq)
.reshape(Shape(-1, features))
val prediction = estimator.infer(() => testExample)
println(s"Customer exited ? ${prediction.scalar > 0.5f}")
```

We use `dataTransformer`

function one more time for converting raw single data record into
numeric format that our model can understand and return a target value for it:

```
Customer exited ? false
```

`false`

is expected value for that simple example.

## Batch test

We can also submit data batch to infer a target value for each record.

```
println(s"Train accuracy = ${accuracy(xTrain, yTrain)}")
println(s"Test accuracy = ${accuracy(xTest, yTest)}")
```

We are going to calculate the accuracy metric manually based on known labels for train and test datasets:

```
def accuracy(input: Tensor[Float], labels: Tensor[Float]): Float = {
val predictions = estimator.infer(() => input.toFloat).toArray
val correct = predictions
.map(v => if (v > 0.5f) 1f else 0f)
.zip(labels.toFloat.toArray)
.foldLeft(0f) { case (acc, (yHat, y)) => if (yHat == y) acc + 1 else acc }
correct / predictions.length
}
```

```
Train accuracy = 0.867875
Test accuracy = 0.8605
```

Although we used again the same data for checking accuracy, however one can take new / unseen data to check the accuracy on just trained or loaded from checkpoint estimator.

# Tensorboard

Tensorboard is an additional tool from TensorFlow main framework. It can be installed via `pip`

tool:

```
pip install tensorboard
```

We enable Tensorboard as part of Estimator configuration. Every time we run `training`

cycle for an estimator with Tensorboard configured, we get the following console message:

```
sbt:tensorflow-scala-example> run
[info] running MultipleLR
2021-02-12 21:09:04.933 [run-main-c] INFO Learn / Hooks / TensorBoard - Launching TensorBoard in 'localhost:6006' for log directory '..../tensorflow-ann/temp/ann'
```

TensorFlow starts a web-app at `localhost`

on port `6006`

and using data from the the log directory that we configured at `estimator`

level.

Log directory accumulates TensorFlow logs between training cycles, so that if we run training cycle again and again we can see that estimator variables (graph state) is restored from that logging folder. Eventually, our model loss and accuracy metric values are going to be stable, i.e. not improving anymore.

# Summary

`TensorFlow Scala`

is a fantastic library that mimics most of `TensorFlow`

core library and Python API.
Although current library is missing some documentations, one can always use official TensorFlow documentation web-site to get an idea of the Scala API.

Implemented ANN in `TensorFlow Scala`

shows that one can use Scala to train Deep Learning models easily. Training program in Scala are going to be quite declarative and statically type-checked which eliminates lots of mistakes. Library API also allows to extend most of the abstractions, which is very important for real life use cases.

# Source Code

Full source code as SBT project can be found here: