scala

ONNX Model format in Flink

Alexey Novakov published on

7 min, 1352 words

The most popular eco-system to train ML model these days is Python and C/C++ based libraries. An example of model training can be a Logistic Regression algorithms, from ScikitLearn package or more advanced neural networks algorithms offered by Tensorflow, Pytorch and others. There are lots of tools and libraries in Python world to facilitate training and model serving.

In order to bring trained models in Keras or SciKitLean into a Flink application, we can use cross-platform file formats such as ONNX and PMML (in the past). These formats also come with language runtimes. In Flink case, we select JVM SDK to run inference logic inside the Flink job.

Let's look at the example on how to train Logistic Regression in Python using Keras and then use trained model in ONNX format inside Flink.



Read More

Using Scala 3 with Apache Flink

Alexey Novakov published on

4 min, 625 words


If you have come here, then you probably know that current version of Apache Flink 1.16 Scala API still depends on Scala 2.12. However, the good news is that previous Flink release 1.15 introduced an important change for Scala users, which allows us to use own Scala version via Flink Java API. That means users can now use Scala 2.13 or 3 to write Flink jobs. Earlier, Flink had Scala 2.12 as a mandatory library on application classpath, so one could not use newer version due to version conflicts.



Official Flink Scala API will be still available in future, but will probably be deprecated at some point. It is unclear at this point. Flink community made a decision to not engage with newer Scala version and make Flink to be Scala-free, in terms of user's Scala version choice. Whether it is good or bad for Scala users in general we will going to see in near future. Definitely this choice is good, as it unlocks Flink for newer Scala versions.

Read More

Spark API Languages

Alexey Novakov published on

7 min, 1244 words



Context

As part of my role of Data Architect at work, I often deal with AWS data services to run Apache Spark jobs such as EMR and Glue ETL. At the very beginning team needed to choose Spark supported programming language and start writing our main jobs for data processing.

Before we further dive into the languages choice, let's quickly remind what is Spark for EMR. Glue ETL is going to be skipped from the blog-post.

Apache Spark is one of the main component of AWS EMR, which makes EMR still meaningful service to be used by Big Data teams. AWS EMR team is building its own Spark distribution to integrate it with other EMR applications seamlessly. Even though Amazon builds own Spark, they keep the same Spark version, which is equal to open source version of Spark. All features of Apache Spark are available in EMR Spark. EMR allows to run a Spark application in EMR cluster via step type called “Spark Application”.

Read More

CDC with Delta Lake Streaming

Alexey Novakov published on

5 min, 969 words



Change Data Capture (CDC) is a popular technique for replication of data from OLTP to OLAP data store. Usually CDC tools integrate with transactional logs of relational databases and thus are mainly dedicated to replicate all possible data changes from relational databases. NoSQL databases are usually coming with built-in CDC for any possible data change (insert, update, delete), for example AWS DynamoDB Streams.

In this blog-post, we will look at Delta Lake table format, which supports "merge" operation. This operation is useful when we need to update replicated data in Data Lake.

Read More

Decision Tree from scratch

Alexey Novakov published on

5 min, 817 words

Cropped view of one the region in the middle of the tree we will build further

Decision Tree classifier is one the simplest algorithm to implement from scratch. One of the benefit of this algorithm is it can be trained without spending too much efforst on data preparation and it is fast comparing to more complex algorithms like Neural Networks. In this blog post we are going to implement CART algorithm, which stands for Classification and Regression trees. There are many other algorithms in decision trees space, but we will not describe them in this blog post.

Data science practitioners often use decision tree algorithms to compare their performance with more advanced algorithms. Although decision tree is fast to train, its accuracy metric usually lower than accuracy on the other algorithms like deep feed forward networks or something more advanced using the same dataset. However, you do not always need high accuracy value, so using CART and other decision tree ensemble algorithms may be enough for solving particular problem.

Read More

Kubernetes Operator in Scala for Kerberos Keytab Management

Alexey Novakov published on

8 min, 1452 words


Kubernetes has built-in controllers to handle its native resource such as

  • Pod
  • Service
  • Deployment
  • etc.

What if you want a completely new resource type, which would describe some new abstraction in clear and concise way? Such new resource would describe everything in one single type which would require 5-10 separate native Kubernetes resources.

Read More

Convolutional Neural Network in Scala

Alexey Novakov published on

7 min, 1255 words

Last time we used ANN to train a Deep Learning model for image recognition using MNIST dataset. This time we are going to look at more advanced network called Convolutional Neural Network or CNN in short.

CNN is designed to tackle image recognition problem. However, it can be used not only for image recognition. As we have seen last time, ANN using just hidden layers can learn quite well on MNIST. However, for real life use cases we need higher accuracy. The main idea of CNN is to learn how to recognise object in their different shapes and positions using specific features of the image data. The goal of CNN is better model regularisation by using convolution and pooling operations.

Read More

MNIST image recognition using Deep Feed Forward Network

Alexey Novakov published on

8 min, 1593 words

Deep Feed Forward Neural Network is one of the type of Artificial Neural Networks, which is also able to classify computer images. In order to feed pixel data into the neural net in RBG/Greyscale/other format one can map every pixel to network inputs. That means every pixel becomes a feature. It may sound scary and highly inefficient to feed, let's say, 28 hieght on 28 width image size, which is 784 features to learn from. However, neural networks can learn from the pixel data successfully and classify unseen data. We are going to prove this.

Please note, there are additional type of networks which are more efficient in image classification such as Convolutional Neural Network, but we are going to talk about that next time.

Dataset

Wikipedia MnistExamples

Read More