scala

ONNX Model format in Flink

Alexey Novakov published on

7 min, 1352 words

The most popular eco-system to train ML model these days is Python and C/C++ based libraries. An example of model training can be a Logistic Regression algorithms, from ScikitLearn package or more advanced neural networks algorithms offered by Tensorflow, Pytorch and others. There are lots of tools and libraries in Python world to facilitate training and model serving.

In order to bring trained models in Keras or SciKitLean into a Flink application, we can use cross-platform file formats such as ONNX and PMML (in the past). These formats also come with language runtimes. In Flink case, we select JVM SDK to run inference logic inside the Flink job.

Let's look at the example on how to train Logistic Regression in Python using Keras and then use trained model in ONNX format inside Flink.



Read More

Spark API Languages

Alexey Novakov published on

7 min, 1244 words



Context

As part of my role of Data Architect at work, I often deal with AWS data services to run Apache Spark jobs such as EMR and Glue ETL. At the very beginning team needed to choose Spark supported programming language and start writing our main jobs for data processing.

Before we further dive into the languages choice, let's quickly remind what is Spark for EMR. Glue ETL is going to be skipped from the blog-post.

Apache Spark is one of the main component of AWS EMR, which makes EMR still meaningful service to be used by Big Data teams. AWS EMR team is building its own Spark distribution to integrate it with other EMR applications seamlessly. Even though Amazon builds own Spark, they keep the same Spark version, which is equal to open source version of Spark. All features of Apache Spark are available in EMR Spark. EMR allows to run a Spark application in EMR cluster via step type called “Spark Application”.

Read More