MLOps pipeline using Kedro & Mlflow

Elyadata

Published in

Elyadata

6 min readJun 22, 2023

Kedro

What is kedro?

Kedro is an open-source Python framework for creating reproducible, sustainable, and modular data science code. It borrows concepts from software engineering best practices and applies them to machine learning code; the concepts applied include modularity, separation of interests, and version control.

Open source Python framework: Kedro is free and maintained by its community, making it easy to use.
Reproducible code: with Kedro, you can create data pipelines and use them for different data sources.
Sustainable code: Kedro’s framework makes it easy to maintain code and for a team to work collaboratively on the same pipeline.
Modular code: functions can be used in different parts of the pipeline and different pipelines.

Installation guideline

The installation of Kedro is done using the PyPI package :

pip install kedro

To verify the installation, run this command :

kedro info

If all goes well, you will see the following image, followed by the installed version:

Creating a project with kedro interactively

To create a Kedro project, run the following command:

kedro new

Next, you will have to answer three questions:

project_name: name of the project (use a readable name separated by underscores);
repo_name: project folder name (you can leave it blank, and kedro will use the project name)
python_package: name of the python package (can also be left blank).

Creating a new project from a configuration file

You can create a new project from a configuration file if you prefer. The file must contain:

output_dir: ~/code
project_name: Get Started
repo_name: get-started
python_package: get_started

To create the new project:

kedro new - config config.yml

Creating a pipeline

To generate a new pipeline template, run:

kedro pipeline create data_processing

MLflow

What is MLflow?

MLflow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc), wherever you currently run ML code (e.g. in notebooks, standalone applications, or the cloud).

MLflow’s current components are:

MLflow Tracking: An API to log parameters, code, and results in machine learning experiments and compare them using an interactive UI.
MLflow Projects: A code packaging format for reproducible runs using Conda and Docker, so you can share your ML code with others.
MLflow Models: A model packaging format and tools that let you easily deploy the same model (from any ML library) to batch and real-time scoring on platforms such as Docker, Apache Spark, Azure ML, and AWS SageMaker.
MLflow Model Registry: A centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of MLflow Models.

Installation

pip install mlflow

Setting a tracking URI

MLflow runs can be recorded to local files, to an SQLAlchemy compatible database, or remotely to a tracking server.

By default, the MLflow Python API logs runs locally to files in an mlruns directory wherever you ran your program.

mlflow.set_tracking_uri()

- Local file path (specified as file:/my/local/dir)
- Database encoded as <dialect>+<driver>://<username>:<password>@<host>:<port>/<database>
- HTTP server (specified as https://my-server:5000) which is a server hosting an MLflow tracking server.

Creating an experiment

mlflow.create_experiment()

creates a new experiment and returns its ID.

Runs can be launched under the experiment by passing the experiment ID to

mlflow.start_run()

Creating & ending a run

mlflow.start_run()
mlflow.end_run()

If no active run exists, a new one is started.

Logging parameters

With Mlflow, you can log a single key-value param/ multiple params at once in the currently active run. The key and value are both strings.

mlflow.log_param()
mlflow.log_params()

Logging metrics

Logging key-value metrics is possible & MLflow remembers the history of values for each metric.

The metric value must always be a number.

mlflow.metric()
mlflow.metrics()

Logging artifacts

mlflow.log_artifact()
mlflow.log_artifacts()

logs a local file or directory as an artifact, optionally taking an artifact_path to place it within the run’s artifact URI.

Kedro-Mlflow pipeline

Kedro-Mlflow plugin

kedro-mlflow is a kedro-plugin for lightweight and portable integration of mlflow capabilities inside kedro projects. It enforces Kedro principles to make mlflow usage as production ready as possible.

Kedro-mlflow installation

Kedro-mlflow is only compatible with kedro>=0.16.0 and mlflow>=1.0.0.

If you have a project created with an older version of Kedro, see this migration guide.

Install from PyPI

kedro-mlflow is available on PyPI, so you can install it with pip:

pip install kedro-mlflow

Install from sources

you can install the package from Github:

pip install git+https://github.com/Galileo-Galilei/kedro-mlflow.git

Check the installation

Type kedro info in a terminal to check the installation. If it has succeeded, you should see the following ascii art:

 _            _ 
| | _____  __| |_ __ ___ 
| |/ / _ \/ _` | '__/ _ \ 
|   <  __/ (_| | | | (_) | 
|_|\_\___|\__,_|_|  \___/ 
v0.16.<x> 
 
kedro allows teams to create analytics 
projects. It is developed as part of 
the Kedro initiative at QuantumBlack. 
 
Installed plugins: 
kedro_mlflow: 0.11.3 (hooks:global,project)

Using the plugin

After the installation, the plugin can be activated using:

kedro mlflow init

If successful, you’ll see the following message: ‘conf/local/mlflow.yml’ successfully updated.

The conf/local folder is updated and you can see the mlflow.yml file:

Optional: If you have configured your own mlflow server, you can specify the tracking uri in the mlflow.yml (replace the highlighted line below):

Project example with Kedro & Mlflow

Project Components

In this section, we present an example of kedro pipeline which contains 3 main pipelines:

Data Preparation: the goal of this pipeline is to extract annotated images (text of annotation) from label studio and find the corresponding images.
Data Science: this pipeline is used to train Trocr approach using the data extracted in the Data Preparation pipeline, and then evaluate the model using the evaluation dataset.
Data Versioning: in order to capture the versions of our models in Git commits, we used Data Version Control (DVC) to store models in each experiment in Azure Blob Storage Container.

Data Preparation Pipeline

The data_preparation/nodes.py contains extract_data_from_label_studio node and other functions. This node extracts the annotated data (filename and annotated text) from Label Studio and saves them in a file.
For data_preparation/pipeline.py, we create data preparation pipeline by assembling the node in the pipeline and specifying the name of the function (node), the input and the output of this node and the name of the node.
In base/parameters/data_prepara.yml file, the data preparation pipeline settings are defined. We specified the connection parameters of Label Studio (Token and URL) and data extraction parameters (like number of takes to be extracted and the path of the file in which the data will be stored).

Data Science Pipeline

The data_science/nodes.py file contains the training & evaluation code of Trocr model. We also use mlflow to log the parameters, metrics & artifacts of the run.
For data_science/pipeline.py, we specify the input and output of each node (functions defined in nodes.py)
The parameters of the data science pipeline can be found in parameters/data_science.yml

Data Versioning Pipeline

Similarly to the previous pipeline, the nodes of the data versioning pipeline are defined in data_versioning/nodes.py. They define the remote configurer interface, the DVC Azure remote configuration, and the DVC and Git clients. pytorch_pipeline/pipeline_registry.py is used to register the created pipelines

Running the project

kedro run

You can also run a specific pipeline:

kedro run - pipeline my_pipeline

If you specify kedro run without the — pipeline option, it runs the default pipeline from the dictionary returned by register_pipelines().

Launching Mlflow UI:

kedro mlflow ui