MLOps Blog

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

10 min
3rd April, 2024

Let me share a story that I’ve heard too many times.


 So far, we have been doing everything manually and sort of ad hoc. 

Some people are using it, some people are using that, it’s all over the place.

We don’t have anything standardized.

But we run many projects, the team is growing, and we are scaling pretty fast.

So we run into a lot of problems. How was the model trained? On what data? What parameters did we use for different versions? How can we reproduce them?

We just feel the need to control our experiments
 unfortunate Data Scientist

The truth is, when you develop ML models, you will run lots of experiments.

And those experiments may

  • use different models and model hyperparameters,
  • use different training or evaluation data, 
  • run different code (including that one small change you wanted to test the other day)
  • run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed).

As a result, each of these experiments can produce completely different evaluation metrics. 

Keeping track of all that information becomes really difficult really quickly. Especially if you want to organize and compare many experiments and feel confident that you selected the best models to go into production.

This is where experiment tracking comes in. 

What is ML experiment tracking?

Experiment tracking is the process of saving all experiment related information that you care about for every experiment you run.

Experiment tracking is the process of saving all experiment-related information that you care about for every experiment you run. What this “information you care about” is will strongly depend on your project.

Generally, this so-called experiment metadata may include:

  • Any scripts used for running the experiment
  • Environment configuration files
  • Information about the data used for training and evaluation (e.g., dataset statistics and versions)
  • Model and training parameter configurations
  • ML evaluation metrics
  • Model weights
  • Performance visualizations (e.g., a confusion matrix or ROC curve
  • Example predictions on the validation set (common in computer vision)

Of course, you want to have this information available after the experiment has finished. But, ideally, you’d like to see some of it already as your experiment is running. 

Why?

Because for some experiments, you can see (almost) right away that there is no way they will get you better results. Instead of letting them run (which might take days or weeks), you are better off simply stopping them and trying something different.

To be able to collect, store, and analyze all the data, you need an experiment tracking system in place. Such a system will typically have three components: 

  • Experiment database (neptune.ai servers on the visual below): A place where your logged experiment metadata is stored and can be queried.
  • Client library: A collection of methods that help you log metadata right from your training scripts and query the experiment database.
  • Experiment dashboard (neptune.ai web app on the visual below): A visual interface to your experiment database where you can see your experiment metadata.
Experiment tracking system architecture (based on neptune.ai example)
Experiment tracking system architecture (based on neptune.ai example)

Of course, you can implement each component in many different ways, but the general picture will be very similar. 

Wait, so isn’t experiment tracking like MLOps or something?

ML experiment tracking vs MLOps

MLOps deals with every part of the ML project lifecycle, from developing models by scheduling distributed training jobs, managing model serving, and monitoring the quality of models in production to re-training those models when needed.

Experiment tracking (also referred to as experiment logging) is part of MLOps, focused on supporting iterative model development, the part of the ML project lifecycle where you try many things to get your model performance to the level you need. Experiment tracking is closely intertwined with other aspects of MLOps, such as data and model versioning.

MLOps cycle and ML experiment tracking
MLOps cycle and machine learning experiment tracking

Experiment tracking is useful even if your models don’t make it to production (yet). In many research-focused projects, you might never even get there. But especially in these projects, having all the metadata about every experiment you run and the ability to analyze it is important.

Ok, if you are a bit like me, you may be thinking:

Cool, so I know what experiment tracking is. …but why should I care?

Let me explain. 

LLMs: from experiment tracking to prompt tracking

Here’s what our CEO has to say:

If you look at the “jobs to be done” of an experiment tracker, it goes way beyond experimenting. It’s not just about research. When you’re building models, you want to understand what’s happening, you want to understand the building process, you want to debug it, you want to compare it with other experiments. In this way, you can understand whether the model you’re building is going in the right direction or not. You want to version it so you have some level of reproducibility, some way to share a particular model for feedback – and you want to be able to hand the model over to an Ops team.
Listen to what our founding CEO Piotr NiedĆșwiedĆș had to say about experiment tracking and prompt engineering for LLMs on episode 168 of the MLOps Community podcast
When I think about prompt engineering, that’s quite a different way of building models. I’m not even sure that we should be calling it “engineering” in the sense of a building process because the model is stateless. For the latest models like GPT-4, fine-tuning is not (yet) available. So what you’re left with is crafting prompts. And you can configure agents and build the prompts in a sequential way using different models. So, yes, it is engineering.

When we talk about experiment tracking, we’re talking about the building phase and figuring out how a model works. In that spirit, I definitively see support for prompt visualizations and chain visualizations on our roadmap, as well as integration with Langchain. But this is just the beginning! I think that to really support teams that are building Large Language Models and using them in production, we’ll have to support and invent new methods to validate prompts.

Why does ML experiment tracking matter?  

Building a tool for ML practitioners has one huge benefit. You get to talk to a lot of them. 

And after talking to hundreds of people who track their experiments in Neptune, I identified four ways experiment tracking can improve your workflow.

4 reasons why ML experiment tracking matters
4 reasons why machine learning experiment tracking matters

All of your ML experiments and models are organized in a single place

There are many ways to run your ML experiments or model training jobs:

  • Personal laptop
  • PC at work
  • A dedicated instance in the cloud
  • University cluster
  • Kaggle kernel or Google Colab
  • and many more 


Sometimes, you just want to test something quickly and run an experiment in a notebook. Sometimes, you need to spin up a distributed hyperparameter tuning job. 

Either way, over the course of a project (especially when several people are working on it), you can end up with experiment results scattered across multiple machines.

With an experiment tracking system, all of your experiment results are logged to one experiment repository by design. Keeping all of your experiment metadata in a single place, regardless of where you run them, makes your experimentation process so much easier to manage.

[experiment tracking system] allows us to keep all of our experiments organized in a single space. Being able to see my team’s work results any time I need makes it effortless to track progress and enables easier coordination. Michael Ulin VP of Machine Learning at Zesty.ai

A centralized experiment repository makes it easy to:

  • Search and filter experiments to find the information you need quickly
  • Compare metrics and parameters between experiments with no additional work
  • Drill down and see what exactly it was that you tried (code, data versions, architectures) 
  • Reproduce or re-run experiments when you need to
  • Access experiment metadata even when you don’t have access to the server where you ran them
See in the app
All metadata in a single place with an experiment tracker (example in neptune.ai)

Additionally, you can sleep peacefully knowing that all the ideas you tried are safely stored, and you can always go back to them later. 

Compare ML experiments, analyze results, debug model training with little extra work

Easily compare experiments, analyze results, and debug model training

Whether you are debugging training runs, looking for improvement ideas, or auditing your current best models, comparing experiments is important.

But when you don’t have any experiment tracking system in place,

  • the way you log things can change, 
  • you may forget to log something important,
  • and you’re likely to lose some information accidentally. 

In those situations, something as simple as comparing and analyzing experiments can get difficult or even impossible.

With an experiment tracking system, your experiments are stored in a single place, and you consistently follow the same protocol for logging them. Experiment analyses and comparisons can go as deep as you like, and you can focus on improving your models instead of worrying about data storage.

Tracking and comparing different approaches has noticeably boosted our productivity, allowing us to focus more on the experiments [and] develop new, good practices within our team
 Tomasz Grygiel Data Scientist at idenTT

Proper experiment tracking makes it easy to:

  • Compare parameters and metrics between experiments
  • Overlay learning curves of different training runs
  • Group and compare experiments based on data versions or parameter values
  • Compare confusion matrices, ROC curves, and other performance charts
  • Compare the best/worst predictions on test or validation sets
  • View code diffs (and/or notebook diffs) for model, feature engineering, and training code
  • Look at hardware consumption during training runs for various models
  • Look at prediction explanations like feature importance, SHAP, or LIME
  • Compare rich-format artifacts like video or audio
  • 
 and compare anything else you logged
See in the app
Comparison features for ML experiment tracking (example in neptune.ai)

Modern experiment tracking tools will give you many, if not all, of those comparison features (almost) for free. Some tools even go as far as to automatically find suitable experiments to compare to and identify for you which parameters have the biggest impact on model performance.

When you have all the pieces in one place, you can gain new insights and ideas just by looking at all the metadata you logged. That is especially true when you are not working alone. 

Speaking of which


Improve collaboration: see what everyone is doing, share ML experiment results easily, and access experiment data programmatically   

When you are part of a team, and many people are running experiments, having one source of truth for your entire team is really important.

[An experiment tracking system] makes it easy to share results with my teammates. I’m sending them a link and telling what to look at, or I’m building a view on the experiments dashboard. I don’t need to generate it by myself, and everyone in my team has access to it. Maciej Bartczak Resarch Lead at Banacha Street

Experiment tracking lets you organize and compare not only your past experiments but also see what everyone else was trying and how that worked out. 

See in the app
Collaboration features for ML experiment tracking (example in neptune.ai)

Sharing results becomes easier, too. 

Modern experiment tracking tools let you share your work by sending a link to a particular experiment or dashboard view. You don’t have to send screenshots or “have a quick meeting” to explain what is going on in your experiment. It saves a ton of time and energy.

For example, here is a link to an experiment comparison dashboard I did months ago. Pretty easy, right?

Apart from sharing things you see in a web UI, most experiment tracking setups let you access experiment metadata programmatically. This comes in handy when your experiments and models go from experimentation to production. For example, you can connect your experiment tracking tool to a CI/CD framework like GitHub Actions and integrate ML experimentation into your teams’ workflow. A visual comparison between the models on branches `main` and `develop` (and a way to explore details) adds another sanity check before you update your production model.

See your ML runs live: manage ML experiments from anywhere at any time

When you are training a model on your local computer, you can see what is going on whenever you like. But if your experiment is running on a remote server at work, university, or in the cloud, it may not be as easy to see what the learning curve looks like or discover that the training job crashed.

Experiment tracking systems solve this problem. While it’s a big security no-no to allow remote access to all of your data and servers, letting people see only their experiment’s metadata is usually fine.

When you can easily compare the currently running experiment to previous runs, you can decide whether it makes sense to continue. Why waste those precious GPU hours on something that is not converging? You will also quickly notice if your cloud training job has crashed, and you can close it (or fix the bug and re-run).

Speaking of GPUs and failed jobs, Some experiment tracking tools monitor training and log hardware consumption, helping you see whether you are using your resources efficiently.

See in the app
Real-time monitoring feature for ML experiment tracking (example in neptune.ai)

For example, looking at GPU consumption over time can help you identify if your data loaders are not working correctly or that your multi-GPU setup is actually using just one core (which happened to me more times than I’d like to admit).

Without information I have in [Neptune’s] monitoring section I wouldn’t know that my experiments are running 10 times slower than they could. MichaƂ Kardas ML Researcher at TensorCell

ML experiment tracking best practices

So far, we’ve covered what machine learning experiment tracking is and why it matters.

Now it’s time to get into details.

What you should keep track of in any ML experiment:

As I mentioned initially, what information you may want to track ultimately depends on the project’s characteristics.

But there are some things that you should keep track of regardless of the project you are working on. Those are:

  • Code: Preprocessing, training and evaluation scripts, notebooks for feature engineering, and other utilities. And, of course, all the code needed to run (and re-run) the experiment.
  • Environment: The easiest way to keep track of the environment is to save the environment configuration files like `Dockerfile` (Docker), `requirements.txt` (pip), `pyproject.toml` (e.g., hatch and poetry),  or `conda.yml` (conda). You can also save built Docker images on Docker Hub or your own container repository, but I find saving configuration files easier.
  • Data: Saving data versions (as a hash or locations of immutable data resources) makes it easy to see what your model was trained on. You can also use modern data versioning tools like DVC (and save the .dvc files to your experiment tracking tool).
  • Parameters: Saving your experiment run’s configuration is crucial. Be especially careful when you pass parameters via the command line (e.g., through argparse, click, or hydra), as this is a place where you can easily forget to track important information (I have some horror stories to share). You may want to take a look at this article about various approaches to tracking hyperparameters
  • Metrics: Logging evaluation metrics on train, validation, and test sets for every run is pretty obvious. But different frameworks do it differently, so you may want to check out this in-depth article on tracking ML model metrics.

Keeping track of those things will let you reproduce experiments, do basic debugging, and understand what happened at a high level.

That said, you can always log more things to gain even more insights. As long as you keep the data you track in a nice structure, it doesn’t hurt to collect information, even if you don’t know if it might be relevant later. After all, most metadata is just numbers and strings that don’t take up much space.

What else you could keep track of 

Let’s look at some additional things you may want to keep track of when working on a specific type of project.

Below are some of my recommendations for various ML project types. 

Machine Learning

  • Model weights
  • Evaluation charts (ROC curves, Confusion matrix)
  • Prediction distributions
See in the app
Logging different model metadata for experiment tracking (example in neptune.ai)

Deep Learning

  • Model checkpoints (both during and after training)
  • Gradient norms (to control for vanishing or exploding gradient problems
  • Best/worst predictions on the validation and test set after training
  • Hardware resources: handy for debugging data loaders and multi-GPU setups

Computer Vision

  • Model predictions after every epoch (labels, overlayed masks or bounding boxes)

Natural Language Processing and Large Language Models

  • Inference time
  • Prompts (in the case of generative LLMs)
  • Specific evaluation metrics (e.g., ROUGE for text summarization or BLEU for translation between languages)
  • Embedding size and dimensions, type of tokenizer, and number of attention heads (when training transformer models from scratch)
  • Feature importance, attention-based, or example-based explanations (see this overview for specific algorithms and more ideas)

Structured Data

  • Input data snapshot ( `.head()` on DataFrames if you are using pandas)
  • Feature importance (e.g., permutation importance)
  • Prediction explanations like SHAP or partial dependence plots (they are all available in DALEX)

Reinforcement Learning

  • Episode return and episode length
  • Total environment steps, wall time, steps per second
  • Value and police function losses
  • Aggregate statistics over multiple environments and/or runs

Hyperparameter optimization:

  • Run score: the metric you are optimizing after every iteration
  • Run parameters: parameter configuration tried at each iteration
  • Best parameters: best parameters so far and overall best parameters after all runs have concluded
  • Parameter comparison charts: there are various visualizations that you may want to log during or after training, like parallel coordinates plot or slice plot (they are all available in Optuna, by the way)
See in the app
Hyperparameter optimization features for experiment tracking (example in neptune.ai)

How to set up machine learning experiment tracking

OK, those are nice guidelines, but how do you actually implement experiment tracking in your machine learning project?

There are (at least) a few options. The most popular being:

  • Spreadsheets and naming conventions
  • Versioning everything in a Git repository
  • Using modern experiment tracking tools

Let’s talk about those now.

You can use spreadsheets and naming conventions (but please don’t)

A common approach to experiment tracking is to create one giant spreadsheet where you put all of the information you can (metrics, parameters, etc.) and a directory structure where things are named in a certain way. Those names usually end up being really long and intricate, like ‘model_v1_lr01_ batchsize64_ no_preprocessing_ result_accuracy082.h5’.

Whenever you run an experiment, you look at the results and copy them to the spreadsheet.

What is wrong with that?

To be honest, in some situations, it can be just enough to solve your experiment tracking needs. It may not be the best solution, but it is quick and straightforward.


 things can fall apart really quickly

But things can fall apart really quickly. There are (at least) a few major reasons why tracking experiments in spreadsheets doesn’t work for most people:

  • You have to remember to track them. Things get messy if something doesn’t happen automatically, especially with more people involved.
  • You have to ensure that you or your team will not accidentally overwrite things in the spreadsheet. Spreadsheets are not easy to version, so if this happens, you are in trouble. 
  • You have to remember to use the naming conventions. If someone on your team messes this up, tracking down the experiment artifacts (model weights, performance charts) for the experiments is painful.
  • You have to independently back up your artifact directories and keep them in sync with the spreadsheet. Even if you set up an automatic workflow that gets triggered regularly, there will inevitably come a time when it breaks.
  • When your spreadsheet grows, it becomes less and less usable. Searching for things and comparing hundreds of experiments in a spreadsheet (especially if you have multiple people who want to use it simultaneously) is not a great experience.

You can version ML experiment metadata files on GitHub

Another option is to version all of your experiment metadata in a GitHub repository. 

When running your experiment, you can commit metrics, parameters, charts, and whatever you want to keep track of to a repository. You can set up post-commit hooks that automatically create or update files (configs, charts, etc.) automatically after your experiment finishes.

It can work in some setups, but:

  • Git wasn’t built for comparing machine learning artifacts and experiment metadata. It’s built for versioning and storing text files. Neither binary artifacts like image files nor structured, relational data are handled well.
  • You cannot compare more than two experiments at a time. Like most version control systems for code, Git was designed for comparing two commits. If you want to compare metrics and learning curves of multiple experiments, you are out of luck.
  • Organizing many experiments is difficult (if not outright impossible). You can have branches where you try out new ideas or a separate branch for each experiment. But the more experiments you run, the less usable it becomes. (And you’ll have to make sure everyone follows whatever branching convention you come up with.)
  • You will not be able to monitor your experiments live. You can only save information after your experiment finishes.

Maybe you could build your own ML experiment tracker?

If a spreadsheet relies too much on discipline and will quickly grow to an unmanageable size, and a Git repository is just not the right kind of data store, how about spinning up a database and writing a slim Python client?

It’s certainly not the worst idea, and many experiment-tracking and machine-learning management solutions – including our own – started this way.

At least, you’ll need the following components:

  • A database to keep your metadata. A natural choice is a schema-free database like MongoDB or CouchDB that allows you to store and query arbitrary JSON documents.
  • A place to store artifacts like model snapshots or plots. A blob storage bucket, a network drive, or a good old FTP server will probably do.
  • A client to integrate into your experiment code. A few lines of Python that push metadata and files to your central repositories will suffice initially.

But then things start to get complicated rather quickly. How will you retrieve and analyze the metadata? Is your team content with pulling data into notebooks and generating their plots themselves? Or do you need to set up a dashboard and develop a web frontend? What about live tracking?

I certainly share the enthusiasm for conceptualizing and creating ML experiment-tracking tools – after all, it’s my job these days – but I doubt the effort is worth it for most teams. As we’ll see in the next sections, plenty of excellent tools are available.

You can use a modern experiment tracking tool

Instead of trying to adjust generic tools to work for machine learning experiments or developing your own platform, you could just use one of the solutions built specifically for tracking, organizing, and comparing experiments. 

Within the first few tens of runs, I realized how complete the tracking was – not just one or two numbers, but also the exact state of the code, the best-quality model snapshot stored to the cloud, the ability to quickly add notes on a particular experiment. My old methods were such a mess by comparison. Edward Dixon Data Scientist at intel

They have slightly different interfaces, but they usually work in a similar way:

Step 1

Connect to the tool by adding a snippet to your training code.

For example:

import neptune

run = neptune.init_run() # initialize a new run

Step 2

Specify what you want to log (or use an ML framework integration that does it for you):

from neptune.types import File

run["accuracy"] = evaluate_accuracy(model, test_data)
for prediction_image in worst_predictions:
    run["worst predictions"]].append(
       File.as_image(prediction_image)
    )

Step 3

Run your experiment as you normally would:

python train.py

And that’s it!

Your experiment is logged to a central experiment database and displayed in a dashboard, where you can search, compare, and drill down to whatever information you need.

See in the app
Machine learning experiment tracking with neptune.ai

Today, there are several tools for machine learning experiment tracking optimized for different contexts, and I would strongly recommend using one. They are designed to treat machine learning experiments as first-class citizens, and they will always

One important decision to make is whether you want to use a software-as-a-service offering or host an open source tool yourself.

  • Most open source experiment tracking tools provide the interfaces you need to create plugins and integrations. This might be an essential selection criterion if you’re working with somewhat esoteric data storage systems or compute infrastructure. 
  • You’re not tied to any vendor or cloud provider. If you want, you can take your machine learning experiment tracker and move to a different cloud provider. There is no need to try to migrate data between incompatible platforms. If an open source project is no longer maintained, you can keep developing it – or at least keep the lights on as long as you need to prepare your migration on your own schedule.
  • Your data and artifacts never have to leave your premises. For the vast majority of businesses and even government agencies, the protections that contracts and legal agreements provide are sufficient to allow them to store their data on third-party clouds. But if your data must under no circumstances leave your building, self-hosting is your only option.

But let’s face it: Everyone who’s ever self-hosted tools knows how long it takes to get things working right and has experienced how simple day-to-day system maintenance became a bottomless time sink. And let’s not forget the hassle of keeping up with breaking changes and security fixes.

It’s no surprise that many data science teams are looking for a fully managed experiment tracking platform. Key benefits of paying someone else to run an experiment tracker on your behalf include:

  • You don’t need to worry about infrastructure, scaling, and updates. Someone else takes care of the burdensome maintenance work, and you’ll never lose valuable data because your server runs out of storage space mid-experiment.
  • Vendors have a lot more experience with machine learning experiment tracking than any single machine learning team could ever accumulate. Here at Neptune, we’ve worked with hundreds of customers, constantly learning about new edge cases and continuously discovering new ways to optimize experiment tracking.
  • Data scientists can focus on creating and optimizing machine learning models. When you adopt a managed experiment tracking platform, you’re not only leaving the software engineering and maintenance to the specialists but also getting access to dedicated support.

Next steps

Machine learning experiment tracking is, first and foremost, a practice, not just a tool or a logging method. It will take some time to really understand and implement:

  • what to keep track of for your project,
  • how to use that information to improve future experiments,
  • how to improve your teams’ unique workflow with it,
  • and when to even use experiment tracking.

Hopefully, after reading this article, you have a good idea of how to start tracking and how it can improve your (or your teams’) machine learning workflow. 

Editor’s note

Do you feel like experimenting with neptune.ai?

FAQ

  • Just like there are many different kinds of machine learning models, there are numerous approaches and tools for tracking machine learning experiments.

    Data science teams often start out by using spreadsheets to record the parameters and results of experiments. Some might use a Git repository to keep this data instead and utilize their CI/CD system to automatically add the outcomes of new experiments.

    However, all these hand-crafted solutions tend to be brittle and are prone to fail as the number of experiments grows. This is where dedicated machine learning experiment tracking tools shine. They provide convenient integration with the model training code and offer a wide range of features to visualize and compare experiments.

  • A machine learning experiment consists of three phases:

    1. Defining the input parameters, such as which data samples to use, the model’s configuration, and the number of training iterations and learning rate.
    2. Training the model. This can take anywhere from seconds on a laptop to days on a dedicated GPU cluster.
    3. Evaluating the model. Using test data that the model has not seen during training, its performance can be assessed through metrics like accuracy.

    For experiments to be worthwhile, it is paramount to keep records of both the input parameters as well as the performance metrics. Only then you’ll be able to compare different models, identify avenues for performance improvements, and ultimately determine the optimal parameters.

  • To systematically and consistently log machine learning experiments, all data recording should happen directly from the code controlling the experiment.

    Most experiment tracking tools have a client library to import into your training and evaluation scripts. Then, you can submit parameter values or even image files and model checkpoints by calling a function. The client sends the data to the experiment tracking tool’s API on your behalf.

  • Machine learning metadata encompasses everything that’s not the model itself but related to its creation and lifecycle. There is no clear definition of the term, but loosely speaking, it refers to any information you or someone else might want to know about a machine learning model.

    For example, machine learning metadata includes any scripts used for training a model, information about training, evaluation, test data, and the specific training parameter configuration. Metadata can also comprise dataset statistics and evaluation metrics, visualizations (like a confusion matrix or ROC curve), and environment configuration files.

  • A model registry stores and organizes trained models. While it will contain some metadata along with a model (such as the time a model was uploaded, its version, or a name), this data is only used to catalog the model and make it accessible to downstream users.

    An experiment tracking tool is focused on recording and analyzing information about the machine learning model’s training process. While it might also allow for storing model artifacts, the main purpose is to keep track of extensive information collected during training. Often, an experiment tracking tool provides features to visualize this information and compare different runs of an experiment.

    For a more in-depth discussion of this topic, have a look at our Experiment Tracking vs Machine Learning Model Management vs MLOps article.

Was the article useful?

Thank you for your feedback!
What topics would you like to see for your next read?
Let us know what should be improved

    Thanks! Your suggestions have been forwarded to our editors