Published 2 months ago.

Hidden Technical Debt in Machine Learning Systems

As a machine learning student, I used to simply treat machine learning problems ‘kagglewise’, taking ‘quick wins’ by building, tuning models on notebooks and showing off the best performed experiment on social media, or wherever can make my friends jealous. However, as my thesis work went on, the term ‘MLOps’, which stands for Machine Learning Operations, occurred to me if not more than, at least as many times as terms like model, hyper-parameters and other terms that I thought as core elements in machine learning systems. What the food is going on? Why should machine learning teams care so much about MLOps? Today we will discuss the paper Hidden Technical Debt in Machine Learning Systems by Google, which addresses the potential practical risks lying in real-world ML systems. Although it was published in NIPS 6 years ago, it can make even more sense to study it today, given that the machine learning industry has grown so much over the past years. So, what are we waiting for?


What is Technical Debt?

Technical Debt is a concept in software development that reflects the implied cost of additional rework resulting from choosing easier solutions, instead of more maintainable approaches which may take more time to develop, at earlier stages. Just like general debts, the compounding effect is what bottlenecks the development cycle if earlier technical debts are not paid on time. Traditional approaches in paying off technical debt include refactoring code, reducing dependencies, improving documentations, and so on.


Technical Debt in ML Systems

ML systems have a special capacity for incurring technical debt, because they have all of the maintenance problems of traditional code, plus an additional set of ML-specific issues. The paper categories them into SEVEN scenarios:

    1. Erosion Boundaries: It is difficult to enforce and monitor strict modules boundaries from prescribing their intended behaviors, since we rely on the model to learn by itself. In other words, changing anything will change everything.
    2. Data Dependencies: ML systems are built upon large-scale data dependency chains. However, unlike traditional code dependencies in software engineering, where dependencies can be detected via compliers or linkers. Lack of such tooling for data dependencies makes ML systems hard to untangle.
    3. Feedback Loops: Live ML systems often end up influencing their own behavior if they update over time, this is most commonly seen in recommendation systems. This feedback loop is difficult to detect
    4. Anti-Patterns: Only a tiny fraction of the code in many ML systems is actually devoted to learning or prediction. Other important parts include configurations, data pipeline, version control, system monitoring and so on. Cases with Glue Code, Pipeline Jungles happen more often than we normally expect.
    5. Configuration Debt: ML systems have a wide range of configurable options, including feature choice, data selection, tooling version etc, which poses the famous reproducibility problem. However, this problem is hard to monitor and modify, or even reason about.
    6. Changes in External World: ML systems often directly interact with the external world. This close interaction poses a great challenge for ML systems to adjust not only quickly but more importantly, correctly, with the fast and unstably changing world.
    7. Others: Additional areas where ML-related technical debt may accrue can include data testing, for example, data quality inspection; reproducibility as mentioned before; automatic process management and cultural debt, which requires teams to combine both research and engineering spirits.

Measure/Payoff ML Technical Debt

With all said, what’s important is to effectively and efficiently measure hidden technical debt in ML systems and to pay it off. Approaches may mostly depend on organizations but there are some common important questions to think of while developing ML systems For example. How transitive is the current design of data/module dependency? How wise is it to improve accuracy at the cost of increasing model complexity? How quickly can new members of the team be brought up to speed? And so on…


Lastly, I will quote my favorite part of this paper: “Paying down ML-related technical debt requires a specific commitment, which can often only be achieved by a shift in team culture. Recognizing, prioritizing, and rewarding this effort is important for the long term health of successful ML teams.” I hope this takeaway can benefit us machine learning enthusiastic, no matter if you are developing your own ML systems, maintaining current architectures or even just thinking out loud!

Published by Tianzong Wang 2 months ago.