ML models are unlike traditional algorithms in that most need to be retrained frequently to deal with distribution shifts, and monitoring to ensure your model is still accurately modeling the real world distribution is a key component. If you are someone relying on ML to do a good job / needing to continuously improve your ML models, you end up caring a lot about how much it costs to retrain, how ongoing labeling operations are going & how quickly it goes from newly-labeled data --> retrained model --> evaluated model --> deployed model.
Furthermore, ML orgs often have to make labeling and training efficient by sharing resources & strategically triaging only the most impactful experiments / picking the most impactful things to label.
There's also the pace of progress in the field, and often organizations will have dedicated data science / ML folks who want to run experiments to improve the models upon each retraining. The infra needed to unlock rapid prototyping is quite a lot.
If you have a lot of models and you're re-building the end-to-end stack each time, you end up with a ton of wasted work. A lot of this stuff is also pretty specialized. It's a bit like asking all your engineers to set up their own web server, proxy, monitoring / alerting, maybe a load balancer, etc etc. Plenty of people know how to do it, and for small companies doing 1 or 2 ML models maybe that's fine. But for corporations at scale it makes no sense to do it like that. And when you're at scale, a small team probably works on the load balancers vs the base infra vs the software platform vs the features. ML works the same way.
Some real world examples where you can imagine how much infra you need within ML-heavy organizations:
- Say you want to build a _motorcycle detector for a self-driving car_. You need to build a data extraction pipeline that processes images, gets the segmented objects, sends each object for labeling, then when you have all your labeled data, you need to split it into test/train/validation (and make sure you use the same splits as everyone else building the car's software), then you need to have these piles of images integrated with additional information needed for training (e.g. how fast the object was moving, what time of day), you need to upsample/downsample certain cases (e.g maybe you need to upsample examples of motorcycles at night), then train a net (locally, or in the cloud, or maybe as part of N experiments to tune, on shared infrastructure that M people are using where jobs need to be prioritized), then evaluate (do you need to build infra to harvest important metrics, like how well your model performs when the car is driving on a slope? in fog?), then optimize for onboard (are you going to run it on CPU? GPU? Accelerator? Do you optimize it using TensorRT? Your own quantization infra? Distillation?), then deploy (and monitor -- is the model eating too much memory during inference? Being run too many times? Not doing the right thing?) Okay your model works -- you sure it will keep working when motorcycles look different 10 years or even just 1 year from now?
- Say you want to build a _spam detector for your social media website_. You do everything above, build and deploy your model to the cloud, and suddenly you realize it's not working, a new spam campaign has occurred that your model can't account for. You need to add more labeled data, but how much and where are you going to get it? After adding it, what's your overall data look like? Adding it didn't help your net as much as you expected, why? The model-level eval looked improved, but combined with a rule-system together, it got worse. Crap, how to debug? Okay finally working, how stale is the data in your model after 1 year? Did we regress on something when we solved the spam campaign? You have a computational budget for how big the net can get, because it's used real-time to judge spamminess of posts on a major website -- maybe you care about what hardware your model is running on and how to best optimize for that hardware. Maybe you use cloud TPUs, where large batch sizes help you to scale. Maybe you use graphcore or something that thrives on small batch size. What if you started on one, moved to the other, and suddenly your net isn't working as well? What if you upgrade from a gtx2080ti -> gtx3080ti and see that your net has a prediction regression? Do you have infra to detect these regressions? Over time, when your data got an order of magnitude bigger, you noticed that your net's hyperparams were no longer optimal. You needed to increase your learning rate, or decrease it. Did you notice this issue, and do you have the infra to do that tuning quickly? You notice your labeling budget is too small to label everything flagged as spam. How do you decide which things are most worthwhile to label?
You have to build infra for all of this. MLOps are needed every step of the way. It's not that different from needing SREs and cloud infra engineers to run your cloud services & organizations.
Labeling infra alone is a big enough market for companies like Hive and Scale to build billion-dollar businesses.
Is there really a difference between "labelling data" and assigning properties to "traditional" inputs (e.g. assigning tax codes, classifying new products, filing cases, managing customer data, ...)?
Is there something fundamentally unique about sharing and monitoring data during ML training as opposed to say feedback loops between trading algorithms and profit or production planning, logistics and market response?
Or to address your examples, wouldn't the same issues as in your motorcycle detector arise with any other software implementation? Hardware constraints, runtime limitations and -requirements are in no way unique to ML after all.
The same applies to your spam detector example. The same questions arise with any other software. It's all just constraints versus benefit, data quality, monitoring loops, infrastructure, and cost.
I honestly don't see anything that's truly unique to ML here.
The part that is described as "model training" in ML is just done manually by developers and expressed as iterations in engineering. I would therefore think that the skillset is very much transferable and much of the apparent novelty is just traditional software engineering and management practices hidden behind ML jargon.
> I honestly don't see anything that's truly unique to ML here.
- The workloads are specific (lots of offline batch processing, accelerator powered offline stuff, then speed/power/resource constrained inference stuff)
- The hardware is specific (ML accelerators are for ML, you don't really use TPUs for anything else do you?)
- Debugging is specific (ML-specific tools like XLA)
- Labeling is specific (e.g. labeling audio, video, 3D points requires specific tooling)
If what you're saying is, "ML engineering sounds like engineering" that's obvious and was never a point in contention. OP's comment was "a couple of motivated engineers can make ML work" and my point is -- kind of, but at scale you need a lot of very specific things which are best done by specialized folks.
That there are billion dollar ML infra companies, as well as companies with ML infra teams that are hundreds of people, means that folks are finding it worthwhile to have, say, a team of folks who work on deploying nets efficiently and only that. Or a team of folks who only build labeling tools. Or a team of folks who only build model evaluation tools. My ramble was mainly to illustrate just how many sub-problems there are in ML and why ML infra is rightfully a big business -- there's a reason companies that use a lot of ML don't just have 2-3 randos building everything end-to-end for each model.
> The part that is described as "model training" in ML is just done manually by developers and expressed as iterations in engineering. I would therefore think that the skillset is very much transferable and much of the apparent novelty is just traditional software engineering and management practices hidden behind ML jargon.
Yeah ML engineering is engineering, so plenty of skills transfer between ML engineering <-> other engineering. But if you want to go from other engineering -> ML engineering, you do have to learn ML-specific things that I would not dismiss as "novelized software engineering" or just "jargon."
Furthermore, ML orgs often have to make labeling and training efficient by sharing resources & strategically triaging only the most impactful experiments / picking the most impactful things to label.
There's also the pace of progress in the field, and often organizations will have dedicated data science / ML folks who want to run experiments to improve the models upon each retraining. The infra needed to unlock rapid prototyping is quite a lot.
If you have a lot of models and you're re-building the end-to-end stack each time, you end up with a ton of wasted work. A lot of this stuff is also pretty specialized. It's a bit like asking all your engineers to set up their own web server, proxy, monitoring / alerting, maybe a load balancer, etc etc. Plenty of people know how to do it, and for small companies doing 1 or 2 ML models maybe that's fine. But for corporations at scale it makes no sense to do it like that. And when you're at scale, a small team probably works on the load balancers vs the base infra vs the software platform vs the features. ML works the same way.
Some real world examples where you can imagine how much infra you need within ML-heavy organizations:
- Say you want to build a _motorcycle detector for a self-driving car_. You need to build a data extraction pipeline that processes images, gets the segmented objects, sends each object for labeling, then when you have all your labeled data, you need to split it into test/train/validation (and make sure you use the same splits as everyone else building the car's software), then you need to have these piles of images integrated with additional information needed for training (e.g. how fast the object was moving, what time of day), you need to upsample/downsample certain cases (e.g maybe you need to upsample examples of motorcycles at night), then train a net (locally, or in the cloud, or maybe as part of N experiments to tune, on shared infrastructure that M people are using where jobs need to be prioritized), then evaluate (do you need to build infra to harvest important metrics, like how well your model performs when the car is driving on a slope? in fog?), then optimize for onboard (are you going to run it on CPU? GPU? Accelerator? Do you optimize it using TensorRT? Your own quantization infra? Distillation?), then deploy (and monitor -- is the model eating too much memory during inference? Being run too many times? Not doing the right thing?) Okay your model works -- you sure it will keep working when motorcycles look different 10 years or even just 1 year from now?
- Say you want to build a _spam detector for your social media website_. You do everything above, build and deploy your model to the cloud, and suddenly you realize it's not working, a new spam campaign has occurred that your model can't account for. You need to add more labeled data, but how much and where are you going to get it? After adding it, what's your overall data look like? Adding it didn't help your net as much as you expected, why? The model-level eval looked improved, but combined with a rule-system together, it got worse. Crap, how to debug? Okay finally working, how stale is the data in your model after 1 year? Did we regress on something when we solved the spam campaign? You have a computational budget for how big the net can get, because it's used real-time to judge spamminess of posts on a major website -- maybe you care about what hardware your model is running on and how to best optimize for that hardware. Maybe you use cloud TPUs, where large batch sizes help you to scale. Maybe you use graphcore or something that thrives on small batch size. What if you started on one, moved to the other, and suddenly your net isn't working as well? What if you upgrade from a gtx2080ti -> gtx3080ti and see that your net has a prediction regression? Do you have infra to detect these regressions? Over time, when your data got an order of magnitude bigger, you noticed that your net's hyperparams were no longer optimal. You needed to increase your learning rate, or decrease it. Did you notice this issue, and do you have the infra to do that tuning quickly? You notice your labeling budget is too small to label everything flagged as spam. How do you decide which things are most worthwhile to label?
You have to build infra for all of this. MLOps are needed every step of the way. It's not that different from needing SREs and cloud infra engineers to run your cloud services & organizations.
Labeling infra alone is a big enough market for companies like Hive and Scale to build billion-dollar businesses.