5 Reasons Why AI Fails To Scale

Author

Petuum Team

It is no secret that the first ML projects at a company tend to fail, or deliver sub-standard results. What is less commonly known is that AI work tends not to follow the trend of scalability. Instead of costs, time to value, and risks of failure decreasing with scale, the opposite happens with ML projects.

This is a fairly robust finding. AI Teams will scale up in terms of customers served, models and pipelines managed, data volume, infrastructure and toolchain expanse. And the standard rules of scaling don’t apply. AI Teams become less productive as they increase scale.

So why aren’t AI systems able to scale like other operations?

To answer this we sat down with 60+ real world AI Teams of ML practitioners to understand the problems they’ve encountered when their companies tried to operationalize and scale their AI initiatives. We’ve distilled their pain points and combined them with our own experience to arrive at five main categories:

1 — Organizational issues arise when building and expanding the AI Team. Hiring managers are all aware of the scarcity and high cost of ML talent, and the undefined and overlapping skillsets that are prevalent in common AI Team structures only exacerbate these organizational inefficiencies. Case in point — the average search for an ML Engineer takes over six months.

2 — Infrastructure Orchestration is rarely done effectively and often does not lend itself easily to the benefits of scaling. Heavy cross-domain expertise is needed to deliver on resource intensive tasks, and handoff between infra experts and ML practitioners is not readily scalable across siloed organizations. Among the 50%+ of ML Projects that fail, infrastructure is cited as the number one cause of failure.

3 — Automation of core machine learning operations activities is rarely effective. Most of the ML practitioners with whom we’ve spoken agree: AutoML products are generally of limited value, from inefficient hyperparameter optimization to messy experiment management and poor UX. These issues compound the limitations of technology in tuning pipelines. Teams tend to manual stitch OSS — which leads again to the lack of automation.

4 — Deployment Productivity faces bottlenecks around core ML pipeline tasks. Chaotic pipeline jungles with time intensive, messy glue code are all too pervasive. This approach also lacks repeatability, requiring AI Teams to reinvent a sub optimal wheel with each experiment. AI teams that feel comfortable with a few simple pipelines deployed in an ad hoc manner suddenly find themselves overwhelmed when their product begins to catch on with customers. By not formalizing their deployment processes sooner, AI teams end up with dramatic increases in their lead times, with ever increasing technical debt.

5 — Business uncertainty makes it hard for leaders to commit and deliver value. Sometimes the problem is as simple as setting expectations about the R&D nature of ML as opposed to standard development. Other times, it is simply the fact that time, cost, and value-accretive outcomes are hard to predict. The result is that time to value is longer than decision makers anticipate, and even successful outcomes can be difficult to explain.

We believe the solution to these problems lies in the discipline of MLOps. In order to scale AI, businesses first need to operationalize their teams, processes and tools in an integrated, cohesive manner. MLOps seeks to establish best practices and tools to facilitate rapid, safe, and efficient operationalization of AI. Point efficiencies, such as training speed and collaboration, compound in a well-managed MLOps system to enable scaling efficiencies. And when these efficiencies hit their stride, the value of the AI team grows exponentially.

In this blog series, we will explore the reasons why AI fails to scale in depth and discuss the benefits of building a future-proof MLOps platform. We will offer the insights we have gleaned from our team of award-winning researchers as well as the many ML Practitioners we have interviewed. Stay tuned for deeper insights into problem areas and the relevant solutions that can help your team #ScaleMLOps.

‍

Learn More

5 Reasons Why AI Fails To Scale

Latest articles

MPCFORMER: FAST, PERFORMANT AND PRIVATE TRANSFORMER INFERENCE WITH MPC

AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness