Subscribe to the Machine Learning Engineer Newsletter

Receive curated articles, tutorials and blog posts from experienced Machine Learning professionals.

Issue #385 🤖

Thank you for being part of over 70,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps 🤖 You can join the newsletter https://bit.ly/state-of-ml-2025 ⭐

If you like the content please support the newsletter by sharing with your friends via ✉️ Email, 🐦 Twitter, 💼 Linkedin and 📕 Facebook!

This week in ML Engineering:

Demis Hassabis on Agents & AGI
Google Cloud MCP Ecosystem
Netflix State of ML Serving
PyTorch Lightning Supply Attack
Multi-Modal Traces in MLFlow
Open Source ML Frameworks
Awesome AI Guidelines to check out this week
+ more 🚀

Demis Hassabis: Agents & AGI

Demis Hassabis on one of the most insightful conversations this year for ML practitioners talking about Agents, AGI and the Next Scientific Breakthrough - here's the key takeaways: Today’s foundation-model stack is not a dead end, but still needs better continual learning, long-term reasoning, memory, consistency, and introspection before agents can become reliable fire-and-forget systems. Agents are still early, and a lot of near-term value is likely coming from human-in-the-loop workflows, fast distilled models, multimodal systems, local/edge deployment, and specialized tools orchestrated by general models rather than one giant monolith. For ML teams, the AlphaFold pattern is especially relevant, as it is clear the highest-impact opportunities are domains with massive combinatorial search spaces, clear objective functions, and either strong data or simulators, such as drug discovery, materials, biology, and other deep-tech areas. And could not finish without a prediction on AGI, which surprisingly he's putting his money on 2030 as the year when it arrives; that sounds to me like what someone that runs an AI lab would say, so this is the one point that you should take with a pinch of salt.

Google Cloud MCP Ecosystem

Google has just published 50+ MCP servers to support programmatic agentic workflows with integrations that are integrated with governance and observability by design: It is great to see this move to support teams to move away fragile / bespoke tool integrations with instead managed MCP endpoints across Cloud infrastructure, databases, analytics, storage, Workspace, Maps, security, payments, and developer docs. For production ML practitioners, the interesting part is less that agents can call tools, but more that there's a clear bet from cloud providers to invest in infrastructure where agents are the target user. It is also interesting to see the maturity in the ecosystem, in this case looks as IAM Deny policies, Agent Registry discovery, Model Armor for prompt-injection/data-exfiltration defense, OTel tracing, and Cloud Audit Logs. This seems to be a likely pattern for the rest of the cloud providers to follow, and it will become more interesting as we start seeing agentic systems providing automations higher up in the stack as we move towards operations, monitoring and debugging.

Netflix State of ML Serving

Netflix built a centralized ML serving platform handles over 1M requests/sec and thousands of models that include preprocessing, feature computation, post-processing, and optional learned components. Some impressive architectural decisions: They built a custom routing system called Switchboard that helped them improve their ML velocity, which served requests by using metadata at the request-body-level, however as you can imagine this quite fast became a critical-path dependency, added latency costs (eg request parsing), and made tenant/request-origin isolation harder. To address this, Netflix is introducing their new "Lightbulb" design, which instead minimal request context into routing metadata with Envoy proxy performing the actual routing from headers, and model-specific parameters stay in the request body. For teams building ML platforms, the takeaway is that serving abstractions should decouple product clients from model/version/shard churn, but the routing layer itself must eventually become lightweight, cacheable, failure-tolerant, and close to the networking substrate rather than a monolithic proxy in every request path.

PyTorch Lightning Supply Attack

PyTorch Lightning versions 2.6.2 and 2.6.3 were compromised for a 42-minute window on April 30 after attackers obtained PyPI publishing credentials and uploaded tampered builds: This is an important reminder that security is absolutely key especially in our ML stack; one compromised package release can turn everyday training jobs, notebooks, and CI pipelines into credential-exfiltration paths. This malicious packages executed on import, spawned a background thread, installed Bun, ran an obfuscated JavaScript payload, and targeted cloud credentials, browser-stored secrets, env files, and GitHub tokens. For ML teams, the operational lesson is critical, if either version was installed and imported in developer machines, notebooks, training jobs, or CI/CD runners, treat those environments as compromised, downgrade to lightning==2.6.1, rotate exposed secrets, audit outbound network activity, and review build logs/artifacts. More broadly, this incident shows why MLOps needs MLSecOps controls at the packaging boundary, including pinned and verified dependencies, isolated CI secrets, least-privilege cloud credentials, egress monitoring, artifact provenance, and fast incident playbooks for trusted ML libraries.

Multi-Modal Traces in MLFlow

It is quite interesting that MLflow has recently been leading the charge on ML telemetry and traces, setting the charge for how multimodal tracing can work at scale for images, audio, PDFs, and other files. Open Telemetry standards are growing as now payloads go beyond purely json into binary artifacts extracted from spans, which need to be stored in an existing artifact store, and replaced in the trace database with lightweight references so queries stay fast and storage does not explode. The practical win for production ML practitioners is much better debugging of vision, audio, document, and image-generation workflows; namely allowing for an integrated experience where images render inline, audio can be played, PDFs can be viewed, and custom files can be attached manually via MLflow’s Attachment API. It's really reassuring to see how observability tools for MLOps are maturing at fast pace, especially as ML systems are becoming a critical foundation for society and industry.

Upcoming MLOps Events

The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.

Events we are speaking at this year:

eTail Europe - March @ Berlin
World Summit AI Europe - September @ Amsterdam

Other relevant events:

KubeCon Europe - March @ Amsterdam
PyData Berlin - April @ Frankfurt
Databricks Summit - June @ San Francisco
World Developer Congress - July @ Berlin
EuroPython 2026 - July @ Prague
EuroSciPy 2026 - July @ Krakow
AI Infra Summit 2026 - Sept @ California
Code.Talks 2026 - Nov @ Hamburg
MLOps World 2026 - Nov @ Austin

In case you missed our talks, check our recordings below:

The State of AI in 2025 - WeAreDevelopers 2025
Prod Generative AI in 2024 - KubeCon AI Day 2025
The State of AI in 2024 - WeAreDevelopers 2024
Responsible AI Workshop Keynote - NeurIPS 2021
Practical Guide to ML Explainability - PyCon London
ML Monitoring: Outliers, Drift, XAI - PyCon Keynote
Metadata for E2E MLOps - Kubecon NA 2022
ML Performance Evaluation at Scale - KubeCon Eur 2021
Industry Strength LLMs - PyData Global 2022
ML Security Workshop Keynote - NeurIPS 2022

Open Source MLOps Tools

Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 20,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Here's a few featured open source libraries that we maintain:

KAOS - K8s Agent Orchestration Service for managing the KAOS in large-scale distributed agentic systems.
Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced data processing usecases.
Production ML Tools - A curated list of tools to deploy, monitor and optimize machine learning systems at scale.
AI Policy List - A mature list that maps the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
Agentic Systems Tools - A new list that aims to map the emerging ecosystem of agentic systems with tools and frameworks for scaling this domain

Please do support some of our open source projects by sharing, contributing or adding a star ⭐

About us

The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning.

Check out our website

✉️ Email, 🐦 Twitter, 💼 Linkedin

This email was sent to You received this email because you are registered with The Institute for Ethical AI & Machine Learning's newsletter "The Machine Learning Engineer"

Unsubscribe here