Subscribe to the Machine Learning Engineer Newsletter

Receive curated articles, tutorials and blog posts from experienced Machine Learning professionals.

Thanks to everyone for your support on our announcement last week releasing the K8s Agent Orchestration System (KAOS)!

If you want to support us with the momentum please do reshare, open an issue, and/or give the repo a star ⭐

github.com/axsaucedo/kaos 🔥

Issue #371 🤖

Thank you for being part of over 70,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps 🤖 You can join the newsletter https://bit.ly/state-of-ml-2025 ⭐

If you like the content please support the newsletter by sharing with your friends via ✉️ Email, 🐦 Twitter, 💼 Linkedin and 📕 Facebook!

This week in ML Engineering:

Releasing the KAOS Framework
OpenAI Scaling Postgres to 800m Users
FastMCP 3.0 Now Released
Anthropic's Original Take-home Test
7 Deadly Sins of Eng Productivity
Open Source ML Frameworks
Awesome AI Guidelines to check out this week
+ more 🚀

Releasing the KAOS Framework

Thank you all for your support during last week's announcement release for the K8s Agent Orchestration System (KAOS) to help manage distributed agentic systems at scale 🚀 The KAOS Framework addresses some of the pains of taking multi-agent / multi-tool / multi-model systems to hundreds or thousands of services! It started as an experiment to build agentic copilots, and has progressed as a fun endevour building distributed systems for A2A, MCP Servers, and model inference! The initial release comes with a few key features including: 1) a golang control plane to manage Agentic CRDs; 2) a python data plane that implements a2a, memory, tool / model mgmt; 3) a React UI for CRUD+debugging, and; 4) a robust CI/CD setup with KIND/pytest/ginko/etc. I have to say I am impressed on the level of abstraction that is possible to reach with agentic copilots when covering frameworks and domains I have experience in, a blog post will follow on this topic specifically! For the meantime do check out the repo, docs and examples to try it out - if you have any feedback or run into issues please do submit an issue!

OpenAI Scaling Postgres to 800m Users

OpenAI shared this week how they scaled their database to 800 million users, an there are some interesting surprises: OpenAI has had to deal with explosive growth throughout the last few years, and it is interesting to see how they've been able to scale to millions of queries per second, however at a clear tech debt cost. Surprisingly they are running a single primary PostgreSQL server for writes, with a set of 50 geo-distributed read replicas; the main blocker seems to be the migration cost towards scaling the write replicas due to hundreds of internal applications require migrations. On the optimization it seems like they have implemented the usual suspects by offloading reads to replicas, migrating shardable write-heavy workloads to sharded stores like Azure Cosmos DB, and reducing unnecessary writes. These are great write-ups that provide an uncommon glimpse into hypergrowth scaleups and the tech debt costs that come from having to scale at such a massive speed (and the tradeoffs required to maintain that speed).

FastMCP 3.0 Released

The agentic stack continues to evolve at lightning speed, this past week with the release of FastMCP 3.0: It is quite interesting to see the fast iterations from these projects in near-real time, in this case taking various features that have evolved organically and integrating them into cohesive / standardised components. In this case it seems like the main releases include the three composable primitives of "components" (tools/resources/prompts), "providers" (OpenAPI, remote MCP servers, other FastMCP servers), and "transforms" (pipeline middleware that renames, namespaces, filters by tag/version, or reshapes schemas). Quite interesting that sometimes it seems we're going full circle with concepts like ETL or learnings from neighboring areas like MLOps. It is however really great to see some improvements on production features like native OpenTelemetry tracing, background tasks for long-running work, tool timeouts, pagination, and connection pings, etc. Looking forward to see how this project evolves as the MCP ecosystem matures, as we are also yet to see the first iterations since the protocol joined the Linux Foundation.

Anthropic's Original Take-home Test

Anthropic has publicly released their recruitment take-home, and it has a lot of quite interesting learnings for both indvidual contributors but also hiring managers: What I liked the most when I came across this was that (unsurprisingly) Anthropic set up their take-home test with explicit assumption that candidates would use agentic coding to try to solve it. The test basically seems to have a Python script that simulates a custom computer, and your job is to rewrite the kernel so it computes the exact same outputs while using far fewer "simulated cycles" which forces you to really think about performance. This has some great foundations as it likely touches on optimizing memory, reuse of intermediate results, batch work, etc. It's also great that the score is clear and visible feedback for the candidate, and it's also quite funny to see the disclaimer that basically many submissions have basically "cheated" by just changing the tests. Technical interviews have been changing drastically, and we will need to find ways to adapt to this strange new world, so seeing these type of examples bring quite a lot of useful ideas.

7 Deadly Sins of Eng Productivity

Here are the 7 deadly sins of engineering productivity: 1) Context Switching; 2) Task hopping; 3) The Urgency Illusion; 4) Parkinson's Law (underestimate costs and risks while overestimating benefits); 5) The Zeigarnik effect (aka the mental RAM leak); 6) Decision fatigue, and; 7) Brooke's law (adding to a delayed project delays it further). This is a pretty good list! I am actually interested to see how 1 and 2 will evolve with agentic coding; however I do still believe these will still need to be key foundations.

Upcoming MLOps Events

The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.

Events we are speaking at this year:

eTail Europe - March @ Berlin
World Summit AI Europe - September @ Amsterdam

Other relevant events:

KubeCon Europe - March @ Amsterdam
PyData Berlin - April @ Frankfurt
Databricks Summit - June @ San Francisco
World Developer Congress - July @ Berlin
EuroPython 2026 - July @ Prague
EuroSciPy 2026 - July @ Krakow
Code.Talks 2026 - Nov @ Hamburg
MLOps World 2026 - Nov @ Austin

In case you missed our talks, check our recordings below:

The State of AI in 2025 - WeAreDevelopers 2025
Prod Generative AI in 2024 - KubeCon AI Day 2025
The State of AI in 2024 - WeAreDevelopers 2024
Responsible AI Workshop Keynote - NeurIPS 2021
Practical Guide to ML Explainability - PyCon London
ML Monitoring: Outliers, Drift, XAI - PyCon Keynote
Metadata for E2E MLOps - Kubecon NA 2022
ML Performance Evaluation at Scale - KubeCon Eur 2021
Industry Strength LLMs - PyData Global 2022
ML Security Workshop Keynote - NeurIPS 2022

Open Source MLOps Tools

Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 20,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Here's a few featured open source libraries that we maintain:

KAOS - K8s Agent Orchestration Service for managing the KAOS in large-scale distributed agentic systems.
Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced data processing usecases.
Production ML Tools - A curated list of tools to deploy, monitor and optimize machine learning systems at scale.
AI Policy List - A mature list that maps the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
Agentic Systems Tools - A new list that aims to map the emerging ecosystem of agentic systems with tools and frameworks for scaling this domain

Please do support some of our open source projects by sharing, contributing or adding a star ⭐

About us

The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning.

Check out our website

✉️ Email, 🐦 Twitter, 💼 Linkedin

This email was sent to You received this email because you are registered with The Institute for Ethical AI & Machine Learning's newsletter "The Machine Learning Engineer"

Unsubscribe here