Engineering Like It's 2007 Let’s take a trip back to 2007! This engineering talk from YouTube is truly a master class of scaling and learning, and surprisingly the lessons are as valuable today as they were back then. This talk shows a lean and mean team iterating through bottlenecks across web serving, video delivery, thumbnails, databases, hardware, OS tuning, caching and sharding. The most useful lesson for ML engineering teams is that scale is rarely solved by one abstraction, as YouTube kept the system simple enough to rewrite under pressure, used caching at multiple layers, and leveraged first principles at every layer. It is pretty cool to see that they followed the standard playbook of best practice of scalability, the usual suspects of moving hot traffic through CDNs, tuning commodity hardware (maybe less common now), and eventually replacing replication tricks with database partitioning. For ML practitioners, the parallel is model serving, feature stores, vector databases, observability pipelines and agent systems, which are currently hitting us with the same analogous challenges that we have to solve at lighting speed. |
|
|
|---|
|
| |
Netflix is sharing their playbook for massive scale LLM fine-tuning infrastructure: Netflix moved away from a few fine-tuning scripts to a managed framework that supports SFT, DPO, RL, distillation, checkpointing, MFU tracking, Hugging Face-compatible model/tokenizer flows, and distributed orchestration. It is interesting that Ray has chosen also the usual suspect technologies like Ray, PyTorch, vLLM together with custom tooling (eg Netflix’s internal Mako platform). The interesting takeaway for production ML practitioners is that post-training is challenging across every layer, including Data, Model, Compute and Workflow. Netflix reports up to 4.7x effective token throughput from asynchronous on-the-fly sequence packing, and this is a great example of the direction GenAI infrastructure is taking. |
|
|
|---|
|
| | Open AI & Anthropic have found Market Fit - this is an interesting opinion piece from Simon Willison: Here there is a good case that coding agents may be the first real product-market fit moment for frontier AI labs because enterprises are now being charged close to raw API-token economics for daily developer workflows. OpenAI Codex and Anthropic Claude Code have shifted from huge subsidies / subscriptions to usage-based pricing (unfortunately for us subsidies are indeed ending). It is now clear that organisations will need to establish much tighter cost observability, usage governance, ROI measurement, procurement discipline, and platform controls around coding agents, just as they already do for serving and inference workloads. This starts making it clear why tools like LiteLLM or OpenRouter are becoming so popular, even if at the beginning it wasn't super clear why an extra abstraction layer was needed on top of a simple vendor API (eg. surprisingly enough many vendors still do not offer spend caps). |
|
|
|---|
|
Reviving PapersWithCode.co I remember when Papers With Code originally came out in 2018 it was a major breakthrough; after the meta acquisition the project slowed and then stopped, but it seems there is an attempt to revive it! Hugging Face has started reviving it as paperswithcode.co with support from some AI agents to help with the parsing of papers, auto-linking GitHub repos, project pages and artifacts, categorizing, and generating leaderboards. The new site already brings back the familiar discovery workflow around trending papers, SOTA browsing, methods and domains, while adding support for star-velocity trends, citation counts, external non-arXiv papers, multiple repos per paper, benchmark harness reports, and Hugging Face-native storage/login integration. As research moves faster (whether AI slop or otherwise), it's still a basic need to have a well maintained discovery layer, so it's great to see projects like this, hopefully it will continue growing. |
|
|
|---|
|
| |
An interesting release of a new massive open text-to-image dataset: The MONET dataset. It is great to see this, as image generation not only depends on quality data, but also other key resources like benrhcmarks whcih can only help teams accelerate on this field. Better curation, filtering, captions, and provenance can really make quite a difference. This dataset consists of 104.9M curated image–text pairs distilled from 2.9B raw pairs, with safety filtering, domain filtering, exact/near duplicate removal, multi-VLM re-captioning, embeddings, object/face annotations, hashes, NSFW/watermark scores, and pre-encoded SANA-VAE latents for faster latent-diffusion training. |
|
|
|---|
|
Upcoming MLOps Events The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.
Events we are speaking at this year:
Other relevant events:
In case you missed our talks, check our recordings below:
|
|
|---|
| | |
Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 20,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Here's a few featured open source libraries that we maintain: - SARC - Provides wrappers for popular agentic frameworks to enable guardrails and constraints that are enforced through the flow.
- KAOS - K8s Agent Orchestration Service for managing the KAOS in large-scale distributed agentic systems.
- Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced data processing usecases.
- Production ML Tools - A curated list of tools to deploy, monitor and optimize machine learning systems at scale.
- AI Policy List - A mature list that maps the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
- Agentic Systems Tools - A new list that aims to map the emerging ecosystem of agentic systems with tools and frameworks for scaling this domain
Please do support some of our open source projects by sharing, contributing or adding a star ⭐ |
|
|---|
| | |
| | | | The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning. | | | | |
|
|
|---|
|
|
This email was sent to You received this email because you are registered with The Institute for Ethical AI & Machine Learning's newsletter "The Machine Learning Engineer" |
| | | | |
|
|
|---|
|
© 2023 The Institute for Ethical AI & Machine Learning |
|
|---|
|
|
|