Subscribe to the Machine Learning Engineer Newsletter

Receive curated articles, tutorials and blog posts from experienced Machine Learning professionals.

Issue #261 🤖 
If you like the content please support the newsletter by sharing with your friends via ✉️ Email, 🐦 Twitter, 💼 Linkedin and 📕 Facebook! If you've come across this you can join the newsletter  for free at
This week in Machine Learning:
Thank you for being part of over 50,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps 🤖 You can join the newsletter for free at
If you are a Machine Learning Practitioner looking for an interesting opportunity, I'm currently hiring for a few roles including Applied Science Manager, Applied Scientist, Analytics Team Lead, and Customer Analyst - do check it out and do feel free to share broadly!
A recent NeurIPS competition challenged Meta, Google, Microsoft & other Tech companies on vector search & ANN algorithms on massive-scale billion-sized datasets: The "Billion-Scale Approximate Nearest Neighbor Search" challenge challenges advancements in Approximate Nearest Neighbor (ANN) search algorithms for large-scale datasets, focusing on the first Billion-Scale Approximate Nearest Neighbors. This challenge was sponsored by NeurIPS in 2021 and marked a significant step in evaluating ANN algorithms at the billion-scale, assessing their performance, accuracy, and hardware cost across six billion-scale datasets. The results highlight the need for efficient ANN solutions in large-scale data environments, particularly relevant for production machine learning practitioners dealing with high-dimensional nearest neighbor searches.
The Mozilla Innovation Project introduces Local AI with their new private AI-model product "MemoryCache": An interesting experimental initiative by Mozilla aimed at enhancing personal AI models by integrating locally-saved browser data, focusing on privacy and individualized experiences. It includes a Firefox extension for saving web pages, a script for processing saved data with privateGPT, and an optional PDF saving feature for readability. Currently it is in sandbox stage, and is being tested on specific hardware and software configurations, which has highlighted challenges in balancing personalization with the risk of over-generalization in AI responses. This project offers valuable insights for machine learning practitioners interested in personalized & private AI development.
A new (freely available!) book on Deep Learning foundations authored by experts and endorsed by renowned figures like Hinton, LeCun, Bengio and others: The new book "Deep Learning Foundations & Concepts" is a great comprehensive resource tailored for both newcomers and seasoned machine learning practitioners. It delves into the core concepts of deep learning, offering a structured approach suitable for academic courses and self-study. The book emphasizes practical applications over abstract theory, providing a blend of textual explanations, diagrams, and mathematical formulations. It is also available for free as an online copy if you want to read it at no costs.
Google's recent initiative to leverage machine learning to optimize machine learning: The team at Google highlights advancements in using machine learning to optimize ML compilers which is key for optimisations of ML model efficiency on hardware. The introduction of the "TpuGraphs" dataset marks a significant step in this direction, offering a large-scale resource for developing learned cost models, particularly for Google's TPUs. The post also discusses innovative techniques like Graph Segment Training for managing large graphs and insights from a Kaggle competition, which revealed novel approaches like graph pruning and cross-configuration attention. These developments are particularly relevant for ML practitioners focused on optimizing model performance and efficiency.
As we growingly interact with billion-scale datasets we require new paradigms to extract insights at scale, this is where probabilistic data structures come in - this is one of the best articles out there on this topic: This article provides a strong intuition on various efficient methods for analyzing large-scale data sets leveraging probabilistic data structures. The article provides a deep dive on each of these key data structures with practical case studies, covering Linear Counters, Loglog Counters, Count-Min Sketches, Count-Mean-Min Sketches, Stream-Summary, and Bloom Filters. The article discusses their applications in various scenarios, such as tracking unique website visitors or monitoring IP traffic, and highlights their adaptability for complex queries. This approach is particularly relevant for machine learning practitioners dealing with big data, as it provides a means to optimize system performance in data-intensive tasks.
Upcoming MLOps Events
The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.
Relevant upcoming MLOps conferences:
Open Source MLOps Tools
Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 10,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Four featured libraries in the GPU acceleration space are outlined below.
  • Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced  data processing usecases.
  • CuPy - An implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of the core multi-dimensional array class, cupy.ndarray, and many functions on it.
  • Jax - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
  • CuDF - Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
If you know of any open source and open community events that are not listed do give us a heads up so we can add them!
As AI systems become more prevalent in society, we face bigger and tougher societal challenges. We have seen a large number of resources that aim to takle these challenges in the form of AI Guidelines, Principles, Ethics Frameworks, etc, however there are so many resources it is hard to navigate. Because of this we started an Open Source initiative that aims to map the ecosystem to make it simpler to navigate. You can find multiple principles in the repo - some examples include the following:
  • MLSecOps Top 10 Vulnerabilities - This is an initiative that aims to further the field of machine learning security by identifying the top 10 most common vulnerabiliites in the machine learning lifecycle as well as best practices.
  • AI & Machine Learning 8 principles for Responsible ML - The Institute for Ethical AI & Machine Learning has put together 8 principles for responsible machine learning that are to be adopted by individuals and delivery teams designing, building and operating machine learning systems.
  • An Evaluation of Guidelines - The Ethics of Ethics; A research paper that analyses multiple Ethics principles.
  • ACM's Code of Ethics and Professional Conduct - This is the code of ethics that has been put together in 1992 by the Association for Computer Machinery and updated in 2018.
If you know of any guidelines that are not in the "Awesome AI Guidelines" list, please do give us a heads up or feel free to add a pull request!
© 2023 The Institute for Ethical AI & Machine Learning