Subscribe to the Machine Learning Engineer Newsletter

Receive curated articles, tutorials and blog posts from experienced Machine Learning professionals.

Issue #16
๐Ÿฅณ๐ŸŽ‰We've reached 1000๐ŸŽˆ๐ŸŽ†
To celebrate 1000 subscribers, we built a script to converts the newsletter into AI-generated ๐ŸŽตaudio format๐ŸŽถ. To broaden options and avoid the stereotipical Siri-like AI voice, we've made it available in multiple different variations:
This week in Issue #16:
Real time machine learning with big data streams, test driven development in AI, a visual ML landscape, detecting outliers and anomalies, a checklist to debug neural nets, methods to evaluate model performance, computational load distribution open source libraries, upcoming AI conferences, new Machine Learning jobs and more ๐Ÿš€.
Support the ML Engineer!
Forward the email, or share the online version on ๐Ÿฆ Twitter,  ๐Ÿ’ผ Linkedin and  ๐Ÿ“• Facebook!
If you would like to suggest articles, ideas, papers, libraries, jobs, events or provide feedback just hit reply or send us an email to! We have received a lot of great suggestions in the past, thank you very much for everyone's support!
Ververica posted a very insightful article on how ING built their cutting-edge infrastructure to perform fraud detection in real time using machine learning with Flink on top of Kafka streams. In this post they talk about their goals, which require support for a range of ML models, flexibility across environments, and multi-tenency. The article goes into good amounts of detail on how they leveraged streaming with Kafka and stream processing with Flink to achieve their three goals. Streaming is certainly opening a lot of exciting opportunities in business intelligence with a lot of potential. One quirky example is the Financial Times' recent video where they show the music produced by yield curves.
It is exciting to see when best practices from software engineering make their way to data science and vice-versa. This very comprehensible article introduces the benefits of Unit Testing and Test Driven Development (AKA TDD) in machine learning. TDD is a tested approach to develop better and more robust systems, where tests are written before development of the core system begins, and it's incrementally built in this way. This post proposes how testing approaches allow ML models to be more robust by reinforcing them against unstable data, underfitting and beyond. For the curious ones, here is an article that provides a deep dive on the fundamentals of TDD in software.
Paco Nathan has put together a great blog post + article where he outlines 50 or so of the most popular Python libraries and frameworks used in Data Science. The tools are categorised and clustered based on their functionality. This list covers fundamentals like package management, application frameworks, data access, data representation and more.
Here are 5 ways to detect outliers and anomalies which senior AWS tech consultant Will Badr argues every data scientist should know. As he points out, outliers are data points that don't belong to a certain population - abnormal observations that we may want to detect. These could be things like cpu spikes in DevOps, fraudulent transactions in finance, etc. The 5 approaches the article covers are: 1) Standard deviation, 2) Boxplots, 3) DBScan Clustering, 4) Isolation Forests, and 5) Robust Random Cut Forests.
CometML Product Lead Cecelia Shao has put together a great checklist for debugging neural networks. As she outlines, this is a set of tangible steps you can take to identify and fix issues with training, generalisation and optimisation for machine learning models. This is a great resource that boils debugging into 5 steps: 1) start simple, 2 ) confirm your loss, 3) check intermediate outputs and connections, 4) diagnose parameters, and 5) tracking your work.  
Our friends from QuiltData have put together a great article on how to use yellowbrick to evaluate Keras machine learning models. Yellow brick is a swiss-army knife for model evaluation, providing advanced visualisations for model evaluation. In this post they provide insights on how to evaluate instances of both clsasification and regression, and they even provide the code to wrap your keras model so it can be used with this (and other) libraries. If you are curious for more ways you can evaluate models you can check out our machine learning operations list which contains a great overview of the current tools available in production machine learning.
MLOps = Featured OS Libraries
We are excited to see the Awesome MLOps list reaching 400 stars now! Thanks to everyone for your support! This week's edition is focused on new libraries on computation distribution frameworks which fall on our Responsible ML Principle #4. The four featured libraries this week are:
  • Hadoop Open Platform-as-a-service (HOPS) - A multi-tenancy open source framework with RESTful API for data science on Hadoop which enables for Spark, Tensorflow/Keras, it is Python-first, and provides a lot of features
  • PyWren - Answer the question of the "cloud button" for python function execution. It's a framework that abstracts AWS Lambda to enable data scientists to execute any Python function
  • Horovod - Uber's distributed training framework for TensorFlow, Keras, and PyTorch
  • Dask - Distributed parallel processing framework for Pandas and NumPy computations 
If you know of any libraries that are not in the "Awesome MLOps" list, please do give us a heads up or feel free to add a pull request
We feature conferences that have core  ML tracks (primarily in Europe for now) to help our community stay up to date with great events coming up.
Technical Conferences
  • DataFest19 [11/03/2019] - Two week festival of Data Innovation hosted across Scotland, UK.
  • AI Conference Beijing [18/06/2019] - O'Reilly's signature applied AI conference in Asia in Beijing, China.
  • Data Natives [21/11/2019] - Data conference in Berlin, Germany.
  • ODSC Europe [19/11/2019] - The Open Data Science Conference in  London, UK.
Business Conferences
  • World Summit AI Americas [10/04/2019] - Large scale AI summit in Montreal, Canada.
    • Come join our panel on AI Ethics and Tools.
  • Big Data LDN 2019 [13/11/2019] - Conference for strategy and tech on big data in London, UK.
We showcase Machine Learning Engineering jobs (primarily in London for now) to help our community stay up to date with great opportunities that come up. It seems that the demand for data scientists continues to rise!
Junior Opportunities
Mid-level Opportunities
Leadership Opportunities
ยฉ 2018 The Institute for Ethical AI & Machine Learning