Subscribe to the Machine Learning Engineer Newsletter

Receive curated articles, tutorials and blog posts from experienced Machine Learning professionals.

Issue #61
This week in Issue #61:
Forward the email, or share the online version on 🐦 Twitter,  💼 Linkedin and  📕 Facebook!
If you would like to suggest articles, ideas, papers, libraries, jobs, events or provide feedback just hit reply or send us an email to! We have received a lot of great suggestions in the past, thank you very much for everyone's support!
In recent years, the field of natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence solutions. Microsoft has put together a great resource with best practices for NLP through Jupyter notebooks and utility functions.
The Data Exchange Podcast dives into conversation with Sijie Guo on how Apache Pulsar is able to handle both queuing and streaming, and both online and offline applications. In this episode they cover the role of messaging in modern data applications/platforms, queuing implementations, streaming applications, and a status update on apache pulsar.
Machine learning mastery sheds light into the topic of imbalanced classification in machine learning, specifically around why this challenge is so difficutl to tackle. In this tutorial they cover the challenges of severly skewed class distributions, costs of missclassification, proprieties that can be imbalanced, and a framework to develop an intuition to compoind the effects on the modelling difficulty posed by different dataset properties.
An interesting article that proposes using ML to clean data at scale (for training more ML). This article breaks down the challenge of data cleaning, and covers a fascinating academic opens ource project called HoloClean, which aims to tackle this, together with a breakdon of the techniques and next steps.
Larger models are difficult to train because of cost, time, and ease of code integration. Microsoft is releasing an open-source library called DeepSpeed, which suggests to provide scale, speed, cost, and usability, unlocking the ability to train models at massive scale.
The topic for this week's featured production machine learning libraries is ETL and Batch Processing. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. The four featured libraries this week are:
  • Apache Airflow - Data Pipeline framework built in Python, including scheduler, DAG definition and a UI for visualisation
  • Argo Workflows - Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).
  • Luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs, handling dependency resolution, workflow management, visualisation, etc
  • Genie - Job orchestration engine to interface and trigger the execution of jobs from Hadoop-based systems
If you know of any libraries that are not in the "Awesome MLOps" list, please do give us a heads up or feel free to add a pull request
As AI systems become more prevalent in society, we face bigger and tougher societal challenges. We have seen a large number of resources that aim to takle thiese challenges in the form of AI Guidelines, Principles, Ethics Frameworks, etc, however there are so many resources it is hard to navigate. Because of this we started an Open Source initiative that aims to map the ecosystem to make it simpler to navigate. We will be showcasingitg three resources from our list so we can check them out every week. This week's resources are:
If you know of any guidelines that are not in the "Awesome AI Guidelines" list, please do give us a heads up or feel free to add a pull request
© 2018 The Institute for Ethical AI & Machine Learning