The Institute for Ethical AI & Machine Learning

Subscribe to the Machine Learning Engineer Newsletter

Receive curated articles, tutorials and blog posts from experienced Machine Learning professionals.

THE ML ENGINEER 🤖

Issue #61

This week in Issue #61:

Microsoft's NLP Recipes
Messaging & Data Ingestion with Pulsar
Why Imbalanced ML is so hard
AI for Data Cleaning at Scale
Training Models with 1b+ Params
Featured OSS Production ML Libraries
Awesome AI Guidelines to check out this week
+ more 🚀

Forward the email, or share the online version on 🐦 Twitter, 💼 Linkedin and 📕 Facebook!

If you would like to suggest articles, ideas, papers, libraries, jobs, events or provide feedback just hit reply or send us an email to a@ethical.institute! We have received a lot of great suggestions in the past, thank you very much for everyone's support!

Microsoft's NLP Recipes

In recent years, the field of natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence solutions. Microsoft has put together a great resource with best practices for NLP through Jupyter notebooks and utility functions.

Messaging & Data Ingestion++

The Data Exchange Podcast dives into conversation with Sijie Guo on how Apache Pulsar is able to handle both queuing and streaming, and both online and offline applications. In this episode they cover the role of messaging in modern data applications/platforms, queuing implementations, streaming applications, and a status update on apache pulsar.

Why Imbalanced ML is so hard

Machine learning mastery sheds light into the topic of imbalanced classification in machine learning, specifically around why this challenge is so difficutl to tackle. In this tutorial they cover the challenges of severly skewed class distributions, costs of missclassification, proprieties that can be imbalanced, and a framework to develop an intuition to compoind the effects on the modelling difficulty posed by different dataset properties.

AI for Data Cleaning at Scale

An interesting article that proposes using ML to clean data at scale (for training more ML). This article breaks down the challenge of data cleaning, and covers a fascinating academic opens ource project called HoloClean, which aims to tackle this, together with a breakdon of the techniques and next steps.

Training Models with 1b+ Params

Larger models are difficult to train because of cost, time, and ease of code integration. Microsoft is releasing an open-source library called DeepSpeed, which suggests to provide scale, speed, cost, and usability, unlocking the ability to train models at massive scale.

OSS: ETL & Batch Processing

The topic for this week's featured production machine learning libraries is ETL and Batch Processing. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. The four featured libraries this week are:

Apache Airflow - Data Pipeline framework built in Python, including scheduler, DAG definition and a UI for visualisation
Argo Workflows - Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).
Luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs, handling dependency resolution, workflow management, visualisation, etc
Genie - Job orchestration engine to interface and trigger the execution of jobs from Hadoop-based systems

If you know of any libraries that are not in the "Awesome MLOps" list, please do give us a heads up or feel free to add a pull request!

OSS: Awesome AI Guidelines

As AI systems become more prevalent in society, we face bigger and tougher societal challenges. We have seen a large number of resources that aim to takle thiese challenges in the form of AI Guidelines, Principles, Ethics Frameworks, etc, however there are so many resources it is hard to navigate. Because of this we started an Open Source initiative that aims to map the ecosystem to make it simpler to navigate. We will be showcasingitg three resources from our list so we can check them out every week. This week's resources are:

IEEE's Ethically Aligned Design - A Vision for Prioritizing Human Wellbeing with Artificial Intelligence and Autonomous Systems that encourages technologists to prioritize ethical considerations in the creation of autonomous and intelligent technologies.
Montréal Declaration for a responsible development of artificial intelligence - ethical principles and values that promote the fundamental interests of people and group created as an initiative by Université de Montréal
PWC's Responsible AI - PWC has put together a survey and a set of principles that abstract some of the key areas they've identified for responsible AI.
Singapore Data Protection Govt Commission's AI Governance Principles - The Singapore government's Personal Data Protection Commission has put together a set of guiding principles towards data protection and human involvement in automated systems, and comes with a report that breaks down the guiding principles and motivations.

If you know of any guidelines that are not in the "Awesome AI Guidelines" list, please do give us a heads up or feel free to add a pull request!