The Institute for Ethical AI & Machine Learning

Subscribe to the Machine Learning Engineer Newsletter

Receive curated articles, tutorials and blog posts from experienced Machine Learning professionals.

THE ML ENGINEER 🤖
Issue #20

This week in Issue #20:

An R book for programmers, real time hate-speech detection, the datalake layer, data discovery engines, DAWN ML infrastructure, serverless code reuse, data streaming libraries, upcoming ML conferences, data science / ML engineering jobs and more 🚀.

Support the ML Engineer!

Forward the email, or share the online version on 🐦 Twitter, 💼 Linkedin and 📕 Facebook!

If you would like to suggest articles, ideas, papers, libraries, jobs, events or provide feedback just hit reply or send us an email to a@ethical.institute! We have received a lot of great suggestions in the past, thank you very much for everyone's support!

An R book for programmers

An incredible contribution to the world by Greg Wilson who released a fully open source book that dives into the world of R, in a programmer friendly way. "Speak not of madness, oh you who count from zero", is how the book begins before venturing into an in-depth overview of R concepts, including hello worlds, indexing, control flow, munging data, error evaluation and (much much) more. Check it out, and do contribute (even if it's with spelling corrections)!

Real time ML with Kafka & Spark

Last week we dived into the full potential of real time machine learning with real time analysis on hate speech and offensive language detection on reddit comments using Kafka and Spark Streaming. The reveal.js slides of the presentation are live, and the code will be open sourced in multiple sub-projects this week, containing a dockerized deployment (of Kafka, Spark, Zeppelin, Grafana + Prometheus), as well as the code to perform the text analysis. This is a very exciting project that extends our Explainable AI (XAI) workstream.

Deltalake DB & data lake layer

Databricks has open sourced their Delta Lake project. Delta Lake is a storage layer that brings reliability to your data lakes built on HDFS and cloud storage by providing ACID transactions through optimistic concurrency control between writes and snapshot isolation for consistent reads during writes. Delta Lake also provides built-in data versioning for easy rollbacks and reproducing reports. It is great to see that leading players in the data engineering & data science space keep contributing to the open source community.

Amunsen data discovery engine

Lyft has been tackling the data exploration challenge, and has now taken it to the next level by open sourcing their internal data discovery application "Amunsen". This platform allows data scientists to explore metadata from the datasets to identify relevant sources to use across projects. This is quite an interesting space in ML Operations, with several tools like Apache Atlas also attempting to address the issue (more focused on the HDFS world). Relatively similar to "Atlas", the code-name Amunsen comes from the Norwegian explorer Roald Amundsen who led the first expedition to the South Pole in 1911.

DAWN Machine Learning Tools

Interesting paper released by Stanford researchers proposing the combination of multiple ML tools to form the "DAWN Infrastructure", which aims to enable "anyone with domain expertise to build their own production-quality ML products". They combine systems like Snorkel, deepdive, modelQA, ModelSnap, and several others to propose a setup that could provide the infrastructure required to achieve this.

Serverless operator reuse

When using serverless frameworks like AWS Lambda, it is often necessary to handle state through infrastructure like databases. This requires data to be retrieved and stored (The E and L from ETL). Given the sources that we often interact with are consistent, code reuse becomes a more improtant piece. This post proposes a few options to access multiple different sources with reusable code, as well as best practices to follow when using AWS Lambda.

MLOps = Featured OS Libraries

The theme for this week's featured ML libraries is Real time Machine Learning with data streaming pipelines, which falls on our Responsible ML Principle #4. This week we want to dive deeper and feature some smaller libraries in this space - four featured libraries on data stream processing this week are:

Apache Flink - Open source stream processing framework with powerful stream and batch processing capabilities.
Faust - Streaming library built on top of Python’s Asyncio library using the async kafka client inspired by the kafka streaming library.
Kafka Streams - Kafka client library for buliding applications and microservices where the input and output are stored in kafka clusters
Spark Streaming - Micro-batch processing for streams using the apache spark framework as a backend supporting stateful exactly-once semantics

If you know of any libraries that are not in the "Awesome MLOps" list, please do give us a heads up or feel free to add a pull request!

MLConf = Conferences & Events

We feature conferences that have core ML tracks (primarily in Europe for now) to help our community stay up to date with great events coming up.

Technical Conferences

DataFest19 [11/03/2019] - Two week festival of Data Innovation hosted across Scotland, UK.

PyCon + PyData Florence [02/05/2019] - Python X comes this year with a PyData focus in Florence, Italy.

AI Conference Beijing [18/06/2019] - O'Reilly's signature applied AI conference in Asia in Beijing, China.

RAAIS 2019 [28/06/2019] - The Research and Applied AI Summit in London, UK

Data Natives [21/11/2019] - Data conference in Berlin, Germany.

ODSC Europe [19/11/2019] - The Open Data Science Conference in London, UK.

Spacy IRL [05/07/2019] - SpaCy NLP's First F2F Conference in Berlin, Germany.

Business Conferences

World Summit AI Americas [10/04/2019] - Large scale AI summit in Montreal, Canada.
- Come join our panel on AI Ethics and Tools.

AI Expo Global [19/04/2019] - Global conference on artificial intelligence in London, UK.
- Come join us at our talk on AI orchestration at scale.

Predictive Analytics World [18/11/2019] - Conference for Business AI in Berlin, Germany.

Big Data LDN 2019 [13/11/2019] - Conference for strategy and tech on big data in London, UK.

MLJobs = Jobs & Careers

We showcase Machine Learning Engineering jobs (primarily in London for now) to help our community stay up to date with great opportunities that come up. It seems that the demand for data scientists continues to rise!

Leadership Opportunities

Algorithmia is hiring for a VP of Engineering in Seatle, USA
Fractal Labs is hiring for a VP of Engineering in London
Distributed is hiring for a VP of Engineering in London
FactMata is hiring for a Head of Machine Learning in London
Brainpool.ai is hiring for a Head of Machine Learning in London, UK
Cytora is hiring for a Data Science Director in London

Mid-level Opportunities

Proportunity is hiring for a Senior Machine Learning Engineer in London
Twitter is hiring for a Senior Machine Learning Engineer in London
Atlas ML is hiring for a Lead NLP Engineer in London
StreetBees is hiring for a Senior Data Scientist in London
Expedia is hiring for a Principal Data Scientist in London
QuantumBlack is hiring for a Senior Machine Learning Engineer in London
Tractable is hiring for a Senior Deep Learning Engineer

Junior Opportunities

Seldon is hiring for a Machine Learning / Data Engineer in London
Migacore is hiring for a Machine Learning Engineer in London
CloudNC is hiring for a Machine Learning Engineer in London
Babylon Health is hiring for a Machine Learning Engineer in London
Chattermill is hiring for a Machine Learning Engineer in London