Subscribe to the Machine Learning Engineer Newsletter

Receive curated articles, tutorials and blog posts from experienced Machine Learning professionals.


THE ML ENGINEER 🤖
Issue #20
 
 
This week in Issue #20:
An R book for programmers, real time hate-speech detection, the datalake layer, data discovery engines, DAWN ML infrastructure, serverless code reuse, data streaming libraries, upcoming ML conferences, data science / ML engineering jobs and more 🚀.
 
Support the ML Engineer!
Forward the email, or share the online version on 🐦 Twitter,  💼 Linkedin and  📕 Facebook!
 
If you would like to suggest articles, ideas, papers, libraries, jobs, events or provide feedback just hit reply or send us an email to a@ethical.institute! We have received a lot of great suggestions in the past, thank you very much for everyone's support!
 
 
 
An incredible contribution to the world by Greg Wilson who released a fully open source book that dives into the world of R, in a programmer friendly way. "Speak not of madness, oh you who count from zero", is how the book begins before venturing into an in-depth overview of R concepts, including hello worlds, indexing, control flow, munging data, error evaluation and (much much) more. Check it out, and do contribute (even if it's with spelling corrections)!
 
 
Last week we dived into the full potential of real time machine learning with real time analysis on hate speech and offensive language detection on reddit comments using Kafka and Spark Streaming. The reveal.js slides of the presentation are live, and the code will be open sourced in multiple sub-projects this week, containing a dockerized deployment (of Kafka, Spark, Zeppelin, Grafana + Prometheus), as well as the code to perform the text analysis. This is a very exciting project that extends our Explainable AI (XAI) workstream.
 
 
Databricks has open sourced their Delta Lake project. Delta Lake is a storage layer that brings reliability to your data lakes built on HDFS and cloud storage by providing ACID transactions through optimistic concurrency control between writes and snapshot isolation for consistent reads during writes. Delta Lake also provides built-in data versioning for easy rollbacks and reproducing reports. It is great to see that leading players in the data engineering & data science space keep contributing to the open source community.
 
 
Lyft has been tackling the data exploration challenge, and has now taken it to the next level by open sourcing their internal data discovery application "Amunsen". This platform allows data scientists to explore metadata from the datasets to identify relevant sources to use across projects. This is quite an interesting space in ML Operations, with several tools like Apache Atlas also attempting to address the issue (more focused on the HDFS world). Relatively similar to "Atlas", the code-name Amunsen comes from the Norwegian explorer Roald Amundsen who led the first expedition to the South Pole in 1911.
 
 
Interesting paper released by Stanford researchers proposing the combination of multiple ML tools to form the "DAWN Infrastructure", which aims to enable "anyone with domain expertise to build their own production-quality ML products". They combine systems like Snorkel, deepdive, modelQA, ModelSnap, and several others to propose a setup that could provide the infrastructure required to achieve this.
 
 
When using serverless frameworks like AWS Lambda, it is often necessary to handle state through infrastructure like databases. This requires data to be retrieved and stored (The E and L from ETL). Given the sources that we often interact with are consistent, code reuse becomes a more improtant piece. This post proposes a few options to access multiple different sources with reusable code, as well as best practices to follow when using AWS Lambda.
 
 
 
 
MLOps = Featured OS Libraries
The theme for this week's featured ML libraries is Real time Machine Learning with data streaming pipelines, which falls on our Responsible ML Principle #4. This week we want to dive deeper and feature some smaller libraries in this space - four featured libraries on data stream processing this week are:
 
  • Apache Flink - Open source stream processing framework with powerful stream and batch processing capabilities.
  • Faust - Streaming library built on top of Python’s Asyncio library using the async kafka client inspired by the kafka streaming library.
  • Kafka Streams - Kafka client library for buliding applications and microservices where the input and output are stored in kafka clusters
  • Spark Streaming - Micro-batch processing for streams using the apache spark framework as a backend supporting stateful exactly-once semantics
 
If you know of any libraries that are not in the "Awesome MLOps" list, please do give us a heads up or feel free to add a pull request
 
 
 
We feature conferences that have core  ML tracks (primarily in Europe for now) to help our community stay up to date with great events coming up.
 
Technical Conferences
 
  • DataFest19 [11/03/2019] - Two week festival of Data Innovation hosted across Scotland, UK.
 
 
  • AI Conference Beijing [18/06/2019] - O'Reilly's signature applied AI conference in Asia in Beijing, China.
 
 
  • Data Natives [21/11/2019] - Data conference in Berlin, Germany.
 
  • ODSC Europe [19/11/2019] - The Open Data Science Conference in  London, UK.
 
 
 
Business Conferences
 
  • World Summit AI Americas [10/04/2019] - Large scale AI summit in Montreal, Canada.
    • Come join our panel on AI Ethics and Tools.
 
 
 
  • Big Data LDN 2019 [13/11/2019] - Conference for strategy and tech on big data in London, UK.
 
 
 
We showcase Machine Learning Engineering jobs (primarily in London for now) to help our community stay up to date with great opportunities that come up. It seems that the demand for data scientists continues to rise!
 
Leadership Opportunities
 
Mid-level Opportunities
 
Junior Opportunities
 
 
 
 
 
© 2018 The Institute for Ethical AI & Machine Learning