Subscribe to the Machine Learning Engineer Newsletter

Receive curated articles, tutorials and blog posts from experienced Machine Learning professionals.

Issue #2

This week in Issue #2:
Towards Pandas 1.0 features and deprications, 50 popular matplotlib visualisations, papers with code, an intro to kubeflow, style-based GANs, debugging your deep learning NLP with Textbugger, data versioning libraries and more!
Support the ML Engineer!
Forward the email, or share the online version on 🐦 Twitter,  💼 Linkedin and  📕 Facebook!
A summary of Mark Garcia's pydata talk on the roadmap towards Pandas 1.0. It covers the current developments around method chaining, apache arrow, extension arrays and the deprecations of several features (such as inplace).
A great overview of some of the most useful visualisations in matplotlib (with respective code). Very useful as a reference as well. I can't share this article without also giving a shoutout to some of the "matplotlib-on-steroids" libraries that also help data scientists to simplify visualisations through high-level interfaces - some of these include seaborn, ggplot and bokeh.
Excellent website that provides what it suggests - published papers that have been released with accompanying code. We applaud the AtlasML team not only for such an awesome contribution, but for also helping raise the bar in machine learning research.
Kubeflow is an open source Kubernetes-native platform for developing, orchestrating, deploying, and running scalable and portable machine learning workloads. Great overview to a very interesting tool that tackles the challenge of production orchestration in machine learning. Great companies and projects like SeldonIO, TensorRT and Pachyderm currently actually use it - you can learn more at their website.
NVIDIA researchers delivers another mind-blowing video + paper on GANs. This time their focus is around style-based guided generation. More specifically, "an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes... [enabling for] intuitive, scale-specific control of the synthesis."
Adversarial methods for generating test data (or even training data) for machine learning systems are completely fascinating. A new paper titled "TEXTBUGGER: Generating Adversarial Text AgainstReal-world Applications" proposes a framework for generating text that can be used for adversarial attacks on deep learning NLP models. There is quite a lot of interesting work in this space. Another paper released last year also used a similar approach, and tried "Deceiving Google’s Perspective API Built for Detecting Toxic Comments".
MLOps = ML Operations
This edition we want to highlight a few awesome libraries that tackle the reproducible operations challenge primarily around data versioning. The libraries we're showcasing this week are:
  • Data Version Control (DVC) - A git fork that allows for version management of models, data and code through version control.
  • QuiltData - Versioning, reproducibility and deployment of data and models.
  • Pachyderm - Open source distributed processing framework build on Kubernetes focused mainly on dynamic building of production machine learning pipelines.
  • ModelDB - Framework to track all the steps in your ML code to keep track of what version of your model obtained which accuracy, and then visualise it and query it via the UI.
If you know of any libraries that are not in the "Awesome MLOps" list, please do give us a heads up or feel free to add a pull request