|
|
|
|
THE ML ENGINEER 🤖
Issue #60
|
|
|
|
|
|
|
|
If you would like to suggest articles, ideas, papers, libraries, jobs, events or provide feedback just hit reply or send us an email to a@ethical.institute! We have received a lot of great suggestions in the past, thank you very much for everyone's support!
|
|
|
|
|
|
|
|
Production machine learning systems bring fundamentally different challenges to those in traditional software engineering. Last week in our talk at FOSDEM 2020 we provided a practical CI/CD framework to scale production machine learning at massive scale. In this talk we define the concept of MLOps, cover some of the challenges that production machine learning brings to the table, as well as a hands on example using Seldon Core and Jenkins X to build machine learning pipelines that can scale to hundreds of models.
|
|
|
|
|
|
|
The lifecycle of a machine learning model only begins when it's deployed. Degrading performance is a big challenge that requires the right processes and infrastructure to ensure it's monitored so that any business impact that would arise from skewed predictions due to drift in performance is avoided.
|
|
|
|
|
|
|
Machine learning interpretability is key in high risk use-cases - there are large number of techniques available, each with their own tradeoffs, and it's important to make sure the tradeoffs of these are understood. This Kaggle Kernel, covers a high level overview of the importance of machine learning interpretability, together with hands on examples around permutation importance, partial dependence plots and SHAP.
|
|
|
|
|
|
|
In this episode of the Data Exchange, Chief Scientist Ben Lorica speaks with David Talby, co-creator of Spark NLP, an open source, highly scalable, production grade natural language processing (NLP) library. Spark NLP has become one of the more popular NLP libraries and is available on PyPI, Conda, Maven, and Spark Packages. With recent advances in research in large-scale natural language models, there is strong interest in domain specific natural language applications - in this podcast they dive into some of these.
|
|
|
|
|
|
|
Wayfair has a huge catalog with over 14 million items with very broad categories. However, the large size of our product catalog also makes it hard for customers to find the perfect item among all of the possible options. In this post wayfair introduces their new Bayesian system which was developed to (1) identify these products and (2) present them to their customers.
|
|
|
|
|
|
|
|
|
The topic for this week's featured production machine learning libraries is ETL and Batch Processing. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. The four featured libraries this week are:
- Apache Airflow - Data Pipeline framework built in Python, including scheduler, DAG definition and a UI for visualisation
- Argo Workflows - Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).
- Luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs, handling dependency resolution, workflow management, visualisation, etc
- Genie - Job orchestration engine to interface and trigger the execution of jobs from Hadoop-based systems
|
|
|
|
|
|
|
As AI systems become more prevalent in society, we face bigger and tougher societal challenges. We have seen a large number of resources that aim to takle thiese challenges in the form of AI Guidelines, Principles, Ethics Frameworks, etc, however there are so many resources it is hard to navigate. Because of this we started an Open Source initiative that aims to map the ecosystem to make it simpler to navigate. We will be showcasingitg three resources from our list so we can check them out every week. This week's resources are:
|
|
|
|
|
|
|
© 2018 The Institute for Ethical AI & Machine Learning
|
|
|
|
|