When designing a machine learning pipeline for a new problem, each stage of the output entails design choices, such as which strategy should be used to handle the missing data? How should the categorical data be transformed? How many trees should the random forest have or would a logistic regression be better instead? Testing many of these combinations is tedious and error prone.
Automated machine learning aims to help both novices and experts in machine learning by performing a search over the different options and parameters in a smart way, with a model that learns from past experience. To properly warm-start this search, we need to run extensive experiments on a large collection of datasets truly representative of the data space. That’s why we are building a benchmarking framework and curating a repository of real-life datasets.