In the world of machine learning (ML), there are a few very important processes which are critical to anyone in the ML space. The first is making sure the data used in machine learning is clean. This topic gets talked about a lot so, while it is very important, let’s skip this and move on to the next step. Assuming a clean and complete dataset already exists, the user then proceeds to start training models.
The larger the dataset, the longer it takes to train a model. This apparently seems obvious, but it is important to note because it is often an overlooked and critically significant detail. The reason it is so important is that a model doesn’t just get trained one time. Training models is a highly repetitive task. One of the goals of a training run is to arrive at an optimal set of hyperparameters to yield the best result of a known dataset and then verify them over a holdout dataset. What are hyperparameters in machine learning? Hyperparameters typically refer to parameters that control the learning process (eg, speed and quality) of a model, and there are a lot of them, which leads to a large number of permutations to test.
Service-Level Time Windows
These repetitions quickly add up to a lot of wall-clock time and compute cycles. As time marches on and new models are created, there are often service-level time windows that must be adhered to in order to put a new model into a production environment.
Take a real-world example from a financial services use case. In this situation, the entire dataset was about 1.5TB in size. This size was nearly impossible to train on so instead of a 40GB dataset was created. At this size, it took about 2 hours to train one model with the popular XGBoost in Python. In isolation, that number might not seem quite, so bad, but keep in mind that one iteration is never enough and that seldom is one single model used in isolation. Most production deployments consist of an ensemble of models (ie, a group of models working together).
When taking into account the variety of different models needed to accomplish a task and the total iterations to arrive at an acceptable model-prediction performance, the total pipeline training time can grow into weeks. And when you consider iterations on that pipeline over months or years, it becomes quite easy to understand why there is such a strong desire to train and arrive at the most optimized hyperparameters as quickly as possible.
Improving the Training Pipeline
The big question most people have: How can we improve on this type of a training pipeline? Well, within the Python ecosystem, the most widely used libraries are going to be Pandas, Scikit-learn, and XGBoost. The first change would be to add a scaling framework such as Dask to the solution. This provides parallelization and enables a performance optimization by way of utilizing more simultaneous compute resources. The next change would be to add a GPU to the mix. Pandas and Scikit-learn can be swapped with RAPIDS cuDF and cuML respectively (open source libraries), and XGBoost can already run on a GPU and with a Dask. When coupled with an open source framework such as Optuna, which is integrated with Dask, model-training and testing pipelines can be sped up, generating more iterations, and yielding better results in less time.
Increasing Speed, Reducing Cost
Running this same pipeline with the same dataset previously mentioned on GPUs achieved a 100x speed up and easily surpassed a 90% cost reduction to perform the same training. More iterations lead to better optimization of hyperparameters, which results in better performance of the model.
Another benefit that comes along with this type of speed increase and cost reduction is that it gets people thinking about training over larger datasets, because with the original approach, the time to complete the training on such large datasets usually pushes the total pipeline time into unacceptable timeframes. When training over larger datasets is combined with more training iterations, a typical expectation is to yield higher accuracy models. In many industries, such as financial services, improving accuracy by just a couple of tenths of a percent can yield millions of dollars in additional revenue.
While this is all important from a pipeline perspective, the most important detail is that ML teams do not need to reinvent any technology to achieve these benefits. These tools are readily available, have already been battle-ed, have been integrated to work together, and are actively used in the largest company in the world.