当前位置：网站首页>Challenges of machine learning system in production

Challenges of machine learning system in production

2022-06-27 11:04:00 【Steven Devin】

The challenge of machine learning system in production

Machine learning and deep learning have become very popular in the past few years ,

However, most of the materials on the Internet and the teaching in the classroom are based on the construction model and adjustment model .

However, in actual production ,

Machine learning engineers are not only responsible for building and maintaining models , More need to master some software engineering skills .

Most companies have only started using machine learning technology in the past few years , Or develop related systems .

And there are few companies that develop and run machine learning systems on a large scale .

In the operating system , There are often challenges , This article is intended to discuss in depth some of the challenges of running machine learning systems .

1. Organize machine learning experiment process

The development of machine learning is an iterative process .

The data needs to be 、 Learning algorithm and various combinations of model parameters ,

And track the impact of these changes on prediction performance .

as time goes on , This iterative experiment may produce thousands of model training runs and model versions .

This makes it difficult to track the best performing model and re-enter the configuration of the best model .

Like traditional software engineering , Few people develop models over time .

Team turnover 、 Target changes and new data sets and functionality changes are common .

therefore , We should expect to build the model for the first time , The experimental process of building the model will last for a long time .

Compare the current experimental results with the past experimental results , It will become increasingly difficult to identify opportunities for further improvement ,

This requires a system to track experimental metadata and the impact of different parameters on prediction performance .

2. Conditioning and training models

When in Jupyter When training models in interactive programming environments such as notebooks , Debugging the model training task is a simple thing .

Run the code manually , If a training error occurs ,Jupyter The notebook will display exceptions and stack traces .

If the training is successful , Visual learning curves and other indicators can also be displayed .

Further diagnose whether the model has been fitted or the gradient disappears .

But when the model is in a fixed time , Automated batch processing , Adjusting the model will become difficult .

Although the scheduler will rerun the failed training process , But unless you write a custom solution , Otherwise they cannot easily check for over fitting and gradient vanishing .

And the goal of the data science team is to deploy more and more models , When more and more models appear in this process , The problem will only get worse .

3. Deploy the model to the production environment

The machine learning model can only be used by its users , To start adding value to the company .

Use trained ML The process of modeling and providing its predictions to users or other systems is called 「 Deploy 」.

Deployment and feature engineering 、 Conventional machine learning tasks such as model selection or model evaluation are completely different .

therefore , Lack of software engineering or DevOps Background data scientists and ML Engineers may not know much about this deployment .

When deciding how to deploy the machine learning model , There are many factors to consider ：

How often should forecasts be generated .
Whether forecasts should be generated for a single instance or a batch of instances at a time .
Number of access model applications .
Latency requirements for these applications .

4. Expand machine learning services

If the model has been deployed to the endpoint , They can begin to provide value to users .

But model endpoints may face higher workloads in the near future .

for example , If the company starts to serve more users , These increased demands may reduce the quality of your machine learning services .

As API Endpoint managed ML Models often need to respond to this change in demand .

When requesting an increase , The number of calculation instances serving the model should be increased , When the workload decreases , The calculation instance should be deleted , This way, you don't have to pay for unused instances .

5. Model monitoring

This stage is the real beginning .

When the model completes the platform deployment , Another important thing is to monitor the predictions and exceptions of the model .

This phase must continuously monitor the model , Detect and eliminate the deviation of model quality , For example, data drift .

Proactively detect these deviations as early as possible , Able to take corrective actions , For example, retraining the model 、 Review upstream systems or fix data quality problems , Without manually monitoring the model or building additional tools .

At the end ：

Sometimes it is a good choice to avoid building your own machine learning infrastructure .

Leverage a wide range of open source tools and platforms , Build models that provide differentiated value .

Insert picture description here