当前位置:网站首页>Challenges of machine learning system in production
Challenges of machine learning system in production
2022-06-27 11:04:00 【Steven Devin】
The challenge of machine learning system in production
Machine learning and deep learning have become very popular in the past few years ,
However, most of the materials on the Internet and the teaching in the classroom are based on the construction model and adjustment model .
However, in actual production ,
Machine learning engineers are not only responsible for building and maintaining models , More need to master some software engineering skills .
Most companies have only started using machine learning technology in the past few years , Or develop related systems .
And there are few companies that develop and run machine learning systems on a large scale .
In the operating system , There are often challenges , This article is intended to discuss in depth some of the challenges of running machine learning systems .
1. Organize machine learning experiment process
The development of machine learning is an iterative process .
The data needs to be 、 Learning algorithm and various combinations of model parameters ,
And track the impact of these changes on prediction performance .
as time goes on , This iterative experiment may produce thousands of model training runs and model versions .
This makes it difficult to track the best performing model and re-enter the configuration of the best model .
Like traditional software engineering , Few people develop models over time .
Team turnover 、 Target changes and new data sets and functionality changes are common .
therefore , We should expect to build the model for the first time , The experimental process of building the model will last for a long time .
Compare the current experimental results with the past experimental results , It will become increasingly difficult to identify opportunities for further improvement ,
This requires a system to track experimental metadata and the impact of different parameters on prediction performance .
2. Conditioning and training models
When in Jupyter When training models in interactive programming environments such as notebooks , Debugging the model training task is a simple thing .
Run the code manually , If a training error occurs ,Jupyter The notebook will display exceptions and stack traces .
If the training is successful , Visual learning curves and other indicators can also be displayed .
Further diagnose whether the model has been fitted or the gradient disappears .
But when the model is in a fixed time , Automated batch processing , Adjusting the model will become difficult .
Although the scheduler will rerun the failed training process , But unless you write a custom solution , Otherwise they cannot easily check for over fitting and gradient vanishing .
And the goal of the data science team is to deploy more and more models , When more and more models appear in this process , The problem will only get worse .
3. Deploy the model to the production environment
The machine learning model can only be used by its users , To start adding value to the company .
Use trained ML The process of modeling and providing its predictions to users or other systems is called 「 Deploy 」.
Deployment and feature engineering 、 Conventional machine learning tasks such as model selection or model evaluation are completely different .
therefore , Lack of software engineering or DevOps Background data scientists and ML Engineers may not know much about this deployment .
When deciding how to deploy the machine learning model , There are many factors to consider :
- How often should forecasts be generated .
- Whether forecasts should be generated for a single instance or a batch of instances at a time .
- Number of access model applications .
- Latency requirements for these applications .
4. Expand machine learning services
If the model has been deployed to the endpoint , They can begin to provide value to users .
But model endpoints may face higher workloads in the near future .
for example , If the company starts to serve more users , These increased demands may reduce the quality of your machine learning services .
As API Endpoint managed ML Models often need to respond to this change in demand .
When requesting an increase , The number of calculation instances serving the model should be increased , When the workload decreases , The calculation instance should be deleted , This way, you don't have to pay for unused instances .
5. Model monitoring
This stage is the real beginning .
When the model completes the platform deployment , Another important thing is to monitor the predictions and exceptions of the model .
This phase must continuously monitor the model , Detect and eliminate the deviation of model quality , For example, data drift .
Proactively detect these deviations as early as possible , Able to take corrective actions , For example, retraining the model 、 Review upstream systems or fix data quality problems , Without manually monitoring the model or building additional tools .
At the end :
Sometimes it is a good choice to avoid building your own machine learning infrastructure .
Leverage a wide range of open source tools and platforms , Build models that provide differentiated value .

边栏推荐
- One copy ten, CVPR oral is accused of plagiarizing a lot
- deep learning statistical arbitrage
- Co jump
- [tcapulusdb knowledge base] Introduction to tmonitor stand-alone installation guidelines (II)
- Review of last week's hot spots (6.20-6.26)
- Leetcode 729. My schedule I (awesome, solved)
- Istio related information
- Mail system (based on SMTP protocol and POP3 protocol -c language implementation)
- LLVM系列(1)- LLVM简介
- Ubuntu手動安裝MySQL
猜你喜欢

Oracle-分组统计查询

KDD 2022 | epileptic wave prediction based on hierarchical graph diffusion learning

audiotrack与audioflinger

Feedforward feedback control system design (process control course design matlab/simulink)

一篇抄十篇,CVPR Oral被指大量抄袭

红包雨: Redis 和 Lua 的奇妙邂逅

Oracle multi table query

【TcaplusDB知识库】TcaplusDB Tmonitor模块架构介绍

Glide caching mechanism

Future & CompletionService
随机推荐
直播電子商務應用程序開發需要什麼基本功能?未來發展前景如何?
[tcaplusdb knowledge base] Introduction to tcaplusdb tcaplusadmin tool
Eureka core source code analysis
Future & CompletionService
ci/cd自动化测试_CI / CD管道加快测试自动化的16种最佳实践
Leetcode 729. My schedule I (awesome, solved)
【TcaplusDB知识库】TcaplusDB表数据缓写介绍
Experiment notes - Convert Carmen (.Log.Clf) file to rosbag
ECMAScript 6(es6)
【TcaplusDB知识库】TcaplusDB单据受理-建表审批介绍
Red envelope rain: a wonderful encounter between redis and Lua
Istio related information
Error im002 when Oracle connects to MySQL
Eureka核心源码解析
杰理之睡眠以后定时唤醒系统继续跑不复位【篇】
Deep understanding of happens before principle
Feedforward feedback control system design (process control course design matlab/simulink)
Support system of softswitch call center system
Glide缓存机制
[tcapulusdb knowledge base] Introduction to tmonitor stand-alone installation guidelines (I)