当前位置:网站首页>Introduction to distributed learning and federated learning
Introduction to distributed learning and federated learning
2022-07-23 06:17:00 【deephub】
In this article , We will discuss the main principles of distributed learning and federated learning and how they work . First , Let's start with a simple stand-alone example , Then it is developed into a distributed random gradient descent (D-SGD), Finally, federal learning (FL).

Concentrated learning ( stand-alone )
A simple example , We want to learn the linear relationship between height and weight , And we have it 100 Weight and height data , Want to train a linear model , The model uses height to predict people's weight , Linear regression W = [a,b] as follows :

How do we find w? For the sake of w, Using gradient descent (GD), From a random w Start , Then by following the opposite direction of the error in 100 Minimize the error of the model at data points .
Set up A = 0 and B = 2, And calculate our model for each data point , As shown below :

The above equation is certainly not tenable , because 2 * 1.70 + 0 It's not equal to 72. Our goal is to find a a and b Make this equation hold . So we need to calculate this model for all 100 The error of human data points :

The goal is to find a model that makes the error of all data points zero , We assume that the negative error is equal to the positive error . Therefore, the total error is defined as the average of the square error of all data points , As shown below :

Emphasize that the key point of this total error or loss function is the average value of all data points , That is to say, the contribution of each data point to the total error is equal . The loss function is calculated by averaging the errors of all data points , The contribution of each data point to the loss function is equal .
In order to find a and b The optimal value , Need to compute b At the beginning b The gradient of the point , And update the value as follows :

Lambda It's the learning rate , Keep looking at the picture below

To calculate F Gradient of , First, it needs to be written in a complete form F.

Now? , Prepare to calculate F be relative to B Gradient of :

The to gradient is the average error gradient of each data point ! Use the symbols defined above , We can complete the gradient descent update rule in the following way :

The true gradient of the loss function is calculated by averaging the error of each data point , And then new B Replace with the previous B, Until our total error is small enough . This is an iterative process , Pets can be found many times A and B The best value of .
Stochastic gradient descent (SGD)
We passed the 100 Average all gradients of data points F Gradient of . If we only use 20 Data points to estimate , What should I do ?

This is called random gradient descent in small batches , Only a subset of the data is used to calculate the gradient .
Distributed random gradient descent (D-SGD)
Let's look at the gradient calculated from another angle .

If we rewrite the gradient according to the above formula and divide it into 2 Partial summation , Every sum has its meaning . The first part is actually the former 50 Average gradient of point data , The second part is after the data set 50 Average gradient of point data .

This means that we don't need to put all 100 Put data points in one place ( The same server )! We can divide the data into two parts and calculate the gradient of each part separately , Then average the two gradients , To calculate the gradient of the whole data . This is it. D-SGD The main idea of .
Now? , We have two distributed clients SGD.

As shown above , stay D-SGD Both clients in are from the same b PM , Then use each 50 Data points to calculate the gradient of each client . Then send the local gradient to the server acting as coordinator . The coordinator averages the two gradients , Then calculate the gradient of the whole data, or global gradient . The server returns this global gradient to two clients , Clients use this global gradient to update their b Values or their models .b The new value of is the same for every client , Because the global gradient is the same , Calculated new b It should be the same . This process is shown in the figure below .

from 1( Calculate the local gradient ) To 4( Download global gradient ) The steps of continue to iterate , Until the predefined error level is reached . In this example , We only use two clients , But it can be extended to many clients .
It should be noted that , We use local gradient to estimate global gradient !
Federal learning (FL)
If we use the local gradient of each client to calculate each local model , Or in our case ,b As shown below , What's going to happen ?

In this scenario , It will be different for each client b End of value , As shown in the figure above , We call it the local model .

If we do that , Each local model will have parameters b Update , This means that there is no need to send local gradients . Instead, the parameters or intermediate results of the local model are sent to the server for averaging , Then get the global model . This is the main idea of federal learning .

FL The system optimizes global machine learning by repeating the following process (ML) Model :
i) Each client device calculates its data locally to minimize the global model w.
ii) Then send its locally updated model to FL Server for aggregation ;
iii) FL The server aggregates the received local models , Generate an improved global model ;
Iv), The server sends the updated global model to the client device , The client device uses the new global model for the next calculation .
This process will continue to iterate , Until the model reaches the predefined accuracy level . This process is shown in the figure below .

Federal learning vs Distributed SGD
stay FL Use model weights in , But in D-SGD Only gradients are used in . In the example we discussed , Only one local step of gradient descent is performed before sending the update . under these circumstances ,FL Equivalent to distributed sgd. If there are multiple steps , Need to use FL Send model weights . In general form FL Convergence analysis of ( Multiple local steps ) It's different from what we do distributed - sgd analysis . But the principle is the same .
What we describe in this article D-SGD Algorithm ( Centralization D-SGD) and FL Algorithm (FEDAVG) It's just D-SGD and FL One of many algorithms .
Why federal learning is useful ?
We need to FL The main reason is because of privacy . We don't want to divulge private raw data to any server used to train machine learning models . So we need a machine learning algorithm that can be trained without sending raw data from the client device , This is the role of federal learning . for example , Google uses FL To improve its keyboard application (Gboard).FL There are other reasons why it is useful in different applications . for example FL Enable the system to use local computing such as mobile devices , To reduce the pressure on the server .
The challenge of federal learning
We can FL The challenges are divided into two categories . The first is running FL Data preparation process before the process . The key problem of this is , Cannot access raw data , Can't even access FL The equipment of the system . We need to know how to design models or evaluate data without accessing devices ?
The second type of challenge is running FL Problems in the process . Participation needs to be considered FL The client resources of the system are limited , They are sending or processing ML Limited ability in modeling , For example, in the example of this article , Our parameters are only b, It is feasible to transmit complete parameters , But if the model is big , for example BERT, Then it is impossible for us to transfer several between the client and the server G The data of , It's impossible .
summary
Federated learning is a new topic based on distributed learning framework , It tries to solve the problem of training in real applications ML The privacy of the model . In this paper , We only touched the surface of these systems , If you want to know more about this knowledge, you can search the relevant articles by yourself or wait for our subsequent relevant articles .
https://avoid.overfit.cn/post/ea6d50f42f904c97b4fa299be0c389b5
author :Mahdi Beitollahi
边栏推荐
- Over fitting weight regularization and dropout regularization
- 影响接口查询速度的情况
- 【黄啊码】MySQL入门—3、我用select *,老板直接赶我坐火车回家去,买的还是站票
- Reset root password
- pwn栈溢出基础练习题——2
- MRS +Apache Zeppelin,让数据分析更便捷
- 怎么为typora配置一个可爱的小鲨鱼主题?
- 2. 输入一个圆半径r,当r>=0时,计算并输出圆的面积和周长,否则,输出提示信息。
- Redis集群搭建
- 2019_ IJCAI_ Adapting BERT for Target-Oriented Multimodal Sentiment Classification
猜你喜欢

從鍵盤輸入一串字符,輸出不同的字符以及每個字符出現的次數。(輸出不按照順序)運用String類的常用方法解題

2019_IJCAI_Adapting BERT for Target-Oriented Multimodal Sentiment Classification

只有快递单号,怎样查询物流进度查看正在派件的单号

IM即时通讯开发时手机信号为什么会差

中国电子信息产业发展研究院院长张立:打造我国主导的开源价值链

字符串在JVM中的内存分配

pwn ——ret2libc3

栈溢出基础练习题——6(字符串漏洞64位下)

【数据库连接】——节选自培训

Understanding the application steps of machine learning development
随机推荐
scikit-learn——机器学习应用开发的步骤
Chapter7 循环神经网络-1
PWN stack overflow basic exercise - 1
Pytorch实现文本情感分析
Golang AES加密解密
Chapter7 recurrent neural network-1
Activation function (sigmoid, tanh, relu, softmax)
优化器(Optimizer)(SGD、Momentum、AdaGrad、RMSProp、Adam)
[强网杯 2019]随便注
mysql数据库基本知识
IM即时通讯开发时手机信号为什么会差
Recent ACM insights and future ideas
Reset root password
视频去除色彩,教你简单不复杂的操作方法
重磅!《2022中国开源发展蓝皮书》正式发布
Robot Arm 机械臂源码解析
pwn1_ sctf_ two thousand and sixteen
PWN —— ret2libc1
软件测试之移动APP安全测试简析,北京第三方软件检测机构分享
激活函数(sigmoid、tanh、ReLU、softmax)