当前位置：网站首页>Introduction to distributed learning and federated learning

Introduction to distributed learning and federated learning

2022-07-23 06:17:00 【deephub】

In this article , We will discuss the main principles of distributed learning and federated learning and how they work . First , Let's start with a simple stand-alone example , Then it is developed into a distributed random gradient descent （D-SGD）, Finally, federal learning （FL）.

Concentrated learning （ stand-alone ）

A simple example , We want to learn the linear relationship between height and weight , And we have it 100 Weight and height data , Want to train a linear model , The model uses height to predict people's weight , Linear regression W = [a,b] as follows ：

How do we find w? For the sake of w, Using gradient descent (GD), From a random w Start , Then by following the opposite direction of the error in 100 Minimize the error of the model at data points .

Set up A = 0 and B = 2, And calculate our model for each data point , As shown below ：

The above equation is certainly not tenable , because 2 * 1.70 + 0 It's not equal to 72. Our goal is to find a a and b Make this equation hold . So we need to calculate this model for all 100 The error of human data points :

Insert picture description here

The goal is to find a model that makes the error of all data points zero , We assume that the negative error is equal to the positive error . Therefore, the total error is defined as the average of the square error of all data points , As shown below :

Emphasize that the key point of this total error or loss function is the average value of all data points , That is to say, the contribution of each data point to the total error is equal . The loss function is calculated by averaging the errors of all data points , The contribution of each data point to the loss function is equal .

In order to find a and b The optimal value , Need to compute b At the beginning b The gradient of the point , And update the value as follows :

Lambda It's the learning rate , Keep looking at the picture below

To calculate F Gradient of , First, it needs to be written in a complete form F.

Now? , Prepare to calculate F be relative to B Gradient of ：

The to gradient is the average error gradient of each data point ！ Use the symbols defined above , We can complete the gradient descent update rule in the following way ：

The true gradient of the loss function is calculated by averaging the error of each data point , And then new B Replace with the previous B, Until our total error is small enough . This is an iterative process , Pets can be found many times A and B The best value of .

Stochastic gradient descent （SGD）

We passed the 100 Average all gradients of data points F Gradient of . If we only use 20 Data points to estimate , What should I do ？

This is called random gradient descent in small batches , Only a subset of the data is used to calculate the gradient .

Distributed random gradient descent （D-SGD）

Let's look at the gradient calculated from another angle .

If we rewrite the gradient according to the above formula and divide it into 2 Partial summation , Every sum has its meaning . The first part is actually the former 50 Average gradient of point data , The second part is after the data set 50 Average gradient of point data .

This means that we don't need to put all 100 Put data points in one place （ The same server ）! We can divide the data into two parts and calculate the gradient of each part separately , Then average the two gradients , To calculate the gradient of the whole data . This is it. D-SGD The main idea of .

Now? , We have two distributed clients SGD.

Insert picture description here

As shown above , stay D-SGD Both clients in are from the same b PM , Then use each 50 Data points to calculate the gradient of each client . Then send the local gradient to the server acting as coordinator . The coordinator averages the two gradients , Then calculate the gradient of the whole data, or global gradient . The server returns this global gradient to two clients , Clients use this global gradient to update their b Values or their models .b The new value of is the same for every client , Because the global gradient is the same , Calculated new b It should be the same . This process is shown in the figure below .

Insert picture description here

from 1( Calculate the local gradient ) To 4( Download global gradient ) The steps of continue to iterate , Until the predefined error level is reached . In this example , We only use two clients , But it can be extended to many clients .

It should be noted that , We use local gradient to estimate global gradient !

Federal learning (FL)

If we use the local gradient of each client to calculate each local model , Or in our case ,b As shown below , What's going to happen ?

In this scenario , It will be different for each client b End of value , As shown in the figure above , We call it the local model .

Insert picture description here

If we do that , Each local model will have parameters b Update , This means that there is no need to send local gradients . Instead, the parameters or intermediate results of the local model are sent to the server for averaging , Then get the global model . This is the main idea of federal learning .

Insert picture description here

FL The system optimizes global machine learning by repeating the following process (ML) Model :

i) Each client device calculates its data locally to minimize the global model w.

ii) Then send its locally updated model to FL Server for aggregation ;

iii) FL The server aggregates the received local models , Generate an improved global model ;

Iv), The server sends the updated global model to the client device , The client device uses the new global model for the next calculation .

This process will continue to iterate , Until the model reaches the predefined accuracy level . This process is shown in the figure below .

Insert picture description here

Federal learning vs Distributed SGD

stay FL Use model weights in , But in D-SGD Only gradients are used in . In the example we discussed , Only one local step of gradient descent is performed before sending the update . under these circumstances ,FL Equivalent to distributed sgd. If there are multiple steps , Need to use FL Send model weights . In general form FL Convergence analysis of ( Multiple local steps ) It's different from what we do distributed - sgd analysis . But the principle is the same .

What we describe in this article D-SGD Algorithm ( Centralization D-SGD) and FL Algorithm (FEDAVG) It's just D-SGD and FL One of many algorithms .

Why federal learning is useful ?

We need to FL The main reason is because of privacy . We don't want to divulge private raw data to any server used to train machine learning models . So we need a machine learning algorithm that can be trained without sending raw data from the client device , This is the role of federal learning . for example , Google uses FL To improve its keyboard application (Gboard).FL There are other reasons why it is useful in different applications . for example FL Enable the system to use local computing such as mobile devices , To reduce the pressure on the server .

The challenge of federal learning

We can FL The challenges are divided into two categories . The first is running FL Data preparation process before the process . The key problem of this is , Cannot access raw data , Can't even access FL The equipment of the system . We need to know how to design models or evaluate data without accessing devices ?

The second type of challenge is running FL Problems in the process . Participation needs to be considered FL The client resources of the system are limited , They are sending or processing ML Limited ability in modeling , For example, in the example of this article , Our parameters are only b, It is feasible to transmit complete parameters , But if the model is big , for example BERT, Then it is impossible for us to transfer several between the client and the server G The data of , It's impossible .

summary

Federated learning is a new topic based on distributed learning framework , It tries to solve the problem of training in real applications ML The privacy of the model . In this paper , We only touched the surface of these systems , If you want to know more about this knowledge, you can search the relevant articles by yourself or wait for our subsequent relevant articles .

https://avoid.overfit.cn/post/ea6d50f42f904c97b4fa299be0c389b5

author ：Mahdi Beitollahi

原网站

版权声明
本文为[deephub]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207221814145636.html