当前位置：网站首页>Quickly get started with federal learning -- the practice of Tencent's self-developed federal learning platform powerfl

Quickly get started with federal learning -- the practice of Tencent's self-developed federal learning platform powerfl

2022-06-26 15:59:00 【Tencent big data official】

Introduction ： near 10 year , Machine learning is developing rapidly in the field of artificial intelligence , One of the key driving fuels is the large amount of data accumulated by human society . However , Although the data scale is growing rapidly in general , Most of the data is scattered in various companies or departments , As a result, the data is severely isolated and fragmented ; And that's why , There is a strong desire for data cooperation among various organizations , But based on data privacy and security considerations , There are many challenges to implementing data cooperation in compliance .

The data islands formed for the above reasons are seriously hindering all parties to cooperate with data to build artificial intelligence models , Therefore, a new mechanism is urgently needed to solve the above problems . Federal learning came into being , Through this emerging technology , It can ensure user privacy and data security , The process of exchanging model information between organizations is carefully designed and encrypted , So that no organization can guess the private data content of any other organization , But it achieves the purpose of joint modeling .

PowerFL It's Tencent. TEG Self developed federal learning platform , Already in the financial cloud 、 Advertising joint modeling and other business scenarios begin to land , And achieved preliminary results .PowerFL In a technological way , Build a bridge for the data distributed in different departments and teams , On the premise of protecting data privacy , Make it possible to combine data . This article will start with the platform framework 、 Deployment view and network topology PowerFL The overall technical architecture of , On this basis, it introduces how to deploy with one click PowerFL And how to define a federated task flow to submit federated tasks .

PowerFL Introduction to the platform framework of
PowerFL Deployment view and key components
PowerFL Network topology
Rapid deployment PowerFL
- Get ready k8s colony
- Get ready Yarn colony （ Optional ）
- Ready to install the client
- One key deployment PowerFL
adopt flow-server Submit a federal mission
- The flow of federation task arrangement and scheduling
summary

PowerFL Introduction to the platform framework of

From the perspective of platform framework ,PowerFL Construct the technology and ecology of the entire federal learning from the following five levels , From the bottom to the top ：

Computing and data resources ：PowerFL Support two mainstream computing resource scheduling engines YARN and K8S： All service components are deployed in the form of containers K8S On the cluster , Greatly simplify deployment and O & M costs , It can easily realize the fault tolerance and capacity expansion of services ; meanwhile , All computing components pass YARN Cluster scheduling , Thus, while ensuring the parallel acceleration of large-scale machine learning tasks , Ensure the stability and fault tolerance of the calculation . Besides ,PowerFL It also supports pulling data from multiple data sources , Include TDW,Ceph,COS,HDFS etc. .
Computing framework ： On top of computing and data resources ,PowerFL A computational framework for federated learning algorithm is implemented , Compared with the traditional machine learning framework , This framework focuses on solving the most common difficulties in the practice of Federated learning algorithms and applications ：1） Security encryption ：PowerFL It implements various common homomorphic encryption 、 Symmetric and asymmetric encryption algorithms （ Include Paillier、RSA Asymmetric encryption and other algorithms ）;2） Distributed computing ： be based on Spark on Angel High performance distributed machine learning framework , adopt PowerFL It can easily implement various efficient distributed federated learning algorithms ;3） Cross network communication ：PowerFL It provides a set of multi-party cross network transmission interfaces , The bottom layer uses message queue components , On the premise of ensuring data security , It realizes stable and reliable high-performance cross network transmission ;4）TEE/SGX Support ： In addition to ensuring data security through software ,PowerFL It also supports the adoption of TEE/SGX stay Enclave Encrypts and calculates data in , Thus, the algorithm performance is greatly improved in the way of hardware .
Algorithmic protocol ： Based on the above calculation framework ,PowerFL Common federated algorithm protocols are implemented for different scenarios ：1） In the analysis scenario ,PowerFL Support joint query and two-way query / Multiple sample alignments ;2） For modeling scenarios ,PowerFL Support federal feature engineering （ Including feature selection 、 Feature filtering and feature transformation ）、 Federal training （ Including logical regression 、GBDT,DNN etc. ） And joint forecasting .
Product interaction ： From the end user's point of view ,PowerFL As an application product of federal learning , Both support REST API In the form of a federal mission , It also supports all model participants to work together in the joint workspace , Construct and configure the federated task flow by dragging and dropping algorithm components , And users 、 resources 、 Configuration and task management .
Application scenarios ： After improving the infrastructure of the above federal learning ,PowerFL Financial risk control can be solved under the premise of safety and compliance 、 Advertising recommendation 、 Portrait of the crowd 、 It is caused by data isolation and fragmentation in multiple application scenarios such as joint query “ data silos ” problem , Truly enable AI and big data applications that comply with privacy norms .

PowerFL Deployment view and key components

From a deployment view point of view ,PowerFL It includes the service layer and the computing layer ：

The service layer is built on K8S On top of the cluster , Utilizing its excellent resource scheduling capability 、 Perfect capacity expansion and contraction mechanism and stable fault tolerance performance , take PowerFL The resident service of is deployed on the service node in the form of a container . These resident service components include ：
- Message middleware ： Responsible for event driving between all services and computing components 、 Algorithm synchronization and encrypted data asynchronous communication between computing components .
- Task flow engine ： Responsible for controlling the scheduling of unilateral federated task flow , The execution node is invoked in the form of a container according to the task flow sequence defined in advance , Computing tasks are executed in the execution node , Or in the execution node YARN The cluster submits computing tasks .
- Task panel ： Collect the key performance indicators of each iteration or final model output results of each algorithm component in the task flow 、 Such as AUC,Accuracy、K-S, Feature importance, etc .
- Multi party federated scheduling engine ： Be responsible for the scheduling and synchronization of Federated tasks among multiple parties , And provides a set of API Provides the creation of Federated task flow 、 Task initiation 、 End 、 Pause 、 Delete 、 Interface such as status query .
The computing layer is built on YARN On top of the cluster , Make the most of it Spark Big data ecological suite , be responsible for PowerFL Distributed computing of various algorithm components at runtime . The computing task is actually initiated by the task node of the service layer and sent to YARN Cluster application resource operation PowerFL Federation operator of , be based on Spark on Angel Computing framework of , It guarantees the high parallelism and excellent performance of the algorithm .

PowerFL Network topology

From the perspective of intranet users ,PowerFL adopt k8s Of ingress Expose service paths to intranet users ：

By visiting http://domain/ You can access the interface of the task panel , Understand the running status of the current task 、 Key logs during task operation 、 Display of key performance indicators, etc ;
By visiting http://domain/pipelines You can access the interface of the task flow engine , View the running phase of the task on this side :
By visiting http://domain/flow-server Accessible flow-server Of REST API.

Calculation task （ Such as spark Mission driver and executor） And service layer components , Message oriented middleware is used to provide communication ;

Calculate the intermediate encryption results obtained during task execution and the task status information that needs to be synchronized , Then, cross Internet synchronization is realized through respective message oriented middleware .

Rapid deployment PowerFL

After the overall understanding PowerFL Platform framework for 、 After deploying the view and network topology , The following describes how to quickly deploy PowerFL, As mentioned above ,PowerFL It is divided into service layer components and computing layer , Built on k8s Clusters and YARN On top of the cluster . In the deployment PowerFL Before , You need to prepare these two cluster environments first （ If computing tasks do not require a distributed environment , You may not need to prepare YARN Cluster environment ）.

Before doing the following , Get the latest version of the installation package and unzip .

Get ready k8s colony

Machine configuration requirements ：

Number of machines ：1+ platform
hardware configuration ：16G+ Mem、CPU 4+、 Hard disk 100G+
OS And version ： Suggest CentOS 7.0
Docker 1.8+

If you are in a test environment , Refer to the documentation for installation Minikube,VM driver Optional kvm2（linux） perhaps hyperkit, perhaps VirtualBox （macOS）, adopt Minikube establish k8s：

minikube start --memory=8192 --cpus=4

If you are installing in a production environment k8s , You can refer to k8s Official documents of , Deploy... In a production environment k8s colony , Options include ：

Use kubeadm To install .
Use kops To install .
Use KRIB To install .
Use Kubespray To install .
You can also use Tencent cloud's TKE

If offline installation is required k8s, Please refer to the... Under the installation package directory offline-k8s-deploy To install .

Get ready Yarn colony （ Optional ）

You can refer to Apache Ambari To install YARN colony , Ready after installation hadoop Configuration file for , Put it in hadoop-config Under the table of contents :

core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml

Go together k8s Import the above configuration file into the cluster , The configuration imported here is named hadoop:

kubectl -n power-fl-[partyId] create configmap hadoop --from-file=./hadoop-config

Ready to install the client

install jq and envsubst

#  If it is Ubuntusudo apt-get install jq envsubst#  If it is CentOSsudo yum install jq envsubst

install helm 3.0+ May refer to Helm Official installation documentation , Simply speaking , It can be operated as follows ：
- macOS Use Homebrew Installation :
```
brew install helm
```
- Windows You can use Chocolatey Installation
```
choco install kubernetes-helm
```
- Install from the command line
```
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
```

One key deployment PowerFL

The following operations are performed on the client machine , And the execution directory is Fl-deploy Under the root directory of ：

cd FL-deploy

First installation

Get ready kubectl Configuration file for ：
```
mkdir kubecp ~/.kube/config ./kube
```

Copy environment configuration template file , And set the corresponding environment variables ：

cp _powerfl_env.sh ./powerfl_env.shvim ./powerfl_env.sh

#!/bin/bash#  Set up participants id, Of all parties involved id Must be unique export PARTY_ID=10000#  Set party access powerfl Domain name of related services export DOMAIN=powerfl-10000.com#  Intranet access MQ The address of , For the interior Hadoop Cluster access export INTERNAL_MQ_HOST=xx.xx.xx.xx#  Intranet access MQ Of tcp port export INTERNAL_MQ_TCP_PORT="xxxx"#  Exposed to the Internet MQ-proxy Of http port （ For the default k8s To configure , The port range is 30000-32767）export EXPOSE_MQ_HTTP_PORT="xxxx"#  Exposed to the Internet MQ Of tcp port （ For the default k8s To configure , The port range is 30000-32767）export EXPOSE_MQ_TCP_PORT="xxxx"#  Set components that do not need to be installed , Separate with spaces export DISABLED_COMPONENTS=""#  Of other participants id, Separate with spaces export OTHER_PARTIES="20000"#  Of other participants MQ To configure , Such as OTHER_PARTIES="20000 30000"#  You need to configure them separately #    PARTY_MQ_HTTP_URL_20000 and PARTY_MQ_PROXY_URL_20000#  and  PARTY_MQ_HTTP_URL_30000 and PARTY_MQ_PROXY_URL_30000export PARTY_MQ_HTTP_URL_20000=yy.yy.yy.yy:yyyyexport PARTY_MQ_PROXY_URL_20000=yy.yy.yy.yy:yyyy

Execute Script Installation ：
```
./deploy.sh setup
```

Install multiple participants at the same time

If you need to be in the same k8s Multiple parties are installed on the cluster , You can copy multiple copies _powerfl_env.sh Environment profiles as different parties , And specify... When executing the deployment script （ If not specified , By default, the... In the current directory is used powerfl_env.sh As an environment configuration file , As shown above ）：

cp _powerfl_env.sh ./powerfl-10000.shvim ./powerfl-10000.sh #  modify PARTY_ID Other configuration cp _powerfl_env.sh ./powerfl-20000.shvim ./powerfl-20000.sh #  modify PARTY_ID Other configuration #  Deployment participants 10000, Specify profile powerfl-10000.sh./deploy.sh setup ./powerfl-10000.sh#  Deployment participants 20000,  Specify profile powerfl-20000.sh./deploy.sh setup ./powerfl-20000.sh

Update system

If you need to modify the system related configuration , You can modify the corresponding environment configuration file and component configuration file , And implement :

./deploy.sh upgrade#  If you specify a profile ./deploy.sh upgrade ./powerfl-10000.sh

Unloading the system

Be careful ： This operation is to delete powerfl All data for , Irrecoverable :

./deploy.sh cleanup#  If you specify a profile ./deploy.sh cleanup ./powerfl-10000.sh

adopt flow-server Submit a federal mission

After installation PowerFL after , You can use the specified DSL Write task flow and task parameter configuration files to flow-server Submit a federal mission , Before introducing the specific method of use , First of all, let's get to know PowerFL Federal task scheduling and scheduling process .

The flow of federation task arrangement and scheduling

1） Write a task flow file pipeline.yaml, towards flow-server Import pipeline：

curl --request POST 'http://{domain}/flow-server/pipelines' --form '[email protected]'

If it succeeds, it will return pipeline Of id; For the just imported pipeline, Write task configuration parameter file job.yaml（ I'll introduce you later DSL）, towards flow-server Submit tasks ：

curl -request POST 'http://{domain}/flow-server/pipelines/{pipeline_id}/jobs' --form '[email protected]_parameters.yaml'

2）flow-server After receiving the task request submission , This task will be randomly generated id、 Inject the global configuration to build the task flow engine for scheduling DSL And submit the task flow to it .

3） The task flow engine is based on the above DSL file , According to the node order defined by the task flow, move to K8S Cluster application resources , Invoke the runtime container of the corresponding node ;

4 and 5） After the runtime container starts , According to the injected environment variable information , towards YARN Cluster application resources , start-up driver and executor, Call up the specific algorithm flow , Perform parallel computing tasks , thus , The algorithm task on this side is started and completed .

6） On the other side ,flow-server After receiving the task request submission , The configuration file information of the task will be transmitted to the local message oriented middleware ;

7） Local message oriented middleware synchronizes across networks , Synchronize the above task configuration file to the message oriented middleware of other participants ;

8） Of other participants flow-server Listening for message oriented middleware task submission topic, Receive a request to start a new federal task ;

9） The subsequent process is the same as 3）4）5）

The format of the task configuration parameter file is as follows ：

parties: [ "10000=guest", "20000=host" ]common-args:  spark-master-name: local[*]  runtime-image: power_fl/runtime:developparties-args:  10000:    hadoop-config: hadoop    hadoop-user-name: root    hdfs-libs-path: hdfs:///fl-runtime-libs    spark-submit-args: ""    input: /opt/spark-app/fl-runtime/data/a9a.guest.head    output: /tmp/a9a.guest.output  20000:    hadoop-config: hadoop    hadoop-user-name: root    hdfs-libs-path: hdfs:///fl-runtime-libs    spark-submit-args: ""    input: /opt/spark-app/fl-runtime/data/a9a.host.head    output: /data/a9a.host.output

The above document consists of three main parts ：

parties： Specify each participant of the federated task in the form of an array , With partyId=role Specifies the format of each participant id And the role in this task
common-args： Specify parameters to be used by all parties , Configurable parameters shall be consistent with pipline.yaml Specified in the spec.arguments.parameters The consistency of .
parties-args： Specify the parameters of each participant , Include hadoop Configuration information 、 Task algorithm parameter configuration information, etc .

summary

PowerFL From computing and data resources 、 Computing framework 、 Algorithmic protocol 、 The five levels of product interaction and application scenarios consolidate the technology and ecology of the whole Federation from bottom to top , Build the whole system in a cloud native way k8s On top of the cluster , And make the most of it YARN The big data ecology of the cluster , be based on Spark on Angel To achieve high-performance distributed computing for federated tasks . This article first introduces PowerFL The overall structure of , Including technology stack 、 Key components and network topology , On this basis, it introduces how to deploy with one click PowerFL And how to define a federated task flow to submit federated tasks . I hope this article can help you get started quickly , Learn more about this new privacy based machine learning modeling mechanism , And apply it to e-commerce 、 Finance 、 Medical care 、 education 、 More fields such as urban computing .