当前位置:网站首页>Quickly get started with federal learning -- the practice of Tencent's self-developed federal learning platform powerfl

Quickly get started with federal learning -- the practice of Tencent's self-developed federal learning platform powerfl

2022-06-26 15:59:00 Tencent big data official

Introduction : near 10 year , Machine learning is developing rapidly in the field of artificial intelligence , One of the key driving fuels is the large amount of data accumulated by human society . However , Although the data scale is growing rapidly in general , Most of the data is scattered in various companies or departments , As a result, the data is severely isolated and fragmented ; And that's why , There is a strong desire for data cooperation among various organizations , But based on data privacy and security considerations , There are many challenges to implementing data cooperation in compliance .

The data islands formed for the above reasons are seriously hindering all parties to cooperate with data to build artificial intelligence models , Therefore, a new mechanism is urgently needed to solve the above problems . Federal learning came into being , Through this emerging technology , It can ensure user privacy and data security , The process of exchanging model information between organizations is carefully designed and encrypted , So that no organization can guess the private data content of any other organization , But it achieves the purpose of joint modeling .

PowerFL It's Tencent. TEG Self developed federal learning platform , Already in the financial cloud 、 Advertising joint modeling and other business scenarios begin to land , And achieved preliminary results .PowerFL In a technological way , Build a bridge for the data distributed in different departments and teams , On the premise of protecting data privacy , Make it possible to combine data . This article will start with the platform framework 、 Deployment view and network topology PowerFL The overall technical architecture of , On this basis, it introduces how to deploy with one click PowerFL And how to define a federated task flow to submit federated tasks .

  • PowerFL Introduction to the platform framework of

  • PowerFL Deployment view and key components

  • PowerFL Network topology

  • Rapid deployment PowerFL

    • Get ready k8s colony

    • Get ready Yarn colony ( Optional )

    • Ready to install the client

    • One key deployment PowerFL

  • adopt flow-server Submit a federal mission

    • The flow of federation task arrangement and scheduling

  • summary

PowerFL Introduction to the platform framework of

From the perspective of platform framework ,PowerFL Construct the technology and ecology of the entire federal learning from the following five levels , From the bottom to the top :

  1. Computing and data resources :PowerFL Support two mainstream computing resource scheduling engines YARN and K8S: All service components are deployed in the form of containers K8S On the cluster , Greatly simplify deployment and O & M costs , It can easily realize the fault tolerance and capacity expansion of services ; meanwhile , All computing components pass YARN Cluster scheduling , Thus, while ensuring the parallel acceleration of large-scale machine learning tasks , Ensure the stability and fault tolerance of the calculation . Besides ,PowerFL It also supports pulling data from multiple data sources , Include TDW,Ceph,COS,HDFS etc. .
  2. Computing framework : On top of computing and data resources ,PowerFL A computational framework for federated learning algorithm is implemented , Compared with the traditional machine learning framework , This framework focuses on solving the most common difficulties in the practice of Federated learning algorithms and applications :1) Security encryption :PowerFL It implements various common homomorphic encryption 、 Symmetric and asymmetric encryption algorithms ( Include Paillier、RSA Asymmetric encryption and other algorithms );2) Distributed computing : be based on Spark on Angel High performance distributed machine learning framework , adopt PowerFL It can easily implement various efficient distributed federated learning algorithms ;3) Cross network communication :PowerFL It provides a set of multi-party cross network transmission interfaces , The bottom layer uses message queue components , On the premise of ensuring data security , It realizes stable and reliable high-performance cross network transmission ;4)TEE/SGX Support : In addition to ensuring data security through software ,PowerFL It also supports the adoption of TEE/SGX stay Enclave Encrypts and calculates data in , Thus, the algorithm performance is greatly improved in the way of hardware .
  3. Algorithmic protocol : Based on the above calculation framework ,PowerFL Common federated algorithm protocols are implemented for different scenarios :1) In the analysis scenario ,PowerFL Support joint query and two-way query / Multiple sample alignments ;2) For modeling scenarios ,PowerFL Support federal feature engineering ( Including feature selection 、 Feature filtering and feature transformation )、 Federal training ( Including logical regression 、GBDT,DNN etc. ) And joint forecasting .
  4. Product interaction : From the end user's point of view ,PowerFL As an application product of federal learning , Both support REST API In the form of a federal mission , It also supports all model participants to work together in the joint workspace , Construct and configure the federated task flow by dragging and dropping algorithm components , And users 、 resources 、 Configuration and task management .
  5. Application scenarios : After improving the infrastructure of the above federal learning ,PowerFL Financial risk control can be solved under the premise of safety and compliance 、 Advertising recommendation 、 Portrait of the crowd 、 It is caused by data isolation and fragmentation in multiple application scenarios such as joint query “ data silos ” problem , Truly enable AI and big data applications that comply with privacy norms .

PowerFL Deployment view and key components

From a deployment view point of view ,PowerFL It includes the service layer and the computing layer :

  • The service layer is built on K8S On top of the cluster , Utilizing its excellent resource scheduling capability 、 Perfect capacity expansion and contraction mechanism and stable fault tolerance performance , take PowerFL The resident service of is deployed on the service node in the form of a container . These resident service components include :
    • Message middleware : Responsible for event driving between all services and computing components 、 Algorithm synchronization and encrypted data asynchronous communication between computing components .
    • Task flow engine : Responsible for controlling the scheduling of unilateral federated task flow , The execution node is invoked in the form of a container according to the task flow sequence defined in advance , Computing tasks are executed in the execution node , Or in the execution node YARN The cluster submits computing tasks .
    • Task panel : Collect the key performance indicators of each iteration or final model output results of each algorithm component in the task flow 、 Such as AUC,Accuracy、K-S, Feature importance, etc .
    • Multi party federated scheduling engine : Be responsible for the scheduling and synchronization of Federated tasks among multiple parties , And provides a set of API Provides the creation of Federated task flow 、 Task initiation 、 End 、 Pause 、 Delete 、 Interface such as status query .
  • The computing layer is built on YARN On top of the cluster , Make the most of it Spark Big data ecological suite , be responsible for PowerFL Distributed computing of various algorithm components at runtime . The computing task is actually initiated by the task node of the service layer and sent to YARN Cluster application resource operation PowerFL Federation operator of , be based on Spark on Angel Computing framework of , It guarantees the high parallelism and excellent performance of the algorithm .

PowerFL Network topology

From the perspective of intranet users ,PowerFL adopt k8s Of ingress Expose service paths to intranet users :

  • By visiting http://domain/ You can access the interface of the task panel , Understand the running status of the current task 、 Key logs during task operation 、 Display of key performance indicators, etc ;
  • By visiting http://domain/pipelines You can access the interface of the task flow engine , View the running phase of the task on this side :
  • By visiting http://domain/flow-server Accessible flow-server Of REST API.

Calculation task ( Such as spark Mission driver and executor) And service layer components , Message oriented middleware is used to provide communication ;

Calculate the intermediate encryption results obtained during task execution and the task status information that needs to be synchronized , Then, cross Internet synchronization is realized through respective message oriented middleware .

Rapid deployment PowerFL

After the overall understanding PowerFL Platform framework for 、 After deploying the view and network topology , The following describes how to quickly deploy PowerFL, As mentioned above ,PowerFL It is divided into service layer components and computing layer , Built on k8s Clusters and YARN On top of the cluster . In the deployment PowerFL Before , You need to prepare these two cluster environments first ( If computing tasks do not require a distributed environment , You may not need to prepare YARN Cluster environment ).

Before doing the following , Get the latest version of the installation package and unzip .

Get ready k8s colony

Machine configuration requirements :

  • Number of machines :1+ platform
  • hardware configuration :16G+ Mem、CPU 4+、 Hard disk 100G+
  • OS And version : Suggest CentOS 7.0
  • Docker 1.8+

If you are in a test environment , Refer to the documentation for installation Minikube,VM driver Optional kvm2(linux) perhaps hyperkit, perhaps VirtualBox (macOS), adopt Minikube establish k8s:

minikube start --memory=8192 --cpus=4

If you are installing in a production environment k8s , You can refer to k8s Official documents of , Deploy... In a production environment k8s colony , Options include :

  • Use kubeadm To install .
  • Use kops To install .
  • Use KRIB To install .
  • Use Kubespray To install .
  • You can also use Tencent cloud's TKE

If offline installation is required k8s, Please refer to the... Under the installation package directory offline-k8s-deploy To install .

Get ready Yarn colony ( Optional )

You can refer to Apache Ambari To install YARN colony , Ready after installation hadoop Configuration file for , Put it in hadoop-config Under the table of contents :

core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml

Go together k8s Import the above configuration file into the cluster , The configuration imported here is named hadoop:

kubectl -n power-fl-[partyId] create configmap hadoop --from-file=./hadoop-config

Ready to install the client

  1. install jq and envsubst
    #  If it is Ubuntusudo apt-get install jq envsubst#  If it is CentOSsudo yum install jq envsubst
  2. install helm 3.0+ May refer to Helm Official installation documentation , Simply speaking , It can be operated as follows :
    • macOS Use Homebrew Installation :
      brew install helm
    • Windows You can use Chocolatey Installation
      choco install kubernetes-helm
    • Install from the command line
      curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

One key deployment PowerFL

The following operations are performed on the client machine , And the execution directory is Fl-deploy Under the root directory of :

cd FL-deploy

First installation

  1. Get ready kubectl Configuration file for :

    mkdir kubecp ~/.kube/config ./kube
  2. Copy environment configuration template file , And set the corresponding environment variables :

    cp _powerfl_env.sh ./powerfl_env.shvim ./powerfl_env.sh
    #!/bin/bash#  Set up participants id, Of all parties involved id Must be unique export PARTY_ID=10000#  Set party access powerfl Domain name of related services export DOMAIN=powerfl-10000.com#  Intranet access MQ The address of , For the interior Hadoop Cluster access export INTERNAL_MQ_HOST=xx.xx.xx.xx#  Intranet access MQ Of tcp port export INTERNAL_MQ_TCP_PORT="xxxx"#  Exposed to the Internet MQ-proxy Of http port ( For the default k8s To configure , The port range is 30000-32767)export EXPOSE_MQ_HTTP_PORT="xxxx"#  Exposed to the Internet MQ Of tcp port ( For the default k8s To configure , The port range is 30000-32767)export EXPOSE_MQ_TCP_PORT="xxxx"#  Set components that do not need to be installed , Separate with spaces export DISABLED_COMPONENTS=""#  Of other participants id, Separate with spaces export OTHER_PARTIES="20000"#  Of other participants MQ To configure , Such as OTHER_PARTIES="20000 30000"#  You need to configure them separately #    PARTY_MQ_HTTP_URL_20000 and PARTY_MQ_PROXY_URL_20000#  and  PARTY_MQ_HTTP_URL_30000 and PARTY_MQ_PROXY_URL_30000export PARTY_MQ_HTTP_URL_20000=yy.yy.yy.yy:yyyyexport PARTY_MQ_PROXY_URL_20000=yy.yy.yy.yy:yyyy
  3. Execute Script Installation :

    ./deploy.sh setup

Install multiple participants at the same time

If you need to be in the same k8s Multiple parties are installed on the cluster , You can copy multiple copies _powerfl_env.sh Environment profiles as different parties , And specify... When executing the deployment script ( If not specified , By default, the... In the current directory is used powerfl_env.sh As an environment configuration file , As shown above ):

cp _powerfl_env.sh ./powerfl-10000.shvim ./powerfl-10000.sh #  modify PARTY_ID Other configuration cp _powerfl_env.sh ./powerfl-20000.shvim ./powerfl-20000.sh #  modify PARTY_ID Other configuration #  Deployment participants 10000, Specify profile powerfl-10000.sh./deploy.sh setup ./powerfl-10000.sh#  Deployment participants 20000,  Specify profile powerfl-20000.sh./deploy.sh setup ./powerfl-20000.sh

Update system

If you need to modify the system related configuration , You can modify the corresponding environment configuration file and component configuration file , And implement :

./deploy.sh upgrade#  If you specify a profile ./deploy.sh upgrade ./powerfl-10000.sh

Unloading the system

Be careful : This operation is to delete powerfl All data for , Irrecoverable :

./deploy.sh cleanup#  If you specify a profile ./deploy.sh cleanup ./powerfl-10000.sh

adopt flow-server Submit a federal mission

After installation PowerFL after , You can use the specified DSL Write task flow and task parameter configuration files to flow-server Submit a federal mission , Before introducing the specific method of use , First of all, let's get to know PowerFL Federal task scheduling and scheduling process .

The flow of federation task arrangement and scheduling

1) Write a task flow file pipeline.yaml, towards flow-server Import pipeline:

curl --request POST 'http://{domain}/flow-server/pipelines' --form '[email protected]'

If it succeeds, it will return pipeline Of id; For the just imported pipeline, Write task configuration parameter file job.yaml( I'll introduce you later DSL), towards flow-server Submit tasks :

curl -request POST 'http://{domain}/flow-server/pipelines/{pipeline_id}/jobs' --form '[email protected]_parameters.yaml'

2)flow-server After receiving the task request submission , This task will be randomly generated id、 Inject the global configuration to build the task flow engine for scheduling DSL And submit the task flow to it .

3) The task flow engine is based on the above DSL file , According to the node order defined by the task flow, move to K8S Cluster application resources , Invoke the runtime container of the corresponding node ;

4 and 5) After the runtime container starts , According to the injected environment variable information , towards YARN Cluster application resources , start-up driver and executor, Call up the specific algorithm flow , Perform parallel computing tasks , thus , The algorithm task on this side is started and completed .

6) On the other side ,flow-server After receiving the task request submission , The configuration file information of the task will be transmitted to the local message oriented middleware ;

7) Local message oriented middleware synchronizes across networks , Synchronize the above task configuration file to the message oriented middleware of other participants ;

8) Of other participants flow-server Listening for message oriented middleware task submission topic, Receive a request to start a new federal task ;

9) The subsequent process is the same as 3)4)5)

The format of the task configuration parameter file is as follows :

parties: [ "10000=guest", "20000=host" ]common-args:  spark-master-name: local[*]  runtime-image: power_fl/runtime:developparties-args:  10000:    hadoop-config: hadoop    hadoop-user-name: root    hdfs-libs-path: hdfs:///fl-runtime-libs    spark-submit-args: ""    input: /opt/spark-app/fl-runtime/data/a9a.guest.head    output: /tmp/a9a.guest.output  20000:    hadoop-config: hadoop    hadoop-user-name: root    hdfs-libs-path: hdfs:///fl-runtime-libs    spark-submit-args: ""    input: /opt/spark-app/fl-runtime/data/a9a.host.head    output: /data/a9a.host.output

The above document consists of three main parts :

  1. parties: Specify each participant of the federated task in the form of an array , With partyId=role Specifies the format of each participant id And the role in this task
  2. common-args: Specify parameters to be used by all parties , Configurable parameters shall be consistent with pipline.yaml Specified in the spec.arguments.parameters The consistency of .
  3. parties-args: Specify the parameters of each participant , Include hadoop Configuration information 、 Task algorithm parameter configuration information, etc .

summary

PowerFL From computing and data resources 、 Computing framework 、 Algorithmic protocol 、 The five levels of product interaction and application scenarios consolidate the technology and ecology of the whole Federation from bottom to top , Build the whole system in a cloud native way k8s On top of the cluster , And make the most of it YARN The big data ecology of the cluster , be based on Spark on Angel To achieve high-performance distributed computing for federated tasks . This article first introduces PowerFL The overall structure of , Including technology stack 、 Key components and network topology , On this basis, it introduces how to deploy with one click PowerFL And how to define a federated task flow to submit federated tasks . I hope this article can help you get started quickly , Learn more about this new privacy based machine learning modeling mechanism , And apply it to e-commerce 、 Finance 、 Medical care 、 education 、 More fields such as urban computing .

Sweep yards attention | Instant understanding of Tencent big data technology trends

原网站

版权声明
本文为[Tencent big data official]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202170506472608.html