当前位置:网站首页>P2pdb white paper

P2pdb white paper

2022-06-24 14:05:00 InfoQ

Introduce

P2PDB From KK Research and development of the group's offline cash register project , Early adoption of Raft agreement +Sqlite  Implement a lightweight distributed database , As for Raft In depth study of the protocol , Find out Raft The protocol has great defects for offline scenarios ( The number of failed nodes is greater than 50%, The whole cluster doesn't work ), Then it opened the way to further study the distributed database protocol , A year later, it was found that based on Merkel's directed acyclic graph +CRDT Implemented logical clock
merker-CRDT
Decentralization of final consistency can be achieved 、 Distributed 、 Point to point database , It also provides full support for offline scenarios .

P2PDB  Is a point-to-point database , This means that each peer has its own specific database instance . Databases are automatically replicated between peers , This generates the latest view of the database when any peer updates . in other words , The database is pulled to the client .

This means that each application contains the full database they are using . With client - Server model , This in turn changes data modeling , client - The server model typically uses a large database for all data : stay  P2PDB  in , It should be stored according to the access rights of the data 、“ Partition ” or “ Fragmentation ”. for example , In applications like Weibo , Tweets will not be saved in global pages written by millions of users at the same time “ Tweets ” In the database , Instead, each user has its own microblog database . To subscribe to other users' tweets , Just focus on the topic (topic)

1、 Offline priority application

Offline first web Browser running applications , The data generated by the client can be stored in p2pdb in , Let's take a real scenario :  In the physical industry , Offline physical stores need multiple cashier systems , At present, the common solution is to use the online cash register system , Data exchange through centralized cloud servers , When the network is poor 、 Even when the network is disconnected , It will directly affect the income of offline physical stores , As a result, offline physical stores cannot operate normally , A better solution would be to use offline storage , Report the data after the network is restored , How to ensure the consistency of data when offline ,  Use raft The protocol can be used to build an offline storage network through the LAN in the case of offline , But it's similar Raft Paxos  Strong consistency protocols have natural defects , When more than half of the nodes are unavailable , The cluster will not work , The cashier system of offline physical stores is often turned on and off 、 In the low peak period of business, the excess cash registers will be shut down , Therefore, a more effective solution is needed .


2、 Distributed edge caching

In the age of Globalization , We often visit various countries 、 Website data of provinces , Because the server nodes are distributed in different parts of the world , Lead to slow access , Such as accessing the U.S. server from the mainland , With high latency . Even though CDNS Vendors provide distributed file cloud storage ,cdns High cost , Not every website can afford to pay ,IPFS The problem of optimization arises , however IPFS Medium filecoin  It just solves the problem of file storage , There is still no feasible scheme for the database storage which requires fast data exchange .

3、dapp Decentralized application scenarios

dapp  The application of both de Sinochem , Also known as distributed applications , It's supposed to turn on web3.0 Time , DApp Decentralized operation through network nodes . It can run on the user's personal device , such as : mobile phone 、 Personal computers and other terminal equipment ,DApp Running on Peer-to-Peer Networks . Independent of the central server , There is no need for a dedicated communication server to deliver messages , There is no need for a central database to record data . The data is stored in the user's personal space , It could be a cell phone , It could be a personal cloud disk ,p2pdb  The design naturally conforms to the decentralized standard , It's going to be dapp  Great choice for application data storage .

4、 Multi person online collaboration

When the client application is used for multi person online collaboration . It is necessary to solve the problem of concurrency conflicts , be based on merkle-CRDT The tamper proof log of , Automatically solve the problem of final data consistency .

5、 The Internet of things

Physical network application access p2pdb Realize edge storage , Deploy a subscription node in the cloud , It can automatically collect IOT data of all edge nodes , Data peer-to-peer synchronization realized at the database level , It saves a lot of trouble at the application level .


Two 、 Meet the conditions

To solve the above industry problems , The following goals need to be achieved
  • 1、 Low latency : When the network is stable , It can synchronize data of 100 nodes in seconds
  • 2、 On a large scale : Support thousands of nodes to run
  • 3、 High performance : Single machine query qps 1 All the above 、 write in qps 5 More than a thousand
  • 4、 De centralization : Peer to peer network
  • 5、 Secure storage : Verifiable data structures
  • 5、 Transport security : Public private key encryption and decryption
  • 6、 The ultimate consistency mechanism
  • 7、 Historical logs cannot be tampered with
  • 8、 Consensus mechanism
  • 9、 Free open source
  • 10、 Easy upgrade and bug recovery
  • 11、 Community autonomy
  • 12、 Support mainstream programming languages


3、 ... and 、 Summary of technical principles

CRDT  Conflict free replication data types

Because direct transmission can be used for online and offline communication , Therefore, there must be a way to maintain the consistency and order of all messages , Especially in conversations with multiple participants . for example , If  Alice  and  Bob  In a chat group with several others , And they all lost their Internet connection by taking the subway , Then they can still use  BLE  Communicate with each other in the same conversation , To create a parallel version of this conversation . When they go back online ,BLE  Version and  Internet  Versions will have to be merged . therefore , It is necessary to use an algorithm to ensure that all peers have exactly the same sort message list after synchronization .
The solution to this problem is to replicate data types without conflicts  (CRDT), It's a data structure , Allow consistent ordering of messages on distributed systems ,CRDT  Provides optimistic replication and strong final consistency , Make sure that once you synchronize , Each peer will have the same version of the message list .
Each message is linked to its parent , This is the last message sent in the conversation by one of the peers connected at this time . There will be problems when the online and offline versions of the conversation are synchronized : Some messages are linked to the same parent , The linked list becomes a directed acyclic graph  . 
For more details
.


null


This leads to the creation of several parallel branches that eventually need to be merged . P2PDB  By using Lamport Clock  To achieve this : Each message will contain a Lamport Clock, The merged list will be sorted according to its value .

Lamport  The clock

Lamport  A clock is a structure with two fields : An identity public key and a counter , This counter is for the relevant user / Each message published by identity is incremented  
For more details
.
// golang
type lamportClock struct {
 time int
 id crypto.PublicKey
}

The comparison function is simple , It first checks the difference between the counter values , without , It checks the dictionary gap between the identity public keys , Knowing that a given identity cannot publish two message values with the same counter .



// golangfunc 
compareClock(a, b lamportClock) int {
 dist := a.time - b.time
 if dist == 0 {
 dist = comparePubKey(a.id, b.id) // Returns lexicographic distance
 }

 return dist
}


merkle-clock

Merkle-clock  Of M  It's a Merkle-DAG, Each node represents an event , Put it another way , An event in a given system , We can do it here DAG Find a node that represents it in , And allows us to sort it by comparing it with other events .DAG Is to merge other nodes through some simple rules DAG To build , The new event is added as the new root node ( The parent node of an existing node ), It should be noted that ,Merkle-clock There may be multiple roots at a given time .
for example , Given  Mα and  Mβ(α  and  β  It's these  DAG  A single root in ):
  • If  α = β  No action required , Because they are the same  DAG.
  • Otherwise, if  α ∈ Mβ, We reserve the  Mβ As our new clock , because  Mα The history of is already a part of it . under these circumstances , We said  Mα < Mβ.
  • Otherwise, if  β ∈ Mα, We reserve for the same reason  Mα. Here   Under different circumstances , We said  Mβ< Mα.
  • otherwise , By keeping two  DAG  Merge the two clocks as is , therefore   There are two root nodes , These nodes are made up of  α  and  β  quote . Please note that ,Mα and  Mβ May not intersect at all , May not intersect , It depends on whether they share   Enjoy some deeper nodes . If we want to record a new event , We can   To create a new root  γ, It has two children ,α  and  β.

We can already see , By identifying a  MerkleClock  Whether included in another  MerkleClock  in , We are  Merkle-Clocks  The concept of order is introduced between . In the same way , We have a concept of order between nodes in each clock , Because an earlier event will always be a descendant of a later event . Besides , We also introduced a method of merging according to their comparison Merkle-Clocks  Methods . Generated  Merkle-Clock  Always include from two  Merkle-Clock  Cause and effect information . This ultimately means that each copy is stored in  Merkle-Clock  The causality information in will converge to the same Merkle-Clock.Merkle-Clocks  The causal sequence provided is used to construct the Merkle-DAG  Embedded when , And is often overlooked as something very intuitive . However , It is important to formalize how we define  Merkle-Clocks  The order between , And prove that causal information is maintained when they are synchronized and merged .


Peer-to-peer networks

Peer to peer network is an idea of network structure . It is the same as the dominant client in the current network / The server (Client/Server) structure ( That is to say WWW The structure adopted ) An essential difference between , There is no central node in the whole network structure ( Or central server ). stay P2P In structure , Every node (peer) Most of them have information consumers at the same time 、 Information provider and information communication . In terms of calculation mode ,P2P Breaking the traditional Client/Server (C/S) Pattern , Every node in the network has the same status . Each node acts as a server , Provide services for other nodes , At the same time, they also enjoy the services provided by other nodes .

To put it simply ,P2P It's about connecting people directly , Let people interact directly through the Internet .P2P Make it easy to communicate on the Internet 、 More direct sharing and interaction , Truly eliminate middlemen . P2P Another important feature is to change the status of the Internet Centered on the Ethernet website 、 Return “ Decentralization ”, And give the power back to the user .  Peer to peer network is a successful extension of the concept of distribution , It allocates the traditional server burden to every node in the network , Each node will undertake limited storage and computing tasks , The more nodes join the network , The more resources a node contributes , The higher the quality of service .

Peer to peer networks can be applied to  Internet  Relatively powerful computers on the edge ( personal computer ), Perform more advanced tasks than client based computing tasks . modern PC With a very fast processor 、 Massive memory and huge hard disk , While performing routine computing tasks ( such as : Browse email and  Web) when , The potential of these devices cannot be fully realized . new PC It's easy to act as both a client and a server for many types of applications ( Counterpart ).



null

P2P The characteristics of network technology are reflected in the following aspects :
  • De centralization
The resources and services in the network are distributed on all nodes , Information transmission and service implementation are directly carried out between nodes , There is no need for the intervention of intermediate links and servers , Avoid possible bottlenecks .P2P The basic characteristics of decentralization , Brings its scalability 、 Robustness and other advantages .

  • Extensibility
stay P2P In the network , With the addition of users , Not only has the demand for services increased , The overall resources and service capabilities of the system are also expanding synchronously , It can always easily meet the needs of users . Theoretically, its scalability is almost infinite . for example : In the traditional way FTP In the file download mode of , When the number of download users increases , The download speed will get slower and slower , However P2P The Internet is the opposite , The more users you join ,P2P The more resources available on the Internet , The faster you download .


  • Robustness,
P2P Architecture is inherently attack resistant 、 Advantages of high fault tolerance . Because services are distributed among nodes , The destruction of some nodes or networks has little impact on other parts .P2P Generally, the network can automatically adjust the overall topology when some nodes fail , Keep other nodes connected .P2P Networks are usually self-organized , And allow nodes to join and leave freely .

  • High cost performance
The performance advantage is P2P An important reason for widespread concern . With the development of hardware technology , The computing, storage capacity and network bandwidth of personal computers are growing rapidly according to Moore's theorem . use P2P The architecture can make effective use of a large number of common nodes scattered in the Internet , Distribute computing tasks or storage data to all nodes . Use the idle computing power or storage space , Achieve the purpose of high-performance computing and mass storage . at present ,P2P The application in this field is mostly in academic research , Once the technology is mature , It can be popularized in the industrial field , Can save many enterprises the cost of purchasing large servers .

  • Privacy protection
stay P2P In the network , Because the transmission of information is distributed among the nodes without going through a centralized link , The possibility of users' privacy information being eavesdropped and leaked is greatly reduced . Besides , At the moment Internet The privacy problem mainly adopts relay forwarding technology , Thus, the participants of communication are hidden in many network entities . In some traditional anonymous communication systems , Implementing this mechanism depends on some relay server nodes . And in the P2P in , All participants can provide relay forwarding function , Therefore, the flexibility and reliability of anonymous communication are greatly improved , Can provide users with better privacy protection .

Proof of version  VOS

Proof of vension , similar pos、 An algorithm for storing proofs , The node with the most complete storage version can pack blocks , according to IPFS Protocol storage to different server nodes worldwide , To achieve the purpose of permanent data storage

The log cannot be tampered with

Tamper proof means , Change log generated each time , Broadcast to all nodes , Every ten minutes, it will be packaged into a block log and uploaded to the blockchain , At this time, the log content cannot be tampered with , Refer to Ethereum 、 Bitcoin protocol


Technical terminology

1、 Node consensus

Node consensus refers to , Historical data changes cannot be tampered with 、 The nodes will reach a consensus every ten minutes , When the consensus is completed, the change log will be permanently saved , Each node will save a complete change log , At the same time, a data snapshot will be generated , For fast data recovery .

2、 Proof of version

Proof of version refers to the occurrence of consensus , The node with the most complete version certificate is preferred as the node for data consistency verification , The remaining nodes will perform data proofreading according to the node with the most complete data version , Finally, data snapshots and historical log blocks are generated

3、 Light node

Follow Merkel's directed acyclic graph structure , For edge nodes , Just download the latest 10 Minutes ago .

4、 Data storage snapshots and logs

Store change logs ( similar mysql binlog journal , Every ten minutes ), Data snapshot ( Full data , For fast recovery and backup )

5、 The private network

P2PDB Not just for the public network , At the same time, it also allows enterprises to P2PDB  Set up a private network , The private network has the same high partition fault tolerance as the public network 、 Usability 、 But security 、 It will be more controllable .

6、 Public private key

The public-private key provides the security of data during transmission , The node encrypts the data according to the issued public key before data transmission , Only the user who knows the private key can solve the secret text .

The reference paper of the white paper

  • Lamport time clock 
    https://lamport.azurewebsites.net/pubs/time-clocks.pdf
  • CRDT tech report: 
    https://hal.inria.fr/file/index/docid/555588/filename/techreport.pdf
  • CRDT :
    https://hal.inria.fr/inria-00555588/document
  • Peer-to-peer networks  
    https://baike.baidu.com/item/%E5%AF%B9%E7%AD%89%E7%BD%91%E7%BB%9C/5482934?fr=aladdin
  • merker-CRDT 
    https://research.protocol.ai/blog/2019/a-new-lab-for-resilient-networks-research/PL-TechRep-merkleCRDT-v0.1-Dec30.pdf


原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206241237173650.html