当前位置：网站首页>Learn about spark project on Nebula graph

Learn about spark project on Nebula graph

2022-07-25 08:55:00 【InfoQ】

This article was first published in

NebulaGraph Community official account

Recently, I tried to build a convenient one click demo Nebula Graph Medium  

Spark

  Related items , Today, I will put them in writing and share them with you . and , I'm out PySpark Under the Nebula Spark Connector How to use , Later, it will also be contributed to the document .

Nebula Graph One of the three Spark subprojects

I used to surround Nebula Graph All data import methods of have drawn a

The sketch

, It already includes Spark Connector,Nebula Exchange A brief introduction . In this article, I compare them with other Nebula Algorithm Go a little deeper .

notes ：
This document
  It also clearly lists the choices of different import tools for us .

TL;DR

Nebula Spark Connector It's a Spark Lib, It can make Spark The application is able to  
dataframe
  The form from Nebula Graph Read and write graph data .

Nebula Exchange Based on the Nebula Spark Connector above , As a Spark Lib At the same time, it can be directly Spark Submit JAR Applications executed by packages , Its design goal is to Nebula Graph Exchange different data sources （ For open source versions , It's one-way ： write in , For the enterprise version , It's two-way ）.Nebula Exchange Many different types of data sources are supported, such as ：
MySQL
、
Neo4j
、
PostgreSQL
、
ClickHouse
、
Hive
  etc. . In addition to writing directly Nebula Graph, It can also choose to generate SST file , And inject it into Nebula Graph, For use Nebula Graph Computing power outside the cluster helps sort the bottom layer .

Nebula Algorithm, Based on the Nebula Spark Connector and GraphX above , Also a Spark Lib and Spark On the application , It is used in Nebula Graph Run common graph algorithms on the graph of （pagerank,LPA etc. ）.

Nebula Spark Connector

Code ：https://github.com/vesoft-inc/nebula-spark-connector

file ：https://docs.nebula-graph.io/3.1.0/nebula-spark-connector/

JAR package ：https://repo1.maven.org/maven2/com/vesoft/nebula-spark-connector/

Code example ：
example

Nebula Graph Spark Reader

In order to learn from Nebula Graph Read data from , Like reading vertex,Nebula Spark Connector Will scan all with a given TAG Of Nebula StorageD, For example, this means scanning  

player

  This TAG ：

withLabel("player")

, We can also specify vertex Properties of ：

withReturnCols(List("name", "age"))

Specify all reads TAG After the relevant configuration , call  

spark.read.nebula.loadVerticesToDF

  What you get back is a scan Nebula Graph Then it's converted to Dataframe Graph data , like this ：

 def readVertex(spark: SparkSession): Unit = {
 LOG.info(&quot;start to read nebula vertices&quot;)
 val config =
 NebulaConnectionConfig
 .builder()
 .withMetaAddress(&quot;metad0:9559,metad1:9559,metad2:9559&quot;)
 .withConenctionRetry(2)
 .build()
 val nebulaReadVertexConfig: ReadNebulaConfig = ReadNebulaConfig
 .builder()
 .withSpace(&quot;basketballplayer&quot;)
 .withLabel(&quot;player&quot;)
 .withNoColumn(false)
 .withReturnCols(List(&quot;name&quot;, &quot;age&quot;))
 .withLimit(10)
 .withPartitionNum(10)
 .build()
 val vertex = spark.read.nebula(config, nebulaReadVertexConfig).loadVerticesToDF()
 vertex.printSchema()
 vertex.show(20)
 println(&quot;vertex count: &quot; + vertex.count())
 }

I will not list the examples written here , however , There are more detailed examples in the link of the code example given above , It's worth mentioning here ,

Spark Connector Read data to meet graph analysis 、 Figure calculation of a large number of data scenarios

, Very different from most other clients , It bypasses GraphD, By scanning MetaD and StorageD Get data , But writing is through GraphD launch nGQL DML Statement written .

Next, let's do a hands-on exercise .

Get started Nebula Spark Connector

precondition ： Suppose the following program is on an Internet connected Linux Running on the machine , It's better to pre install Docker and Docker-Compose.

Pull up the environment

First , Let's use it  

Nebula-Up

  Deploy container based Nebula Graph Core v3、Nebula Studio、Nebula Console and Spark、Hadoop Environmental Science , If it has not been installed, it will also try to install it for us Docker and Docker-Compose.

# Install Core with Spark Connector, Nebula Algorithm, Nebula Exchange
curl -fsSL nebula-up.siwei.io/all-in-one.sh | bash -s -- v3 spark

You know what?  
Nebula-UP
  You can load more things with one click , If your environment configuration is larger （ such as 8 GB RAM）
curl -fsSL nebula-up.siwei.io/all-in-one.sh | bash
  Can hold more things , But notice  
Nebula-UP
  Not for the production environment .

After the above side script is executed , Let's use it  

Nebula-Console

（Nebula Graph Command line client for ） To connect it .

# Connect to nebula with console
~/.nebula-up/console.sh
# Execute any queryies like
~/.nebula-up/console.sh -e &quot;SHOW HOSTS&quot;

Load a copy of data into , And execute a graph query ：

# Load the sample dataset
~/.nebula-up/load-basketballplayer-dataset.sh
#  Wait a minute or so 

# Make a Graph Query the sample dataset
~/.nebula-up/console.sh -e 'USE basketballplayer; FIND ALL PATH FROM &quot;player100&quot; TO &quot;team204&quot; OVER * WHERE follow.degree is EMPTY or follow.degree >=0 YIELD path AS p;'

Get into Spark Environmental Science

Execute the following line , We can get into Spark Environmental Science ：

docker exec -it spark_master_1 bash

If we want to perform compilation , It can be installed inside  

mvn

：

docker exec -it spark_master_1 bash
# in the container shell

export MAVEN_VERSION=3.5.4
export MAVEN_HOME=/usr/lib/mvn
export PATH=$MAVEN_HOME/bin:$PATH

wget http://archive.apache.org/dist/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-$MAVEN_VERSION-bin.tar.gz && \
 tar -zxvf apache-maven-$MAVEN_VERSION-bin.tar.gz && \
 rm apache-maven-$MAVEN_VERSION-bin.tar.gz && \
 mv apache-maven-$MAVEN_VERSION /usr/lib/mvn

run Spark Connector Example

Options 1（ recommend ）： adopt PySpark

Get into PySpark Shell

~/.nebula-up/nebula-pyspark.sh

call Nebula Spark Reader

# call Nebula Spark Connector Reader
df = spark.read.format(
 &quot;com.vesoft.nebula.connector.NebulaDataSource&quot;).option(
 &quot;type&quot;, &quot;vertex&quot;).option(
 &quot;spaceName&quot;, &quot;basketballplayer&quot;).option(
 &quot;label&quot;, &quot;player&quot;).option(
 &quot;returnCols&quot;, &quot;name,age&quot;).option(
 &quot;metaAddress&quot;, &quot;metad0:9559&quot;).option(
 &quot;partitionNumber&quot;, 1).load()

# show the dataframe with limit of 2
df.show(n=2)

Return result example

 ____ __
 / __/__ ___ _____/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /__ / .__/\_,_/_/ /_/\_\ version 2.4.5
 /_/

Using Python version 2.7.16 (default, Jan 14 2020 07:22:06)
SparkSession available as 'spark'.
>>> df = spark.read.format(
... &quot;com.vesoft.nebula.connector.NebulaDataSource&quot;).option(
... &quot;type&quot;, &quot;vertex&quot;).option(
... &quot;spaceName&quot;, &quot;basketballplayer&quot;).option(
... &quot;label&quot;, &quot;player&quot;).option(
... &quot;returnCols&quot;, &quot;name,age&quot;).option(
... &quot;metaAddress&quot;, &quot;metad0:9559&quot;).option(
... &quot;partitionNumber&quot;, 1).load()
>>> df.show(n=2)
+---------+--------------+---+
|_vertexId| name|age|
+---------+--------------+---+
|player105| Danny Green| 31|
|player109|Tiago Splitter| 34|
+---------+--------------+---+
only showing top 2 rows

Options 2： compile 、 Submit sample JAR package

Clone first Spark Connector And the code warehouse of its sample code , Then compile ：

Be careful , We used master Branch , Because right now master Branches are compatible 3.x Of , Make sure spark connector It matches the database kernel version , Version correspondence refers to  
README.md
 .

cd ~/.nebula-up/nebula-up/spark
git clone https://github.com/vesoft-inc/nebula-spark-connector.git

docker exec -it spark_master_1 bash
cd /root/nebula-spark-connector

Replace the code of the sample project

echo > example/src/main/scala/com/vesoft/nebula/examples/connector/NebulaSparkReaderExample.scala

vi example/src/main/scala/com/vesoft/nebula/examples/connector/NebulaSparkReaderExample.scala

Paste the following code , Here, we'll compare the figure loaded in front ： 
basketballplayer
  Read vertices and edges on ： Respectively called  
readVertex
  and  
readEdges
.

package com.vesoft.nebula.examples.connector

import com.facebook.thrift.protocol.TCompactProtocol
import com.vesoft.nebula.connector.connector.NebulaDataFrameReader
import com.vesoft.nebula.connector.{NebulaConnectionConfig, ReadNebulaConfig}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.slf4j.LoggerFactory

object NebulaSparkReaderExample {

 private val LOG = LoggerFactory.getLogger(this.getClass)

 def main(args: Array[String]): Unit = {

 val sparkConf = new SparkConf
 sparkConf
 .set(&quot;spark.serializer&quot;, &quot;org.apache.spark.serializer.KryoSerializer&quot;)
 .registerKryoClasses(Array[Class[_]](classOf[TCompactProtocol]))
 val spark = SparkSession
 .builder()
 .master(&quot;local&quot;)
 .config(sparkConf)
 .getOrCreate()

 readVertex(spark)
 readEdges(spark)

 spark.close()
 sys.exit()
 }

 def readVertex(spark: SparkSession): Unit = {
 LOG.info(&quot;start to read nebula vertices&quot;)
 val config =
 NebulaConnectionConfig
 .builder()
 .withMetaAddress(&quot;metad0:9559,metad1:9559,metad2:9559&quot;)
 .withConenctionRetry(2)
 .build()
 val nebulaReadVertexConfig: ReadNebulaConfig = ReadNebulaConfig
 .builder()
 .withSpace(&quot;basketballplayer&quot;)
 .withLabel(&quot;player&quot;)
 .withNoColumn(false)
 .withReturnCols(List(&quot;name&quot;, &quot;age&quot;))
 .withLimit(10)
 .withPartitionNum(10)
 .build()
 val vertex = spark.read.nebula(config, nebulaReadVertexConfig).loadVerticesToDF()
 vertex.printSchema()
 vertex.show(20)
 println(&quot;vertex count: &quot; + vertex.count())
 }

 def readEdges(spark: SparkSession): Unit = {
 LOG.info(&quot;start to read nebula edges&quot;)

 val config =
 NebulaConnectionConfig
 .builder()
 .withMetaAddress(&quot;metad0:9559,metad1:9559,metad2:9559&quot;)
 .withTimeout(6000)
 .withConenctionRetry(2)
 .build()
 val nebulaReadEdgeConfig: ReadNebulaConfig = ReadNebulaConfig
 .builder()
 .withSpace(&quot;basketballplayer&quot;)
 .withLabel(&quot;follow&quot;)
 .withNoColumn(false)
 .withReturnCols(List(&quot;degree&quot;))
 .withLimit(10)
 .withPartitionNum(10)
 .build()
 val edge = spark.read.nebula(config, nebulaReadEdgeConfig).loadEdgesToDF()
 edge.printSchema()
 edge.show(20)
 println(&quot;edge count: &quot; + edge.count())
 }

}

And then it's packaged into JAR package

/usr/lib/mvn/bin/mvn install -Dgpg.skip -Dmaven.javadoc.skip=true -Dmaven.test.skip=true

Last , Submit it to Spark Internal execution ：

cd example

/spark/bin/spark-submit --master &quot;local&quot; \
 --class com.vesoft.nebula.examples.connector.NebulaSparkReaderExample \
 --driver-memory 4g target/example-3.0-SNAPSHOT.jar

#  sign out  spark  Containers 
exit

After success , We will get the return result ：

22/04/19 07:29:34 INFO DAGScheduler: Job 1 finished: show at NebulaSparkReaderExample.scala:57, took 0.199310 s
+---------+------------------+---+
|_vertexId| name|age|
+---------+------------------+---+
|player105| Danny Green| 31|
|player109| Tiago Splitter| 34|
|player111| David West| 38|
|player118| Russell Westbrook| 30|
|player143|Kristaps Porzingis| 23|
|player114| Tracy McGrady| 39|
|player150| Luka Doncic| 20|
|player103| Rudy Gay| 32|
|player113| Dejounte Murray| 29|
|player121| Chris Paul| 33|
|player128| Carmelo Anthony| 34|
|player130| Joel Embiid| 25|
|player136| Steve Nash| 45|
|player108| Boris Diaw| 36|
|player122| DeAndre Jordan| 30|
|player123| Ricky Rubio| 28|
|player139| Marc Gasol| 34|
|player142| Klay Thompson| 29|
|player145| JaVale McGee| 31|
|player102| LaMarcus Aldridge| 33|
+---------+------------------+---+
only showing top 20 rows

22/04/19 07:29:36 INFO DAGScheduler: Job 4 finished: show at NebulaSparkReaderExample.scala:82, took 0.135543 s
+---------+---------+-----+------+
| _srcId| _dstId|_rank|degree|
+---------+---------+-----+------+
|player105|player100| 0| 70|
|player105|player104| 0| 83|
|player105|player116| 0| 80|
|player109|player100| 0| 80|
|player109|player125| 0| 90|
|player118|player120| 0| 90|
|player118|player131| 0| 90|
|player143|player150| 0| 90|
|player114|player103| 0| 90|
|player114|player115| 0| 90|
|player114|player140| 0| 90|
|player150|player120| 0| 80|
|player150|player137| 0| 90|
|player150|player143| 0| 90|
|player103|player102| 0| 70|
|player113|player100| 0| 99|
|player113|player101| 0| 99|
|player113|player104| 0| 99|
|player113|player105| 0| 99|
|player113|player106| 0| 99|
+---------+---------+-----+------+
only showing top 20 rows

in fact , There are more examples under this code warehouse , especially  

GraphX

  Example , You can try to explore this part by yourself .

Please note that , stay GraphX Assumed vertex ID It's a number type , So for vertices of string type ID situation , Real time conversion is required , Please refer to  
Nebula Algorithom Examples in
, Learn how to bypass this problem .

Nebula Exchange

Code ：https://github.com/vesoft-inc/nebula-exchange/

file ：https://docs.nebula-graph.com.cn/3.1.0/nebula-exchange/about-exchange/ex-ug-what-is-exchange/

JAR package ：https://github.com/vesoft-inc/nebula-exchange/releases

Configuration example ： 
exchange-common/src/test/resources/application.conf

Nebula Exchange It's a Spark Lib, It can also be directly submitted for execution Spark application , It is used to read and write data from multiple data sources Nebula Graph Or output  

Nebula Graph SST file

adopt spark-submit The way to use Nebula Exchange The method is very direct ：

First create the configuration file , Give Way Exchange Know how to get and write data

Then call with the specified configuration file Exchange package

Now? , Let's do a real test with the same environment created in the previous chapter .

One click trial Exchange

Run up and have a look first

Please refer to the front
Pull up the environment
This chapter , First click to install the environment .

One click execution ：

~/.nebula-up/nebula-exchange-example.sh

congratulations , One has been successfully executed for the first time Exchange Data import task ！

Look at some details

In this case , We actually use Exchange from CSV File, which supports data source reading and writing Nebula Graph Clustered . This CSV The first column in the file is the vertex ID, The second and third columns are “ full name “ and “ Age “ Properties of ：

player800,&quot;Foo Bar&quot;,23
player801,&quot;Another Name&quot;,21

We can enter Spark Look in the environment

docker exec -it spark_master_1 bash
cd /root

You can see that we submitted Exchange The configuration file specified during the task  
exchange.conf
  It's a  
HOCON
  File format ： stay  
.nebula
  Described in Nebula Graph The relevant information of the cluster is in  
.tags
  How to map mandatory fields to our data source is described in （ Here is CSV file ） Etc Vertecies Information about .

{
 # Spark relation config
 spark: {
 app: {
 name: Nebula Exchange
 }

 master:local

 driver: {
 cores: 1
 maxResultSize: 1G
 }

 executor: {
 memory: 1G
 }

 cores:{
 max: 16
 }
 }

 # Nebula Graph relation config
 nebula: {
 address:{
 graph:[&quot;graphd:9669&quot;]
 meta:[&quot;metad0:9559&quot;, &quot;metad1:9559&quot;, &quot;metad2:9559&quot;]
 }
 user: root
 pswd: nebula
 space: basketballplayer

 # parameters for SST import, not required
 path:{
 local:&quot;/tmp&quot;
 remote:&quot;/sst&quot;
 hdfs.namenode: &quot;hdfs://localhost:9000&quot;
 }

 # nebula client connection parameters
 connection {
 # socket connect & execute timeout, unit: millisecond
 timeout: 30000
 }

 error: {
 # max number of failures, if the number of failures is bigger than max, then exit the application.
 max: 32
 # failed import job will be recorded in output path
 output: /tmp/errors
 }

 # use google's RateLimiter to limit the requests send to NebulaGraph
 rate: {
 # the stable throughput of RateLimiter
 limit: 1024
 # Acquires a permit from RateLimiter, unit: MILLISECONDS
 # if it can't be obtained within the specified timeout, then give up the request.
 timeout: 1000
 }
 }

 # Processing tags
 # There are tag config examples for different dataSources.
 tags: [

 # HDFS csv
 # Import mode is client, just change type.sink to sst if you want to use client import mode.
 {
 name: player
 type: {
 source: csv
 sink: client
 }
 path: &quot;file:///root/player.csv&quot;
 # if your csv file has no header, then use _c0,_c1,_c2,.. to indicate fields
 fields: [_c1, _c2]
 nebula.fields: [name, age]
 vertex: {
 field:_c0
 }
 separator: &quot;,&quot;
 header: false
 batch: 256
 partition: 32
 }

 ]
}

We should be able to see that CSV The data source and the configuration file are in the same directory ：

bash-5.0# ls -l
total 24
drwxrwxr-x 2 1000 1000 4096 Jun 1 04:26 download
-rw-rw-r-- 1 1000 1000 1908 Jun 1 04:23 exchange.conf
-rw-rw-r-- 1 1000 1000 2593 Jun 1 04:23 hadoop.env
drwxrwxr-x 7 1000 1000 4096 Jun 6 03:27 nebula-spark-connector
-rw-rw-r-- 1 1000 1000 51 Jun 1 04:23 player.csv

then , In fact, we can manually submit this again Exchange Mission

/spark/bin/spark-submit --master local \
 --class com.vesoft.nebula.exchange.Exchange download/nebula-exchange.jar \
 -c exchange.conf

Partial return results

22/06/06 03:56:26 INFO Exchange$: Processing Tag player
22/06/06 03:56:26 INFO Exchange$: field keys: _c1, _c2
22/06/06 03:56:26 INFO Exchange$: nebula keys: name, age
22/06/06 03:56:26 INFO Exchange$: Loading CSV files from file:///root/player.csv
...
22/06/06 03:56:41 INFO Exchange$: import for tag player cost time: 3.35 s
22/06/06 03:56:41 INFO Exchange$: Client-Import: batchSuccess.player: 2
22/06/06 03:56:41 INFO Exchange$: Client-Import: batchFailure.player: 0
...

More data sources , Please refer to the documentation and configuration examples .

About Exchange Output SST Document practice , You can refer to the documents and my old articles  
Nebula Exchange SST 2.x Practice Guide
.

Nebula Algorithm

Code warehouse ： 
https://github.com/vesoft-inc/nebula-algorithm

file ：https://docs.nebula-graph.com.cn/3.1.0/nebula-algorithm/

JAR package ：https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/

Sample code ：
example/src/main/scala/com/vesoft/nebula/algorithm

adopt spark-submit Submit tasks

I am here
This code warehouse
Examples are given in , Today we use Nebula-UP You can experience it more conveniently .
Refer to the front
Pull up the environment
This chapter , First click to install the environment .

Passed as above Nebula-UP Of Spark After the pattern deploys the required dependencies

load  
LiveJournal
  Data sets

~/.nebula-up/load-LiveJournal-dataset.sh

stay LiveJournal Execute a PageRank Algorithm , Results output to CSV In file

~/.nebula-up/nebula-algo-pagerank-example.sh

Check the output ：

docker exec -it spark_master_1 bash

head /output/part*000.csv
_id,pagerank
637100,0.9268620883822242
108150,1.1855749056722755
957460,0.923720299211093
257320,0.9967932799358413

Profile interpretation

The complete document is in

here

, here , Let's introduce the main fields ：

.data
  The specified source is Nebula, Indicates getting graph data from the cluster , Output
sink
yes  
csv
, Write to local file .

 data: {
 # data source. optional of nebula,csv,json
 source: nebula
 # data sink, means the algorithm result will be write into this sink. optional of nebula,csv,text
 sink: csv
 # if your algorithm needs weight
 hasWeight: false
 }

.nebula.read
  Stipulated reading Nebula Graph The correspondence between clusters , Here is reading all edge type: 
follow
  The edge data of is a whole graph

 nebula: {
 # algo's data source from Nebula. If data.source is nebula, then this nebula.read config can be valid.
 read: {
 # Nebula metad server address, multiple addresses are split by English comma
 metaAddress: &quot;metad0:9559&quot;
 # Nebula space
 space: livejournal
 # Nebula edge types, multiple labels means that data from multiple edges will union together
 labels: [&quot;follow&quot;]
 # Nebula edge property name for each edge type, this property will be as weight col for algorithm.
 # Make sure the weightCols are corresponding to labels.
 weightCols: []
 }

.algorithm
  The algorithm we want to call is configured in , And algorithm configuration

 algorithm: {
 executeAlgo: pagerank

 # PageRank parameter
 pagerank: {
 maxIter: 10
 resetProb: 0.15 # default 0.15
 }

As a library in Spark Call in Nebula Algoritm

Please pay attention to the other side , We can Nebula Algoritm Call as a library , The good thing about it is ：

Have more control over the output format of the algorithm / Custom features

Can be non numeric ID The situation of , see
here

I won't give an example here , If you are interested, you can give Nebula-UP To demand , I will also add corresponding examples .

Communication graph database technology ？ Join in Nebula Communication group please first

Fill in your Nebula Business card

,Nebula The little assistant will pull you into the group ~~

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/201/202207191352471008.html