当前位置:网站首页>Real time computing framework: Spark cluster setup and introduction case

Real time computing framework: Spark cluster setup and introduction case

2022-06-24 00:39:00 A cicada smiles

One 、Spark summary

1、Spark brief introduction

Spark Designed for large-scale data processing , Based on memory, it is fast and universal , Scalable cluster computing engine , Achieve efficient DAG Execution engine , Data flow can be processed efficiently through memory , Compared with MapReduce Has been significantly improved .

2、 Operation structure

Driver

function Spark Of Applicaion in main() function , Will create SparkContext,SparkContext Responsible for and Cluster-Manager communicate , And apply for resources 、 Task allocation and monitoring, etc .

ClusterManager

Responsible for application and management in WorkerNode The resources needed to run the application on , It can efficiently scale computing from one computing node to thousands of computing nodes , Currently include Spark Native ClusterManager、ApacheMesos and HadoopYARN.

Executor

Application Running on the WorkerNode Last process , Be responsible for running as a work node Task Mission , And responsible for storing data in memory or on disk , Every Application Each has its own independent group Executor, Tasks are independent of each other .

Two 、 The deployment environment

1、Scala Environmental Science

Installation package management

[[email protected] opt]# tar -zxvf scala-2.12.2.tgz
[[email protected] opt]# mv scala-2.12.2 scala2.12

Configuration variables

[[email protected] opt]# vim /etc/profile

export SCALA_HOME=/opt/scala2.12
export PATH=$PATH:$SCALA_HOME/bin

[[email protected] opt]# source /etc/profile

Version view

[[email protected] opt]# scala -version

Scala The environment needs to be deployed in Spark Running on the relevant service node .

2、Spark Based on the environment

Installation package management

[[email protected] opt]# tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz
[[email protected] opt]# mv spark-2.1.1-bin-hadoop2.7 spark2.1

Configuration variables

[[email protected] opt]# vim /etc/profile

export SPARK_HOME=/opt/spark2.1
export PATH=$PATH:$SPARK_HOME/bin

[[email protected] opt]# source /etc/profile

Version view

[[email protected] opt]# spark-shell

3、Spark Cluster configuration

Service node

[[email protected] opt]# cd /opt/spark2.1/conf/
[[email protected] conf]# cp slaves.template slaves
[[email protected] conf]# vim slaves

hop01
hop02
hop03

Environment configuration

[[email protected] conf]# cp spark-env.sh.template spark-env.sh
[[email protected] conf]# vim spark-env.sh

export JAVA_HOME=/opt/jdk1.8
export SCALA_HOME=/opt/scala2.12
export SPARK_MASTER_IP=hop01
export SPARK_LOCAL_IP= Install nodes IP
export SPARK_WORKER_MEMORY=1g
export HADOOP_CONF_DIR=/opt/hadoop2.7/etc/hadoop

Be careful SPARK_LOCAL_IP Configuration of .

4、Spark start-up

rely on Hadoop Related to the environment , So start it first .

 start-up :/opt/spark2.1/sbin/start-all.sh
 stop it :/opt/spark2.1/sbin/stop-all.sh

Here, two processes are started at the master node :Master and Worker, Other nodes start only one Worker process .

5、 visit Spark colony

The default port is :8080.

http://hop01:8080/

Basic operation cases :

[[email protected] spark2.1]# cd /opt/spark2.1/
[[email protected] spark2.1]# bin/spark-submit --class org.apache.spark.examples.SparkPi --master local examples/jars/spark-examples_2.11-2.1.1.jar

 Running results :Pi is roughly 3.1455357276786384

3、 ... and 、 The development case

1、 Core dependence

rely on Spark2.1.1 edition :

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.1.1</version>
</dependency>

introduce Scala Compile the plug-in :

<plugin>
    <groupId>net.alchim31.maven</groupId>
    <artifactId>scala-maven-plugin</artifactId>
    <version>3.2.2</version>
    <executions>
        <execution>
            <goals>
                <goal>compile</goal>
                <goal>testCompile</goal>
            </goals>
        </execution>
    </executions>
</plugin>

2、 Case code development

Read the file in the specified location , And output file content word statistics results .

@RestController
public class WordWeb implements Serializable {
    

    @GetMapping("/word/web")
    public String getWeb (){
    
        // 1、 establish Spark Configuration objects for 
        SparkConf sparkConf = new SparkConf().setAppName("LocalCount")
                                             .setMaster("local[*]");

        // 2、 establish SparkContext object 
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        sc.setLogLevel("WARN");

        // 3、 Read the test file 
        JavaRDD lineRdd = sc.textFile("/var/spark/test/word.txt");

        // 4、 Line content segmentation 
        JavaRDD wordsRdd = lineRdd.flatMap(new FlatMapFunction() {
    
            @Override
            public Iterator call(Object obj) throws Exception {
    
                String value = String.valueOf(obj);
                String[] words = value.split(",");
                return Arrays.asList(words).iterator();
            }
        });

        // 5、 Mark the segmented words 
        JavaPairRDD wordAndOneRdd = wordsRdd.mapToPair(new PairFunction() {
    
            @Override
            public Tuple2 call(Object obj) throws Exception {
    
                // Mark words :
                return new Tuple2(String.valueOf(obj), 1);
            }
        });

        // 6、 Count the number of times words appear 
        JavaPairRDD wordAndCountRdd = wordAndOneRdd.reduceByKey(new Function2() {
    
            @Override
            public Object call(Object obj1, Object obj2) throws Exception {
    
                return Integer.parseInt(obj1.toString()) + Integer.parseInt(obj2.toString());
            }
        });

        // 7、 Sort 
        JavaPairRDD sortedRdd = wordAndCountRdd.sortByKey();
        List<Tuple2> finalResult = sortedRdd.collect();

        // 8、 Results the print 
        for (Tuple2 tuple2 : finalResult) {
    
            System.out.println(tuple2._1 + " ===> " + tuple2._2);
        }

        // 9、 Save the statistics 
        sortedRdd.saveAsTextFile("/var/spark/output");
        sc.stop();
        return "success" ;
    }
}

Package execution results :

Look at the file output :

[[email protected] output]# vim /var/spark/output/part-00000

Four 、 Source code address

GitHub· Address 
https://github.com/cicadasmile/big-data-parent
GitEE· Address 
https://gitee.com/cicadasmile/big-data-parent

Read the label

Java Basics 】【 Design patterns 】【 Structure and algorithm 】【Linux System 】【 database

Distributed architecture 】【 Microservices 】【 Big data components 】【SpringBoot Advanced 】【Spring&Boot Basics

Data analysis 】【 Technology map 】【 In the workplace

Technology Series

OLAP engine :Druid Component for statistical analysis of data

OLAP engine :Presto Component analysis across data sources

OLAP engine :ClickHouse High performance column queries

原网站

版权声明
本文为[A cicada smiles]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206232255055306.html