当前位置:网站首页>Spark项目打包优化实践
Spark项目打包优化实践
2022-06-24 06:39:00 【Angryshark_128】
问题描述
在使用Scala/Java进行Spark项目开发过程中,常涉及项目构建和打包上传,因项目依赖Spark基础相关类包一般较大,打包后若涉及远程开发调试,每次打包都消耗多很多时间,因此需对此过程进行优化。
优化方案
方案1:一次全量上传jar包,后续增量更新class
POM文件配置(Maven)
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
........
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
</dependencies>
<!-- 构建配置 -->
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
</resource>
</resources>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<configuration>
<recompileMode>incremental</recompileMode>
</configuration>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<!-- get all project dependencies -->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!-- bind to the packaging phase -->
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
按照如上配置进行打包,会得到*-1.0-SNAPSHOT.jar和*-1.0-SNAPSHOT-jar-with-dependencies.jar两个jar包,后者是可单独执行的jar包,但因打包进去很多无用依赖,导致即便一个很简单的项目,也要一两百M。
原理须知:jar包其实只是一个普通的rar压缩包,解压后内部由相关jar包和编译后的class文件、静态资源文件等组成。也就是说,我们每次修改代码重新打包后,只是更新了其中个别class或静态资源文件,因而后续更新只需替换更新代码后的class文件即可。
例:
写一个简单的sparktest项目,打包后会出现sparktest-1.0-SNAPSHOT.jar和sparktest-1.0-SNAPSHOT-jar-with-dependencies.jar两个jar包。

其中sparktest-1.0-SNAPSHOT-jar-with-dependencies.jar为单独可执行jar包,上传至服务器即可执行,使用解压软件打开该jar可看到目录结构。

其中,App*.class文件即为主代码对应的编译文件,

修改App.scala代码后,执行重新compile一下,在target/classes目录下即可看到新的App*.class

将更新后的class文件,上传至服务器jar包同目录下,替换即可
jar uvf sparktest-1.0-SNAPSHOT-jar-with-dependencies.jar App*.class
注:若该class文件不在jar包根目录下,则创建相同目录,然后替换,如
jar uvf sparktest-1.0-SNAPSHOT-jar-with-dependencies.jar com/example/App*.class
方案2:依赖与项目分开上传,后续单独更新项目jar包
POM文件配置(Maven)
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
......
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>target/lib</outputDirectory>
<excludeTransitive>false</excludeTransitive>
<stripVersion>true</stripVersion>
</configuration>
</execution>
</executions>
</plugin>
<!--scala打包插件-->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.1</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<!--java代码打包插件,不会将依赖也打包-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<!-- <addClasspath>true</addClasspath> -->
<mainClass>com.oidd.App</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
打包完成后,会出现单独jar包和lib目录

上传jar包和lib文件夹至服务器,后续更新只需要替换jar包即可,执行spark-submit时,只需要添加–jars *\lib*.jar目录即可。
边栏推荐
- What is JSP technology? Advantages of JSP technology
- On BOM and DOM (3): DOM node operation - element style modification and DOM content addition, deletion, modification and query
- 机器人迷雾之算力与智能
- CloudCompare&PCL 点云裁剪(基于裁剪盒)
- 原神方石机关解密
- leetcode:84. 柱状图中最大的矩形
- [binary tree] - middle order traversal of binary tree
- 潞晨科技获邀加入NVIDIA初创加速计划
- Asp+access web server reports an error CONN.ASP error 80004005
- When the VPC main network card has multiple intranet IP addresses, the server cannot access the network internally, but the server can be accessed externally. How to solve this problem
猜你喜欢

leetcode:84. 柱状图中最大的矩形

setInterval里面的函数不能有括号

Become TD hero, a superhero who changes the world with Technology | invitation from tdengine community

目标5000万日活,Pwnk欲打造下一代年轻人的“迪士尼乐园”

潞晨科技获邀加入NVIDIA初创加速计划

leetcode:剑指 Offer 26:判断t1中是否含有t2的全部拓扑结构

Typora收费?搭建VS Code MarkDown写作环境

【二叉数学习】—— 树的介绍

记录--关于JSP前台传参数到后台出现乱码的问题

C language student management system - can check the legitimacy of user input, two-way leading circular linked list
随机推荐
Actual combat | how to deploy flask project using wechat cloud hosting
Record -- about the method of adding report control to virtual studio2017 -- reportview control
智能视觉组A4纸识别样例
展锐芯片之GPU频率
leetcode:84. 柱状图中最大的矩形
sql join的使用
Come on, it's not easy for big factories to do projects!
Cloudcompare & PCL point cloud clipping (based on clipping box)
目标5000万日活,Pwnk欲打造下一代年轻人的“迪士尼乐园”
Kangaroo cloud: the overall architecture and key technical points of building a real-time computing platform based on Flink
Database stored procedure begin end
MySQL concept - View
Koa source code analysis
Authoritative recognition! Tencent cloud data security Zhongtai was selected as the 2021 pioneer practice case
Open source and innovation
On BOM and DOM (1): overview of BOM and DOM
mysql中的 ON UPDATE CURRENT_TIMESTAMP
Programmers use personalized Wallpapers
Spirit information development log (1)
setInterval里面的函数不能有括号