当前位置:网站首页>How to finally generate a file from saveastextfile in spark
How to finally generate a file from saveastextfile in spark
2022-07-25 15:16:00 【The south wind knows what I mean】
Project scenario :
- generally speaking ,
saveAsTextFileWill follow task How many files are generated by the number of , such as part-00000 Until part-0000n,n Nature is task The number of , That is, the last stage The number of partitions . Is there any way to generate only one file in the end , Instead of hundreds of files ? The answer is naturally there is a way .
myth
stay RDD On the call
coalesce(1,true).saveAsTextFile(), It means that after the calculation, the data is collected into a partition , Then execute the saved action , obviously , A division ,Spark Naturally, there is only one task To perform the saved action , Only one file was generated . Or, , You can callrepartition(1), It is actually coalesce A package of , The default second parameter is true.Is it that simple in the end ?
Obviously not . Although you can do this , But the cost is huge . because Spark Faced with a large amount of data , And it is executed in parallel , If you force only one partition at the end , It will inevitably lead to a large number of disks IO And the Internet IO produce , And finally execute reduce The memory of the operating node will also be greatly tested .Spark The program will be slow , Even died .This is often beginner Spark A thinking trap of , We need to change the original thinking of one-way single node , The understanding of the program needs to change from multiple nodes to multiple processes , You need to be familiar with the mode that multi node clusters naturally generate multiple files .
Besides ,saveAsTextFile The directory required to be saved was not available before , Otherwise, an error will be reported . therefore , It is best to judge whether the directory exists before saving in the program .
When I finish running a Spark The program wants to save the results as saveAsTextFile,
Use of resultsHadoop fs -ls /outputLater, I found a series of part, Thousands .reason :
function Spark The data is divided into many parts (partition), Every partition All save their own data as partxxx File form .
If you want to save it as a copy , Will be :
Firstcollect
perhapsdata.coalesce(1,true).saveAsTextFile()
Or
data.repartition(1).saveAsTextFile( ) //You can also use repartition(1), which is just a wrapper for coalesce() with the suffle argument set to true.
data.repartition(1).saveAsTextFile( “HDFS://OUTPUT”)
- problem :
But if your data is big , It is difficult to install on a single memory , The above operations may cause the single machine to run out of memory (OOM)
The reason is that the above operations are distributed inThe data on each machine is summarized into a single machine, Then save to disk (HDFS) On .
The above operations will be carried out on each machine RDD partition Merge into a single host and then read into the disk .
Solution :
The safer operation is given below , using HDFS Disk merge operation .
If you have saved many part:
You can put large folders getmerge:
hold HDFS Multiple files on Merge into one Local files :
hadoop fs -getmerge /hdfs/output /local/file.txt
It's fine too :
hadoop fs -cat /hdfs/output/part-r-* > /local/file.txt
边栏推荐
猜你喜欢

流程控制(上)

ice 100G 网卡分片报文 hash 问题

System.AccessViolationException: 尝试读取或写入受保护的内存。这通常指示其他内存已损坏

Nacos2.1.0 cluster construction

API health status self inspection

MySQL installation and configuration super detailed tutorial and simple database and table building method

打开虚拟机时出现VMware Workstation 未能启动 VMware Authorization Service

一个程序最多可以使用多少内存?

推荐10个堪称神器的学习网站

Implement a simple restful API server
随机推荐
处理ORACLE死锁
VMware Workstation fails to start VMware authorization service when opening virtual machine
Once spark reported an error: failed to allocate a page (67108864 bytes), try again
Introduction to raspberry Pie: initial settings of raspberry pie
图片裁剪cropper 示例
推荐10个堪称神器的学习网站
js URLEncode函数
Pl/sql creates and executes ORALCE stored procedures and returns the result set
Automatically set the template for VS2010 and add header comments
Spark获取DataFrame中列的方式--col,$,column,apply
万能通用智能JS表单验证
Deployment and simple use of PostgreSQL learning
Unable to start web server when Nacos starts
Spark002 --- spark task submission, pass JSON as a parameter
Spark提交参数--files的使用
MeanShift聚类-01原理分析
Es5 thinking of writing inheritance
Application of object detection based on OpenCV and yolov3
iframe嵌套其它网站页面 全屏设置
Vs2010添加wap移动窗体模板