当前位置：网站首页>How to finally generate a file from saveastextfile in spark

How to finally generate a file from saveastextfile in spark

2022-07-24 01:34:00 【The south wind knows what I mean】

Project scenario ：

generally speaking ,saveAsTextFile Will follow task How many files are generated by the number of , such as part-00000 Until part-0000n,n Nature is task The number of , That is, the last stage The number of partitions . Is there any way to generate only one file in the end , Instead of hundreds of files ？ The answer is naturally there is a way .

myth

stay RDD On the call coalesce(1,true).saveAsTextFile(), It means that after the calculation, the data is collected into a partition , Then execute the saved action , obviously , A division ,Spark Naturally, there is only one task To perform the saved action , Only one file was generated . Or, , You can call repartition(1), It is actually coalesce A package of , The default second parameter is true.
Is it that simple in the end ？
　　 Obviously not . Although you can do this , But the cost is huge . because Spark Faced with a large amount of data , And it is executed in parallel , If you force only one partition at the end , It will inevitably lead to a large number of disks IO And the Internet IO produce , And finally execute reduce The memory of the operating node will also be greatly tested .Spark The program will be slow , Even died .
This is often beginner Spark A thinking trap of , We need to change the original thinking of one-way single node , The understanding of the program needs to change from multiple nodes to multiple processes , You need to be familiar with the mode that multi node clusters naturally generate multiple files .
Besides ,saveAsTextFile The directory required to be saved was not available before , Otherwise, an error will be reported . therefore , It is best to judge whether the directory exists before saving in the program .
When I finish running a Spark The program wants to save the results as saveAsTextFile,
Use of results Hadoop fs -ls /output Later, I found a series of part, Thousands .
reason ：
function Spark The data is divided into many parts （partition）, Every partition All save their own data as partxxx File form .
If you want to save it as a copy , Will be ：
First collect
perhaps
data.coalesce(1,true).saveAsTextFile()
Or

data.repartition(1).saveAsTextFile( )       //You can also use repartition(1), which is just a wrapper for coalesce() with the suffle argument set to true.

data.repartition(1).saveAsTextFile( “HDFS://OUTPUT”)

problem ：
But if your data is big , It is difficult to install on a single memory , The above operations may cause the single machine to run out of memory (OOM)
The reason is that the above operations are distributed in The data on each machine is summarized into a single machine , Then save to disk （HDFS） On .
The above operations will be carried out on each machine RDD partition Merge into a single host and then read into the disk .

Solution ：

The safer operation is given below , using HDFS Disk merge operation .

If you have saved many part：
You can put large folders getmerge：

hold HDFS Multiple files on Merge into one Local files ：

hadoop fs -getmerge /hdfs/output   /local/file.txt

It's fine too ：

hadoop    fs   -cat    /hdfs/output/part-r-* >   /local/file.txt

原网站

版权声明
本文为[The south wind knows what I mean]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207230231457668.html

当前位置：网站首页>How to finally generate a file from saveastextfile in spark

How to finally generate a file from saveastextfile in spark

Project scenario ：

myth

Solution ：

边栏推荐

猜你喜欢

随机推荐