当前位置:网站首页>How to finally generate a file from saveastextfile in spark
How to finally generate a file from saveastextfile in spark
2022-07-24 01:34:00 【The south wind knows what I mean】
Project scenario :
- generally speaking ,
saveAsTextFileWill follow task How many files are generated by the number of , such as part-00000 Until part-0000n,n Nature is task The number of , That is, the last stage The number of partitions . Is there any way to generate only one file in the end , Instead of hundreds of files ? The answer is naturally there is a way .
myth
stay RDD On the call
coalesce(1,true).saveAsTextFile(), It means that after the calculation, the data is collected into a partition , Then execute the saved action , obviously , A division ,Spark Naturally, there is only one task To perform the saved action , Only one file was generated . Or, , You can callrepartition(1), It is actually coalesce A package of , The default second parameter is true.Is it that simple in the end ?
Obviously not . Although you can do this , But the cost is huge . because Spark Faced with a large amount of data , And it is executed in parallel , If you force only one partition at the end , It will inevitably lead to a large number of disks IO And the Internet IO produce , And finally execute reduce The memory of the operating node will also be greatly tested .Spark The program will be slow , Even died .This is often beginner Spark A thinking trap of , We need to change the original thinking of one-way single node , The understanding of the program needs to change from multiple nodes to multiple processes , You need to be familiar with the mode that multi node clusters naturally generate multiple files .
Besides ,saveAsTextFile The directory required to be saved was not available before , Otherwise, an error will be reported . therefore , It is best to judge whether the directory exists before saving in the program .
When I finish running a Spark The program wants to save the results as saveAsTextFile,
Use of resultsHadoop fs -ls /outputLater, I found a series of part, Thousands .reason :
function Spark The data is divided into many parts (partition), Every partition All save their own data as partxxx File form .
If you want to save it as a copy , Will be :
Firstcollect
perhapsdata.coalesce(1,true).saveAsTextFile()
Or
data.repartition(1).saveAsTextFile( ) //You can also use repartition(1), which is just a wrapper for coalesce() with the suffle argument set to true.
data.repartition(1).saveAsTextFile( “HDFS://OUTPUT”)
- problem :
But if your data is big , It is difficult to install on a single memory , The above operations may cause the single machine to run out of memory (OOM)
The reason is that the above operations are distributed inThe data on each machine is summarized into a single machine, Then save to disk (HDFS) On .
The above operations will be carried out on each machine RDD partition Merge into a single host and then read into the disk .
Solution :
The safer operation is given below , using HDFS Disk merge operation .
If you have saved many part:
You can put large folders getmerge:
hold HDFS Multiple files on Merge into one Local files :
hadoop fs -getmerge /hdfs/output /local/file.txt
It's fine too :
hadoop fs -cat /hdfs/output/part-r-* > /local/file.txt
边栏推荐
- Disadvantages of win11
- Hospital network security architecture
- HCIP实验
- Jenkins multitâche construction simultanée
- 128. Longest continuous sequence
- Hcip day 8 notes
- HCIP实验
- MGRE experiment
- Measurement and acquisition of permanent magnet motor parameters (inductance, resistance, pole number, flux linkage constant)
- OSPF(第六天笔记)
猜你喜欢

RIP---路由信息协议

Hardware knowledge 2 -- Protocol class (based on Baiwen hardware operation Daquan video tutorial)

SCM learning notes 9 -- common communication methods (based on Baiwen STM32F103 series tutorials)

Introduction to the use of bit instruction in Rockwell AB PLC rslogix5000

Topological sorting & critical path

Disadvantages of win11

HCIP第五天笔记

Interview question: what are the differences between ArrayList and LinkedList

面试了二三十家公司所总结的问题,Android面试吃完这一套没有拿不到的Offer......

2022 global developer salary exposure: China ranks 19th, with an average annual salary of $23790
随机推荐
Kotlin foundation from introduction to advanced series explanation (basic chapter) keyword: suspend
复制可读路径不好使
HCIP第五天笔记
基于强化空间注意力的视网膜网络(ESA-Unet)
架构实战营模块二作业
php7 垃圾回收机制详解
Introduction and environment construction of little bear sect
C byte array and class mutual conversion
中小型医院基础网络解决方案
Hcip day 12 notes
Exchange 2013 SSL证书安装文档
Rip (notes of the second day)
Arm architecture and programming 4 -- serial port (based on Baiwen arm architecture and programming tutorial video)
Detailed explanation of OSPF in hcip
Hcip third day notes
Network type
Hospital generic cabling
Problèmes de localisation et de planification des itinéraires (Lingo, mise en œuvre de MATLAB)
選址與路徑規劃問題(Lingo,Matlab實現)
Hcip experiment