当前位置:网站首页>How to finally generate a file from saveastextfile in spark
How to finally generate a file from saveastextfile in spark
2022-07-24 01:34:00 【The south wind knows what I mean】
Project scenario :
- generally speaking ,
saveAsTextFileWill follow task How many files are generated by the number of , such as part-00000 Until part-0000n,n Nature is task The number of , That is, the last stage The number of partitions . Is there any way to generate only one file in the end , Instead of hundreds of files ? The answer is naturally there is a way .
myth
stay RDD On the call
coalesce(1,true).saveAsTextFile(), It means that after the calculation, the data is collected into a partition , Then execute the saved action , obviously , A division ,Spark Naturally, there is only one task To perform the saved action , Only one file was generated . Or, , You can callrepartition(1), It is actually coalesce A package of , The default second parameter is true.Is it that simple in the end ?
Obviously not . Although you can do this , But the cost is huge . because Spark Faced with a large amount of data , And it is executed in parallel , If you force only one partition at the end , It will inevitably lead to a large number of disks IO And the Internet IO produce , And finally execute reduce The memory of the operating node will also be greatly tested .Spark The program will be slow , Even died .This is often beginner Spark A thinking trap of , We need to change the original thinking of one-way single node , The understanding of the program needs to change from multiple nodes to multiple processes , You need to be familiar with the mode that multi node clusters naturally generate multiple files .
Besides ,saveAsTextFile The directory required to be saved was not available before , Otherwise, an error will be reported . therefore , It is best to judge whether the directory exists before saving in the program .
When I finish running a Spark The program wants to save the results as saveAsTextFile,
Use of resultsHadoop fs -ls /outputLater, I found a series of part, Thousands .reason :
function Spark The data is divided into many parts (partition), Every partition All save their own data as partxxx File form .
If you want to save it as a copy , Will be :
Firstcollect
perhapsdata.coalesce(1,true).saveAsTextFile()
Or
data.repartition(1).saveAsTextFile( ) //You can also use repartition(1), which is just a wrapper for coalesce() with the suffle argument set to true.
data.repartition(1).saveAsTextFile( “HDFS://OUTPUT”)
- problem :
But if your data is big , It is difficult to install on a single memory , The above operations may cause the single machine to run out of memory (OOM)
The reason is that the above operations are distributed inThe data on each machine is summarized into a single machine, Then save to disk (HDFS) On .
The above operations will be carried out on each machine RDD partition Merge into a single host and then read into the disk .
Solution :
The safer operation is given below , using HDFS Disk merge operation .
If you have saved many part:
You can put large folders getmerge:
hold HDFS Multiple files on Merge into one Local files :
hadoop fs -getmerge /hdfs/output /local/file.txt
It's fine too :
hadoop fs -cat /hdfs/output/part-r-* > /local/file.txt
边栏推荐
- C byte array and class mutual conversion
- 医院综合布线
- How to solve the problem that the universal vision NVR device is connected to the easycvr platform and cannot be online after offline?
- 基于强化空间注意力的视网膜网络(ESA-Unet)
- Research on retinal vascular segmentation based on GAN using few samples
- HCIP第一天笔记
- Win11 highlights of win11 system
- Parsing yaml configuration files using C language and libcyaml Library
- MD5 encryption and decryption website test, is MD5 encryption still safe?
- [cloud native kubernetes] deployment advanced resource object management under kubernetes cluster
猜你喜欢

Arm architecture and programming 2 -- arm architecture (based on Baiwen arm architecture and programming tutorial video)

How to solve the problem that the universal vision NVR device is connected to the easycvr platform and cannot be online after offline?

HCIP第十天笔记

Hcip day 6 notes

How to synchronize MySQL database when easycvr platform is upgraded to the latest version v2.5.0?

Arm architecture and programming 7 -- exceptions and interrupts (based on Baiwen arm architecture and programming tutorial video)

Hcip second day notes

Hcip day 12 notes

基于强化空间注意力的视网膜网络(ESA-Unet)

OSPF(第六天笔记)
随机推荐
简单GAN实例代码
[cloud native kubernetes] deployment advanced resource object management under kubernetes cluster
HCIP第三天笔记
医院综合布线
刚开始使用,请教些问题和教程或分享帖子
Technology enabled new insurance: the digital transformation of China Property Insurance
How the next dbcontext of efcore advanced SaaS system supports multi database migration
SCM learning notes 8 -- keys and external interrupts (based on Baiwen STM32F103 series tutorials)
"Guanghetong AI intelligent module sca825-w" with full AI performance accelerates the era of e-commerce live broadcast 2.0
Simple Gan instance code
1000 okaleido tiger launched binance NFT, triggering a rush to buy
OSI、TCP/IP(A1)
小熊派第一天
HCIP网络类型,ppp会话,数据链路层协议
Hcip network type, PPP session, data link layer protocol
OSPF (fifth day notes)
Jenkins multitâche construction simultanée
Talk about the top 10 mistakes often made in implementing data governance
HCIP实验
HCIP第十一天笔记