当前位置:网站首页>How to finally generate a file from saveastextfile in spark
How to finally generate a file from saveastextfile in spark
2022-07-24 01:34:00 【The south wind knows what I mean】
Project scenario :
- generally speaking ,
saveAsTextFileWill follow task How many files are generated by the number of , such as part-00000 Until part-0000n,n Nature is task The number of , That is, the last stage The number of partitions . Is there any way to generate only one file in the end , Instead of hundreds of files ? The answer is naturally there is a way .
myth
stay RDD On the call
coalesce(1,true).saveAsTextFile(), It means that after the calculation, the data is collected into a partition , Then execute the saved action , obviously , A division ,Spark Naturally, there is only one task To perform the saved action , Only one file was generated . Or, , You can callrepartition(1), It is actually coalesce A package of , The default second parameter is true.Is it that simple in the end ?
Obviously not . Although you can do this , But the cost is huge . because Spark Faced with a large amount of data , And it is executed in parallel , If you force only one partition at the end , It will inevitably lead to a large number of disks IO And the Internet IO produce , And finally execute reduce The memory of the operating node will also be greatly tested .Spark The program will be slow , Even died .This is often beginner Spark A thinking trap of , We need to change the original thinking of one-way single node , The understanding of the program needs to change from multiple nodes to multiple processes , You need to be familiar with the mode that multi node clusters naturally generate multiple files .
Besides ,saveAsTextFile The directory required to be saved was not available before , Otherwise, an error will be reported . therefore , It is best to judge whether the directory exists before saving in the program .
When I finish running a Spark The program wants to save the results as saveAsTextFile,
Use of resultsHadoop fs -ls /outputLater, I found a series of part, Thousands .reason :
function Spark The data is divided into many parts (partition), Every partition All save their own data as partxxx File form .
If you want to save it as a copy , Will be :
Firstcollect
perhapsdata.coalesce(1,true).saveAsTextFile()
Or
data.repartition(1).saveAsTextFile( ) //You can also use repartition(1), which is just a wrapper for coalesce() with the suffle argument set to true.
data.repartition(1).saveAsTextFile( “HDFS://OUTPUT”)
- problem :
But if your data is big , It is difficult to install on a single memory , The above operations may cause the single machine to run out of memory (OOM)
The reason is that the above operations are distributed inThe data on each machine is summarized into a single machine, Then save to disk (HDFS) On .
The above operations will be carried out on each machine RDD partition Merge into a single host and then read into the disk .
Solution :
The safer operation is given below , using HDFS Disk merge operation .
If you have saved many part:
You can put large folders getmerge:
hold HDFS Multiple files on Merge into one Local files :
hadoop fs -getmerge /hdfs/output /local/file.txt
It's fine too :
hadoop fs -cat /hdfs/output/part-r-* > /local/file.txt
边栏推荐
- Simple Gan instance code
- Yinshimei Invisalign oral scan referral method (export oral scan data + online consultation)
- How to solve the problem that the device video cannot be played due to the missing CGO playback callback parameters of easycvr platform?
- Research on retinal vascular segmentation based on GAN using few samples
- 代码阅读方法与最佳实践
- Non boost ASIO notes: UDP UART socketcan multicast UDS
- Interview question: what are the differences between ArrayList and LinkedList
- IP地址、子网划分(A2)
- Introduction to the use of bit instruction in Rockwell AB PLC rslogix5000
- Hcip day 4 notes
猜你喜欢

選址與路徑規劃問題(Lingo,Matlab實現)

Hardware knowledge 2 -- Protocol class (based on Baiwen hardware operation Daquan video tutorial)

Disadvantages of win11

What is the Gantt chart function of Zen

Measurement and acquisition of permanent magnet motor parameters (inductance, resistance, pole number, flux linkage constant)

Arm architecture and programming 6 -- Relocation (based on Baiwen arm architecture and programming tutorial video)

Repeat one sentence Trojan horse

Arm architecture and programming 2 -- arm architecture (based on Baiwen arm architecture and programming tutorial video)

罗克韦尔AB PLC RSLogix5000中的位指令使用方法介绍

MD5 encryption and decryption website test, is MD5 encryption still safe?
随机推荐
Hospital generic cabling
Some ideas and skills suitable for pinduoduo small business accessories
网络类型(第三天笔记)
Hcip day 8 notes
Rip --- routing information protocol
Vessel Segmentation in Retinal Image Based on Retina-GAN
Hcip experiment
HCIP第三天笔记
c语言支持yaml配置文件通用方法
Network type
[cloud native kubernetes] deployment advanced resource object management under kubernetes cluster
How to solve the problem that the universal vision NVR device is connected to the easycvr platform and cannot be online after offline?
Win11 highlights of win11 system
Parsing yaml configuration files using C language and libcyaml Library
General method of C language supporting yaml configuration file
网络 类型
Repeat one sentence Trojan horse
php7 垃圾回收机制详解
Non boost ASIO notes: UDP UART socketcan multicast UDS
Hcip third day notes