当前位置：网站首页>August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)

August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)

2020-11-06 21:50:00 【Fuda Dajia architect's daily question】

5 Line code , Can let wechat small program on the shelf to their own APP in | Register and send it to Dajiang 、 Huawei 、 Cherry keyboard ！>>>

Fogo's answer 2020-08-24：
Know the answer
1. Small files ：
Small files mean that the file size is significantly smaller than HDFS Upper block （block） size （ Default 64MB, stay Hadoop2.x China and Murdoch think 128MB） The file of .

2. Small file problem ：
HDFS Small file problem of ：
（１）HDFS Any of the files in , The directory or data block is in NameNode Node memory is represented as an object （ Metadata ）, And this is subject to NameNode Physical memory capacity limits . Each metadata object accounts for about 150 byte, So if there is 1 Thousands of little files , Each file takes up one block, be NameNode About need 2G Space . If storage 1 Billion documents , be NameNode need 20G Space , There's no doubt about it 1 It is not advisable to have 100 million small documents .
（２） Processing small files is not Hadoop The design goal of ,HDFS Is designed to stream access to large data sets （TB Level ）. thus , stay HDFS It's inefficient to store a large number of small files in . Accessing a large number of small files often leads to a large number of seek, And constantly in DatanNde Jump to retrieve small files . This is not a very effective access mode , Seriously affect performance .
（３） Processing a large number of small files is much faster than processing large files of the same size . Each small file takes up one slot, The task start will take a lot of time, even most of the time is spent on starting and releasing tasks .

MapReduce Small file problem on ：
Map Tasks typically process only one block of input at a time （input. If the file is very small , And there's a lot of , So each of these Map Tasks only deal with very small input data , And it will produce a lot of Map Mission , every last Map The tasks will be added bookkeeping expenses .

Why there are so many small files
In at least two scenarios, a large number of small files will be generated :
（１） These small files are all part of a large logical file . because HDFS stay 2.x Version only supports appending files , So save unbounded files before that （ For example, log files ） One common way is to write the data in blocks HDFS in .
（２） The file itself is very small . For example, for a large picture corpus , Each picture is a separate file , And there's no good way to merge these files into one big file .
Solution
These two situations need different solutions ：
（１） For the first case , A document is made up of many records , Then you can call HDFS Of sync() Method ( and append Methods used in combination ), Generate a large file at regular intervals . perhaps , It can be done by writing a MapReduce Program to merge these little files .
（２） For the second case , You need containers to group these files in some way .Hadoop Offers some options ：
① Use HAR File.Hadoop Archives （HAR files） Is in 0.18.0 The version introduces HDFS Medium , It came into being to ease the consumption of a large number of small files NameNode Memory problems .HAR The document is passed through the HDFS Build a hierarchical file system to work on .HAR File by hadoop archive Command to create , And this command actually runs MapReduce Homework to package small files into a small number of HDFS file . For the client , Use HAR There's no change in the file system ： All original files are visible and accessible （ Just use har://URL, instead of hdfs://URL）, But in HDFS The number of files in the middle has decreased .
② Use SequenceFile Storage . File name as key, File contents as value. In practice, it's very effective . For example, for 10,000 individual 100KB Small file size problem , You can write a program that will merge into one SequenceFile, Then you can stream （ To deal with or use directly MapReduce） SequenceFile.
③ Use HBase. If you produce a lot of small files , Depending on the access mode , There should be different types of storage .HBase Store data in Map Files（ Indexed SequenceFile） in , If you need random access to perform MapReduce Flow analysis , This is a good choice .

Comment on

版权声明
本文为[Fuda Dajia architect's daily question]所创，转载请带上原文链接，感谢

当前位置：网站首页>August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)

August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)

边栏推荐

猜你喜欢

随机推荐