当前位置:网站首页>August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)
August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)
2020-11-06 21:50:00 【Fuda Dajia architect's daily question】
Fogo's answer 2020-08-24:
Know the answer
1. Small files :
Small files mean that the file size is significantly smaller than HDFS Upper block (block) size ( Default 64MB, stay Hadoop2.x China and Murdoch think 128MB) The file of .
2. Small file problem :
HDFS Small file problem of :
(1)HDFS Any of the files in , The directory or data block is in NameNode Node memory is represented as an object ( Metadata ), And this is subject to NameNode Physical memory capacity limits . Each metadata object accounts for about 150 byte, So if there is 1 Thousands of little files , Each file takes up one block, be NameNode About need 2G Space . If storage 1 Billion documents , be NameNode need 20G Space , There's no doubt about it 1 It is not advisable to have 100 million small documents .
(2) Processing small files is not Hadoop The design goal of ,HDFS Is designed to stream access to large data sets (TB Level ). thus , stay HDFS It's inefficient to store a large number of small files in . Accessing a large number of small files often leads to a large number of seek, And constantly in DatanNde Jump to retrieve small files . This is not a very effective access mode , Seriously affect performance .
(3) Processing a large number of small files is much faster than processing large files of the same size . Each small file takes up one slot, The task start will take a lot of time, even most of the time is spent on starting and releasing tasks .
MapReduce Small file problem on :
Map Tasks typically process only one block of input at a time (input. If the file is very small , And there's a lot of , So each of these Map Tasks only deal with very small input data , And it will produce a lot of Map Mission , every last Map The tasks will be added bookkeeping expenses .
-
Why there are so many small files
In at least two scenarios, a large number of small files will be generated :
(1) These small files are all part of a large logical file . because HDFS stay 2.x Version only supports appending files , So save unbounded files before that ( For example, log files ) One common way is to write the data in blocks HDFS in .
(2) The file itself is very small . For example, for a large picture corpus , Each picture is a separate file , And there's no good way to merge these files into one big file . -
Solution
These two situations need different solutions :
(1) For the first case , A document is made up of many records , Then you can call HDFS Of sync() Method ( and append Methods used in combination ), Generate a large file at regular intervals . perhaps , It can be done by writing a MapReduce Program to merge these little files .
(2) For the second case , You need containers to group these files in some way .Hadoop Offers some options :
① Use HAR File.Hadoop Archives (HAR files) Is in 0.18.0 The version introduces HDFS Medium , It came into being to ease the consumption of a large number of small files NameNode Memory problems .HAR The document is passed through the HDFS Build a hierarchical file system to work on .HAR File by hadoop archive Command to create , And this command actually runs MapReduce Homework to package small files into a small number of HDFS file . For the client , Use HAR There's no change in the file system : All original files are visible and accessible ( Just use har://URL, instead of hdfs://URL), But in HDFS The number of files in the middle has decreased .
② Use SequenceFile Storage . File name as key, File contents as value. In practice, it's very effective . For example, for 10,000 individual 100KB Small file size problem , You can write a program that will merge into one SequenceFile, Then you can stream ( To deal with or use directly MapReduce) SequenceFile.
③ Use HBase. If you produce a lot of small files , Depending on the access mode , There should be different types of storage .HBase Store data in Map Files( Indexed SequenceFile) in , If you need random access to perform MapReduce Flow analysis , This is a good choice .
版权声明
本文为[Fuda Dajia architect's daily question]所创,转载请带上原文链接,感谢
边栏推荐
- 超高频RFID医疗血液管理系统应用
- Markdown tricks
- Exclusive interview of guests at | 2020 PostgreSQL Asia Conference: Wang Tao
- 递归、回溯算法常用数学基础公式
- STM32F030C6T6兼容替换MM32SPIN05PF
- 2020-09-09:裸写算法:两个线程轮流打印数字1-100。
- 打工人好物——磨炼钢铁意志就要这样高效的电脑
- 2020-08-20:GO语言中的协程与Python中的协程的区别?
- Git rebase is in trouble. What to do? Waiting line
- html+ vue.js Implementing paging compatible IE
猜你喜欢

Novice guidance and event management system in game development

Utility class functions (continuous update)

Using an example to understand the underlying processing mechanism of JS function
![[elastic search engine]](/img/3b/00bc81122d330c9d59909994e61027.jpg)
[elastic search engine]

Jenkins installation and deployment process

ado.net and asp.net The relationship between

To teach you to easily understand the basic usage of Vue codemirror: mainly to achieve code editing, verification prompt, code formatting

How much disk space does a new empty file take?

实用工具类函数(持续更新)

细数软件工程----各阶段必不可少的那些图
随机推荐
迅为iMX6开发板-设备树内核-menuconfig的使用
Utility class functions (continuous update)
The method of local search port number occupation in Windows system
To teach you to easily understand the basic usage of Vue codemirror: mainly to achieve code editing, verification prompt, code formatting
Exclusive interview with Alibaba cloud database for 2020 PostgreSQL Asia Conference: Zeng Wenjing
磁存储芯片STT-MRAM的特点
#JVM 类加载机制
2020-08-18:介绍下MR过程?
2020-08-30:裸写算法:二叉树两个节点的最近公共祖先。
STM32F030K6T6兼容替换灵动MM32F031K6T6
C calls SendMessage to refresh the taskbar icon (the icon does not disappear at the end of forcing)
An article taught you to download cool dog music using Python web crawler
Ora-02292: complete constraint violation (midbjdev2.sys_ C0020757) - subrecord found
应用层软件开发教父教你如何重构,资深程序员必备专业技能
非易失性MRAM存储器应用于各级高速缓存
An article takes you to understand CSS gradient knowledge
实验一
ES6 learning notes (4): easy to understand the new grammar of ES6
The memorandum model of behavior model
Windows 10 蓝牙管理页面'添加蓝牙或其他设备'选项点击无响应的解决方案