当前位置:网站首页>Data Lake: introduction to delta Lake
Data Lake: introduction to delta Lake
2022-07-23 11:01:00 【YoungerChina】
1. DeltaLake What is it?
Delta Lake yes DataBricks Company open source 、 Storage framework for building Lake Warehouse Architecture . Able to support Spark,Flink,Hive,PrestoDB,Trino Etc / Calculation engine . As an open format storage layer , It provides batch flow integration at the same time , Provide reliable support for the lake warehouse structure , Safe , High performance guarantee .

2. Key features
Delta Lake Key features :
- ACID Business : Through different levels of isolation strategies ,Delta Lake Support multiple pipeline Concurrent read and write ;
- Data version management :Delta Lake adopt Snapshot And so on 、 Version of audit data and metadata , And then support time-travel Query the historical version data or trace back to the historical version ;
- Open source file format :Delta Lake adopt parquet Format to store data , In order to achieve high-performance compression and other characteristics ;
- Batch flow integration :Delta Lake Support batch and streaming data reading and writing ;
- Metadata evolution :Delta Lake Allow users to merge schema Or rewrite schema, To adapt to the changes of data structure in different periods ;
- rich DML:Delta Lake Support Upsert,Delete And Merge To meet the needs of users in different scenarios , such as CDC scene ;
- File structure : Lake watch is more than ordinary Hive One big difference between tables is that : The metadata of the lake table is self managed , Stored in the file system .
2. Document organization
The following figure for Delta Lake Table file structure :

Delta Lake The file structure of is mainly composed of two parts :
- _delta_log Catalog : Storage deltalake All metadata information of the table , among :
- Each operation on the table is called commit, Including data operation (Insert/Update/Delete/Merge) And metadata operations ( Add new column / Modify table configuration ), Every time commit Will generate a new json Format log file , Record this time commit The behavior produced on the table (action), Such as adding new files , Delete file , Updated metadata information, etc ;
- By default , Every time 10 Time commit Will automatically merge into one parquet Format checkpoint file , Used to speed up the parsing of metadata , And metadata files that support periodic cleaning of history ;
2. Data directory / file : except _delta_log Outside the directory is the file that actually stores the table data ; We need to pay attention to :
- DeltaLake The data organization of partitioned tables is the same as that of ordinary tables Hive surface , Partition fields and their corresponding values are part of the actual data path ;
- Not all visible data files are valid ;DeltaLake In order to snapshot Form organization table , newest snopshot The corresponding valid data file is in _delta_log Manage in metadata ;
3. Metadata mechanism
Delta Lake adopt snapshot To manage multiple versions of tables , It also supports the of historical versions Time-Travel Inquire about . Whether it's querying the latest snapshot Or a version of history snapshot Information , You need to parse to get the corresponding snapshot Metadata information , Mainly involves :
- At present DeltaLake The read-write version of the protocol (Protocol);
- Field information and configuration information of the table (Metadata);
- List of valid data files ; This is done through a set of new files (AddFile) And delete files (RemoveFile) To describe ;
That's loading concrete snopshot when , To speed up the loading process , Try to find a version less than or equal to this version first checkpoint file , Then combine the following until the current version of log file , Analyze and get metadata information together .
Reference material
边栏推荐
- 阿里云对象存储服务OSS前后联调
- An analysis of the CPU explosion of an intelligent transportation background service in.Net
- Notes and Thoughts on the red dust of the sky (III) as long as the conditions are sufficient, the results will come naturally
- Redis source code and design analysis -- 5. Integer set
- The topic pub instruction of ros2 appears: failed to populate field: 'vector3' object has no attribute 'x:1' error
- Kubernetes technology and Architecture (VI)
- Analyse du code source et de la conception de redis - - 7. Liste rapide
- sort
- Single sign on - how to unify the expiration time of session between authentication server and client
- Redis source code and design analysis -- 6. Compressed list
猜你喜欢

Alibaba cloud object storage service OSS front and rear joint debugging

The 12th Blue Bridge Cup embedded design and development project

Recommend a shell installation force artifact, which has been open source! Netizen: really fragrant...

构建人工智能产品/业务的两种策略(by Andrew Ng)

Briefly describe the features and application scenarios of redis

Meyer Burger梅耶博格西门子工控机维修及机床养护

简述redis特点及其应用场景

Redis源码与设计剖析 -- 5.整数集合

Redis源码与设计剖析 -- 12.集合对象

52832dongle installation
随机推荐
A case study on the collaborative management of medical enterprise suppliers, hospitals, patients and other parties by building a low code platform
资源池以及资源池化是什么意思?
Cadence learning path (VIII) PCB placement components
Huck hurco industrial computer maintenance winmax CNC machine tool controller maintenance
H1 -- HDMI interface test application 2022-07-15
C1 -- vivado configuration vs code text editor environment 2022-07-21
C EventHandler observer mode
H1--HDMI接口测试应用2022-07-15
[Social Media Marketing] new idea of going to sea: WhatsApp business replaces Facebook
Activiti工作流使用之项目实例
SVG, canvas, drawing line segments and filling polygon, rectangle, curve drawing and filling
Error in na.fail. default(list(Purchase = c(“CH“, “CH“, “CH“, “MM“, “CH“, : missing values in obj
Redis源码与设计剖析 -- 9.字符串对象
Filter in MATLAB
达人专栏 | 还不会用 Apache Dolphinscheduler?大佬用时一个月写出的最全入门教程
Thing JS notes
Mysql database foundation
FPGA - SPI bus control flash (2) (including code)
TS类型体操 之 中级类型体操挑战收官之战
Briefly describe the features and application scenarios of redis