当前位置:网站首页>Overview of data model design method

Overview of data model design method

2022-07-24 00:20:00 000X000

⼤ Most analysts in companies do it in combination with their business ⼀ Some data analysis ( need ⽤ To ⼤ The amount of data ), By reporting ⽅ Service to the business department ⻔ Operations . But before the data platform was built , Analysts often find that ⾃⼰ Nothing can be answered ⽤ The data of , Have to make ⽤ Raw data into ⾏ cleaning 、 Add ⼯、 Calculate the index .

Because of them ⼤ Mostly ⾮ Technology majors produce ⾝, Written SQL quality ⽐ Poor , what ⾄⻅ too 5 Nesting above layers . such SQL On resource consumption ⾮ often ⼤, It can cause queue congestion , Affect other warehouse tasks , Will cause dissatisfaction with data development . Data development will require that the analyst's access to raw data be withdrawn , analysts ⼜ Will complain that the data warehouse is not perfect , What you want is nothing ,⼀ A need often has to wait ⼀ Zhou Shi ⾄ Half ⽉. Analysts and data development ⽭ The shield starts from then on .

This ⽭ The root of shield lies in the data model ⽆ FA Fu ⽤, Data development is a chimney , Every time you meet a new demand , All recalculated from the original data ,⾃ But it takes time .⽽ To solve this ⽭ shield , It's about figuring out what our data model should look like ⼦.

01 What is ⼀ A good data model design ?


Look at ⼀ Group data , These two tables are based on metadata ⼼ Provided ⾎ Fate information , Respectively for ⼤ Data platform to transport ⾏ Tasks and analysis queries for (Ad-hoc) Into the ⾏ The statistics of .

surface 1:

surface 2:

The figure below shows the hierarchical architecture of the data warehouse ,⽅ The hierarchical design architecture of data model is recalled :

surface 1 There is 2547 The hierarchical table is not recognized , In the total table 6049 Of 40%, They can't be recovered ⽤.

The focus is on the identified hierarchical table reading tasks ,ODS:DWD:DWS:ADS The reading tasks of are respectively 1072:545:187:433, Direct reading ODS Layer tasks account for the sum of the four layers 47.9%, This shows that there is ⼤ Quantity tasks are based on raw data plus ⼯, The middle model is complex ⽤ Bad sex .

surface 2 In identified hierarchical queries ,ODS:DWD:DWS:ADS The queries hit by are respectively 892:1008:152:305, Yes 37.8% The query directly hits ODS Layer raw data , explain DWD、DWS、ADS The lack of layer data construction is serious . In especial ADS and DWS, Query the underlying table , The amount of data scanned by the query will increase ⼤, The query time will increase ⻓, The resource consumption of query is also increasing ⼤, send ⽤ Data ⼈ Satisfaction will be low .

Last , Into the ⼀ Step by step ODS Layer is read 704 Zhang biaojin ⾏ decompose , Found to have 382 The downstream output of Zhang Biao is DWS,ADS, In especial ADS Reached 323 A watch , Occupy ODS Layer surface ⽐ example 45.8%, Description yes ⼤ The amount ODS The layer surface is entered ⾏ Deepen physics ⼯.

Pass on ⾯ Analysis of , We seem to have found ⼀ An ideal data warehouse model design should have the factors , That's it “ The data model is repeatable ⽤, Perfect and standard ”.

02 How to measure perfection ?


DWD Layer perfection : To measure DWD Whether the layer is perfect , The most beautiful ODS How many tables are there on the floor DWS/ADS/DM Layer citation ⽤. because DWD The above layer leads to ⽤ The more , The more tasks are based on raw data ⾏ Deep aggregation computing , There is no accumulation of detailed data ,⽆ The law is restored ⽤,  Data cleaning 、 format 、 There is duplication of development in integration . therefore , I put forward ⽤ Cross layer Introduction ⽤ Rate indicators measure DWD The degree of perfection of .

Cross layer Introduction ⽤ rate :ODS Layers are directly DWS/ADS/DM Layer citation ⽤ Table of , Take all ODS Layer table ( Only active tables are counted )⽐ example .

Cross layer Introduction ⽤ The lower the rate, the better , In the data platform model design specification , Cross layer references are not allowed ⽤,ODS Layer data can only be DWD lead ⽤.

DWS/ADS/DM Layer perfection : Assess the completeness of the summary data , It mainly depends on whether the aggregate data can be filled directly ⾜ How many query requirements ( That is to say ⽤ Summary layer data query ⽐ To measure ). If you aggregate the data ⽆ Faman ⾜ demand , send ⽤ Data ⼈ We must make ⽤ Detailed data , what ⾄ It's raw data .

Aggregate data query ⽐ example :DWS/ADS/DM Layer queries account for all queries ⽐ example .

Make it clear that , This is a cross layer guide ⽤ Different rates , Aggregate Query ⽐ It is impossible to do 100%, But the more ⾼, It shows that the data construction of the upper layer is more perfect , To make ⽤ Data ⼈ Come on , Query speed and cost will be reduced ,⽤ It's better to get up .

03 How to measure reusability ?


The core of platform model design in data ⼼ yes Pursue the complexity of the model ⽤ And share , Through metadata ⼼ The data of ⾎ Edge map , You can see ,⼀ individual ⽐ Poor model design ,⾃ Next ⽽ On is ⼀ line .⽽⼀ An ideal model design , It should be an interwoven divergent structure .

The model quotes ⽤ coefficient As an indicator , Measure the complexity of platform model design in data ⽤ degree . lead ⽤ The more coefficient ⾼, Explain the complex of shucang ⽤ The better sex .

The model quotes ⽤ coefficient :⼀ Models are read , The average number of direct output downstream models .

⽐ Such as ⼀ Zhang DWD The layer surface is 5 Zhang DWS Layer surface quotation ⽤, This piece of DWD The quotation of the layer table ⽤ The coefficient is 5, If you put all DWD Layer table ( With downstream tables ) lead ⽤ Take the average of the coefficients , Then for DWD The layer surface average model introduces ⽤ coefficient ,⼀ Generally lower than 2⽐ Poor ,3 The above is relative to ⽐ good ( Empirical value ).

04 How to measure the degree of normalization ?


surface 1 in , exceed 40% There is no hierarchical information in any of the tables of , At the model design level ⾯, This is obviously not standardized . Except to see if the table is layered , It also depends on whether it belongs to the subject domain ( For example, the transaction domain ) If it doesn't belong to the subject domain , It's hard to find this watch , also ⽆ FA Fu ⽤.

secondly , It depends on the name of the table . take stock This is named for example , When you see this watch , Know which subject area it is 、 Business process ? It's a table of full data , Or daily incremental data ? in general , The information obtained from this table name is too limited .⼀ Table names should include subject fields 、 layered 、 The table is a full snapshot , Or incremental information .

besides , If it's on the watch A in ⽤⼾ID The name of is UserID, In the table B in ⽤⼾ID The name is ID, It will make ⽤ People cause trouble , Is this ⼀ Something . So we need the same fields in different models , It has to be named ⼀ To .

Experience and Suggest

1. You can use these indicators to evaluate ⼀ Next ,⾃⼰ What's the status quo of the digital warehouse .

2. And then formulate ⼀ Some targeted improvement plans ,⽐ For example, eliminate these nonstandard named tables , The table that covers the subject field ⽐ It is mentioned that ⾼ To 90% above .

3. After trying ⼀ After a period of time of model reconstruction and optimization , Then take these indicators to test ⼀ Test whether it is really better .

How much help does model refactoring help data construction ? Is there any ⼀ Some quantitative indicators can be used to measure ? Based on the above knowledge, we can answer these two questions very well .

05 How to go from the chimney like decimal warehouse to the shared data center ?

The essence of building data platform is to build the public data layer of enterprises , Put the scattered 、 Chimney type 、 Messy ⼩ Several positions , Merge into ⼀ Can be shared 、 Repeatable ⽤ Data Center .

The first ⼀: To take over ODS layer , Control the source

ODS It's business data that goes into ⼊ Number one in the data center ⼀ standing , It's all the data plus ⼯ The source of , Control the source , In order to prevent ⽌⼀ The emergence of duplicate data systems .

The data center team has to Clear responsibilities , whole ⾯ To take over ODS The layer data , Access from the source database of the business system ⼊⼿, Ensure that data is generated from business systems ⽣ Backward ⼊ Data warehouse , It can only be maintained in the data center ⼀ Share . This can be achieved with the business system database manager ⼀ Cause , Only the accounts of the middle office team can synchronize data .

ODS The data of the layer table must be consistent with the table structure of the data source 、 Number of table records ⼀ Cause ,⾼ degree ⽆ damage , about ODS The nomenclature of the layer table is adopted ⽤ODS_ Business system database name _ Business system database table name ⽅ type ,⽐ Such as ods_warehous_stock,warehous Is the business system database name ,stock It's under the library ⾯ Table name of .

The first ⼆: Divide Lord Topic domain , Building a bus matrix

A topic domain is an abstract collection of business processes . It may be said that , A little bit ⼉ abstract , But in fact, the business process is the business process of the enterprise ⼀ They are inseparable ⾏ For events ,⽐ Such as warehouse management ⾥⾯ Yes ⼊ library 、 Out of stock 、 deliver goods 、 Sign for , It's all business processes , The abstract subject domain is the storage domain .

The subject domain should cover all business requirements as much as possible , Maintain relative stability , There is also ⼀ The expansibility of the ( New addition ⼊⼀ Subject areas , The table that does not affect the divided subject fields ).

After the subject domain is divided , We're going to start building the bus matrix , Identify the analysis dimensions of business processes under each subject area , For example ⼦:

Third : structure ⼀ Sexual dimension .

Complaints from the after sales team ⼯ Singular quantity has the analysis dimension of the region ,⽽ The distribution delay of distribution team also has the analysis dimension of region , You want to analyze the increase in complaints due to delivery delays , But the analysis dimension of the two regions contains no content ⼀ Cause , Will eventually lead to ⼀ Some areas can't be analyzed . So we build the whole picture ⼀ The dimension of sex , Make sure that the dimension table only stores ⼀ Share .

Dimension system ⼀ The most ⼤ The problem is Dimension attributes ( If the dimension is a commodity , So the product category 、 Brand of goods 、 Commodity ruler ⼨ And so on sex , We call it dimension attributes ) Integration of . Whether all dimension attributes should be integrated into ⼀ individual ⼤ In the dimension table of , either ⻅ have to , I'll give it to you ⼏ individual   Suggest .

1. Public dimension attributes and unique dimension attributes are split into two dimension tables . stay ⾃ In the camp platform , There's usually ⼀ Some third ⽅ Merchant ⼊ Stationed , But count The quantity is very small .⼤ In fact, some commodities have no store attribute , This situation , It is not recommended to add other dimension attributes of stores and commodities ,⽐ Such as commodity category 、 The brand is designed to ⼀ Dimension table .

2. The output time is different ⼤ Split the dimension attributes of a separate dimension table ,⽐ For example, some dimension attributes are produced in the early morning 2 spot , Some dimension attributes are produced in the early hours of the morning 6 spot , that 2 Point and 6 Point can be split into two dimension tables , Make sure nuclear ⼼ As soon as possible .

3. For the sake of the stability of dimensional table output , You can update frequently and change slowly into ⾏ Split , Frequent and less visited dimension tables Into the ⾏ Split .

For the normalized naming of dimension table , Suggest ⽤“dim_ Subject field _ describe _ Table rules ”⽅ type . Sub table can be understood in this way :⼀ A watch store     Store ⼏ One hundred billion ⾏ The record is so ⼤ 了 , So we need to ⼀ A table is cut into many ⼩ The partition , Every day or every week , As the task is scheduled , Meeting ⽣ become ⼀ Zones . often ⻅ The zoning rules of ( Time to query ).

Fourth : Fact table Integration

Fact table integration follows the most basic ⼀ One principle is , Statistical granularity must be maintained ⼀ Cause , Data with different statistical granularity cannot appear in the same

⼀ In the fact table . Look at ⼀ Case ⼦:

Before building a data center , Supply chain ⻔、 Storage department ⻔ And marketing ⻔ There are ⼀ Some repetitive fact sheets , We need to Put these repetitions into ⾏ Remove , According to transaction domain and warehousing domain , Subject domain ⽅ Step in ⾏ Integrate .

For the warehouse department ⻔ And supply chain ⻔ All have inventory details , Because the warehouse department ⻔ The statistical granularity of is Merchandise plus warehouse ,⽽ Supply chain ⻔ Of Only commodities , Therefore, in principle, two tables cannot be merged ,⽽ It should be alone ⽴ There is .

For marketing ⻔ And supply chain ⻔ Two order details of , Because the statistical granularity is Order level , All belong to the order business process under the transaction domain , So it can be merged into ⼀ A fact sheet .

besides , We should also consider Fill in the incomplete data ⻬. about ODS Layers are directly cited ⽤ Produce DWS/ADS/DM Layer task , adopt ⾎ edge , Find the task list , One by one ⾏ Take apart . No, ODS Corresponding DWD Of , should ⽣ become DWD surface , For what already exists , The task should be migrated , send ⽤DWD Layer table .

DWD/DWS/ADS/DM The naming rules of are suitable for ⽤“[ level ][ The theme ][⼦ The theme ][ Content description ][ Table rules ]” The name of ⽅ type .

The fifth : Model development .

After model design , Enter ⼊ Model development stage , Something to watch out for :

1.  All tasks must be strictly configured with task dependencies , If no task dependencies are configured , It can lead to pre ⼀ Tasks that don't produce data normally

Condition , after ⼀ Tasks are scheduled , Based on the wrong data , Waste resources , At the same time, it increases the complexity of troubleshooting ;

2.  Temporary table created in task , You should delete... Before the end of the task , If you don't delete , Will find ⼤ There is a temporary table of quantities , Occupy ⽤ Space ;

3.  The task name should follow the table name ⼀ Cause ,⽅ Then find and associate ;

4. ⽣ Life cycle management , about ODS and DWD,⼀ Keep as much historical data as possible , about DWS/ADS/DM Need to set up ⽣ Life cycle ,7〜30 Different days ;

5. DWD The surface of the formation is suitable for mining ⽤ Compression of the ⽅ Type storage , can ⽤lzo Compress .

The sixth : Should be ⽤ transfer

Last ⼀ Step is to be ⽤ Migration , The core of this process ⼼ Pay attention to the data ⽐ Yes , Make sure the data is complete ⼀ Cause , Then enter ⾏ Should be ⽤ transfer , Delete ⽼ Data sheet for .

in general , Building data center is not ⼀⼝⽓ You can eat it ⼀ Fat ⼦, Its construction is often snowballing ⽅ type , With ⼀ Everyone should ⽤ Migration , The data of China Taiwan are becoming more and more abundant , The value of the play is more and more ⼤.

06 Warehouse modeling ⼯ have EasyDesign


The implementation of the above steps , Cannot do without ⼀ A good one ⽤ Of ⼯ As a result ⽀ support , To standardize the design of the data model , Research and development EasyDesign Model design products of , Let these processes realize systematic management .EasyDesign Design ideas and functions of :

Netease knows how many :

https://bigdata.163yun.com/product/easydesign

EasyDesign Built in metadata ⼼ above , adopt API transfer ⽤ In metadata ⼼ The data of ⾎ Edge to connect ⼝, Combined with the index of data warehouse model design , The model design metrics are given .

EasyDesign By subject area 、 Business process 、 A layered ⽅ Management of all models .

It also provides dimensions 、 Management of metrics and field base dictionaries , At the same time, it has the control of model design approval process .

07 summary


This paper mainly understands the model design of data platform . Be sure ⽴ Design ⽬ mark , To pass ⼀ A series of steps , take ⼀ They're all scattered 、 Messy 、 Chimney type ⼩ The number of storehouses is gradually regulated to ⼀ It can be recovered ⽤、 Shared data center , Finally, through the production of ⽅ To realize systematic management . Last , Emphasize again ⼏ A little bit :

1. Perfection 、 complex ⽤ Degree and normalization constitute the measurement system of platform model design , It can help you evaluate the design of data warehouse .

2. Dimension design is the soul of dimensional modeling , It is also the basis of data platform model design , The core of dimension design ⼼ Is build ⼀ Sexual dimension .

3. The statistical granularity of fact tables must be maintained ⼀ Cause , Data with different statistical granularity cannot appear in the same ⼀ In the fact table .

It often takes half a year to build a data platform ⾄⼀ Years or more , But when the data center is built , The improvement effect on R & D efficiency ⾮ It is often obvious that , stay ⽹ In e-commerce business , China and Taiwan build the afterimage ⽐ Before building , The average delivery time of data requirements is from ⼀ Week shortened to 3 Days. , The speed of demand response has been improved , It improves the effect of data for enterprises ⽀ support .

reflection :

In the process of the actual implementation of the data platform , Data teams not only need to build a public data layer , Form a data center , And bear a huge ⼤ The pressure of new demand for ⼒.⽽ And , Often the priority of requirements is ⾼ Priority in building a public data layer , As a result, it is difficult to guarantee the progress of China Taiwan construction .    

On this question , What solution do you have ⽅ What about law? ?

such as :

1、 Fill first ⾜ demand ( live ), And then develop the public data layer ( Building a better future ).

2、 get ⾼ Level leadership ⽀ a , To get more R & D resources .

3、 In full ⾜ In the process of business requirements , According to the business needs of the public data layer ⾏ Iteration and optimization .

4、 as time goes on , More and more ⽇ Business needs can be ⽤ Public data layer ( Middle ground to finish ).

5、⽇ The development of business requirements and the construction of common data layer are mutually reinforcing cycles .

in addition , In order to guarantee the propulsion speed of the data center , You can try to be ⽴ specially ⼈ The team , these ⼈ Of ⽬ Clearly, it is the construction of the Middle East , Reconstruction and integration of models , Sorting out the indicators . these ⼈ Not meeting business needs , This can be avoided ⽇ Often business requirements for the data team in the middle of the building ⼲ Disturb ; Setting up reasonable KPI and KPI The weight , Give sufficient impetus to the construction of the Middle East ⼒.

原网站

版权声明
本文为[000X000]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/205/202207240018085461.html