当前位置：网站首页>Overview of data model design method

Overview of data model design method

2022-07-24 00:20:00 【000X000】

⼤ Most analysts in companies do it in combination with their business ⼀ Some data analysis （ need ⽤ To ⼤ The amount of data ）, By reporting ⽅ Service to the business department ⻔ Operations . But before the data platform was built , Analysts often find that ⾃⼰ Nothing can be answered ⽤ The data of , Have to make ⽤ Raw data into ⾏ cleaning 、 Add ⼯、 Calculate the index .

Because of them ⼤ Mostly ⾮ Technology majors produce ⾝, Written SQL quality ⽐ Poor , what ⾄⻅ too 5 Nesting above layers . such SQL On resource consumption ⾮ often ⼤, It can cause queue congestion , Affect other warehouse tasks , Will cause dissatisfaction with data development . Data development will require that the analyst's access to raw data be withdrawn , analysts ⼜ Will complain that the data warehouse is not perfect , What you want is nothing ,⼀ A need often has to wait ⼀ Zhou Shi ⾄ Half ⽉. Analysts and data development ⽭ The shield starts from then on .

This ⽭ The root of shield lies in the data model ⽆ FA Fu ⽤, Data development is a chimney , Every time you meet a new demand , All recalculated from the original data ,⾃ But it takes time .⽽ To solve this ⽭ shield , It's about figuring out what our data model should look like ⼦.

01 What is ⼀ A good data model design ？

Look at ⼀ Group data , These two tables are based on metadata ⼼ Provided ⾎ Fate information , Respectively for ⼤ Data platform to transport ⾏ Tasks and analysis queries for （Ad-hoc） Into the ⾏ The statistics of .

surface 1：

surface 2：

The figure below shows the hierarchical architecture of the data warehouse ,⽅ The hierarchical design architecture of data model is recalled ：

surface 1 There is 2547 The hierarchical table is not recognized , In the total table 6049 Of 40%, They can't be recovered ⽤.

The focus is on the identified hierarchical table reading tasks ,ODS：DWD：DWS：ADS The reading tasks of are respectively 1072：545：187：433, Direct reading ODS Layer tasks account for the sum of the four layers 47.9%, This shows that there is ⼤ Quantity tasks are based on raw data plus ⼯, The middle model is complex ⽤ Bad sex .

surface 2 In identified hierarchical queries ,ODS：DWD：DWS：ADS The queries hit by are respectively 892：1008：152：305, Yes 37.8% The query directly hits ODS Layer raw data , explain DWD、DWS、ADS The lack of layer data construction is serious . In especial ADS and DWS, Query the underlying table , The amount of data scanned by the query will increase ⼤, The query time will increase ⻓, The resource consumption of query is also increasing ⼤, send ⽤ Data ⼈ Satisfaction will be low .

Last , Into the ⼀ Step by step ODS Layer is read 704 Zhang biaojin ⾏ decompose , Found to have 382 The downstream output of Zhang Biao is DWS,ADS, In especial ADS Reached 323 A watch , Occupy ODS Layer surface ⽐ example 45.8%, Description yes ⼤ The amount ODS The layer surface is entered ⾏ Deepen physics ⼯.

Pass on ⾯ Analysis of , We seem to have found ⼀ An ideal data warehouse model design should have the factors , That's it “ The data model is repeatable ⽤, Perfect and standard ”.

02 How to measure perfection ？

DWD Layer perfection ： To measure DWD Whether the layer is perfect , The most beautiful ODS How many tables are there on the floor DWS/ADS/DM Layer citation ⽤. because DWD The above layer leads to ⽤ The more , The more tasks are based on raw data ⾏ Deep aggregation computing , There is no accumulation of detailed data ,⽆ The law is restored ⽤, Data cleaning 、 format 、 There is duplication of development in integration . therefore , I put forward ⽤ Cross layer Introduction ⽤ Rate indicators measure DWD The degree of perfection of .

Cross layer Introduction ⽤ rate ：ODS Layers are directly DWS/ADS/DM Layer citation ⽤ Table of , Take all ODS Layer table （ Only active tables are counted ）⽐ example .

Cross layer Introduction ⽤ The lower the rate, the better , In the data platform model design specification , Cross layer references are not allowed ⽤,ODS Layer data can only be DWD lead ⽤.

DWS/ADS/DM Layer perfection ： Assess the completeness of the summary data , It mainly depends on whether the aggregate data can be filled directly ⾜ How many query requirements （ That is to say ⽤ Summary layer data query ⽐ To measure ）. If you aggregate the data ⽆ Faman ⾜ demand , send ⽤ Data ⼈ We must make ⽤ Detailed data , what ⾄ It's raw data .

Aggregate data query ⽐ example ：DWS/ADS/DM Layer queries account for all queries ⽐ example .

Make it clear that , This is a cross layer guide ⽤ Different rates , Aggregate Query ⽐ It is impossible to do 100%, But the more ⾼, It shows that the data construction of the upper layer is more perfect , To make ⽤ Data ⼈ Come on , Query speed and cost will be reduced ,⽤ It's better to get up .

03 How to measure reusability ？

The core of platform model design in data ⼼ yes Pursue the complexity of the model ⽤ And share , Through metadata ⼼ The data of ⾎ Edge map , You can see ,⼀ individual ⽐ Poor model design ,⾃ Next ⽽ On is ⼀ line .⽽⼀ An ideal model design , It should be an interwoven divergent structure .

⽤ The model quotes ⽤ coefficient As an indicator , Measure the complexity of platform model design in data ⽤ degree . lead ⽤ The more coefficient ⾼, Explain the complex of shucang ⽤ The better sex .

The model quotes ⽤ coefficient ：⼀ Models are read , The average number of direct output downstream models .

⽐ Such as ⼀ Zhang DWD The layer surface is 5 Zhang DWS Layer surface quotation ⽤, This piece of DWD The quotation of the layer table ⽤ The coefficient is 5, If you put all DWD Layer table （ With downstream tables ） lead ⽤ Take the average of the coefficients , Then for DWD The layer surface average model introduces ⽤ coefficient ,⼀ Generally lower than 2⽐ Poor ,3 The above is relative to ⽐ good （ Empirical value ）.

04 How to measure the degree of normalization ？

surface 1 in , exceed 40% There is no hierarchical information in any of the tables of , At the model design level ⾯, This is obviously not standardized . Except to see if the table is layered , It also depends on whether it belongs to the subject domain （ For example, the transaction domain ） If it doesn't belong to the subject domain , It's hard to find this watch , also ⽆ FA Fu ⽤.

secondly , It depends on the name of the table . take stock This is named for example , When you see this watch , Know which subject area it is 、 Business process ？ It's a table of full data , Or daily incremental data ？ in general , The information obtained from this table name is too limited .⼀ Table names should include subject fields 、 layered 、 The table is a full snapshot , Or incremental information .

besides , If it's on the watch A in ⽤⼾ID The name of is UserID, In the table B in ⽤⼾ID The name is ID, It will make ⽤ People cause trouble , Is this ⼀ Something . So we need the same fields in different models , It has to be named ⼀ To .

Experience and Suggest ：

1. You can use these indicators to evaluate ⼀ Next ,⾃⼰ What's the status quo of the digital warehouse .

2. And then formulate ⼀ Some targeted improvement plans ,⽐ For example, eliminate these nonstandard named tables , The table that covers the subject field ⽐ It is mentioned that ⾼ To 90% above .

3. After trying ⼀ After a period of time of model reconstruction and optimization , Then take these indicators to test ⼀ Test whether it is really better .

How much help does model refactoring help data construction ？ Is there any ⼀ Some quantitative indicators can be used to measure ？ Based on the above knowledge, we can answer these two questions very well .

05 How to go from the chimney like decimal warehouse to the shared data center ？

The essence of building data platform is to build the public data layer of enterprises , Put the scattered 、 Chimney type 、 Messy ⼩ Several positions , Merge into ⼀ Can be shared 、 Repeatable ⽤ Data Center .

The first ⼀： To take over ODS layer , Control the source

ODS It's business data that goes into ⼊ Number one in the data center ⼀ standing , It's all the data plus ⼯ The source of , Control the source , In order to prevent ⽌⼀ The emergence of duplicate data systems .

The data center team has to Clear responsibilities , whole ⾯ To take over ODS The layer data , Access from the source database of the business system ⼊⼿, Ensure that data is generated from business systems ⽣ Backward ⼊ Data warehouse , It can only be maintained in the data center ⼀ Share . This can be achieved with the business system database manager ⼀ Cause , Only the accounts of the middle office team can synchronize data .

ODS The data of the layer table must be consistent with the table structure of the data source 、 Number of table records ⼀ Cause ,⾼ degree ⽆ damage , about ODS The nomenclature of the layer table is adopted ⽤ODS_ Business system database name _ Business system database table name ⽅ type ,⽐ Such as ods_warehous_stock,warehous Is the business system database name ,stock It's under the library ⾯ Table name of .

The first ⼆： Divide Lord Topic domain , Building a bus matrix

A topic domain is an abstract collection of business processes . It may be said that , A little bit ⼉ abstract , But in fact, the business process is the business process of the enterprise ⼀ They are inseparable ⾏ For events ,⽐ Such as warehouse management ⾥⾯ Yes ⼊ library 、 Out of stock 、 deliver goods 、 Sign for , It's all business processes , The abstract subject domain is the storage domain .

The subject domain should cover all business requirements as much as possible , Maintain relative stability , There is also ⼀ The expansibility of the （ New addition ⼊⼀ Subject areas , The table that does not affect the divided subject fields ）.

After the subject domain is divided , We're going to start building the bus matrix , Identify the analysis dimensions of business processes under each subject area , For example ⼦：

Third ： structure ⼀ Sexual dimension .

Complaints from the after sales team ⼯ Singular quantity has the analysis dimension of the region ,⽽ The distribution delay of distribution team also has the analysis dimension of region , You want to analyze the increase in complaints due to delivery delays , But the analysis dimension of the two regions contains no content ⼀ Cause , Will eventually lead to ⼀ Some areas can't be analyzed . So we build the whole picture ⼀ The dimension of sex , Make sure that the dimension table only stores ⼀ Share .

Dimension system ⼀ The most ⼤ The problem is Dimension attributes （ If the dimension is a commodity , So the product category 、 Brand of goods 、 Commodity ruler ⼨ And so on sex , We call it dimension attributes ） Integration of . Whether all dimension attributes should be integrated into ⼀ individual ⼤ In the dimension table of , either ⻅ have to , I'll give it to you ⼏ individual Suggest .

1. Public dimension attributes and unique dimension attributes are split into two dimension tables . stay ⾃ In the camp platform , There's usually ⼀ Some third ⽅ Merchant ⼊ Stationed , But count The quantity is very small .⼤ In fact, some commodities have no store attribute , This situation , It is not recommended to add other dimension attributes of stores and commodities ,⽐ Such as commodity category 、 The brand is designed to ⼀ Dimension table .

2. The output time is different ⼤ Split the dimension attributes of a separate dimension table ,⽐ For example, some dimension attributes are produced in the early morning 2 spot , Some dimension attributes are produced in the early hours of the morning 6 spot , that 2 Point and 6 Point can be split into two dimension tables , Make sure nuclear ⼼ As soon as possible .

3. For the sake of the stability of dimensional table output , You can update frequently and change slowly into ⾏ Split , Frequent and less visited dimension tables Into the ⾏ Split .

For the normalized naming of dimension table , Suggest ⽤“dim_ Subject field _ describe _ Table rules ”⽅ type . Sub table can be understood in this way ：⼀ A watch store Store ⼏ One hundred billion ⾏ The record is so ⼤了 , So we need to ⼀ A table is cut into many ⼩ The partition , Every day or every week , As the task is scheduled , Meeting ⽣ become ⼀ Zones . often ⻅ The zoning rules of （ Time to query ）.

Fourth ： Fact table Integration

Fact table integration follows the most basic ⼀ One principle is , Statistical granularity must be maintained ⼀ Cause , Data with different statistical granularity cannot appear in the same

⼀ In the fact table . Look at ⼀ Case ⼦：

Before building a data center , Supply chain ⻔、 Storage department ⻔ And marketing ⻔ There are ⼀ Some repetitive fact sheets , We need to Put these repetitions into ⾏ Remove , According to transaction domain and warehousing domain , Subject domain ⽅ Step in ⾏ Integrate .

For the warehouse department ⻔ And supply chain ⻔ All have inventory details , Because the warehouse department ⻔ The statistical granularity of is Merchandise plus warehouse ,⽽ Supply chain ⻔ Of Only commodities , Therefore, in principle, two tables cannot be merged ,⽽ It should be alone ⽴ There is .

For marketing ⻔ And supply chain ⻔ Two order details of , Because the statistical granularity is Order level , All belong to the order business process under the transaction domain , So it can be merged into ⼀ A fact sheet .

besides , We should also consider Fill in the incomplete data ⻬. about ODS Layers are directly cited ⽤ Produce DWS/ADS/DM Layer task , adopt ⾎ edge , Find the task list , One by one ⾏ Take apart . No, ODS Corresponding DWD Of , should ⽣ become DWD surface , For what already exists , The task should be migrated , send ⽤DWD Layer table .

DWD/DWS/ADS/DM The naming rules of are suitable for ⽤“[ level ][ The theme ][⼦ The theme ][ Content description ][ Table rules ]” The name of ⽅ type .

The fifth ： Model development .

After model design , Enter ⼊ Model development stage , Something to watch out for ：

1. All tasks must be strictly configured with task dependencies , If no task dependencies are configured , It can lead to pre ⼀ Tasks that don't produce data normally

Condition , after ⼀ Tasks are scheduled , Based on the wrong data , Waste resources , At the same time, it increases the complexity of troubleshooting ;

2. Temporary table created in task , You should delete... Before the end of the task , If you don't delete , Will find ⼤ There is a temporary table of quantities , Occupy ⽤ Space ;

3. The task name should follow the table name ⼀ Cause ,⽅ Then find and associate ;

4. ⽣ Life cycle management , about ODS and DWD,⼀ Keep as much historical data as possible , about DWS/ADS/DM Need to set up ⽣ Life cycle ,7〜30 Different days ;

5. DWD The surface of the formation is suitable for mining ⽤ Compression of the ⽅ Type storage , can ⽤lzo Compress .

The sixth ： Should be ⽤ transfer

Last ⼀ Step is to be ⽤ Migration , The core of this process ⼼ Pay attention to the data ⽐ Yes , Make sure the data is complete ⼀ Cause , Then enter ⾏ Should be ⽤ transfer , Delete ⽼ Data sheet for .

in general , Building data center is not ⼀⼝⽓ You can eat it ⼀ Fat ⼦, Its construction is often snowballing ⽅ type , With ⼀ Everyone should ⽤ Migration , The data of China Taiwan are becoming more and more abundant , The value of the play is more and more ⼤.

06 Warehouse modeling ⼯ have EasyDesign

The implementation of the above steps , Cannot do without ⼀ A good one ⽤ Of ⼯ As a result ⽀ support , To standardize the design of the data model , Research and development EasyDesign Model design products of , Let these processes realize systematic management .EasyDesign Design ideas and functions of ：

Netease knows how many ：

https://bigdata.163yun.com/product/easydesign

EasyDesign Built in metadata ⼼ above , adopt API transfer ⽤ In metadata ⼼ The data of ⾎ Edge to connect ⼝, Combined with the index of data warehouse model design , The model design metrics are given .

EasyDesign By subject area 、 Business process 、 A layered ⽅ Management of all models .

It also provides dimensions 、 Management of metrics and field base dictionaries , At the same time, it has the control of model design approval process .

07 summary

This paper mainly understands the model design of data platform . Be sure ⽴ Design ⽬ mark , To pass ⼀ A series of steps , take ⼀ They're all scattered 、 Messy 、 Chimney type ⼩ The number of storehouses is gradually regulated to ⼀ It can be recovered ⽤、 Shared data center , Finally, through the production of ⽅ To realize systematic management . Last , Emphasize again ⼏ A little bit ：

1. Perfection 、 complex ⽤ Degree and normalization constitute the measurement system of platform model design , It can help you evaluate the design of data warehouse .

2. Dimension design is the soul of dimensional modeling , It is also the basis of data platform model design , The core of dimension design ⼼ Is build ⼀ Sexual dimension .

3. The statistical granularity of fact tables must be maintained ⼀ Cause , Data with different statistical granularity cannot appear in the same ⼀ In the fact table .

It often takes half a year to build a data platform ⾄⼀ Years or more , But when the data center is built , The improvement effect on R & D efficiency ⾮ It is often obvious that , stay ⽹ In e-commerce business , China and Taiwan build the afterimage ⽐ Before building , The average delivery time of data requirements is from ⼀ Week shortened to 3 Days. , The speed of demand response has been improved , It improves the effect of data for enterprises ⽀ support .

reflection ：

In the process of the actual implementation of the data platform , Data teams not only need to build a public data layer , Form a data center , And bear a huge ⼤ The pressure of new demand for ⼒.⽽ And , Often the priority of requirements is ⾼ Priority in building a public data layer , As a result, it is difficult to guarantee the progress of China Taiwan construction .

On this question , What solution do you have ⽅ What about law? ？

such as ：

1、 Fill first ⾜ demand （ live ）, And then develop the public data layer （ Building a better future ）.

2、 get ⾼ Level leadership ⽀ a , To get more R & D resources .

3、 In full ⾜ In the process of business requirements , According to the business needs of the public data layer ⾏ Iteration and optimization .

4、 as time goes on , More and more ⽇ Business needs can be ⽤ Public data layer （ Middle ground to finish ）.

5、⽇ The development of business requirements and the construction of common data layer are mutually reinforcing cycles .

in addition , In order to guarantee the propulsion speed of the data center , You can try to be ⽴ specially ⼈ The team , these ⼈ Of ⽬ Clearly, it is the construction of the Middle East , Reconstruction and integration of models , Sorting out the indicators . these ⼈ Not meeting business needs , This can be avoided ⽇ Often business requirements for the data team in the middle of the building ⼲ Disturb ; Setting up reasonable KPI and KPI The weight , Give sufficient impetus to the construction of the Middle East ⼒.

原网站

版权声明
本文为[000X000]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/205/202207240018085461.html