当前位置:网站首页>Introduction to data platform

Introduction to data platform

2022-06-24 08:40:00 An unreliable programmer

The goal is

  1. To provide stable and reliable data for various business platforms
  2. Provide a general data processing flow solution
  3. Generate some topic oriented 、 Integrated 、 Changing over time 、 But the information itself is a relatively stable data set
  4. Integrate historical data from multiple data sources for fine-grained processing 、 Multidimensional analysis
  5. To put it bluntly, it means reading data –> The production data –> Process of data delivery

Some of the concepts

ETL

ETL,Extraction-Transformation-Loading Abbreviation , The Chinese name is data extraction 、 Transform and load .ETL Responsible for the distribution 、 Data in heterogeneous data sources such as relational data 、 The plane data files are extracted to the temporary middle layer for cleaning 、 transformation 、 Integrate , Finally loaded into a data warehouse or data mart , Become online analytical processing 、 The foundation of data mining .ETL yes BI The most important part of the project , Usually ETL It's going to cost the whole project 1/3 Time for ,ETL The quality of the design is directly related to it BI The success or failure of the project .ETL It's also a long-term process , Only to find and solve problems constantly , Can we make ETL More efficient operation , Provide accurate data for the later development of the project .

Data warehouse

Data warehouse , English name is Data Warehouse, It can be abbreviated as DW or DWH. Data warehouse , It's a decision-making process for all levels of the enterprise , A strategic set that provides support for all types of data . It's a single data store , Created for analytical reporting and decision support purposes . For businesses that need business intelligence , Provide guidance for business process improvement 、 Monitoring time 、 cost 、 Quality and control .

Problems to be solved at present

  1. A task scheduling monitoring platform is required to manage data reading 、 production 、 A series of scripts delivered , Task scheduling and monitoring .
  2. Need one API Interface platform to meet the ad hoc query of some data .
  3. A data synchronization platform is needed to synchronize the production data to each business end .
  4. A data inspection platform is needed to control the quality of the delivered data .
  5. Need one BI Data display platform to clearly display the data of various dimensions concerned by different roles .

Solution

  1. Use airflow To build ETL System , That is to compile and adjust the collection script of a series of data , Cleaning script , Data summary , polymerization , Pre calculate multi-dimensional indicators . Provide task monitoring and webUI Visual tasks depend on .
  2. Use dataX To complete data synchronization .
  3. Use lumen To do it API Interface platform .
  4. Data detection platform and BI The first phase of the exhibition will not be considered for the time being .

Technology stack

airflow(python)、lumen、postgreSQL、dataX、elasticsearch
In the later stage, based on the amount of data, we will do spark Distributed cluster offline computing ,hdfs Storage , Flow calculation 、hive etc.

Ideal state

Later log analysis can be accessed ETL System to analyze user behavior , User portrait , Improve the security of the system .
On performance daily report , weekly , The annual report and other data display and summary provide shorter time delay , Reduce the load of the business system .
Yes ERP The data are collected and analyzed to provide reference for the decision-making of the leadership .
Yes APP The logs are summarized and analyzed to provide some data facts for product design and operation .
At the same time, in the face of the rapid growth of data, big data analysis can also be handy .

“ Rome was not built in a day ”

原网站

版权声明
本文为[An unreliable programmer]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206240612375088.html