当前位置：网站首页>Data stack technology sharing: how to use data stack for data collection?

Data stack technology sharing: how to use data stack for data collection?

2022-06-24 15:18:00 【Data stack dtinsight】

Counting stack It's Yun Yuansheng. — Station Data Center PaaS, We are github and gitee There's an interesting open source project on ：FlinkX,FlinkX It's based on Flink Batch flow unified data synchronization tool , It can collect static data , It can also collect real-time changing data , It's global 、 isomerism 、 Batch stream integrated data synchronization engine . If you like, please give us some star！star！star！

github Open source project ：https://github.com/DTStack/flinkx

gitee Open source project ：https://gitee.com/dtstack_dev_0/flinkx

One 、 Where to collect data

We talk about big data 、 With big data 、 The premise for data to generate value is , We need data first .

Let's talk about data center “ save ”、“ through ”、“ use ”, The first is “ save ”, We're going to put the data in the middle , It used to be in a data warehouse , stay “ save ” On the basis of , We will sort out the data from different sources and formats 、 Open up data islands one by one , Data resources form data assets , Then according to the user's specific scene , Data application .

The generation of data doesn't come out of thin air , Kangaroo cloud data stack provides offline data synchronous collection and real-time data synchronous collection , Help users collect scattered data resources efficiently , Put them together , In a tool way , Conduct “ Global ” Data collection , Lay the foundation for the construction of data platform .

Two 、 How to collect data

1、 Offline data synchronous acquisition

The data synchronization task of visual configuration is shown in the figure below ：

Data synchronization tool of data stack FlinkX, Play a role in different storage systems “ bridge ” The role of , It is the basic core function of data center , Support a variety of heterogeneous storage system data （MySQL,Server,Oracle etc. ）, Plug in architecture can support more new data source requirements at any time , Bottom based Flink Distributed architecture , Support large capacity 、 High concurrency synchronization , Better performance than single point synchronization , More stable .

The scheme satisfies the requirement of minutes （5 minute ）、 Hours 、 Day and other levels of synchronization requirements .

The data synchronization interface of kangaroo cloud stack is shown in the figure below ：

Data synchronization module FlinkX It's a pipeline for data exchange between storage units . In order to mine and compute large-scale datasets in the data center , The usual way is to transfer the data to the data center before the task is executed , At the end of the task, the calculation results are transferred to the external storage unit （ for example MySQL And so on ）.

The role of data integration is shown in the figure below ：

The features of data synchronization module include the following aspects ：

1） Rich data sources support

The data synchronization module can MySQL、Oracle、SQLServer、PostgreSQL、DB2、HDFS（Textfile/Parquet/ORC）、Hive、HBase、FTP、ElasticSearch、MaxCompute、ElasticSearch、Redis、MongoDB、CarbonData Equal data source , Support to read or write data to these data sources . You only need to configure the connection information of the data source （ For example, fill in Oracle Database JDBC URL、 user name 、 Password and other information ）, Then configure the corresponding data synchronization task .

2） Distributed system architecture

Data synchronization module adopts advanced distributed system architecture in system architecture , Multiple nodes can read concurrently 、 Write data , It can greatly improve the throughput of data synchronization , comparison Sqoop、Kettle And other open source data synchronization solutions , Higher data throughput 、 The supporting functions are more perfect .

3） The wizard / Custom configuration mode The wizard mode ：

The characteristic is convenience 、 Simple , Visual field mapping , Quickly complete the synchronization task configuration . You can create and configure synchronization tasks through Wizard mode , It mainly includes synchronization task, source database and source table selection 、 Target library target table 、 Configure field mapping 、 Configure synchronization speed, etc .

Script mode ：

It is characterized by omnipotence 、 Efficient , Deep tuning , Support all data sources . To be written JSON Script to complete the configuration process .

4） Scheduling and dependency configuration

In the actual data production process , Data synchronization task is usually the first and last task of data processing link , Undertaken separately “ Extracting data from business systems ” and “ Write out the result data ” Responsibility for .【 Offline computing - Development Kit 】 Supports configuration dependencies on synchronization tasks , Constrain the execution sequence of synchronization tasks and other tasks .

Data synchronization tasks are usually performed periodically , Every day 、 Once a week 、 Every hour or minute （5 minute ） Do it once ,【 Offline computing - Development Kit 】 Support to configure cycle period for synchronous task , Realize the regular execution of synchronous tasks , Please refer to data development for detailed scheduling and dependency configuration functions ： Building data analysis logic section .

5） Total quantity / The incremental synchronization

In the process of reading data from the business system , To minimize the impact on business systems , Usually, incremental synchronization of data is needed . In the case of the data change time field in the source database table ,【 Offline computing - Development Kit 】 Support incremental data synchronization for relational database , Users only need to input the corresponding data filtering statements to achieve .

6） Whole library synchronization

The whole library synchronization is to help improve user efficiency 、 It's a quick tool to reduce the cost of users , It can quickly put a MySQL All tables in the database are uploaded to the data platform , Save a lot of energy . Suppose the database has 100 A watch , You might have to configure 100 Second data synchronization task , But with the whole library upload, it can be done at one time （ The table design of the database is required to be highly standardized ）.

In the whole library synchronization configuration , Users can batch select the tables to be synchronized , And configure the full amount / The incremental , Synchronize batch and other information . At the same time, user-defined table names are supported 、 Configuration of field type, etc , Achieve a high degree of flexibility on the basis of convenience .

7） Sub database and sub table （MySQL）、FTP Multipath synchronization

The data synchronization module can support the data synchronization under the mode of sub database and sub table of relational database , Users only need to select multiple tables on the page 、 Multiple databases are enough （ The structure of each table should be the same ）.

In addition to the relational database sub database sub table mode , It also supports one task from multiple FTP route , Read multiple files , Reduce the repetitive work of synchronous task configuration .

8） Synchronous speed control

According to the synchronization of initialization , There is often a large amount of historical data that needs to be synchronized to the middle station , Need to speed up data reading , When the operation pressure of business database is high , In order to reduce the pressure of database , We need to reduce the speed of data reading and writing .

The data synchronization module supports synchronous speed control , Adjust by setting the upper limit of synchronization rate , This parameter needs to be adjusted according to the hardware configuration and the amount of data , The user selects the set value according to the business requirements .

2、 Real time data synchronous acquisition

The picture above shows the real-time data stream synchronization architecture , The explanation is as follows ：

1）Oracle and SQLServer data source ： It needs to be purchased and deployed by users themselves OGG Real time acquisition tools , Real time acquisition Oracle redo log data , And then through the stack DTinsightStream Product visualization configuration will type data to Kafka, Data is archived or consumed in real time .

2）MySQL data source ： Counting stack DTinsightStream The product has been integrated Canal Data acquisition tools , Real time acquisition MySQL binlog data , Directly print data to... Through visual configuration Kafka, Data is archived or consumed in real time .

3） Log data source ： Counting stack DTinsightStream The product aims at the real-time collection module of log class, and the bottom layer is based on jLogstash Component implementation （ Compared to open source jLogstash Distributed transformation ）, Can be based on YARN Distributed resource scheduling , Directly print data to... Through visual configuration Kafka, Data is archived or consumed in real time .

The real-time acquisition module is in WEB The configuration is very convenient and flexible , Similar to offline data synchronization task , Can support wizard and script 2 Two configuration modes . With MySQL Real time acquisition is an example , Users only need to configure the data source on the page 、 Table and partial filter conditions .

In addition to configuration functions , Real time acquisition task is running , The system can also input 、 The output data is monitored and alarmed in real time .

原网站

版权声明
本文为[Data stack dtinsight]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/05/20210517185907535v.html

当前位置：网站首页>Data stack technology sharing: how to use data stack for data collection?

Data stack technology sharing: how to use data stack for data collection?

One 、 Where to collect data

Two 、 How to collect data

边栏推荐

猜你喜欢

随机推荐