当前位置:网站首页>Data stack technology sharing: how to use data stack for data collection?
Data stack technology sharing: how to use data stack for data collection?
2022-06-24 15:18:00 【Data stack dtinsight】
Counting stack It's Yun Yuansheng. — Station Data Center PaaS, We are github and gitee There's an interesting open source project on :FlinkX,FlinkX It's based on Flink Batch flow unified data synchronization tool , It can collect static data , It can also collect real-time changing data , It's global 、 isomerism 、 Batch stream integrated data synchronization engine . If you like, please give us some star!star!star!
github Open source project :https://github.com/DTStack/flinkx
gitee Open source project :https://gitee.com/dtstack_dev_0/flinkx
One 、 Where to collect data
We talk about big data 、 With big data 、 The premise for data to generate value is , We need data first .
Let's talk about data center “ save ”、“ through ”、“ use ”, The first is “ save ”, We're going to put the data in the middle , It used to be in a data warehouse , stay “ save ” On the basis of , We will sort out the data from different sources and formats 、 Open up data islands one by one , Data resources form data assets , Then according to the user's specific scene , Data application .
The generation of data doesn't come out of thin air , Kangaroo cloud data stack provides offline data synchronous collection and real-time data synchronous collection , Help users collect scattered data resources efficiently , Put them together , In a tool way , Conduct “ Global ” Data collection , Lay the foundation for the construction of data platform .
Two 、 How to collect data
1、 Offline data synchronous acquisition
The data synchronization task of visual configuration is shown in the figure below :
Data synchronization tool of data stack FlinkX, Play a role in different storage systems “ bridge ” The role of , It is the basic core function of data center , Support a variety of heterogeneous storage system data (MySQL,Server,Oracle etc. ), Plug in architecture can support more new data source requirements at any time , Bottom based Flink Distributed architecture , Support large capacity 、 High concurrency synchronization , Better performance than single point synchronization , More stable .
The scheme satisfies the requirement of minutes (5 minute )、 Hours 、 Day and other levels of synchronization requirements .
The data synchronization interface of kangaroo cloud stack is shown in the figure below :
Data synchronization module FlinkX It's a pipeline for data exchange between storage units . In order to mine and compute large-scale datasets in the data center , The usual way is to transfer the data to the data center before the task is executed , At the end of the task, the calculation results are transferred to the external storage unit ( for example MySQL And so on ).
The role of data integration is shown in the figure below :
The features of data synchronization module include the following aspects :
1) Rich data sources support
The data synchronization module can MySQL、Oracle、SQLServer、PostgreSQL、DB2、HDFS(Textfile/Parquet/ORC)、Hive、HBase、FTP、ElasticSearch、MaxCompute、ElasticSearch、Redis、MongoDB、CarbonData Equal data source , Support to read or write data to these data sources . You only need to configure the connection information of the data source ( For example, fill in Oracle Database JDBC URL、 user name 、 Password and other information ), Then configure the corresponding data synchronization task .
2) Distributed system architecture
Data synchronization module adopts advanced distributed system architecture in system architecture , Multiple nodes can read concurrently 、 Write data , It can greatly improve the throughput of data synchronization , comparison Sqoop、Kettle And other open source data synchronization solutions , Higher data throughput 、 The supporting functions are more perfect .
3) The wizard / Custom configuration mode The wizard mode :
The characteristic is convenience 、 Simple , Visual field mapping , Quickly complete the synchronization task configuration . You can create and configure synchronization tasks through Wizard mode , It mainly includes synchronization task, source database and source table selection 、 Target library target table 、 Configure field mapping 、 Configure synchronization speed, etc .
Script mode :
It is characterized by omnipotence 、 Efficient , Deep tuning , Support all data sources . To be written JSON Script to complete the configuration process .
4) Scheduling and dependency configuration
In the actual data production process , Data synchronization task is usually the first and last task of data processing link , Undertaken separately “ Extracting data from business systems ” and “ Write out the result data ” Responsibility for .【 Offline computing - Development Kit 】 Supports configuration dependencies on synchronization tasks , Constrain the execution sequence of synchronization tasks and other tasks .
Data synchronization tasks are usually performed periodically , Every day 、 Once a week 、 Every hour or minute (5 minute ) Do it once ,【 Offline computing - Development Kit 】 Support to configure cycle period for synchronous task , Realize the regular execution of synchronous tasks , Please refer to data development for detailed scheduling and dependency configuration functions : Building data analysis logic section .
5) Total quantity / The incremental synchronization
In the process of reading data from the business system , To minimize the impact on business systems , Usually, incremental synchronization of data is needed . In the case of the data change time field in the source database table ,【 Offline computing - Development Kit 】 Support incremental data synchronization for relational database , Users only need to input the corresponding data filtering statements to achieve .
6) Whole library synchronization
The whole library synchronization is to help improve user efficiency 、 It's a quick tool to reduce the cost of users , It can quickly put a MySQL All tables in the database are uploaded to the data platform , Save a lot of energy . Suppose the database has 100 A watch , You might have to configure 100 Second data synchronization task , But with the whole library upload, it can be done at one time ( The table design of the database is required to be highly standardized ).
In the whole library synchronization configuration , Users can batch select the tables to be synchronized , And configure the full amount / The incremental , Synchronize batch and other information . At the same time, user-defined table names are supported 、 Configuration of field type, etc , Achieve a high degree of flexibility on the basis of convenience .
7) Sub database and sub table (MySQL)、FTP Multipath synchronization
The data synchronization module can support the data synchronization under the mode of sub database and sub table of relational database , Users only need to select multiple tables on the page 、 Multiple databases are enough ( The structure of each table should be the same ).
In addition to the relational database sub database sub table mode , It also supports one task from multiple FTP route , Read multiple files , Reduce the repetitive work of synchronous task configuration .
8) Synchronous speed control
According to the synchronization of initialization , There is often a large amount of historical data that needs to be synchronized to the middle station , Need to speed up data reading , When the operation pressure of business database is high , In order to reduce the pressure of database , We need to reduce the speed of data reading and writing .
The data synchronization module supports synchronous speed control , Adjust by setting the upper limit of synchronization rate , This parameter needs to be adjusted according to the hardware configuration and the amount of data , The user selects the set value according to the business requirements .
2、 Real time data synchronous acquisition
The picture above shows the real-time data stream synchronization architecture , The explanation is as follows :
1)Oracle and SQLServer data source : It needs to be purchased and deployed by users themselves OGG Real time acquisition tools , Real time acquisition Oracle redo log data , And then through the stack DTinsightStream Product visualization configuration will type data to Kafka, Data is archived or consumed in real time .
2)MySQL data source : Counting stack DTinsightStream The product has been integrated Canal Data acquisition tools , Real time acquisition MySQL binlog data , Directly print data to... Through visual configuration Kafka, Data is archived or consumed in real time .
3) Log data source : Counting stack DTinsightStream The product aims at the real-time collection module of log class, and the bottom layer is based on jLogstash Component implementation ( Compared to open source jLogstash Distributed transformation ), Can be based on YARN Distributed resource scheduling , Directly print data to... Through visual configuration Kafka, Data is archived or consumed in real time .
The real-time acquisition module is in WEB The configuration is very convenient and flexible , Similar to offline data synchronization task , Can support wizard and script 2 Two configuration modes . With MySQL Real time acquisition is an example , Users only need to configure the data source on the page 、 Table and partial filter conditions .
In addition to configuration functions , Real time acquisition task is running , The system can also input 、 The output data is monitored and alarmed in real time .
边栏推荐
- STM32F1与STM32CubeIDE编程实例-WS2812B全彩LED驱动(基于SPI+DMA)
- Mots clés pour la cartographie es; Ajouter une requête par mot - clé à la requête term; Changer le type de mot - clé de cartographie
- FPGA based analog I ² C protocol system design (Part I)
- 在同花顺开户证券安全吗,需要什么准备
- laravel 8 实现Auth登录
- Linux Installation cenos7 MySQL - 8.0.26
- Cvpr2022 𞓜 thin domain adaptation
- How to resolve the 35 year old crisis? Sharing of 20 years' technical experience of chief architect of Huawei cloud database
- Golang实现Biginteger大数计算
- GO语言-init()函数-包初始化
猜你喜欢

ES mapping之keyword;term查询添加keyword查询;更改mapping keyword类型

As a developer, what is the most influential book for you?

laravel 8 实现Auth登录

How to generate assembly code using clang in Intel syntax- How to generate assembly code with clang in Intel syntax?
![[bitbear story collection] June MVP hero story | technology practice collision realm thinking](/img/b7/ca2f8cfb124e7c68da0293624911d1.png)
[bitbear story collection] June MVP hero story | technology practice collision realm thinking

Keras deep learning practice (11) -- visual neural network middle layer output

Development of digital Tibetan product system NFT digital Tibetan product system exception handling source code sharing

Mots clés pour la cartographie es; Ajouter une requête par mot - clé à la requête term; Changer le type de mot - clé de cartographie

Successfully solved: selenium common. exceptions. SessionNotCreatedException: Message: session not created: This versi

Application of motion capture system in positioning and mapping of mobile robot in underground tunnel
随机推荐
Teach you how to deploy the pressure test engine on Tencent cloud
Wide measuring range of jishili electrometer
C language ---18 function (user-defined function)
常见的缺陷管理工具——禅道,从安装到使用手把手教会你
Two way combination of business and technology to build a bank data security management system
US Senate promotes bipartisan gun safety bill
安防市场进入万亿时代,安防B2B网上商城系统精准对接深化企业发展路径
How to allow easydss online classroom system to upload an on-demand file with a space in the file name?
API data interface for announcement of Hong Kong listed companies
Golang implements BigInteger large number calculation
阿里OSS对象存储服务
Is industrial securities reliable? Is it safe to open a securities account?
Closed loop management of time synchronization service -- time monitoring
`Thymeleaf ` template engine comprehensive analysis
June training (day 24) - segment tree
Record the range of data that MySQL update will lock
缓存使用中Redis,Memcached的共性和差异分析
Istio Troubleshooting: using istio to reserve ports causes pod startup failure
Keyword of ES mapping; Term query add keyword query; Change mapping keyword type
Service visibility and observability