Tool comparison :
Kettle( conventional ETL tool )
characteristic : pure Java to write
advantage : Can be found in Windows,linux,Unix On the implementation ; Data extraction is efficient and stable ; Sub components spoon There are plenty of Steps Complex business logic scenarios can be developed , It is convenient to realize the total quantity , Incremental synchronization ;
shortcoming : By timing operation , Poor real-time performance ;
component :
Spoon: Allow graphical interface implementation ETL Data conversion process
Pan: Batch operation Spoon Data conversion process
Chef:job( State , It can be monitored whether it is executed or not , Speed of execution, etc )
Kitchen: Batch operation chef
Sqoop( Less used )
characteristic : Mainly used for HDFS Data conversion between and relational database ;
Datax( Offline data statistics tool used by Alibaba , Open source ):
characteristic : Implement different types of data sources ( Include relational database , Distributed file system, etc ) Data synchronization between ;
advantage : The operation is simple , only 2 step , One is to create the configuration file of the job ; The second is to start the configuration file job ;
shortcoming : Lack of support for incremental updates , But you can write it yourself shell Script and other ways to achieve incremental synchronization ;
Job: A data synchronization job Splitter: Job segmentation module , Decompose a large task into multiple concurrent small tasks .Sub-job: Synchronization of small tasks Reader(Loader): Data input module , Responsible for running small tasks after segmentation , Load data from source DataXStorage:Reader and Writer adopt Storage Exchange data Writer(Dumper): Data writing module , Responsible for transferring data from DataX Import to destination data destination
DataX Inside the framework, there are double buffered queues , Thread pool encapsulation and other technologies , The problem of high-speed data exchange is dealt with , Provide simple interface and plug-in interaction , Plug ins are divided into Reader and Writer Two types , Plug in interface based on Framework , Can be very convenient to develop the plug-ins needed
.
StreamSets( At present, it is widely used )
characteristic : Lightweight , Powerful engine , Real time stream data extraction can be realized ; Developers can easily build batch and streaming data streams , And the code is small
assembly :
Data Collector: Routing and processing data
The Conduit :
Technology
Daily Recommendation