Oracle database which does not need much introduction and as well as Kafka, Kafka has been is a high-throughput distributed message system that is being adopted by hundreds of companies to manage their real-time data. Companies use Kafka for many applications (real time stream processing, data synchronization, messaging, and more). But building ETL with kafka is cumbersome until recently , with Kafka connect which can seemlessly integrated source and target data easily with connectors called Source, Sink connectors.
Kafka connect, is designed to make it easier to build large scale, real-time data pipelines by standardizing how you move data into and out of Kafka. You can use Kafka connectors to read from or write to external systems, manage data flow, and scale the system—all without writing new code. Kafka Connect manages all the common problems in connecting with other systems (scalability, fault tolerance, configuration, and management), allowing each connector to focus only on how to best copy data between its target system and Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch or into batch systems such as Hadoop for offline analysis.
Further, not only Kafka can provide stream data pipeline it also can do lot of transformation using inbuilt transformations and complex aggregations using kafka streams. Kafka streams can do event aggregation, complex joins and also provide persistence query streams on top of a stream.
All of these has been presented in my recent Oracle Code One Bengaluru Event on 15-Mar-2019, and the presentation slides are available here.
You can download entire code demo from my git hub repository here
In this video and presentation, I have explored Kafka connect as ETL tool to capture database changes from source to target and how kafka topic is used to deliver the CDC to target systems. For source side I have used JDBC Source connector and for target I have used HDFS Sink connector and kafka running in standalone mode.
- Use repository https://github.com/geeksinsights/vagrant-ansible-kafkaconnect-etl-oracle
- It will create vagrant box, create oracle database (follow instructions)
The videos which help you to understand below,
- Created table on the source side in oracle database 11g (start.sh will do)
- Which populated in hive metastore and there to hdfs partitions (start.sh will do)
- Insert/update - changes populated with ID+modified columns in my table using source jdbc connectors (video watch)
- Sink connector receives data based on the kafka topic test-jdbc-users (video watch)
- Alter table will be captured using schema registry from hive which populates structure from source to target every time. (video watch)
And You tube play list for your watching,
[embedyt] https://www.youtube.com/embed?listType=playlist&list=PLR6rN4cTV4BTxVQS-uL7NE6htS-pyLsRV&layout=gallery[/embedyt]
Hope you like it.
-Thanks
Suresh
Follow Me!!!