Subscribe to Posts by Email

Subscriber Count

    696

Disclaimer

All information is offered in good faith and in the hope that it may be of use for educational purpose and for Database community purpose, but is not guaranteed to be correct, up to date or suitable for any particular purpose. db.geeksinsight.com accepts no liability in respect of this information or its use. This site is independent of and does not represent Oracle Corporation in any way. Oracle does not officially sponsor, approve, or endorse this site or its content and if notify any such I am happy to remove. Product and company names mentioned in this website may be the trademarks of their respective owners and published here for informational purpose only. This is my personal blog. The views expressed on these pages are mine and learnt from other blogs and bloggers and to enhance and support the DBA community and this web blog does not represent the thoughts, intentions, plans or strategies of my current employer nor the Oracle and its affiliates or any other companies. And this website does not offer or take profit for providing these content and this is purely non-profit and for educational purpose only. If you see any issues with Content and copy write issues, I am happy to remove if you notify me. Contact Geek DBA Team, via geeksinsights@gmail.com

Pages

ETL with Kafka and Oracle

Oracle database which does not need much introduction and as well as Kafka, Kafka has been is a high-throughput distributed message system that is being adopted by hundreds of companies to manage their real-time data. Companies use Kafka for many applications (real time stream processing, data synchronization, messaging, and more). But building ETL with kafka is cumbersome until recently , with Kafka connect which can seemlessly integrated source and target data easily with connectors called Source, Sink connectors.

Kafka connect, is designed to make it easier to build large scale, real-time data pipelines by standardizing how you move data into and out of Kafka. You can use Kafka connectors to read from or write to external systems, manage data flow, and scale the system—all without writing new code. Kafka Connect manages all the common problems in connecting with other systems (scalability, fault tolerance, configuration, and management), allowing each connector to focus only on how to best copy data between its target system and Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch or into batch systems such as Hadoop for offline analysis.

Further, not only Kafka can provide stream data pipeline it also can do lot of transformation using inbuilt transformations and complex aggregations using kafka streams.  Kafka streams can do event aggregation, complex joins and also provide persistence query streams on top of a stream.

All of these has been presented in my recent Oracle Code One Bengaluru Event on 15-Mar-2019, and the presentation slides are available here.

You can download entire code demo from my git hub repository here

In this video and presentation, I have explored Kafka connect as ETL tool to capture database changes from source to target and how kafka topic is used to deliver the CDC to target systems. For source side I have used JDBC Source connector and for target I have used HDFS Sink connector and kafka running in standalone mode.

  1. Use repository https://github.com/geeksinsights/vagrant-ansible-kafkaconnect-etl-oracle
  2. It will create vagrant box, create oracle database (follow instructions)

The videos which help you to understand below,

  1. Created table on the source side in oracle database 11g (start.sh will do)
  2. Which populated in hive metastore and there to hdfs partitions (start.sh will do)
  3. Insert/update - changes populated with ID+modified columns in my table using source jdbc connectors (video watch)
  4. Sink connector receives data based on the kafka topic test-jdbc-users (video watch)
  5. Alter table will be captured using schema registry from hive which populates structure from source to target every time. (video watch)

 

And You tube play list for your watching,

[embedyt] https://www.youtube.com/embed?listType=playlist&list=PLR6rN4cTV4BTxVQS-uL7NE6htS-pyLsRV&layout=gallery[/embedyt]

Hope you like it.

-Thanks

Suresh

Comments are closed.