Aim of this project

This project is reading from a Kafka stream of events that gets published to the events topic. Those messages are expected to be in a valid json format and to contain the fields uid (the customer id as String) and ts (a UNIX timestamp) in the first level of their json tree.

The script in this project will read the json information from that events topic and count the number unique customer ids (the uid field) per minute as defined in the event (the ts field). The resulting counts will then be printed to the console and also published to a new topic called user_count.

How to run it

Requirements

Docker
Scala SDK

Start Kafka environment

open a shell on the project root and navigate to folder src/main/docker
run docker-compose up -d to initially download and start the kafka environment
when it is there log into the broker container with docker exec -it broker bash
inside the broker container create the topics required for this job by running the following commands

kafka-topics \
  --bootstrap-server localhost:9092 \
  --topic events \
  --create

kafka-topics \
  --bootstrap-server localhost:9092 \
  --topic user_count \
  --create

Set up Kafka job for the customer count

open your IDE and import the project
configure the scala sdk, if needed
run the script in the CountEventsBasic object
the code will keep running and write the counted results to the user_count topic and to the console

Check the output in the new Kafka Topic

use the open shell logged in to the broker or log in to the container again with

docker exec -it broker bash

start the consumer on the topic with

kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic user_count \
  --from-beginning \
  --formatter kafka.tools.DefaultMessageFormatter \
  --property print.key=true \
  --property print.value=true

How this project can be continued in the future

This project provides a very basic counting algorithm that only works on a single node. When there is a bigger amount of data being pushed to the incoming topic then it would be probably be better to use a more sophisticated solution that can handle parallel counting und multiple consumers in a cluster, that are polling from a bigger Kafka cluster with multiple brokers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

README.md

README.md

Aim of this project

How to run it

Requirements

Start Kafka environment

Set up Kafka job for the customer count

Check the output in the new Kafka Topic

How this project can be continued in the future

Collapse file tree

Files

README.md

Latest commit

History

README.md

File metadata and controls

Aim of this project

How to run it

Requirements

Start Kafka environment

Set up Kafka job for the customer count

Check the output in the new Kafka Topic

How this project can be continued in the future