We are building a search bar that lets people do fuzzy search on different Konnect entities (services, routes, nodes). you're in charge of creating the backend ingest to power that service built on top of a CDC stream generated by Debezium
We have provided a jsonl file containing some sample events that can be used to simulate input stream.
Below are the tasks we want you to complete.
- develop a program that ingests the sample cdc events into a Kafka topic
- develop a program that persists the data from Kafka into Opensearch
Run
docker-compose up -d
to start a Kakfa cluster.
The cluster is accessible locally at localhost:9092
or kafka:29092
for services running inside the container network.
You can also access Kafka-UI at localhost:8080
to examine the ingested Kafka messages.
Opensearch is accessible locally at localhost:9200
or opensearch-node:9200
for services running inside the container network.
You can validate Opensearch is working by running sample queries
Insert
curl -X PUT localhost:9200/cdc/_doc/1 -H "Content-Type: application/json" -d '{"foo": "bar"}'
{"_index":"cdc","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}%
Search
curl localhost:9200/cdc/_search | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 223 100 223 0 0 41527 0 --:--:-- --:--:-- --:--:-- 44600
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "cdc",
"_id": "1",
"_score": 1.0,
"_source": {
"foo": "bar"
}
}
]
}
}
Run
docker-compose down
to tear down all the services.
stream.jsonl
contains cdc events that need to be ingesteddocker-compose.yaml
contains the skeleton services to help you get started
Note - it could be either "docker-compose" or "docker compose" depending on the docker version you are using. Old docker versions needs docker-compose separately whereas with new versions, docker compose is integrated with docker.
Also multiple kafka topics are used for different konnect entities since different entities have different schema.
docker compose up -d
mvn install
mvn exec:java -Dexec.mainClass="org.konnect.IngestExerciseProducer"
mvn exec:java -Dexec.mainClass="org.konnect.IngestExerciseConsumer"
docker compose down
curl --location 'localhost:9200/cdc/_search?pretty=null&size=100' \
--header 'Content-Type: application/json'
curl --location --request GET 'http://localhost:9200/cdc/_search?pretty=null&size=100' \
--header 'Content-Type: application/json' \
--data '{
"query": {
"wildcard": {
"host": {
"value": "*cypress*"
}
}
}
}'
curl --location --request GET 'http://localhost:9200/cdc/_search?pretty=null&size=100' \
--header 'Content-Type: application/json' \
--data '{
"query": {
"match": {
"konnect_entity": {
"query": "node",
"fuzziness": "AUTO"
}
}
}
}'
curl --location --request DELETE 'localhost:9200/cdc' \
--header 'Content-Type: application/json'
- We could produce messages to kafka in batch instead one by one
- Docker compose updates to run producer and consumer.
- Which fields to be kept searchable in open search schema ? Right now we are utilizing opensearch default mappings to make fields searchable, but we can explicitly change those mappings by defining some schema for opensearch index.
- Add unit test cases too.
- At consumer side, we have a map to store (entity id, updated_at of last event processed) which is used to handle out of order updates. Ideally this map data could be in some distributed key,value db like Redis.
- We could use spring consumer which can have auto retry with backoff.
(fuzzy search on different Konnect entities (services, routes, nodes)) there are create/update events for different type of entities. No delete event in the sample
- total events - 726
- 8 events for cluster which have key like -> c/_global/o/cluster/f24150e5-4781-4d70-9350-aaa7700ee9c3 -> guid at last is cluster id. The events are for create and update both for a cluster. Ex - cluster with guid 4c75f4f6-ca71-44a9-80ca-e96f6c412b24 has create and update event both.
- 15 events for service - NEED TO CONSIDER
- 554 events for node - NEED TO CONSIDER
- 6 events for node-status - DONT THINK THIS IS NEEDED, BUT DOUBLE CHECK
- 2 events for sni
- 2 events for target
- 14 events for route - NEED TO CONSIDER
- 86 events for store_event
- 1 event for key
- 5 event for vault
- 13 event for hash
- 5 event for upstream
- 2 event for composite-status
- 5 event for consumer_group
- 8 events for consumer
Count of events to be considered = 15+554+14 = 583