Skip to content

Latest commit

 

History

History
155 lines (99 loc) · 5.98 KB

File metadata and controls

155 lines (99 loc) · 5.98 KB

How to Build a Distributed Big Data Pipeline Using Kafka, Cassandra, and Jupyter Lab with Docker plus Faker API

This is solution for Tutorial 4. You are recommended to follow step-by-step guide in class for better understanding.

You can use the resources in this github to deploy an end-to-end data pipeline on your local computer using Docker containerized Kafka (data streaming), Cassandra (NoSQL database) and Jupyter Lab (data analysis Visualization).

This bases on the repo https://github.com/salcaino/sfucmpt733/tree/main/foobar-kafka Substantial changes and bug fixes have been made. Tested on Windows 10.

Tutorial videos:

https://youtu.be/_lJWsgOoOjM

https://youtu.be/TSJ_9ykhU1g

https://youtu.be/qQ7krtlZ7As

Quickstart instructions

You need to apply for some APIs to use with this. The APIs might take days for application to be granted access. Sample API keys are given, but it can be blocked if too many users are running this.

Twitter Developer API: https://developer.twitter.com/en/apply-for-access

OpenWeatherMap API: https://openweathermap.org/api 

After obtaining the API keys, please update the files "twitter-producer/twitter_service.cfg" and "owm-producer/openweathermap_service.cfg" accordingly.

#Create docker networks

$ docker network create kafka-network                         # create a new docker network for kafka cluster (zookeeper, broker, kafka-manager services, and kafka connect sink services)
$ docker network create cassandra-network                     # create a new docker network for cassandra. (kafka connect will exist on this network as well in addition to kafka-network)

Starting Cassandra

Cassandra is setup so it runs keyspace and schema creation scripts at first setup so it is ready to use.

$ docker-compose -f cassandra/docker-compose.yml up -d

Starting Kafka on Docker

$ docker-compose -f kafka/docker-compose.yml up -d            # start single zookeeper, broker, kafka-manager and kafka-connect services
$ docker ps -a                                                # sanity check to make sure services are up: kafka_broker_1, kafka-manager, zookeeper, kafka-connect service

Note: Kafka-Manager front end is available at http://localhost:9000

You can use it to create cluster to view the topics streaming in Kafka.

IMPORTANT: There is a bug that I don't know how to fix yet. You have to manually go to CLI of the "kafka-connect" container and run the below comment to start the Cassandra sinks.

./start-and-wait.sh

Starting Producers

$ docker-compose -f faker-producer/docker-compose.yml up -d     # start the producer to generate faker data
$ docker-compose -f owm-producer/docker-compose.yml up -d     # start the producer that retrieves open weather map
$ docker-compose -f twitter-producer/docker-compose.yml up # start the producer for twitter

There is a known issue with reading the tweets and the bug fix will be releashed in Tweetpy 4.0 (details here: tweepy/tweepy#237). Therefore, with the Twitter producer, we will attach the bash to monitor the log to see the magic and retry if the service is stopped.

Starting Twitter classifier (plus Weather and Faker consumer)

There is another catch: We cannot build the Docker file for the consumer directly with the docker-compose.yml (We can do so with all other yml files, just not this one -.-). So we have to manually go inside the folder "consumers" to build the Docker using command:

$ docker build -t twitterconsumer .        # start the consumers

Then go back up 1 level with "cd .." and we can start consumers:

$ docker-compose -f consumers/docker-compose.yml up       # start the consumers

Check that data is arriving to Cassandra

First login into Cassandra's container with the following command or open a new CLI from Docker Desktop if you use that.

$ docker exec -it cassandra bash

Once loged in, bring up cqlsh with this command and query twitterdata and weatherreport tables like this:

$ cqlsh --cqlversion=3.4.4 127.0.0.1 #make sure you use the correct cqlversion

cqlsh> use kafkapipeline; #keyspace name

cqlsh:kafkapipeline> select * from twitterdata;

cqlsh:kafkapipeline> select * from weatherreport;

cqlsh:kafkapipeline> select * from fakerdata;

And that's it! you should be seeing records coming in to Cassandra. Feel free to play around with it by bringing down containers and then up again to see the magic of fault tolerance!

Visualization

Run the following command the go to http://localhost:8888 and run the visualization notebook accordingly

docker-compose -f data-vis/docker-compose.yml up -d

Or you can try the Dash version

docker-compose -f dashboard/docker-compose.yml up -d

Teardown

To stop all running kakfa cluster services

$ docker-compose -f data-vis/docker-compose.yml down # stop visualization node

$ docker-compose -f consumers/docker-compose.yml down          # stop the consumers

$ docker-compose -f owm-producer/docker-compose.yml down       # stop open weather map producer

$ docker-compose -f faker-producer/docker-compose.yml down       # stop faker producer

$ docker-compose -f twitter-producer/docker-compose.yml down   # stop twitter producer

$ docker-compose -f kafka/docker-compose.yml down              # stop zookeeper, broker, kafka-manager and kafka-connect services

$ docker-compose -f cassandra/docker-compose.yml down          # stop Cassandra

To remove the kafka-network network:

$ docker network rm kafka-network
$ docker network rm cassandra-network

To remove resources in Docker

$ docker container prune # remove stopped containers, done with the docker-compose down
$ docker volume prune # remove all dangling volumes (delete all data from your Kafka and Cassandra)
$ docker image prune -a # remove all images (help with rebuild images)
$ docker builder prune # remove all build cache (you have to pull data again in the next build)
$ docker system prune -a # basically remove everything