Parse data using Kafka stream with Yarn

This is a distributed streaming application that reads covid19 data using Python, and pipes it to Kafka streams. Jobs are run using Yarn, Samza and Zookeeper on Hadoop 2.0

RUNNING THE CODE

1.Prerequisites

a) make sure JAVA_HOME is set in .bashrc.

echo $JAVA_HOME

b) Make sure you have write persmission in /home directory or wherever you are cloning the repo.

git clone https://github.com/Shivansh1805/KWHY.git

2.Give execution permission to the below mentioned scripts

chmod +x /home/path/to/KWHY/deploy-script/grid
chmod +x /home/path/to/KWHY/deploy/bin/run-job.sh
chmod +x /home/path/to/KWHY/deploy/bin/run-class.sh

If you face permission related problem with any script, just make it executable by above mentioned commands.

3.Build Samza distribution artifacts in deploy directory

mvn clean install

4.Start the grid script in bootstrap mode(It will automatically install and starts zookeeper, kafka, yarn)

Install and Start yarn, zookeeper, kafka

deploy-script/grid bootstrap

To start/stop zookeeper,yarn,kafka by a single command


deploy-script/grid start all
deploy-script/grid stop all

To start/stop zookeeper,yarn,kafka separately

deploy-script/grid start/stop kafka
deploy-script/grid start/stop yarn
deploy-script/grid start/stop zookeeper

Note - If you face some error while starting kafka just remove the kafka-logs directory from /tmp/ and start it again.

5. Update path accoridng to your local system

Update path in both files present in src/resources directory

yarn.package.path=/home/path/to/KWHY/deploy/deploy.tar.gz

6.Run the Samza deployment script with correct paths and attributes

/home/path/to/KWHY/deploy/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=/home/path/to/KWHY/deploy/conf/covid-parser.properties

Open browser and go to local yarn resource manager: http://localhost:8088/cluster to check status

7.Create consumers on two different terminals after submitting job to yarn .

/home/path/to/KWHY/deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic district-data

/home/path/to/KWHY/deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic state-data

8.Start kafka stream

Pull data from covid19 API using python script and pipe it to kafka producer and bind to 'district-data' & 'state-data' topic Go to producer directory and run the below command.

python3 covid-stream.py

If you encounter error due to impropoer yarn shut down, simply delete /tmp/kafka-logs directory

9. See the streamed results on both the terminals where you launched the consumers.

NOTE

Before pushing the code

rm -rf deploy
mvn clean

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
conf		conf
deploy-script		deploy-script
producer		producer
src/main		src/main
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parse data using Kafka stream with Yarn

RUNNING THE CODE

1.Prerequisites

2.Give execution permission to the below mentioned scripts

3.Build Samza distribution artifacts in deploy directory

4.Start the grid script in bootstrap mode(It will automatically install and starts zookeeper, kafka, yarn)

5. Update path accoridng to your local system

6.Run the Samza deployment script with correct paths and attributes

7.Create consumers on two different terminals after submitting job to yarn .

8.Start kafka stream

9. See the streamed results on both the terminals where you launched the consumers.

NOTE

About

Releases

Packages

Contributors 2

Languages

Shivansh1805/Kafka-Covid-Stream

Folders and files

Latest commit

History

Repository files navigation

Parse data using Kafka stream with Yarn

RUNNING THE CODE

1.Prerequisites

2.Give execution permission to the below mentioned scripts

3.Build Samza distribution artifacts in deploy directory

4.Start the grid script in bootstrap mode(It will automatically install and starts zookeeper, kafka, yarn)

5. Update path accoridng to your local system

6.Run the Samza deployment script with correct paths and attributes

7.Create consumers on two different terminals after submitting job to yarn .

8.Start kafka stream

9. See the streamed results on both the terminals where you launched the consumers.

NOTE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages