Map/Reduce on IMDB dataset

This project implements Map Reduce to find all the the actors (or actresses) who also acted as directors in the same movie title

How to run this project:

Prerequisites

Copy all the input files in folder "inputImdb" inside IMDB-3mapper2reducer folder
The input file name should be as follows:
1. imdb00-title-actors.csv
2. title.basics.tsv
3. title.crew.tsv
jdk 1.8 required
download Hadoop for windows and for Linux

setting up and running hadoop on linux is easier, if you don't have Linux try using wsl.

Running the project on windows-wsl2

I have my bashrc file present in home location. To start hadoop one node cluster, run below commands:

source ~/.bashrc
sudo service ssh start
sbin/start-dfs.sh
sbin/start-yarn.sh

After setting up hadoop and all the environment variable, check if the UI of hadoop is up and running. At default node 50070 http://localhost:50070/dfshealth.html#tab-overview
Place all the input files in folder 'inputMapReduce' and Upload the folder to hadoop HDFS

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal <folder_location>/inputMapReduce /

Run the following command to run the project

$HADOOP_HOME/bin/hadoop jar <project_folder>/target/NewProjectHadoop-0.0.1-SNAPSHOT.jar com.uta.MapperReducerMain /inputMapReduce/title.basics.tsv /inputMapReduce/imdb00-title-actors.csv /inputMapReduce/title.crew.tsv /mapreduce_output

: The HDFS location may have mentioned output folder, so delete the folder first and then run above command. Folder can be deleted from HDFS using this command:

$HADOOP_HOME/bin/hdfs  dfs -rm -r /mapreduce_output

After completion of the command, the output folder can be seen using this command:

$HADOOP_HOME/bin/hdfs dfs -cat /mapreduce_output/part-r-00000

and to copy the files to your location, use:

$HADOOP_HOME/bin/hdfs dfs -get /mapreduce_output/* output-distr

It is an academic project, to verify the mapper-reducer result comare it to the output of the query below on database which has tables with same dataset as in input files:

/* * * SELECT * FROM imdb00.title_basics b INNER JOIN imdb00.TITLE_PRINCIPALS a ON * a.TCONST=b.TCONST and a.category in ('actor','actress') and a.NCONST<>'\N' * INNER join imdb00.title_crew c ON a.TCONST=c.TCONST and c.directors<> '\N' * and c.directors like '%'||a.NCONST||'%' --check this as dir are comma * separated Where b.Titletype='movie' AND b.startYear BETWEEN '1950' AND '1960' * ORDER BY a.TCONST ASC; * */

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
IMDB-3mapper1reducer		IMDB-3mapper1reducer
.gitignore		.gitignore
README.md		README.md
bashrc		bashrc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Map/Reduce on IMDB dataset

How to run this project:

Prerequisites

Running the project on windows-wsl2

It is an academic project, to verify the mapper-reducer result comare it to the output of the query below on database which has tables with same dataset as in input files:

About

Releases

Packages

Languages

tomarakanksha/map-reduce-on-imdb-dataset

Folders and files

Latest commit

History

Repository files navigation

Map/Reduce on IMDB dataset

How to run this project:

Prerequisites

Running the project on windows-wsl2

It is an academic project, to verify the mapper-reducer result comare it to the output of the query below on database which has tables with same dataset as in input files:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages