This project aims to use MapReduce for finding image similarity in large dataset.
You can find paper in Paper folder or just simply following this link. You can read our approaches and results in the white paper.
You need Java for compiling the JAR. To run this code properly, you need your own AWS account with S3 and EMR access. You also require decent internet connection for checking the status of the clusters.
To use this project, you have to do following,
- Clone this project
- Make sure you have Maven Installed. If you are using Intellij it should come in default package
- Create JAR with Maven with create .jar with "mvn package" command on project directory.
- Create your cluster from AWS EMR Control Panel
- Once your cluters are up and running, go to your S3 bucket and upload your JAR.
- Open up AWS EMR Control Panel again, add a new step, select freshly installed JAR and add the following arguments.
Ex GistCompare s3://com-rosettahub-default-xxxxx/MapReduce/input/ s3://com-rosettahub-default-xxxxx/output 20000 com-rosettahub-default-xxxxx
- Once everything is done, there should be final file in MapReduce folder. You can review similarities
- MapReduce - Main framework
- Hadoop - Used for HDFS and general MapReduce Framework
- Maven - Dependency Management
- Serhan Gürsoy - Architecture Engineer - Github
- Ege Yosunkaya - Architecture Engineer - Github
- Ömer Faruk Karakaya - Architecture Engineer - Github
- Musab Erayman - Architecture Engineer - Github
See also the list of contributors who participated in this project.