W205 - Storing and Retrieving Data, Spring 2017

Final Project - April 30th, 2017

Jason Hunsberger, Victoria Baker, Dominic Delmolino

Servers and Services

Scan Server (AWS Ubuntu)
Storm Server (AWS Amazon Linux)
Hadoop Cluster Namenode (single-node) (AWS Amazon Linux)
Kafka Service (from aiven.io)
PostgreSQL Service (AWS RDS PostgreSQL)

Place servers in same AWS security group

Kafka Cluster Configuration

Kafka version 0.10.2

http://aiven.io - Create a Kafka service

Use Connection parameters from your Kafka service (host, port)
Download your CA certificate, access key and access certificate
Allowed IP addresses 0.0.0.0/0
Add topic "zmap_scans"

Storm Server Configuration

Server configuration

m3.large
Amazon Linux
ec2-user

Storm topology available under /home/ec2-user/finaldag

Create a .pgpass file

Navigate to /home/ec2-user
Create a .pgpass file with your PostgreSQL hostname:port:database:username:password
Database should be zmapdb
Username should be iro_admin

Modify /etc/hosts file

Navigate to /etc
Use sudo to edit host file
Add entry for reported hostname of Hadoop Cluster namenode

Copy Kafka certificate and key files

Navigate to /home/ec2-user/finalcerts
Copy ca.pem, service.cert and service.key files from the Aiven Kafka service to this directory

Load PostgreSQL reference data

Navigate to /home/ec2-user/finalpsql
Edit country-load-script.sh file to point at your PostgreSQL database
Run the country-load-script.sh file

View PostgreSQL analysis data

Navigate to /home/ec2-user/finalpsql
Edit analysis_queries.sh file to point at your PostgreSQL database
Run the analysis_queries.sh file to see summary results

Edit the WebHDFS bolt

Navigate to /home/ec2-user/finaldag/src/bolts directory
Edit the WebHDFS.py
Replace the IP address in the InsecureClient call with the public address of your Hadoop namenode

Java version

$ java -version
java version "1.7.0_131"
OpenJDK Runtime Environment (amzn-2.6.9.0.71.amzn1-x86_64 u131-b00)
OpenJDK 64-Bit Server VM (build 24.131-b00, mixed mode)

Leiningen version

Leiningen 2.7.1 on Java 1.7.0_131 OpenJDK 64-Bit Server VM

Storm version

$ storm version
0.9.3

Streamparse version

$ sparse --version
sparse 2.1.4

Python version

$ python --version
Python 2.7.12

Python libraries installed

hdfs (2.0.16)
sycopg2 (2.7.1)
kafka-python (1.3.3)
streamparse (2.1.4)

PostgreSQL

AWS RDS PostgreSQL
db.m3.xlarge
Production instance
PostgreSQL 9.5.4
1 TB Storage Provision IOPS (SSD)
IOPS 3000
dbname zmapdb
Username iro_admin
Security group includes inbound traffic from Storm server security group

Hadoop Cluster Configuration

Server configuration

AMI w205_cp_hadoop_jhvbdd_final ami-def191c8
m3.large
Amazon Linux
ec2-user
2 x 500GB (st1 volume type)
Open ports 50070, 50075 and 8080 -- allow inbound from Storm server

Filesystem additions

mkfs /dev/xvdf
mkfs /dev/xvdg
mount /dev/xvdf /data_01
mount /dev/xvdg /data_02

Java version

$ java -version
java version "1.7.0_131"
OpenJDK Runtime Environment (amzn-2.6.9.0.71.amzn1-x86_64 u131-b00)
OpenJDK 64-Bit Server VM (build 24.131-b00, mixed mode)

Hadoop version

$ hadoop version
Hadoop 2.7.1
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a
Compiled by jenkins on 2015-06-29T06:04Z
Compiled with protoc 2.5.0
From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.1.jar

hadoop-env.sh lines to change

export JAVA_HOME=/usr
export HADOOP_LOG_DIR=/data_01/logs

core-site.xml details

<configuration>

   <property>
      <name>fs.defaultFS</name>
      <value>hdfs://localhost:9000</value>
   </property>

   <property>
      <name>hadoop.temp.dir</name>
      <value>/data_01/temp</value>
   </property>

</configuration>

hdfs-site.xml details

<configuration>

   <property>
      <name>dfs.name.dir</name>
      <value>/data_01/name</value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value>/data_01/data,/data_02/data</value>
   </property>

   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>

   <property>
      <name>dfs.webhdfs.enabled</name>
      <value>true</value>
   </property>

   <property>
      <name>mapred.job.tracker</name>
      <value>localhost:9001</value>
   </property>

</configuration>

Hadoop user iro

$ hadoop fs -mkdir /user/iro
$ hadoop fs -chown iro:hadoop /user/iro

Scan Server configuration

AWS c3.large instance

Zmap scanner build with Ubuntu 16.04 LTS - Xenial (HVM)

sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade
sudo apt-get install python3-pip
sudo ln -s /usr/bin/python3 python
sudo ln -s /usr/bin/pip3 pip
sudo apt-get install unzip
sudo apt-get install nginx
pip install kafka-python
sudo apt-get install git gcc make libpcap-dev

ZMAP Build from source

git clone git://github.com/zmap/zmap.git
cd zmap/
sudo apt-get install build-essential cmake libgmp3-dev libpcap-dev gengetopt byacc flex
sudo apt-get install git pkg-config libjson0-dev
git clone https://github.com/json-c/json-c.git
cd json-c/
sudo apt-get install autoconf
sudo apt-get install automake
sudo apt-get install libtool
sh autogen.sh
./configure
make -j 4
make install
sudo make install
make check

sudo apt-get install libunistring-dev

cmake -DWITH_JSON=ON -DENABLE_DEVELOPMENT=OFF
make
sudo make install

nginx configuration

sudo vim /etc/nginx/nginx.conf
enabled: server_tokens off;

sudo service nginx reload

Let's Encrypt! SSL Cert Installation

sudo add-apt-repository ppa:certbot/certbot
sudo apt-get update
sudo apt-get install certbot
sudo certbot certonly --webroot -w /var/www/html -d internet-observatory.io
sudo certbot certonly --webroot -w /var/www/html -d internet-observatory.io
sudo certbot renew --dry-run
sudo openssl dhparam -out /etc/ssl/certs/dhparam.pem 2048
sudo vim default
sudo nginx -t

Running the project

Start the scan server, Kafka service, PostgreSQL service, Storm Server and Hadoop cluster

To Run a Scan

NOTE: The default configuration will scan approximately 8 million IP addresses on port 80 (HTTP).

Login into Storm Server as "ec2-user" user
Navigate to /home/ec2-user/finaldag directory
At the unix prompt, type "sparse run --option "supervisor.worker.timeout.secs=3600"
Login to Scan server as "ubuntu" user
Confirm /data is mounted with mount -t ext4 /dev/xvdc /data
Run python3 zmap_scan_script.py
After scan completes on scan server (confirmed with a message to stdout), use ctrl-c to cancel the scanner script.

Validate data by running SQL queries agains the PostgreSQL database

Check the Hadoop filesystem for the stream files

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
final_certs		final_certs
final_dag		final_dag
final_psql		final_psql
final_www		final_www
final_zmap		final_zmap
images		images
README.md		README.md
w205_hunsberger_baker_delmolino_internet_research_observatory_final.pdf		w205_hunsberger_baker_delmolino_internet_research_observatory_final.pdf

jnhunsberger/observatory

Folders and files

Latest commit

History

Repository files navigation

W205 - Storing and Retrieving Data, Spring 2017

Servers and Services

Kafka Cluster Configuration

Storm Server Configuration

Server configuration

Create a .pgpass file

Modify /etc/hosts file

Copy Kafka certificate and key files

Load PostgreSQL reference data

View PostgreSQL analysis data

Edit the WebHDFS bolt

Java version

Leiningen version

Storm version

Streamparse version

Python version

Python libraries installed

PostgreSQL

Hadoop Cluster Configuration

Server configuration

Filesystem additions

Java version

Hadoop version

hadoop-env.sh lines to change

core-site.xml details

hdfs-site.xml details

Hadoop user iro

Scan Server configuration

Zmap scanner build with Ubuntu 16.04 LTS - Xenial (HVM)

ZMAP Build from source

nginx configuration

Let's Encrypt! SSL Cert Installation

Running the project

Start the scan server, Kafka service, PostgreSQL service, Storm Server and Hadoop cluster

To Run a Scan

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages