Learning how to tame the Big Data with Hadoop and related technologies
- Hadoop
- Map Reduce
- Pig
- Spark
- Spark SQL
- Using MLLib in Spark 2.0
- Hive
- Sqoop
- HBase
- Cassandra DB
- Mongo DB
- Hadoop is an open source software platform for distributed storage and distributed processing of very large datasets on computer clusters built from commodity hardware
- Why Hadoop?
- Data is too big
- Vertical scaling isn't an option
- Disk seek times
- Hardware failures
- Processing times
- Horizontal scaling is linear
- You can do much more instead of just batch processing
- Download Virtual Box from https://www.virtualbox.org/
- Download image of Hadoop to run on Virtual Box
- (Horton Works Data Platform) HDP 2.5 Sandbox is preffered because it boots up faster than new versions
- Download from https://hortonworks.com/downloads/#sandbox
- (Horton Works Data Platform) HDP 2.5 Sandbox is preffered because it boots up faster than new versions
- Import the image into Virtual Box
- Once you bootup, you will have CentOS instance that has Hadoop up and running
- We can use CLI, it also has browser interface
- Ambari is available to easily navigate and manage different systems on Hadoop
- Goto http://localhost:8888
- Launch Dashboard and login to Ambari
- Username: maria_dev
- Password: maria_dev
-
- Enable virtualization in your BIOS
- Disable Hyper-V acceleration in Windows