This is a tutorial for people to have an idea of how to conduct various data analyses using SMV (Spark Modularized View) - a framework to use Spark to develop large scale data applications. API docs can be found here. After the tutorial, users are expected to be able to build a data analytics project with Smv framework.
The tutorial basics will mainly cover the following contents:
First things first. We need to make sure we have all necessary tools installed and the environment set up.
Once we have the environment set up, we can start doing some cool things. As a data scientist or a business analyst who may be familiar with traditional analytic tools such as SQL or SAS, it is natural to ask how to process data and conduct analyses in Smv. We will leverage the employment data in the SmvTraining in the following examples. The sample file in the data directory was directly extracted from US employment data.
$ wget http://www2.census.gov/econ2012/CB/sector00/CB1200CZ11.zip
$ unzip CB1200CZ11.zip
$ mv CB1200CZ11.dat CB1200CZ11.csv
More info can be found on US Census site
Now we will show how convenient and efficient data analyses can be with Smv.
- Profile Input Data
- Identify Insights from Data
- Advanced Analytics
- Quality Control
- Smv Exercise 1: Employment Data
Smv offers a the modularized computation framework, where the scalability and reusability of data, code is expected to finally scale the development team and reduce the development time of a complicated and large scale project. This tutorial is mainly to help users get familiar with how to build a project with Smv, and users are always encouraged to follow the latest development of SMV project and check the corresponding API docs for detailed help.