Introduction to Spark:
- PySpark is a library that can be used to run python application using Apache Spark Capability in other words PySpark is Python API for Spark.
- Spark is not programming language. a. write spark applications using Java, Scala, R and Python b. PySpark allows you to write python based data processing applications that execute on a distributed cluster in parallel
Apache Spark is an analytical processing engine for large scale powerful distributed data processing and also machine learning applications.
Basic set- up for PySpark on Ubuntu for distributed Machine Learning. Prerequisites:
- An Ubuntu System
- Access to a terminal on command line
- A user with sudo or root permission
- Apache Spark
- Java
- PySpark
- FindSpark
- SQL