This repository provides a scaffold for Apache Spark applications written in Scala. It includes a structured project setup with essential configurations, making it easier to kickstart Spark-based data processing projects.
This setup is built using
- MacOS M3 Pro
- Java 21
- Scala 2.13.16
- sbt 1.9.8
- Spark 4.0.0
- Pre-configured SBT build system
- Sample Spark job implementation
- Converts parquet to CSV
- Organized project structure
- Logging setup using SLF4J
- Easy-to-run example with SparkSession
Clone the repository and navigate to the project directory:
git clone https://github.com/Nasruddin/spark-scala-scaffold.git
cd spark-scala-scaffold
Compile and package the application using SBT:
sbt clean compile
sbt package
Run the application locally using SBT:
sbt "run --input-path data --output-path output"
Or submit the packaged JAR to a Spark cluster:
wget https://repo1.maven.org/maven2/com/typesafe/config/1.4.3/config-1.4.3.jar
sbt clean package
spark-submit --class com.example.sparktutorial.SparkExampleMain \
--master "local[*]" \
— jars config-1.4.3.jar \
target/scala-2.13/sparktutorial_2.13-1.0.jar \
--input-path data --output-path output
Or submit packaged fat JAR
sbt clean assembly
spark-submit --class com.example.sparktutorial.SparkExampleMain \ ✔ took 5s at 10:07:12 AM
--master "local[*]" \
target/scala-2.13/sparktutorial-assembly-1.0.jar \
--input-path data --output-path output
spark-scala-scaffold/
├── src/
│ ├── main/
│ │ ├── scala/com/example/
│ │ │ ├── sparktutorial/
│ │ │ │ ├── Analysis.scala
│ │ │ │ ├── Configuration.scala
│ │ │ │ ├── package.scala
│ │ │ │ ├── SparkExampleMain.scala
│ ├── resources/
│ │ ├── application.conf
├── build.sbt
├── README.md
├── project/
├── target/
Analysis.scala
- Contains the main transformation logic for the Spark application.Configuration.scala
- Handles the application configuration usingTypesafe Config
.package.scala
- Contains utility functions and implicit values for the Spark session.SparkExampleMain.scala
- Entry point for running Spark examples.build.sbt
- Project dependencies and build configuration.
The configuration is managed using Typesafe Config
. The configuration file application.conf
should be placed in the resources
directory.
default {
appName = "Spark Scala Basic Setup App"
spark {
settings {
spark.master = "local[*]"
spark.app.name = ${default.appName}
}
}
}
Add the following dependencies to your build.sbt
file:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "4.0.0" % Provided,
"org.apache.spark" %% "spark-sql" % "4.0.0" % Provided,
"com.typesafe" % "config" % "1.4.4",
)
Contributions are welcome! Feel free to submit issues or pull requests.