Skip to content

Commit

Permalink
Releasing 2.13.0
Browse files Browse the repository at this point in the history
  • Loading branch information
EnricoMi committed Nov 4, 2024
1 parent 9844aa8 commit 02b636d
Show file tree
Hide file tree
Showing 6 changed files with 25 additions and 25 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [UNRELEASED] - YYYY-MM-DD
## [2.13.0] - 2024-11-04

### Fixes
- Support diff for Spark Connect implemened via PySpark Dataset API (#251)
Expand Down
4 changes: 2 additions & 2 deletions PYSPARK-DEPS.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,9 @@ docker compose -f docker-compose.yml up -d

Run the `example.py` Spark application on the example cluster:
```shell
docker exec spark-master spark-submit --master spark://master:7077 --packages uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.5 /example/example.py
docker exec spark-master spark-submit --master spark://master:7077 --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5 /example/example.py
```
The `--packages uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.5` argument
The `--packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5` argument
tells `spark-submit` to add the `spark-extension` Maven package to the Spark job.

Alternatively, install the `pyspark-extension` PyPi package via `pip install` and remove the `--packages` argument from `spark-submit`:
Expand Down
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ The package version has the following semantics: `spark-extension_{SCALA_COMPAT_
Add this line to your `build.sbt` file:

```sbt
libraryDependencies += "uk.co.gresearch.spark" %% "spark-extension" % "2.12.0-3.5"
libraryDependencies += "uk.co.gresearch.spark" %% "spark-extension" % "2.13.0-3.5"
```

### Maven
Expand All @@ -209,7 +209,7 @@ Add this dependency to your `pom.xml` file:
<dependency>
<groupId>uk.co.gresearch.spark</groupId>
<artifactId>spark-extension_2.12</artifactId>
<version>2.12.0-3.5</version>
<version>2.13.0-3.5</version>
</dependency>
```

Expand All @@ -219,7 +219,7 @@ Add this dependency to your `build.gradle` file:

```groovy
dependencies {
implementation "uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.5"
implementation "uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5"
}
```

Expand All @@ -228,7 +228,7 @@ dependencies {
Submit your Spark app with the Spark Extension dependency (version ≥1.1.0) as follows:

```shell script
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.5 [jar]
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5 [jar]
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark version.
Expand All @@ -238,7 +238,7 @@ Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depe
Launch a Spark Shell with the Spark Extension dependency (version ≥1.1.0) as follows:

```shell script
spark-shell --packages uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.5
spark-shell --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark Shell version.
Expand All @@ -254,7 +254,7 @@ from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.config("spark.jars.packages", "uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.5") \
.config("spark.jars.packages", "uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5") \
.getOrCreate()
```

Expand All @@ -265,7 +265,7 @@ Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depe
Launch the Python Spark REPL with the Spark Extension dependency (version ≥1.1.0) as follows:

```shell script
pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.5
pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your PySpark version.
Expand All @@ -275,7 +275,7 @@ Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depe
Run your Python scripts that use PySpark via `spark-submit`:

```shell script
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.5 [script.py]
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5 [script.py]
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark version.
Expand All @@ -289,7 +289,7 @@ Running your Python application on a Spark cluster will still require one of the
to add the Scala package to the Spark environment.

```shell script
pip install pyspark-extension==2.12.0.3.5
pip install pyspark-extension==2.13.0.3.5
```

Note: Pick the right Spark version (here 3.5) depending on your PySpark version.
Expand All @@ -299,7 +299,7 @@ Note: Pick the right Spark version (here 3.5) depending on your PySpark version.
There are plenty of [Data Science notebooks](https://datasciencenotebook.org/) around. To use this library,
add **a jar dependency** to your notebook using these **Maven coordinates**:

uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.5
uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5

Or [download the jar](https://mvnrepository.com/artifact/uk.co.gresearch.spark/spark-extension) and place it
on a filesystem where it is accessible by the notebook, and reference that jar file directly.
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>uk.co.gresearch.spark</groupId>
<artifactId>spark-extension_2.13</artifactId>
<version>2.13.0-3.5-SNAPSHOT</version>
<version>2.13.0-3.5</version>
<name>Spark Extension</name>
<description>A library that provides useful extensions to Apache Spark.</description>
<inceptionYear>2020</inceptionYear>
Expand Down
20 changes: 10 additions & 10 deletions python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,20 @@

This project provides extensions to the [Apache Spark project](https://spark.apache.org/) in Scala and Python:

**[Diff](https://github.com/G-Research/spark-extension/blob/v2.12.0/DIFF.md):** A `diff` transformation and application for `Dataset`s that computes the differences between
**[Diff](https://github.com/G-Research/spark-extension/blob/v2.13.0/DIFF.md):** A `diff` transformation and application for `Dataset`s that computes the differences between
two datasets, i.e. which rows to _add_, _delete_ or _change_ to get from one dataset to the other.

**[Histogram](https://github.com/G-Research/spark-extension/blob/v2.12.0/HISTOGRAM.md):** A `histogram` transformation that computes the histogram DataFrame for a value column.
**[Histogram](https://github.com/G-Research/spark-extension/blob/v2.13.0/HISTOGRAM.md):** A `histogram` transformation that computes the histogram DataFrame for a value column.

**[Global Row Number](https://github.com/G-Research/spark-extension/blob/v2.12.0/ROW_NUMBER.md):** A `withRowNumbers` transformation that provides the global row number w.r.t.
**[Global Row Number](https://github.com/G-Research/spark-extension/blob/v2.13.0/ROW_NUMBER.md):** A `withRowNumbers` transformation that provides the global row number w.r.t.
the current order of the Dataset, or any given order. In contrast to the existing SQL function `row_number`, which
requires a window spec, this transformation provides the row number across the entire Dataset without scaling problems.

**[Inspect Parquet files](https://github.com/G-Research/spark-extension/blob/v2.12.0/PARQUET.md):** The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to [parquet-tools](https://pypi.org/project/parquet-tools/)
**[Inspect Parquet files](https://github.com/G-Research/spark-extension/blob/v2.13.0/PARQUET.md):** The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to [parquet-tools](https://pypi.org/project/parquet-tools/)
or [parquet-cli](https://pypi.org/project/parquet-cli/) by reading from a simple Spark data source.
This simplifies identifying why some Parquet files cannot be split by Spark into scalable partitions.

**[Install Python packages into PySpark job](https://github.com/G-Research/spark-extension/blob/v2.12.0/PYSPARK-DEPS.md):** Install Python dependencies via PIP or Poetry programatically into your running PySpark job (PySpark ≥ 3.1.0):
**[Install Python packages into PySpark job](https://github.com/G-Research/spark-extension/blob/v2.13.0/PYSPARK-DEPS.md):** Install Python dependencies via PIP or Poetry programatically into your running PySpark job (PySpark ≥ 3.1.0):

```python
# noinspection PyUnresolvedReferences
Expand Down Expand Up @@ -94,7 +94,7 @@ Running your Python application on a Spark cluster will still require one of the
to add the Scala package to the Spark environment.

```shell script
pip install pyspark-extension==2.12.0.3.4
pip install pyspark-extension==2.13.0.3.4
```

Note: Pick the right Spark version (here 3.4) depending on your PySpark version.
Expand All @@ -108,7 +108,7 @@ from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.config("spark.jars.packages", "uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.4") \
.config("spark.jars.packages", "uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.4") \
.getOrCreate()
```

Expand All @@ -119,7 +119,7 @@ Note: Pick the right Scala version (here 2.12) and Spark version (here 3.4) depe
Launch the Python Spark REPL with the Spark Extension dependency (version ≥1.1.0) as follows:

```shell script
pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.4
pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.4
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.4) depending on your PySpark version.
Expand All @@ -129,7 +129,7 @@ Note: Pick the right Scala version (here 2.12) and Spark version (here 3.4) depe
Run your Python scripts that use PySpark via `spark-submit`:

```shell script
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.4 [script.py]
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.4 [script.py]
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.4) depending on your Spark version.
Expand All @@ -139,7 +139,7 @@ Note: Pick the right Scala version (here 2.12) and Spark version (here 3.4) depe
There are plenty of [Data Science notebooks](https://datasciencenotebook.org/) around. To use this library,
add **a jar dependency** to your notebook using these **Maven coordinates**:

uk.co.gresearch.spark:spark-extension_2.12:2.12.0-3.4
uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.4

Or [download the jar](https://mvnrepository.com/artifact/uk.co.gresearch.spark/spark-extension) and place it
on a filesystem where it is accessible by the notebook, and reference that jar file directly.
Expand Down
2 changes: 1 addition & 1 deletion python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from pathlib import Path
from setuptools import setup

jar_version = '2.13.0-3.5-SNAPSHOT'
jar_version = '2.13.0-3.5'
scala_version = '2.13.8'
scala_compat_version = '.'.join(scala_version.split('.')[:2])
spark_compat_version = jar_version.split('-')[1]
Expand Down

0 comments on commit 02b636d

Please sign in to comment.