The Koalas project makes data scientists more productive when interacting with big data, by implementing pandas DataFrame API on top of Apache Spark.
pandas is the de facto standard (single-node) dataframe implementation in Python, while Spark is the de facto standard for big data processing. With this package, data scientists can:
- Be immediately productive with Spark, with no learning curve, if one is already familiar with pandas.
- Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).
This project is currently in beta and is rapidly evolving, with a weekly release cadence. We would love to have you try it and give us feedback, through our mailing lists or GitHub issues.
- cmake for building pyarrow
- Spark 2.4. Some older versions of Spark may work too but they are not officially supported.
- A recent version of pandas. It is officially developed against 0.23+ but some other versions may work too.
- Python 3.5+ if you want to use type hints in UDFs. Work is ongoing to also support Python 2.
Koalas is available at the Python package index:
pip install koalas
If this fails to install the pyarrow dependency, you may want to try installing with Python 3.6.x, as pip install arrow
does not work out of the box for 3.7 apache/arrow#1125.
After installing the package, you can import the package:
import databricks.koalas as ks
Now you can turn a pandas DataFrame into a Koalas DataFrame that is API-compliant with the former:
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
# Rename the columns
df.columns = ['x', 'y', 'z1']
# Do some operations in place:
df['x2'] = df.x * df.x
Project docs are published here: https://koalas.readthedocs.io
We use Google Groups for mailling list: https://groups.google.com/forum/#!forum/koalas-dev
We recommend setting up a Conda environment for development:
conda create --name koalas-dev-env python=3.6
conda activate koalas-dev-env
conda install -c conda-forge pyspark=2.4
conda install -c conda-forge --yes --file requirements-dev.txt
pip install -e . # installs koalas from current checkout
Once setup, make sure you switch to koalas-dev-env
before development:
conda activate koalas-dev-env
There is a script ./dev/pytest
which is exactly same as pytest
but with some default settings to run Koalas tests easily.
To run all the tests, similar to our CI pipeline:
# Run all unittest and doctest
./dev/pytest
To run a specific test file:
# Run unittest
./dev/pytest -k test_dataframe.py
# Run doctest
./dev/pytest -k series.py --doctest-modules databricks
To run a specific doctest/unittest:
# Run unittest
./dev/pytest -k "DataFrameTest and test_Dataframe"
# Run doctest
./dev/pytest -k DataFrame.corr --doctest-modules databricks
Note that -k
is used for simplicity although it takes an expression. You can use --verbose
to check what to filter. See pytest --help
for more details.
To build documentation via Sphinx:
cd docs && make clean html
It generates HTMLs under docs/build/html
directory. Open docs/build/html/index.html
to check if documentation is built properly.
We follow PEP 8 with one exception: lines can be up to 100 characters in length, not 79.
Only project maintainers can do the following.
Step 1. Make sure version is set correctly in databricks/koalas/version.py
.
Step 2. Make sure the build is green.
Step 3. Create a new release on GitHub. Tag it as the same version as the setup.py. If the version is "0.1.0", tag the commit as "v0.1.0".
Step 4. Upload the package to PyPi:
rm -rf dist/koalas*
python setup.py bdist_wheel
export package_version=$(python setup.py --version)
echo $package_version
python3 -m pip install --user --upgrade twine
python3 -m twine upload --repository-url https://test.pypi.org/legacy/ dist/koalas-$package_version-py3-none-any.whl
python3 -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/koalas-$package_version-py3-none-any.whl
This project is currently in beta and is rapidly evolving. We plan to do weekly releases at this stage. You should expect the following differences:
-
some functions may be missing (see the Contributions section)
-
some behavior may be different, in particular in the treatment of nulls: Pandas uses Not a Number (NaN) special constants to indicate missing values, while Spark has a special flag on each value to indicate missing values. We would love to hear from you if you come across any discrepancies
-
because Spark is lazy in nature, some operations like creating new columns only get performed when Spark needs to print or write the dataframe.
If you are already familiar with pandas and want to leverage Spark for big data, we recommend using Koalas. If you are learning Spark from ground up, we recommend you start with PySpark's API.
File a GitHub issue: https://github.com/databricks/koalas/issues
Databricks customers are also welcome to file a support ticket to request a new feature.
Different projects have different focuses. Spark is already deployed in virtually every organization, and often is the primary interface to the massive amount of data stored in data lakes. Koalas was inspired by Dask, and aims to make the transition from pandas to Spark easy for data scientists.
Please create a GitHub issue if your favorite function is not yet supported.
Make sure the name also reflects precisely which function you want to implement, such as
DataFrame.fillna
or Series.dropna
. If an open issue already exists and you want do add
missing parameters, consider contributing to that issue instead.
We also document all the functions that are not yet supported in the missing directory. In most cases, it is very easy to add new functions by simply wrapping the existing pandas or Spark functions. Pull requests welcome!
Two reasons:
-
We want a venue in which we can rapidly iterate and make new releases. The overhead of making a release as a separate project is minuscule (in the order of minutes). A release on Spark takes a lot longer (in the order of days)
-
Koalas takes a different approach that might contradict Spark's API design principles, and those principles cannot be changed lightly given the large user base of Spark. A new, separate project provides an opportunity for us to experiment with new design principles.
Databricks Runtime for Machine Learning has the right versions of dependencies setup already, so you just need to install Koalas from PyPi when creating a cluster.
For the regular Databricks Runtime, you will need to upgrade pandas and NumPy versions to the required list.
In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and Databricks Runtime for Machine Learning.