Python utils for Spark

This repo contains Python utility code for Spark

Module sqltodf

Run a simple HQL statement in Hive SparkSQL and return resultset as a Pandas dataframe.

Notes

Spark module imports (pyspark/pyspark sql) are not required as the module handles this.
Code currently runs in Spark 'local' mode so complex SQL (any type of join for example) is not supported.
Resulting Pandas dataframe will be in memory, so table must be small enough to allow this.
Driver memory is currently configured at 10Gb.

Usage

Invlude this module somewhere on your PYTHONPATH.

    export PYTHONPATH=$PYTHONPATH:<path to>/sqltodf

Example

    import pandas as pd
    import numpy as np
    from sqltodf import Factory
    cls = Factory.get('Spark')
    df = cls.SqlToPandas(sql='Select * from testtable')
    print df.info(memory_usage='deep')
    print df.head()
    # do stuff with df.....

outputs: -

    <class 'pandas.core.frame.DataFrame'>

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
conf		conf
sqltodf		sqltodf
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python utils for Spark

Module sqltodf

Notes

Usage

Example

About

Releases

Packages

Languages

License

martinprobson/SqlToDF

Folders and files

Latest commit

History

Repository files navigation

Python utils for Spark

Module sqltodf

Notes

Usage

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages