Skip to content
/ SQLCell Public
forked from tmthyjames/SQLCell

SQLCell is a magic function for the Jupyter Notebook that executes raw, parallel, parameterized SQL queries with the ability to accept Python values as parameters and assign output data to Python variables while concurrently running Python code. And *much* more.

License

Notifications You must be signed in to change notification settings

CJinny/SQLCell

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SQLCell

Bringing together the power of SQL, Python, and Javascript. Run raw, multi-threaded SQL in an IPython Notebook while concurrently executing python cells, and a ton of other features.

See the tutorials here:

Part 1: The Basics

Part 2: Beyond The Basics

SQLCell preview

Contents


Most recent feature
Installation and Configuration
Features
 i. Parameters
 ii. Flags
 iii. Pass python variables to SQL
 iv. psql metacommands
 v. Multi-threading
 vi. Buttons
 vii. Inline editing
 viii. Easy-to-read Query Plan Table
 ix. Easy-to-read Query Plan Graph
 x. Alter column name and type via the UI
 xi. Notifications
 xii. pg_dump support
 xiii. Switch Engines (To be documented...)  
In development
 i. Built-in PostGIS preview (inspired by postgis-preview
 ii. MSSQL support.  
To dos
 i. Add UI elements to perform basic, common database tasks, such as adding columns, creating tables, etc.
 ii. Need to confirm install process is smooth on non-Mac computers.
 iii. Add support for MySQL commands.
 iv. Add modifiers and constraints via the UI

Open issues can be found here.

Most recent feature

added 12/04/2016

SQLCell now offers the option to view the Query Plan as a sankey graph built with D3.js

query_plan_sankey

Installation and configuration

Just clone the repo and cp the sqlcell_app.py file to Jupyter's startup directory (on my computer, the directory is ~/.ipython/profile_default/startup, but may be different depending on your OS and version of IPython/Jupyter):

$ cd .ipython/profile_default/startup # or wherever your startup directory is
$ git clone https://github.com/tmthyjames/SQLCell.git
$ cp SQLCell/sqlcell_app.py sqlcell_app.py # place sqlcell_app.py in the startup folder so it will get executed

Then in the engine_config.py file, define your connection variables. If you don't add them to engine_config.py, then you'll have to pass a connection string to the ENGINE parameter everytime you use %%sql, like so:

In [2]: %%sql ENGINE='postgresql://username:password@host:port/database'
        SELECT * FROM table;

To save the engines:

%%sql --declare_engines new
LOCAL=postgresql://username:password@localhost:5432/
DEV=postgresql://username:password@random.domain.com/

See more about this option in the Declare Engines section

Now you are ready to ditch pgAdmin or whatever SQL interface you use. Continue reading to see all the available options, like writing results to a CSV, using SQLAlchemy named parameters and more.

Features

Parameters

Available parameters:
DB: Determines what database to query. On the first run, this parameter is required. After that, it will remember what database was chosen. To change databases, use this parameter again. Default is the last-specificed database.
PATH: Writes results of the query to a CSV (can also be done through the UI). No default.
MAKE_GLOBAL: Passes results of the query to the variable you pass to it. If this parameter is specified but the RAW parameter is not, then the results will be a Pandas DataFrame. If RAW is set to True, then the results will be the raw RowProxy returned from the database. No Default.
RAW: Determines whether the data will be of type DataFrame or RowProxy. Default: False.
DISPLAY: Determines whether or not to render the results as a table. This is best used in conjunction with the MAKE_GLOBAL parameter because displaying a table in a busy workflow can be cumbersome and annoying sometimes.
ENGINE: Speicifies what host, database to connect to. Default is the connection that is specified in the engine_config.py file. If the engine_config.py file is not configured, then the ENGINE parameter is required.
TRANSACTION_BLOCK': Determines whether the query will be executed inside a transaction block or not. This is useful when creating a database, dropping a database, VACUUM ANALYZEing a database, or any other query statements that cannot be run inside a transaction block. Default: True
EDIT: Enables inline editing. To use this, you must specify only one table in your query, and that table must have a primary key. Default: False.
NOTIFY: Disables notifications for finished queries. Default: True.

Examples of how to use these are below.

DB Parameter

After adding your connection details to engines.py, run your first query with the DB argument:

In [3]: %%sql DB=bls
        SELECT * 
        FROM la_unemployment
        LIMIT 3
series_idyearperiodvaluefootnote_codes
1LASST4700000000000031976M016.2None
2LASST4700000000000031976M026.1None
3LASST4700000000000031976M036.0None

For the rest of the session, you won't have to use the DB argument unless you want to change databases. And the last-used DB will be persisted even after you shut down Jupyter and start it back up next time.

In [4]: %%sql
        SELECT * 
        FROM avg_price LIMIT 3
series_idyearperiodvalue
1APU00007011111980M010.203
2APU00007011111980M020.205
3APU00007011111980M030.211

To switch databases, just invoke the DB argument again with a different database:

In [5]: %%sql DB=sports
        SELECT * 
        FROM nba LIMIT 3
dateofteamoppptsfgfg_attftft_attfg3fg3_attoff_reboundsdef_reboundsasstblksfoulsstlsturnovers
12015-10-27DETATL106379620261229233623315515
22015-10-27ATLDET943782121582773322425915
32015-10-27CLECHI9538941017929113926721511

PATH Parameter

To write the data to a CSV, use the PATH argument:

In [6]: %%sql PATH='/<path>/<to>/<file>.csv'
        SELECT * 
        FROM nba LIMIT 3
dateofteamoppptsfgfg_attftft_attfg3fg3_attoff_reboundsdef_reboundsasstblksfoulsstlsturnovers
12015-10-27DETATL106379620261229233623315515
22015-10-27ATLDET943782121582773322425915
32015-10-27CLECHI9538941017929113926721511

MAKE_GLOBAL parameter

And my favorite. You can assign the dataframe to a variable like this useing the MAKE_GLOBAL argument:

In [9]: %%sql MAKE_GLOBAL=WHATEVER_NAME_YOU_WANT DB=bls
        SELECT * 
        FROM la_unemployment
        WHERE year = 1976
            AND period = 'M01'
        LIMIT 3
series_idyearperiodvaluefootnote_codes
1LASST4700000000000031976M016.2None
2LASST4700000000000041976M01111152.0None
3LASST4700000000000051976M011691780.0None

And call the variable:

In [10]: WHATEVER_NAME_YOU_WANT
series_idyearperiodvaluefootnote_codes
1LASST4700000000000031976M016.2None
2LASST4700000000000041976M01111152.0None
3LASST4700000000000051976M011691780.0None

RAW Parameter

You can also return the raw RowProxy from SQLAlchemy by setting the RAW argument to True and using the MAKE_GLOBAL argument.

In [10]: %%sql MAKE_GLOBAL=data RAW=True
         SELECT * 
         FROM la_unemployment
         LIMIT 3

In [11]: data
         [(u'LASST470000000000003', 1976, u'M01', 6.2, None),
          (u'LASST470000000000003', 1976, u'M02', 6.1, None),
          (u'LASST470000000000003', 1976, u'M03', 6.0, None)]

DISPLAY Parameter

Query the data without rendering the table (useful if the result set is prohibitively large and displaying the table breaks things) by setting the DISPLAY parameter to False. It makes sense to use this parameter in conjunction with the MAKE_GLOBAL parameter so the data is passed to the variable but the table isn't rendered:

In [10]: %%sql MAKE_GLOBAL=data DISPLAY=False
         SELECT * 
         FROM la_unemployment

ENGINE Parameter

The ENGINE parameter accepts any connection string and creates a connection based on that.

In [10]: %%sql ENGINE='postgresql://username:password@host:port/DB'
         SELECT * 
         FROM la_unemployment
         LIMIT 3

TRANSACTION_BLOCK Parameter

Some SQL statements (VACUUM, CREATE <db>, DROP <db>, etc.) must be executed outside of a transaction block by setting the isolation level to 0 (see this)

In [10]: %%sql TRANSACTION_BLOCK=False
         VACUUM ANALYZE <table_name>

EDIT Parameter

Enables inline editing.

In [10]: %%sql EDIT=True
         SELECT * 
         FROM la_unemployment
         LIMIT 3

Will display a table where the cells can be clicked on and edited.

NOTIFY Parameter

In [11]: %%sql NOTIFY=False
         SELECT * FROM la_unemployment LIMIT 1

Will disable notifications for the remainder of your Jupyter session. To re-enable notifications, just set NOTIFY=True.

Flags


declare_engines: Makes adding engines to the engines.py file easy.
pg_dump: Run --pg_dump commands from your Jupyter Notebook.

Declare Engines

All engines should be in this format: name=connection_string. For example:

%%sql --declare_engines new
LOCAL=postgresql://username:password@localhost:5432/
DEV=postgresql://username:password@random.domain.com/

Will create the ENGINES object with only LOCAL and DEV in it. "LOCAL" will be the text that goes on the button, like the following:

declare_engines_new

To append new engines to an existing ENGINES object:

%%sql --declare_engines append
LOCAL_test=postgresql://username:password@localhost:5432/
DEV_test=postgresql://username:password@random.domain.com/

declare_engines_append

Where I had the engines LOCAL, DEV, and PROD I now have LOCAL_test and DEV_test also.

pg_dump support

In[17]: %%sql --pg_dump
        -t nba sports --schema-only

Will output the following:

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;
SET search_path = public, pg_catalog;
SET default_tablespace = '';
SET default_with_oids = false;
--
-- Name: nba; Type: TABLE; Schema: public; Owner: postgres
--
CREATE TABLE nba (
    dateof date,
    team character varying(5),
    opp character varying(5),
    pts bigint,
    fg integer,
    fg_att integer,
    ft integer,
    ft_att integer,
    fg3 integer,
    fg3_att integer,
    off_rebounds integer,
    def_rebounds integer,
    asst integer,
    blks integer,
    fouls integer,
    stls integer,
    turnovers integer
);
ALTER TABLE nba OWNER TO postgres;

Pass Python variables to SQL

To pass python variables to your queries, just do the following.

In[7]: # define your parameters in a python cell
        name = '1976'
        period = 'M01'
        series_id = ('LASST470000000000005', 'LASST470000000000004', 'LASST470000000000003')

Now in a %%sql cell:

In [8]: %%sql DB=bls
        SELECT * 
        FROM la_unemployment
        WHERE year = %(year)s
            AND period = %(period)s AND series_id IN %(series_id)s
        LIMIT 3

You can also use a colon to indicate your variables:

In [8]: %%sql DB=bls
        SELECT * 
        FROM la_unemployment
        WHERE year = :year
            AND period = :period AND series_id IN :series_id
        LIMIT 3

Both output the following table:

series_idyearperiodvaluefootnote_codes
1LASST4700000000000031976M016.2None
2LASST4700000000000041976M01111152.0None
3LASST4700000000000051976M011691780.0None

psql metacommands

In [1]: %%sql DB=bls
        \dp
SchemaNameTypeAccess privilegesColumn privilegesPolicies
1publicavg_pricetablenannannan
2publicla_unemploymenttablenannannan
3publictu_atustablenannannan
In [2]: %%sql
        \d avg_price
ColumnTypeModifiers
1series_idcharacter varying(17)nan
2yearintegernan
3periodcharacter varying(3)nan
4valuerealnan
In [3]: %%sql DB=sports
        \COPY public.nba (dateof, team, opp, pts, fouls) to '/<path>/<to>/<file>.csv'
	
Out[3]: <p>COPY 3092</p>

Multi-threading

All queries are executed on their own thread, so you can run as many queries as your box will allow while concurrently executing python code.

Buttons

lots_of_buttons

Buttons include
• Viewing Query Plan with d3.js sankey graph
• Running Explain Analyze on your query
• executing query
• executing query and returning SQLAlchemy results in a variable
• saving to a TSV
• stopping query
• swithcing between user-defined engines (button group on right; see Declare Engines for instructions on how to define engines.)

Inline editing

Set the EDIT parameter to True to enable inline editing. As long as you are querying one table and that table has a primary key, then you can edit it using the UI.

inline editing

Easy to read Query Plan table

This includes a heatmap-like color scale to indicate problem spots in your query.

query_plan_table

Easy to read Query Plan graph

Sankey graph that uses a heatmap-like color scale to indicate problem spots in your query, built with D3.js.

query_plan_sankey

Alter Column Name and Type Inline

To edit the column info via the UI, use the \d <table-name> metacommand and the EDIT parameter.

In[15]: %%sql DB=sports EDIT=True
        \d nba

alter_column

Notifications

SQLCell now includes "Growl"-like, Bootstrap-styled notifications using mouse0270's awesome bootstrap-notify. The entire query is in the pre tag and scrollable, and clicking the notification will focus the window on the results of that query.

notifications

And that's it.

Enjoy and contribute.

About

SQLCell is a magic function for the Jupyter Notebook that executes raw, parallel, parameterized SQL queries with the ability to accept Python values as parameters and assign output data to Python variables while concurrently running Python code. And *much* more.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 82.7%
  • JavaScript 17.3%