Skip to content

Latest commit

 

History

History
277 lines (156 loc) · 12.1 KB

README.rst

File metadata and controls

277 lines (156 loc) · 12.1 KB

From Spreadsheets to DataFrames: Escaping Excel Hell with Python


Pycon 2021 Tutorial Video [YouTube] - May 12, 2021

Other Presentations:

STL Python Presentation [YouTube]

Chicago Python Users Group [YouTube]

Details

A spreadsheet is a wonderful invention and an excellent tool for certain jobs. All too often, however, spreadsheets are called upon to perform tasks that are beyond their capabilities. It’s like the old saying, 'If the only tool you have is a hammer, every problem looks like a nail.' However, some problems are better addressed with a screwdriver, with glue, or with a Swiss Army Knife.

Python is described by some in the programming world as the Swiss Army Knife of programming languages because of its unrivaled versatility and flexibility in use. This allows its users to solve complex problems relatively easily compared with other programming languages and is one of the reasons why Python has become increasingly popular over time.

In this tutorial, we’ll briefly discuss spreadsheets, signs that you might be living in “Excel Hell”, and then we’ll spend the rest of the time learning how to escape it using Python.

In the first section, we’ll extend on what spreadsheet users already know about cells, rows, columns, and formulas, and map them to their Python equivalent, such as variables, lists, dictionaries, and functions. At the end of this section, we’ll do an interactive exercise and learn how we can perform a simple calculation, similar to one you might do in Excel, but instead using Python.

In the second section, we’ll discuss (and attempt) how we can perform more complex tasks including web scraping, data processing, analysis, and visualization, by utilizing a few popular 3rd party libraries used including Requests, Pandas, Flask, Matplotlib, and others.

In the last section, we’ll round out our discussion with a few important concepts in data management, including concept of tidy data, building a data pipeline, and a few strategies (and packages) to use when approaching various data problems, including demo using Apache Airflow.

Slides

Intro [Slides]

Excel to Python [Slides]

Python Libraries & Resources [Slides]

Data Management [Slides]

Tutorial Code

Section 1 - Python Fundamentals for an Excel User

01 basics_but_important_stuff.ipynb

02 files_lists_dictionaries.ipynb

Section 1 - Challenges

challenge_1.py

challenge_1_answer.py

challenge_2.py

challenge_2_answer.py

challenge_3.py

challenge_3_answer.py

Section 2 - Real-World Python Example for an Excel User

01-real-world-example.py

02-real-world-example-refactored.py

Section 2 - Challenge

section2_challenge.rst

Section 3 - Best Practices in Python & Data for an Excel User

Data Management [Slides]

07-airflow

STL Python - Talk Code

01-basics.ipynb

02-webscraping.ipynb

03-tidy-data.ipynb

04-pandas.ipynb

05-data-analysis.ipynb

06-data-visualizations.ipynb

STL Python - Folders

  • 01-basics - examples used in presentation
  • 02-webscraping - program that clicks through a calendar (written in javascript) and exports csv files
  • 02-selenium-safari - program that logins to website, scrapes html from javascript generated page, cleans html, and exports to pdf files
  • 02-webscrape-celery - example of that uses a message queue, and celery to download list of urls
  • 04-other-analysis - examples of different quantitative notebooks
  • 05-other-visualizations - examples of different data visualization tools
  • 06-flask - different flask examples
  • 07-airflow - example that uses airflow to download and store stock prices

Quick Start Guides

Install Anaconda & Pycharm

  • Anaconda = manages your Python environments
  • Pycharm = code editor

Install Git - Allows you to git clone/download Github Projects'

Setup Environment & Run Example (Windows):

$ git clone https://github.com/ryansmccoy/spreadsheets-to-dataframes.git
$ cd spreadsheets-to-dataframes
$ conda create -n spreadsheets-to-dataframes python=3.8 pandas scipy numpy lxml jupyter matplotlib -y
$ activate spreadsheets-to-dataframes
$ pip install -r requirements_dev.txt

Setup Environment & Run Example (Linux):

$ git clone https://github.com/ryansmccoy/spreadsheets-to-dataframes.git
$ cd spreadsheets-to-dataframes
$ conda create -n spreadsheets-to-dataframes python=3.8 pandas scipy numpy lxml jupyter matplotlib -y
$ source activate spreadsheets-to-dataframes
$ pip install -r requirements_dev.txt

Running Jupyter Notebooks:

Navigate to spreadsheet-to-dataframe directory/folder:

$ activate spreadsheets-to-dataframes
$ jupyter notebook

(Optional) Install Docker to Run Airflow Example

https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html

Python Books & Videos:

(Book) Python Crash Course, 2nd Edition

(Book) Introducing Python: Modern Computing in Simple Packages

(Book) Learning Python, 5th Edition

(Book) Automate the Boring Stuff with Python, 2nd Edition: Practical Programming for Total Beginners

(Book) Think Python: How to Think Like a Computer Scientist

(Book) The Quick Python Book (Book)

(Book) Serious Python: Black-Belt Advice on Deployment, Scalability, Testing, and More

(Github) A Whirlwind Tour of Python

(Github) Python Data Science Handbook

(Github) Introduction to Python

Cookiecutter:

$ pip install cookiecutter

Resources:

https://github.com/cookiecutter/cookiecutter

https://github.com/audreyfeldroy/cookiecutter-pypackage

https://towardsdatascience.com/cookiecutter-creating-custom-reusable-project-templates-fc85c8627b07

Requests

$ pip install requests

Resources:

https://python.readthedocs.io/en/stable/library/stdtypes.html

https://realpython.com/python-requests/

Have you mastered Requests? Then you should check out multithreading, concurrency, asyncio, message queues, parallelism.

https://yasoob.me/2019/05/29/speedingw-up-python-code-using-multithreading/

https://www.toptal.com/python/beginners-guide-to-concurrency-and-parallelism-in-python

https://creativedata.stream/multi-threading-api-requests-in-python/

https://levelup.gitconnected.com/asynchronous-tasks-in-python-with-celery-rabbitmq-redis-480f6e506d76

https://tests4geeks.com/blog/python-celery-rabbitmq-tutorial/

https://codeburst.io/automated-web-scraping-with-python-and-celery-ac02a4a9ce51

https://github.com/ryansmccoy/zmq-high-speed-subs

Pandas

$ pip install pandas

Resources:

Dealing With Data

Pandas Cookbook

brandon-rhodespycon-pandas-tutorial

Python pandas Q&A video series

Master Data Analysis with Python

Have you mastered Pandas? Then you check out Dask and Spark.

https://dask.org/

https://spark.apache.org/docs/latest/api/python/

Visualization:

$ pip install matplotlib

Resources:

https://github.com/fasouto/awesome-dataviz

https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

https://www.toptal.com/designers/data-visualization/data-visualization-tools

https://realpython.com/pandas-plot-python/

Have you mastered Matplotlilb? Then you should checkout Javascript, D3, React, Tableau

Flask:

$ pip install flask

Resources:

https://www.fullstackpython.com/flask.html

https://blog.miguelgrinberg.com/

Have you mastered Flask? Then you should checkout FastAPI, Javascript, Node, React