A visitor to New York City asked a passerby for directions to the city's famous classical music venue:
Visitor: Excuse me, how do I get to Carnegie Hall?
Passerby: Practice, practice, practice!
Online Class WebEx: http://umbc.webex.com/meet/jaywang
Statistical Analysis, Data Visualization, and Python Programming All in One Place with Hands-on Practices.
Python for Data Analysis by Wes McKinny Second Edition
- Use Interactive Python Shell (Interpreter)
- python.org
- No registeration is required
- Run Python Scripts through Command Line
- pythonanywhere.com
- Requires account registeration
- Need basic Linux familiarity
- Use Jupyter Notebooks
- Google Colab
- Plus: Seemless integration with GitHub
- Kaggle.com
- Plus: Data science community with public datasets and notebooks for learning
- notebooks.ai
- Plus: Acess to Linux terminal to run Python Scripts and Python Webapps
- Microsoft Azure Notebook
- Google Colab
https://m.facebook.com/story.php?story_fbid=2596321403939440&id=1815765878661667
- Syntax (most basic programming requirements)
- Idiom (use of .join for string concatenation)
- Design Patterns (best practices and approaches to common problems and issues)
- Architectural (Overall project structure)
Most books and courses teach level 1 and 2 and rarely touch on level 3 and 4.
https://blog.newrelic.com/engineering/python-programming-styles/
- Imperative
- Procedural
- Object-oriented
- Functional
- Written Code - Solely your responsibility - Make sure it is clean, correct, and commented (3C rule)
- Source Data - Primary data is your responsibity. You have no control over secondary data so be careful in the selection and cleansing.
- Existing Libraries - You have no control on existing libraries/algothorithms so be careful in selecting and using them.
- Interpretation of Results - Be careful about what is objective and what is subjective and what data exhibit and what experts know.
The one thing you have absolute control is the code you write. Make sure don't write bad code (complicated, incorrect, and undocumented code), so-called spaghetti code.
Wikipedia's Definition of Spaghetti Code:
"Spaghetti code is a pejorative phrase for unstructured and difficult-to-maintain source code. Spaghetti code can be caused by several factors, such as volatile project requirements, lack of programming style rules, and insufficient ability or experience."
The name Jupyter comes from the fact that it supports writing code in three popular languages:
- Julia
- Python
- R
Julia and R are popular for statistical analysis and data science. Python is a more generic programming language that happens to be popular in data science as well, though Python is good for all kinds of development, not just data science.
- Define Problem and Ask Questions
- Define Data Source and Elements
- Tidy up Data (Normalize "messy" data so that is is "Tidy". )
- Summarize Data (Summarize/Tablulate, descriptive statistics)
- Visualize Data (static and interactive)
- Interpret and Communicate Results
Check out this paper for data tidying.
All Six steps must be guided by domain knowledge, principles, and purposes.
- One Variable
- Categorical Variable (frequency table)
- Bar chart/Pareto Chart (sorted)
- Pie Chart (Avoid it when there are two many categories)
- Numerical Variable (discrete or continuous, frequency distribution)
- Boxplot
- Histogram
- Line Chart
- Area Chart
- Textual Variable/Data
- Wordcloud
- Categorical Variable (frequency table)
- Multiple variables
- Two categorical variables (contingency table, pivot table)
- Stacked Bar Chart
- Grouped Bar Chart
- Two numerical variables
- Scatter Plot
- Bubble Chart (Scatter plot with varying size of dots based on the third numerical variable)
- Motion Chart (Interactive scatter plot reflecting the trend on a third time series categorical variable such as quarter, month)
- Scatter Plot with varying colors and Shapes of dots/marks reflecting additional categorical variables (dimensions)
- Two categorical variables (contingency table, pivot table)
- I put practice above PowerPoint. I don't focus on syntax and mechanics.
- I share with students how I learn. I show them how to learn on their own so that they become life-long learners without relying on teachers or gurus.
- I offer no fishes nor fishing gears. I show students where and how to find them.
- The Internet is the biggest fish pond and Google is the best fishing gear.
- Curiocity may kill a cat, it sure makes a student.
- Interactive Python Tutorial
- W3C School Python Tutorial
- Practical Data Science for Journalists and Everyone Else
- Markdown Cheatsheet
- Github Flavored Markdown
- AP Statistics Tutorial
- Practice Python
- Python Exercises, Practice, Solution
- 4-hour Beginner's Python for Data Science Training Video
- Free Book: Python Data Science Handbook by Jake Vanderplas
- A Visual Intro to NumPy and Data Representation
- 10 Minutes to Pandas
- A Gentle Visual Intro to Data Analysis in Python Using Pandas
- Summarising, Aggregating, and Grouping data in Python Pandas
- Visualizing Pandas' Pivoting and Reshaping Functions
- Data Visualization with Python
- From Data to Viz
- Data Visualization Best Practices
- Introduction to Exploratory Data Analysis in Python.ipynb
- Object-Oriented Programming in Python vs Java
- Ask, Acquire, Analyze, Apply, Announce, Assess
- A First Course in Data Science