This is a concise demonstration of essential pandas operations for data manipulation and analysis.
This tutorial covers:
- reading data (CSV and Parquet formats) and some performance considerations,
- accessing data and some best practices,
- filtering data and a performance comparison,
- column operations such as adding new columns, naming, and datatype conversions,
- data export options,
- merging (join types) and concat data,
- handling null values,
- aggregating data (groupBy operations), and
- some advance functionality (shift operations, ranking systems, cumulative operations, ...).
The tutorial uses three main datasets:
- Coffee Sales Data: Daily coffee shop sales with different coffee types
- Biographical Data: Olympic athletes' biographical information
- Country Codes: NOC (National Olympic Committee) to country name mappings
All datasets are loaded directly from GitHub repositories, so no local files are required.
Download the pandas_comprehensive_tutorial.ipynb file. Go to Google Colab and upload it.
Read the instructions from setup.md
- Use vectorized operations over
.apply()when possible - Prefer
.locfor explicit, readable code - Choose appropriate data types for memory efficiency
- Use
pd.cut()for binning operations instead of nested conditions
- Use
.copy()when creating DataFrame variants to avoid name binding - Handle null values explicitly
- Use descriptive column names
- Prefer explicit over implicit operations
This project is open source and available under the MIT License.
- Data sources from Keith Galli's pandas tutorial repository
- Pandas development team for creating this amazing library
- The open-source community for continuous improvements
Happy Data Engineering! 🐼📊