The Webpage of This Repository: Tools in Data Science
Data Science Center, Shahid Beheshti University
In this repository, we introduce some videos, slides, notebooks, and papers about some of
important tools in data science and also some tools to write or share your projects.
- Command Line
- Anaconda
- Integrated Development Environment (IDE)
- Markdown
- Git
- Docker
- Programming Languages
- Python Libraries for Data Science
- Files
- Introduction to the Command Line by Kevin Markham
- Cheat Sheet: Windows
- Cheat Sheet: Linux & Mac
Additional Reading:
- Book: Data Science at the Command Line
- Blog: 12 Essential Command Line Tools for Data Scientists by Matthew Mayo
Anaconda Distribution: With over 6 million users, the open source Anaconda Distribution is the fastest and easiest way to do Python and R data science and machine learning on Linux, Windows, and Mac OS X. It's the industry standard for developing, testing, and training on a single machine.
- Instalation
- Blog: Managing Environments
- Blog: Kernels for Different Environments
- Getting Started with Conda
- Why You Need Python Environments and How to Manage them with Conda by Gergely Szerovay
Additional Reading:
- Stop Installing Tensorflow Using pip for Performance Sake! by Michael Nguyen
- Using Pip in a Conda Environment by Jonathan Helmus
- Conda Commands (Create Virtual Environments for Python with Conda) by LipingY
- Conda Cheat Sheet
Python IDEs and Code Editors (Guide) by by Jon Fincher
- IDE: An IDE (or Integrated Development Environment) is a program dedicated to software development. As the name implies, IDEs integrate several tools specifically designed for software development. These tools usually include:
- An editor designed to handle code (with, for example, syntax highlighting and auto-completion)
- Build, execution, and debugging tools
- Some form of source control
- Most IDEs support many different programming languages and contain many more features. They can, therefore, be large and take time to download and install. You may also need advanced knowledge to use them properly.
- Top Python IDEs For Data Science (My Recommendation):
- Welcome to Colaboratory!
Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.- Overview of Colaboratory
- Guide to Markdown
- Unrar, Unzip, Rar, Zip in Gdrive
- Importing Libraries and Installing Dependencies
- Saving and Loading Notebooks in GitHub
- Interactive Forms
- Interactive Widgets
- TensorFlow 2 in Colab
- Loading Data: Drive, Sheets, and Google Cloud Storage
- Charts: Visualizing Data
- Getting Started with BigQuery
The Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. Also, IPython provides a rich architecture for interactive computing with in multiple programming languages.
- Jupyter Notebook for Beginners: A Tutorial by Benjamin Pryke
- Advanced Jupyter Notebooks: A Tutorial by Benjamin Pryke
- 28 Jupyter Notebook Tips, Tricks, and Shortcuts by Josh Devlin
- Import-Ipynb: The code within import_ipynb.py defines a “notebook loader” that allows you to “import” other ipynb files into your current ipynb file.
Additional Reading:
- Six Easy Ways to Run Your Jupyter Notebook in the Cloud
- IPython and Shell Commands
- IPython: Beyond Normal Python
- Jupyter/IPython Notebook Quick Start Guide
- Built-in Magic Commands
- Defining Custom Magics * Introducing IPython: Differences between line-oriented and cell-oriented magic functions
- Paper: Ten Simple Rules for Reproducible Research in Jupyter Notebooks
- Talk: Jupyter (Formerly IPython Notebook) by Finn Arup Nielsen
- Awesome JupyterLab by Hai Nguyen Mau
- JupyterLab Extension
Additional Reading:
- Jupyter Lab extensions for Data Scientist by Alexander Osipenko
- Install: The R kernel in Jupyter Lab by Rich Pauloo
- Advantages of Using R Notebooks For Data Analysis Instead of Jupyter Notebooks
by Max Woolf
- R Notebook by Yihui Xie, J. J. Allaire, and Garrett Grolemund
Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world’s most popular markup languages.
- Blog: Learn Markdown Online
- Cheat Sheet: Markdown Syntax
- Blog: Markdown Tables Generator
Additional Reading:
- Getting Started with R Markdown
- Word to Markdown
- Complete List of GitHub Markdown Emoji Markup by Rafael Xavier de Souza
- R Markdown
- R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, and Garrett Grolemund
- R Markdown Cheat Sheet - RStudio
- Loading data: Drive, Sheets, and Google Cloud Storage
- Charts: visualizing data
- Getting started with BigQuery
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. Git is easy to learn and has a tiny footprint with lightning fast performance.
- Resources to Learn Git
- Pro Git is an excellent book to learn Git.
Additional Reading:
- Git Workflow
- Git Immersion looks promising to practice Git.
- Git Quick Reference for Beginners
- Learn Git from Git Handbook
- Practice Git Online
- Learn Git from Scratch in LabEx
- Git Book
- Git Tutorial by Jae Woo Lee and Stephen A. Edwards
- Git Tools - Reset Demystified
- An Introduction to Git by Politecnico di Torino
- GIT for Beginners by Anthony Baire
- A Quick Guide to Git and Version Control by Jay Johnson
- Tutorial: Introduction to Git and Github
- To understand how to contribute on GitHub, learn first forks and pull requests.
Docker provides a simple and powerful developer experience, workflows and collaboration for creating applications.
- Blog: Learn to Build and Deploy Your Distributed Applications Easily to The Cloud With Docker
- Blog: Docker for Beginners by Prakhar Srivastav
- Blog: Docker Tutorial: Get Going From Scratch by Eric Goebelbecker
- Blog: Dockerfile Tutorial by Example - Basics and Best Practices by Márk Takács
- Blog: Docker – COPY Instruction by Raunak Jain
- Blog: What is the WORKDIR Command in Docker? by Edpresso Team
- Blog: Docker – WORKDIR Instruction by Raunak Jain
- Blog: Guide to Docker Volumes Guide to Docker Volumes
- Slide: Introduction to Docker by Jérôme Petazzoni
- Slide: Docker and Deployment by Joel Spolsky
You can learn python via SoloLearn (A great website for getting started with coding. It offers easy to follow lessons, interspersed with quizzes to help you retain what you are learning). Also, we recommend the following references:
- Book: Think Python: How to Think Like a Computer Scientist, Second Edition, by Allen B. Downey (free PDF book)
Additional Reading:
- Python3 Tutorial: This online Python course was created and is maintained by Bernd Klein, an experienced Python trainer, giving training classes all over the world. It is an interesting introduction into Python for beginners and intermediate learners with lots of examples and exercises!
- Regular Expressions for Data Scientists by Alex Yang
- Python Generators](https://www.dataquest.io/blog/python-generators-tutorial/) by Christian Pascual
- Real Python Tutorials
- Book: How to Think Like a Computer Scientist by Brad Miller and David Ranum
- Book: Python for Everybody by Charles R. Severance (free PDF book)
- Book: Practical Programming (2nd edition): An Introduction to Computer Science Using Python 3 by Paul Gries, Jennifer Campbell, and Jason Montojo, 2013.
- Book: The Python Tutorial (available from the Python website)
- Learn Python from LabEx
- Useful Functions, Tutorials, and Other Python-Related Things
- Python Debugging with pdb
- How to Use the Python Debugger by Zygmunt
- Errors and Debugging by Jake VanderPlas
- Python Data Science Tutorials by Ujjwal Karn
- Methods and Attributes in Python:
- Modules and Packages in Python:
- Python Modules and Packages – An Introduction by John Sturtz
- Packages in Python
- IF NAME == “MAIN” by Peter Lynch
- Argparse: The argparse module makes it easy to write user-friendly command-line interfaces.
- How to use sys.argv in Python by Vanshika Goyal
- Python Argparse Cookbook by Marcus Kazmierczak
- How to Build Command Line Interfaces in Python With Argparse by Davide Mastromatteo
- Python, Argparse, and Command Line Arguments Python, argparse, and command line arguments by Adrian Rosebrock
- Python argparse (ArgumentParser) Examples for Beginners
- A Simple Guide To Command Line Arguments With ArgParse by Sam Starkman
- Warning Control: Warning messages are typically issued in situations where it is useful to alert the user of some condition in a program, where that condition (normally) doesn’t warrant raising an exception and terminating the program.
- Warnings in Python by Rituraj Saha
- How to Use Python Warnings Framework?
- PDB Module: This module defines an interactive source code debugger for Python programs.
- Debugging Python Applications with the PDB Module by Muhammad Junaid Khalid
- Pickle Module: This module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
- The Python pickle Module: How to Persist Objects in Python by Davide Mastromatteo
- Online Book: R for Data Science
Additional Reading:
- Online Book: YaRrr! The Pirate’s Guide to R by Nathaniel D. Phillips
- Online Book: Efficient R programming by Colin Gillespie and Robin Lovelace
- R Tutorial for Beginners: Learning R Programming
- R-Statistics by Selva Prabhakaran
- Command Line: The Workspace
- Blog: Awesome R
- Blog: Quick List of Useful R Packages by Garrett Grolemund
- Getting used to R, RStudio, and R Markdown by Chester Ismay
- Formulas in R Tutorial by Karlijn Willems
- Google's R Style Guide
- R Data Science Tutorials by Ujjwal Karn
- Data Science Wars: Choosing R or Python for Data Analysis? An Infographic
- R Coding Style Guide by Iegor Rudnytskyi
- Base R (Cheat Sheet)
- R Programming Cheat Sheet Arianne Colton and Sean Chen
- Simplify Your Code with %>% (Pipes) by UC Business Analytics R Programming Guide
- Pipes by Garrett Grolemund and Hadley Wickham
- Book: Machine Learning Mastery With R by Jason Brownlee
- Blog: Caret Package by Max Kuhn
- Blog: An Introduction to Machine Learning with R Laurent Gatto
- Cheat Sheet: Caret Package by Max Kuhn
- Blog: Caret Package – A Practical Guide to Machine Learning in R
- NoteBook: Principles of Machine Learning R
- Blog: Practical Machine Learning Course Notes by Xing Su
If you want to solve interesting problems to practice Python or R, then we recommend to solve the following problems:
- Blog: Project Euler!
- Blog: Codeforces
SQL is a a domain-specific language for managing data in databases.
- SQL Fundamentals by Srini Kadamati
Python continues to take leading positions in solving data science tasks and challenges. Kdnuggets introduced 20 libraries of Python for data science. The following table was adopted from Applied Machine Learning and Deep Learning created by Cuixian Chen. Here are five of the most important of libraries:
Python Overview [Word] Python Tutorial [PDF] [Code] |
Numpy [PDF] [Code] User Guide [Link] Quickstart [Link] Reference [Link] Practice Numpy in LabEx [Link] Cheatsheet [Link] |
Matplotlib [PDF][Code] Example [Link] Tutorials [Link] Reference [Link] Practice Matplotlib in LabEx [Link] Cheatsheet [Link] |
Pandas [Code] 10 Min to Pandas [Link] Cookbook [Link] Tutorials [Link] Reference [Link] Practice Pandas in LabEx [Link] Cheatsheet [Link] |
Seaborn: Stat data Visulization [Link] Example [Link] Tutorials [Link] Reference [Link] Cheatsheet [Link] |
Scikit Learn [Link] Scikit Image [Link] Scikit Tutorial #1 [Code] Scikit Tutorial #2 [Code] Cheatsheet [Link] |
NumPy is the fundamental package for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
Additional Reading:
- Data Science iPython NoteBooks by Donne Martin
- Python Numpy Tutorial by Justin Johnson
- Python for Data Analysis
- Tutorial: An Essential Guide to Numpy for Machine Learning in Python by Siddharth Dikshit
- NoteBook: How Fast are NumPy Operations Compared to Regular Python Math? by Tirthajyoti Sarkar
- Blog: Data Science with Python: Turn your Conditional Loops to Numpy Vectors by Tirthajyoti Sarkar
- Blog: One Simple Trick for Speeding up your Python Code with Numpy by George Seif
- Exercises: Practice Numpy in LabEx
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- Data Science iPython NoteBooks by Donne Martin
- Using Excel with Pandas by Harish Garg
- Using pandas with Large Data Sets by Josh Devlin
- IO Tools (text, CSV, HDF5, …)
Additional Reading:
- Pandas (Faster Data Science Education by Kaggle) by Aleksey Bilogur
- Python for Data Analysis
- 10 Minutes to Pandas
- Tutorial: Visualizing Machine Learning One Concept at a Time by Jay Alammar
- Tutorial: Best practices with pandas (Video Series)
- Tutorial: 9 New Pandas Updates That Will Save You Time
- Blog: Difference between Pandas VS NumPy by Vansh Gaur
- Exercises: Practice Pandas in LabEx
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
- Data Science iPython NoteBooks by Donne Martin
Additional Reading:
- Scipy Lecture Notes
- How to Generate FiveThirtyEight Graphs in Python by Alex Olteanu
- Top 50 Matplotlib Visualizations – The Master Plots (with Full Python Code)
- Exercises: Practice Matplotlib in LabEx
Scikit-Learn is a simple and efficient tools for data mining and data analysis. It was built on NumPy, SciPy, and Matplotlib.
- Data Science iPython NoteBooks by Donne Martin
SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more.
Additional Reading:
- Data Science iPython NoteBooks by Donne Martin
PyMC3 allows you to write down models using an intuitive syntax to describe a data generating process.