This course teaches you how to build pipelines to import data kept in common storage formats. You’ll use pandas, a major Python library for analytics, to get data from a variety of sources, from spreadsheets of survey responses, to a database of public service requests, to an API for a popular review site. Along the way, you’ll learn how to fine-tune imports to get only what you need and to address issues like incorrect data types. Finally, you’ll assemble a custom dataset from a mix of sources.
read_csv()andread_excel- Setting data types, choosing data to load, handling missing data and errors
usecols,nrows,skiprows,names,dtype,na_values,.isna(),sheet_name,parse_dates(for standard datetime),pd.to_datetime()(Parse non-standard date formats)- Putting multiple spreadsheets together (Using iterate through dataframe in dictionary)
- Setting custom true/false values
read_sql()andsqlalchemy- SQL
SELECT,WHERE, aggregate functions and joins Install SQLite and import .db in it- sqlalchemy's
create_engine()makes an engine to handle database connections
read_json(),json_normalize(), andrequests- Working with APIs and nested JSONs
- Appending and merging datasets
requests.get()(get the data from API),.json()(extract JSON data),.append(),.merge()