Data analysis increasingly involves mining data from the Internet and handling big datasets. However, students often lack the knowledge and experience required to take full advantage of the Internet and social media's data opportunities. This course guides the students to move their first steps into data mining. The course offers case studies and exercises in a friendly class environment. Students will learn (by doing) how to collect and handle web data in their future work. The course covers the primary skills required to access web data confidently.
A personal motivation to learn data mining is the only hard requirement for this course: passive, listening-only, and credit-oriented attendance styles are discouraged and incompatible with effective and durable learning.
Students are expected to have some working knowledge of the R language (key notions will be refreshed in class). Some basic statistics and programming skills (e.g., one previous course in statistics using R) are recommended to reduce the overall workload. Students are also expected to possess basic computer skills (e.g., using keyboard shortcuts, handling files and folders, and basic knowledge of the shell [short intro]).
For this course, there are several software tools you'll need to install and configure. Please allocate approximately 2 hours before our first class to complete the following eight steps carefully. In case you need support,
- Register at Perusall.com with your university email and familiarize yourself with the Perusall interface;
- Install or upgrade R and RStudio, and then install the “tidyverse” packages following this guide;
- Register a GitHub account using your personal email and a professional username (guide);
- Install Git (guide);
- Configure Git by adding your GitHub chosen username and email (guide);
- Generate and set a Personal Access Token, PAT for HTTPS protocol (guide; in short in RStudio, execute
usethis::create_github_token()
, then copy the token, and finally set it runninggitcreds::gitcreds_set()
and paste it there. - Create a test repository on GitHub (guide; pick HTTPS).
- Clone the test repo to your computer using RStudio (guide; if you managed to clone, you should be all right without making a local change and pushing it.
-
From the GitHub Repository: Find the 'Issues' tab.
-
Create a New Issue: Click on 'New Issue' to start writing your request for help.
-
Write a Clear Title: Summarize your problem briefly in the title, like 'Trouble Installing RStudio'.
-
Describe Your Problem:
- Be Detailed: Explain what issue you're facing including any error messages you see. Being specific helps me understand your problem better.
- Steps You Took: List what you did step by step until you encountered the problem. You should try to fix the problem on your own before writing an issue.
- Include Screenshots: If you can, add pictures of your screen showing the error messages or where you got stuck. You can drag and drop images into the GitHub issue.
- Your Computer's Details: Share information about your computer, such as the operating system and the version of the software you're trying to install.
-
Submit Your Issue: Once you've filled in all the details, click the 'Submit new issue' button. I'll look into it and get back to you with help or a solution.
The course is structured in three blocks:
- An introductory block covers the essential knowledge for working with big data (notions of R programming, developing reproducible code, reporting in automated notebooks, version control, and Git/GitHub; secondary datasets for social science research & MySQL).
- A data access block focuses on web scraping and related tools (introduction to regular expressions, HTML language, XML, and JSON data structures).
- A third block introduces more advanced data access concepts, such as API interaction, and allows students to practice with live coding sessions in class.
By the end of the course, active participants will:
- Gain proficiency in data analysis, learning to analyze data efficiently and reproducibly. [Data analysis]
- Understand and critically re-assess data-related issues arising in applied research problems with big data. [Data literacy]
- Learn how to develop and debug complex code throughout the data analysis cycle (mining, tidying, analyzing, reporting). [Programming and statistical skills]
- Develop feasible big data research designs. [Research and analytical skills]
- Handle unstructured text and unfold the tidying process to obtain structured data. [Text mining]