-
Notifications
You must be signed in to change notification settings - Fork 7
/
01_concepts.Rmd
executable file
·107 lines (72 loc) · 4.67 KB
/
01_concepts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
# Concepts
## What is a Relational Database?
A **relational database** is a collection of tables (organized in rows and columns of data) that are related to each other in some way.
Database tables are analogous to CSV files, spreadsheets in Excel, or data frames in programming languages like R or Python.
Ideally each table can be connected to another table by a column that is present
in both tables. That column may have different numbers of observations in each
table, but the values will match up. This column is called a **key**. For
example, a key commonly used on campus is your student or employee ID number.
Let's look at an example dataset of fictional student data with data about courses, grades, and employment. Can we say anything about the relationship between course grades and employment based on this data?
**Table: Student**
|ID|Name|
|:--|:--|
|123|Jane Smith|
|456|Maria Martinez|
|789|Paul Jones|
**Table: Courses**
|ID|Course|Grade|
|:--|:--|:--|
|123|Calculus|A-|
|456|Calculus|A|
|789|Calculus|C+|
|123|Data Science|A-|
|456|Data Science|B|
|789|Data Science|B-|
**Table: Employment**
|ID|Position|Employer|HoursPerWeek|
|:--|:--|:--|:--|
|123|Student Assistant|University Research Lab|5|
|456|Customer Service|Alumni Center|5|
|456|Research Assistant|University Research Lab|15|
|789|Student Assistant|University Research Lab|10|
|789|Stock Room|Medical Supplier|20|
|789|Customer Service|Alumi Center|15|
## What is SQL?
SQL stands for **structured query language**. SQL is a programming language that allows you to request (query) information from a database using a standard set of keywords. You can pronounce SQL as "ess cue ell" or "sequel".
### What kinds of questions can SQL answer?
SQL excels at extracting and combining information from large datasets. Some questions you might ask with SQL include:
* How many items are there in my data with a specific label?
* What are the unique values in a given column?
* Which records (rows) relate to a specific time period in my data?
## What is a Relational Database Management System?
A **relational database management system** (RDBMS) is a software system that creates, updates, and manages relational databases as well as managing user's access to the database. There are many different systems available. For instance, you may have heard of MySQL, Postgres, and Microsoft SQL Server.
For this workshop, we'll use [SQLite][sqlite], which is a simple and widely-used RDBMS. It runs on Windows, MacOS X, and Linux with no setup necessary!
![](images/sqlite-logo.png)
<!-- https://en.wikipedia.org/wiki/SQLite#/media/File:SQLite370.svg -->
[sqlite]: https://sqlite.org/
Every RDBMS has its own implementation or "dialect" of SQL. In other words, the set of SQL keywords supported differs slightly from one RDBMS to another, and sometimes queries have to be written differently, but the basics are the same. Details about the supported keywords for a given RDBMS can be found in that system's documentation. How does this impact you working with any given SQL RDBMS? When you're searching for help with syntax, include the version of SQL you're working with. The keywords covered in this workshop are supported by most systems.
Some RDBMS allow you to add functions with extensions. For example, the PostGIS extension adds keywords to PostgreSQL to all you to work with location information to do spatial analysis.
## Advantages & Disadvantages of SQL
SQL has major advantages in several areas important to researchers:
* Efficiency
+ Write a few lines of code rather than lots of manual data manipulation
+ SQL is meant for data manipulation
* Reproducibility
+ save queries as a record of your workflow
+ re-run code with updates
* Work with large amounts of data
+ Typically faster to run a process in a database than in a spreadsheet
+ Store lots of data (compare with Excel's row limits)
* Data management
+ One database file is the equivalent to many, many spreadsheet files (like csvs or xlsx files)
+ Write a query instead of making a new files or tabs
What does SQL not do well?
* Most RDBMS do not visualize data, however, you can connect your database to visualization tools to perform these kinds of tasks seamlessly.
* The SQL language is designed for data querying not data analysis. If you want
to run statistics on your data you can connect to your database from a
programming language like R or python, or from statistical software.
* SQL assumes you work with tabular data. If your data is another type - for example graph data or tree data - you might want to explore other database types.
<!--
* You can manage multiple data sources
- With an RDBMS, you have access to more than one database
-->