Skip to content

Commit

Permalink
w2
Browse files Browse the repository at this point in the history
  • Loading branch information
caalo committed Aug 14, 2024
1 parent d8e7c80 commit 5d6c88b
Show file tree
Hide file tree
Showing 2 changed files with 146 additions and 19 deletions.
34 changes: 22 additions & 12 deletions 01-intro-to-computing.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,15 @@ Let's open up the KRAS analysis in Google Colab. If you are taking this course w

Today, we will pay close attention to:

- Python Console (Execution): Open it via View -\> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you.
- Python Console ("Executions"): Open it via View -\> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you.

- Notebook: in the central panel of the website, you will see Python code interspersed with word document text. This is called a Python Notebook (other similar services include Jupyter Notebook, iPython Notebook), which has chunks of plain text *and* Python code, and it helps us understand better the code we are writing.

- Variable Environment: Open it by clicking on the "{x}" button on the left-hand panel. Often, your code will store information in the Variable Environment, so that information can be reused. For instance, we often load in data and store it in the Variable Environment, and use it throughout rest of your Python code.

The first thing we will do is see the different ways we can run Python code. You can do the following:

1. Type something into the Python Console (Execution) and type enter, such as `2+2`. The Python Console will run it and give you an output.
1. Type something into the Python Console (Execution) and click the arrow button, such as `2+2`. The Python Console will run it and give you an output.
2. Look through the Python Notebook, and when you see a chunk of Python Code, click the arrow button. It will copy the Python code chunk to the Python Console and run all of it. You will likely see variables created in the Variables panel as you load in and manipulate data.
3. Run every single Python code chunk via Runtime -\> Run all.

Expand Down Expand Up @@ -101,7 +101,7 @@ add(18, 21)
add(18, add(21, 65))
```

Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to *readable* code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Because the `add()` function isn't typically used, it is not automatically available, so we used the import statement to load it in.)
Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to *readable* code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Most functions in Python are stored in a collection of functions called **modules** that needs to be loaded. The `import` statement gives us permission to access the functions in the module "operator".)

### Data types

Expand All @@ -120,7 +120,7 @@ A nice way to summarize this first grammar structure is using the function machi

Here are some aspects of this schema to pay attention to:

- A programmer should not need to know how the function is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language.
- A programmer should not need to know how the function or operation is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language.

- A function can have different kinds of inputs and outputs - it doesn't need to be numbers. In the `len()` function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs.

Expand All @@ -140,11 +140,11 @@ If you enter this in the Console, you will see that in the Variable Environment,
>
> Bind variable to the left of `=` to the resulting value.
>
> The variable is stored in the Variable Environment.
> The variable is stored in the **Variable Environment**.
The Variable Environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined.

The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. When we work with large datasets, if you assign a variable to a data type larger than the available RAM, it will not work. More on this later.
The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM.

Look, now `x` can be reused downstream:

Expand All @@ -159,28 +159,30 @@ It is quite common for programmers to not know what data type a variable is whil
type(y)
```

We should give useful variable names so that we know what to expect! Consider `num_sales` instead of `y`.
We should give useful variable names so that we know what to expect! If you are working with sales data, consider `num_sales` instead of `y`.

## Grammar Structure 3: Evaluation of Functions

Let's look at functions a little bit more formally: A function has a **function name**, **arguments**, and **returns** a data type.

### Execution rule for functions:

> Evaluate the function by its arguments, and if the arguments are functions or contains operations, evaluate those functions or operations first.
> Evaluate the function by its arguments if there's any, and if the arguments are functions or contains operations, evaluate those functions or operations first.
>
> The output of functions is called the **returned value**.
Often, we will use multiple functions, in a nested way, or use parenthesis to change the order of operation. Being able to read nested operations, nested functions, and parenthesis is very important. Think about what the Python is going to do step-by--step in the line of code below:
Often, we will use multiple functions in a nested way, and it is important to understand how the Python console understand the order of operation. We can also use paranthesis to change the order of operation. Think about what the Python is going to do step-by--step in the lines of code below:

```{python}
(len("hello") + 4) * 2
max(len("hello"), 4)
(len("pumpkin") - 8) * 2
```

If we don't know how to use a function, such as `pow()` we can ask for help:
If we don't know how to use a function, such as `pow()`, we can ask for help:

```
?pow
pow?
pow(base, exp, mod=None)
Equivalent to base**exp with 2 arguments or base**exp % mod with 3 arguments
Expand Down Expand Up @@ -211,6 +213,14 @@ And there is an operational equivalent:
2 ** 3
```

We will mostly look at functions with input arguments and return types in this course, but not all functions need to have input arguments and output return. Here are some varieties of functions to stretch your function horizons.

| Function call | What it takes in | What it does | Returns |
|---------------|---------------|----------------------------|---------------|
| `pow(a, b)` | integer `a`, integer `b` | Raises `a` to the `b`th power. | Integer |
| `print(x)` | any data type `x` | Prints out the value of `x` to the console. | None |
| `datetime.now()` | Nothing | Gets the current time. | String |

## Tips on writing your first code

`Computer = powerful + stupid`
Expand Down
131 changes: 124 additions & 7 deletions 02-data-structures.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ In our second lesson, we start to look at two **data structures**, **Lists** and

## Lists

In the first exercise, you started to explore **data structures**, which store information about data types. You played around with **lists**, which is an ordered collection of data types and data structures. Each *element* of a list contains a data type or another data structure, and there is no limit on how big a list can be.
In the first exercise, you started to explore **data structures**, which store information about data types. You explored **lists**, which is an ordered collection of data types or data structures. Each *element* of a list contains a data type or another data structure.

We can now store a vast amount of information in a list, and assign it to a single variable. Even more, we can use operations and functions on a list, modifying many elements within the list at once! This makes analyzing data much more scalable and less repetitive.

Expand Down Expand Up @@ -38,13 +38,19 @@ With subsetting, you can modify elements of a list or use the element of a list

### Subsetting multiple elements of lists

Suppose you want to access *multiple* elements of a list, such as accessing the first three elements of `chrNum`. You would use the slice operator, which specifies the index number to start and the index of the item to stop at *without including it in the slice.*
Suppose you want to access multiple elements of a list, such as accessing the first three elements of `chrNum`. You would use the **slice** operator, which specifies:

- the index number to start

- the index number to stop, *plus one.*

If you want to access the first three elements of `chrNum` (The first element's index number is 0, the third element's index number is 2, plus 1, which is 3.)

```{python}
chrNum[0:3]
```

If you want to access the second and third element of `chrNum`:
If you want to access the second and third elements of `chrNum`:

```{python}
chrNum[1:3]
Expand All @@ -58,7 +64,7 @@ chrNum[3:len(chrNum)]

where `len(chrNum)` is the length of the list.

When the start or stop index is missing, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively:
When the start or stop index is specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively:

```{python}
chrNum[:3]
Expand All @@ -69,8 +75,119 @@ More discussion of list slicing can be found [here](https://stackoverflow.com/qu

## Objects in Python

Object functions, object properties
The list data structure has an organization and functionality that metaphorically represents a pen-and-paper list in our physical world. Like a physical object, we have examined:

- What does it contain (in terms of data)?

- What can it do (in terms of operations and functions)?

And if it "makes sense" to us, then it is well-designed.

The list data structure we have been working with is an example of an **Object**. The definition of an object allows us to ask the questions above: what does it contain, and what can it do. It is an organizational tool for a collection of data and functions that we can relate to. Formally, an object contains the following:

- **Value** that holds the essential data for the object.

- **Attributes** that store additional data for the object.

- Functions called **Methods** that can be used on the object.

This organizing structure on an object applies to pretty much all Python data types and data structures.

Let's see how this applies to the list:

- Value: the contents of the list, such as `[2, 3, 4].`

- **Attributes** that store additional values: Not relevant for lists.

- **Methods** that can be used on the object: `chrNum.count(2)` counts the number of instances 2 appears as an element of `chrNum`.

Object methods are functions that does something with the object you are using it on. You should think about `chrNum.count(2)` as a function that takes in `chrNum` and `2` as inputs. If you want to use the count function on list `mixedList`, you would use `mixedList.count(x)`.

| Function method | What it takes in | What it does | Returns |
|----------------|----------------|-------------------------------------|------------------|
| `chrNum.count(x)` | list `chrNum`, data type `x` | Counts the number of instances `x` appears as an element of `chrNum`. | Integer |
| `chrNum.append(x)` | list `chrNum`, data type `x` | Appends `x` to the end of the `chrNum`. | None (but `chrNum` is modified!) |
| `chrNum.sort()` | list `chrNum` | Sorts `chrNum` by ascending order. | None (but `chrNum` is modified!) |
| `chrNum.reverse()` | list `chrNum` | Reverses the order of `chrNum`. | None (but `chrNum` is modified!) |

## Dataframes

A Dataframe is a two-dimensional data structure that stores data like a spreadsheet does.

The Dataframe data structure is found within a Python module called "Pandas". A Python module is an organized collection of functions and data structures. The `import` statement below gives us permission to access the "Pandas" module via the variable `pd`.

To load in a Dataframe from existing spreadsheet data, we use the function `pd.read_csv()`:

```{python}
import pandas as pd
metadata = pd.read_csv("classroom_data/metadata.csv")
type(metadata)
```

There is a similar function `pd.read_excel()` for loading in Excel spreadsheets.

Let's investigate the Dataframe as an object:

- What does a Dataframe contain (in terms of data)?

- What can a Dataframe do (in terms of operations and functions)?

### What does a Dataframe contain (in terms of data)?

We first take a look at the contents:

```{python}
metadata
```

It looks like there are 1864 rows and 30 columns in this Dataframe, and when we display it it shows some of the data.

We can look at specific columns by looking at **attributes** via the dot operation. We can also look at the columns via the bracket operation.

```{python}
metadata.ModelID
metadata['ModelID']
```

The names of all columns is stored as an attribute, which can be accessed via the dot operation.

```{python}
metadata.columns
```

The number of rows and columns are also stored as an attribute:

```{python}
metadata.shape
```

### What can a Dataframe do (in terms of operations and functions)?

We can use the `head()` and `tail()` functions to look at the first few rows and last few rows of `metadata`, respectively:

```{python}
metadata.head()
metadata.tail()
```

Both of these functions (without input arguments) are considered as **methods**: they are functions that does something with the Dataframe you are using it on. You should think about `metadata.head()` as a function that takes in `metadata` as an input. If we had another Dataframe called `my_data` and you want to use the same function, you will have to say `my_data.head()`.

#### Subsetting Dataframes

Perhaps the most important operation you will can do with Dataframes is subsetting them. There are two ways to do it. The first way is to subset by numerical indicies, exactly like how we did for lists. You will use the `iloc` and bracket operations, and you give two slices: one for the row, and one for the column.

Subset the first 5 rows, and first two columns:

```{python}
metadata.iloc[:5, :2]
```

If we want a custom slice that is not sequential, we can use an integer list. Subset the last 5 rows, and the 1st and 10 and 21th column:

```{python}
metadata.iloc[5:, [1, 10, 21]]
```

Pandas Dataframes
This is a great way to start thinking about subsetting your dataframes for analysis, but this way of of subsetting can lead to some inconsistencies in the long run. For instance, suppose your collaborator added a new cell line to the metadata and changed the order of the column. Then your code to subset the last 5 rows and the columns will get you a different answer once the spreadsheet is changed.

Subsetting Dataframes
The second way is to subset by the column name, and this is much more preferred in data analysis practice. You will learn about it next week!

0 comments on commit 5d6c88b

Please sign in to comment.