|
16 | 16 | "source": [
|
17 | 17 | "## Learning Objectives\n",
|
18 | 18 | "In this lesson, you will:\n",
|
19 |
| - "<br>\n", |
20 | 19 | "* Learn the structure of Python functions\n",
|
21 | 20 | "* Call a function\n",
|
22 | 21 | "* Utilize bioinformatics functions from BioPython\n",
|
23 | 22 | "\n",
|
24 | 23 | "## Prerequisites\n",
|
25 | 24 | "\n",
|
26 |
| - "- Some experience in Python or\n", |
27 |
| - "- Tutorials 1, 2, and 3\n", |
| 25 | + "- Submodule 1 - Tutorial 1: Python Overview\n", |
| 26 | + "- Submodule 1 - Tutorial 2: Variables\n", |
| 27 | + "- Submodule 1 - Tutorial 3: Data Structures\n", |
28 | 28 | "\n",
|
29 | 29 | "## Getting Started\n",
|
30 | 30 | "Run the code box below to import the required libraries"
|
|
51 | 51 | "source": [
|
52 | 52 | "## Functions\n",
|
53 | 53 | "\n",
|
54 |
| - "In Python, a **function** is a block of reusable code that performs a specific task. It takes input, processes it, and optionally returns a value. It only runs when you \"call\" it \n", |
| 54 | + "In Python, a **function** is a block of reusable code that performs a specific task. It takes input, processes it, and optionally returns a value. It only runs when you \"call\" it.\n", |
55 | 55 | "\n",
|
56 |
| - "Lets try to use a very common bioinformatics task to illustrate the strcture of FUNCTIONS in python. \n", |
57 |
| - "<br>\n", |
58 |
| - "They are created with the key word def (you are DEFining the function). The function may be named any way you want and is logical to you.\n", |
59 |
| - "<br>\n", |
60 |
| - "Parentheses surround the variables that will be provided by the user of the function. Within the function, many different jobs can be done, including calculations and perhaps returning a value to you, such as you've already seen with the len(string) function that returns to the console the length of the string \"sent\" to it in the parentheses.\n", |
61 |
| - "<br>\n", |
62 |
| - "In the Python code box below we define a function, \"Count_base\", which needs two pieces of information: some kind of sequence and the base to be counted in the sequence list. Calling that function will give back (return) a number that represents the count of the base you ask for in the sequence you provide. (count is a built-in function in Python that can be used with strings).\n", |
| 56 | + "Let’s try using a common bioinformatics task to illustrate the structure of **functions** in Python.\n", |
| 57 | + "\n", |
| 58 | + "They are created with the keyword `def` (you are **defining** the function). The function may be named in any way that makes logical sense to you.\n", |
| 59 | + "\n", |
| 60 | + "Parentheses surround the variables that will be provided by the user of the function. Within the function, many different tasks can be performed, including calculations, and possibly returning a value—such as what you've already seen with the `len(string)` function, which returns the length of the string passed to it.\n", |
| 61 | + "\n", |
| 62 | + "In the Python code box below, we define a function called `Count_base`, which needs two pieces of information: a sequence and the base to be counted in that sequence. Calling this function will return a number representing how many times that base appears in the sequence you provide. (`count` is a built-in Python function that works on strings.)\n", |
63 | 63 | "\n",
|
64 |
| - "The last line uses the keyword 'return' to tell Python to print to the console the \"result\" of the tasks that it runs.\n", |
| 64 | + "The last line uses the keyword `return` to tell Python to print the result of the function’s operation to the console.\n", |
65 | 65 | "\n",
|
66 |
| - "<div class=\"alert alert-block alert-info\"> <b>Tip:</b> Try changing the base or making the \"base\" multiple letters (e.g, \"aaa\") and running the Python code box again.</a>. </div>" |
| 66 | + "<div class=\"alert alert-block alert-info\"> <b>Tip:</b> Try changing the base or making the \"base\" multiple letters (e.g., \"aaa\") and running the Python code box again.</div>\n" |
67 | 67 | ]
|
68 | 68 | },
|
69 | 69 | {
|
70 | 70 | "cell_type": "code",
|
71 |
| - "execution_count": null, |
| 71 | + "execution_count": 1, |
72 | 72 | "id": "ec61f862-6e53-4c16-af46-10fd39b76baa",
|
73 |
| - "metadata": {}, |
74 |
| - "outputs": [], |
| 73 | + "metadata": { |
| 74 | + "execution": { |
| 75 | + "iopub.execute_input": "2025-06-09T16:46:54.120437Z", |
| 76 | + "iopub.status.busy": "2025-06-09T16:46:54.120181Z", |
| 77 | + "iopub.status.idle": "2025-06-09T16:46:54.127575Z", |
| 78 | + "shell.execute_reply": "2025-06-09T16:46:54.126993Z", |
| 79 | + "shell.execute_reply.started": "2025-06-09T16:46:54.120417Z" |
| 80 | + } |
| 81 | + }, |
| 82 | + "outputs": [ |
| 83 | + { |
| 84 | + "data": { |
| 85 | + "text/plain": [ |
| 86 | + "17" |
| 87 | + ] |
| 88 | + }, |
| 89 | + "execution_count": 1, |
| 90 | + "metadata": {}, |
| 91 | + "output_type": "execute_result" |
| 92 | + } |
| 93 | + ], |
75 | 94 | "source": [
|
76 | 95 | "def count_base(dna, base): #the function is named count_base and takes 2 inputs- a sequence string and the letter to look for\n",
|
77 | 96 | " return dna.count(base)\n",
|
|
85 | 104 | "id": "dbefcfe0-4bc9-4d9e-bb4d-02e8831e2525",
|
86 | 105 | "metadata": {},
|
87 | 106 | "source": [
|
88 |
| - "<div class=\"alert alert-block alert-info\"> <b>Tip:</b> Try this: Instead of just returning the number, edit the function so that it returns \"g=17\" or \"In seq, g=17\"</a>. </div>" |
| 107 | + "<div class=\"alert alert-block alert-info\"> <b>Tip:</b> Instead of just returning the number, edit the function so that it returns \"g=17\" or \"In seq, g=17\"</a>. </div>" |
89 | 108 | ]
|
90 | 109 | },
|
91 | 110 | {
|
|
147 | 166 | "id": "66a5efbb-8c5d-4fd8-a273-d247d4ad99bc",
|
148 | 167 | "metadata": {},
|
149 | 168 | "source": [
|
150 |
| - "Functions can also call another function, though to use these routinely you will need to learn to save these. (see ___ in tutorial). For now, lets write another function that calls our count_base function to calculate the GC %" |
| 169 | + "Functions can also call another function, though to use these routinely you will need to learn to save these. For now, lets write another function that calls our count_base function to calculate the GC%. " |
151 | 170 | ]
|
152 | 171 | },
|
153 | 172 | {
|
|
178 | 197 | "id": "9a75092f-c6c4-4835-9c1b-2ef56344d809",
|
179 | 198 | "metadata": {},
|
180 | 199 | "source": [
|
181 |
| - "**Can you write your own tool that calculates the percentage of the time of all guanines that are found as the pairing AG?**" |
| 200 | + "Can you write your own tool that calculates the percentage of the time of all guanines that are found as the pairing AG?" |
182 | 201 | ]
|
183 | 202 | },
|
184 | 203 | {
|
|
194 | 213 | "id": "b9065064-ed5d-4090-9312-278c0c40ab82",
|
195 | 214 | "metadata": {},
|
196 | 215 | "source": [
|
197 |
| - "There are many other ways that we might want to manipulate, align, and evaluate bioinformatic sets (e.g., FASTA sequences, both DNA and protein.) Fortunately, many of these standard functions have already been written and are freely available to everyone in Biopython: \"A set of python tools for computational molecular biology.\" (biopython.org) \n", |
| 216 | + "There are many ways we might want to manipulate, align, and evaluate bioinformatic data sets—such as FASTA sequences, both DNA and protein. Fortunately, many standard functions for these tasks have already been written and are freely available through **Biopython**: *“A set of Python tools for computational molecular biology.”* (biopython.org) \n", |
198 | 217 | "<br>\n",
|
199 |
| - "We will start here using the tools developed for \"sequence input and output\" (SeqIO). \n", |
| 218 | + "We will begin by using tools developed for **sequence input and output** (`SeqIO`). \n", |
200 | 219 | "<br>\n",
|
201 |
| - "We import that set of functions and tools (\"objects\") from the whole Biopython toolset with the following syntax: \n", |
| 220 | + "We import that specific set of functions and tools (also called \"objects\") from the full Biopython toolkit using the following syntax: \n", |
202 | 221 | "<br>\n",
|
203 |
| - "from Bio import SeqIO\n", |
| 222 | + "`from Bio import SeqIO` \n", |
204 | 223 | "<br>\n",
|
205 |
| - "Here we'll use that to look at a provided file (glut_human.fasta) that contains 4 different protein FASTA sequences, something that would be rather challenging for a novice Python programmer without the BioPython tools." |
| 224 | + "We’ll use this to examine a provided file, **`glut_human.fasta`**, which contains four different protein FASTA sequences. Analyzing a file like this manually would be quite challenging for a novice Python programmer—Biopython makes it much easier.\n" |
206 | 225 | ]
|
207 | 226 | },
|
208 | 227 | {
|
|
226 | 245 | "id": "77555b0a-dcfc-448a-b30d-47e644c19c31",
|
227 | 246 | "metadata": {},
|
228 | 247 | "source": [
|
229 |
| - "You should see the 4 different protein identifiers in the file-- in this case with PDB ID numbers. \n", |
| 248 | + "You should see the 4 different protein identifiers in the file- in this case with PDB ID numbers. \n", |
230 | 249 | "\n",
|
231 | 250 | "There is a lot of information besides just the ID in each of these records, but it is not convenient to access the pieces yet. But, we can load all of that information into a single variable called (here) record_glut. The specific format is as a python LIST. "
|
232 | 251 | ]
|
|
288 | 307 | "metadata": {},
|
289 | 308 | "source": [
|
290 | 309 | "We wrote a function above (count_base) that can now come in handy to determine how many of any amino acid was present in that sequence. Although we conceived of it as a nucleotide counter, the mini program accepts whatever information we submit to it. \n",
|
291 |
| - "\n", |
| 310 | + "```\n", |
292 | 311 | "def count_base(dna, base):\n",
|
293 | 312 | " return dna.count(base)\n",
|
294 |
| - "\n", |
| 313 | + "```\n", |
295 | 314 | "We can send it the FASTA sequence of the GLUT protein and count an amino acid, rather than a base. The function takes any sequence and will count the letter you give it in quotes. This helps us to see how these functions \"think\" about the material you provide to it."
|
296 | 315 | ]
|
297 | 316 | },
|
|
395 | 414 | "id": "9bcdcc97-f4dc-411e-aca7-6ea24ca0c272",
|
396 | 415 | "metadata": {},
|
397 | 416 | "source": [
|
398 |
| - "Fetching records from NCBI using BioPython\r\n", |
399 |
| - "The public databases of bioinformatics data have designed ways to access their extensive files without having to go through the GUI interfaces. We can collect the data to use in our analyses or comparisons in bioinformatics tasks.\r\n", |
400 |
| - "\r\n", |
401 |
| - "In biopython, the modules for doing so are found in Entrez. We must import those as well as the sequence input/output tools (SeqIO) to read and parse these complex files.\r\n", |
402 |
| - "\r\n", |
403 |
| - "The commands below will fetch and parse the specified genbank refseq file for the human insulin receptor 2 protein.\r\n", |
404 |
| - "\r\n", |
405 |
| - "(You can learn more about the different database names and how to use efetch from this book chapter: https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch)\r\n", |
406 |
| - "\r\n", |
407 |
| - "To run, you'll need to give a proper email address in to Entrez\r\n", |
408 |
| - "\r\n", |
409 |
| - "The data is read in (by convention) with the name \"handle\" but any variable name can be used. After reading in the information, one always closes the connection = handle.close()" |
| 417 | + "Fetching Records from NCBI Using Biopython\n", |
| 418 | + "\n", |
| 419 | + "The public databases of bioinformatics data have built-in ways to access their extensive files **programmatically**, without needing to use graphical user interfaces (GUIs). This allows us to efficiently collect data for analysis and comparison in bioinformatics tasks. \n", |
| 420 | + "<br>\n", |
| 421 | + "In Biopython, the modules for accessing these databases are found in **Entrez**. We must import both `Entrez` (for data fetching) and `SeqIO` (for reading and parsing sequence files). \n", |
| 422 | + "<br>\n", |
| 423 | + "The commands below will **fetch and parse** a GenBank RefSeq file for the human *insulin receptor 2* protein. \n", |
| 424 | + "<br>\n", |
| 425 | + "🔗 You can learn more about database names and how to use `efetch` from this [NCBI book chapter](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch). \n", |
| 426 | + "<br>\n", |
| 427 | + "**Note:** To run these commands, you must provide a valid email address to `Entrez`. \n", |
| 428 | + "<br>\n", |
| 429 | + "The data is conventionally read into a variable named `handle`, but any valid variable name can be used. Once the data is read, **you must close the connection** with: \n", |
| 430 | + "`handle.close()`\n", |
| 431 | + "ction = handle.close()" |
410 | 432 | ]
|
411 | 433 | },
|
412 | 434 | {
|
|
436 | 458 | "id": "9f35c7af-0412-413b-9d3d-636ce7c9eaec",
|
437 | 459 | "metadata": {},
|
438 | 460 | "source": [
|
439 |
| - "Reading this genbank file creates a variable of class SeqRecord (i.e, a \"Sequence record\") which is a bit like a \"list\" but it contains an ID, a sequence, and other identifying information. We can find out what \"things\" are in the file by asking for a directory of all the features.\n", |
| 461 | + "Reading this GenBank file creates a variable of **class `SeqRecord`** (i.e., a *sequence record*), which behaves somewhat like a list—but also includes useful attributes such as an ID, a sequence, and other identifying information. \n", |
| 462 | + "<br>\n", |
| 463 | + "We can explore what components are included in the file by requesting a **directory of all the available attributes** using the `dir()` function.\n", |
440 | 464 | "\n",
|
441 |
| - "We'll look at just a few parts of the sequence record in this less, so this will show just the attributes we are most likely to need. Look at what is in various aspects of the file by writing:\n", |
| 465 | + "In this lesson, we'll focus on just a few key parts of the `SeqRecord`, highlighting the attributes most commonly used in bioinformatics workflows. \n", |
| 466 | + "<br>\n", |
| 467 | + "For example, to access the description field of the record, you can write:\n", |
442 | 468 | "\n",
|
| 469 | + "```python\n", |
443 | 470 | "humInsR2.description\n",
|
444 |
| - "for example. Any of the terms from the directory can be added after the . but only a few end up being useful to us." |
| 471 | + "```\n", |
| 472 | + "You can replace .description with any other attribute listed in the output of dir(humInsR2), although only a few are typically useful for common tasks." |
445 | 473 | ]
|
446 | 474 | },
|
447 | 475 | {
|
|
0 commit comments