Skip to content

BVBRC_GenomeDataDownloader 是一個 Python 腳本,用於從 BV-BRC 的 FTP 網站上批量下載和處理不同病症的基因組資料,並將它們轉換為 Roary 的輸入格式。

Notifications You must be signed in to change notification settings

CNH3148/BVBRC_GenomeDataDownloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

BVBRC_GenomeDataDownloader

A. 簡介

「BVBRC_GenomeDataDownloader」是我替這個 Python script 所取的名字,因為最初的目的就是從 BV-BRC 的 FTP 網站上下載大量基因組資料,而這始終是這份腳本的主要功能。而在後來的編寫過程中,為了能將下載的資料以合適的格式輸入 Roary,新增了一些功能:比對兩個資料夾中的內容,以固定規則尋找並修改文件內容,對應檔名的檔案合併,檔案的移動刪除,以及回報下載錯誤等。

由於著重批量處理檔案的下載、讀寫、移動,腳本中透過 os 或 subprocess 套件呼叫的一些外部指令,是基於 Ubuntu 的 Bash shell 來撰寫的,如果要在 Windows 執行的話需要修改一下。

我是在 jupyter notebook 上編輯的,所以在腳本中會發現許多 # %% (用來標示 .ipynb 中的 cells ),那是把檔案轉成 .py 時遺留下來的記號。我沒有加上 shebang,主要是曾經發生過 shebang 和設想的位置不同,結果不能執行的問題;而且搞不好會有人想在特殊的環境下執行 (我自己就是在 conda 的環境中測試的)。

這是我第一次寫 Python script,完成它的過程讓我感到非常高興,也想要把它分享給有需要的人。我在腳本中寫了許多註解,既是為了自己不要忘記代碼的意義,也是為了讓其他人能夠更容易理解。我知道這個腳本還有很多可以改進的地方,例如效率、錯誤處理、跨平台適用性等,但我目前的專業知識還不足以做到這些,只能說它能夠勉強達到我想要的功能。其實我也曾經想過用 Bash 指令來實現同樣的目的,但我對 Bash 不太熟悉,而且我也很好奇 Python 的能力,所以我最後還是選擇用 Python 來寫這個腳本,雖然過程中遇到了不少困難,但也學到了不少東西。

B. 流程概述

這個 Python 腳本的目的是從 FTP 網址下載和處理 genome 數據。它的主要步驟如下:

  1. 導入 csv、os、re 和 subprocess 模組。
  2. 要求用戶輸入一個 CSV 檔案的絕對路徑,該檔案包含了不同病症的名稱和相關的 genome IDs。
  3. 讀取 CSV 檔案的第一行,獲取所有病症的名稱。
  4. 定義 FTP 網址和要下載的檔案格式(gff 和 fna)。
  5. 對於每一種病症,執行以下操作:
    • 創建一個以病症名稱命名的資料夾,並切換到該資料夾。
    • 讀取 CSV 檔案中該病症對應的所有 genome IDs,並將它們寫入到一個文本檔案中。
    • 創建兩個子資料夾,分別用於存放 gff 和 fna 檔案。
    • 使用 wget 命令從 FTP 網址下載每一個 genome ID 的 gff 和 fna 檔案,並將它們保存到相應的子資料夾中。
    • 如果下載過程中有任何錯誤,則將出錯的 genome ID 寫入到另一個文本檔案中。
    • 確認並取得成功下載的所有 genome IDs。
    • 對於每一個成功下載的 genome ID,將其 gff 和 fna 檔案合併,並將合併後的檔案移動到病症資料夾中。
    • 刪除 gff 和 fna 子資料夾。
    • 切換到上一層目錄,準備處理下一種病症。

請注意,這個腳本需要在能夠執行 wget 命令的環境中運行,並且需要有讀取和寫入檔案的權限。

C. 測試執行

我準備了一份大小適當的測試用 genome_id_list.csv,格式如下;第一行 (first row) 是各個分類的標題 (heading),標題下方即屬於該分類的基因組 ID (PATRIC ID)。

asymptomatic,bacteremia,UTI,sepsis
573.31029,573.14940,1263871.10,573.12478
573.31030,573.14941,573.12456,573.12681
573.31031,573.14942,573.12482,573.12682
573.32133,573.14949,573.12763,573.12688
573.32134,573.14950,333333333,573.12689
         ,222222222,573.12764,
111111111,444444444,555555555,
573.56218,573.57234,72407.2108,573.56350
72407.2099,573.57332,573.56491,573.57262

建議將我的腳本和這個 CSV 檔放在同一個目錄底下執行。上方範例中的 111111111 ~ 555555555 是偽 ID,用來測試無法下載檔案時腳本的反應;同樣的道理,我也加入了兩行「空 ID」在末兩行,他們是真實的 ID,但會從 FTP site 下載到一個空的檔案。另外,我的腳本用到了幾個預設模組,一般來說應該是不用額外安裝:csv, os, re, subprocess

執行腳本之後,將會為每個標題 (heading) 建立一個資料夾,所有指定下載的文件將會分類放在該目錄底下;如果完整執行整個腳本的話,在每個以標題為名的資料夾中將會包含:(1) 該分類所有符合 Roary 輸入格式的 .gff 文件 (2) genome_id_list.csv 中該分類所有的 id_list.txt (3) 如果有下載錯誤,所產生的 error_id_list.txt

自訂操作

如果單純想要從 BV-BRC 批量下載檔案的話,可以把腳本的 (3)、(4-1)、(4-2) 部分註解掉。

取得檔案的 Bash 指令及 FTP 網址格式為:

wget ftp://anonymous@ftp.bvbrc.org/genomes/<573.12688>/<573.12688>.<RefSeq.frn>

其中 <573.12688> 代表基因組 ID,可自行替換成要下載的 ID;<RefSeq.frn> 則用來指定要下載的檔案格式。根據我用 FileZilla 檢視的結果,每一個 genome ID 可能有以下幾種檔案類型可供下載:

  1. .fna (nucleic acid contigs;定序組裝的結果)
  2. .PATRIC.faa (轉譯後的氨基酸序列)
  3. .PATRIC.features.tab
  4. .PATRIC.ffn (蛋白質的核酸序列)
  5. .PATRIC.frn (rRNA 的核酸序列)
  6. .PATRIC.gff (不包含序列資料的核酸註解)
  7. .PATRIC.pathway.tab
  8. .PATRIC.spgene.tab
  9. .PATRIC.subsystem.tab
  10. .RefSeq.faa
  11. .RefSeq.features.tab
  12. .RefSeq.ffn
  13. .RefSeq.frn
  14. .RefSeq.gff

上方副檔名後的括號標註是我自己的理解,可能有誤請小心使用。

D. 待解決問題 / 問題修復

正如上述,這個腳本不過還算堪用,有一些明顯的問題可能被陸續發現:

1. FTP site 存在「空」檔案

目前的程式碼針對「問題檔案」,只處理下載失敗的話會產生一份文件來告訴使用者哪些 ID 沒有成功下載資料。但是我在後續的分析中發現,有些時候會發生一種問題,使的檔案內容只有這樣:

##gff-version 3
#Genome: 72407.2102|Klebsiella pneumoniae subsp. pneumoniae CS00078
#Date:08/11/2023

##FASTA

問題源自雖然能夠從 FTP site 下載到特定檔案,但實際上該文件內容是空的,只包含一些 metadata,沒有任何基因體註解也沒有序列資料。而我的腳本目前沒有針對這個問題,去把這些「空殼文件」找出來。

Feb. 24th, 2024 (fixed)

針對上述問題,我先在腳本的第 (2) 部分新增了一個段落,檢查若檔案的內容不足 6 行,就視為無效檔案,將其刪除;並將 genome ID 寫入 empty-ids.txt

原因是根據我的觀察,引起這類問題的檔案特性相當一致:GFF 檔案只有前三行 META DATA,沒有任何 feature 的資訊;FNA 檔案中完全沒有任何內容。而一般正常的檔案,似乎都是幾百行幾千行起跳的。

2. 追蹤檔案下載的狀況

雖然我有紀錄下載錯誤的 IDs,以及空檔案的 IDs 到一個特定的文件中。但是它們分散在各個 Disease 的資料夾中,並且沒有生成一個報表呈現整體輸入了多少 IDs,又有多少 IDs 發生問題,以及問題的類型。

換句話說,我需要讓使用者更容易理解並追蹤整體檔案下載的狀況。

Feb. 24th, 2024 (fixed)

新增了腳本的第 (5) 段,這段程式碼主要在處理一些檔案操作和數據統計。以下是詳細的步驟:

  1. 使用 subprocess.run 呼叫系統命令 mv 來移動特定的檔案。這些檔案包含了在下載 GFF 和 FNA 檔案時發生錯誤或產生空檔案的 ID 清單。

  2. 使用 open 函數開啟這些已移動的檔案,並使用 readlines 方法讀取所有行。然後,利用 len 函數計算出檔案中的行數,這個數字也就是下載失敗或產生空檔案的 ID 數量。

  3. 最後,將這些統計數據寫入一個名為 statistics.csv 的 CSV 檔案中。這個檔案會記錄每種疾病的下載狀況,包括查詢 ID 的數量、成功合併的 ID 數量、GFF 和 FNA 下載失敗的 ID 數量,以及 GFF 和 FNA 空檔案的 ID 數量。

請注意,這段程式碼中的 diseaserstrip_idsids_ready_to_merge 變數並未在第 (5) 段中定義,它們是在程式碼的其他部分被定義的。disease 代表的是疾病的名稱,rstrip_ids 是所有需要查詢的 ID,而 ids_ready_to_merge 是成功合併的 ID。


English introduction generated by GPT

Introduction

"BVBRC_GenomeDataDownloader" is the name I gave to this Python script, because the original purpose was to download a large amount of genome data from the BV-BRC FTP site, and this is still the main function of this script. In the later writing process, I added some features to be able to input the downloaded data in a suitable format for Roary, such as comparing the contents of two folders, finding and modifying the file content with a fixed rule, merging files with corresponding names, moving and deleting files, and reporting download errors.

Since it focuses on batch processing of file downloading, reading, writing, and moving, some external commands called by the os or subprocess packages in the script are written based on the Ubuntu Bash shell, and need to be modified if you want to run it on Windows.

I edited it on jupyter notebook, so you will find many # %% (used to mark cells in .ipynb) in the script, which are the symbols left over when the file is converted to .py. I didn't add shebang, mainly because I had encountered the problem that shebang and the expected location were different, and the result could not be executed; and maybe some people would want to run it in a special environment (I tested it in the conda environment myself).

This is the first time I wrote a Python script, and I was very happy to complete it, and I also wanted to share it with those who need it. I wrote a lot of comments in the script, both to prevent myself from forgetting the meaning of the code, and to make it easier for others to understand. I know that there are still many areas for improvement in this script, such as efficiency, error handling, cross-platform applicability, etc., but my current professional knowledge is not enough to do these, I can only say that it can barely achieve what I want. In fact, I also thought about using Bash commands to achieve the same purpose, but I am not very familiar with Bash, and I am also curious about Python's capabilities, so I finally chose to write this script in Python, although I encountered a lot of difficulties in the process, but I also learned a lot.

Process Overview

The purpose of this Python script is to download and process genome data from the FTP address. The main steps are as follows:

  1. Import csv, os, re and subprocess modules.
  2. Ask the user to enter an absolute path of a CSV file, which contains the names of different diseases and related genome IDs.
  3. Read the first line of the CSV file and get the names of all diseases.
  4. Define the FTP address and the file format to download (gff and fna).
  5. For each disease, perform the following operations:
    • Create a folder named after the disease name and switch to that folder.
    • Read all the genome IDs corresponding to the disease from the CSV file and write them to a text file.
    • Create two subfolders, one for storing gff and fna files respectively.
    • Use the wget command to download the gff and fna files of each genome ID from the FTP address and save them to the corresponding subfolders.
    • If there are any errors during the download process, write the erroneous genome ID to another text file.
    • Confirm and obtain all the successfully downloaded genome IDs.
    • For each successfully downloaded genome ID, merge its gff and fna files and move the merged file to the disease folder.
    • Delete the gff and fna subfolders.
    • Switch to the previous directory and prepare to process the next disease.

Please note that this script needs to run in an environment that can execute the wget command, and needs to have the permission to read and write files.

Test Execution

I prepared a test genome_id_list.csv of appropriate size, the format is as follows; the first row is the heading of each category, and the genome ID (PATRIC ID) belonging to that category is below the heading.

asymptomatic,bacteremia,UTI,sepsis
573.31029,573.14940,1263871.10,573.12478
573.31030,573.14941,573.12456,573.12681
573.31031,573.14942,573.12482,573.12682
573.32133,573.14949,573.12763,573.12688
573.32134,573.14950,333333333,573.12689
         ,222222222,573.12764,
111111111,444444444,555555555,

It is recommended to put my script and this CSV file in the same directory for execution. The 111111111 ~ 555555555 in the above example are fake IDs, used to test the script's response when files cannot be downloaded. (My script uses a few default modules, which should not require additional installation: csv, os, re, subprocess)

After executing the script, a folder will be created for each heading, and all the specified downloaded files will be classified and placed under that directory; if the entire script is executed completely, each folder named after the heading will contain: (1) All .gff files in that category that meet the Roary input format (2) id_list.txt of all categories in genome_id_list.csv (3) error_id_list.txt generated if there are download errors.

Custom Operations

If you just want to download files from BV-BRC in batches, you can comment out parts (3), (4-1), and (4-2) of the script.

The Bash command and FTP address format for obtaining the file are:

wget ftp://anonymous@ftp.bvbrc.org/genomes/<573.12688>/<573.12688>.<RefSeq.frn>

Where <573.12688> represents the genome ID, which can be replaced with the ID to be downloaded; <RefSeq.frn> is used to specify the file format to download. According to the results I checked with FileZilla, each genome ID may have the following types of files available for download:

  1. .fna (nucleic acid contigs; the result of sequencing assembly)
  2. .PATRIC.faa (translated amino acid sequence)
  3. .PATRIC.features.tab
  4. .PATRIC.ffn (nucleic acid sequence of proteins)
  5. .PATRIC.frn (nucleic acid sequence of rRNA)
  6. .PATRIC.gff (nucleic acid annotation without sequence data)
  7. .PATRIC.pathway.tab
  8. .PATRIC.spgene.tab
  9. .PATRIC.subsystem.tab
  10. .RefSeq.faa
  11. .RefSeq.features.tab
  12. .RefSeq.ffn
  13. .RefSeq.frn
  14. .RefSeq.gff

The parentheses after the suffix are my own understanding, which may be wrong, please use with caution.

Pending Issues

As mentioned above, this script is barely usable, and some obvious problems may be discovered one after another:

  1. The current code only deals with “problem files” by generating a file to tell the user which IDs have not successfully downloaded the data. But in the subsequent analysis, I found that sometimes a problem would occur: that is, although a specific file can be downloaded from the FTP site, the actual content of the file is empty, only containing some metadata, without any genome annotation or sequence data. And my script does not currently find these “empty shell files”.

About

BVBRC_GenomeDataDownloader 是一個 Python 腳本,用於從 BV-BRC 的 FTP 網站上批量下載和處理不同病症的基因組資料,並將它們轉換為 Roary 的輸入格式。

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages