-
Notifications
You must be signed in to change notification settings - Fork 0
Notes week1 (v49)
Nov 28th, Tue, First Meeting Summary Project aim: Understanding transcriptional changes in HCC progression. Initial timeline setting: Week49 (Nov 27th- Dec 3rd): Preparation (Processing) of the data Identifying Replicates Combining Transcripts into Genes Week50 (Dec 4th- Dec 10th): Analysis of the Data Differential Expression Analysis Exploring the result (optional) Generating Co-Expression Network Week51 (Dec 11st- Dec 17th): Validation of the results Previous Studies Human Protein Atlas Co-expression Network Week2 (Jan 8th- Jan 10th): Concluding the work Summarizing Results Making the poster Additional notes:
- The replicates are already indicated in the "metadata_patient_code.txt" file, in which: H → Healthy S → Steatosis N → NASH C → HCC Samples with the same NASH_IDENTIFIER are technical replicates of different reading depth, for which we will need to sum the corresponding data up. The last three samples (U001, X001, X002) are of poor quality which we should get rid of.
- We need to combine the transcripts into identified genes. Example mapping files can be seen from Ensembl GTF human file.
- Differential Expression Analysis can be done by DESeq quickly. After that we can compare the results with previous studies i.e. the TCSBN network database. Depending on the quality of result we will get, we can make PCA plots or build network based on our own findings, and analyze the result in our own way (e.g. if we choose the top 100 genes in each of the four conditions, will we see difference in the gene categories and/or co-expression stages?).
Nov 29th: The different columns in data files are divided by a space.
Nov 30th: Second Group Meeting Notes:
- The transcripts have been preliminarily mapped to Ensembl reference genes.
- Use gene length (exonic) to calculate TPM (directly comparable between samples). If TPM < 1, the genes are considered not differentially expressed.
- The first filter should be based on the protein-coding genes (refer to the HPA No.6).
- Then do differential expression analysis. (Sequencing platform: Illumina HiSeq)
To do list: By end of this week: Combine metadata and raw counts → patient code vs gene codes. Look up whether to sum up or take median/mean for the samples that were sequenced multiple times (deeper coverage). Get information regarding the workflow so far up until we get the raw count. Upload reference regarding TPM value cut-off. The details of how sequencing and mapping were done need to be further understood.
Questions in this week: 1128-1 What do we know about the initial samples? Do we have any information about which are (or how many) biological/technical replicates, or shall we figure them out all by ourselves? 1128-2 What does "cleaning up data" mean in detail? 1129-1 For files like "raw_count_data.txt" and "metadata.txt"--how does every line look like (in a string), are the two columns separated by a space or something else?