updated readme

KasperSkytte · Sep 29, 2023 · 42d13bf · 42d13bf
1 parent e199ba4
commit 42d13bf
Showing 1 changed file with 50 additions and 48 deletions.
diff --git a/README.md b/README.md
@@ -1,40 +1,21 @@
 # ASMC-prediction
-!THIS README NEEDS AN UPDATE!
-Predicting Activated Sludge Microbial Communities based on time series of continuous sludge samples by using deep learning, mainly LSTM and IDEC for pre-clustering
-
-## IDEC
-Everything in the 'idec/' folder is from:\
-https://github.com/XifengGuo/IDEC-toy
-
-IDEC is from the paper:\
-Xifeng Guo, Long Gao, Xinwang Liu, Jianping Yin.
-[Improved Deep Embedded Clustering with Local Structure Preservation](https://xifengguo.github.io/papers/IJCAI17-IDEC.pdf). IJCAI 2017.
+Predicting Activated Sludge Microbial Communities based on time series of continuous sludge samples by using graph neural network models.
 
 ## Requirements
 ### Data
-Data files needed in the 'data' directory to run main.py:\
-Abundance data:                              (e.g. aalborg_west_ASV.csv)\
-Metadata about the samples:                  (e.g. metadata_filtered.csv)\
-function info from MiDAS Field Guide:   (e.g. MiDAS_Metadata.csv)
+The required data must be in the typical amplicon data format with an abundance table for each ASV/OTU, taxonomy table, and sample metadata. The sample metadata must contain a variable with sampling dates. If it can be loaded succesfully using the [ampvis2](https://kasperskytte.github.io/ampvis2/) R package everything should "just run" as long as there is enough data.
 
-### Python packages
-Install all necessary Python packages with:
-pip install -r requirements.txt
-
-One of the requirements is 'TensorFlow 2' which is currently supported on Python 3.6-3.8 (https://www.tensorflow.org/install/)
+### Required Python and R packages
+Install required Python packages with `pipenv` based on the lock file, and similarly for R use `renv`. For GPU support ensure you have a version of Tensorflow that matches your nvidia drivers and CUDA.
 
 ## Usage
-Some settings/parameters can be tweaked in the file 'config.json' including which data files to use (see the list below).
-If changing 'metadata_file' or 'functions_file' options in 'config.json' it can be necessary to also set 'force_preprocessing' to 'true' for the next run. This will force recalculation of some of the preprocessing steps using the new data.\
-The program can be run with:\
-python ./main.py
-
-## Docker
-Pull image with `docker pull kasperskytte/asmc-prediction` (append `-{version}` to pull a specific and locked version based on specific GitHub tags) or build from this repository with `docker build -t kasperskytte/asmc-prediction .`. The image does not contain any scripts, it's simply to contain the software and dependencies used (exact versions, tested).
+Simply run the wrapper script `run.bash` will run `reformat.R` to first sort, filter, and format the data, look up known Genus-level functions on the [midasfieldguide.org](https://midasfieldguide.org) etc, and then run `main.py` that will start model training and evaluation.
 
-Then run with:
+### Docker (recommended)
+This image has all required tools installed and tested together, and this image have been used to produce the results for the paper.
+Pull image with `docker pull ghcr.io/kasperskytte/asmc-prediction:main` or build scratch from this repository with `docker build -t ghcr.io/kasperskytte/asmc-prediction:main .`. The image does not contain any scripts, it's simply to contain the software and dependencies used (exact versions, tested). Ideally use [development containers](https://code.visualstudio.com/docs/devcontainers/tutorial) with VSCode. Otherwise run through docker:
 ```
-docker run -it --rm -v "${PWD}":/tf -u $(id -u):$(id -g) kasperskytte/asmc-prediction python main.py
+docker run -it --rm -v "${PWD}":/tf -u $(id -u):$(id -g) ghcr.io/kasperskytte/asmc-prediction:main run.bash
 
 ```
 
@@ -45,30 +26,51 @@ sudo apt-get update
 sudo apt-get install docker.io nvidia-container-toolkit
 ```
 
-before starting the container. Remember to restart the docker daemon for the changes to take effect with `sudo systemctl restart dockerd`. If this doesn't work follow the guidelines at https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#install-guide. The container has been developed using CUDA version 11.4.
+before starting the container. Remember to restart the docker daemon for the changes to take effect with `sudo systemctl restart dockerd`. If this doesn't work follow the guidelines at https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#install-guide. The container has been based on CUDA version 11.4, but you can adapt.
 
-## Explanations of the options in config.json:
-| Parameter                     | Description |
-| ---                           | ---         |
-| abund_file                     | Name of the abundance data file. |
-| metadata_file                 | Name of the metadata file. |
-| results_dir                   | Path to a directory where results are saved. |
-| functions          | Which functions to use. |
-| force_preprocessing           | If 'true', forces preprocessing. Otherwise tries to skip some preprocessing steps which are only necessary to run when the data files changes. |
-| only_pos_func                 | If 'true', only uses taxa with a positive value in at least one function. |
-| max_zeros_pct       | Discards taxa which have an abundance of 0 in more than 'max_zeros_pct'\*100 percent of the samples. |
-| num_features          | Number of taxa used for the prediction. |
-| max_epochs_lstm               | The maximum number of epochs used for training the LSTM. |
-| window_size                   | The size of the windows used for the LSTM i.e. how many samples that are used to predict the following sample. |
-| num_clusters_idec                | Number of IDEC clusters |
-| tolerance_idec                | The training of the IDEC model stops if less than 'tolerance_idec'\*100 percent taxa change cluster each iteration. |
-| splits                        | How to partition the data into training, validation, and testing sets. Must sum to <= 1 |
-
-splits: make sure the test data set has at least the same number of samples as the window size
+## Options in config.json:
+| Parameter | Default value | Description |
+| --- | --- | --- |
+| abund_file    | "data/datasets/Damhusåen-C/ASVtable.csv" |  CSV/text file with abundance data (OTU/ASVs in rows, samples in columns) |
+| taxonomy_file | "data/datasets/Damhusåen-C/taxonomy.csv" |  File with taxonomy for each OTU/ASV (Kingdom->Species) |
+| metadata_file | "data/metadata.csv" |  Sample metadata (Sample IDs must be in the first column) |
+| results_dir   | "results" |  Folder with all output and logs |
+| metadata_date_col | "Date" |  Name of the column in the metadata that contains the sampling dates |
+| tax_level | "OTU" |  Taxonomic level at which to aggregate OTU/ASVs (Only works and makes sense at OTU/ASV level) |
+| tax_add   | ["Species", "Genus"] |  Additional taxonomy levels to add to plot titles |
+| functions | ["AOB", "NOB", "PAO", "GAO", "Filamentous"] |  Array of metabolic functions to use for pre-clusterin |
+| only_pos_func | false |  If true only keeps a taxon if it's assigned to at least one function according to midasfieldguide.org |
+| pseudo_zero   | 0.01 | Pseudo zero |
+| max_zeros_pct | 0.60 | Filter taxa that have abundance of pseudo-zero in more than this percent of samples |
+| top_n_taxa    |  200 | Number of most abundant taxa to use from the dataset |
+| num_features  |  200 |   |
+| num_per_group |  5 | Max number of taxa per group |
+| iterations    |  10 | Max iterations of model training before continuing |
+| max_epochs_lstm   |  200 | Max number of epochs when using LSTM |
+| window_size   |  10 | How many samples are used as input for predictions |
+| predict_timestamp |  10 | How many samples into the future to predict for each moving window |
+| num_clusters_idec |  10 | How many IDEC clusters to create (should be automatic though) |
+| tolerance_idec    |  0.001 | Stop IDEC model training if not improving more than this tolerance |
+| transform |  divmean | Data transformation to use. One of "divmean", "normalize", "standardize", "none" |
+| cluster_idec  |  false | Whether to create IDEC clusters and perform model training+testing |
+| cluster_func  |  false | Whether to create function clusters and perform model training+testing |
+| cluster_abund |  true | Whether to create ranked abundance clusters and perform model training+testing |
+| cluster_graph |  true | Whether to create graph clusters and perform model training+testing |
+| smoothing_factor |  4 | Data smoothing factor |
+| splits | [0.80, 0,05, 0.15] | Fractions with which to split the data into train+val+test dataset |
 
 vscode extensions:
 R
 quarto
 jupyter
 python
-(pylance)
+(pylance)
+
+
+## IDEC
+Everything in the 'idec/' folder is from:\
+https://github.com/XifengGuo/IDEC-toy
+
+IDEC is from the paper:\
+Xifeng Guo, Long Gao, Xinwang Liu, Jianping Yin.
+[Improved Deep Embedded Clustering with Local Structure Preservation](https://xifengguo.github.io/papers/IJCAI17-IDEC.pdf). IJCAI 2017.