-
Notifications
You must be signed in to change notification settings - Fork 4
2. Methods
Depending on the data set type, different methods are used to explore the underlying information.
Archived data is uploaded as a BytesIO stream. That stream is then written to a file that represents a 1 to 1 copy of the originally uploaded data set. That archive is then extracted into the Data/uploaded/omic_name
folder for further use. The next step that is shared among every archived data set is the addition of temporality. If filenames are starting with D or W, then a user is confronted with an input box where a start date should be entered. After inputting the start date, every file name is converted to the ISO 8601
format (yyyy-mm-dd). If filenames are already according to this format, the previous step is skipped. If there is an error during this process, an error message is shown to the user.
If an archived data set contains FASTA files (extensions .fa or .faa), the user is presented with an option for creating an additional data set with physico-chemical properties. These properties are different for genomics and proteomics data, as can be seen below:
- Genomics: Hydrogen bond, Stacking energy, Solvation (see paper Physico-chemical fingerprinting of RNA genes)
- Proteomics: Molecular weight, Gravy, Aromaticity, Instability index, Isoelectric point, Secondary structure fraction of helix, Secondary structure fraction of turn, Secondary structure fraction of sheet, Electricity, Fraction aliphatic, Fraction uncharged polar, Fraction polar, Fraction hydrophobic, Fraction positive, Fraction sulfur, Fraction negative, Fraction amide, Fraction alcohol
Since each FASTA file can contain multiple sequences, these physico-chemical values are averaged so that each file has a fixed set of physico-chemical values. When these values are calculated for all files, temporality is added to the newly-created data set using file names.
The next step of preprocessing FASTA files is embedding those files (and sequences within) into 100-dimensional vectors. The length of those vectors was chosen empirically. Word2Vec model is used to embed varying-length sequences into fixed-size vectors. Embedding FASTA files for the first time trains the model on those sequences and then embeds them, which can take a long time depending on the number of sequences and FASTA file. After the first run, model weights are cached as well as the embeddings, so there are no unnecessary re-runs. A vector corresponding to one FASTA file represents an average vector of all sequences contained in that FASTA file. That means MOVIS is robust to the number of sequences in each FASTA file. When the whole archived data set (holding N FASTA files) is processed, the end result is a Nx101-dimensional matrix containing one 100-dimensional vector per FASTA file plus an additional dimension holding the temporal information of that file.
As stated earlier in this Wiki, KEGG files must have a specific header in order to be processed correctly. If that is not the case, an error message is shown, and all computations are stopped. After importing a valid KEGG annotation file, the count matrix is created for each KEGG file name. If there are N KEGG files, then the resulting matrix is Nx(M+1), where M is the number of distinct KO annotations present in the whole archive. One additional dimension is the temporal dimension computed from the file names. Values inside that matrix represent the number of occurrences of the specific KO annotation inside one KEGG file.
Beware: The resulting matrix created from KEGG annotation files can be big in size.
MOVIS handles these files by saving only product information from them. That means that every annotation file must have a part where the products (i.e., the functions) of the corresponding FASTA file sequences are stated. We intend to extend the functionality of MOVIS to use even more meaningful data from these files. When different products are imported for each BIN file, they are sorted in descending order with respect to the number of occurrences of each product inside one BIN file. After that, only the TOP 10 products are saved for each BIN file. This is done for easier visualization of those products through time. The result is a Nx(10+1) matrix, where N is the number of BIN files, 10 is the number of columns representing the TOP 10 products, and 1 is an additional temporal column. Each cell of the matrix holds the number of occurrences of the specific product inside one BIN file.
The processing of depth-of-coverage files is in the early stages. However, the support of these files provides a user an overall impression of the depth-of-coverage values over sampling time. After importing a valid Depth-of-coverage file, MOVIS embeds the whole file into one single value by averaging all values of that file. After repeating that process for all N files, MOVIS creates an Nx2 matrix where one column contains temporal information, and the other column contains averaged depth-of-coverage values. Lastly, MOVIS creates two data sets, one that includes the statistical information of depth-of-coverage values and the other that has outliers. The statistical information is consisted of calculating first quartile Q1, third quartile Q3, interquartile range IQR, lower limit, upper limit, and mean. Outliers are those values that are either higher than the upper limit or lower than the lower limit.
The unknown values are removed immediately after importing a tabular data set. More information about the reasoning behind this decision is in section 1. of this Wiki. Since certain tabular data sets can contain special characters within the column names, MOVIS fixes those columns by replacing each invalid character with a corresponding valid character.
If a multi-tabular data set is uploaded, the previous procedure is applied to each data set individually. Then, since each data set must have the same set of columns (features), a check is run to validate this. If that check is successful, all data sets are merged into one, with one additional column named Type that contains the file names of each corresponding data set. The resulting data set is then treated as a new data set, so all of the following procedures are shared to all tabular data sets.
The next step of MOVIS for curating the uploaded data set is to fix any value that contains , instead of . as a decimal point. After that, MOVIS locates the temporal column in the data set and saves the list of all other columns for further processing. If the temporal column is not located, an error is shown to the user.
The last step in the data set curation is the additional modification of the uploaded data set. MOVIS currently supports the following types of modifications:
- Saving only a certain time interval and removing all rows outside of the chosen interval.
- Removing one or more rows from a tabular data set.
- Removing one or more columns from a tabular data set.
After this step, a data set is considered curated and ready for visualization tasks.
After the embedding process, the user is presented with an option to cluster FASTA representations, i.e., FASTA files. The user is first presented with an 'Elbow rule' visualization of the underlying data. This way, a rough estimation of the best number of clusters is revealed. For the clustering step, we implemented 2 clustering methods, but we intend to integrate more in the future. The first algorithms we chose are K-means and OPTICS algorithms because they cover 2 different types of inductive principles - K-means is a distance-based algorithm while OPTICS is density-based. Since these two algorithms have the same goal but different strategies, the user is confronted with a slider to either chose the number k for the K-means algorithm or the minimum samples number for the OPTICS algorithm. Since we are using the implementation from the SciKit-Learn library, further information on the meaning of these numbers can be found on the official website. Since MOVIS calculates everything after every change of the input, a user can easily see clustering results for different choices of clustering options. After the clustering process, dimensionality reduction options are provided to the user for easier data visualization. MOVIS currently supports 3 dimensionality reduction methods: PCA, MDS, and t-SNE. After choosing one of the mentioned options, interactive visualization of the data is presented.
The analysis step of the KEGG annotation files consists of clustering of instances that are part of the KEGG matrix created in the previous (2.1. Processing) step. The clustering process is exactly the same as described in the previous subsection above (for the FASTA files).
Currently, the analysis of the BIN files is only supported by the visual analytics of the product feature of those files. One visualization provided by MOVIS is specifically suitable for the analysis of these files - 'Top 10 share through time'. This visualization allows users to examine shares of different products over time. We plan to integrate more analytical possibilities for exploring BIN files in the future.
The analysis step of the Depth-of-coverage files consists of inspecting and visualizing the summary data set created in the previous (2.1. Processing) step. One visualization provided by MOVIS is specifically suitable for analyzing these files - 'Whisker plot'. This visualization is only presented when working with Depth-of-coverage files. We plan to integrate more analytical possibilities for exploring Depth-of-coverage files in the future and especially support taxa linking.
Tabular data sets are visualized with various supported visualizations depending on the feature type.
Currently, MOVIS supports nine different visualizations of time-series data sets:
- Correlation heatmap,
- Time heatmap,
- Multiple features parallel chart,
- Scatter-plot matrix,
- Scatter plot,
- Two features plot,
- Feature through time,
- Whisker plot,
- Top 10 share through time.
The user could use each one of the visualizations for any data set type without any restrictions. However, some basic data knowledge is advised in order to choose appropriate visualizations for certain data set types.
Below are presented currently available visualizations. Scatter plot is not presented as a seperate figure, since it is included in the Scatter-plot matrix visualization.
Correlation heatmap. The screenshot shows the correlation of Physico-chemical data of Example 1 using a Correlation heatmap chart. The tooltip allows a closer inspection of values for two features of interest.
Time heatmap. The screenshot shows the temperature feature of Physico-chemical data of Example 1 using a Time heatmap chart. The tooltip allows a closer inspection of values for a cell of interest.
Multiple features parallel chart. Suppose a data set has an ordinal or nominal feature differentiating data set elements. In that case, this chart can help a user find patterns that might be missed with some other visualization techniques. This screenshot depicts Ci, Se, and Type features of the Metabolomics data set from Example 1. Interactive elements like Span and Zoom are present and allow a more granular look at specific points that might be occluded on the default view.
Scatter-plot matrix. The screenshot depicts multiple Physico-chemical features (Inflow pH, Oxygen concentration, Nitrate concentration) of Example 1 in the scatter-plot matrix chart. The Temperature feature is also encoded using a blue color scale. We can see an outlier in 4 charts - there is a cluster of low pH values present only in June 2011. A tooltip gives the exact values of each feature for the point of interest.
Two features scatter plot. The two features scatter plot is here used to depict how the Gravy feature changes when the Molecular Weight changes for the additional calculated Metaproteomics data from Example 1. Tooltip shows respective values for those features, alongside the temporal information of the specific point.
Feature through time. When a user wants to see how one feature changed through time, Feature through time chart is the way to go. Here we can see how the Hydrogen bond feature from the additional calculated Metagenomics data from Example 1 changed through time. A vertical ruler allows precise positioning on the line chart. A tooltip is also available and gives an exact value of the feature of interest at the selected position.
Whisker plot. The screenshot presents averaged Depth-of-coverage values of Metagenomics data set from Example 1. A tooltip is also present to show detailed information on every single whisker plot over time. We can see a clear periodic pattern created by third quartile and upper limit values.
Top 10 share through time. This chart is a composite chart consisting of a linked line chart and a stacked bar chart. A user can select a specific period of interest on the upper part of the chart, which will be reflected on the lower part of the chart. Here we can see product (function) information of BINS annotation genomics data from Example 1, from 23rd of July to 19th of August. There is an apparent abundance of hypothetical protein function in that period for the collected samples.