-
Notifications
You must be signed in to change notification settings - Fork 27
NNet Tool
HOME > REMORAS > NeuralNet Tool
The NeuralNet Remora allows users to generate training and test sets from a set of labeled events, save a trained network and classify novel signals with the trained network.
Warnings:
- This remora is currently configured to process vector data (e.g. waveforms, spectra, feature distributions), not matrices (e.g. images).
- This remora requires Matlab's Deep Learning Toolbox. If you do not have this toolbox, consider trying our standalone Triton implementation.
This tool is designed for acousticians who are not experts in machine learning but wish to use basic deep neural networks as classifiers for signals of interest. These signals must be represented as vectors. One or more of three possible features are expected: Spectra, waveforms, and inter-click (or inter-detection) distributions. (However, it is possible to pass in whatever features you want as long as the vectors have the expected names, more on that below).
It is configured for 2 possible options:
-
Detection-level learning and classification: In this case, the tool will label individual detections.
-
Bin-level learning and classification: In this case, the tool will label averages of multiple detections. The process is similar but additional logic is implemented to keep track of the original detections used in each average.
The easiest way begin with this remora is to work with output from the Cluster Tool remora, which is automatically configured as expected. The Cluster Tool option save individual cluster files for classifier training outputs a directory (e.g. "MyTypes") containing one folder per signal category.
e.g. MyTypes\CuviersBeakedWhale
\GervaisBeakedWhale
\SpermWhale
\RissosDolphin
\Echosounder
\Boat
If you have obtained folder structure with output files from the Cluster Tool Remora, you can proceed directly to the first menu item.
Remora > NeuralNet > Make Train & Test Sets > From Clusters
This window will help you partition your data into balanced, independent training, test and validation sets according to specified percentages. The "bout gap" parameter enforces temporal separation between training and test examples: "Bouts" are defined as independent detection events separated by a minimum number of minutes without detections, and training and test examples cannot be pulled from the same bout. You will need to provide the base directory containing your species/signal specific folders, as well as the filename patterns (wildcards) that need to be matched when searching for detection-level and bin-level files. If you only have one or the other, then you can ignore the wildcards. In general, generation of detection-level sets is much slower than bin-level, because there are many more detections than bins.
Max Training Set Size: This parameter determines how many examples of each type to include in the training set. Neural networks are typically trained on "balanced" datasets, where every class has the same number of examples. You can imagine if you trained it on 100,000 examples of dolphin clicks and 1000 examples of beaked whale clicks, it would learn that dolphins clicks are 100 times more likely and might just label everything as a dolphin. If one or more of your classes is smaller than Max Training Set Size, examples will be re-sampled from the small class(es) to achieve a balanced dataset. This is not ideal and can lead to brittle classifiers, so consider the size of the smallest class you have when picking this number.
Add Noise: The tool can add noise to training examples to try to boost variability when sample sizes are small, however this may affect classification accuracy on novel data later on, therefore, this option should be used with caution.
Details are given in tooltips if you hover your cursor over the various field names.
Additional info for those trying to input data from a nonstandard pipeline
Detection level input files are expected to contain the following variables:
trainTimes
- Event times as Nx1 vector of Matlab datenumbers, where N = the number of detections.
trainMSN
- NxM matrix of detection waveform envelopes, where M is the length of the waveform envelopes. Use zero padding/truncation if needed to achieve the same length.
trainMSP
- NxP matrix of detection spectra, where M is the number of spectral bins.
TPWSname
- Name of the original data file. This only used for tracking where the data originated.
If you are manually creating these files, you can put whatever you want into these vectors. No signal processing will be done on them at this stage. This is the easiest way to get your own data into this pipeline without
Bin level input files are expected to be constructed by the Cluster Tool and have a more complex structure. Bin-level inputs can be created without clustering from labeled TPWS files. This functionality is not yet available in a gui, but an example script to get you started can be found here:
Triton ▸ Remoras ▸ NNet ▸ funs ▸ nn_fn_TPWS_input.m
The output of step 1 will create 3 files, one each for training, testing and validation. The tool will try to pre-populate the file names from the previous step in this window, unless you have closed the session since running step 1.
![TrainNet](https://user-images.githubusercontent.com/4645150/133722227-4eff1b78-6a60-462f-b670-46bccc3b8553.png)
The network designs created with this tool are very simple, but generally effective. The network will consist of 1 input layer, and 1 output layer, with N fully-connected hidden layers in between. All hidden layers are the same width. We recommend using 4 hidden layers in Matlab versions 2019 and prior, because in our experience >4 layers will fail to learn. In recent Matlab versions (2020+) more hidden layers are effective due to changes in how networks are initialized, and deeper networks may succeed. It is standard practice to use layer sizes which are a power of 2 (e.g. 128, 256, 512...) however in practice we have not noticed efficiency differences between powers of 2 and nearby round numbers (e.g. 250 vs. 256).
When choosing your parameters, consider the following:
- Wider and deeper networks store more information, however they may also be prone to overtraining.
- Larger networks are more computationally intensive to train.
- You will probably want to adjust your hidden layer width to be similar or slightly larger than your input vector size.
- Dropout is a parameter that controls overtraining and improves the ability of your model to generalize to new data. 50% is standard.
- If you have 1000 training examples, a batch size of 10, and run 20 epochs then the neural network will iterate through 100 (1000/100) steps in each epoch, and will run until it stops learning, or completes 20 epochs. In each epoch, all training examples are fed in, but the order is shuffled. A common batch size is the square root of your training set size, but this is a highly-debated parameter.
- If the network performance on the validation set starts to deteriorate (a sign of overtraining), the network will quit training before completing all epochs.
- There are many other tunable parameters in deep nets and alternative configurations which are not adjustable through the gui, but can be modified in the underlying scripts if you are Matlab savvy.
-
A .txt diary file will be created, this file can be opened with any text editor, and will give you valuable information on the technical details of your network design and parameterization, details on how the network learned with each iteration/epoch, and final overall accuracy and confusion when evaluated on the test set.
-
*trainedNetwork_det.mat or *trainedNetwork_bin.mat This file will holds the trained network structure, which can be reloaded into Matlab when you want to classify new data.
-
*evalScores_det.mat or *evalScores_det.bin These files contain additional details about what you ran, and how the network performed on the test/evaluation data. In particular the softmax probability scores assigned to each detection are stored here for further investigation, as are the labels applied. The network parameters printed in the diary are also stored in this file.
- *TrainingProgress.png This is an image of the final state of the training gui, a Matlab built-in tool which will appear and update in real time during network training
- *ConfusionTrain.png & *ConfusionTest.png These confusion matrices can help you understand where your network is working well or failing to distinguish things. Training confusion should not be reported as a final result, but can be useful for troubleshooting. Test (a.k.a. evaluation) confusion is more important, because it applies to data that the network did not see during training. Be aware that performance on subsequent datasets (e.g. if you apply this network to an entire new dataset) may differ substantially.
- *Classification_accuracy.png This plot shows how classification accuracy relates to the probability score output by the network. In general, probability scores tell you something about the likelihood, according to the network, that a detection is correctly classified. For instance a detection labeled "beaked whale" with a likelihood near 1 is more likely to be a correct label than one with a likelihood near 0. In practice the minimum probability you will see is equal to 1 divided by the number of training classes. By removing labels with poor probability, classification accuracy can improve, but some detections that don't match any class well will end up unlabeled. All of this said, probability scores are not a substitute for classifier confidence. Please read up on this elsewhere for a more thorough treatment of this important caveat.
-
*Training_Data.png, *Test_Data.png, *Classified_Test.png, *Misclassified_Test.png
These plots show you what the training data and test data looked like as vectors concatenated into matrices for each species. These allow you to compare the true labels to how the neural network classified the data, to help you understand what issues could be addressed to improve performance. The Misclassified_Test image helps highlight the detections that were misclassified. For instance, if a spectrum is plotted under "Sperm whale" then it was labeled as sperm whale by the network, but should have been labeled as something else according to the "true" labels for the test set. If a panel is empty, then there were no misclassifications in that category.
These plots take some getting used to. In the example below, the network was trained on spectra and waveform envelopes. The bottom half of each subplot shows the concatenated spectra, and the top shows the concatenated waveform envelopes. Values have been normalized for plotting purposes.
![NNet_TestClassifications](https://user-images.githubusercontent.com/4645150/133850676-0409ab66-e3ea-412d-bd0f-4f54867c2561.png)
![NNet_Missclassifications](https://user-images.githubusercontent.com/4645150/133854163-3e89349a-fea0-42fa-a40b-c6c34a638ba6.png)
Left: Test classifications, Right: Misclassifications.
Known Issue: These plots do not yet track the edges of your vectors when using multiple inputs, and normalization may fail, resulting in bad plots with problematic colormapping. This does not affect actual performance.
Now that you have a trained network, you are ready to classify new data. If you have trained a network on detections, then this tool expects TPWS files as input. If you have trained it on bins, then it expects output from Cluster Bins. The tool also requires you to specify the location of the trained network file.
![Classify_gui](https://user-images.githubusercontent.com/4645150/133860017-0340ae67-92ac-4e14-9c16-0c2ddf8639c6.png)
Tip: Watch out for differences between the shape of the data you used to train the network, and the shape of any new data you want to classify. For instance, if you trained the network on 100 point spectra, it will not run on inputs any other size. Differences in band pass filters, or bin size, etc will cause problems. Make sure that the same processing was applied to the training data and all subsequent data being classified.
The only post classification option currently implemented is exporting labels. This step will output ID1.mat files for use with detEdit or other subsequent analyses. Files contain a vector called "zID" consisting of 3 columns [detection time, detection label, probability score].
![NNet_options](https://user-images.githubusercontent.com/4645150/133722327-51917cb3-3360-4b8b-9ed4-119cc3d72606.png)
Please note: In the bin-level case, these files will contain detection level labels for all detections which were included in labeled bins.
For example: A 5 minute time window might contain 5,000 clicks. Of those, 4500 were good quality, and formed a cluster, while 500 were poor quality, and were not included in the cluster. The neural network labels the cluster as "Cuvier's beaked whale". In this case, zID will contain 4500 time stamps and labels for the clicks included in the cluster. They will all be assigned the same probability, which is actually the bin label probability. If you use this ID file in conjunction with the original TPWS files, you can use the set difference to understand which detections are not labeled.
[timeUnlabeled, idxUnlabeled] = setdiff(MTT,zID(:,1));