Cluster sequences into inliers/outliers and generate a novel prototypical sequence for each cluster.
Consider the following scenario: a process generates a set of sequences, each sequences is encoded as a sequence of characters. There is an undisclosed number of distinct processes so it should be possible to group the sequences into clusters of similar sequences. However, in addition some sequences have been generated by another unrelated process to form outliers. Each instance is either an inlier or an outlier.
- Cluster the inliers into an appropriate number of groups.
- Generate a novel prototypical sequence for each cluster, i.e. a sequence that is the most representative for that cluster. Note that the prototypical sequence must be novel, i.e. not be one of the provided sequences.
A text file, test.txt
, is provided which contains a random mixture of inlier and outlier sequences in no particular order. Each row of this contains an integer identifier for the sequence and the sequence itself.
- Print for each sequence's identifier together with the cluster ID they belong to or that they are an outlier.
- Print one novel prototypical sequence for each cluster you have found.