|
1 | 1 | CogAlg
|
2 | 2 | ======
|
3 | 3 |
|
4 |
| -I am designing this algorithm for comprehensive hierarchical clustering, from pixels to eternity. It stems from the definition of intelligence as the ability to predict from prior / adjacent inputs. That includes much-ballyhooed reasoning and planning: basically tracing connections in segmented graphs. Any prediction is interactive projection of known patterns, hence primary process must be pattern discovery (AKA unsupervised learning: an obfuscating negation-first term). This perspective is not novel, pattern recognition is a main focus of ML, and a core of any IQ test. The problem I have with statistical ML is the process: it discards crucial positional info, so resulting patterns are indirectly centroid-based. |
| 4 | +I design this algorithm for comprehensive hierarchical clustering, from pixels to eternity. It stems from the definition of intelligence as the ability to predict from prior / adjacent inputs. Which is basically tracing connections in segmented graphs, including much-ballyhooed reasoning and planning. Any prediction is interactive projection of known patterns, hence primary process must be pattern discovery (AKA unsupervised learning: an obfuscating negation-first term). This perspective is not novel, pattern recognition is a main focus of ML, and a core of any IQ test. The problem I have with statistical ML is the process: it ignores crucial positional info, thus resulting patterns are effectively centroid-based. |
5 | 5 |
|
6 |
| -Pattern recognition is a default mode in Neural Nets, but they work indirectly, in a very coarse statistical fashion. Basic NN, such as [multi-layer perceptron](https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53) or [KAN](https://towardsdatascience.com/kolmogorov-arnold-networks-kan-e317b1b4d075), performs lossy stochastic chain-rule curve fitting. Each node outputs a normalized sum of weighted inputs, then adjusts the weights in proportion to modulated similarity between input and output. In Deep Learning, this adjustment is mediated by backprop of decomposed error (inverse similarity) from the output layer. In Hebbian Learning, it's a more direct adjustment by local output/input coincidence: a binary version of their similarity. It's the same logic as in centroid-based clustering, but non-linear and fuzzy (fully connected in MLP), with a vector of centroids and multi-layer summation / credit allocation in backprop. |
| 6 | +Pattern recognition is a default mode in Neural Nets, but they work indirectly, in a very coarse statistical fashion. Basic NN, such as [multi-layer perceptron](https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53) or [KAN](https://towardsdatascience.com/kolmogorov-arnold-networks-kan-e317b1b4d075), performs lossy stochastic chain-rule curve fitting. Each node outputs a normalized sum of weighted inputs, which adjusts the weights in proportion to modulated similarity between input and output. In Deep Learning, this adjustment is by backprop of decomposed error (inverse similarity) from the output layer. In Hebbian Learning, it's a more direct adjustment by local output/input coincidence: a binary version of direct similarity. The logic is basically the same as in centroid-based clustering, but non-linear and fuzzy (fully connected in MLP), with a vector of centroids and multi-layer summation / distribution in backprop. |
7 | 7 |
|
8 | 8 | Modern ANNs combine such vertical training with lateral cross-correlation, within an input vector. CNN filters are designed to converge on edge-detection in initial layers. Edge detection means computing lateral gradient, by weighted pixel cross-comparison within kernels. Graph NNs embed lateral edges, representing similarity or/and difference between nodes, also produced by their cross-comparison. Popular [transformers](https://www.quantamagazine.org/researchers-glimpse-how-ai-gets-so-good-at-language-processing-20220414/) can be seen as a [variation of Graph NN](https://towardsdatascience.com/transformers-are-graph-neural-networks-bca9f75412aa). Their first step is self-attention: computing dot product between KV vectors within context window of an input. This is a form of cross-comparison because dot product serves as a measure of similarity, just an unprincipled one.
|
9 | 9 |
|
10 | 10 | So basic operation in both trained CNN and self-attention is what I call cross-comparison, but the former selects for variance and the latter for similarity. I think the difference is due to relative rarity of each in respective target data: mostly low gradients in raw images and sparse similarities in compressed text. This rarity or surprise determines information content of the input. But almost all text ultimately describes generalized images and objects therein, so there should be a gradual transition between the two. In my scheme higher-level cross-comparison computes both variance and similarity, for differential clustering.
|
11 | 11 |
|
12 |
| -GNN, transformers, and Hinton's [Capsule Networks](https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b) all have positional embeddings, as I use explicit coordinates. But they are still trained through destructive backprop: randomized summation first, meaningful output-to-template comparison last. This primary summation degrades resolution of the whole learning process, exponentially with the number of layers. Hence, a ridiculous number of backprop cycles is needed to fit hidden layers into generalized representations (patterns) of the input. Most practitioners agree that this process is not very smart, the noise-worshiping alone is the definition of stupidity. I think it's just a low-hanging fruit for terminally lazy evolution, and slightly more disciplined human coding. It's also easy to parallelize, which is crucial for glacially slow cell-based biology. |
| 12 | +GNN, transformers, and Hinton's [Capsule Networks](https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b) all have positional embeddings, as I use explicit coordinates. But they are still trained through destructive backprop: randomized summation first, meaningful output-to-template comparison last. This primary summation degrades resolution of the whole learning process, exponentially with the number of layers. Hence, a ridiculous number of backprop cycles is needed to fit hidden layers into generalized representations of the input. Most practitioners agree that this process is not very smart, the noise-worship alone is the definition of stupidity. It's just a low-hanging fruit for terminally lazy evolution, and slightly more disciplined human coding. And easily parallelizable, which is crucial for glacially slow cell-based biology. |
13 | 13 |
|
14 | 14 | Graceful conditional degradation requires reversed sequence: first cross-comp of original inputs, then summing them into match-defined clusters. That's a lateral [connectivity-based clustering](https://en.wikipedia.org/wiki/Cluster_analysis#Connectivity-based_clustering_(hierarchical_clustering)), vs. vertical statistical fitting in NN. This cross-comp and clustering is recursively hierarchical, forming patterns of patterns and so on. Initial connectivity is in space-time, but feedback will reorder input along all sufficiently predictive derived dimensions (eigenvectors). This is similar to [spectral clustering](https://en.wikipedia.org/wiki/Spectral_clustering), but the last step is still connectivity clustering, in new frame of reference. Feedback will only adjust hyperparameters to filter future inputs: no top-down training, just bottom-up learning.
|
15 | 15 |
|
16 |
| -Connectivity likely represents local interactions, which may form a heterogeneous or differentiated system. Such differentiation may be functional, with internal variation contributing to whole-system stability. The most basic example is contours in images, which may represent object resilience to external impact and are generally more informative than fill-in areas. Cross-similarity is not likely to continue immediately beyond such contours. So next cross-comp is discontinuous and should be selective for well-defined and stable core + contour clusters. |
| 16 | +Connectivity likely represents local interactions, which may form both similarity clusters and their high-variance boundaries. Such boundaries reflect stability (resilience to external impact) of the core similarity cluster. The most basic example is image contours, which are initially more informative than flat areas. Cross-similarity is not likely to continue immediately beyond such contours, so they also represent "separability" of the core cluster. Thus, the next cross-comp should be discontinuous and on a higher-composition level: between previously formed complemented cluster + contour representations. This is much more expensive: the clusters are complex and compared over greater distance (number of combinations). |
17 | 17 |
|
18 |
| -At this point centroid clustering of previously formed connectivity clusters makes sense: it's a global selection by mutual similarity. Only centroids (exemplars) need to be cross-compared on the next connectivity clustering level, as they effectively represent their nodes. So hierarchical clustering (learning) should alternate between these two phases: |
19 |
| - - connectivity clustering is a generative learning phase, forming new derivatives and structured composition levels, |
20 |
| - - centroid clustering is a compressive phase, reducing multiple similar comparands to a single exemplar. |
| 18 | +So higher cross-comp should be selective for high core_similarity + contour_variance (borrowed from average local cross-similarity). And it should be global: complemented clusters are inherently discontinuous. That means centroid clustering, and next connectivity clustering level will cross-comp resulting centroids (exemplars). These two clustering phases should alternate hierarchically: |
| 19 | + - connectivity clustering as a generative phase, forming new derivatives and structured composition levels, |
| 20 | + - centroid clustering as a compressive phase, reducing multiple similar comparands to a single exemplar. |
21 | 21 |
|
22 | 22 | While connectivity clustering is among the oldest methods in ML, I believe my scheme is uniquely scalable in complexity of discoverable patterns:
|
23 | 23 | - Links are valued by both similarity and variance between the nodes.
|
|
0 commit comments