Skip to content

Latest commit

 

History

History
58 lines (35 loc) · 8.95 KB

proposal.md

File metadata and controls

58 lines (35 loc) · 8.95 KB

Research Proposal

  • Fergus Steel
  • 2542391s
  • Dr. Paul Siebert

Research Outline

  • Brief: Investigating the Ability of CNNs to Count Visual Concepts
  • Research Direction: Investigating the encapsulation of visual conecepts in Capsule Networks and the resulting effect on class-agnostic object counting?

Conceptual Overview - What is involved in this research? / Research Justification - What is the rationale for this research?

Related Work: Object Counting, Convolutional Neural Networks

Object Counting is one domain of vision research where human's are remarkably capable. The human brain is able to impulsively incorporate visual cues such as object density and group size to count objects at a glance. Additionally the human brain is able to subitize and then rapidly, and accurately, produce an estimated count of smaller groups of objects (Revkin et al., 2008).

In Computer Vision, however, these impulsive systems don't exist and therefore they must be emulated or alternative techniques must be used. Fan et al. (2022), discusses the two methods that are typically used in crowd counting, which is a sub-domain of object counting research:

  • Detection-based CNNs, where CNNs are trained on an image dataset which are annotated by bounding boxes, thus training a model to detect, localize and therefore count each instance of the trained object in the scene.
  • Regression-based CNNs, where the CNNs are trained on a dataset of point-annotated images to directly estimate the count or more typically, the density map, which is a normalized heatmap that when integrated, provides a count of the map. Density-Map Estimation is a computationally simpler task, and is able to perform better in more complex scenes i.e. High Density, High Occlusion, Sparse Scenes and Complex Backgrounds. (Gao et el., 2020).

Convolutional Neural Networks are able to achieve state-of-the-art results in object counting thanks to their excellent performance as non-linear function approximaters. One such implementation is outlined in de Arruda et al., (2022), which uses a three stage model to implement density-map esimtaiton. Which uses a typical Convolutional Neural Network to extract the features of an input image, which are then passed into a Pyramid Pooling Module that constructs a hierarchical feature map of different spatial resolutions, which is then passed to a Multi Stage Sigma module that uses multiple gaussian distributions (that select from multiple variance parameters) to create a high quality density map that can be used to extract the count of object in the image. This architecture was originally trained and tested on a new UAV imagery dataset, and was then also trained on popular car counting datasets (CARPK and PUCPR+) to show its generalisability. In both cases, the model was able to achieve state-of-the-art performance in object counting, showing the viability of density map estimation as a method for object counting.

Class-Agnostic Counting (Exemplar-based Few Shot Counting and GAN-Based Zero-Shot Counting) & Capsule Networks

The issue with the proposed architecture above, and other CNN-Based density-map estimation object counting, is that they are trained upon a dataset, that when presented at test time, with an input image that contains a visual concept that does not exist in the training sets distribution, it will be unable to "generalise" and count it successfully. Additionally, to retrain these networks on a new dataset, this typically requires millions of annotations on thousands of images, which requires a lot of labour to create these datasets and further computational power in the training process (Ranjan et al., 2022).

Class counting, originally proposed in Lu et al. (2018), aims to solve the above issues by framing the task of object counting as a "matching" problem, where in which patches from the input images known as exemplars are used to count occurences of a given object in an image. This is done by encoding both the exemplar and input image into feature maps then broadcasting the exempler feature map to the size of the input feature map (in this paper this size is H/8xW/8x512) then learning a function to create the density function given these two maps.

This idea has been further extended in multiple ways in recent literature, in ways that improve the accuracy of predicted density map or allow for exemplar bounding boxes to be generated by a Generative Adversial Network.

What is wrong with Convolutional Neural Networks?

Geoffrey Hinton, in his talk "What is wrong with Convolutional Neural Networks" describes the pooling layers, which are used in convolutional layers that downsample surrounding information to collate denser feature maps, as a "disaster". This is because pooling operations gradually suppress the spacial context. This means that CNNs fail to encode spatial information such as orientation, position, pose, size and relative geometery in the inputs. The most famous example of this is CNN will often recognise a face, with its feature's locations swapped, incorrectly as a proper face.

These issues are captured in the properties of invariance and equivariance. An invariant is a property of an object is not altered when some transformation is applied to said object, for instance when an object is moved, the perimeter and size are not altered. Equivariance is a property of an object that changes predictably for example the centre of an object is still the centre if that object is rotated. The spatial information of an object is an equivariant property, as the objects should change predictably under a transformation and since pooling layers mean this information is lost.

From a biological perspective, pooling layers contradicts how we understand our own vision to work. CNNs "see" objects in a way that has no intrinsic coordinate frame and discards the spatial information whilst losing further information during processing (Sabour et al., 2017).

What are Capsule Networks and How do they Networks address these problems?

Capsule Networks in their modern form were introduced by Geoffrey Hinton in 2017 and are a modern Neural Network architecture that differ from traditional netowrks. Capsules, replace the neurons in the network by representing their outputs as a vector rather than a scalar. The length of this vector acts as a "vote", where an capsule can output how confident it is. The contents of a capsule's vector, i.e. its orientation captures the entity's properties, such as its spatial arrangement. Higher layer capsules aggregate the input "votes" of the lower-level capsules meaning a part-to-whole hierarchy is formed, making for a more accurate object representation. The idea that through their "routing-by-agreement" algorithm, that by establishing relationships between capsules such that they can collaborate to identify images, they are able to do high-dimensionality coincidence filtering, as the likelihood of multiple capsules agreeing by coincidence is highly unlikely.

Why are Capsule Networks potentially suitable architecture's for Class-Agnostic counting?

Capsule Networks offer a variety of benefits that theoretically could aid the performance of class-agnostic counting networks. When considering, that the exemplar images are represented as feature maps that are then compared using some sort of "similarity" module with the feaure map of the input image to count occurences. If Capsule Networks are used in place of typical convolutional methods to represent the exemplar images, we can potentially exploit the benefits in object recognition offered by capsules. The challenge comes from designing a regression head that allows the model to pull the similiarity of the exemplar capsules from the occurences of those objects in the input image in order to build the density map. This can be done in several ways, but avoiding the "crowding problem" that affects capsule networks is crucial. Capsule Networks "vote" for the their confidence that an object is present at a location in the image, which means that much like in human vision where stand alone concepts are easier to recognise, capsule networks can struggle to identify nearby identical objects.

Bibliography

  • de Arruda, Mauro dos Santos, et al. "Counting and locating high-density objects using convolutional neural network." Expert Systems with Applications 195 (2022): 116555.
  • Fan, Z., Zhang, H., Zhang, Z., Lu, G., Zhang, Y., & Wang, Y. (2022). A survey of crowd counting and density estimation based on convolutional neural network. Neurocomputing, 472, 224-251.
  • Gao, G., Gao, J., Liu, Q., Wang, Q., & Wang, Y. (2020). Cnn-based density estimation and crowd counting: A survey. arXiv preprint arXiv:2003.12783.
  • Viresh Ranjan, Udbhav Sharma, Thu Nguyen, Minh Hoai; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3394-3403
  • Revkin, S. K., Piazza, M., Izard, V., Cohen, L., & Dehaene, S. (2008). Does Subitizing Reflect Numerical Estimation? Psychological Science, 19(6), 607-614. https://doi.org/10.1111/j.1467-9280.2008.02130.x