Paper Matrix

Back to list

{"id":11,"url":"https://pm.philipcastiglione.com/papers/11.json","title":"Convolutional architectures are cortex-aligned de novo","read":true,"authors":"Atlas Kazemian, Eric Elmoznino, Michael F. Bonner","year":2023,"auto_summary":"The paper \"Convolutional architectures are cortex-aligned de novo\" by Atlas Kazemian, Eric Elmoznino, and Michael F. Bonner explores the emergence of cortex-aligned representations in deep neural networks (DNNs) without extensive pre-training. The study challenges the prevailing hypothesis that large-scale pre-training is the primary factor for similarities between neural networks and biological vision. Instead, it highlights the role of architectural inductive biases, particularly in convolutional architectures, in promoting cortex-aligned representations with minimal training.\n\nKey findings include:\n\n1. **Architectural Inductive Biases:** The study demonstrates that convolutional architectures, even when untrained, can predict image representations in the visual cortices of monkeys and humans. This is attributed to two key manipulations: spatial compression and feature expansion, which are inherent in convolutional networks.\n\n2. **Dimensionality Expansion:** The research shows that scaling the number of random features in convolutional networks leads to significant performance gains in predicting cortical responses. This effect is specific to convolutional architectures and is not observed in fully connected or transformer architectures.\n\n3. **Critical Architectural Components:** The study identifies that nonlinear activation functions and spatial locality of convolutional filters are critical for the performance of convolutional networks. Removing these components results in a significant drop in encoding performance.\n\n4. **Comparison with Pre-trained Networks:** The largest untrained convolutional model approaches the performance of pre-trained networks like AlexNet in predicting monkey visual cortex responses. However, there is a larger performance gap in human data, suggesting differences in semantic processing or feedback mechanisms between species.\n\n5. **Emergent Properties and Visualization:** The paper shows that even without pre-training, convolutional networks can form semantically meaningful clusters of images, indicating that these architectures naturally organize images into interpretable representations.\n\n6. **Implications for Neuroscience and AI:** The findings suggest that the architectural constraints of convolutional networks are closely aligned with biological vision, allowing for the emergence of cortical representations without extensive training. This challenges the notion that diverse architectures can equally predict cortical representations and emphasizes the unique suitability of convolutional architectures.\n\nOverall, the study provides a new perspective on the role of architectural biases in neural networks and their ability to model visual cortex representations without relying on large-scale pre-training. This has implications for understanding the computational principles underlying biological vision and developing more efficient AI models.","notes":{"id":11,"name":"notes","body":"\u003ch1\u003eNotes\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eThis is not about image classification performance - it’s about predicting brain visual system behaviour/representations.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eDNNs have become the dominant method to accurately model human vision - (”performance in predicting image-evoked responses in the ventral visual stream”, “become the leading theoretical models of visual computation in the brain”).\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eIt is an open question what factors lead to representational similarities.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eA prominent theory suggests cortex-aligned representations emerge in DNNs when their constraints and objectives match those of biological vision.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eOther recent work counters this by testing DNNs with varied architectures and objectives not matching the constraints paradigm. That work suggests pre-training on massive datasets is the key.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eThis work counters even that, using untrained networks to suggest that simply including convolutional layers is sufficient for significant representational similarities. Fully connected and transformer based architectures do not show this pattern.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cblockquote\u003e\u003cbr\u003eScaling the deeper layers of a convolutional network yielded striking performance gains—the best model even approached the performance of a classic pre-trained network in the monkey data … the performance of the convolutional architecture depended critically on spatial locality and nonlinear activation functions, again demonstrating that the benefits of scaling were highly architecture-dependent.\u003cbr\u003e\u003cbr\u003e\u003cbr\u003ethere appears to be a stronger emphasis on abstract, semantic information in the human fMRI data than in monkey electrophysiology data\u003cbr\u003e\u003cbr\u003e\u003c/blockquote\u003e\u003cdiv\u003e\u003cbr\u003eit’s tempting to conclude that human representations are semantically richer but this could result from the methodology/data\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cblockquote\u003e\u003cbr\u003eThis visualization shows that despite having no pre-training for classification, the CNN forms many intuitively interpretable clusters of images, including clusters related to sports, vehicles, food, animals, and people (Figure 4). This suggests that there is a remarkable degree to which images naturally organize into semantically meaningful clusters in the representational space of high-dimensional untrained CNNs\u003cbr\u003e\u003cbr\u003e\u003c/blockquote\u003e\u003cdiv\u003e\u003cbr\u003eThe essential components:\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cul\u003e\u003cli\u003eexpanding conv layers (larger is better but only to a point)\u003c/li\u003e\u003cli\u003enon-linear activation functions\u003c/li\u003e\u003cli\u003espatial locality of the convolutional filters\u003c/li\u003e\u003c/ul\u003e\u003cblockquote\u003e\u003cbr\u003eEncoding performance can be strong even when classification accuracy is weak\u003cbr\u003e\u003cbr\u003e\u003c/blockquote\u003e\u003cdiv\u003e\u003cbr\u003eSo training is essential for image classification, but NOT for predicting visual cortex activation patterns.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eOther work has shown surprising feature representations in randomly initialized convolutional networks.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cblockquote\u003e\u003cbr\u003ecombined convolution and pooling architecture can yield spatial frequency selectivity and translation invariance using only random filters\u003cbr\u003e\u003cbr\u003e\u003cbr\u003erandom, untrained CNN has an inductive bias to localize objects … this phenomenon may be driven by the nature of images, where the background is relatively texture-less compared to the foreground objects, increasing the chance that the background will be deactivated by ReLU.\u003cbr\u003e\u003cbr\u003e\u003cbr\u003eface-selective units arise within the network in the absence of supervised training\u003cbr\u003e\u003cbr\u003e\u003c/blockquote\u003e\u003cdiv\u003e\u003cbr\u003eThe paper is not suggesting the expansion model describes literally the computational process of visual processing in the brain. Instead, a discovery process for brain relevant representations.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003ch1\u003eQuestions\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eWhat is spatial locality in CNN filters?\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003ch1\u003eTakeaways\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eExpansion then compression (eg. train a large network then use dimensionality reduction like PCA) leads to a much better network than one of the same size trained without this process.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eUn \u003cem\u003euntrained\u003c/em\u003e network approached the performance of trained alexnet in predicting image evoked responses in human visual cortex patterns (though not in image classification on out of dataset images). Specifically, high dimensional CNN (with spatial locality) based DNNs with nonlinear activation functions. This means the architecture alone is sufficient for significant cortex alignment.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eMassive training can \u003cem\u003eovercome\u003c/em\u003e less de-novo aligned architectures.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003e\u003c/div\u003e","record_type":"Paper","record_id":11,"created_at":"2024-12-10T04:28:28.806Z","updated_at":"2024-12-10T04:28:28.806Z"},"created_at":"2024-12-10T04:28:21.939Z","updated_at":"2024-12-10T04:28:42.967Z"}

Edit Paper

Delete Paper