{"id":10,"url":"https://pm.philipcastiglione.com/papers/10.json","title":"SemanticCMC: Contrastive Learning of Meaningful Object Associations from Temporal Co-occurrence Patterns in Naturalistic Movies","read":true,"authors":"Cliona O’Doherty, Rhodri Cusack","year":2023,"auto_summary":"The paper \"SemanticCMC: Contrastive Learning of Meaningful Object Associations from Temporal Co-occurrence Patterns in Naturalistic Movies\" by Cliona O’Doherty and Rhodri Cusack explores self-supervised learning in computer vision, specifically focusing on how deep neural networks (DNNs) can learn semantic structures from temporal co-occurrence patterns in naturalistic movies. The authors propose SemanticCMC, a variant of Contrastive Multiview Coding (CMC), which is trained on a dataset of naturalistic movie frames to capture meaningful object associations over time.\n\nThe study addresses the limitations of supervised DNNs, which rely heavily on curated datasets and often lack the ability to understand the relational structure of concepts. The authors argue that self-supervised models, which do not require labeled data, could potentially emulate human-like learning by leveraging the context in which objects appear. This approach is inspired by human vision, which perceives objects not only by their features but also by their relationships and contexts.\n\nThe researchers use a dataset constructed from feature-length films, capturing frames at intervals to preserve temporal co-occurrence patterns. They hypothesize that these patterns can provide a signal for learning semantic structures, akin to how distributional semantic models in natural language processing learn word meanings from textual context.\n\nTheir experiments involve training CMC on this movie dataset with different temporal lags between frames, comparing the semantic quality of the learned representations to those from networks trained on traditional, perceptual tasks. They employ Representational Similarity Analysis (RSA) to evaluate how well the network's representations capture semantic relationships, using WordNet similarity scores as a benchmark.\n\nThe results indicate that networks trained with an intermediate temporal lag (around 60 seconds) capture more semantic information than those trained on purely perceptual tasks or with very short or long lags. This finding suggests that there is an optimal temporal window for learning meaningful object associations.\n\nImportantly, the study highlights a key insight: high object-level classification accuracy does not necessarily imply that a network has captured meaningful semantic concepts. Networks trained on perceptual tasks may perform well on classification benchmarks but fail to understand the relational structure of objects.\n\nThe paper concludes that leveraging temporal co-occurrence patterns in naturalistic datasets can enhance the semantic quality of visual representations in self-supervised learning, offering a pathway towards more robust and human-like computer vision models. The authors acknowledge limitations, such as the focus on a single self-supervised framework and the use of AlexNet, suggesting future work to explore other architectures and concurrent learning of object recognition and semantic structure.","notes":{"id":10,"name":"notes","body":"\u003ch1\u003eNotes\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eCMC: Contrastive Multiview Coding\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eDNNs trained on highly curated labelled datasets perform well on classification tasks, but do not transfer as well to learning real world vision. The learning mechanisms themselves have limited research.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eSelf supervision might lead to models that do better in these real world environments. These models might also be used to provide insights about learning in the brain.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eDNNs rely on statistical regularities of local image features. Human vision incorporates more global structure between objects in the world. Pre-verbal infants begin this learning.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eFully understanding something involves understanding its relation to other things.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eOur computer vision models generally lack these higher level semantic concepts.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eSome image datasets do a better job at showing objects with their relations, but static image datasets do not provide temporal information.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eTraining self supervised models on video data (to include temporal information) improves learning.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eThis study used the Contrastive Multiview Coding (CMC) model and paired video frames separated by a fixed time gap. “Temporal coherence.” This paper uses longer range lags than have been used, to test for slower associations between co-occurrence patterns of objects in the visual environment and potential semantic structure embedded.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eLong range associated were used because short range associations are expected to be dominated by perceptual features (not high level semantic associations).\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eNaturalistic data was used (rather than curated image datasets) in order to guarantee the real world semantic associations between objects.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003ePerceptual similarity decays faster than temporal co-occurrence correlations.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eThe CMC network was trained on the new dataset (AlexNet used as an encoder). It was not capable of classification.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eActually, prior knowledge of object features was useful for learning semantic structure (fine tuning on top of a pre-trained network).\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003ch1\u003eQuestions\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eUsing AlexNet as an encoder - what is this? we’re using some layer (the second last?) of alexnet to encode (compress?) image data into something then using that something as the input of the CMC network?\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003ch1\u003eTakeaways\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eImage classification DNNs training on curated labelled datasets that perform well on classifications benchmarks rely on purely perceptual information, and struggle to capture high level semantic information, such as relations between objects.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eIncorporating temporal information into a self supervised model can capture some high level semantic information during training. Specifically, in addition to the below:\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cblockquote\u003e\u003cbr\u003eThis suggests that semantically related objects hold superficial visually similarities\u003cbr\u003e\u003cbr\u003e\u003c/blockquote\u003e\u003cdiv\u003e\u003cbr\u003e(Huh that’s kind of interesting)\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003e\u003c/div\u003e","record_type":"Paper","record_id":10,"created_at":"2024-12-10T04:22:27.556Z","updated_at":"2024-12-10T04:24:17.059Z"},"created_at":"2024-12-10T04:22:16.860Z","updated_at":"2024-12-10T04:24:17.060Z"}