{"id":17,"url":"https://pm.philipcastiglione.com/papers/17.json","title":"Scaling Monosemanticity - TODO","read":false,"authors":"","year":null,"auto_summary":"","notes":{"id":17,"name":"notes","body":"\u003ch1\u003eNotes\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eLooking at a mid-level layer in Claude 3 Sonnet, many features are interpretable, abstract and monosemantic.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eThese can also be steered (enhanced or deemphasized) to impact the behaviour of the model.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cblockquote\u003e\u003cstrong\u003e\u003cbr\u003eKey Results\u003cbr\u003e\u003c/strong\u003e\u003cbr\u003e\u003cul\u003e\u003cli\u003eSparse autoencoders (SAEs) produce interpretable features for large models.\u003c/li\u003e\u003cli\u003eScaling laws can be \u003ca href=\"https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#scaling-scaling-laws\"\u003eused to guide the training\u003c/a\u003e of sparse autoencoders.\u003c/li\u003e\u003cli\u003eThe resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.\u003c/li\u003e\u003cli\u003eThere \u003ca href=\"https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#feature-survey-completeness\"\u003eappears to be a systematic relationship\u003c/a\u003e between the frequency of concepts and the dictionary size needed to resolve features for them.\u003c/li\u003e\u003cli\u003eFeatures can be used to steer large models (\u003cem\u003esee e.g.\u003c/em\u003e \u003ca href=\"https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#assessing-tour-influence\"\u003eInfluence on Behavior\u003c/a\u003e). This extends prior work on steering models using other methods (see \u003ca href=\"https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#related-work-steering\"\u003eRelated Work\u003c/a\u003e).\u003c/li\u003e\u003cli\u003eWe observe features related to a broad range of safety concerns, including \u003ca href=\"https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-deception\"\u003edeception\u003c/a\u003e, \u003ca href=\"https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-sycophancy\"\u003esycophancy\u003c/a\u003e, \u003ca href=\"https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-bias\"\u003ebias\u003c/a\u003e, and \u003ca href=\"https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-criminal\"\u003edangerous content\u003c/a\u003e.\u003c/li\u003e\u003c/ul\u003e\u003c/blockquote\u003e\u003cdiv\u003e\u003cbr\u003eThis paper scaled up the application of sparse autoencoders to decompose and discover features present in a given layer of claude 3 sonnet (instead of a toy model, like previous work). This would have been expensive.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eLarger networks (more parameters) have more specific parameters. They will have a larger number of very specific parameters than smaller networks, which need to overload the parameters more. They will contain more information but also more low value parameters which contain less information relatively.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eLarger networks need smaller learning weights (at least, eventually), though this obviously requires larger compute budget.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003ch1\u003eQuestions\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eWhat is dictionary learning?\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cul\u003e\u003cli\u003e\u003cstrong\u003ealgorithms that seek to decompose data into a weighted sum of sparsely active components\u003c/strong\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv\u003e\u003cbr\u003eWhat is a sparse autoencoder?\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cul\u003e\u003cli\u003e\u003cstrong\u003ea specific approximation of dictionary learning\u003c/strong\u003e\u003c/li\u003e\u003cli\u003e\u003cstrong\u003eOur SAE consists of two layers. The first layer (“encoder”) maps the activity to a higher-dimensional layer via a learned linear transformation followed by a ReLU nonlinearity. We refer to the units of this high-dimensional layer as “features.” The second layer (“decoder”) attempts to reconstruct the model activations via a linear transformation of the feature activations. The model is trained to minimize a combination of (1) reconstruction error and (2) an L1 regularization penalty on the feature activations, which incentivizes sparsity.\u003c/strong\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv\u003e\u003cbr\u003eWhat is the linear representation hypothesis?\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cul\u003e\u003cli\u003e\u003cstrong\u003eAt a high level, the linear representation hypothesis suggests that neural networks represent meaningful concepts – referred to as \u003c/strong\u003e\u003cstrong\u003e\u003cem\u003efeatures\u003c/em\u003e\u003c/strong\u003e\u003cstrong\u003e – as directions in their activation spaces\u003c/strong\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv\u003e\u003cbr\u003eIs this right:\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cul\u003e\u003cli\u003eThe superposition hypothesis suggests that with N vectors, by considering the combination of two (or more) of them you can massively increase the available representations.\u0026nbsp;\u003cul\u003e\u003cli\u003e\u003cstrong\u003eThe superposition hypothesis accepts the idea of linear representations and further hypothesizes that neural networks use the existence of almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions.\u003c/strong\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/li\u003e\u003c/ul\u003e\u003ch1\u003eTakeaways\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eWIP\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003e\u003c/div\u003e","record_type":"Paper","record_id":17,"created_at":"2024-12-10T05:08:02.919Z","updated_at":"2024-12-10T05:10:14.789Z"},"created_at":"2024-12-10T05:07:19.021Z","updated_at":"2024-12-10T05:10:14.790Z"}