Anthropocentric Bias in Language Model Evaluation
Raphaël Millière & Charles Rathkopf·Computational Linguistics·Journal Paper
Debates about the capacities of current AI systems, such as language models, are starkly polarized: some dismiss them as mere statistical pattern-matchers while others herald them as genuinely intelligent. This polarization reveals a methodological gap in how we evaluate these systems. In my research, I tackle both first-order questions about whether AI systems can be ascribed specific capacities (like syntactic competence or analogical reasoning) and second-order questions about how we should assess these capacities in the first place. Standard benchmarks in the AI industry often lack construct validity and are easy to game. My work proposes adapting best practices from cognitive science to design rigorous behavioral experiments with proper controls, as well as interventional experiments that provide insight into the causal mechanisms responsible for behavior.
LLMs as Models for Analogical Reasoning
Finds that while advanced language models can match human performance on novel analogical reasoning tasks requiring flexible re-representation of semantic information, they exhibit different patterns of behavior in response to task variations and semantic distractors, suggesting they may use different underlying mechanisms than humans.
Anthropocentric Bias in Language Model Evaluation
Identifies two types of anthropocentric bias in evaluating large language models' cognitive capacities – overlooking auxiliary factors impeding performance despite competence (Type-I) and dismissing non-human-like competent strategies (Type-II) – and proposes mitigating these biases through an empirically-driven, iterative approach combining behavioral experiments with mechanistic studies.
Language Models as Models of Language
Critically examines the potential contributions of modern language models to theoretical linguistics and debates about linguistic competence and acquisition, particularly by challenging learnability claims about syntax and providing evidence that hierarchical syntactic knowledge can emerge from exposure to linguistic data without built-in syntactic constraints.
Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models
Introduces BIG-bench, a diverse and challenging benchmark of over 200 tasks for evaluating large language models, finding that model performance improves with scale but remains far below human-level. Note: I co-designed the 'conceptual combination' task, which tests language models' ability to grasp novel combinations of concepts, including made-up words.
Artificial neural networks are often described as inscrutable black boxes. The emerging field of mechanistic interpretability aims to reverse-engineer these networks by uncovering the internal causal structures that generate their behavior. This approach seeks to identify both the features encoded in activation patterns and the algorithms implemented by specific circuits within the networks. Despite recent progress in mechanistic interpretability, the field still lacks robust conceptual foundations and methodological consensus. My AI2050 fellowship project, funded by Schmidt Sciences, aims to bridge this gap by drawing from the philosophy of science and causation. In particular, it addresses the risk of interpretability illusions – compelling but misleading explanations for the inner workings of neural networks.
How Do Transformers Learn Variable Binding in Symbolic Programs?
Shows that Transformers can learn to perform variable binding in symbolic programs by developing a systematic mechanism that leverages the residual stream as addressable memory and specialized attention heads for information routing.
Interventionist Methods for Interpreting Deep Neural Networks
Reviews interventionist methods to interpret neural networks, arguing that such approaches offer more rigorous insights into the causal mechanisms underlying model behavior compared to purely correlational or behavioral methods.
A Philosophical Introduction to Language Models – Part I: Continuity With Classic Debates
Discusses new philosophical issues raised by research on language models – including how mechanistic interpretability methods reveal that language models can implement general algorithms rather than merely memorizing patterns in their training data.
As AI systems get more capable, we need to ensure they are safe, reliable, and aligned with human values. The main method to align the behavior of language models with desirable norms such as helpfulness, harmless and honesty involves fine-tuning them based on human preferences. I argue that this approach is fundamentally shallow and vulnerable to adversarial manipulation that exploits conflicts between the norms of alignment – for example, where being helpful conflicts with avoiding harm. While humans can navigate such conflicts through explicit deliberation that weighs the contextual relevance of competing norms, language models currently lack a robust capacity for normative reasoning. By bridging technical research on alignment methods with insights from moral philosophy and psychology, I aim to understand why AI systems remain vulnerable to blatant adversarial attacks, and how we can develop less superficial alignment strategies.
Normative Conflicts and Shallow AI Alignment
Argues current alignment strategies for language models are fundamentally inadequate because they reinforce shallow behavioral dispositions that leave them vulnerable to the exploitation of conflicts between norms like helpfulness, honesty, and harmlessness.
The Alignment Problem in Context
Reviews current strategies to align the behavior of language models with desirable norms, and investigates why they remain vulnerable to adversarial attacks that elicit potential harmful outputs.
Adversarial Attacks on Image Generation With Made-Up Words
Introduces two novel adversarial attacks on text-guided image generation models using made-up words, which can be used to bypass content filters and generate problematic images.
In previous work, I investigated the nature and scope of conscious self-representation in ordinary experience as well as in specific conditions. I developed a pluralist account that distinguishes between several modes of self-representation across conscious thoughts, bodily experiences, and perceptual states – each of which can be disrupted either separately or jointly in anomalous cases, including psychopathologies and drug-induced states. I also argued against the long-standing claim that self-consciousness is constitutive of consciousness, which is either trivially true on a deflationary interpretation or unsupported on an inflationary interpretation. One upshot of my research is that it is both conceptually and nomologically possible to be conscious without being conscious of oneself in any way.
Constitutive Self-Consciousness
Argues that the claim that consciousness constitutively involves self-consciousness is either trivial on a deflationary interpretation or insufficiently supported on an inflationary interpretation.
Argues that subjective reports of conscious experiences lacking self-consciousness can be credible under certain conditions and do not necessarily conflict with subjects' abilities to recall and report such experiences as their own.
Distinguishes several forms of self-consciousness, showing through empirical evidence that each of them can be independently absent in certain conscious states, and further argues that there exist 'totally selfless' states of consciousness in which all of them are concurrently missing.
Raphaël Millière & Charles Rathkopf·Computational Linguistics·Journal Paper
Raphaël Millière·The Oxford Handbook of the Philosophy of Linguistics·Book Chapter
Eric Mandelbaum & Raphaël Millière·The Stanford Encyclopedia of Philosophy·Encyclopedia Entry
Raphaël Millière·Australasian Journal of Philosophy·Journal Paper
Raphaël Millière & Cameron Buckner·Neurocognitive Foundations of Mind·Book Chapter
Raphaël Millière·Philosophical Studies·Journal Paper
Raphaël Millière·Open Encyclopedia of Cognitive Science·Encyclopedia Entry
Dimitri Coelho Mollo & Raphaël Millière·arXiv·Preprint/Other
Sam Musker, Alex Duchnowski, Raphaël Millière & Ellie Pavlick·Journal of Memory and Language·Journal Paper
Yiwei Wu, Atticus Geiger & Raphaël Millière·Forty-Second International Conference on Machine Learning·Conference Paper
Raphaël Millière·Philosophical Perspectives on Psychedelic Psychiatry·Book Chapter
Raphaël Millière & Cameron Buckner·arXiv·Preprint/Other
Raphaël Millière & Cameron Buckner·arXiv·Preprint/Other
Raphaël Millière·WIREs Cognitive Science·Journal Paper
Safoora Yousefi, Leo Betthauser, Hosein Hasanbeig, Raphaël Millière & Ida Momennejad·arXiv·Preprint/Other
Raphaël Millière·arXiv·Preprint/Other
Aarohi Srivastava, Abhinav Rastogi & Abhishek Rao et al.·Transactions on Machine Learning Research·Journal Paper
Raphaël Millière·arXiv·Preprint/Other
Raphaël Millière·Synthese·Journal Paper
Raphaël Millière·The Routledge Handbook of Bodily Awareness·Book Chapter
Raphaël Millière & Albert Newen·Erkenntnis·Journal Paper
Martin Fortier-Davy & Raphaël Millière·Neuroscience of Consciousness·Journal Paper
Raphaël Millière·Philosophy and the Mind Sciences·Journal Paper
Raphaël Millière & Thomas Metzinger·Philosophy and the Mind Sciences·Journal Paper
Raphaël Millière·Thesis·Thesis
Raphaël Millière·Journal of Consciousness Studies·Journal Paper
Christopher Timmermann, Leor Roseman & Michael Schartner et al.·Scientific Reports·Journal Paper
Raphaël Millière, Robin L. Carhart-Harris, Leor Roseman, Fynn-Mathis Trautwein & Aviva Berkovich-Ohana·Frontiers in Psychology·Journal Paper
Raphaël Millière·Frontiers in Human Neuroscience·Journal Paper
Raphaël Millière·Form(s) and Modes of Being: The Ontology of Roman Ingarden·Book Chapter