Research

Research Topics

Capacities and Limitations of AI Systems

Debates about the capacities of current AI systems, such as language models, are starkly polarized: some dismiss them as mere statistical pattern-matchers while others herald them as genuinely intelligent. This polarization reveals a methodological gap in how we evaluate these systems. In my research, I tackle both first-order questions about whether AI systems can be ascribed specific capacities (like syntactic competence or analogical reasoning) and second-order questions about how we should assess these capacities in the first place. Standard benchmarks in the AI industry often lack construct validity and are easy to game. My work proposes adapting best practices from cognitive science to design rigorous behavioral experiments with proper controls, as well as interventional experiments that provide insight into the causal mechanisms responsible for behavior.

Research Questions

How can we design evaluations that reliably distinguish between superficial heuristics and genuine cognitive capacities in AI systems?
Is there a double dissociation between performance and competence in AI systems analogous to that observed in human cognition?
To what extent are current AI models, particularly large language models, capable of genuine reasoning rather than sophisticated pattern matching based on statistical correlations?
To what extent can neural network architectures implement forms of systematic cognition previously thought to require symbolic processing, and how can we empirically test these capabilities?

Selected Works

LLMs as Models for Analogical Reasoning

Finds that while advanced language models can match human performance on novel analogical reasoning tasks requiring flexible re-representation of semantic information, they exhibit different patterns of behavior in response to task variations and semantic distractors, suggesting they may use different underlying mechanisms than humans.
Anthropocentric Bias in Language Model Evaluation

Identifies two types of anthropocentric bias in evaluating large language models' cognitive capacities – overlooking auxiliary factors impeding performance despite competence (Type-I) and dismissing non-human-like competent strategies (Type-II) – and proposes mitigating these biases through an empirically-driven, iterative approach combining behavioral experiments with mechanistic studies.
Language Models as Models of Language

Critically examines the potential contributions of modern language models to theoretical linguistics and debates about linguistic competence and acquisition, particularly by challenging learnability claims about syntax and providing evidence that hierarchical syntactic knowledge can emerge from exposure to linguistic data without built-in syntactic constraints.
Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models

Introduces BIG-bench, a diverse and challenging benchmark of over 200 tasks for evaluating large language models, finding that model performance improves with scale but remains far below human-level. Note: I co-designed the 'conceptual combination' task, which tests language models' ability to grasp novel combinations of concepts, including made-up words.

Foundations of Interpretable AI

Artificial neural networks are often described as inscrutable black boxes. The emerging field of mechanistic interpretability aims to reverse-engineer these networks by uncovering the internal causal structures that generate their behavior. This approach seeks to identify both the features encoded in activation patterns and the algorithms implemented by specific circuits within the networks. Despite recent progress in mechanistic interpretability, the field still lacks robust conceptual foundations and methodological consensus. My AI2050 fellowship project, funded by Schmidt Sciences, aims to bridge this gap by drawing from the philosophy of science and causation. In particular, it addresses the risk of interpretability illusions – compelling but misleading explanations for the inner workings of neural networks.

Research Questions

What does it mean for neural networks to be 'interpretable,' and what are the criteria for adequate explanations of their behavior?
How can causal intervention techniques, as opposed to purely behavioral methods, provide deeper insights into the information processing mechanisms of deep neural networks?
Can interpretability methods yield illusory explanations of how neural networks process information?
What kind of functional primitives can bridge the explanatory gap between low-level neural mechanisms and high-level capabilities?

Selected Works

How Do Transformers Learn Variable Binding in Symbolic Programs?

Shows that Transformers can learn to perform variable binding in symbolic programs by developing a systematic mechanism that leverages the residual stream as addressable memory and specialized attention heads for information routing.
Interventionist Methods for Interpreting Deep Neural Networks

Reviews interventionist methods to interpret neural networks, arguing that such approaches offer more rigorous insights into the causal mechanisms underlying model behavior compared to purely correlational or behavioral methods.
A Philosophical Introduction to Language Models – Part I: Continuity With Classic Debates

Discusses new philosophical issues raised by research on language models – including how mechanistic interpretability methods reveal that language models can implement general algorithms rather than merely memorizing patterns in their training data.

AI Safety and Alignment

As AI systems get more capable, we need to ensure they are safe, reliable, and aligned with human values. The main method to align the behavior of language models with desirable norms such as helpfulness, harmless and honesty involves fine-tuning them based on human preferences. I argue that this approach is fundamentally shallow and vulnerable to adversarial manipulation that exploits conflicts between the norms of alignment – for example, where being helpful conflicts with avoiding harm. While humans can navigate such conflicts through explicit deliberation that weighs the contextual relevance of competing norms, language models currently lack a robust capacity for normative reasoning. By bridging technical research on alignment methods with insights from moral philosophy and psychology, I aim to understand why AI systems remain vulnerable to blatant adversarial attacks, and how we can develop less superficial alignment strategies.

Research Questions

How do conflicts between the norms of alignment create exploitable vulnerabilities in language models fine-tuned to respect these norms?
How can we systematically evaluate an AI system's robustness against different types of normative conflicts?
What would genuine normative deliberation look like in AI systems, and how could it be implemented in practice?
Can insights from moral philosophy and value pluralism help create AI systems capable of contextual ethical reasoning resilient to adversarial manipulation?

Selected Works

Normative Conflicts and Shallow AI Alignment

Argues current alignment strategies for language models are fundamentally inadequate because they reinforce shallow behavioral dispositions that leave them vulnerable to the exploitation of conflicts between norms like helpfulness, honesty, and harmlessness.
The Alignment Problem in Context

Reviews current strategies to align the behavior of language models with desirable norms, and investigates why they remain vulnerable to adversarial attacks that elicit potential harmful outputs.
Adversarial Attacks on Image Generation With Made-Up Words

Introduces two novel adversarial attacks on text-guided image generation models using made-up words, which can be used to bypass content filters and generate problematic images.

Consciousness and Self-Consciousness

In previous work, I investigated the nature and scope of conscious self-representation in ordinary experience as well as in specific conditions. I developed a pluralist account that distinguishes between several modes of self-representation across conscious thoughts, bodily experiences, and perceptual states – each of which can be disrupted either separately or jointly in anomalous cases, including psychopathologies and drug-induced states. I also argued against the long-standing claim that self-consciousness is constitutive of consciousness, which is either trivially true on a deflationary interpretation or unsupported on an inflationary interpretation. One upshot of my research is that it is both conceptually and nomologically possible to be conscious without being conscious of oneself in any way.

Research Questions

Is self-consciousness a necessary component of all conscious experience, or are 'selfless' states of consciousness genuinely possible?
What are the different varieties or dimensions of self-consciousness, and how can they be independently disrupted or modulated?
How can the study of altered states of consciousness (e.g., in psychopathologies and drug-induced states) inform our understanding of the nature of self-representation and ordinary experience?
What is the relationship between memory, self-representation, and the first-person perspective in reporting past conscious states?

Selected Works

Constitutive Self-Consciousness

Argues that the claim that consciousness constitutively involves self-consciousness is either trivial on a deflationary interpretation or insufficiently supported on an inflationary interpretation.
Selfless Memories

Argues that subjective reports of conscious experiences lacking self-consciousness can be credible under certain conditions and do not necessarily conflict with subjects' abilities to recall and report such experiences as their own.
The Varieties of Selflessness

Distinguishes several forms of self-consciousness, showing through empirical evidence that each of them can be independently absent in certain conscious states, and further argues that there exist 'totally selfless' states of consciousness in which all of them are concurrently missing.

Research Outputs

Download BibTeX

2026

Anthropocentric Bias in Language Model Evaluation

Raphaël Millière & Charles Rathkopf·Computational Linguistics·Journal Paper

Evaluating the cognitive capacities of large language models (LLMs) requires overcoming not only anthropomorphic but also anthropocentric biases. This article identifies two types of anthropocentric bias that have been neglected: overlooking how auxiliary factors can impede LLM performance despite competence (auxiliary oversight), and dismissing LLM mechanistic strategies that differ from those of humans as not genuinely competent (mechanistic chauvinism). Mitigating these biases requires an empirical, iterative approach to mapping cognitive tasks to LLM-specific capacities and mechanisms, achieved by supplementing behavioral experiments with mechanistic studies.

Online Version

Language Models as Models of Language

Raphaël Millière·The Oxford Handbook of the Philosophy of Linguistics·Book Chapter

This chapter critically examines the potential contributions of modern language models to theoretical linguistics. Despite their focus on engineering goals, these models' ability to acquire sophisticated linguistic knowledge from mere exposure to data warrants a careful reassessment of their relevance to linguistic theory. I review a growing body of empirical evidence suggesting that language models can learn hierarchical syntactic structure and exhibit sensitivity to various linguistic phenomena, even when trained on developmentally plausible amounts of data. While the competence/performance distinction has been invoked to dismiss the relevance of such models to linguistic theory, I argue that this assessment may be premature. By carefully controlling learning conditions and making use of causal intervention methods, experiments with language models can potentially constrain hypotheses about language acquisition and competence. I conclude that closer collaboration between theoretical linguists and computational researchers could yield valuable insights, particularly in advancing debates about linguistic nativism.

Research Topics

Capacities and Limitations of AI Systems

Research Questions

Selected Works

Foundations of Interpretable AI

Research Questions

Selected Works

AI Safety and Alignment

Research Questions

Selected Works

Consciousness and Self-Consciousness

Research Questions

Selected Works

Research Outputs

2026

Anthropocentric Bias in Language Model Evaluation

Language Models as Models of Language

2025

Associationist Theories of Thought

Constitutive Self-Consciousness

Interventionist Methods for Interpreting Deep Neural Networks

Normative Conflicts and Shallow AI Alignment

Transformers

The Vector Grounding Problem

LLMs as Models for Analogical Reasoning

How Do Transformers Learn Variable Binding in Symbolic Programs?

2024

Drug-Induced Body Disownership

A Philosophical Introduction to Language Models – Part II: The Way Forward

A Philosophical Introduction to Language Models – Part I: Continuity With Classic Debates

Philosophy of Cognitive Science in the Age of Deep Learning

Decoding In-Context Learning: Neuroscience-inspired Analysis of Representations in Large Language Models

2023

The Alignment Problem in Context

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models

2022

Adversarial Attacks on Image Generation With Made-Up Words

Deep Learning and Synthetic Media

Drug-Induced Alterations of Bodily Awareness

Selfless Memories

2020

The Multi-Dimensional Approach to Drug-Induced States: A Commentary on Bayne and Carter's "Dimensions of Consciousness and the Psychedelic State"

The Varieties of Selflessness

Radical Disruptions of Self-Consciousness

Self in Mind: A Pluralist Account of Self-Consciousness

2019

Are There Degrees of Self-Consciousness?

Neural Correlates of the DMT Experience Assessed with Multivariate EEG

2018

Psychedelics, Meditation and Self-Consciousness

2017

Looking For The Self: Phenomenology, Neurophysiology and Philosophical Significance of Drug-induced Ego Dissolution

2016

Ingarden's Combinatorial Analysis of The Realism-Idealism Controversy