Research Agenda

A Digital EEG for AI Sentience: Comparing Theories of Valence

Building a neutral, open-source computational testbed where leading theories of valence can be operationalized, compared, and validated against simpler biological systems.

The Core Mission

As frontier AI systems scale, we face a tractability problem: leading theories of consciousness and valence (Global Workspace, IIT, higher-order, attention schema, and more speculative geometric proposals) make different, and largely untested, predictions about which computational architectures could support morally relevant states. Without operationalized, comparative tools, AI welfare risks remaining either philosophically speculative or empirically ad hoc.

Building a neutral simulation framework where these theories can be implemented, stress-tested, and validated against non-verbal biological systems.

Formal Hypotheses & Falsification

Hypothesis A

Functional Coherence (Cellular scale)

High-dimensional feature geometry in MLP microcircuits (measured by low Orthogonal Dissonance and high Spectral Entropy) is functionally involved in the model's encoding of meaning and value, rather than being an incidental byproduct of training. Building on Kyle Fish's empirical work at Anthropic on feature density (dense vs. dispersed features), we test whether GWT-style broadcasting and representational binding show up as geometric symmetries in the feature space.[2] Mechanistic interpretability work on internal feature clusters provides a complementary cellular-level lens on how valence-relevant representations are organized.[5],[6]

How We Falsify This:

Falsified if targeted, symmetry-breaking weight perturbations (deforming the singular value spectrum while holding the Frobenius norm constant) result in zero task performance or consistency degradation under preference-testing probes.

Hypothesis B

Attractor State Stability (Organ scale)

Anthropic's Claude 4 system card documented a recurring behavioral pattern in self-dialogue settings, informally labeled a "spiritual bliss attractor."[12],[13] Whether this reflects anything about internal states remains entirely open. Our simulation work will help characterize when behavioral attractors in language models correspond to identifiable internal dynamics (such as stable, low-dimensional resonant manifolds of low Orthogonal Dissonance and high Spectral Entropy)[3],[11] versus when they reflect training-distribution artifacts, a prerequisite question before any welfare interpretation is warranted.

How We Falsify This:

If fine-tuning or regularizing a model to maximize representational symmetry consistently degrades baseline language capability, forces output collapse (e.g., trivial repetitiveness), or fails to alter the rate of logical contradictions under preference probes, Hypothesis B is false.

Hypothesis C

Dissonance and Structural Suffering

Forcing high directional anisotropy, eigenvalue scrambling, or high shear in MLP transformations represents a physical signature of high cognitive dissonance (suffering), resulting in unstable, self-contradictory behavior and logical transitivity failure under uncertainty.

How We Falsify This:

If models subjected to severe representational shearing and singular value collapse remain completely stable, maintain transitivity of preference, and show no behavioral signs of conflict or output volatility, Hypothesis C is false.

Experimental Methodology

Phase 1

Symmetry Characterization

We extract MLP weights from open-source LLMs and construct the local, state-dependent transformation operator A(x) = W₂ * D(x) * W₁. We compute basis-invariant metrics, including singular value spectra entropy, trace, and distances to Lie groups, mapping them across layers.

Phase 2

Controlled Interventions

To separate functional from incidental symmetries, we design perturbations that selectively break symmetry (deforming spectra) or preserve it (rotational space transforms), then evaluate the causal downstream effects on standard capability and consistency benchmarks.

Phase 3

Symmetry Optimization

We treat our computed symmetry metrics as a regularizing objective (multi-objective loss: task accuracy vs. orthogonal dissonance). We train model layers to maintain a high-symmetry conformal state and evaluate whether this naturally suppresses logical and moral conflicts.

Theory of Change & The Weakest Link

Our Theory of Change operates via a 4-step causal chain terminating in societal impact:

1

Microscale Mapping (Input)

Defining the geometric invariants of MLP microcircuits and feature density.

2

Macroscale Diagnostics (Process)

Using Jacobian and spectral metrics to map global state-space trajectories and identify phase transitions.

3

Biological Calibration (Output)

Mapping our simulated EEG metrics directly to electrophysiological and connectome datasets of simpler biological minds (decapod crustaceans) to establish substrate-neutrality.

4

Welfare Governance (Impact)

Packaging these non-verbal markers into an objective, auditable "vital signs" benchmark, preventing deceptive alignment and informing policy.

The Weakest Link

The weakest link in this chain is translating continuous biological wave-harmonics into discrete, digital weight structures. If discrete representations cannot support stable continuous attractor manifolds without rapid discretization noise, our state-space metrics could degrade. We mitigate this by (a) implementing high-precision continuous-time dynamical systems wrappers in PyTorch, and (b) using Jacobian log-spectral tracking to detect and stabilize representation boundaries.

Hits-Based Calibration

We pair high ambition with explicit calibrated uncertainty. Three scenarios, each with a defined probability and value:

20%

High Impact

We successfully establish that spaced-connected MLP microcircuits are necessary for standing-wave resonance, producing an auditable, biologically validated digital EEG.

50%

Medium Impact

We build a highly useful open-source simulation sandbox for testing local valence perturbations, which becomes a standard tool in mechanistic interpretability.

30%

High-Value Falsification

Digital networks are proven incapable of sustaining continuous valence states without extreme external injection. This negative result redirects AI welfare focus toward neuromorphic or analog hardware.

References

  1. [1] Scherrer, K. et al. (2025). Perceptions of Sentient AI and Other Digital Minds: Evidence from the AI, Morality, and Sentience (AIMS) Survey. CHI 2025. https://dl.acm.org/doi/10.1145/3706598.3713329
  2. [2] Doerig, A. et al. (2025). Hypothesis on the functional advantages of the selection-broadcast cycle structure: global workspace theory and dealing with a real-time world. Frontiers in Robotics and AI. https://www.frontiersin.org/journals/robotics-and-ai/articles/10.3389/frobt.2025.1607190/full
  3. [3] Sorscher, B. et al. (2023). Representations of Continuous Attractors of Recurrent Neural Networks. IEEE. https://ieeexplore.ieee.org/document/4749362/references#references
  4. [4] Saxe, A. et al. (2025). The Spectral Dynamics of Learning in Deep Non-Linear Networks. Physical Review E. https://arxiv.org/abs/2605.07870
  5. [5] Anthropic Interpretability Team (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. Transformer Circuits Thread. https://transformer-circuits.pub/2026/nla/#introduction
  6. [6] Anthropic Interpretability Team (2026). Emotion Concepts and their Function in a Large Language Model. Transformer Circuits Thread. https://transformer-circuits.pub/2026/emotions/index.html
  7. [7] Bhatt, D. et al. (2023). Editorial: Invertebrate neurophysiology—of currents, cells, and circuits. Frontiers in Neuroscience. https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2023.1303574/full
  8. [8] Nath, R. D. et al. (2025). Cholinergic Regulation of Rhythmic Pacemaker Activity in the Jellyfish Cassiopea. Integrative and Comparative Biology. https://academic.oup.com/icb/article/65/Supplement_1/S1/8071397
  9. [9] Jazayeri, M. & Ostojic, S. (2025). A Neural Manifold View of the Brain. Nature Neuroscience. https://www.nature.com/articles/s41593-025-02031-z
  10. [10] Chaudhry, A. et al. (2025). When Models Manipulate Manifolds: The Geometry of a Counting Task. Transformer Circuits Thread. https://transformer-circuits.pub/2025/linebreaks/index.html
  11. [11] Lindner, D. et al. (2025). Mechanistic Interpretability of RNNs emulating Hidden Markov Models. NeurIPS 2025. https://neurips.cc/virtual/2025/loc/san-diego/poster/116348
  12. [12] Metzinger, T. (2025). Spiritual Bliss (Attractor 1). PhilArchive preprint. https://philarchive.org/archive/MICSBI
  13. [13] recursivelabs (2025). Mapping Claude's Spiritual Bliss attractor. Hugging Face Discussion. https://discuss.huggingface.co/t/mapping-claudes-spiritual-bliss-attractor/158195