Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

Shogo Noguchi1,*,a Taketo Akama1,* Tai Nakamura1,a Shun Minamikawa1 Natalia Polouliakh1

1Sony Computer Science Laboratories, Inc., Tokyo, Japan

*Equal contribution, corresponding authors: noguchishogo1@gmail.com aWork conducted when working as a research assistant

EEG representation learning should be guided by neural encoding.

Conceptual overview of PredANN++
Concept. PredANN++ aligns EEG with neurobiologically distinct teacher signals—acoustic structure and predictive structure—and then exploits their complementarity through ensembling.

Abstract

During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.

Method Overview

PredANN++ neural network architecture
Architecture. PredANN++ pretrains an EEG encoder with masked prediction of discretized teacher tokens while jointly stabilizing learning with an auxiliary Song ID objective.
ANN representation calculation and discretization
Teacher construction. Audio is converted into complementary targets: MuQ acoustic embeddings and MusicGen-derived Surprisal / Entropy, then discretized into temporally aligned supervision for EEG pretraining.

Results

A compact summary of the main quantitative findings and the key figures supporting the paper’s claims.

Category Model / Ensemble Accuracy Notes
Baseline (Full-scratch) Seed 0 / 1 / 42 (mean) 0.823 (0.832 / 0.809 / 0.827) Transformer EEG encoder trained from scratch.
Single model (teacher pretrain) Acoustic (MuQ) 0.859 Best single model.
Single model (teacher pretrain) Surprisal (ctx16) 0.855 Expectation-related representation.
Single model (teacher pretrain) Entropy (ctx16) 0.850 Expectation-related representation.
2-model ensemble Acoustic + Surprisal 0.881 Probability averaging.
2-model ensemble Acoustic + Entropy 0.879 Probability averaging.
2-model ensemble Surprisal + Entropy 0.880 Probability averaging.
3-model ensemble Acoustic + Surprisal + Entropy 0.887 Best Representation diversity ensemble.
Seed ensemble Seeds 0 + 1 + 42 0.878 Initialization diversity baseline.

Experiments were conducted on the NMED-T dataset .

Single model comparison
Teacher choice matters. Every teacher-guided model beats full-scratch training, and MuQ gives the strongest single-model performance.
Representation ensemble vs seed ensemble
Diversity should be representational, not random. Ensembling models trained on distinct teacher representations outperforms equally sized multi-seed baselines.

Surprisal / Entropy Visualization

Visualization for Surprisal / Entropy Q128 sequences. The page loads three demo tracks and synchronized Surprisal / Entropy data with the 16-second context setting used in the main experiments.

Timeline sync is aligned to segment_start_s, with a hard upper bound of 240 sec (shorter tracks end at their natural duration).

Discretization method: continuous Surprisal and Entropy sequences were pooled across all songs (NMED-T 10 songs + 3 demo tracks), then feature-wise 128-bin equal-frequency quantile binning was applied. Each frame value was converted to a discrete label in the range 0–127 using the resulting bin boundaries.

Why are surprisal & entropy elevated at the start of each track? tap to expand

The elevated surprisal and entropy values at the beginning of tracks arise from the context-window design. For each segment, logits are computed using a fixed 16-second lookback window. At song onset, this window extends beyond the start of the audio, so the missing left context is padded with special_token_id, a reserved sentinel token used for invalid or masked positions. As a result, early segments are conditioned partly on synthetic padding rather than real audio history, making the model's predictions broader and less confident. This increases entropy, which reflects distributional uncertainty, and also increases surprisal, because the true tokens receive lower probability. The effect lasts for about 13 seconds (window_sec − segment_sec), because segments starting before 13 s still require left padding. Thus, the early elevation is a systematic artifact of the padding scheme rather than a property of the music itself.

Initializing synchronized visualization...
Loading attribution...
Playback position slider jumps anywhere in the song. Drag red cursor for fine control. Use zoom and timeline scroll slider to move through the Surprisal/Entropy curve timeline.
Demo tracks are excerpted from the MTG-Jamendo Dataset. Audio files follow each track's Creative Commons license. Detailed attribution is listed in audio_licenses.txt.

Demo (Gradio)

This section shows a demo video of the demo.py Gradio interface running. For instructions on how to launch the demo locally and run EEG → Song-ID inference, see the Quick Demo (Gradio UI) section in the GitHub README.

Pretrained Checkpoints

Model Target representation Context Pretraining Finetuning Link
PredANN++ Acoustic MuQ-Embedding 16s 10k epochs (50% masking) 3.5k epochs Shogo-Noguchi/PredANNpp-Acoustic
PredANN++ Entropy MusicGen Entropy 16s 10k epochs (50% masking) 3.5k epochs Shogo-Noguchi/PredANNpp-Entropy-ctx16
PredANN++ Surprisal MusicGen Surprisal 16s 10k epochs (50% masking) 3.5k epochs Shogo-Noguchi/PredANNpp-Surprisal-ctx16

How to Cite

BibTeX
@misc{noguchi2026expectationacousticneuralnetwork,
  title={Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity},
  author={Shogo Noguchi and Taketo Akama and Tai Nakamura and Shun Minamikawa and Natalia Polouliakh},
  year={2026},
  eprint={2603.03190},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2603.03190},
}