USC @ ICASSP 2026

Venice Tang | May 13, 2026 

USC researchers brought 13 papers to ICASSP this year, highlighted by a plenary talk from Shrikanth Narayanan

Shrikanth Narayanan delivering a plenary speech at ICASSP 2026 (Photo Credit: Kai-Wei Chang)

Shrikanth Narayanan delivering a plenary speech at ICASSP 2026 (Photo Credit: Kai-Wei Chang)

Last week, USC researchers had a strong presence at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), with more than 10 papers accepted and USC University Professor and Vice President Shrikanth (Shri) Narayanan selected as one of the conference’s plenary speakers.

Held May 4-8 in Barcelona, Spain, ICASSP is a global forum shaping the future of signal-driven machine intelligence and showcasing the latest findings in the signal processing field.

This year, USC had 13 papers accepted at the conference from faculty and students across USC Viterbi’s Ming Hsieh Department of Electrical and Computer Engineering and USC Mark and Mary Stevens School of Computing and Artificial Intelligence. This includes eight papers from the Signal Analysis and Interpretation Laboratory (SAIL) led by Narayanan, University Professor and USC’s inaugural vice president for presidential initiatives; two led by Antonio Ortega, Dean’s Professor of Electrical and Computer Engineering and Professor of Electrical and Computer Engineering; and three led by Urbashi Mitra and her students. Mitra is the Gordon S. Marshall Chair in Engineering and Professor of Electrical and Computer Engineering and Computer Science, with joint appointments at Ming Hsieh Department of Electrical and Computer Engineering and the Thomas Lord Department of Computer Science.

USC’s papers focused on multimodal signal intelligence for human communication, mental health biomarkers, trustworthy speech technologies, causal inference with interventions for non-linear models, performance bounds for wireless channel estimation and compression for learning.

The paper of Jeongmin Chae, a PhD student advised by Mitra, “Pairwise Distortion Distribution For Compression And Quantization,” explored theoretical foundations for compression and quantization in learning systems. This paper grew out of her Summer 2025 internship at Nokia Bell Labs. During her internship, Chae received third place among more than 150 interns in Nokia Bell Labs’ Outstanding Intern Competition.

ICASSP 2026 in Barcelona represents a massive shift in the industry, featuring a record-breaking 11,000 research submissions and more than 4,000 attendees. The conference is known for its high academic rigor, with this year’s acceptance rate dropping to 40% from 45% last year, making it increasingly competitive and selective.

Shrikanth Narayanan delivering a plenary speech at ICASSP 2026 (Photo Credit: Kai-Wei Chang)

Shrikanth Narayanan delivering a plenary speech at ICASSP 2026 (Photo Credit: Kai-Wei Chang)

Narayanan was invited to deliver a keynote speech on May 8, closing out the conference. As one of the plenary speakers, his talk, “Inside Out and Outside In: Illuminating Human Communication Through Multimodal Signals in Health and Breakdown,” explored the complex interplay of neurocognitive, physiological and behavioral processes that define human interaction.

In his plenary talk, Narayanan highlighted how multimodal signals, including speech, facial expressions, physiology and neural activity, can make internal mental states observable through external behaviors. Framing communication as a “Human Communication Stack,” he described how human interaction is shaped by both biology and environmental influences, while emphasizing the coordinated neurocognitive, sensory and social processes involved in communication.

Narayanan also showcased recent advances from USC studies in real-time MRI speech imaging, articulatory modeling and trustworthy AI systems. His presentation featured new datasets and frameworks developed at USC for speech articulation analysis, mental health biomarkers and inclusive speech technologies, including research applications in autism spectrum disorder, depression and cancer rehabilitation. 

As “human-centered signal processing” has always been the core of Narayanan’s work, he emphasized the importance of building human-centered machine intelligence systems that are inclusive, privacy-conscious and designed to support human health and communication throughout the talk.

 

USC-Affiliated Papers

(USC authors bolded)

Learning to Intervene: Optimized Soft Intervention Selection for Causal Discovery

Chen Peng, Sambit Mishra, Urbashi Mitra

Abstract: Understanding causal structure is fundamental for scientific discovery, yet most existing methods rely on observational data or passively use available interventional datasets. While these approaches have advanced the field, they do not actively target the most uncertain or poorly represented parts of the structure. We address this challenge by optimizing intervention design under soft interventions and integrating it into causal subgraph learning. A key ingredient is a utility function that balances reconstruction accuracy, exploration, and diversity: it highlights poorly estimated subgraphs, revisits under-explored distributions, and promotes diverse coverage to distinguish correlation from causation. Experiments show that optimized intervention selection enables more efficient and reliable structure learning than purely observational approaches and random-intervention baselines, underscoring the importance of principled intervention design in advancing causal discovery.

Cramér-Rao Bounds on Sparse-Diffuse Channel Estimation

Lei Lyu, Maxime Ferreira Da Costa, Urbashi Mitra

Abstract: Lower bounds on the estimation of a hybrid sparse-diffuse channel and parameters of the channel are provided. The Hybrid Atomic-Least-Squares (HALS) algorithm was designed to jointly estimate the sparse and diffuse components with a combined atomic and ℓ2 regularization for this hybrid, mixed-channel model. The Cramér-Rao Bound analysis in this paper focuses on the estimation of the channel parameters, resulting in a bound on the aggregate channel. Numerical results via simulations on synthetic data validate the efficacy of the HALS estimation strategy and show the improved predictive ability of the CRB analysis for the performance of HALS versus previously considered bounds.

Pairwise Distortion Distribution For Compression And Quantization

Carl Nuzman, Jeongmin Chae, Kursat Mestav, Urbashi Mitra

Abstract: Choosing the latent dimension is a critical challenge in data compression, where conventional hyperparameter optimization is costly and ad hoc. Intrinsic dimension provides a good alternative, which connects information-theoretic notions such as Renyi information dimension (RID) to rate-distortion dimension and random codebook compression. In this work, we introduce the pairwise distortion distribution (PDD) and its asymptotic limit, pairwise dimension (PD), which generalize the correlation integral to general distortion measures. We show that PD bounds RID, and PDD provides a lower bound for random codebook quantization performance, which can provide practical guidance for compression, quantization, and dimensionality reduction applications.

Wrapper-Aware Rate-Distortion Optimization In Feature Coding For Machines

Samuel Fernandez Menduina, Hyomin Choi, Fabien Racape, Eduardo Pavez, Antonio Ortega

Abstract: Feature coding for machines (FCM) is a lossy compression paradigm for split-inference. The transmitter encodes the outputs of the first part of a neural network before sending them to the receiver for completing the inference. Practical FCM methods “sandwich” a traditional codec between pre- and post-processing neural networks, called wrappers, to make features easier to compress using video codecs. Since traditional codecs are non-differentiable, the wrappers are trained using a proxy codec, which is later replaced by a standard codec after training. These codecs perform rate-distortion optimization (RDO) based on the sum of squared errors (SSE). Because the RDO does not consider the post-processing wrapper, the inner codec can invest bits in preserving information that the post-processing later discards. In this paper, we modify the bit-allocation in the inner codec via a wrapper-aware weighted SSE metric. To make wrapper-aware RDO (WA-RDO) practical for FCM, we propose: 1) temporal reuse of weights across a group of pictures and 2) fixed, architecture-and task-dependent weights trained offline. Under MPEG test conditions, our methods implemented on HEVC match the VVC-based FCM state-of-the-art, effectively bridging a codec generation gap with minimal runtime overhead relative to SSE-RDO HEVC.

Digraph Signal Processing Via Polar Decomposition

Feng Ji, Antonio Ortega

Abstract: We develop a signal processing framework for directed graphs. Unlike in the undirected setting, the shift operator of a directed graph generally lacks an orthogonal eigenbasis, making it difficult to define a Fourier transform with interpretable components. Leveraging the polar decomposition, we obtain two complementary eigendecompositions and introduce a graph Fourier transform that jointly encodes spectral responses in the corresponding eigenbases. This leads to a natural definition of convolution, which is lossless (recovering the original graph operator), and reduces to classical graph signal processing in the undirected case. Numerical experiments demonstrate applications of the framework to denoising.

voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models

Aju Ani Justus, Ruchit Agrawal, Sudarsana Reddy Kadiri, Shrikanth Narayanan

Abstract: We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (≈95.7% accuracy with SVM), an absolute improvement of ≈12–15% over the best traditional baseline. We also show layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis 

Thanathai Lertpetchpun, Yoonjeong Lee, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

Abstract: Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.

A Point Process Model Of Skin Conductance Responses In A Stroop Task For Predicting Depression And Suicidal Ideation

Kleanthis Avramidis, Myzelle Hughes, Idan A Blank, Dani Byrd, Assal Habibi, Takfarinas Medani, Richard M Leahy, Shrikanth Narayanan

Abstract: Accurate identification of mental health biomarkers can enable earlier detection and objective assessment of compromised mental well-being. In this study, we analyze electrodermal activity recorded during an Emotional Stroop task to capture sympathetic arousal dynamics associated with depression and suicidal ideation. We model the timing of skin conductance responses as a point process whose conditional intensity is modulated by task-based covariates, including stimulus valence, reaction time, and response accuracy. The resulting subject-specific parameter vector serves as input to a machine learning classifier for distinguishing individuals with and without depression. Our results show that the model parameters encode meaningful physiological differences associated with depressive symptomatology and yield superior classification performance compared to conventional feature extraction methods.

Encoding Emotion Through Self-Supervised Eye Movement Reconstruction

Marcus Ma, Jordan Prescott, Emily Zhou, Tiantian Feng, Kleanthis Avramidis, Gabor Mihaly Toth, Shrikanth Narayanan

Abstract: The relationship between emotional expression and eye movement is well-documented, with literature establishing gaze patterns are reliable indicators of emotion. However, most studies utilize specialized, high-resolution eye-tracking equipment, limiting the potential reach of findings. We investigate how eye movement can be used to predict multimodal markers of emotional expression from naturalistic, low-resolution videos. We utilize a collection of video interviews from the USC Shoah Foundation’s Visual History Archive with Holocaust survivors as they recount their experiences in the Auschwitz concentration camp. Inspired by pretraining methods on language models, we develop a novel gaze detection model that uses self-supervised eye movement reconstruction that can effectively leverage unlabeled video. We use this model’s encoder embeddings to fine-tune models on two downstream tasks related to emotional expression. The first is aligning eye movement with directional emotion estimates from speech. The second task is using eye gaze as a predictor of three momentary manifestations of emotional behaviors: laughing, crying/sobbing, and sighing. We find our new model is predictive of emotion outcomes and observe a positive correlation between pretraining performance and emotion processing performance for both experiments. We conclude self-supervised eye movement reconstruction is an effective method for encoding the affective signal they carry.

Interpretable Modeling Of Articulatory Temporal Dynamics From Real-Time Mri For Phoneme Recognition

Jay Park, Hong Nguyen, Sean Foley, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan

Abstract: Real-time Magnetic Resonance Imaging (rtMRI) visualizes vocal tract action, offering a comprehensive window into speech articulation. However, its signals are high dimensional and noisy, hindering interpretation. We investigate compact representations of spatiotemporal articulatory dynamics for phoneme recognition from midsagittal vocal tract rtMRI videos. We compare three feature types: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. We evaluate models trained independently on each representation, as well as multi-feature combinations. Results show that multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video. Temporal fidelity experiments demonstrate a reliance on fine-grained articulatory dynamics, while ROI ablation studies reveal strong contributions from tongue and lips. Our findings highlight how rtMRI-derived features provide accuracy and interpretability, and establish strategies for leveraging articulatory data in speech processing.

A long-form single-speaker real-time MRI speech dataset and benchmark

Sean Foley, Jihwan Lee, Kevin Huang, Xuan Shi, Yoonjeong Lee, Louis Goldstein, Shrikanth Narayanan

Abstract: We release the USC Long Single-Speaker (LSS) dataset containing real-time MRI video of the vocal tract dynamics and simultaneous audio obtained during speech production. This unique dataset contains roughly one hour of video and audio data from a single native speaker of American English, making it one of the longer publicly available single-speaker datasets of real-time MRI speech data. Along with the articulatory and acoustic raw data, we release derived representations of the data that are suitable for a range of downstream tasks. This includes video cropped to the vocal tract region, sentence-level splits of the data, restored and denoised audio, and regions-of-interest timeseries. We also benchmark this dataset on articulatory synthesis and phoneme recognition tasks, providing baseline performance for these tasks on this dataset which future research can aim to improve upon.

Towards Six-Dimensional Articulatory Speech Encoding

Jihwan Lee, Sean Foley, Thanathai Lertpetchpun, Kevin Huang, Yoonjeong Lee, Tiantian Feng, Louis Goldstein, Dani Byrd, Shrikanth Narayanan

Abstract: We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications.

VoxGuard: Evaluating user and attribute privacy in speech via Membership Inference Attacks

Efthymios Tsaprazlis, Thanathai Lertpetchpun, Tiantian Feng, Sai Praneeth Karimireddy, Shrikanth Narayanan

Abstract: Voice anonymization aims to conceal speaker identity and attributes while preserving intelligibility, but current evaluations rely almost exclusively on Equal Error Rate (EER) that obscures whether adversaries can mount high-precision attacks. We argue that privacy should instead be evaluated in the low false-positive rate (FPR) regime, where even a small number of successful identifications constitutes a meaningful breach. To this end, we introduce VoxGuard, a framework grounded in differential privacy and membership inference that formalizes two complementary notions: User Privacy, preventing speaker re-identification, and Attribute Privacy, protecting sensitive traits such as gender and accent. Across synthetic and real datasets, we find that informed adversaries, especially those using fine-tuned models and max-similarity scoring, achieve orders-of-magnitude stronger attacks at low-FPR despite similar EER. For attributes, we show that simple transparent attacks recover gender and accent with near-perfect accuracy even after anonymization. Our results demonstrate that EER substantially underestimates leakage, highlighting the need for low-FPR evaluation, and recommend VoxGuard as a benchmark for evaluating privacy leakage.

Published on May 13th, 2026

Last updated on May 13th, 2026

This article may feature some AI-assisted content for clarity, consistency, and to help explore complex scientific concepts with greater depth and creative range.