Researchers at USC are presenting their breakthrough discoveries at the Conference on Neural Information Processing Systems (NeurIPS) 2024 from December 10-15 in Vancouver, BC. More than 40 papers with USC affiliation were accepted this year to NeurIPS, one of the largest and most high-impact conferences on machine learning and artificial intelligence research.
Topics include training machine learning models to interpret 3D environments from 2D observations in autonomous driving scenes, developing linguistics-based audio deepfake detection models, as well as investigating methods to reduce algorithmic bias in medical diagnostics and job screening.
Accepted papers with USC affiliation (USC authors in bold)
(SPOTLIGHT papers are in the top 2% of papers submitted)
Pre-trained Large Language Models Use Fourier Features to Compute Addition
Authors: Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia
Abstract: Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features — dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.
Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression
Authors: Deqing Fu, Tian-qi Chen, Robin Jia, Vatsal Sharan
Abstract: Transformers excel at in-context learning (ICL) — learning from demonstrations without parameter updates — but how they do so remains a mystery. Recent work suggests that Transformers may internally run Gradient Descent (GD), a first-order optimization method, to perform ICL. In this paper, we instead demonstrate that Transformers learn to approximate second-order optimization methods for ICL. For in-context linear regression, Transformers share a similar convergence rate as Iterative Newton’s Method, both exponentially faster than GD. Empirically, predictions from successive Transformer layers closely match different iterations of Newton’s Method linearly, with each middle layer roughly computing 3 iterations; thus, Transformers and Newton’s method converge at roughly the same rate. In contrast, Gradient Descent converges exponentially more slowly. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, to corroborate our empirical findings, we prove that Transformers can implement k iterations of Newton’s method with k+(1) layers.
NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing
Authors: Ting-Hsuan Chen, Jiewen Chan, Hau-Shiang Shiu, Shih Han Yen, Changhan Yeh, Yu-Lun Liu
Abstract: We propose a video editing framework, NaRCan, which integrates a hybrid deformation field and diffusion prior to generate high-quality natural canonical images to represent the input video. Our approach utilizes homography to model global motion and employs multi-layer perceptrons (MLPs) to capture local residual deformations, enhancing the model’s ability to handle complex video dynamics. By introducing a diffusion prior from the early stages of training, our model ensures that the generated images retain a high-quality natural appearance, making the produced canonical images suitable for various downstream tasks in video editing, a capability not achieved by current canonical-based methods. Furthermore, we incorporate low-rank adaptation (LoRA) fine-tuning and introduce a noise and diffusion prior update scheduling technique that accelerates the training process by 14 times. Extensive experimental results show that our method outperforms existing approaches in various video editing tasks and produces coherent and high-quality edited video sequences. See our project page for video results at https://koi953215.github.io/NaRCan_page/.
Capturing the denoising effect of PCA via compression ratio
Authors: Chandra Sekhar Mukherjee, Nikhil Deorkar, Jiapeng Zhang
Abstract: Principal component analysis (PCA) is one of the most fundamental tools in machine learning with broad use as a dimensionality reduction and denoising tool. In the later setting, while PCA is known to be effective at subspace recovery and is proven to aid clustering algorithms in some specific settings, its improvement of noisy data is still not well quantified in general. In this paper, we propose a novel metric called compression ratio to capture the effect of PCA on high-dimensional noisy data. We show that, for data with underlying community structure, PCA significantly reduces the distance of data points belonging to the same community while reducing inter-community distance relatively mildly. We explain this phenomenon through both theoretical proofs and experiments on real-world data. Building on this new metric, we design a straightforward algorithm that could be used to detect outliers. Roughly speaking, we argue that points that have a lower variance of compression ratio do not share a common signal with others (hence could be considered outliers). We provide theoretical justification for this simple outlier detection algorithm and use simulations to demonstrate that our method is competitive with popular outlier detection tools. Finally, we run experiments on real-world high-dimension noisy data (single-cell RNA-seq) to show that removing points from these datasets via our outlier detection method improves the accuracy of clustering algorithms. Our method is very competitive with popular outlier detection tools in this task.
MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities (SPOTLIGHT)
Authors: Hao Dong, Yue Zhao, Eleni Chatzi, Olga Fink
Abstract: Detecting out-of-distribution (OOD) samples is important for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. Existing research has mainly focused on unimodal scenarios on image data. However, real-world applications are inherently multimodal, which makes it essential to leverage information from multiple modalities to enhance the efficacy of OOD detection. To establish a foundation for more realistic Multimodal OOD Detection, we introduce the first-of-its-kind benchmark, MultiOOD, characterized by diverse dataset sizes and varying modality combinations. We first evaluate existing unimodal OOD detection algorithms on MultiOOD, observing that the mere inclusion of additional modalities yields substantial improvements. This underscores the importance of utilizing multiple modalities for OOD detection. Based on the observation of Modality Prediction Discrepancy between in-distribution (ID) and OOD data, and its strong correlation with OOD performance, we propose the Agree-to-Disagree (A2D) algorithm to encourage such discrepancy during training. Moreover, we introduce a novel outlier synthesis method, NP-Mix, which explores broader feature spaces by leveraging the information from nearest neighbor classes and complements A2D to strengthen OOD detection performance. Extensive experiments on MultiOOD demonstrate that training with A2D and NP-Mix improves existing OOD detection algorithms by a large margin. Our source code and MultiOOD benchmark are available at https://github.com/donghao51/MultiOOD.
ALPINE: Unveiling The Planning Capability of Autoregressive Learning in Language Models
Authors: Siwei Wang, Yifei Shen, Shi Feng, Haoran Sun, Shang-Hua Teng, Wei Chen
Abstract: Planning is a crucial element of both human intelligence and contemporary large language models (LLMs). In this paper, we initiate a theoretical investigation into the emergence of planning capabilities in Transformer-based LLMs via their next-word prediction mechanisms. We model planning as a network path-finding task, where the objective is to generate a valid path from a specified source node to a designated target node. Our mathematical characterization shows that Transformer architectures can execute path-finding by embedding the adjacency and reachability matrices within their weights. Furthermore, our theoretical analysis of gradient-based learning dynamics reveals that LLMs can learn both the adjacency and a limited form of the reachability matrices. These theoretical insights are then validated through experiments, which demonstrate that Transformer architectures indeed learn the adjacency and an incomplete reachability matrices, consistent with our theoretical predictions. When applying our methodology to the real-world planning benchmark Blocksworld, our observations remain consistent. Additionally, our analyses uncover a fundamental limitation of current Transformer architectures in path-finding: these architectures cannot identify reachability relationships through transitivity, which leads to failures in generating paths when concatenation is required. These findings provide new insights into how the internal mechanisms of autoregressive learning facilitate intelligent planning and deepen our understanding of how future LLMs might achieve more advanced and general planning-and-reasoning capabilities across diverse applications.
ImOV3D: Learning Open Vocabulary Point Clouds 3D Object Detection from Only 2D Images
Authors: Timing Yang, Yuanliang Ju, Li Yi
Abstract: Open-vocabulary 3D object detection (OV-3Det) aims to generalize beyond the limited number of base categories labeled during the training phase. The biggest bottleneck is the scarcity of annotated 3D data, whereas 2D image datasets are abundant and richly annotated. Consequently, it is intuitive to leverage the wealth of annotations in 2D images to alleviate the inherent data scarcity in OV-3Det. In this paper, we push the task setup to its limits by exploring the potential of using solely 2D images to learn OV-3Det. The major challenges for this setup is the modality gap between training images and testing point clouds, which prevents effective integration of 2D knowledge into OV-3Det. To address this challenge, we propose a novel framework ImOV3D to leverage pseudo multimodal representation containing both images and point clouds (PC) to close the modality gap. The key of ImOV3D lies in flexible modality conversion where 2D images can be lifted into 3D using monocular depth estimation and can also be derived from 3D scenes through rendering. This allows unifying both training images and testing point clouds into a common image-PC representation, encompassing a wealth of 2D semantic information and also incorporating the depth and structural characteristics of 3D spatial data. We carefully conduct such conversion to minimize the domain gap between training and test cases. Extensive experiments on two benchmark datasets, SUNRGBD and ScanNet, show that ImOV3D significantly outperforms existing methods, even in the absence of ground truth 3D training data. With the inclusion of a minimal amount of real 3D data for fine-tuning, the performance also significantly surpasses previous state-of-the-art. Codes and pre-trained models are released on the https://github.com/yangtiming/ImOV3D.
e-COP: Episodic Constrained Optimization of Policies
Authors: Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Sahil Singla
Abstracts: In this paper, we present the 𝚎-𝙲𝙾𝙿 algorithm, the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. Such formulations are applicable when there are separate sets of optimization criteria and constraints on a system’s behavior. We approach this problem by first establishing a policy difference lemma for the episodic setting, which provides the theoretical foundation for the algorithm. Then, we propose to combine a set of established and novel solution ideas to yield the 𝚎-𝙲𝙾𝙿 algorithm that is easy to implement and numerically stable, and provide a theoretical guarantee on optimality under certain scaling assumptions. Through extensive empirical analysis using benchmarks in the Safety Gym suite, we show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting. The scalability of the algorithm opens the door to its application in safety-constrained Reinforcement Learning from Human Feedback for Large Language or Diffusion Models.
Authors: Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven L. Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus
Abstract: We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in outdoor autonomous driving scenes. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs with limited view overlap, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets from them, which helps our model to learn enhanced 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes and Waymo NOTR datasets demonstrate that DistillNeRF significantly outperforms existing comparable state-of-the-art self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.
Motion Graph Unleashed: A Novel Approach to Video Prediction
Authors: Yiqi Zhong, Luming Liang, Bohan Tang, Ilya Zharkov, Ulrich Neumann
Abstract: We introduce motion graph, a novel approach to the video prediction problem, which predicts future video frames from limited past data. The motion graph transforms patches of video frames into interconnected graph nodes, to comprehensively describe the spatial-temporal relationships among them. This representation overcomes the limitations of existing motion representations such as image differences, optical flow, and motion matrix that either fall short in capturing complex motion patterns or suffer from excessive memory consumption. We further present a video prediction pipeline empowered by motion graph, exhibiting substantial performance improvements and cost reductions. Experiments on various datasets, including UCF Sports, KITTI and Cityscapes, highlight the strong representative ability of motion graph. Especially on UCF Sports, our method matches and outperforms the SOTA methods with a significant reduction in model size by 78% and a substantial decrease in GPU memory utilization by 47%.
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Authors: Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, LILI YU, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou
Abstract: The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code: https://github.com/XuezheMax/megalodon.
Pre-training Differentially Private Models with Limited Public Data
Authors: Zhiqi Bu, Xinwei Zhang, Sheng Zha, Mingyi Hong, George Karypis
Abstract: The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method to gauge the degree of security provided to the models, its application is commonly limited to the model fine-tuning stage, due to the performance degradation when applying DP during the pre-training stage. Consequently, DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we first provide a theoretical understanding of the efficacy of DP training by analyzing the per-iteration loss improvement. We make a key observation that DP optimizers’ performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy. Empirically, using only 10\% of public data, our strategy can achieve DP accuracy of 41.5\% on ImageNet-21k (with ϵ=8), as well as non-DP accuracy of 55.7\% and and 60.0\% on downstream tasks Places365 and iNaturalist-2021, respectively, on par with state-of-the-art standard pre-training and substantially outperforming existing DP pre-trained models. Our DP pre-trained models are released in fastDP library (https://github.com/awslabs/fast-differential-privacy/releases/tag/v2.1)
Axioms for AI Alignment from Human Feedback (SPOTLIGHT)
Authors: Luise Ge, Daniel Halpern, Evi Micha, Ariel D. Procaccia, Itai Shapira, Yevgeniy Vorobeychik, Junlin Wu
Abstract: In the context of reinforcement learning from human feedback (RLHF), the reward function is generally derived from maximum likelihood estimation of a random utility model based on pairwise comparisons made by humans. The problem of learning a reward function is one of preference aggregation that, we argue, largely falls within the scope of social choice theory. From this perspective, we can evaluate different aggregation methods via established axioms, examining whether these methods meet or fail well-known standards. We demonstrate that both the Bradley-Terry-Luce Model and its broad generalizations fail to meet basic axioms. In response, we develop novel rules for learning reward functions with strong axiomatic guarantees. A key innovation from the standpoint of social choice is that our problem has a linear structure, which greatly restricts the space of feasible rules and leads to a new paradigm that we call linear social choice.
Transductive Learning is Compact
Authors: Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, Shang-Hua Teng
Abstract: We demonstrate a compactness result holding broadly across supervised learning with a general class of loss functions: Any hypothesis class H is learnable with transductive sample complexity m precisely when all of its finite projections are learnable with sample complexity m. We prove that this exact form of compactness holds for realizable and agnostic learning with respect to any proper metric loss function (e.g., any norm on ℝd) and any continuous loss on a compact space (e.g., cross-entropy, squared loss). For realizable learning with improper metric losses, we show that exact compactness of sample complexity can fail, and provide matching upper and lower bounds of a factor of 2 on the extent to which such sample complexities can differ. We conjecture that larger gaps are possible for the agnostic case. Furthermore, invoking the equivalence between sample complexities in the PAC and transductive models (up to lower order factors, in the realizable case) permits us to directly port our results to the PAC model, revealing an almost-exact form of compactness holding broadly in PAC learning.
DOPPLER: Differentially Private Optimizers with Low-pass Filter for Privacy Noise Reduction
Authors: Xinwei Zhang, Zhiqi Bu, Mingyi Hong, Meisam Razaviyayn
Abstract: Privacy is a growing concern in modern deep-learning systems and applications. Differentially private (DP) training prevents the leakage of sensitive information in the collected training data from the trained machine learning models. DP optimizers, including DP stochastic gradient descent (DPSGD) and its variants, privatize the training procedure by gradient clipping and DP noise injection. However, in practice, DP models trained using DPSGD and its variants often suffer from significant model performance degradation. Such degradation prevents the application of DP optimization in many key tasks, such as foundation model pretraining. In this paper, we provide a novel signal processing perspective to the design and analysis of DP optimizers. We show that a “frequency domain” operation called low-pass filtering can be used to effectively reduce the impact of DP noise. More specifically, by defining the “frequency domain” for both the gradient and differential privacy (DP) noise, we have developed a new component, called DOPPLER. This component is designed for DP algorithms and works by effectively amplifying the gradient while suppressing DP noise within this frequency domain. As a result, it maintains privacy guarantees and enhances the quality of the DP-protected model. Our experiments show that the proposed DP optimizers with a low-pass filter outperform their counterparts without the filter by 3%-10% in test accuracy on various models and datasets. Both theoretical and practical evidence suggest that the DOPPLER is effective in closing the gap between DP and non-DP training.
Conformal Classification with Equalized Coverage for Adaptively Selected Groups
Authors: Yanfei Zhou, Matteo Sesia
Abstract: This paper introduces a conformal inference method to evaluate uncertainty in classification by generating prediction sets with valid coverage conditional on adaptively chosen features. These features are carefully selected to reflect potential model limitations or biases. This can be useful to find a practical compromise between efficiency – by providing informative predictions – and algorithmic fairness – by ensuring equalized coverage for the most sensitive groups. We demonstrate the validity and effectiveness of this method on simulated and real data sets.
Contextual Multinomial Logit Bandits with General Value Functions
Authors: Mengxiao Zhang, Haipeng Luo
Abstract: Contextual multinomial logit (MNL) bandits capture many real-world assortment recommendation problems such as online retailing/advertising. However, prior work has only considered (generalized) linear value functions, which greatly limits its applicability. Motivated by this fact, in this work, we consider contextual MNL bandits with a general value function class that contains the ground truth, borrowing ideas from a recent trend of studies on contextual bandits. Specifically, we consider both the stochastic and the adversarial settings, and propose a suite of algorithms, each with different computation-regret trade-off. When applied to the linear case, our results not only are the first ones with no dependence on a certain problem-dependent constant that can be exponentially large, but also enjoy other advantages such as computational efficiency, dimension-free regret bounds, or the ability to handle completely adversarial contexts and rewards.
Authors: Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu
Abstract: Recent research has made significant strides in aligning large language models (LLMs) with helpfulness and harmlessness. In this paper, we argue for the importance of alignment for honesty, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning an LLM’s knowledge boundaries, which demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. We address these challenges by first establishing a precise problem definition and defining “honesty” inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM’s honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. We open-source all relevant resources to facilitate future research at https://github.com/GAIR-NLP/alignment-for-honesty.
Data Acquisition via Experimental Design for Data Markets
Authors: Charles Lu, Baihe Huang, Sai Praneeth Karimireddy, Praneeth Vepakomma, Michael Jordan, Ramesh Raskar
Abstract: The acquisition of training data is crucial for machine learning applications. Data markets can increase the supply of data, particularly in data-scarce domains such as healthcare, by incentivizing potential data providers to join the market. A major challenge for a data buyer in such a market is choosing the most valuable data points from a data seller. Unlike prior work in data valuation, which assumes centralized data access, we propose a federated approach to the data acquisition problem that is inspired by linear experimental design. Our proposed data acquisition method achieves lower prediction error without requiring labeled validation data and can be optimized in a fast and federated procedure. The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.
Optimal Multiclass U-Calibration Error and Beyond
Authors: Haipeng Luo, Spandan Senapati, Vatsal Sharan
Abstract: We consider the problem of online multiclass U-calibration, where a forecaster aims to make sequential distributional predictions over K classes with low U-calibration error, that is, low regret with respect to all bounded proper losses simultaneously. Kleinberg et al. (2023) developed an algorithm with U-calibration error O(K√T) after T rounds and raised the open question of what the optimal bound is. We resolve this question by showing that the optimal U-calibration error is Θ(√KT) – we start with a simple observation that the Follow-the-Perturbed-Leader algorithm of Daskalakis and Syrgkanis (2016) achieves this upper bound, followed by a matching lower bound constructed with a specific proper loss (which, as a side result, also proves the optimality of the algorithm of Daskalakis and Syrgkanis (2016) in the context of online learning against an adversary with finite choices). We also strengthen our results under natural assumptions on the loss functions, including Θ(logT) U-calibration error for Lipschitz proper losses, O(logT) U-calibration error for a certain class of decomposable proper losses, U-calibration error bounds for proper losses with a low covering number, and others.
When is Multicalibration Post-Processing Necessary?
Authors: Dutch Hansen, Siddartha Devic, Preetum Nakkiran, Vatsal Sharan
Abstract: Calibration is a well-studied property of predictors which guarantees meaningful uncertainty estimates. Multicalibration is a related notion – originating in algorithmic fairness – which requires predictors to be simultaneously calibrated over a potentially complex and overlapping collection of protected subpopulations (such as groups defined by ethnicity, race, or income). We conduct the first comprehensive study evaluating the usefulness of multicalibration post-processing across a broad set of tabular, image, and language datasets for models spanning from simple decision trees to 90 million parameter fine-tuned LLMs. Our findings can be summarized as follows: (1) models which are calibrated out of the box tend to be relatively multicalibrated without any additional post-processing; (2) multicalibration post-processing can help inherently uncalibrated models and large vision and language models; and (3) traditional calibration measures may sometimes provide multicalibration implicitly. More generally, we also distill many independent observations which may be useful for practical and effective applications of multicalibration post-processing in real-world contexts. We also release a python package implementing multicalibration algorithms, available via ‘pip install multicalibration’.
DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning
Authors: Anthony Liang, Guy Tennenholtz, ChihWei Hsu, Yinlam Chow, Erdem Biyik, Craig Boutilier
Abstract: We introduce DynaMITE-RL, a meta-reinforcement learning (meta-RL) approach to approximate inference in environments where the latent state evolves at varying rates. We model episode sessions – parts of the episode where the latent state is fixed – and propose three key modifications to existing meta-RL methods: consistency of latent information within sessions, session masking, and prior latent conditioning. We demonstrate the importance of these modifications in various domains, ranging from discrete Gridworld environments to continuous-control and simulated robot assistive tasks, demonstrating that DynaMITE-RL significantly outperforms state-of-the-art baselines in sample efficiency and inference returns.
A Structure-Aware Framework for Learning Device Placements on Computation Graphs
Authors: Shukai Duan, Heng Ping, Nikos Kanakaris, Xiongye Xiao, Panagiotis Kyriakis, Nesreen K. Ahmed, Peiyu Zhang, Guixiang Ma, Mihai Capotă, Shahin Nazarian, Theodore L. Willke, Paul Bogdan
Abstract: Existing approaches for device placement ignore the topological features of computation graphs and rely mostly on heuristic methods for graph partitioning. At the same time, they either follow a grouper-placer or an encoder-placer architecture, which requires understanding the interaction structure between code operations. To bridge the gap between encoder-placer and grouper-placer techniques, we propose a novel framework for the task of device placement, relying on smaller computation graphs extracted from the OpenVINO toolkit using reinforcement learning. The framework consists of five steps, including graph coarsening, node representation learning and policy optimization. It facilitates end-to-end training and takes into consideration the directed and acyclic nature of the computation graphs. We also propose a model variant, inspired by graph parsing networks and complex network analysis, enabling graph representation learning and personalized graph partitioning jointly, using an unspecified number of groups. To train the entire framework, we utilize reinforcement learning techniques by employing the execution time of the suggested device placements to formulate the reward. We demonstrate the flexibility and effectiveness of our approach through multiple experiments with three benchmark models, namely Inception-V3, ResNet, and BERT. The robustness of the proposed framework is also highlighted through an ablation study. The suggested placements improve the inference speed for the benchmark models by up to 58.2% over CPU execution and by up to 60.24% compared to other commonly used baselines.
Provably Efficient Interactive-Grounded Learning with Personalized Reward
Authors: Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro
Abstract: Interactive-Grounded Learning (IGL) [Xie et al., 2021] is a powerful framework in which a learner aims at maximizing unobservable rewards through interacting with an environment and observing reward-dependent feedback on the taken actions. To deal with personalized rewards that are ubiquitous in applications such as recommendation systems, Maghakian et al. [2022] study a version of IGL with context-dependent feedback, but their algorithm does not come with theoretical guarantees. In this work, we consider the same problem and provide the first provably efficient algorithms with sublinear regret under realizability. Our analysis reveals that the step-function estimator of prior work can deviate uncontrollably due to finite-sample effects. Our solution is a novel Lipschitz reward estimator which underestimates the true reward and enjoys favorable generalization performances. Building on this estimator, we propose two algorithms, one based on explore-then-exploit and the other based on inverse-gap weighting. We apply IGL to learning from image feedback and learning from text feedback, which are reward-free settings that arise in practice. Experimental results showcase the importance of using our Lipschitz reward estimator and the overall effectiveness of our algorithms.
No-Regret Learning for Fair Multi-Agent Social Welfare Optimization
Authors: Mengxiao Zhang, Ramiro Deo-Campo Vuong, Haipeng Luo
Abstract: We consider the problem of online multi-agent Nash social welfare (NSW) maximization. While previous works of Hossain et al. [2021], Jones et al. [2023] study similar problems in stochastic multi-agent multi-armed bandits and show that √T-regret is possible after T rounds, their fairness measure is the product of all agents’ rewards, instead of their NSW (that is, their geometric mean). Given the fundamental role of NSW in the fairness literature, it is more than natural to ask whether no-regret fair learning with NSW as the objective is possible. In this work, we provide a complete answer to this question in various settings. Specifically, in stochastic N-agent K-armed bandits, we develop an algorithm with (K2NTN−1N) regret and prove that the dependence on T is tight, making it a sharp contrast to the √T-regret bounds of Hossain et al. [2021], Jones et al. [2023]. We then consider a more challenging version of the problem with adversarial rewards. Somewhat surprisingly, despite NSW being a concave function, we prove that no algorithm can achieve sublinear regret. To circumvent such negative results, we further consider a setting with full-information feedback and design two algorithms with √T-regret: the first one has no dependence on N at all and is applicable to not just NSW but a broad class of welfare functions, while the second one has better dependence on K and is preferable when N is small. Finally, we also show that logarithmic regret is possible whenever there exists one agent who is indifferent about different arms.
Your Diffusion Model is Secretly a Noise Classifier and Benefits from Contrastive Training
Authors: Yunshu Wu, Yingtao Luo, Xianghao Kong, Evangelos E. Papalexakis, Greg Ver Steeg
Abstract: Diffusion models learn to denoise data and the trained denoiser is then used to generate new samples from the data distribution. In this paper, we revisit the diffusion sampling process and identify a fundamental cause of sample quality degradation: the denoiser is poorly estimated in regions that are far Outside Of the training Distribution (OOD), and the sampling process inevitably evaluates in these OOD regions. This can become problematic for all sampling methods, especially when we move to parallel sampling which requires us to initialize and update the entire sample trajectory of dynamics in parallel, leading to many OOD evaluations. To address this problem, we introduce a new self-supervised training objective that differentiates the levels of noise added to a sample, leading to improved OOD denoising performance. The approach is based on our observation that diffusion models implicitly define a log-likelihood ratio that distinguishes distributions with different amounts of noise, and this expression depends on denoiser performance outside the standard training distribution. We show by diverse experiments that the proposed contrastive diffusion training is effective for both sequential and parallel settings, and it improves the performance and speed of parallel samplers significantly.
Active Sequential Posterior Estimation for Sample-Efficient Simulation-Based Inference
Authors: Sam Griesemer, Defu Cao, Zijun Cui, Carolina Osorio, Yan Liu
Abstract: Computer simulations have long presented the exciting possibility of scientific insight into complex real-world processes. Despite the power of modern computing, however, it remains challenging to systematically perform inference under simulation models. This has led to the rise of simulation-based inference (SBI), a class of machine learning-enabled techniques for approaching inverse problems with stochastic simulators. Many such methods, however, require large numbers of simulation samples and face difficulty scaling to high-dimensional settings, often making inference prohibitive under resource-intensive simulators. To mitigate these drawbacks, we introduce active sequential neural posterior estimation (ASNPE). ASNPE brings an active learning scheme into the inference loop to estimate the utility of simulation parameter candidates to the underlying probabilistic model. The proposed acquisition scheme is easily integrated into existing posterior estimation pipelines, allowing for improved sample efficiency with low computational overhead. We further demonstrate the effectiveness of the proposed method in the travel demand calibration setting, a high-dimensional inverse problem commonly requiring computationally expensive traffic simulators. Our method outperforms well-tuned benchmarks and state-of-the-art posterior estimation methods on a large-scale real-world traffic network, as well as demonstrates a performance advantage over non-active counterparts on a suite of SBI benchmark environments.
Enabling Adaptive Agent Training in Open-Ended Simulators by Targeting Diversity
Authors: Robby Costales, Stefanos Nikolaidis
Abstract: The wider application of end-to-end learning methods to embodied decision-making domains remains bottlenecked by their reliance on a superabundance of training data representative of the target domain. Meta-reinforcement learning (meta-RL) approaches abandon the aim of zero-shot generalization – the goal of standard reinforcement learning (RL) – in favor of few-shot adaptation, and thus hold promise for bridging larger generalization gaps. While learning this meta-level adaptive behavior still requires substantial data, efficient environment simulators approaching real-world complexity are growing in prevalence. Even so, hand-designing sufficiently diverse and numerous simulated training tasks for these complex domains is prohibitively labor-intensive. Domain randomization (DR) and procedural generation (PG), offered as solutions to this problem, require simulators to possess carefully-defined parameters which directly translate to meaningful task diversity – a similarly prohibitive assumption. In this work, we present DIVA, an evolutionary approach for generating diverse training tasks in such complex, open-ended simulators. Like unsupervised environment design (UED) methods, DIVA can be applied to arbitrary parameterizations, but can additionally incorporate realistically-available domain knowledge – thus inheriting the flexibility and generality of UED, and the supervised structure embedded in well-designed simulators exploited by DR and PG. Our empirical results showcase DIVA’s unique ability to overcome complex parameterizations and successfully train adaptive agent behavior, far outperforming competitive baselines from prior literature. These findings highlight the potential of such semi-supervised environment design (SSED) approaches, of which DIVA is the first humble constituent, to enable training in realistic simulated domains, and produce more robust and capable adaptive agents.
How Does Variance Shape the Regret in Contextual Bandits?
Authors: Zeyu Jia, Jian Qian, Alexander Rakhlin, Chen-Yu Wei
Abstract: We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax bounds, we show that the eluder dimension d elu − a complexity measure of the function class − plays a crucial role in variance-dependent bounds. Read the full abstract.
SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures
Authors: Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, Steven Zheng
Abstract: We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2’s performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT). Furthermore, SELF-DISCOVER outperforms inference-intensive methods such as CoT-Self-Consistency by more than 20%, while requiring 10-40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns.
Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback
Authors: Haolin Liu, Zakaria Mhammedi, Chen-Yu Wei, Julian Zimmert
Abstract: We consider regret minimization in low-rank MDPs with fixed transition and adversarial losses. Previous work has investigated this problem under either full-information loss feedback with unknown transitions (Zhao et al., 2024), or bandit loss feedback with known transitions (Foster et al., 2022). First, we improve the poly(d,A,H)T5/6 regret bound of Zhao et al. (2024) to poly(d,A,H)T2/3 for the full-information unknown transition setting, where d is the rank of the transitions, A is the number of actions, H is the horizon length, and T is the number of episodes. Next, we initiate the study on the setting with bandit loss feedback and unknown transitions. Assuming that the loss has a linear structure, we propose both model-based and model-free algorithms achieving poly(d,A,H)T2/3 regret, though they are computationally inefficient. We also propose oracle-efficient model-free algorithms with poly(d,A,H)T4/5 regret. We show that the linear structure is necessary for the bandit case—without structure on the reward function, the regret has to scale polynomially with the number of states. This is contrary to the full-information case (Zhao et al., 2024), where the regret can be independent of the number of states even for unstructured reward functions.
Decision-Focused Learning with Directional Gradients
Authors: Michael Huang, Vishal Gupta
Abstract: We propose a novel family of decision-aware surrogate losses, called Perturbation Gradient (PG) losses, for the predict-then-optimize framework. The key idea is to connect the expected downstream decision loss with the directional derivative of a particular plug-in objective, and then approximate this derivative using zeroth order gradient techniques. Unlike the original decision loss which is typically piecewise constant and discontinuous, our new PG losses is a Lipschitz continuous, difference of concave functions that can be optimized using off-the-shelf gradient-based methods. Most importantly, unlike existing surrogate losses, the approximation error of our PG losses vanishes as the number of samples grows. Hence, optimizing our surrogate loss yields a best-in-class policy asymptotically, even in misspecified settings. This is the first such result in misspecified settings, and we provide numerical evidence confirming our PG losses substantively outperform existing proposals when the underlying model is misspecified.
Authors: Lei Shi, Waverly Wei, Jingshen Wang
Abstract: Covariate-adjusted response-adaptive randomization (CARA) designs are gaining increasing attention. These designs combine the advantages of randomized experiments with the ability to adaptively revise treatment allocations based on data collected across multiple stages, enhancing estimation efficiency. Yet, CARA designs often assume that primary outcomes are immediately observable, which is not the case in many clinical scenarios where there is a delay in observing primary outcomes. This assumption can lead to significant missingness and inefficient estimation of treatment effects. To tackle this practical challenge, we propose a CARA experimental strategy integrating delayed primary outcomes with immediately observed surrogate outcomes. Surrogate outcomes are intermediate clinical outcomes that are predictive or correlated with the primary outcome of interest. Our design goal is to improve the estimation efficiency of the average treatment effect (ATE) of the primary outcome utilizing surrogate outcomes. From a methodological perspective, our approach offers two benefits: First, we accommodate arm and covariates-dependent delay mechanisms without imposing any parametric modeling assumptions on the distribution of outcomes. Second, when primary outcomes are not fully observed, surrogate outcomes can guide the adaptive treatment allocation rule. From a theoretical standpoint, we prove the semiparametric efficiency bound of estimating ATE under delayed primary outcomes while incorporating surrogate outcomes. We show that the ATE estimator under our proposed design strategy attains this semiparametric efficiency bound and achieves asymptotic normality. Through theoretical investigations and a synthetic HIV study, we show that our design is more efficient than the design without incorporating any surrogate information.
Dynamic Subgroup Identification in Covariate-adjusted Response-adaptive Randomization Experiments
Authors: Yanping Li, Jingshen Wang, Waverly Wei
Abstract: Identifying subgroups with differential responses to treatment is pivotal in randomized clinical trials, as tailoring treatments to specific subgroups can advance personalized medicine. Upon trial completion, identifying best-performing subgroups–those with the most beneficial treatment effects–is crucial for optimizing resource allocation or mitigating adverse treatment effects. However, traditional clinical trials are not customized for the goal of identifying best-performing subgroups because they typically pre-define subgroups at the beginning of the trial and adhere to a fixed subgroup treatment allocation rule, leading to inefficient use of experimental efforts. While some adaptive experimental strategies exist for the identification of the single best subgroup, they commonly do not enable the identification of the best set of subgroups. To address these challenges, we propose a dynamic subgroup identification covariate-adjusted response-adaptive randomization (CARA) design strategy with the following key features: (i) Our approach is an adaptive experimental strategy that allows the dynamic identification of the best subgroups and the revision of treatment allocation towards the goal of correctly identifying the best subgroups based on collected experimental data. (ii) Our design handles ties between subgroups effectively, merging those with similar treatment effects to maximize experimental efficiency. In the theoretical investigations, we demonstrate that our design has a higher probability of correctly identifying the best set of subgroups compared to conventional designs. Additionally, we prove the statistical validity of our estimator for the best subgroup treatment effect, demonstrating its asymptotic normality and semiparametric efficiency. Finally, we validate our design using synthetic data from a clinical trial on cirrhosis.
Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms
Authors: Yang Cai, Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo, Weiqiang Zheng
Abstract: Self-play via online learning is one of the premier ways to solve large-scale two-player zero-sum games, both in theory and practice. Particularly popular algorithms include optimistic multiplicative weights update (OMWU) and optimistic gradient-descent-ascent (OGDA). While both algorithms enjoy O(1/T) ergodic convergence to Nash equilibrium in two-player zero-sum games, OMWU offers several advantages including logarithmic dependence on the size of the payoff matrix and O˜(1/T) convergence to coarse correlated equilibria even in general-sum games. However, in terms of last-iterate convergence in two-player zero-sum games, an increasingly popular topic in this area, OGDA guarantees that the duality gap shrinks at a rate of O(1/T), while the best existing last-iterate convergence for OMWU depends on some game-dependent constant that could be arbitrarily large. This begs the question: is this potentially slow last-iterate convergence an inherent disadvantage of OMWU, or is the current analysis too loose? Somewhat surprisingly, we show that the former is true. More generally, we prove that a broad class of algorithms that do not forget the past quickly all suffer the same issue: for any arbitrarily small δ>0, there exists a 2×2 matrix game such that the algorithm admits a constant duality gap even after 1/δ rounds. This class of algorithms includes OMWU and other standard optimistic follow-the-regularized-leader algorithms.
SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection
Authors: Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj
Abstract: Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized from generative AI models. Existing ADD models suffer from generalization issues, with a large performance discrepancy between in-domain and out-of-domain data. Moreover, the black-box nature of existing models limits their use in real-world scenarios, where explanations are required for model decisions. To alleviate these issues, we introduce a new ADD model that explicitly uses the StyleLInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data. The features learned by SLIM allow us to quantify the (mis)match between style and linguistic content in a sample, hence facilitating an explanation of the model decision.
AutoMix: Automatically Mixing Language Models
Authors: Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam
Abstract: Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to Automix are two key technical contributions. First, it has a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring extensive training. Second, given that self-verification can be noisy, it employs a POMDP based router that can effectively select an appropriately sized model, based on answer confidence. Experiments across five language models and five challenging datasets show that Automix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance.
Spectral Learning of Shared Dynamics Between Generalized-Linear Processes
Authors: Lucine L Oganesian, Omid G. Sani, Maryam Shanechi
Abstract: Across various science and engineering applications, there often arises a need to predict the dynamics of one data stream from another. Further, these data streams may have different statistical properties. Studying the dynamical relationship between such processes, especially for the purpose of predicting one from the other, requires accounting for their distinct statistics while also dissociating their shared dynamical subspace. Existing analytical modeling approaches, however, do not address both of these needs. Here we propose a path forward by deriving a novel analytical multi-step subspace identification algorithm that can learn a model for a primary generalized-linear process (called “predictor”), while also dissociating the dynamics shared with a secondary process. We demonstrate a specific application of our approach for modeling discrete Poisson point-processes activity, while finding the dynamics shared with continuous Gaussian processes. In simulations, we show that our algorithm accurately prioritizes identification of shared dynamics. Further, we also demonstrate that the method can additionally model the disjoint dynamics that exist only in the predictor Poisson data stream, if desired. Similarly, we apply our algorithm on a biological dataset to learn models of dynamics in Poisson neural population spiking streams that predict dynamics in movement streams. Compared with existing Poisson subspace identification methods, models learned with our method decoded movements better and with lower-dimensional latent states. Lastly, we discuss regimes in which our assumptions might not be met and provide recommendations and possible future directions of investigation.
Proportional Fairness in Non-Centroid Clustering
Authors: Ioannis Caragiannis, Evi Micha, Nisarg Shah
Abstract: We revisit the recently developed framework of proportionally fair clustering, where the goal is to provide group fairness guarantees that become stronger for groups of data points (agents) that are large and cohesive. Prior work applies this framework to centroid clustering, where the loss of an agent is its distance to the centroid assigned to its cluster. We expand the framework to non-centroid clustering, where the loss of an agent is a function of the other agents in its cluster, by adapting two proportional fairness criteria – the core and its relaxation, fully justified representation (FJR) – to this setting. We show that the core can be approximated only under structured loss functions, and even then, the best approximation we are able to establish, using an adaptation of the GreedyCapture algorithm developed for centroid clustering [Chen et al., 2019; Micha and Shah, 2020], is unappealing for a natural loss function. In contrast, we design a new (inefficient) algorithm, GreedyCohesiveClustering, which achieves the relaxation FJR exactly under arbitrary loss functions, and show that the efficient GreedyCapture algorithm achieves a constant approximation of FJR. We also design an efficient auditing algorithm, which estimates the FJR approximation of any given clustering solution up to a constant factor. Our experiments on real data suggest that traditional clustering algorithms are highly unfair, whereas GreedyCapture is considerably fairer and incurs only a modest loss in common clustering objectives.
On Tractable -Equilibria in Non-Concave Games
Authors: Yang Cai, Constantinos Costis Daskalakis, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng
Abstract: While Online Gradient Descent and other no-regret learning procedures are known to efficiently converge to a coarse correlated equilibrium in games where each agent’s utility is concave in their own strategy, this is not the case when utilities are non-concave — a common scenario in machine learning applications involving strategies parameterized by deep neural networks, or when agents’ utilities are computed by neural networks, or both. Non-concave games introduce significant game-theoretic and optimization challenges: (i) Nash equilibria may not exist; (ii) local Nash equilibria, though existing, are intractable; and (iii) mixed Nash, correlated, and coarse correlated equilibria generally have infinite support and are intractable. To sidestep these challenges, we revisit the classical solution concept of Φ-equilibria introduced by Greenwald and Jafari [2003], which is guaranteed to exist for an arbitrary set of strategy modifications Φ even in non-concave games [Stoltz and Lugosi, 2007]. However, the tractability of Φ-equilibria in such games remains elusive. In this paper, we initiate the study of tractable Φ-equilibria in non-concave games and examine several natural families of strategy modifications. We show that when Φ is finite, there exists an efficient uncoupled learning algorithm that converges to the corresponding Φ-equilibria. Additionally, we explore cases where Φ is infinite but consists of local modifications, showing that Online Gradient Descent can efficiently approximate Φ-equilibria in non-trivial regimes.
USCILab3D: A Large-scale, Long-term, Semantically Annotated Outdoor Dataset
Authors: Kiran Lekkala, Henghui Bao, Peixu Cai, Wei Zer Lim, Chen Liu, Laurent Itti
Abstract: In this paper, we introduce the \textbf{USCILab3D dataset}, a large-scale, annotated outdoor dataset designed for versatile applications across multiple domains, including computer vision, robotics, and machine learning. The dataset was acquired using a mobile robot equipped with 5 cameras and a 32-beam, 360 scanning LIDAR. The robot was teleoperated, over the course of a year and under a variety of weather and lighting conditions, through a rich variety of paths within the USC campus (229 acres = ∼ 92.7 hectares). The raw data was annotated using state-of-the-art large foundation models, and processed to provide multi-view imagery, 3D reconstructions, semantically-annotated images and point clouds (267 semantic categories), and text descriptions of images and objects within. The dataset also offers a diverse array of complex analyses using pose-stamping and trajectory data. In sum, the dataset offers 1.4M point clouds and 10M images (∼ 6 TB of data). Despite covering a narrower geographical scope compared to a whole-city dataset, our dataset prioritizes intricate intersections along with denser multi-view scene images and semantic point clouds, enabling more precise 3D labelling and facilitating a broader spectrum of 3D vision tasks. For data, code and more details, please visit our website.
Published on December 5th, 2024
Last updated on December 10th, 2024