arXiv daily

Machine Learning (cs.LG)

Thu, 01 Jun 2023

Other arXiv digests in this category:Thu, 14 Sep 2023; Wed, 13 Sep 2023; Tue, 12 Sep 2023; Mon, 11 Sep 2023; Fri, 08 Sep 2023; Tue, 05 Sep 2023; Fri, 01 Sep 2023; Thu, 31 Aug 2023; Wed, 30 Aug 2023; Tue, 29 Aug 2023; Mon, 28 Aug 2023; Fri, 25 Aug 2023; Thu, 24 Aug 2023; Wed, 23 Aug 2023; Tue, 22 Aug 2023; Mon, 21 Aug 2023; Fri, 18 Aug 2023; Thu, 17 Aug 2023; Wed, 16 Aug 2023; Tue, 15 Aug 2023; Mon, 14 Aug 2023; Fri, 11 Aug 2023; Thu, 10 Aug 2023; Wed, 09 Aug 2023; Tue, 08 Aug 2023; Mon, 07 Aug 2023; Fri, 04 Aug 2023; Thu, 03 Aug 2023; Wed, 02 Aug 2023; Tue, 01 Aug 2023; Mon, 31 Jul 2023; Fri, 28 Jul 2023; Thu, 27 Jul 2023; Wed, 26 Jul 2023; Tue, 25 Jul 2023; Mon, 24 Jul 2023; Fri, 21 Jul 2023; Thu, 20 Jul 2023; Wed, 19 Jul 2023; Tue, 18 Jul 2023; Mon, 17 Jul 2023; Fri, 14 Jul 2023; Thu, 13 Jul 2023; Wed, 12 Jul 2023; Tue, 11 Jul 2023; Mon, 10 Jul 2023; Fri, 07 Jul 2023; Thu, 06 Jul 2023; Wed, 05 Jul 2023; Tue, 04 Jul 2023; Mon, 03 Jul 2023; Fri, 30 Jun 2023; Thu, 29 Jun 2023; Wed, 28 Jun 2023; Tue, 27 Jun 2023; Mon, 26 Jun 2023; Fri, 23 Jun 2023; Thu, 22 Jun 2023; Wed, 21 Jun 2023; Tue, 20 Jun 2023; Fri, 16 Jun 2023; Thu, 15 Jun 2023; Tue, 13 Jun 2023; Mon, 12 Jun 2023; Fri, 09 Jun 2023; Thu, 08 Jun 2023; Wed, 07 Jun 2023; Tue, 06 Jun 2023; Mon, 05 Jun 2023; Fri, 02 Jun 2023; Wed, 31 May 2023; Tue, 30 May 2023; Mon, 29 May 2023; Fri, 26 May 2023; Thu, 25 May 2023; Wed, 24 May 2023; Tue, 23 May 2023; Mon, 22 May 2023; Fri, 19 May 2023; Thu, 18 May 2023; Wed, 17 May 2023; Tue, 16 May 2023; Mon, 15 May 2023; Fri, 12 May 2023; Thu, 11 May 2023; Wed, 10 May 2023; Tue, 09 May 2023; Mon, 08 May 2023; Fri, 05 May 2023; Thu, 04 May 2023; Wed, 03 May 2023; Tue, 02 May 2023; Mon, 01 May 2023; Fri, 28 Apr 2023; Thu, 27 Apr 2023; Wed, 26 Apr 2023; Tue, 25 Apr 2023; Mon, 24 Apr 2023; Fri, 21 Apr 2023; Thu, 20 Apr 2023; Wed, 19 Apr 2023; Tue, 18 Apr 2023; Mon, 17 Apr 2023; Fri, 14 Apr 2023; Thu, 13 Apr 2023; Wed, 12 Apr 2023; Tue, 11 Apr 2023; Mon, 10 Apr 2023
1.Improving Energy Conserving Descent for Machine Learning: Theory and Practice

Authors:G. Bruno De Luca, Alice Gatti, Eva Silverstein

Abstract: We develop the theory of Energy Conserving Descent (ECD) and introduce ECDSep, a gradient-based optimization algorithm able to tackle convex and non-convex optimization problems. The method is based on the novel ECD framework of optimization as physical evolution of a suitable chaotic energy-conserving dynamical system, enabling analytic control of the distribution of results - dominated at low loss - even for generic high-dimensional problems with no symmetries. Compared to previous realizations of this idea, we exploit the theoretical control to improve both the dynamics and chaos-inducing elements, enhancing performance while simplifying the hyper-parameter tuning of the optimization algorithm targeted to different classes of problems. We empirically compare with popular optimization methods such as SGD, Adam and AdamW on a wide range of machine learning problems, finding competitive or improved performance compared to the best among them on each task. We identify limitations in our analysis pointing to possibilities for additional improvements.

2.Regularizing Towards Soft Equivariance Under Mixed Symmetries

Authors:Hyunsu Kim, Hyungi Lee, Hongseok Yang, Juho Lee

Abstract: Datasets often have their intrinsic symmetries, and particular deep-learning models called equivariant or invariant models have been developed to exploit these symmetries. However, if some or all of these symmetries are only approximate, which frequently happens in practice, these models may be suboptimal due to the architectural restrictions imposed on them. We tackle this issue of approximate symmetries in a setup where symmetries are mixed, i.e., they are symmetries of not single but multiple different types and the degree of approximation varies across these types. Instead of proposing a new architectural restriction as in most of the previous approaches, we present a regularizer-based method for building a model for a dataset with mixed approximate symmetries. The key component of our method is what we call equivariance regularizer for a given type of symmetries, which measures how much a model is equivariant with respect to the symmetries of the type. Our method is trained with these regularizers, one per each symmetry type, and the strength of the regularizers is automatically tuned during training, leading to the discovery of the approximation levels of some candidate symmetry types without explicit supervision. Using synthetic function approximation and motion forecasting tasks, we demonstrate that our method achieves better accuracy than prior approaches while discovering the approximate symmetry levels correctly.

3.Learning Gaussian Mixture Representations for Tensor Time Series Forecasting

Authors:Jiewen Deng, Jinliang Deng, Renhe Jiang, Xuan Song

Abstract: Tensor time series (TTS) data, a generalization of one-dimensional time series on a high-dimensional space, is ubiquitous in real-world scenarios, especially in monitoring systems involving multi-source spatio-temporal data (e.g., transportation demands and air pollutants). Compared to modeling time series or multivariate time series, which has received much attention and achieved tremendous progress in recent years, tensor time series has been paid less effort. Properly coping with the tensor time series is a much more challenging task, due to its high-dimensional and complex inner structure. In this paper, we develop a novel TTS forecasting framework, which seeks to individually model each heterogeneity component implied in the time, the location, and the source variables. We name this framework as GMRL, short for Gaussian Mixture Representation Learning. Experiment results on two real-world TTS datasets verify the superiority of our approach compared with the state-of-the-art baselines.

4.Coneheads: Hierarchy Aware Attention

Authors:Albert Tseng, Tao Yu, Toni J. B. Liu, Christopher De Sa

Abstract: Attention networks such as transformers have achieved state-of-the-art performance in many domains. These networks rely heavily on the dot product attention operator, which computes the similarity between two points by taking their inner product. However, the inner product does not explicitly model the complex structural properties of real world datasets, such as hierarchies between data points. To remedy this, we introduce cone attention, a drop-in replacement for dot product attention based on hyperbolic entailment cones. Cone attention associates two points by the depth of their lowest common ancestor in a hierarchy defined by hyperbolic cones, which intuitively measures the divergence of two points and gives a hierarchy aware similarity score. We test cone attention on a wide variety of models and tasks and show that it improves task-level performance over dot product attention and other baselines, and is able to match dot-product attention with significantly fewer parameters. Our results suggest that cone attention is an effective way to capture hierarchical relationships when calculating attention.

5.Faster Robust Tensor Power Method for Arbitrary Order

Authors:Yichuan Deng, Zhao Song, Junze Yin

Abstract: Tensor decomposition is a fundamental method used in various areas to deal with high-dimensional data. \emph{Tensor power method} (TPM) is one of the widely-used techniques in the decomposition of tensors. This paper presents a novel tensor power method for decomposing arbitrary order tensors, which overcomes limitations of existing approaches that are often restricted to lower-order (less than $3$) tensors or require strong assumptions about the underlying data structure. We apply sketching method, and we are able to achieve the running time of $\widetilde{O}(n^{p-1})$, on the power $p$ and dimension $n$ tensor. We provide a detailed analysis for any $p$-th order tensor, which is never given in previous works.

6.Out-of-distribution forgetting: vulnerability of continual learning to intra-class distribution shift

Authors:Liangxuan Guo, Yang Chen, Shan Yu

Abstract: Continual learning (CL) is an important technique to allow artificial neural networks to work in open environments. CL enables a system to learn new tasks without severe interference to its performance on old tasks, i.e., overcome the problems of catastrophic forgetting. In joint learning, it is well known that the out-of-distribution (OOD) problem caused by intentional attacks or environmental perturbations will severely impair the ability of networks to generalize. In this work, we reported a special form of catastrophic forgetting raised by the OOD problem in continual learning settings, and we named it out-of-distribution forgetting (OODF). In continual image classification tasks, we found that for a given category, introducing an intra-class distribution shift significantly impaired the recognition accuracy of CL methods for that category during subsequent learning. Interestingly, this phenomenon is special for CL as the same level of distribution shift had only negligible effects in the joint learning scenario. We verified that CL methods without dedicating subnetworks for individual tasks are all vulnerable to OODF. Moreover, OODF does not depend on any specific way of shifting the distribution, suggesting it is a risk for CL in a wide range of circumstances. Taken together, our work identified an under-attended risk during CL, highlighting the importance of developing approaches that can overcome OODF.

7.MindBigData 2023 MNIST-8B The 8 billion datapoints Multimodal Dataset of Brain Signals

Authors:David Vivancos

Abstract: MindBigData 2023 MNIST-8B is the largest, to date (June 1st 2023), brain signals open dataset created for Machine Learning, based on EEG signals from a single subject captured using a custom 128 channels device, replicating the full 70,000 digits from Yaan LeCun et all MNIST dataset. The brain signals were captured while the subject was watching the pixels of the original digits one by one on a screen and listening at the same time to the spoken number 0 to 9 from the real label. The data, collection procedures, hardware and software created are described in detail, background extra information and other related datasets can be found at our previous paper MindBigData 2022: A Large Dataset of Brain Signals.

8.Reconstructing Graph Diffusion History from a Single Snapshot

Authors:Ruizhong Qiu, Dingsu Wang, Lei Ying, H. Vincent Poor, Yifang Zhang, Hanghang Tong

Abstract: Diffusion on graphs is ubiquitous with numerous high-impact applications. In these applications, complete diffusion histories play an essential role in terms of identifying dynamical patterns, reflecting on precaution actions, and forecasting intervention effects. Despite their importance, complete diffusion histories are rarely available and are highly challenging to reconstruct due to ill-posedness, explosive search space, and scarcity of training data. To date, few methods exist for diffusion history reconstruction. They are exclusively based on the maximum likelihood estimation (MLE) formulation and require to know true diffusion parameters. In this paper, we study an even harder problem, namely reconstructing Diffusion history from A single SnapsHot} (DASH), where we seek to reconstruct the history from only the final snapshot without knowing true diffusion parameters. We start with theoretical analyses that reveal a fundamental limitation of the MLE formulation. We prove: (a) estimation error of diffusion parameters is unavoidable due to NP-hardness of diffusion parameter estimation, and (b) the MLE formulation is sensitive to estimation error of diffusion parameters. To overcome the inherent limitation of the MLE formulation, we propose a novel barycenter formulation: finding the barycenter of the posterior distribution of histories, which is provably stable against the estimation error of diffusion parameters. We further develop an effective solver named DIffusion hiTting Times with Optimal proposal (DITTO) by reducing the problem to estimating posterior expected hitting times via the Metropolis--Hastings Markov chain Monte Carlo method (M--H MCMC) and employing an unsupervised graph neural network to learn an optimal proposal to accelerate the convergence of M--H MCMC. We conduct extensive experiments to demonstrate the efficacy of the proposed method.

9.The Risks of Recourse in Binary Classification

Authors:Hidde Fokkema, Damien Garreau, Tim van Erven

Abstract: Algorithmic recourse provides explanations that help users overturn an unfavorable decision by a machine learning system. But so far very little attention has been paid to whether providing recourse is beneficial or not. We introduce an abstract learning-theoretic framework that compares the risks (i.e. expected losses) for classification with and without algorithmic recourse. This allows us to answer the question of when providing recourse is beneficial or harmful at the population level. Surprisingly, we find that there are many plausible scenarios in which providing recourse turns out to be harmful, because it pushes users to regions of higher class uncertainty and therefore leads to more mistakes. We further study whether the party deploying the classifier has an incentive to strategize in anticipation of having to provide recourse, and we find that sometimes they do, to the detriment of their users. Providing algorithmic recourse may therefore also be harmful at the systemic level. We confirm our theoretical findings in experiments on simulated and real-world data. All in all, we conclude that the current concept of algorithmic recourse is not reliably beneficial, and therefore requires rethinking.

10.A New PHO-rmula for Improved Performance of Semi-Structured Networks

Authors:David Rügamer

Abstract: Recent advances to combine structured regression models and deep neural networks for better interpretability, more expressiveness, and statistically valid uncertainty quantification demonstrate the versatility of semi-structured neural networks (SSNs). We show that techniques to properly identify the contributions of the different model components in SSNs, however, lead to suboptimal network estimation, slower convergence, and degenerated or erroneous predictions. In order to solve these problems while preserving favorable model properties, we propose a non-invasive post-hoc orthogonalization (PHO) that guarantees identifiability of model components and provides better estimation and prediction quality. Our theoretical findings are supported by numerical experiments, a benchmark comparison as well as a real-world application to COVID-19 infections.

11.Neuronal Cell Type Classification using Deep Learning

Authors:Ofek Ophir, Orit Shefi, Ofir Lindenbaum

Abstract: The brain is likely the most complex organ, given the variety of functions it controls, the number of cells it comprises, and their corresponding diversity. Studying and identifying neurons, the brain's primary building blocks, is a crucial milestone and essential for understanding brain function in health and disease. Recent developments in machine learning have provided advanced abilities for classifying neurons. However, these methods remain black boxes with no explainability and reasoning. This paper aims to provide a robust and explainable deep-learning framework to classify neurons based on their electrophysiological activity. Our analysis is performed on data provided by the Allen Cell Types database containing a survey of biological features derived from single-cell recordings of mice and humans. First, we classify neuronal cell types of mice data to identify excitatory and inhibitory neurons. Then, neurons are categorized to their broad types in humans using domain adaptation from mice data. Lastly, neurons are classified into sub-types based on transgenic mouse lines using deep neural networks in an explainable fashion. We show state-of-the-art results in a dendrite-type classification of excitatory vs. inhibitory neurons and transgenic mouse lines classification. The model is also inherently interpretable, revealing the correlations between neuronal types and their electrophysiological properties.

12.A Call for Standardization and Validation of Text Style Transfer Evaluation

Authors:Phil Ostheimer, Mayank Nagda, Marius Kloft, Sophie Fellenz

Abstract: Text Style Transfer (TST) evaluation is, in practice, inconsistent. Therefore, we conduct a meta-analysis on human and automated TST evaluation and experimentation that thoroughly examines existing literature in the field. The meta-analysis reveals a substantial standardization gap in human and automated evaluation. In addition, we also find a validation gap: only few automated metrics have been validated using human experiments. To this end, we thoroughly scrutinize both the standardization and validation gap and reveal the resulting pitfalls. This work also paves the way to close the standardization and validation gap in TST evaluation by calling out requirements to be met by future research.

13.Chain-Of-Thought Prompting Under Streaming Batch: A Case Study

Authors:Yuxin Tang

Abstract: Recently, Large Language Models (LLMs) have demonstrated remarkable capabilities. Chain-of-Thought (CoT) has been proposed as a way of assisting LLMs in performing complex reasoning. However, developing effective prompts can be a challenging and labor-intensive task. Many studies come out of some way to automatically construct CoT from test data. Most of them assume that all test data is visible before testing and only select a small subset to generate rationales, which is an unrealistic assumption. In this paper, we present a case study on how to construct and optimize chain-of-thought prompting using batch data in streaming settings.

14.Hinge-Wasserstein: Mitigating Overconfidence in Regression by Classification

Authors:Ziliang Xiong, Abdelrahman Eldesokey, Joakim Johnander, Bastian Wandt, Per-Erik Forssen

Abstract: Modern deep neural networks are prone to being overconfident despite their drastically improved performance. In ambiguous or even unpredictable real-world scenarios, this overconfidence can pose a major risk to the safety of applications. For regression tasks, the regression-by-classification approach has the potential to alleviate these ambiguities by instead predicting a discrete probability density over the desired output. However, a density estimator still tends to be overconfident when trained with the common NLL loss. To mitigate the overconfidence problem, we propose a loss function, hinge-Wasserstein, based on the Wasserstein Distance. This loss significantly improves the quality of both aleatoric and epistemic uncertainty, compared to previous work. We demonstrate the capabilities of the new loss on a synthetic dataset, where both types of uncertainty are controlled separately. Moreover, as a demonstration for real-world scenarios, we evaluate our approach on the benchmark dataset Horizon Lines in the Wild. On this benchmark, using the hinge-Wasserstein loss reduces the Area Under Sparsification Error (AUSE) for horizon parameters slope and offset, by 30.47% and 65.00%, respectively.

15.TorchRL: A data-driven decision-making library for PyTorch

Authors:Albert Bou, Matteo Bettini, Sebastian Dittert, Vikash Kumar, Shagun Sodhani, Xiaomeng Yang, Gianni De Fabritiis, Vincent Moens

Abstract: Striking a balance between integration and modularity is crucial for a machine learning library to be versatile and user-friendly, especially in handling decision and control tasks that involve large development teams and complex, real-world data, and environments. To address this issue, we propose TorchRL, a generalistic control library for PyTorch that provides well-integrated, yet standalone components. With a versatile and robust primitive design, TorchRL facilitates streamlined algorithm development across the many branches of Reinforcement Learning (RL) and control. We introduce a new PyTorch primitive, TensorDict, as a flexible data carrier that empowers the integration of the library's components while preserving their modularity. Hence replay buffers, datasets, distributed data collectors, environments, transforms and objectives can be effortlessly used in isolation or combined. We provide a detailed description of the building blocks, supporting code examples and an extensive overview of the library across domains and tasks. Finally, we show comparative benchmarks to demonstrate its computational efficiency. TorchRL fosters long-term support and is publicly available on GitHub for greater reproducibility and collaboration within the research community. The code is opensourced on https://github.com/pytorch/rl.

16.Does Black-box Attribute Inference Attacks on Graph Neural Networks Constitute Privacy Risk?

Authors:Iyiola E. Olatunji, Anmar Hizber, Oliver Sihlovec, Megha Khosla

Abstract: Graph neural networks (GNNs) have shown promising results on real-life datasets and applications, including healthcare, finance, and education. However, recent studies have shown that GNNs are highly vulnerable to attacks such as membership inference attack and link reconstruction attack. Surprisingly, attribute inference attacks has received little attention. In this paper, we initiate the first investigation into attribute inference attack where an attacker aims to infer the sensitive user attributes based on her public or non-sensitive attributes. We ask the question whether black-box attribute inference attack constitutes a significant privacy risk for graph-structured data and their corresponding GNN model. We take a systematic approach to launch the attacks by varying the adversarial knowledge and assumptions. Our findings reveal that when an attacker has black-box access to the target model, GNNs generally do not reveal significantly more information compared to missing value estimation techniques. Code is available.

17.Anomaly Detection with Variance Stabilized Density Estimation

Authors:Amit Rozner, Barak Battash, Henry Li, Lior Wolf, Ofir Lindenbaum

Abstract: Density estimation based anomaly detection schemes typically model anomalies as examples that reside in low-density regions. We propose a modified density estimation problem and demonstrate its effectiveness for anomaly detection. Specifically, we assume the density function of normal samples is uniform in some compact domain. This assumption implies the density function is more stable (with lower variance) around normal samples than anomalies. We first corroborate this assumption empirically using a wide range of real-world data. Then, we design a variance stabilized density estimation problem for maximizing the likelihood of the observed samples while minimizing the variance of the density around normal samples. We introduce an ensemble of autoregressive models to learn the variance stabilized distribution. Finally, we perform an extensive benchmark with 52 datasets demonstrating that our method leads to state-of-the-art results while alleviating the need for data-specific hyperparameter tuning.

18.Causal Imitability Under Context-Specific Independence Relations

Authors:Fateme Jamshidi, Sina Akbari, Negar Kiyavash

Abstract: Drawbacks of ignoring the causal mechanisms when performing imitation learning have recently been acknowledged. Several approaches both to assess the feasibility of imitation and to circumvent causal confounding and causal misspecifications have been proposed in the literature. However, the potential benefits of the incorporation of additional information about the underlying causal structure are left unexplored. An example of such overlooked information is context-specific independence (CSI), i.e., independence that holds only in certain contexts. We consider the problem of causal imitation learning when CSI relations are known. We prove that the decision problem pertaining to the feasibility of imitation in this setting is NP-hard. Further, we provide a necessary graphical criterion for imitation learning under CSI and show that under a structural assumption, this criterion is also sufficient. Finally, we propose a sound algorithmic approach for causal imitation learning which takes both CSI relations and data into account.

19.Evaluating the "Learning on Graphs" Conference Experience

Authors:Bastian Rieck, Corinna Coupette

Abstract: With machine learning conferences growing ever larger, and reviewing processes becoming increasingly elaborate, more data-driven insights into their workings are required. In this report, we present the results of a survey accompanying the first "Learning on Graphs" (LoG) Conference. The survey was directed to evaluate the submission and review process from different perspectives, including authors, reviewers, and area chairs alike.

20.Rotating Features for Object Discovery

Authors:Sindy Löwe, Phillip Lippe, Francesco Locatello, Max Welling

Abstract: The binding problem in human cognition, concerning how the brain represents and connects objects within a fixed network of neural connections, remains a subject of intense debate. Most machine learning efforts addressing this issue in an unsupervised setting have focused on slot-based methods, which may be limiting due to their discrete nature and difficulty to express uncertainty. Recently, the Complex AutoEncoder was proposed as an alternative that learns continuous and distributed object-centric representations. However, it is only applicable to simple toy data. In this paper, we present Rotating Features, a generalization of complex-valued features to higher dimensions, and a new evaluation procedure for extracting objects from distributed representations. Additionally, we show the applicability of our approach to pre-trained features. Together, these advancements enable us to scale distributed object-centric representations from simple toy to real-world data. We believe this work advances a new paradigm for addressing the binding problem in machine learning and has the potential to inspire further innovation in the field.

21.Safe Offline Reinforcement Learning with Real-Time Budget Constraints

Authors:Qian Lin, Bo Tang, Zifan Wu, Chao Yu, Shangqin Mao, Qianlong Xie, Xingxing Wang, Dong Wang

Abstract: Aiming at promoting the safe real-world deployment of Reinforcement Learning (RL), research on safe RL has made significant progress in recent years. However, most existing works in the literature still focus on the online setting where risky violations of the safety budget are likely to be incurred during training. Besides, in many real-world applications, the learned policy is required to respond to dynamically determined safety budgets (i.e., constraint threshold) in real time. In this paper, we target at the above real-time budget constraint problem under the offline setting, and propose Trajectory-based REal-time Budget Inference (TREBI) as a novel solution that approaches this problem from the perspective of trajectory distribution. Theoretically, we prove an error bound of the estimation on the episodic reward and cost under the offline setting and thus provide a performance guarantee for TREBI. Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.

22.FACT: Federated Adversarial Cross Training

Authors:Stefan Schrod, Jonas Lippl, Andreas Schäfer, Michael Altenbuchinger

Abstract: Federated Learning (FL) facilitates distributed model development to aggregate multiple confidential data sources. The information transfer among clients can be compromised by distributional differences, i.e., by non-i.i.d. data. A particularly challenging scenario is the federated model adaptation to a target client without access to annotated data. We propose Federated Adversarial Cross Training (FACT), which uses the implicit domain differences between source clients to identify domain shifts in the target domain. In each round of FL, FACT cross initializes a pair of source clients to generate domain specialized representations which are then used as a direct adversary to learn a domain invariant data representation. We empirically show that FACT outperforms state-of-the-art federated, non-federated and source-free domain adaptation models on three popular multi-source-single-target benchmarks, and state-of-the-art Unsupervised Domain Adaptation (UDA) models on single-source-single-target experiments. We further study FACT's behavior with respect to communication restrictions and the number of participating clients.

23.OTW: Optimal Transport Warping for Time Series

Authors:Fabian Latorre, Chenghao Liu, Doyen Sahoo, Steven C. H. Hoi

Abstract: Dynamic Time Warping (DTW) has become the pragmatic choice for measuring distance between time series. However, it suffers from unavoidable quadratic time complexity when the optimal alignment matrix needs to be computed exactly. This hinders its use in deep learning architectures, where layers involving DTW computations cause severe bottlenecks. To alleviate these issues, we introduce a new metric for time series data based on the Optimal Transport (OT) framework, called Optimal Transport Warping (OTW). OTW enjoys linear time/space complexity, is differentiable and can be parallelized. OTW enjoys a moderate sensitivity to time and shape distortions, making it ideal for time series. We show the efficacy and efficiency of OTW on 1-Nearest Neighbor Classification and Hierarchical Clustering, as well as in the case of using OTW instead of DTW in Deep Learning architectures.

24.Identifiability and Generalizability in Constrained Inverse Reinforcement Learning

Authors:Andreas Schlaginhaufen, Maryam Kamgarpour

Abstract: Two main challenges in Reinforcement Learning (RL) are designing appropriate reward functions and ensuring the safety of the learned policy. To address these challenges, we present a theoretical framework for Inverse Reinforcement Learning (IRL) in constrained Markov decision processes. From a convex-analytic perspective, we extend prior results on reward identifiability and generalizability to both the constrained setting and a more general class of regularizations. In particular, we show that identifiability up to potential shaping (Cao et al., 2021) is a consequence of entropy regularization and may generally no longer hold for other regularizations or in the presence of safety constraints. We also show that to ensure generalizability to new transition laws and constraints, the true reward must be identified up to a constant. Additionally, we derive a finite sample guarantee for the suboptimality of the learned rewards, and validate our results in a gridworld environment.

25.TriSig: Assessing the statistical significance of triclusters

Authors:Leonardo Alexandre, Rafael S. Costa, Rui Henriques

Abstract: Tensor data analysis allows researchers to uncover novel patterns and relationships that cannot be obtained from matrix data alone. The information inferred from the patterns provides valuable insights into disease progression, bioproduction processes, weather fluctuations, and group dynamics. However, spurious and redundant patterns hamper this process. This work aims at proposing a statistical frame to assess the probability of patterns in tensor data to deviate from null expectations, extending well-established principles for assessing the statistical significance of patterns in matrix data. A comprehensive discussion on binomial testing for false positive discoveries is entailed at the light of: variable dependencies, temporal dependencies and misalignments, and \textit{p}-value corrections under the Benjamini-Hochberg procedure. Results gathered from the application of state-of-the-art triclustering algorithms over distinct real-world case studies in biochemical and biotechnological domains confer validity to the proposed statistical frame while revealing vulnerabilities of some triclustering searches. The proposed assessment can be incorporated into existing triclustering algorithms to mitigate false positive/spurious discoveries and further prune the search space, reducing their computational complexity. Availability: The code is freely available at https://github.com/JupitersMight/TriSig under the MIT license.

26.Learning Prescriptive ReLU Networks

Authors:Wei Sun, Asterios Tsiourvas

Abstract: We study the problem of learning optimal policy from a set of discrete treatment options using observational data. We propose a piecewise linear neural network model that can balance strong prescriptive performance and interpretability, which we refer to as the prescriptive ReLU network, or P-ReLU. We show analytically that this model (i) partitions the input space into disjoint polyhedra, where all instances that belong to the same partition receive the same treatment, and (ii) can be converted into an equivalent prescriptive tree with hyperplane splits for interpretability. We demonstrate the flexibility of the P-ReLU network as constraints can be easily incorporated with minor modifications to the architecture. Through experiments, we validate the superior prescriptive accuracy of P-ReLU against competing benchmarks. Lastly, we present examples of interpretable prescriptive trees extracted from trained P-ReLUs using a real-world dataset, for both the unconstrained and constrained scenarios.

27.Normalization Enhances Generalization in Visual Reinforcement Learning

Authors:Lu Li, Jiafei Lyu, Guozheng Ma, Zilin Wang, Zhenjie Yang, Xiu Li, Zhiheng Li

Abstract: Recent advances in visual reinforcement learning (RL) have led to impressive success in handling complex tasks. However, these methods have demonstrated limited generalization capability to visual disturbances, which poses a significant challenge for their real-world application and adaptability. Though normalization techniques have demonstrated huge success in supervised and unsupervised learning, their applications in visual RL are still scarce. In this paper, we explore the potential benefits of integrating normalization into visual RL methods with respect to generalization performance. We find that, perhaps surprisingly, incorporating suitable normalization techniques is sufficient to enhance the generalization capabilities, without any additional special design. We utilize the combination of two normalization techniques, CrossNorm and SelfNorm, for generalizable visual RL. Extensive experiments are conducted on DMControl Generalization Benchmark and CARLA to validate the effectiveness of our method. We show that our method significantly improves generalization capability while only marginally affecting sample efficiency. In particular, when integrated with DrQ-v2, our method enhances the test performance of DrQ-v2 on CARLA across various scenarios, from 14% of the training performance to 97%.

28.Balanced Training of Energy-Based Models with Adaptive Flow Sampling

Authors:Louis Grenioux, Éric Moulines, Marylou Gabrié

Abstract: Energy-based models (EBMs) are versatile density estimation models that directly parameterize an unnormalized log density. Although very flexible, EBMs lack a specified normalization constant of the model, making the likelihood of the model computationally intractable. Several approximate samplers and variational inference techniques have been proposed to estimate the likelihood gradients for training. These techniques have shown promising results in generating samples, but little attention has been paid to the statistical accuracy of the estimated density, such as determining the relative importance of different classes in a dataset. In this work, we propose a new maximum likelihood training algorithm for EBMs that uses a different type of generative model, normalizing flows (NF), which have recently been proposed to facilitate sampling. Our method fits an NF to an EBM during training so that an NF-assisted sampling scheme provides an accurate gradient for the EBMs at all times, ultimately leading to a fast sampler for generating new data.

29.Adversarial Robustness in Unsupervised Machine Learning: A Systematic Review

Authors:Mathias Lundteigen Mohus, Jinyue Li

Abstract: As the adoption of machine learning models increases, ensuring robust models against adversarial attacks is increasingly important. With unsupervised machine learning gaining more attention, ensuring it is robust against attacks is vital. This paper conducts a systematic literature review on the robustness of unsupervised learning, collecting 86 papers. Our results show that most research focuses on privacy attacks, which have effective defenses; however, many attacks lack effective and general defensive measures. Based on the results, we formulate a model on the properties of an attack on unsupervised learning, contributing to future research by providing a model to use.

30.Prediction of Post-Operative Renal and Pulmonary Complication Using Transformers

Authors:Reza Shirkavand, Fei Zhang, Heng Huang

Abstract: Postoperative complications pose a significant challenge in the healthcare industry, resulting in elevated healthcare expenses and prolonged hospital stays, and in rare instances, patient mortality. To improve patient outcomes and reduce healthcare costs, healthcare providers rely on various perioperative risk scores to guide clinical decisions and prioritize care. In recent years, machine learning techniques have shown promise in predicting postoperative complications and fatality, with deep learning models achieving remarkable success in healthcare applications. However, research on the application of deep learning models to intra-operative anesthesia management data is limited. In this paper, we evaluate the performance of transformer-based models in predicting postoperative acute renal failure, postoperative pulmonary complications, and postoperative in-hospital mortality. We compare our method's performance with state-of-the-art tabular data prediction models, including gradient boosting trees and sequential attention models, on a clinical dataset. Our results demonstrate that transformer-based models can achieve superior performance in predicting postoperative complications and outperform traditional machine learning models. This work highlights the potential of deep learning techniques, specifically transformer-based models, in revolutionizing the healthcare industry's approach to postoperative care.

31.Spreads in Effective Learning Rates: The Perils of Batch Normalization During Early Training

Authors:Christian H. X. Ali Mehmeti-Göpel, Michael Wand

Abstract: Excursions in gradient magnitude pose a persistent challenge when training deep networks. In this paper, we study the early training phases of deep normalized ReLU networks, accounting for the induced scale invariance by examining effective learning rates (LRs). Starting with the well-known fact that batch normalization (BN) leads to exponentially exploding gradients at initialization, we develop an ODE-based model to describe early training dynamics. Our model predicts that in the gradient flow, effective LRs will eventually equalize, aligning with empirical findings on warm-up training. Using large LRs is analogous to applying an explicit solver to a stiff non-linear ODE, causing overshooting and vanishing gradients in lower layers after the first step. Achieving overall balance demands careful tuning of LRs, depth, and (optionally) momentum. Our model predicts the formation of spreads in effective LRs, consistent with empirical measurements. Moreover, we observe that large spreads in effective LRs result in training issues concerning accuracy, indicating the importance of controlling these dynamics. To further support a causal relationship, we implement a simple scheduling scheme prescribing uniform effective LRs across layers and confirm accuracy benefits.

32.Renormalized Graph Neural Networks

Authors:Francesco Caso, Giovanni Trappolini, Andrea Bacciu, Pietro Liò, Fabrizio Silvestri

Abstract: Graph Neural Networks (GNNs) have become essential for studying complex data, particularly when represented as graphs. Their value is underpinned by their ability to reflect the intricacies of numerous areas, ranging from social to biological networks. GNNs can grapple with non-linear behaviors, emerging patterns, and complex connections; these are also typical characteristics of complex systems. The renormalization group (RG) theory has emerged as the language for studying complex systems. It is recognized as the preferred lens through which to study complex systems, offering a framework that can untangle their intricate dynamics. Despite the clear benefits of integrating RG theory with GNNs, no existing methods have ventured into this promising territory. This paper proposes a new approach that applies RG theory to devise a novel graph rewiring to improve GNNs' performance on graph-related tasks. We support our proposal with extensive experiments on standard benchmarks and baselines. The results demonstrate the effectiveness of our method and its potential to remedy the current limitations of GNNs. Finally, this paper marks the beginning of a new research direction. This path combines the theoretical foundations of RG, the magnifying glass of complex systems, with the structural capabilities of GNNs. By doing so, we aim to enhance the potential of GNNs in modeling and unraveling the complexities inherent in diverse systems.

33.A Uniform Confidence Phenomenon in Deep Learning and its Implications for Calibration

Authors:Muthu Chidambaram, Rong Ge

Abstract: Despite the impressive generalization capabilities of deep neural networks, they have been repeatedly shown to poorly estimate their predictive uncertainty - in other words, they are frequently overconfident when they are wrong. Fixing this issue is known as model calibration, and has consequently received much attention in the form of modified training schemes and post-training calibration procedures. In this work, we present a significant hurdle to the calibration of modern models: deep neural networks have large neighborhoods of almost certain confidence around their training points. We demonstrate in our experiments that this phenomenon consistently arises (in the context of image classification) across many model and dataset pairs. Furthermore, we prove that when this phenomenon holds, for a large class of data distributions with overlaps between classes, it is not possible to obtain a model that is asymptotically better than random (with respect to calibration) even after applying the standard post-training calibration technique of temperature scaling. On the other hand, we also prove that it is possible to circumvent this defect by changing the training process to use a modified loss based on the Mixup data augmentation technique.

34.Going Deeper with Spectral Embeddings

Authors:Vivien Cabannes

Abstract: To make sense of millions of raw data and represent them efficiently, practitioners rely on representation learning. Recently, deep connections have been shown between these approaches and the spectral decompositions of some underlying operators. Historically, explicit spectral embeddings were built from graphs constructed on top of the data. In contrast, we propose two new methods to build spectral embeddings: one based on functional analysis principles and kernel methods, which leads to algorithms with theoretical guarantees, and the other based on deep networks trained to optimize principled variational losses, which yield practically efficient algorithms. Furthermore, we provide a new sampling algorithm that leverages learned representations to generate new samples in a single step.

35.Efficient Failure Pattern Identification of Predictive Algorithms

Authors:Bao Nguyen, Viet Anh Nguyen

Abstract: Given a (machine learning) classifier and a collection of unlabeled data, how can we efficiently identify misclassification patterns presented in this dataset? To address this problem, we propose a human-machine collaborative framework that consists of a team of human annotators and a sequential recommendation algorithm. The recommendation algorithm is conceptualized as a stochastic sampler that, in each round, queries the annotators a subset of samples for their true labels and obtains the feedback information on whether the samples are misclassified. The sampling mechanism needs to balance between discovering new patterns of misclassification (exploration) and confirming the potential patterns of classification (exploitation). We construct a determinantal point process, whose intensity balances the exploration-exploitation trade-off through the weighted update of the posterior at each round to form the generator of the stochastic sampler. The numerical results empirically demonstrate the competitive performance of our framework on multiple datasets at various signal-to-noise ratios.

36.An End-to-End Time Series Model for Simultaneous Imputation and Forecast

Authors:Trang H. Tran, Lam M. Nguyen, Kyongmin Yeo, Nam Nguyen, Dzung Phan, Roman Vaculin, Jayant Kalagnanam

Abstract: Time series forecasting using historical data has been an interesting and challenging topic, especially when the data is corrupted by missing values. In many industrial problem, it is important to learn the inference function between the auxiliary observations and target variables as it provides additional knowledge when the data is not fully observed. We develop an end-to-end time series model that aims to learn the such inference relation and make a multiple-step ahead forecast. Our framework trains jointly two neural networks, one to learn the feature-wise correlations and the other for the modeling of temporal behaviors. Our model is capable of simultaneously imputing the missing entries and making a multiple-step ahead prediction. The experiments show good overall performance of our framework over existing methods in both imputation and forecasting tasks.

37.Understanding Augmentation-based Self-Supervised Representation Learning via RKHS Approximation

Authors:Runtian Zhai, Bingbin Liu, Andrej Risteski, Zico Kolter, Pradeep Ravikumar

Abstract: Good data augmentation is one of the key factors that lead to the empirical success of self-supervised representation learning such as contrastive learning and masked language modeling, yet theoretical understanding of its role in learning good representations remains limited. Recent work has built the connection between self-supervised learning and approximating the top eigenspace of a graph Laplacian operator. Learning a linear probe on top of such features can naturally be connected to RKHS regression. In this work, we use this insight to perform a statistical analysis of augmentation-based pretraining. We start from the isometry property, a key geometric characterization of the target function given by the augmentation. Our first main theorem provides, for an arbitrary encoder, near tight bounds for both the estimation error incurred by fitting the linear probe on top of the encoder, and the approximation error entailed by the fitness of the RKHS the encoder learns. Our second main theorem specifically addresses the case where the encoder extracts the top-d eigenspace of a Monte-Carlo approximation of the underlying kernel with the finite pretraining samples. Our analysis completely disentangles the effects of the model and the augmentation. A key ingredient in our analysis is the augmentation complexity, which we use to quantitatively compare different augmentations and analyze their impact on downstream performance on synthetic and real datasets.

38.Initial Guessing Bias: How Untrained Networks Favor Some Classes

Authors:Emanuele Francazi, Aurelien Lucchi, Marco Baity-Jesi

Abstract: The initial state of neural networks plays a central role in conditioning the subsequent training dynamics. In the context of classification problems, we provide a theoretical analysis demonstrating that the structure of a neural network can condition the model to assign all predictions to the same class, even before the beginning of training, and in the absence of explicit biases. We show that the presence of this phenomenon, which we call "Initial Guessing Bias" (IGB), depends on architectural choices such as activation functions, max-pooling layers, and network depth. Our analysis of IGB has practical consequences, in that it guides architecture selection and initialization. We also highlight theoretical consequences, such as the breakdown of node-permutation symmetry, the violation of self-averaging, the validity of some mean-field approximations, and the non-trivial differences arising with depth.

39.In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation

Authors:Julian Bitterwolf, Maximilian Müller, Matthias Hein

Abstract: Out-of-distribution (OOD) detection is the problem of identifying inputs which are unrelated to the in-distribution task. The OOD detection performance when the in-distribution (ID) is ImageNet-1K is commonly being tested on a small range of test OOD datasets. We find that most of the currently used test OOD datasets, including datasets from the open set recognition (OSR) literature, have severe issues: In some cases more than 50$\%$ of the dataset contains objects belonging to one of the ID classes. These erroneous samples heavily distort the evaluation of OOD detectors. As a solution, we introduce with NINCO a novel test OOD dataset, each sample checked to be ID free, which with its fine-grained range of OOD classes allows for a detailed analysis of an OOD detector's strengths and failure modes, particularly when paired with a number of synthetic "OOD unit-tests". We provide detailed evaluations across a large set of architectures and OOD detection methods on NINCO and the unit-tests, revealing new insights about model weaknesses and the effects of pretraining on OOD detection performance. We provide code and data at https://github.com/j-cb/NINCO.

40.What model does MuZero learn?

Authors:Jinke He, Thomas M. Moerland, Frans A. Oliehoek

Abstract: Model-based reinforcement learning has drawn considerable interest in recent years, given its promise to improve sample efficiency. Moreover, when using deep-learned models, it is potentially possible to learn compact models from complex sensor data. However, the effectiveness of these learned models, particularly their capacity to plan, i.e., to improve the current policy, remains unclear. In this work, we study MuZero, a well-known deep model-based reinforcement learning algorithm, and explore how far it achieves its learning objective of a value-equivalent model and how useful the learned models are for policy improvement. Amongst various other insights, we conclude that the model learned by MuZero cannot effectively generalize to evaluate unseen policies, which limits the extent to which we can additionally improve the current policy by planning with the model.

41.Non-stationary Reinforcement Learning under General Function Approximation

Authors:Songtao Feng, Ming Yin, Ruiquan Huang, Yu-Xiang Wang, Jing Yang, Yingbin Liang

Abstract: General function approximation is a powerful tool to handle large state and action spaces in a broad range of reinforcement learning (RL) scenarios. However, theoretical understanding of non-stationary MDPs with general function approximation is still limited. In this paper, we make the first such an attempt. We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs, which subsumes majority of existing tractable RL problems in static MDPs as well as non-stationary MDPs. Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA, which features a sliding window mechanism and a new confidence set design for non-stationary MDPs. We then establish an upper bound on the dynamic regret for the proposed algorithm, and show that SW-OPEA is provably efficient as long as the variation budget is not significantly large. We further demonstrate via examples of non-stationary linear and tabular MDPs that our algorithm performs better in small variation budget scenario than the existing UCB-type algorithms. To the best of our knowledge, this is the first dynamic regret analysis in non-stationary MDPs with general function approximation.

42.IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive Control

Authors:Rohan Chitnis, Yingchen Xu, Bobak Hashemi, Lucas Lehnert, Urun Dogan, Zheqing Zhu, Olivier Delalleau

Abstract: Model-based reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with long-horizon sparse-reward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that model-based RL agents struggle in these environments due to a lack of long-term planning capabilities, and that planning in a temporally abstract model of the environment can alleviate this issue. In this paper, we make two key contributions: 1) we introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL); 2) we propose to use IQL-TD-MPC as a Manager in a hierarchical setting with any off-the-shelf offline RL algorithm as a Worker. More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning. We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents' performance on some of the most challenging D4RL benchmark tasks. For instance, the offline RL algorithms AWAC, TD3-BC, DT, and CQL all get zero or near-zero normalized evaluation scores on the medium and large antmaze tasks, while our modification gives an average score over 40.

43.Is novelty predictable?

Authors:Clara Fannjiang, Jennifer Listgarten

Abstract: Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal implications spanning drug development and manufacturing, plastic degradation, and carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. If one trusts learned models too much in extrapolation, one is likely to design rubbish. In contrast, if one does not extrapolate, one cannot find novelty. Herein, we ponder how one might strike a useful balance between these two extremes. We focus in particular on designing proteins with novel property values, although much of our discussion addresses machine learning-based design more broadly.

44.Quantifying Deep Learning Model Uncertainty in Conformal Prediction

Authors:Hamed Karimi, Reza Samavi

Abstract: Precise estimation of predictive uncertainty in deep neural networks is a critical requirement for reliable decision-making in machine learning and statistical modeling, particularly in the context of medical AI. Conformal Prediction (CP) has emerged as a promising framework for representing the model uncertainty by providing well-calibrated confidence levels for individual predictions. However, the quantification of model uncertainty in conformal prediction remains an active research area, yet to be fully addressed. In this paper, we explore state-of-the-art CP methodologies and their theoretical foundations. We propose a probabilistic approach in quantifying the model uncertainty derived from the produced prediction sets in conformal prediction and provide certified boundaries for the computed uncertainty. By doing so, we allow model uncertainty measured by CP to be compared by other uncertainty quantification methods such as Bayesian (e.g., MC-Dropout and DeepEnsemble) and Evidential approaches.

45.Domain Generalization for Domain-Linked Classes

Authors:Kimathi Kaai, Saad Hossain, Sirisha Rambhatla

Abstract: Domain generalization (DG) focuses on transferring domain-invariant knowledge from multiple source domains (available at train time) to an, a priori, unseen target domain(s). This requires a class to be expressed in multiple domains for the learning algorithm to break the spurious correlations between domain and class. However, in the real-world, classes may often be domain-linked, i.e. expressed only in a specific domain, which leads to extremely poor generalization performance for these classes. In this work, we aim to learn generalizable representations for these domain-linked classes by transferring domain-invariant knowledge from classes expressed in multiple source domains (domain-shared classes). To this end, we introduce this task to the community and propose a Fair and cONtrastive feature-space regularization algorithm for Domain-linked DG, FOND. Rigorous and reproducible experiments with baselines across popular DG tasks demonstrate our method and its variants' ability to accomplish state-of-the-art DG results for domain-linked classes. We also provide practical insights on data conditions that increase domain-linked class generalizability to tackle real-world data scarcity.

46.SpotTarget: Rethinking the Effect of Target Edges for Link Prediction in Graph Neural Networks

Authors:Jing Zhu, Yuhang Zhou, Vassilis N. Ioannidis, Shengyi Qian, Wei Ai, Xiang Song, Danai Koutra

Abstract: Graph Neural Networks (GNNs) have demonstrated promising outcomes across various tasks, including node classification and link prediction. Despite their remarkable success in various high-impact applications, we have identified three common pitfalls in message passing for link prediction. Particularly, in prevalent GNN frameworks (e.g., DGL and PyTorch-Geometric), the target edges (i.e., the edges being predicted) consistently exist as message passing edges in the graph during training. Consequently, this results in overfitting and distribution shift, both of which adversely impact the generalizability to test the target edges. Additionally, during test time, the failure to exclude the test target edges leads to implicit test leakage caused by neighborhood aggregation. In this paper, we analyze these three pitfalls and investigate the impact of including or excluding target edges on the performance of nodes with varying degrees during training and test phases. Our theoretical and empirical analysis demonstrates that low-degree nodes are more susceptible to these pitfalls. These pitfalls can have detrimental consequences when GNNs are implemented in production systems. To systematically address these pitfalls, we propose SpotTarget, an effective and efficient GNN training framework. During training, SpotTarget leverages our insight regarding low-degree nodes and excludes train target edges connected to at least one low-degree node. During test time, it emulates real-world scenarios of GNN usage in production and excludes all test target edges. Our experiments conducted on diverse real-world datasets, demonstrate that SpotTarget significantly enhances GNNs, achieving up to a 15x increase in accuracy in sparse graphs. Furthermore, SpotTarget consistently and dramatically improves the performance for low-degree nodes in dense graphs.

47.Better Private Linear Regression Through Better Private Feature Selection

Authors:Travis Dick, Jennifer Gillenwater, Matthew Joseph

Abstract: Existing work on differentially private linear regression typically assumes that end users can precisely set data bounds or algorithmic hyperparameters. End users often struggle to meet these requirements without directly examining the data (and violating privacy). Recent work has attempted to develop solutions that shift these burdens from users to algorithms, but they struggle to provide utility as the feature dimension grows. This work extends these algorithms to higher-dimensional problems by introducing a differentially private feature selection method based on Kendall rank correlation. We prove a utility guarantee for the setting where features are normally distributed and conduct experiments across 25 datasets. We find that adding this private feature selection step before regression significantly broadens the applicability of ``plug-and-play'' private linear regression algorithms at little additional cost to privacy, computation, or decision-making by the end user.

48.CS4ML: A general framework for active learning with arbitrary data based on Christoffel functions

Authors:Ben Adcock, Juan M. Cardenas, Nick Dexter

Abstract: We introduce a general framework for active learning in regression problems. Our framework extends the standard setup by allowing for general types of data, rather than merely pointwise samples of the target function. This generalization covers many cases of practical interest, such as data acquired in transform domains (e.g., Fourier data), vector-valued data (e.g., gradient-augmented data), data acquired along continuous curves, and, multimodal data (i.e., combinations of different types of measurements). Our framework considers random sampling according to a finite number of sampling measures and arbitrary nonlinear approximation spaces (model classes). We introduce the concept of generalized Christoffel functions and show how these can be used to optimize the sampling measures. We prove that this leads to near-optimal sample complexity in various important cases. This paper focuses on applications in scientific computing, where active learning is often desirable, since it is usually expensive to generate data. We demonstrate the efficacy of our framework for gradient-augmented learning with polynomials, Magnetic Resonance Imaging (MRI) using generative models and adaptive sampling for solving PDEs using Physics-Informed Neural Networks (PINNs).

49.Exposing Attention Glitches with Flip-Flop Language Modeling

Authors:Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang

Abstract: Why do large language models sometimes output factual inaccuracies and exhibit erroneous reasoning? The brittleness of these models, particularly when executing long chains of reasoning, currently seems to be an inevitable price to pay for their advanced capabilities of coherently synthesizing knowledge, pragmatics, and abstract thought. Towards making sense of this fundamentally unsolved problem, this work identifies and analyzes the phenomenon of attention glitches, in which the Transformer architecture's inductive biases intermittently fail to capture robust reasoning. To isolate the issue, we introduce flip-flop language modeling (FFLM), a parametric family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models. This simple generative task requires a model to copy binary symbols over long-range dependencies, ignoring the tokens in between. We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques. Our preliminary mechanistic analyses show why the remaining errors may be very difficult to diagnose and resolve. We hypothesize that attention glitches account for (some of) the closed-domain hallucinations in natural LLMs.

50.Improving and Benchmarking Offline Reinforcement Learning Algorithms

Authors:Bingyi Kang, Xiao Ma, Yirui Wang, Yang Yue, Shuicheng Yan

Abstract: Recently, Offline Reinforcement Learning (RL) has achieved remarkable progress with the emergence of various algorithms and datasets. However, these methods usually focus on algorithmic advancements, ignoring that many low-level implementation choices considerably influence or even drive the final performance. As a result, it becomes hard to attribute the progress in Offline RL as these choices are not sufficiently discussed and aligned in the literature. In addition, papers focusing on a dataset (e.g., D4RL) often ignore algorithms proposed on another dataset (e.g., RL Unplugged), causing isolation among the algorithms, which might slow down the overall progress. Therefore, this work aims to bridge the gaps caused by low-level choices and datasets. To this end, we empirically investigate 20 implementation choices using three representative algorithms (i.e., CQL, CRR, and IQL) and present a guidebook for choosing implementations. Following the guidebook, we find two variants CRR+ and CQL+ , achieving new state-of-the-art on D4RL. Moreover, we benchmark eight popular offline RL algorithms across datasets under unified training and evaluation framework. The findings are inspiring: the success of a learning paradigm severely depends on the data distribution, and some previous conclusions are biased by the dataset used. Our code is available at https://github.com/sail-sg/offbench.

51.Active Reinforcement Learning under Limited Visual Observability

Authors:Jinghuan Shang, Michael S. Ryoo

Abstract: In this work, we investigate Active Reinforcement Learning (Active-RL), where an embodied agent simultaneously learns action policy for the task while also controlling its visual observations in partially observable environments. We denote the former as motor policy and the latter as sensory policy. For example, humans solve real world tasks by hand manipulation (motor policy) together with eye movements (sensory policy). Active-RL poses challenges on coordinating two policies given their mutual influence. We propose SUGARL, Sensorimotor Understanding Guided Active Reinforcement Learning, a framework that models motor and sensory policies separately, but jointly learns them using with an intrinsic sensorimotor reward. This learnable reward is assigned by sensorimotor reward module, incentivizes the sensory policy to select observations that are optimal to infer its own motor action, inspired by the sensorimotor stage of humans. Through a series of experiments, we show the effectiveness of our method across a range of observability conditions and its adaptability to existed RL algorithms. The sensory policies learned through our method are observed to exhibit effective active vision strategies.