Aliasing in Convnets: A Frame-Theoretic Perspective

By: Daniel Haider, Vincent Lostanlen, Martin Ehler, Nicki Holighaus, Peter Balazs

Using a stride in a convolutional layer inherently introduces aliasing, which has implications for numerical stability and statistical generalization. While techniques such as the parametrizations via paraunitary systems have been used to promote orthogonal convolution and thus ensure Parseval stability, a general analysis of aliasing and its effects on the stability has not been done in this context. In this article, we adapt a frame-theoret... more
Using a stride in a convolutional layer inherently introduces aliasing, which has implications for numerical stability and statistical generalization. While techniques such as the parametrizations via paraunitary systems have been used to promote orthogonal convolution and thus ensure Parseval stability, a general analysis of aliasing and its effects on the stability has not been done in this context. In this article, we adapt a frame-theoretic approach to describe aliasing in convolutional layers with 1D kernels, leading to practical estimates for stability bounds and characterizations of Parseval stability, that are tailored to take short kernel sizes into account. From this, we derive two computationally very efficient optimization objectives that promote Parseval stability via systematically suppressing aliasing. Finally, for layers with random kernels, we derive closed-form expressions for the expected value and variance of the terms that describe the aliasing effects, revealing fundamental insights into the aliasing behavior at initialization. less
4 SciCasts by .
Differential Mamba

By: Nadav Schneider, Itamar Zimerman, Eliya Nachmani

Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore ... more
Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available. less
QuEst: Enhancing Estimates of Quantile-Based Distributional Measures
  Using Model Predictions

By: Zhun Deng, Thomas P Zollo, Benjamin Eyre, Amogh Inamdar, David Madras, Richard Zemel

As machine learning models grow increasingly competent, their predictions can supplement scarce or expensive data in various important domains. In support of this paradigm, algorithms have emerged to combine a small amount of high-fidelity observed data with a much larger set of imputed model outputs to estimate some quantity of interest. Yet current hybrid-inference tools target only means or single quantiles, limiting their applicability fo... more
As machine learning models grow increasingly competent, their predictions can supplement scarce or expensive data in various important domains. In support of this paradigm, algorithms have emerged to combine a small amount of high-fidelity observed data with a much larger set of imputed model outputs to estimate some quantity of interest. Yet current hybrid-inference tools target only means or single quantiles, limiting their applicability for many critical domains and use cases. We present QuEst, a principled framework to merge observed and imputed data to deliver point estimates and rigorous confidence intervals for a wide family of quantile-based distributional measures. QuEst covers a range of measures, from tail risk (CVaR) to population segments such as quartiles, that are central to fields such as economics, sociology, education, medicine, and more. We extend QuEst to multidimensional metrics, and introduce an additional optimization technique to further reduce variance in this and other hybrid estimators. We demonstrate the utility of our framework through experiments in economic modeling, opinion polling, and language model auto-evaluation. less
Replacing thinking with tool usage enables reasoning in small language
  models

By: Corrado Rainone, Tim Bakker, Roland Memisevic

Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of "thoughts" expressed in natural language. In this pap... more
Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of "thoughts" expressed in natural language. In this paper, we propose to instead format these tokens as a multi-turn interaction trace with a stateful tool. At each turn, the new state of the tool is appended to the context of the model, whose job is to generate the tokens necessary to control the tool via a custom DSL. We benchmark this approach on the problem of repairing malfunctioning Python code, and show that this constrained setup allows for faster sampling of experience and a denser reward signal, allowing even models of size up to 3B parameters to learn how to proficiently expend additional compute on the task. less
Meta-Learning Transformers to Improve In-Context Generalization

By: Lorenzo Braccaioli, Anna Vettoruzzo, Prabhant Singh, Joaquin Vanschoren, Mohamed-Rafik Bouguelia, Nicola Conci

In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an altern... more
In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an alternative training strategy where we leverage a collection of multiple, small-scale, and domain-specific datasets. We empirically demonstrate that the increased quality and diversity of such data improve the generalization abilities of in-context learners beyond their training domain, while achieving comparable performance with models trained on a single large-scale dataset. We investigate this paradigm by leveraging meta-learning to train an in-context learner on the Meta-Album collection under several settings. Firstly, we show the performance in a controlled environment, where the test domain is completely excluded from the training knowledge. Secondly, we explore the robustness of these models to forgetting in a continual scenario where the information is accessible for a limited time. Finally, we explore the more challenging unsupervised scenario. Our findings demonstrate that transformers still generalize for in-context prediction when trained on a curated dataset collection while offering advantages in modularity and replaceability. less
Train-before-Test Harmonizes Language Model Rankings

By: Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon of training on the test task: As released, different models exhibit a different level of preparation for... more
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon of training on the test task: As released, different models exhibit a different level of preparation for any given test task. A candidate solution to the problem is train-before-test: Give each model the same benchmark-specific finetuning before evaluation. Our primary contribution is a broad empirical evaluation of train-before-test across 24 benchmarks and 61 models. We show that train-before-test significantly improves ranking agreement consistently across all benchmarks. Whereas rankings have little external validity to start with, they enjoy a significant degree of external validity when applying train-before-test: Model rankings transfer gracefully from one benchmark to the other. Even within the same model family, train-before-test reduces strong ranking disagreement to near-perfect agreement. In addition, train-before-test reduces the model-score matrix to essentially rank one, revealing new insights into the latent factors of benchmark performance. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking. less
Beyond Scaling Curves: Internal Dynamics of Neural Networks Through the
  NTK Lens

By: Konstantin Nikolaou, Sven Krippendorf, Samuel Tovey, Christian Holm

Scaling laws offer valuable insights into the relationship between neural network performance and computational cost, yet their underlying mechanisms remain poorly understood. In this work, we empirically analyze how neural networks behave under data and model scaling through the lens of the neural tangent kernel (NTK). This analysis establishes a link between performance scaling and the internal dynamics of neural networks. Our findings of s... more
Scaling laws offer valuable insights into the relationship between neural network performance and computational cost, yet their underlying mechanisms remain poorly understood. In this work, we empirically analyze how neural networks behave under data and model scaling through the lens of the neural tangent kernel (NTK). This analysis establishes a link between performance scaling and the internal dynamics of neural networks. Our findings of standard vision tasks show that similar performance scaling exponents can occur even though the internal model dynamics show opposite behavior. This demonstrates that performance scaling alone is insufficient for understanding the underlying mechanisms of neural networks. We also address a previously unresolved issue in neural scaling: how convergence to the infinite-width limit affects scaling behavior in finite-width models. To this end, we investigate how feature learning is lost as the model width increases and quantify the transition between kernel-driven and feature-driven scaling regimes. We identify the maximum model width that supports feature learning, which, in our setups, we find to be more than ten times smaller than typical large language model widths. less
Cascade: Token-Sharded Private LLM Inference

By: Rahul Thomas, Louai Zahran, Erica Choi, Akilesh Potti, Micah Goldblum, Arka Pal

As LLMs continue to increase in parameter size, the computational resources required to run them are available to fewer parties. Therefore, third-party inference services -- where LLMs are hosted by third parties with significant computational resources -- are becoming increasingly popular. However, third party inference raises critical concerns about user data privacy. To mitigate these risks, privacy researchers have developed provably secu... more
As LLMs continue to increase in parameter size, the computational resources required to run them are available to fewer parties. Therefore, third-party inference services -- where LLMs are hosted by third parties with significant computational resources -- are becoming increasingly popular. However, third party inference raises critical concerns about user data privacy. To mitigate these risks, privacy researchers have developed provably secure schemes for third-party inference, such as Secure Multi-Party Computation (SMPC). However, SMPC protocols have significant computational and communication overhead, and do not scale to large models. In this work, we propose a new multi-party inference protocol, Cascade, that avoids these punitive costs by leveraging sharding in the sequence dimension to maintain privacy, trading off cryptographic privacy guarantees for increased performance and scalability. We demonstrate that Cascade is resistant to a generalization of a recent attack that is highly effective against other statistical privacy schemes, and that it is further resistant to learning-based attacks. As Cascade is orders of magnitude faster than existing schemes, our findings offer practical solutions for secure deployment of modern state-of-the-art LLMs. less
MvHo-IB: Multi-View Higher-Order Information Bottleneck for Brain
  Disorder Diagnosis

By: Kunyu Zhang, Qiang Li, Shujian Yu

Recent evidence suggests that modeling higher-order interactions (HOIs) in functional magnetic resonance imaging (fMRI) data can enhance the diagnostic accuracy of machine learning systems. However, effectively extracting and utilizing HOIs remains a significant challenge. In this work, we propose MvHo-IB, a novel multi-view learning framework that integrates both pairwise interactions and HOIs for diagnostic decision-making, while automatica... more
Recent evidence suggests that modeling higher-order interactions (HOIs) in functional magnetic resonance imaging (fMRI) data can enhance the diagnostic accuracy of machine learning systems. However, effectively extracting and utilizing HOIs remains a significant challenge. In this work, we propose MvHo-IB, a novel multi-view learning framework that integrates both pairwise interactions and HOIs for diagnostic decision-making, while automatically compressing task-irrelevant redundant information. MvHo-IB introduces several key innovations: (1) a principled method that combines O-information from information theory with a matrix-based Renyi alpha-order entropy estimator to quantify and extract HOIs, (2) a purpose-built Brain3DCNN encoder to effectively utilize these interactions, and (3) a new multi-view learning information bottleneck objective to enhance representation learning. Experiments on three benchmark fMRI datasets demonstrate that MvHo-IB achieves state-of-the-art performance, significantly outperforming previous methods, including recent hypergraph-based techniques. The implementation of MvHo-IB is available at https://github.com/zky04/MvHo-IB. less
Replicable Distribution Testing

By: Ilias Diakonikolas, Jingyi Gao, Daniel Kane, Sihan Liu, Christopher Ye

We initiate a systematic investigation of distribution testing in the framework of algorithmic replicability. Specifically, given independent samples from a collection of probability distributions, the goal is to characterize the sample complexity of replicably testing natural properties of the underlying distributions. On the algorithmic front, we develop new replicable algorithms for testing closeness and independence of discrete distributi... more
We initiate a systematic investigation of distribution testing in the framework of algorithmic replicability. Specifically, given independent samples from a collection of probability distributions, the goal is to characterize the sample complexity of replicably testing natural properties of the underlying distributions. On the algorithmic front, we develop new replicable algorithms for testing closeness and independence of discrete distributions. On the lower bound front, we develop a new methodology for proving sample complexity lower bounds for replicable testing that may be of broader interest. As an application of our technique, we establish near-optimal sample complexity lower bounds for replicable uniformity testing -- answering an open question from prior work -- and closeness testing. less
In-Training Multicalibrated Survival Analysis for Healthcare via
  Constrained Optimization

By: Thiti Suttaket, Stanley Kok

Survival analysis is an important problem in healthcare because it models the relationship between an individual's covariates and the onset time of an event of interest (e.g., death). It is important for survival models to be well-calibrated (i.e., for their predicted probabilities to be close to ground-truth probabilities) because badly calibrated systems can result in erroneous clinical decisions. Existing survival models are typically cali... more
Survival analysis is an important problem in healthcare because it models the relationship between an individual's covariates and the onset time of an event of interest (e.g., death). It is important for survival models to be well-calibrated (i.e., for their predicted probabilities to be close to ground-truth probabilities) because badly calibrated systems can result in erroneous clinical decisions. Existing survival models are typically calibrated at the population level only, and thus run the risk of being poorly calibrated for one or more minority subpopulations. We propose a model called GRADUATE that achieves multicalibration by ensuring that all subpopulations are well-calibrated too. GRADUATE frames multicalibration as a constrained optimization problem, and optimizes both calibration and discrimination in-training to achieve a good balance between them. We mathematically prove that the optimization method used yields a solution that is both near-optimal and feasible with high probability. Empirical comparisons against state-of-the-art baselines on real-world clinical datasets demonstrate GRADUATE's efficacy. In a detailed analysis, we elucidate the shortcomings of the baselines vis-a-vis GRADUATE's strengths. less
High-Order Deep Meta-Learning with Category-Theoretic Interpretation

By: David H. Mguni

We introduce a new hierarchical deep learning framework for recursive higher-order meta-learning that enables neural networks (NNs) to construct, solve, and generalise across hierarchies of tasks. Central to this approach is a generative mechanism that creates \emph{virtual tasks} -- synthetic problem instances designed to enable the meta-learner to learn \emph{soft constraints} and unknown generalisable rules across related tasks. Crucially,... more
We introduce a new hierarchical deep learning framework for recursive higher-order meta-learning that enables neural networks (NNs) to construct, solve, and generalise across hierarchies of tasks. Central to this approach is a generative mechanism that creates \emph{virtual tasks} -- synthetic problem instances designed to enable the meta-learner to learn \emph{soft constraints} and unknown generalisable rules across related tasks. Crucially, this enables the framework to generate its own informative, task-grounded datasets thereby freeing machine learning (ML) training from the limitations of relying entirely on human-generated data. By actively exploring the virtual point landscape and seeking out tasks lower-level learners find difficult, the meta-learner iteratively refines constraint regions. This enhances inductive biases, regularises the adaptation process, and produces novel, unanticipated tasks and constraints required for generalisation. Each meta-level of the hierarchy corresponds to a progressively abstracted generalisation of problems solved at lower levels, enabling a structured and interpretable learning progression. By interpreting meta-learners as category-theoretic \emph{functors} that generate and condition a hierarchy of subordinate learners, we establish a compositional structure that supports abstraction and knowledge transfer across progressively generalised tasks. The category-theoretic perspective unifies existing meta-learning models and reveals how learning processes can be transformed and compared through functorial relationships, while offering practical design principles for structuring meta-learning. We speculate this architecture may underpin the next generation of NNs capable of autonomously generating novel, instructive tasks and their solutions, thereby advancing ML towards general artificial intelligence. less
LLM-Driven Treatment Effect Estimation Under Inference Time Text
  Confounding

By: Yuchen Ma, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel

Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are... more
Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are incomplete representations of the original patient information. In this work, we make three contributions. (1) We show that the discrepancy between the data available during training time and inference time can lead to biased estimates of treatment effects. We formalize this issue as an inference time text confounding problem, where confounders are fully observed during training time but only partially available through text at inference time. (2) To address this problem, we propose a novel framework for estimating treatment effects that explicitly accounts for inference time text confounding. Our framework leverages large language models together with a custom doubly robust learner to mitigate biases caused by the inference time text confounding. (3) Through a series of experiments, we demonstrate the effectiveness of our framework in real-world applications. less
Fast and Simplex: 2-Simplicial Attention in Triton

By: Aurko Roy, Timothy Chou, Sai Surya Duvvuri, Sijia Chen, Jiecao Yu, Xiaodong Wang, Manzil Zaheer, Rohan Anil

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is... more
Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that $2$-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention. less
Understanding and Improving Length Generalization in Recurrent Models

By: Ricardo Buitrago Ruiz, Albert Gu

Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths-i.e. they fail to length generalize. In this work, we provide comprehensive empirical and theoretical analysis to s... more
Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths-i.e. they fail to length generalize. In this work, we provide comprehensive empirical and theoretical analysis to support the unexplored states hypothesis, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all attainable states (i.e. states that would be attained if the recurrence was applied to long sequences). Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps ($\sim 0.1\%$ of the pre-training budget), these interventions enable length generalization for sequences that are orders of magnitude longer than the training context (e.g. $2k\longrightarrow 128k$) and show improved performance in long context tasks, thus presenting a simple and efficient way to enable robust length generalization in general recurrent models. less
ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided
  Reinforcement Learning

By: Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi

Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training, which improves reasoning by optimizing model outputs based on reward or preference signals. GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier. However, these methods depend heavily on the model's initial ability to produce positive samples. They primarily refine what the model al... more
Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training, which improves reasoning by optimizing model outputs based on reward or preference signals. GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier. However, these methods depend heavily on the model's initial ability to produce positive samples. They primarily refine what the model already knows (distribution sharpening) rather than enabling the model to solve problems where it initially fails. This limitation is especially problematic in early-stage RL training and on challenging reasoning tasks, where positive samples are unlikely to be generated. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model's likelihood of predicting the correct answer. Based on these insights, we propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most. less
Revisiting Learning Rate Control

By: Micha Henheik, Theresa Eimer, Marius Lindauer

The learning rate is one of the most important hyperparameters in deep learning, and how to control it is an active area within both AutoML and deep learning research. Approaches for learning rate control span from classic optimization to online scheduling based on gradient statistics. This paper compares paradigms to assess the current state of learning rate control. We find that methods from multi-fidelity hyperparameter optimization, fixed... more
The learning rate is one of the most important hyperparameters in deep learning, and how to control it is an active area within both AutoML and deep learning research. Approaches for learning rate control span from classic optimization to online scheduling based on gradient statistics. This paper compares paradigms to assess the current state of learning rate control. We find that methods from multi-fidelity hyperparameter optimization, fixed-hyperparameter schedules, and hyperparameter-free learning often perform very well on selected deep learning tasks but are not reliable across settings. This highlights the need for algorithm selection methods in learning rate control, which have been neglected so far by both the AutoML and deep learning communities. We also observe a trend of hyperparameter optimization approaches becoming less effective as models and tasks grow in complexity, even when combined with multi-fidelity approaches for more expensive model trainings. A focus on more relevant test tasks and new promising directions like finetunable methods and meta-learning will enable the AutoML community to significantly strengthen its impact on this crucial factor in deep learning. less