VCoT-Bench evaluates LLMs on Rust program verification via chain-of-thought reasoning, testing logical deduction abilities beyond binary pass/fail.
Method for reliable uncertainty quantification in Vision-Language-Action models by shifting focus to safety-critical moments in robotic control.
Dataset for detecting human situational awareness gaps in remote human-robot teaming through multimodal sensor data.
PowerFlow applies principled distribution matching to unsupervised reinforcement learning from LLM internal feedback without external supervision.
Theoretical study of computational and statistical hardness in computing calibration distance for probabilistic predictor evaluation.
Research on estimating causal representations from multi-domain data using empirical Bayes methods for causal representation learning.
TARo enables frozen LLMs to perform structured reasoning at inference time through token-level adaptive routing, avoiding expensive post-training alignment.
Statistical framework for quantifying reliability of results from data analysis pipelines using selective inference techniques.
Unsupervised discovery of transition-structure concepts in text via temporal co-occurrence patterns using contrastive learning on large corpus.
Adaptive context allocation method for LLM long-context inference using uncertainty-triggered token-level budgeting to address attention dilution.
Vision-language model method for temporal out-of-distribution detection and domain generalization in open-world settings using adaptive pattern matching.
Analysis of how standard LLM decoding strategies (top-k, nucleus sampling) exclude contextually appropriate but statistically rare tokens compared to human language production.
Theoretical analysis of linear denoisers for noisy data, studying performance in proportional regime without known covariance.
Analyzes optimal satisficing regret bounds for nonstationary K-armed bandits with piecewise-stationary segments.
ICE-Guard detects spurious feature reliance in LLM decision-making through intervention consistency testing on demographic, authority, and framing features.
Addresses sim-to-real transfer for vision-language-action models in robotics by generating diverse 3D simulation worlds for RL fine-tuning.
iSatCR optimizes onboard computing and routing for LEO satellite data processing using graph neural networks to reduce ground transmission bottlenecks.
Studies construction of compressed decision-sufficient datasets for linear programs with unknown cost vectors using decision-relevant dimension theory.
CausalVAD applies causal intervention to de-confound end-to-end autonomous driving models, addressing dataset bias and improving reliability.
ICE framework evaluates explanation faithfulness in LLMs via randomization tests with multiple intervention operators, distinguishing genuine faithfulness from chance.
WarPGNN applies physics-aware graph neural networks for efficient thermal warpage analysis in chiplet-package systems, replacing costly numerical simulations.
DRESS graph fingerprinting achieves unique fingerprints across 51,718 non-isomorphic strongly regular graphs using single-deletion vertex operations.
i-SDT combines predictive modelling and multi-class attack discrimination for detecting and responding to cyber-physical system attacks without full shutdowns.
SwiftGS enables rapid 3D satellite surface reconstruction via meta-learned Gaussian primitives predicted in a single forward pass for environmental monitoring.
Theoretical analysis comparing NUTS-mul and NUTS-BPS variants for Bayesian sampling with convergence guarantees under Gaussian targets.
Applies AI to repurpose single-lead ECG from Holter devices for sleep phenotyping, linking cardiovascular monitoring to sleep assessment.
Memento-Skills introduces an LLM agent that autonomously designs and improves task-specific agents through continual learning with stateful prompts and reusable skills.
Studies consistency and convergence rates of Recursive Rank Matching for computing Wasserstein distance surrogates in small-discrepancy regime.
Dual-IFM develops an interpretable-by-design foundation model for retinal fundus image analysis using self-supervised learning with local and global interpretability.
Proposes variational guidance for autonomous aerial vehicle trajectory learning to address credit assignment and training instability in sparse reward RL settings.
BeamAgent combines LLMs with wireless beamforming optimization through decoupled intent parsing and alternating optimization, separating LLM reasoning from numerical computation.
RewardFlow proposes topology-aware reward propagation on state graphs for RL-enhanced LLM agents, addressing sparse reward limitations without expensive dedicated reward models.
Machine-learning interatomic potential workflow for gas-surface scattering dynamics simulation on graphite.
Physics-informed diffusion model for radio map construction using few-shot learning with manifold alignment.
Introduces PromptHub for visual in-context learning using locality-aware fusion of multiple visual demonstrations with alignment and concentration mechanisms.
Proposes evaluation framework beyond accuracy for human-AI collaborative decision-making, addressing miscalibrated reliance and team effectiveness.
Analyzes contextual bandits with single-index reward models where arms represent stable decisions and covariates evolve under bandit policy.
Studies entropy trajectory shape in chain-of-thought reasoning to predict LLM correctness without additional inference, testing on GSM8K with Qwen2.5-7B.
Proposes unified taxonomy with 11 dimensions for categorizing deep learning approaches to multivariate time series anomaly detection.
Compares OmniAnomaly stochastic recurrent model with PCA-based methods for multivariate time series anomaly detection using standardized evaluation protocols.
CRAFT method for aligning diffusion models through fine-tuning, addressing limitations of SFT and DPO-style preference optimization approaches.
Uses Stochastic Gumbel AlphaZero to evaluate difficulty in Tetris Block Puzzle variants, extending prior game-evaluation methods.
Online resource allocation algorithm with endogenous costs modeling competitive interactions between modules.
Hypothesis-Conditioned Query Rewriting improves RAG systems by rewriting queries to prioritize decision-relevant evidence over topical relevance.
Lightweight cryptographic framework for verifiable AI inference enabling clients to verify model outputs without rerunning computation.
SEM method for post-hoc debiasing of CLIP via sparse embedding modulation to remove social and spurious biases.
Neural network approach to autoregressive time series estimation using backpropagation while preserving interpretability.
SAVeS framework steers safety judgments in Vision-Language Models through semantic cues and textual/visual interventions.
FedTrident defends federated learning-based road classification against poisoning attacks from malicious participants.
Studies how uncertainty estimation scales with sampling in reasoning language models using self-consistency and verbalized confidence.