Foundation World Models for Agents that Learn, Verify, and Adapt Reliably Beyond Static Environments
Vision for foundation world models enabling autonomous agents to learn, verify, and adapt reliably in open, non-static environments.
Vision for foundation world models enabling autonomous agents to learn, verify, and adapt reliably in open, non-static environments.
Multi-agent workflow system that translates jailbreak papers into executable modules for unified benchmarking of LLM robustness techniques.
Model-agnostic interpretable debiasing method for vision-language models to mitigate unintended social bias in black-box reasoning processes.
RewardUQ: Framework for uncertainty quantification in reward models used to align LLMs with human preferences, reducing annotation costs.
Data-driven optimization pipeline for scheduling and caching hundreds of LLM adapters in distributed serving to maximize GPU throughput.
Quant Experts: Post-training quantization method for vision-language models using mixture of experts for token-aware adaptive error reconstruction.
Empirical study evaluating whether reasoning capabilities universally improve LLM performance across sentiment analysis tasks of varying complexity.
ACWI framework for dynamically balancing intrinsic and extrinsic rewards in sparse reward reinforcement learning through adaptive scaling.
Preference packing technique for efficient batch training of LLMs during preference optimization (RLHF), improving resource utilization.
DiffusionHarmonizer method combining neural reconstruction with diffusion models to enhance photorealism in robotic simulation environments.
ARGUS framework studying how narrative features in argumentative texts influence persuasion using corpus analysis and modeling.
Formal verification approach for vision-language models drafting radiology reports to ensure logical consistency in clinical reasoning.
Systematic evaluation of LLM machine translation for Ancient Greek technical prose, showing terminology rarity predicts translation failures.
Unsupervised temporal segmentation method for surgical video analysis using optimal transport, questioning necessity of large-scale pre-training.
CoME: Mobile agent architecture with four expert modules for hybrid reasoning including screen understanding, planning, and action execution.
ArgLLM-App: Interactive web system implementing argumentative reasoning agents with LLMs for explainable binary decision-making tasks.
TASC framework for accelerating small language models through task-adaptive sequence compression and vocabulary enrichment during fine-tuning.
Pre-training method for vision encoders (DINO) to improve cross-modal feature alignment between RGB images and depth maps across different modalities.
Study of resilient decision-making strategies for agents under uncertainty and disturbances that could disrupt intended actions.
Federated learning approach for anomaly detection across heterogeneous IoT devices while preserving privacy without centralized data collection.
Method for training reasoning models to follow instructions in reasoning traces to prevent unintended leakage of private information in AI agents processing sensitive user data.
SafeGen-LLM framework enhances safety in robotic task planning by combining LLMs with safety constraints, addressing generalization challenges in classical and RL-based planning methods.
FaultXformer Transformer-based model for fault detection and localization in electrical distribution systems using PMU data.
TREC 2025 DRAGUN track resources for evaluating RAG systems that help readers assess news trustworthiness with attributed reports.
Exploration of recurrent architectures with growing memory as subquadratic alternatives to Transformers for sequence modeling.
LoRA-Pre method reducing memory overhead in optimizers like Adam via low-rank approximation of momentum states.
CUDA Agent system using large-scale agentic RL to generate optimized GPU kernels, bridging gap between LLMs and compiler-based systems.
Study comparing standard multi-turn prompting with user-turn-only prompting to determine if LLMs benefit from their own prior responses.
Research on offline-to-online multi-agent reinforcement learning with offline value function memory and sequential exploration strategies.
CowPilot framework enabling autonomous and human-agent collaborative web navigation with preference modeling and human oversight.
Method using language models to improve message passing in heterophilic graph neural networks by leveraging semantic node text.
MLE-Live framework for evaluating LLM agents in ML engineering that engage with research communities through knowledge sharing and communication.
Position paper examining opportunities and limitations of integrating LLMs into agent-based social simulations from computational social science perspective.
Multi-agent system for clinical diagnosis that accumulates self-learned clinical knowledge across agent interactions for improved LLM performance.
Analysis of failure modes in multi-agent workflows built on low-code orchestration platforms, examining propagation across heterogeneous nodes.
RE-PO framework for robust LLM alignment that handles noisy preference data and unreliable annotations in RLHF-style training.
MITS algorithm using pointwise mutual information to improve tree search reasoning in LLMs with better step quality assessment.
Method for reducing hallucinations in multimodal LLMs by reallocating attention across layers to balance perception and reasoning.
AutoSpec framework for automatically refining logical specifications in reinforcement learning through exploration-guided search strategies.
Agentic framework orchestrating specialized tools for automated radiology reporting, combining vision-language models with multi-step reasoning.
Research on whether LLMs can mediate online conflicts by fostering empathy and constructive dialogue beyond content moderation.
Evaluation of visual UI design factors influencing web agent decision-making and task performance.
Real-time alignment technique for RLHF reward models to prevent overoptimization and maintain human intent capture.
Study on whether Large Reasoning Models know when to stop thinking, addressing redundancy in long chains-of-thought.
Training method for Large Reasoning Models using adaptive reflection and length penalties to reduce unnecessary token consumption.
ForesightSafety Bench evaluates frontier risks in autonomous AI with unpredictable and difficult-to-control behaviors.
IntentCUA framework for computer-use agents with intent-aligned planning and multi-agent coordination over long horizons.
Layered execution structures for tool orchestration in agentic systems with reflective error correction mechanisms.
Aletheia AI agent solved 6/10 FirstProof mathematics challenges autonomously using Gemini 3 Deep Think reasoning.
Risk-based fuzzy ethical decision-making framework with principle-level explainability and pluralistic validation.