SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
SimBench provides first standardized benchmark for evaluating how faithfully LLMs simulate human behaviors across diverse tasks and metrics.
SimBench provides first standardized benchmark for evaluating how faithfully LLMs simulate human behaviors across diverse tasks and metrics.
AtlasKV enables RAG systems to integrate billion-scale knowledge graphs efficiently in limited VRAM by avoiding expensive external retrieval modules.
Proposes DistDF for time-series forecasting using Wasserstein alignment to handle autocorrelated label sequences better than standard approaches.
Method to automatically extract and explain what features human feedback data encodes when training language models, addressing unpredictability in RLHF approaches.
Analysis of multilingual reasoning gaps in reasoning language models, showing deficits stem from language understanding failures in low-resource languages.
Method for interpreting LLM reasoning by resampling multiple chain-of-thought branches to measure causal influence and underlying computation.
LLM-guided decompilation framework using context to improve re-executability of decompiled binaries for security analysis.
Multimodal diffusion approach for robot learning from expert trajectories, modeling interactions between observations, actions, and rewards.
SynthAgent: Framework for web agent adaptation using synthetic data generation with quality filtering to handle hallucinations and trajectory noise.
GroupRank: Efficient passage reranking paradigm using LLMs with groupwise ranking to balance efficiency and accuracy.
LiveCLKTBench: Benchmark pipeline for reliably measuring cross-lingual knowledge transfer in multilingual LLMs with time-sensitive queries.
Framework for process-centric evaluation of agentic software systems, analyzing execution trajectories and reasoning beyond outcome metrics.
Theoretical framework for sparse dictionary learning in neural networks, analyzing piecewise biconvexity and spurious minima in mechanistic interpretability.
WisPaper: AI agent system for academic paper discovery and organization, addressing semantic search and workflow fragmentation challenges.
Multimodal expert fusion approach for interpretable Alzheimer's disease diagnosis from neuroimaging data.
VPR-AttLLM framework using LLM semantic reasoning to improve geo-localization of crowdsourced flood imagery.
Method for multi-subject image generation with distinction capability, integrating composition and distinction in subject-driven synthesis.
Multimodal RAG system enhanced with knowledge graphs for audio-visual retrieval, extending LLM capabilities to multimodal domains.
Study on imitation learning for autonomous driving, addressing the gap between privileged expert demonstrations and sensor-limited student observations in simulation.
Research on variance-aware tree policies for Monte Carlo Tree Search, improving upon UCB-based methods used in AlphaZero-style algorithms.
CricBench: benchmark for evaluating LLMs on multilingual cricket analytics and domain-specific Text-to-SQL tasks.
Survey of Brazilian K-12 teachers' perceptions on AI in education, examining AI literacy and adoption across 346 educators.
Research questions whether small proxy model training reliably guides data curation decisions for full-scale frontier AI model pretraining.
Disco-RAG improves retrieval-augmented generation by capturing discourse structure and synthesizing knowledge from dispersed evidence.
Enhanced-FQL(λ) reinforcement learning framework with fuzzy eligibility traces and interpretable fuzzy rules for continuous control.
Defensive poisoning technique merges triggers to remove backdoors in instruction-tuned LLMs vulnerable to data poisoning attacks.
HAERAE-Vision benchmark with 653 real-world underspecified visual questions reveals vision-language model limitations with informal queries.
SODACER: Safe reinforcement learning framework with dual-buffer adaptive clustering for nonlinear system control.
GanitLLM: Bengali mathematical reasoning model with difficulty-aware curriculum-based GRPO training pipeline.
Coverage-enhanced latent actions framework for controlling multimodal conversational agents with reinforcement learning.
Game-theoretic analysis of how expanding AI agent capabilities affects strategic interaction in bargaining, negotiation, and persuasion.
EZ-MIA: Training-free membership inference attack against fine-tuned language models to audit privacy risks from data memorization.
Deep learning model for drug response prediction combining chemical substructures with cellular pathway states using differential attention.
Cross-modal domain adaptation approach transferring image dataset knowledge to LiDAR for synthetic training data generation.
Analyzes parallelism and generation order in Masked Diffusion Language Models across 8 models and 58 benchmarks.
Uses persona-based evaluation with LLMs to support inclusive cycling infrastructure design by simulating diverse user experiences.
Systematic analysis of demographic bias in LLM-generated targeted messaging across GPT-4o, Llama-3.3, and Mistral-Large models.
MERMAID: Multi-agent system for fact-checking using LLMs with memory-enhanced retrieval and iterative reasoning to assess veracity of claims.
Agent memory system beyond RAG addressing agent-specific needs: bounded coherent dialogue retrieval with decoupling and aggregation.
Unified framework explaining LLM steering methods (fine-tuning, LoRA, activation interventions) as dynamic weight updates from control signals.
El Agente Estructural multimodal agent for autonomous molecular geometry generation and manipulation using natural language and vision.
Fake-HR1 hybrid-reasoning model for synthetic image detection balancing chain-of-thought reasoning with computational efficiency.
Study of electromagnetic fault injection attacks on embedded deep learning models analyzing influence of number representations on resilience.
AdvSynGNN resilient graph neural network architecture addressing structural noise and non-homophilous topologies via adversarial synthesis.
SubQuad pipeline for immune repertoire analysis combining subquadratic retrieval with GPU-accelerated affinity kernels and multimodal fusion.
UBio-MolFM universal molecular foundation model framework for bio-system simulation bridging quantum accuracy and biological scale.
Pyramid MoA hierarchical mixture-of-agents architecture with decision-theoretic router for cost-optimized anytime LLM inference.
Gome agent for machine learning engineering using gradient-based optimization instead of tree search, scaling LLM-based reasoning.
SteerEval benchmark for evaluating LLM controllability across language features, sentiment, and personality at multiple specification levels.
Physics-informed surrogate model for ferroelectric NAND retention analysis reducing computational cost from day-scale to second-scale.