Proposes ACE-Bench, agent evaluation benchmark with unified grid-based planning tasks, lightweight environments, and configurable difficulty/horizon control.
Introduces Claw-Eval, an end-to-end evaluation suite for autonomous agents addressing trajectory-opaque grading, safety, and interaction modality coverage.
Theoretical analysis of contextuality in quantum information systems as external bookkeeping cost under classical simulation.
Proposes Web Retrieval-Aware Chunking (W-RAC) for efficient RAG document chunking to balance retrieval quality, latency, and cost on web-scale content.
Proposes Task-Driven Alignment (TDA-RC) for improving reasoning chains in LLMs by bridging logical gaps between CoT and multi-round thought paradigms.
Evaluates bidirectional training objectives (MLM, masked attention) to mitigate the reversal curse in autoregressive language models.
Introduces Inclusion-of-Thoughts (IoT), a strategy to reduce LLM instability on multiple-choice questions by filtering irrelevant distractors.
Proposes SUMMIR framework for ranking sports insights extracted by LLMs, addressing hallucinations with 7,900-article dataset across four sports.
Evaluates four open-source PDF-to-Markdown conversion frameworks (Docling, MinerU, Marker, DeepSeek OCR) for RAG document preprocessing impact on QA accuracy.
Studies how to design information retrieval systems for LLM agents versus humans, proposing learning-to-rank methods for agent trajectories.
Analysis of how generative AI enables social engineering fraud and trust manipulation attacks in financial crime scenarios.
Surveys transition from heuristic-based to generative synthesis methods for automatic video trailer generation using LLMs and diffusion models.
Opinion piece on environmental and computational costs of scaling LLM agents and implications for planetary boundaries.
Self-supervised foundation model (CalM) trained on neuronal calcium traces for neuroscience task transfer learning.
Proposes MG²-RAG, a multi-granularity graph approach for retrieval-augmented generation in multimodal LLMs to improve cross-modal reasoning without costly text translation.
Independent evaluation of Claude Code's auto mode permission system for AI coding agents, testing security gates on ambiguous authorization scenarios.
Introduces Squeez, a method for pruning tool outputs in coding agents by identifying minimal relevant evidence blocks. Includes 11,477-example benchmark from SWE-bench.
CURE enables privacy-preserving unlearning in LLM-based recommendation systems using circuit-aware techniques for removing user data.
Cactus improves speculative sampling for LLM decoding by relaxing strict distribution matching to allow acceptable variations like top-k sampling.
Prune-Quantize-Distill pipeline for neural network compression optimizing wall-clock inference time rather than parameter count or FLOPs.
Analysis of implicit architectural decisions made by AI coding agents, identifying five mechanisms and six prompt-architecture coupling patterns.
FreakOut-LLM framework investigates whether emotionally charged prompts compromise safety alignment in ten LLMs using psychological stimuli.
Comparative evaluation of embedding-based and generative models for document classification, showing Vision-Language Models with CoT achieve 82% zero-shot accuracy.
PRIME enables multimodal self-supervised pretraining for cancer prognosis with missing modalities by combining histopathology, gene expression, and reports.
Case study of closed-loop software development system managing backlog via deterministic pipeline with Jira integration and safety constraints.
Studies learning from weak supervision under distribution shift in CRISPR-Cas13d experiments where guidance efficacy is indirectly inferred.
EduIllustrate benchmark evaluates LLMs on generating multimodal educational content combining accurate diagrams with step-by-step explanations.
BDATP framework for audio-visual navigation using binaural attention and action prediction to improve generalization in unseen 3D environments.
YMIR dataset and CNN model for classifying five Yemeni music genres, addressing underrepresentation of non-Western music in MIR research.
Comparative analysis of key-value cache management strategies for efficient LLM inference under different model sizes and context lengths.
Proposes training LLM coding agents on five atomic coding skills (localization, editing, testing, reproduction, review) for improved generalization.
StarVLA provides a modular open-source codebase for building vision-language-action embodied agents with standardized evaluation protocols.
Phase-Associative Memory is a recurrent sequence model using complex-valued representations achieving competitive perplexity on WikiText-103.
ID-Sim proposes an identity-focused similarity metric for vision models to improve evaluation of personalized image generation tasks.
PCA-Triage is a streaming algorithm for adaptive sensor sampling in IoT networks using principal component analysis to manage bandwidth constraints.
Study evaluating LLM sensitivity to prompt phrasing in medical question answering, showing inconsistent responses despite identical underlying evidence.
DynLMC generates synthetic multivariate time series with time-varying correlations and cross-channel dependencies for training foundation models.
arXiv paper presenting AutoLALA, open-source tool analyzing data locality in loop programs for HPC and AI workloads.
arXiv paper on privacy-preserving graph learning for additive manufacturing sensor data using differential privacy techniques.
arXiv paper on Nidus, a governance runtime using Claude, Gemini, Codex to mechanize V-model for AI-assisted software delivery.
arXiv paper proposing OmniScore, deterministic evaluation metrics for multilingual text generation as alternative to LLM judges.
arXiv paper auditing code-editing benchmarks for LLMs, finding flaws in existing evaluation methods for instructed code modification.
arXiv paper on diffusion models for medical imaging, generating paired mammogram views for cancer screening datasets.
arXiv paper on Decision Pre-Trained Transformer for in-context reinforcement learning, enabling scalable generalist agent training.
arXiv paper on CRAB method for mitigating popularity bias in generative recommendation systems via codebook rebalancing.
arXiv paper presenting π² pipeline for curating reasoning data from structured sources to improve LLM long-context reasoning.
arXiv paper on vision-language models learning from grounded video data, finding text-only bias in video benchmarks.
arXiv paper modeling prior authorization policy retrieval as MDP for adaptive decision-making in healthcare insurance.
arXiv paper on how reasoning evolves in language models through fine-tuning and RL, studied via chess task performance.
EffiPair: Relative Contrastive Feedback method for improving runtime and memory efficiency of LLM-generated code without model fine-tuning.