C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts
Benchmark dataset for detecting AI-generated Chinese text with evaluation across multiple LLM architectures.
Benchmark dataset for detecting AI-generated Chinese text with evaluation across multiple LLM architectures.
Deep learning method for uncertainty quantification in clinical radiotherapy segmentation using budget-aware constraints.
RL approach for training physics reasoning models on simulators to address lack of large-scale QA datasets in physics domain.
Evaluation of LLM causal reasoning capabilities using real-world complex texts with implicit causal relationships.
Benchmark evaluating VLMs' strategic reasoning abilities in multi-agent environments with multimodal observations.
Three-stage pipeline for disambiguation-centric finetuning of enterprise tool-calling LLMs to reduce errors with near-duplicate tools.
Multi-agent LLM system for automated academic poster generation from papers incorporating design and aesthetic principles.
Benchmark and framework for evaluating LLM-driven persuasive dialogue for health behavior change in insulin delivery adoption.
GUI agent framework for multi-step e-commerce risk management handling stateful interactions with dynamic web content.
Interactive learning approach enabling LLMs to improve reasoning through multi-agent interactions during inference without re-execution.
Reward learning method deriving progress estimation signals from passive videos for robotics RL tasks without manual reward engineering.
RL method for improving reasoning in diffusion-based language models using denoising process rewards instead of outcome-only rewards.
Multi-agent LLM system for iterative narrative script refinement using divide-and-conquer approach to improve long-form creative content generation.
RL framework for e-commerce search relevance using stepwise reward optimization to improve LLM-based query-product matching beyond SFT/DPO limitations.
Graph-coarsening strategy for Capacitated Vehicle Routing Problem with time windows using multilevel aggregation and quantum/classical solvers for large-scale logistics optimization.
MGA memory-driven GUI agent reduces context overload and architectural redundancy by managing sequential trajectory history for improved long-horizon end-to-end automation.
Audits MedCalc-Bench clinical labels using physician-in-the-loop stewardship to assess reliability of LLM-synthesized reference labels in ML benchmarks.
PRISM framework disentangles SFT and RL training data via gradient concentration to diagnose learning needs and optimize data allocation for LLM agent training.
AgencyBench evaluates LLM-based autonomous agents on long-horizon real-world scenarios with 1M-token context windows, enabling scalable automated evaluation without human-in-the-loop.
Risk Awareness Injection method calibrates vision-language models against multimodal jailbreak attacks without fine-tuning or token manipulation, preserving model utility.
ANCHOR framework generates high-quality synthetic training data for GUI agents by trajectory expansion from seed demonstrations to create diverse, goal-consistent interaction data.
Constrained Assumption-Based Argumentation (CABA) extends ABA frameworks beyond propositional atoms to support variable-based arguments for structured argumentation.
AI agent system for pharmaceutical drug asset scouting across global non-English channels to identify novel drug development opportunities via multi-source intelligence.
FlexMS benchmark framework for evaluating deep learning mass spectrum prediction tools in metabolomics for drug discovery and molecular property identification.
Nano-EmoX proposes three-level cognitive hierarchy (perception, understanding, interaction) for unified multimodal emotional intelligence in language models with empathy capabilities.
Diagnostic framework for LLM agent memory systems comparing write strategies, retrieval methods, and utilization behavior to identify performance bottlenecks across memory components.
Analyzes whether AI systems fail similarly to humans using error alignment metrics on out-of-distribution data to assess cognitive similarity and decision-making strategies.
NormCoRe framework studies how norms emerge in multi-agent AI systems through deliberation and negotiation using replication-by-translation methodology for fairness-sensitive domains.
dTRPO algorithm reduces trajectory probability calculation costs for policy optimization of diffusion-based LLMs, enabling scaled offline RL training for preference alignment.
Method for tracking internal states of LLMs across conversations using self-report-inspired techniques for safety, interpretability, and model welfare without white-box compression.
Manifesto proposing Agentic Business Process Management (APM) framework extending BPM to govern autonomous agents executing organizational processes with agent-oriented abstractions.
Maximum entropy methods for generating synthetic populations matching multi-way constraints from aggregate statistics, applied to microsimulation and privacy-preserving data release.
Large-scale empirical study analyzing 2,000+ publications on reinforcement learning environments, proposing a taxonomy of RL environment evolution and technological trends.
Examines ethical front-end design choices in conversational AI systems, focusing on user interaction and representation rather than backend algorithmic issues.
AIRA_2 addresses three bottlenecks in AI research agents: synchronous GPU execution, generalization gaps, and fixed LLM operator limitations through improved architectural design.
AutoMS is a multi-agent neuro-symbolic framework using LLMs as semantic navigators for evolutionary search in inverse microstructure design, addressing topology optimization challenges.
CoEvoSkills framework enables LLM agents to self-evolve structured multi-file skill artifacts through co-evolutionary verification without manual authoring.
Deep RL framework optimizes land-use allocation in Lake Malawi Basin to maximize ecosystem service value with ecological constraints.
Position paper on failure modes in agentic IR systems, analyzing error cascades in multi-step reason-act-observe workflows despite linguistic fluency.
Hierarchical multi-agent RL framework for reconfigurable intelligent surfaces removes need for channel state information estimation.
SCMAPR uses self-correcting multi-agent prompt refinement to improve text-to-video generation in complex scenarios with ambiguous prompts.
CuraLight combines LLM-centered control with debate-guided data curation for interpretable and generalizable traffic signal control systems.
EmoMAS uses Bayesian multi-agent framework with small language models for emotionally-aware negotiation in privacy-sensitive edge deployments.
EVGeoQA benchmark evaluates LLM reasoning on dynamic geo-spatial exploration with multi-objective planning and compound constraints.
Rhizome OS-1 is a semi-autonomous operating system deploying multi-modal AI agents as computational and medicinal chemists for drug discovery.
Pre-registered study demonstrating AI safety measures can cause harmful outputs in medical domains, with contextual prompting changing model behavior.
SEARL framework enables self-evolving agents through joint optimization of policy and tool graph memory, reducing reliance on large-scale LLMs.
HiL-Bench evaluates whether coding agents know when to request help with incomplete specifications, exposing judgment gaps in frontier models.
Empirical study of how LLM agents coordinate in multi-agent games, distinguishing baseline action similarity from strategic algorithmic monoculture.
AXIL derives exact instance attribution method for gradient boosting machines, expressing predictions as weighted sums of training targets.