TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems
TFRBench: Benchmark for evaluating reasoning capabilities of time-series forecasting systems beyond numerical accuracy metrics.
TFRBench: Benchmark for evaluating reasoning capabilities of time-series forecasting systems beyond numerical accuracy metrics.
Using LLMs as judges to evaluate lightweight segmentation models for drone-based power line inspection under distribution shift.
Domain-invariant neurons approach for cross-domain knowledge transfer to boost LLM reasoning in expertise-scarce specialized domains.
Empirical study on using cross-domain demonstrations to improve in-context learning when expert annotations in target domain are scarce.
HYVE framework for LLMs to better process machine data (logs, metrics, traces) through hybrid structured/unstructured representations.
CODESTRUCT: LLM-based code agents using structured AST action spaces instead of text matching for reliable code editing and repository interaction.
Research on multi-agent pathfinding algorithms handling non-unit edge costs and continuous-time actions for real-world robotic/logistics scenarios.
PRISM-MCTS learning approach using reasoning trajectories with metacognitive reflection, inspired by reasoning models like OpenAI o1, for efficient low-resource NLP methods.
Automated framework using locally-deployed LLMs to audit hospital discharge summaries at scale, enforcing transition-of-care documentation requirements for patient safety.
Adaptive serverless resource management framework using slot-survival prediction and event-driven architecture to optimize cold start latency and utilization.
OntoTKGE model for temporal knowledge graph extrapolation leveraging ontological knowledge to handle sparse historical interactions and enable behavioral pattern inheritance.
GMRL-BD algorithm using bias-diffusion and multi-agent RL to detect untrustworthy topic boundaries of LLMs, identifying domains where model answers cannot be reliably trusted.
Auditable Agents framework establishing accountability, auditability, and auditing definitions for LLM agents with external effects, addressing post-deployment answer-ability.
SCMAPR stage-wise multi-agent refinement framework for complex scenario text-to-video generation that refines and self-corrects ambiguous prompts through agent collaboration.
Thinking Diffusion method adding reasoning penalization and guidance to diffusion multimodal LLMs combining Chain-of-Thought reasoning with parallel generation capabilities.
OmniDiagram unified framework for code generation across diverse diagram types and languages using visual interrogation reward for alignment with visual specifications.
UniCreative approach using reference-free reinforcement learning to balance long-form coherence and short-form expressiveness in LLM-based creative writing generation.
Market-Bench comprehensive benchmark evaluating LLM capabilities in economically-relevant tasks via configurable multi-agent supply chain model with LLM retailer agents.
ActivityEditor dual-LLM-agent framework for zero-shot cross-regional human trajectory generation, synthesizing physically valid mobility patterns without region-specific historical data.
Analysis of 12,007 rank-invariant pseudo-Boolean landscapes introducing stronger notion of rank landscape equivalence under translation and rotation symmetries.
Echo memory framework for multimodal LLM agents enabling transfer of reusable knowledge across Minecraft tasks by decomposing experience into five interpretable dimensions.
SignalClaw framework using LLMs as evolutionary skill generators to synthesize interpretable traffic signal control strategies balancing effectiveness and explainability.
Introduces Tree Decision Diagrams generalizing OBDD for Boolean function representation with improved succinctness and tractable operations like model counting and conditioning.
Neurosymbolic approach combining LLMs with Logic Tensor Networks for auditable offer validation in regulated procurement, ensuring factually correct and legally verifiable decisions.
COSMO-Agent tool-augmented RL framework teaching LLMs to bridge CAD-CAE gap by translating simulation feedback into valid geometric edits for iterative industrial design optimization.
ResearchEVO framework for automated scientific discovery using LLMs to conduct undirected experimentation and generate explanations, instantiating discover-then-explain paradigm computationally.
Research on LLM-as-a-Judge showing both humans and LLMs exhibit bias toward human-authored content labels over identical AI-generated content via counterfactual design and eye-tracking.
Philosophical critique of behavioral evaluation paradigms for AI systems and proposal for cognitive assessment methods.
PECKER algorithm for efficient machine unlearning in diffusion models with directed gradient updates.
CuraLight framework combining RL and LLMs for traffic signal control with debate-guided data curation.
LudoBench benchmark evaluating LLM strategic reasoning in Ludo board game with 480 handcrafted scenarios.
Quality-aware mixture of experts for multimodal sentiment analysis robust to noise and modality missingness.
Unlearn-and-Reinvent pipeline testing whether LLMs can rediscover foundational algorithms after unlearning removal.
Study on cultural evolution showing minimal social learning can transmit higher-level representations without inference.
Hierarchical RL framework (STEP-HRL) for LLM agents using step-level transitions to reduce computational cost and history length.
Vision-language model critic for automated iterative refinement of frontend code generation with visual feedback loops.
Open-source framework for autonomous LLM agents conducting deep learning experiments with hypothesis formation, training, and iterative refinement.
Diagnostic framework determining when LLMs are necessary for contextual multi-armed bandits with text and numerical context.
JTON format, JSON superset with Zen Grid encoding for token-efficient structured data processing in LLMs.
Joint knowledge base completion and QA using combined large and small language models for KB-related tasks.
KV cache compression technique for multimodal LLM inference, reducing memory overhead and latency with hybrid compression strategy.
Architecture for value-driven LLM agents addressing behavioral rigidity through context-value-action design.
Foundation model enabling single GPT-based agent to perform across diverse multi-agent reinforcement learning tasks and environments.
Research agent framework for generating trustworthy reports with confidence estimation and calibration mechanisms.
Multi-objective preference alignment for LLMs using Pareto-lenient consensus to handle diverse human values in model training.
AI agents for retail supply chain operations, automating demand forecasting, procurement, and inventory replenishment in supermarket chains.
Proposes epistemic blinding, an inference-time auditing protocol to separate memorized priors from data-driven inference in LLM-assisted agentic analysis systems.
Investigates instruction-following mechanisms in LLMs through diagnostic probing, finding evidence for compositional skill deployment over universal mechanism.
Proposes ACE-Bench, agent evaluation benchmark with unified grid-based planning tasks, lightweight environments, and configurable difficulty/horizon control.
Introduces Claw-Eval, an end-to-end evaluation suite for autonomous agents addressing trajectory-opaque grading, safety, and interaction modality coverage.