Model Workspace Protocol (MWP) simplifies agentic AI orchestration using folder structures for sequential workflows, reducing engineering overhead compared to multi-agent frameworks.
Enhances OpenVLA vision-language-action models with synthetic instruction augmentation to improve zero-shot performance in new environments for embodied AI tasks.
POaaS optimizes prompts for on-device small language models through minimal edits, reducing hallucinations and improving accuracy without requiring lengthy structured instructions.
Context alignment pre-processor enhances LLM dialogue coherence by resolving contextual misalignment when users omit premises, simplify references, or shift context during interactions.
ARISE uses hierarchical reinforcement learning to improve mathematical reasoning in LLMs by developing reusable strategies that accumulate during training rather than treating problems in isolation.
VIGIL deploys edge-resident AI agents for enterprise IT support, performing diagnosis, knowledge retrieval, and policy-governed remediation on user devices with consent and observability.
NeuronSpark: 0.9B-parameter spiking neural network language model using state-space dynamics and surrogate gradients without Transformer distillation.
SQL-ASTRA: agentic reinforcement learning framework for text-to-SQL using column-set matching and trajectory aggregation for credit assignment.
Data contamination audit reveals public LLM benchmarks may be leaked in training data; questions claims of superhuman performance.
Framework for safe LLM-based IoT agents using dual-stage intent analysis to prevent hallucination and reduce interaction overhead.
MOSAIC: modular control token approach for context-dependent safety alignment in LLMs across applications and regions.
Adaptive theory of mind framework for LLM-based multi-agent coordination, aligning agents' reasoning depth about others' mental states.
NeSy-Route neuro-symbolic benchmark for constrained route planning in remote sensing, evaluating perception, reasoning, and planning of MLLMs.
Learns to predict and reason over high-dimensional discrete event sequences from vehicle diagnostic trouble codes using machine learning.
FactorEngine framework for automated discovery of interpretable alpha factors from market data, combining symbolic and neural approaches for quantitative investment.
Empirical analysis showing negative-only feedback training for LLMs matches or exceeds standard RLHF, exploring theoretical foundations via via negativa framework.
Introduces Option Query Language (OQL) domain-specific intermediate representation for translating natural language into executable financial option strategies.
Studies how visual distractions undermine moral reasoning in vision-language models, identifying gaps in multimodal safety techniques.
TRUST-SQL uses reinforcement learning for text-to-SQL over unknown database schemas, where agents actively identify relevant tables from massive metadata.
RetailBench evaluates long-horizon autonomous decision-making of LLM agents in realistic dynamic retail environments with stochastic conditions.
Hybrid-evidential deductive reasoning approach for open-vocabulary multimodal emotion recognition using MLLMs.
Causal evaluation protocol measuring whether intermediate structures (rubrics, checklists) causally determine LLM outputs or merely accompany them.
Multimodal LLM (ExpressMind) for expressway operation, applying cognitive intelligence to transportation systems beyond rule-based approaches.
Investigates customization approaches for smaller open-source LLMs to improve domain-specific code generation without relying on large proprietary models.
Proposes guardrails for LLM-enabled robots allocating scarce assistance across multiple users with conflicting values and unpredictable LLM behavior.
BenchPreS evaluates whether memory-based LLM personalization appropriately suppresses user preferences in context-sensitive communication settings.
V-DyKnow benchmark evaluates how vision-language models handle time-sensitive knowledge that becomes outdated after training.
Framework for runtime governance of LLM-based AI agents, balancing task completion with legal and reputational costs through execution-path monitoring.
Analyzes AI reasoning about geopolitical conflicts using temporally grounded case study of 2026 Middle East conflict after model training cutoffs.
Integrates constraint propagation into dynamic programming to bridge gap between state-based and constraint-based paradigms for combinatorial problems.
Pipeline for developing norm-compliant reinforcement learning agents inspired by Pinocchio story, addressing safe AI integration into society.
Fine-tuning LLMs on journal publication decisions to enable models to assess scientific merit and predict promising research directions.
Mobile app teaching digital literacy and prebunking misinformation tactics through interactive challenges in nine languages.
Code LLM series (7B-40B) using code-flow multi-stage training paradigm to capture dynamic software logic evolution.
Investigation of how user personalization and mental health disclosure affect harmful behavior in tool-using LLM agents.
Benchmark for evaluating continual learning in biomedical NLP across task-diverse datasets with robustness and efficiency metrics.
Study of reproducibility in AI coding agents, showing agent-to-agent variation produces nonstandard errors in empirical results.
Two-stage RL framework training multimodal agents for anticipatory reasoning and long-term planning in multi-step tasks.
Pipeline integrating forecasting models and ML regressors with inventory optimization, evaluated on M5 Walmart dataset.
Evaluation of conformal factuality as reliability guarantee for RAG-based LLMs with novel metrics and robustness analysis.
Large-scale multimodal surgical dataset and foundation models for cross-procedure generalization in surgical AI tasks.
Study of cultural bias in LLMs and prompt-based methods to improve cultural alignment for policy and decision-making tasks.
RL environment where LLM agents learn to generate professional presentations through research, planning, and tool use with multi-component reward system.
Method for training LLM agents to leverage rich environment feedback through reflective experience and post-training, improving long-horizon planning.
Benchmark evaluating audio-visual social interactivity capabilities of omni-modal LLMs in dynamic dialogue settings.
RL framework using Soft Actor-Critic to learn adaptive ray sampling policies for efficient neural radiance field rendering.
Multimodal AI search framework combining vector search, hybrid retrieval, and reasoning for pharmaceutical data across text, images, audio, and video.
Evaluation of VLMs (GPT-4V, Gemini, Claude, Llava) for navigation assistance tasks for people with vision impairments.
Framework extending RLHF using multi-dimensional rubric-based rewards instead of scalar signals for RL training.
Inference-time governance approach for LLMs using adaptive prompt routing to enable social alignment without retraining.