Prose2Policy: LLM pipeline translating natural-language access control policies into executable Rego code. End-to-end pipeline with test generation and validation.
Empirical study of GPT-4.1 behavior in gambling tasks under different persona prompts. Examines whether LLM risk behavior reflects principled patterns or prompt mimicry.
Regularized latent dynamics prediction as baseline for behavioral foundation models, examining how state feature choice affects task adaptability and reward function expressivity.
Framework for governing embodied AI in critical infrastructure through hybrid oversight modes and bounded autonomy, addressing resilience beyond statistically representable uncertainty.
AsgardBench evaluates visually grounded interactive planning for embodied AI agents, focusing on high-level action sequence generation with plan adaptation based on visual feedback.
Monte Carlo simulation evaluating prompt engineering strategies for LLM-generated personality assessment items across zero-shot, few-shot, and persona-based designs.
Lean 4 formalization of Vlasov-Maxwell-Landau equilibrium using AI reasoning (Gemini DeepThink) and agentic tools (Claude Code) demonstrating AI-assisted mathematical research workflows.
Framework combining computational argumentation with LLMs to create transparent, verifiable AI agents that reason collaboratively with humans rather than providing opaque recommendations.
Agent Rosetta uses LLMs as specialized scientific agents for protein design tasks, emulating reasoning and tool use for broad design pipelines beyond canonical amino acids.
MAC automatically learns constitutional AI rules from training data using multi-agent approaches, improving upon existing LLM-based prompt optimizers through structured learning.
Formal proof that safety is non-compositional: two individually incapable agents can collectively reach forbidden goals through emergent conjunctive capability dependencies.
petscagent-bench evaluates AI-generated scientific code for HPC libraries beyond test-case matching, assessing solver selection, API conventions, memory management, and performance.
Write-time gating mechanism filters incoming knowledge objects based on salience scores to improve retrieval-augmented generation accuracy and mirror biological memory archiving.
IRAM-Omega-Q computational architecture uses quantum-like density matrices to model internal regulation and uncertainty management in artificial agents under stochastic perturbation.
Model Workspace Protocol (MWP) simplifies agentic AI orchestration using folder structures for sequential workflows, reducing engineering overhead compared to multi-agent frameworks.
Enhances OpenVLA vision-language-action models with synthetic instruction augmentation to improve zero-shot performance in new environments for embodied AI tasks.
POaaS optimizes prompts for on-device small language models through minimal edits, reducing hallucinations and improving accuracy without requiring lengthy structured instructions.
Context alignment pre-processor enhances LLM dialogue coherence by resolving contextual misalignment when users omit premises, simplify references, or shift context during interactions.
ARISE uses hierarchical reinforcement learning to improve mathematical reasoning in LLMs by developing reusable strategies that accumulate during training rather than treating problems in isolation.
VIGIL deploys edge-resident AI agents for enterprise IT support, performing diagnosis, knowledge retrieval, and policy-governed remediation on user devices with consent and observability.
NeuronSpark: 0.9B-parameter spiking neural network language model using state-space dynamics and surrogate gradients without Transformer distillation.
SQL-ASTRA: agentic reinforcement learning framework for text-to-SQL using column-set matching and trajectory aggregation for credit assignment.
Data contamination audit reveals public LLM benchmarks may be leaked in training data; questions claims of superhuman performance.
Framework for safe LLM-based IoT agents using dual-stage intent analysis to prevent hallucination and reduce interaction overhead.
MOSAIC: modular control token approach for context-dependent safety alignment in LLMs across applications and regions.
Adaptive theory of mind framework for LLM-based multi-agent coordination, aligning agents' reasoning depth about others' mental states.
NeSy-Route neuro-symbolic benchmark for constrained route planning in remote sensing, evaluating perception, reasoning, and planning of MLLMs.
Learns to predict and reason over high-dimensional discrete event sequences from vehicle diagnostic trouble codes using machine learning.
FactorEngine framework for automated discovery of interpretable alpha factors from market data, combining symbolic and neural approaches for quantitative investment.
Empirical analysis showing negative-only feedback training for LLMs matches or exceeds standard RLHF, exploring theoretical foundations via via negativa framework.
Introduces Option Query Language (OQL) domain-specific intermediate representation for translating natural language into executable financial option strategies.
Studies how visual distractions undermine moral reasoning in vision-language models, identifying gaps in multimodal safety techniques.
TRUST-SQL uses reinforcement learning for text-to-SQL over unknown database schemas, where agents actively identify relevant tables from massive metadata.
RetailBench evaluates long-horizon autonomous decision-making of LLM agents in realistic dynamic retail environments with stochastic conditions.
Hybrid-evidential deductive reasoning approach for open-vocabulary multimodal emotion recognition using MLLMs.
Causal evaluation protocol measuring whether intermediate structures (rubrics, checklists) causally determine LLM outputs or merely accompany them.
Multimodal LLM (ExpressMind) for expressway operation, applying cognitive intelligence to transportation systems beyond rule-based approaches.
Investigates customization approaches for smaller open-source LLMs to improve domain-specific code generation without relying on large proprietary models.
Proposes guardrails for LLM-enabled robots allocating scarce assistance across multiple users with conflicting values and unpredictable LLM behavior.
BenchPreS evaluates whether memory-based LLM personalization appropriately suppresses user preferences in context-sensitive communication settings.
V-DyKnow benchmark evaluates how vision-language models handle time-sensitive knowledge that becomes outdated after training.
Framework for runtime governance of LLM-based AI agents, balancing task completion with legal and reputational costs through execution-path monitoring.
Analyzes AI reasoning about geopolitical conflicts using temporally grounded case study of 2026 Middle East conflict after model training cutoffs.
Integrates constraint propagation into dynamic programming to bridge gap between state-based and constraint-based paradigms for combinatorial problems.
Pipeline for developing norm-compliant reinforcement learning agents inspired by Pinocchio story, addressing safe AI integration into society.
Fine-tuning LLMs on journal publication decisions to enable models to assess scientific merit and predict promising research directions.
Mobile app teaching digital literacy and prebunking misinformation tactics through interactive challenges in nine languages.
Code LLM series (7B-40B) using code-flow multi-stage training paradigm to capture dynamic software logic evolution.
Investigation of how user personalization and mental health disclosure affect harmful behavior in tool-using LLM agents.
Benchmark for evaluating continual learning in biomedical NLP across task-diverse datasets with robustness and efficiency metrics.