Agentic Learning Ecosystem (ALE) infrastructure for end-to-end agent development, enabling LLMs to operate in real-world environments with iterative refinement.
Multi-agent reinforcement learning system exploring width scaling for broad information seeking tasks, addressing organizational capability bottlenecks.
Temporal knowledge graph forecasting method using entity state tuning to model structural and temporal dependencies without episodic amnesia.
Benchmark and execution environment evaluating AI agents on end-to-end research tasks using containerized ICML/ICLR/ACL paper repositories with 39 sub-tasks.
Research on chain-of-thought reasoning failures in LLMs when scaling compute budgets, proposing limited reasoning space as explanation for performance collapse.
Neural network representations of music and brain activity for EEG-based music identification using acoustic and expectation signals.
Reinforcement learning method for LLM-based agents using retrospective feedback to enable continual adaptation and experiential learning.
Open-source framework for locally-hosted LLM-based agents that autonomously operate computing environments and orchestrate workflows.
Agentic framework for multi-step reasoning over complex tabular data with hierarchical headers using closed-loop decision-making.
EvalAct: Method for retrieval-augmented agents using self-evaluated process rewards to optimize multi-step reasoning via explicit quality assessment actions.
Omni Parsing: Framework for multimodal parsing across documents, images, audio-visual with unified taxonomy and hierarchical levels.
CUAAudit: Meta-evaluation framework for vision-language models as auditors of autonomous desktop computer-use agents.
Theoretical bounds on bias from representation learning in conditional average treatment effect estimation.
AI models for analyzing police bodycam footage to improve accountability and government transparency.
Defense technique for vision-language model robustness against adversarial attacks via softmax loss modification.
GNN-driven intrinsic reward method for heterogeneous multi-agent cooperation in decentralized reinforcement learning.
Stein Variational Evolution Strategies: Gradient-free variant of SVGD for sampling from unnormalized distributions.
Theoretical analysis of instrumental variable testability in nonlinear models with non-constant treatment effects.
RouteNet-Gauss: ML model integrated with hardware testbed for network simulation and performance prediction.
HOG-Diff: Diffusion model for graph generation incorporating higher-order topology guidance. Improves on image-based approaches.
FedSKD: Federated learning method for model-heterogeneous training via knowledge distillation without centralized aggregation. Medical imaging focus.
OrchMLLM optimizes multimodal LLM training via batch post-balancing. Addresses modality composition incoherence and GPU utilization issues.
IKGR framework uses intent-centric knowledge graphs for LLM-based recommendations without fine-tuning. Handles sparsity and cold-start scenarios.
Structured Agent Distillation compresses LLM-based agents into smaller student models while preserving reasoning and action consistency.
Framework for verifying correctness of math questions used in LLM training. Focuses on QA data quality beyond answer correctness.
AudioTrust benchmark evaluating trustworthiness of audio LLMs. Reveals vulnerabilities from non-semantic acoustic cues like timbre and accent.
Steganographic jailbreak attacks on LLMs balancing semantic and linguistic stealth. Bypasses safety mechanisms through hidden malicious intent.
ReasonMap benchmark for evaluating multimodal LLM visual reasoning on transit maps. Tests math and logic capabilities on 1,008 questions.
Data-driven survey of 14,648 papers on LLM limitations from 2022-2025. Systematically categorizes known weaknesses and failure modes.
Investigates LLM limitations in theoretical physics. Identifies gaps in physical intuition and constraint satisfaction beyond prompting improvements.
Study measuring how well LLMs comprehend user intent beyond surface-level text matching. Analyzes gap between token prediction and actual user goals.
NLP model for detecting hope speech in code-mixed Roman Urdu tweets. Addresses underrepresented languages and informal text.
Refine-POI applies reinforcement fine-tuning to LLMs for point-of-interest recommendation with improved semantic ID indexing and topology awareness.
NeuralOS simulates OS GUIs using RNNs and diffusion models to predict screen frames from user inputs, trained on Ubuntu recordings.
Research on adapter parameters and task merging for efficient multi-task learning in on-device LLMs, enabling multiple tasks via parameter merging.
TURA proposes a tool-augmented retrieval agent for conversational AI search that handles real-time data and structured queries beyond traditional RAG limitations.
Source-free domain adaptation method for facial expression recognition using personalized feature translation without source data access.
Once4All uses LLM-synthesized test generators guided by skeleton templates to fuzz SMT solvers and uncover correctness bugs.
Fast Image-to-Neural Surface constructs implicit distance representations from single images for robotics obstacle avoidance and path planning.
DiDi-Instruct distills fast student models from diffusion LLMs for ultra-fast language generation matching teacher performance.
TRACE uses AI for semi-automated assessment of individual contributions in collaborative computer science group projects.
XGrasp detects robotic grasps that generalize across multiple gripper types without retraining using gripper-aware architecture.
DriveCritic framework uses vision-language models to provide context-aware evaluation of autonomous driving planners aligned with human judgment.
Vision-language model approach for 3D spatial reasoning from limited views using geometric imagination grounding.
Unifying framework explaining in-context learning and activation steering through belief dynamics, treating both as instances of broader control mechanism.
Study evaluates non-functional quality characteristics of LLM-generated code using ISO/IEC 25010 model across functional correctness, maintainability, and security.
DeepSport is an end-to-end trained multimodal LLM for multi-sport video understanding using agentic reinforcement learning for iterative reasoning.
ConCISE is a reference-free evaluation metric for measuring conciseness of LLM-generated responses to reduce verbosity and token costs.
RefTr extracts 3D vascular tree centerlines from medical images using recurrent refinement to preserve topology for clinical tasks.
MedEyes applies vision-language models with reinforcement learning for medical diagnosis via dynamic visual focusing and iterative clinical reasoning.