EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts
EvolveTool-Bench: diagnostic benchmark for evaluating quality of LLM-generated tool libraries as software artifacts in engineering workflows.
EvolveTool-Bench: diagnostic benchmark for evaluating quality of LLM-generated tool libraries as software artifacts in engineering workflows.
Research on out-of-distribution anomaly where deep models assign higher density to simple OOD data than in-distribution test data.
Domain adaptation framework for brain metastases segmentation across multiple medical institutions with different imaging protocols.
Transfer learning approach for trajectory prediction in autonomous driving handling domain shift across different regional driving patterns.
System enabling humanoid robot navigation in unseen environments using diffusion models trained on 5 hours of human walking video without robot data.
Privacy attack method using gradient-induced feature drift to infer membership in LLM training data without relying on output probabilities.
Analysis of neuron polysemy in neural networks, decomposing superposition metrics to separate lexical overlap from concept compression.
Method to mitigate object hallucination in Vision-Language Models through visual grounding using logit boosting without retraining.
Study examining how personality traits moderate effectiveness of AI-driven conversational coaching in workplace negotiation scenarios.
Technique to reduce LLM code generation latency by executing code incrementally as tokens are generated, eliminating idle waiting periods.
Vision-Language foundation model for chest X-ray interpretation providing explicit reasoning about visual evidence and diagnostic predictions.
Theoretical analysis of generalization bounds for overparameterized neural networks using distance from initialization as an explanatory factor.
Multimodal LLM approach for e-commerce product understanding that captures fine-grained attributes through reasoning-aware representation learning.
Self-supervised learning framework using masked autoencoders for 3D medical imaging, addressing domain shift from natural image pretraining.
Deep learning approach for animal activity recognition from wearable sensors, optimizing sampling rates and addressing class-specific classification accuracy.
Agentic framework combining Vision-Language Models with iterative reasoning for zero-shot 3D visual grounding from natural language descriptions.
Method for optimizing rubrics used in synthetic data generation for LLM fine-tuning, leveraging influence-guided selection in knowledge-intensive domains.
Mamba-based neural network for dental diagnosis from X-rays, unifying tooth detection, caries segmentation, anomaly detection, and developmental staging.
Multi-agent LLM system for housing consultation decisions, combining reasoning, constraint handling, and factuality guarantees beyond simple ranking.
Unified neural architecture framework studying scaling laws across attention-based, TokenMixer, and factorization-machine recommendation systems.
Method using model cascades to optimize LLM inference costs in semantic SQL queries by routing rows through fast/expensive models based on confidence.
Research on autonomous web agents navigating browser-based websites by leveraging internal APIs instead of DOM inspection, addressing architectural mismatches in agent design.
Reinforcement learning technique adding hints to overcome advantage collapse in group relative policy optimization.
Black-box security tool for detecting exploitable third-party vulnerabilities in web applications.
Study of tradeoffs between parametric knowledge in LLM pretraining and non-parametric knowledge from retrieval.
Multi-agent optimization framework addressing non-stationarity through active shared perception of agent policies.
Educational framework for assessing Scratch programming skills using fuzzy clustering aligned with CEFR levels.
Memory-efficient LLM training via truncated SVD factorization of weight matrices on consumer hardware.
Transformer-based framework for predicting immunotherapy response using biomarkers in small medical datasets.
Token pruning framework for vision-language models using attention dual form perspective without retraining.
Security analysis of backdoor attacks on language models using continuous latent reasoning without token output.
LLM pretraining at exascale using Aurora supercomputer with Mula-1B model and Optimus training library.
End-to-end autonomous driving model using 3D geometry instead of language descriptions for planning.
Bayesian inference framework for multi-dimensional emotion understanding accounting for dependencies among emotions.
Language agents with learnable adaptation policies that optimize test-time learning instead of using fixed hand-crafted policies.
Mixture-of-Experts architecture for actor-level stance detection in geopolitical text classification.
PixelPrune: adaptive visual token reduction for vision-language models using predictive coding for document and GUI tasks.
Dataset and analysis of autonomous coding agents' contributions in real-world projects, examining code quality and team dynamics over time.
Training-free canonical correlation analysis method for improving efficiency of pretrained image encoder representations.
DANCEMATCH framework for motion-based dance retrieval using quantized structure-preserving representations.
WARP: method for repairing adversarial vulnerabilities in transformer NLP models with provable inner-layer repair guarantees.
Reinforcement learning with flow-based policies and distributional RL for trajectory optimization in multi-solution problems.
Dignified Peer framework addressing evasive and sycophantic behavior in aligned LLMs through anti-sycophancy and empathy.
MyPhoneBench: evaluation framework measuring privacy-compliant behavior in mobile phone-use agents during task execution.
Multimodal pipeline analyzing state-funded news coverage of Israel-Hamas war on YouTube Shorts.
Egocentric world simulator generating interaction videos with persistent 3D scene state updates for embodied AI.
Query-conditioned keyframe sampling approach for long-form video understanding with multimodal LLMs using evidential reasoning.
OrgAgent: hierarchical multi-agent framework organizing LLM-based agents into governance, execution, and compliance layers for complex reasoning.
Transfer learning algorithms for nonparametric Bayesian network structure learning under limited data.
Method for fast probing of LLM downstream performance during training using metrics correlated with performance beyond training loss.