CausalARC: Abstract Reasoning with Causal World Models
CausalARC testbed for evaluating AI reasoning on abstract tasks with limited data and distribution shift using causal world models.
CausalARC testbed for evaluating AI reasoning on abstract tasks with limited data and distribution shift using causal world models.
Bayesian evaluation framework replacing Pass@k metric for more stable and reliable LLM reasoning performance assessment.
Theoretical framework for measuring conflicts in random permutation sets using order-dependent uncertainty fusion.
SynBullying dataset uses multiple LLMs to generate synthetic conversational data for cyberbullying detection research.
AgroCoT benchmark evaluates reasoning capabilities of vision-language models for agricultural applications like crop monitoring and pest detection.
Memory Bear system applies cognitive science principles to address LLM memory limitations, hallucinations, and context window constraints.
VR-based discrete-event simulator for school security evaluation using behavioral data.
Certification protocol ensuring consistent semantic understanding between agents using stimulus-meaning model and empirical testing.
Data-centric framework learning optimal verbalization for converting user interaction logs into natural language for LLM-based recommendation systems.
Variable isolation study examining prompt architecture layers enabling LLMs to solve reasoning benchmarks like the car wash problem.
CIRCLE lifecycle framework bridging gap between AI model metrics and real-world deployment outcomes through six-stage evaluation.
AI4S-SDS system combining LLM agents with sparse MCTS and differentiable physics for automated chemical solvent design.
CliqueFlowmer approach for computational materials discovery using neural networks for offline optimization of material properties.
MEMO framework reducing variance in multi-turn multi-agent LLM game evaluations through memory augmentation and context optimization.
MedMASLab unified framework and benchmark for multimodal medical multi-agent systems with standardized integration and cross-specialty evaluation.
SoLA framework for reversible lifelong model editing in LLMs using semantic routing with LoRA modules to prevent knowledge forgetting.
Method to reduce overthinking and underthinking in Large Reasoning Models through balanced token allocation for efficient inference.
VTC-Bench evaluating multimodal LLM agents on complex visual tool composition, addressing limitations in existing tool-use benchmarks.
Hybrid scalar-verbal RL approach for emotional support dialogue systems using user reactions as learning signals instead of expert-defined rewards.
AsgardBench benchmark for evaluating visually-grounded interactive planning and plan adaptation based on visual observations.
Formal proof that safety is non-compositional when combining agents with conjunctive capability dependencies.
ARISE hierarchical RL framework for mathematical reasoning in LLMs that learns reusable strategies across problem instances.
Machine learning approach for predicting and discovering error patterns in vehicle diagnostic trouble codes using temporal sequence analysis.
Study of nonstandard errors in AI coding agents deploying 150 Claude agents on market analysis tasks, showing agent-to-agent variation in analytical choices.
IET framework for attributing multi-agent system outputs to specific agents without execution logs, enabling accountability in agent interactions.
SQLBench benchmark for evaluating Text-to-SQL capabilities of LLMs across sub-tasks, addressing gaps in prompt templates and performance assessment.
Graph learning model for drug-drug interaction prediction addressing generalization and robustness in extreme cases.
Deep learning framework mitigating perception latency in vision-based lane-keeping for autonomous vehicles using imitation learning.
Experimental study measuring how partisan biases in LLMs influence human political opinions and decision-making.
Structured transformer approach for offline model-based optimization combining reinforcement learning and generative modeling for design problems.
Framework addressing limitations of contrastive distillation for 3D representation learning by capturing modality-specific features.
Autoregressive transformer approach for component-based colored SVG generation from text descriptions.
Dataset for hierarchical KPI extraction from earnings filings using iXBRL structured financial documents.
LLM-based index advisor for database optimization using in-context learning to iteratively refine index recommendations.
Equilibrium finding algorithms in polymatrix games under differential privacy constraints with hardness results.
Survey of AI-based detection and mitigation methods for DDoS attacks with taxonomy of attack categories.
Ensemble of language models for automated tumor group classification from unstructured pathology reports in cancer registries.
Federated learning system balancing privacy-utility tradeoffs with incentive mechanisms and heterogeneous resource accommodation across organizations.
Tractable description logic with categorical semantics for biomedical ontologies supporting negative knowledge representation.
Weighted gradient-based adversarial attacks on 3D point cloud classifiers with improved imperceptibility through point-wise perturbation adjustment.
Online fair allocation of indivisible goods with sequential arrival, analyzing fairness guarantees with access to future information.
Rank-based uniformity test for detecting undisclosed substitutions or quantization of black-box LLM APIs without access to model weights.
Statistical methods for fairness testing in algorithmic decision-making systems accounting for sampling error and demographic subgroups.
Systematic review combined with Monte Carlo simulation examining student perceptions of GenAI tools and educational outcomes.
Pipeline for converting RGB-D scans into compact 3D virtual replicas with physically-based rendering and interaction support.
arXiv paper: surrogate ML model for predicting heat transfer in impinging jet arrays. CFD acceleration via neural networks.
arXiv paper: agent-based LLM approach for automating free-form clinical notes to HL7 FHIR structured data.
arXiv paper: 2D-guided cross-modal fusion method for LiDAR-camera alignment in 3D autonomous vehicle detection.
Automated page image classification system for historical document digitization handling diverse content types, layouts, and handwritten/printed text.
Unsupervised deep learning approach for inverse problems in computed tomography combining deep image prior and unrolled optimization.