EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
CoreCraft: enterprise RL environment for training generalizable AI agents in customer support simulation with 2,500+ entities and 23 tools.
CoreCraft: enterprise RL environment for training generalizable AI agents in customer support simulation with 2,500+ entities and 23 tools.
Benchmark evaluating physical safety risks of LLMs controlling robotic systems, categorizing drone-related threats and harms.
Knowledge distillation pipeline to compress Dust3r foundation model for faster 3D reconstruction and visual localization.
Operator-learning surrogate model combining PointNet and DeepONet for nonlinear field prediction on complex geometries.
Meta-RL approach using skill decomposition with improved robustness to noisy offline demonstrations for long-horizon tasks.
Rex reversible exponential Runge-Kutta solvers for neural differential equations in generative models.
Self-organizing maps combined with vision transformers to improve ViT performance on small datasets.
Certified backdoor defense for DNNs using sample-specific smoothing noise against training data poisoning attacks.
ReplaceMe training-free depth pruning method replacing transformer blocks with linear operations for model compression.
U-Net CNN with attention mechanisms for COVID-19 lung segmentation in CT scans. Medical imaging, not core AI interests.
Survey demystifying graph neural network concepts: oversmoothing, oversquashing, heterophily, and long-range dependencies.
Research on KL-regularized policy gradient algorithms for LLM reasoning, comparing forward/reverse KL regularization designs.
Systematic literature review of explanation user interfaces for black-box AI systems and XAI techniques.
FinTagging benchmark for evaluating LLMs on financial information extraction and hierarchical GAAP concept classification.
Automated web app testing system using LLMs and screen transition graphs for test case generation.
Study on persona-driven prompting of LLMs to simulate voting behavior in European Parliament, analyzing progressive bias mitigation.
Strict Subgoal Execution method improves long-horizon planning in hierarchical reinforcement learning through reliable subgoal feasibility.
MCIF benchmark for evaluating multimodal LLMs on crosslingual instruction-following with long-form inputs from scientific talks.
CareerPooler generative AI system for career exploration using pool-table metaphor simulation instead of linear chat interface.
Discrete optimal transport voice conversion method shown as effective black-box adversarial attack on audio anti-spoofing systems.
Deep learning approach for predicting electronic-structure Hamiltonians of materials with improved generalization.
CoSpaDi training-free compression method for LLMs using sparse dictionary learning instead of rigid low-rank approximations.
First watermarking scheme designed for diffusion language models that generate tokens in arbitrary order rather than sequentially.
Inference-time search algorithm guides diffusion model sampling using side information for improved image reconstruction.
Prompt optimization framework extended to multimodal LLMs, optimizing visual and textual prompts jointly for improved performance.
pi-Flow modifies flow-based generative models to predict network-free policies for efficient few-step image generation.
VERA-MH automated evaluation framework for assessing safety of AI chatbots in mental health contexts using LLM-based agents.
LRT-Diffusion applies risk-aware sampling to diffusion policies for offline reinforcement learning with statistical hypothesis testing.
VeriStruct framework uses LLMs to automate verification of data structure modules in the Verus verification language.
Semi-Supervised Preference Optimization reduces labeled feedback requirements for aligning language models with human preferences.
PREPO framework improves data efficiency of reinforcement learning for LLMs by leveraging intrinsic data properties during training.
Evaluation of fine-tuned BERT vs LLM prompting for text classification on South Slavic languages, a less-resourced language group.
Deep learning method for segmenting retinal blood vessels using temporal information from Doppler holography imaging.
Wireless foundation models extended to process multiple modalities for improved task performance across varying operating conditions.
Empathetic Cascading Networks multi-stage prompting framework reduces social biases in LLMs through perspective adoption and emotional resonance stages.
Transformer-based joint source-channel coding for non-uniformly distributed HARQ-ACK bits in wireless communications using deep learning.
Local explanation method using MARS and N-ball sampling for generating high-fidelity explanations of black-box model predictions.
Framework for semantic segmentation using hierarchy-aware methods to detect stratified tooth layers in dental imaging.
Reveals lexical and positional biases in post-hoc feature attribution methods like Integrated Gradients, affecting explanation quality for language models.
Block-Recurrent Hypothesis characterizes Vision Transformer depth as block-recurrent structure, providing mechanistic understanding of ViT computations.
Framework integrating Theory of Mind into robots for inferring human mental states to enhance explainability and predictability in human-robot interaction.
CODA extends slot attention with register tokens and contrastive alignment to improve object-centric learning using pretrained diffusion models.
Symphonym neural embedding system maps names across scripts into unified phonetic space for cross-script and cross-language name matching.
Mixed-methods audit examining alignment between student preferences and AI system capabilities for collaborative academic tasks in CS education.
Temporal graph pattern machine for learning transferable representations from dynamic networks by modeling evolving patterns without restrictive assumptions.
Theoretical analysis proving fixed-budget and fixed-confidence best-arm identification settings have equivalent sample complexity up to logarithmic factors.
Di3PO improves preference tuning of text-to-image diffusion models using diptych diffusion and DPO for efficient training pair generation.
Analysis of DARPA's AIxCC competition for autonomous cyber reasoning systems leveraging LLMs to discover vulnerabilities in open-source software.
Framework for jointly optimizing data mixture and model architecture configurations during LLM training through co-optimization rather than sequential approaches.
Diffusion-guided pretraining approach for brain graph foundation models using improved augmentation methods for connectome data.