AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence
ArXiv paper introducing AthenaBench, a benchmark for evaluating LLMs on cyber threat intelligence tasks including analysis of unstructured security reports.
ArXiv paper introducing AthenaBench, a benchmark for evaluating LLMs on cyber threat intelligence tasks including analysis of unstructured security reports.
Influence-aware causal node embedding method for quantifying node importance in complex networks, applicable to influence maximization and network analysis.
Theoretical analysis of collaborative learning in VAE-based recommender systems, showing latent proximity governs how binary masking improves performance.
Framework for formally reasoning about confidence and robustness in neural networks, generalizing existing adversarial robustness verification approaches.
MURPHY: Multi-turn reinforcement learning framework for self-correcting code generation combining group relative policy optimization with execution verification.
Study showing state-of-the-art vision models trained on normal chest X-rays can predict patient health insurance type, revealing encoded socioeconomic bias.
EGGROLL: Scalable evolution strategies algorithm using low-rank approximations to improve training efficiency of black-box optimization on GPUs.
ESPO: Entropy importance sampling policy optimization for stable and efficient token-level RL training of LLMs on complex reasoning tasks at scale.
RRPO: Robust reward policy optimization framework preventing reward hacking in LLM-based emotional text-to-speech by addressing vulnerability of vanilla reward models.
ArtistMus: Benchmark dataset for retrieval-augmented music question answering grounded in artist metadata to evaluate LLMs on music-related reasoning tasks.
Latent flow matching method for modeling continuous disease progression from longitudinal medical imaging to enable early diagnosis and personalized treatment planning.
ALERT dataset and input-size-agnostic Vision Transformer for driver activity recognition using IR-UWB radar to detect distracted driving behaviors.
Deep learning framework for network-level evaluation of vehicle hang-up susceptibility at highway-railway grade crossings using laser imaging and sensor data.
Qualitative study examining patterns of human-AI coevolution in creative writing and how human agency adapts alongside machine capabilities.
IntentMiner: Privacy attack exploiting Model Context Protocol servers to extract user intents from LLM tool calls, revealing new security vulnerabilities in agentic AI systems.
IPNet: Neuromorphic architecture using magnetic tunnel junction intrinsic plasticity to implement human-like working memory with reduced energy costs.
Agent-based model simulating adaptive firm behavior in spatial double-auction markets to understand emergence of industrial symbiosis under socio-spatial constraints.
Multi-LLM validation framework for thematic analysis combining Cohen's Kappa and semantic similarity metrics to improve reliability of LLM-based qualitative research coding.
Nightjar: Dynamic adaptive speculative decoding method that adjusts verification overhead based on request load to optimize LLM inference throughput and latency.
Visual fault diagnosis framework for strawberry harvesting robots using multi-task learning to address gripper misalignment and grasping failures.
FormationEval: 505-question multiple-choice benchmark for evaluating LLMs on petroleum geoscience topics like petrophysics and reservoir engineering.
Retrieval-augmented generation approach addressing domain shift in low-resource neural machine translation using context volume from limited corpora.
Unified framework for LLM alignment decoupling sampling and optimization geometry across PPO, DPO, IPO algorithms and variants.
LLM-based AI agents with hypothesis-driven cognition for improved software bug localization by analyzing code component relationships.
Multi-label classification of Schwartz human values in single sentences using transformer ensembles on political and news text corpora.
Agentic operationalization of DISARM framework for investigating foreign information manipulation on social media across NATO allied partners.
Multi-stage approach using 2D projections for automated cervical spine fracture detection in 3D CT volumes with vertebra-level analysis.
Machine unlearning method for Mixture-of-Experts LLMs using geometric router constraints to erase knowledge rather than redirect queries.
Alternative LLM architecture using rational arithmetic instead of floating-point to enable infinite-depth reasoning without structural heuristics.
Framework for interpreting and explaining emergent extreme events in LLM-powered multi-agent systems to improve safety and transparency.
Self-distillation approach for reinforcement learning leveraging rich textual feedback from verifiable environments to improve credit assignment in code/math tasks.
Model-agnostic diffusion process decomposing time-series signals into spectral components to preserve temporal patterns like seasonality.
Benchmark for open-domain video shot retrieval using LLMs for understanding editing requirements and retrieving keyframe-oriented shots.
Analyzing semantic geometry in LLM hidden states versus behavioral similarity through psycholinguistic experiments across eight instruction-tuned models.
Unified retrieval-augmented generation framework for query auto-completion combining ranking and generation to reduce hallucination and improve coverage.
Cross-modal fusion network for detecting small-scale defects in transmission lines using RGB-D imagery from UAV inspection.
Graph transformer with cardinality-preserving attention for molecular property prediction in drug discovery with limited labeled data.
Graph autoencoder framework combining neuro-symbolic approaches with rare pattern mining for detecting APT cyberattacks in system provenance data.
Reinforcement learning technique filtering irrelevant tokens to improve LLM policy optimization by focusing on contextually relevant action spaces.
Case studies of Google's Gemini models assisting scientific research including mathematical discovery and routine task automation.
Studying knowledge distillation from privileged information in language models for multi-turn agentic environments, addressing inference-time capability transfer.
Game-theoretic analysis of coalition formation in hedonic games using metric spaces.
Demonstrates that steering vectors in LLMs are fundamentally non-identifiable due to large equivalence classes of behaviorally identical vectors.
Steering-based jailbreak method against aligned LLMs requiring less computation than white-box approaches but maintaining stealth.
Benchmark for evaluating LLM performance on financial analysis and tracking using SEC filings with multi-document synthesis.
Real-time facial expression imitation system for humanoid robots enabling lifelike affective human-robot interaction.
Analyzes errors and limitations in Code World Models that simulate program execution by predicting runtime state.
Studies language understanding through paraphrase generation and detection capabilities in language models.
Domain-specific language for predictive modeling on relational databases covering missing values and future predictions.
Derives scaling laws for massive-scale recommendation systems through unified architecture design and efficiency improvements.