The Rise of AI Pentesting Agents: A Technical Analysis (2026)
Technical analysis of AI pentesting agents evolution from PentestGPT to autonomous agents like PentAGI and XBOW.
Technical analysis of AI pentesting agents evolution from PentestGPT to autonomous agents like PentAGI and XBOW.
Essay on bug bounty trends in 2026. Discusses AI agent effectiveness for vulnerability discovery and program management challenges.
Apache 2.0 open standard for governing AI agent payment requests. Policy engine with 12 configurable checks for payment authorization.
Open-source tax software built and maintained by autonomous AI agents. Uses IRS publications as source, applies self-improving agent loops.
Tool for multi-LLM code review consensus. Aggregates feedback from multiple models to identify blind spots and improve code quality assessment.
Essay on LLM-based knowledge management limitations. Discusses problems with AI-generated note synthesis and cognitive organization.
Agent skill implementation for token compression. Reduces output tokens by ~47% while maintaining readability.
Security report on 1.4M AI-driven API test executions. Maps vulnerabilities to OWASP Top 10 using agentic testing.
Cloudflare expands access to OpenAI's frontier models via Agent Cloud platform, enabling enterprises to deploy AI agents for customer support, system updates, and report generation.
Benchmark evaluating humor alignment across frontier LLMs using Cards Against Humanity gameplay, analyzing model performance vs human baseline on comedic response selection.
InstrAction pretraining framework for video foundation models to improve action recognition in instructional videos by addressing static bias in temporal understanding.
Deep learning method for cardiac MRI imaging using phase-sensitive inversion recovery to reduce acquisition time and motion artifacts in late gadolinium enhancement scans.
eBandit uses eBPF and multi-armed bandit reinforcement learning in Linux kernel for adaptive video bitrate selection with improved network signal visibility.
Evaluates cultural alignment of LLMs across 14 language-culture pairs using multilingual story moral generation task and dataset.
Investigates opportunities for resource-constrained AI research using obsolete yet capable discarded models from AI production cycles.
Workshop report on designing reinforcement learning environments for autonomous cyber defense applications.
SenBen large-scale scene graph benchmark for explainable content moderation with visual grounding and sensitivity annotations.
HiFloat4 low-precision floating-point format for efficient 4-bit LLM pre-training on Ascend NPU hardware.
Dictionary-aligned concept control method for safeguarding multimodal LLMs against malicious queries at inference time.
Constraint-satisfaction-based retrieval system for matching patient profiles to clinical trials with high recall and precision.
Empirical study on how humans allocate responsibility in AI-human hybrid workflows using AI-assisted lending experiments.
AudioGuard framework for comprehensive audio safety protection including voice impersonation, speaker attributes, and compositional harms.
MedFormer-UR transformer with uncertainty quantification for safe medical image classification in clinical settings.
Survival-oriented benchmark for temporal student dropout risk modeling using Open University Learning Analytics Dataset.
Temporal survival modeling framework for predicting student dropout using LMS engagement data and administrative records.
Re-examines capacity gap in chain-of-thought distillation, finding student models often outperform teacher distillation baselines.
HTNav framework for aerial vision-and-language navigation combining visual perception with language instructions in urban environments.
HM-Bench benchmark evaluates multimodal LLMs on hyperspectral remote sensing image understanding tasks.
Analysis of causal inference methods applied to graph representation learning and their limitations with graph-structured data.
ADRUwAMS deep learning model with attention mechanisms for automated brain tumor glioma segmentation in medical imaging.
Ge2mS-T improves energy efficiency in Spiking Vision Transformers through multi-dimensional grouping and optimized training methods.
UDG dataset with 300K samples for training defect/anomaly generation models with improved generalization across defect categories.
RAG systems should optimize for utility (task completion) rather than topical relevance when retrieving documents for LLMs.
MuTSE: Human-in-the-loop evaluator tool for systematically comparing LLM text simplification outputs across different prompting strategies and architectures.
WOMBET: Framework for reinforcement learning that generates and transfers experience data between source and target robotic tasks for sample efficiency.
Aligned Agents, Biased Swarm: Empirical study measuring how multi-agent system topologies and feedback loops amplify bias in emergent behaviors.
Litmus ReAgent: Benchmark and agentic system for evaluating multilingual LLM performance prediction across 1,500 questions spanning six tasks and five evidence scenarios.
Neighbourhood Transformer: Graph neural network architecture using switchable attention to handle heterophilic graph learning where dissimilar nodes are frequently connected.
PerMix-RLVR: Training method for aligning LLM personas with reward models while preserving output diversity, avoiding inference-time computation overhead.
PinpointQA dataset and benchmark for evaluating small object localization and spatial reasoning in video MLLMs.
ASTRA: adaptive semantic tree reasoning architecture for LLM-based complex table question answering.
Survey and construction of linguistically-informed representations for English as a second/foreign language.
Named entity identification and anonymization system for cybercrime datasets using speech-to-text and image processing.
Regime-conditional retrieval with transferable router for two-hop question answering with theoretical foundations.
Noise-aware in-context learning approach to mitigate hallucinations in auditory large language models.
ImageProtector prevents multi-modal LLMs from analyzing images via visual prompt injection attacks.
Vision-language models for image geolocation with structured geographic reasoning and autonomous self-evolution.
CONDESION-BENCH evaluates LLM decision-making with compositional action spaces and conditional feasibility constraints.
U-Cast: simple probabilistic weather forecasting using standard U-Net architecture achieving frontier performance.
Watt Counts: open-access energy consumption benchmark for LLM inference across 50 models and 10 GPU architectures.