LiteLLM open-source LLM proxy suffered a supply chain attack in March 2026 where backdoored packages harvested credentials for three hours, demonstrating need for defense-in-depth security strategies.
Spectator: new scripting language for security work combining bash/python functionality with built-in security modules and GUI framework.
Open-source orchestration runtime for multi-agent AI systems using declarative YAML manifests. GitOps approach to agent governance and workflows.
Open-source web client for Stremio streaming platform with syncing and stream selection features.
Agent Ruler v0.1.9 update: reference monitor with confinement for AI agent workflows, adding security/safety layer outside agent guardrails.
HTTPS MITM proxy intercepting prompts to AI APIs/assistants, detecting and blocking sensitive data before transmission to third-party servers.
Three markdown files enabling stateless AI agents to maintain memory across sessions using git repos. Works with coding agents like Claude, Cursor, Windsurf.
Local-first AI orchestration framework (MACCREv2) designed to avoid trusting third-party wrappers with API keys/filesystem. Response to litellm supply chain attack.
Research study demonstrating verbatim recall of copyrighted books in finetuned LLMs across cross-author and within-author scenarios
Theoretical analysis of LLM reasoning properties at self-organized criticality with connections to phase transitions and scaling functions.
Environment Maps: Persistent agent-agnostic representation for reducing cascading errors in long-horizon LLM-based software automation tasks.
Safety-focused evaluation framework for multi-agent voice-enabled smart speaker in care homes covering resident data access and task scheduling.
EnterpriseArena: Benchmark evaluating LLM agents as CFOs for resource allocation under uncertainty in dynamic business environments.
Public API and evaluation framework for benchmarking poker algorithms against GTO Wizard, a superhuman HUNL poker agent.
Method for long-horizon 3D box rearrangement using vision-language grounding and 3D masks for multi-step planning from natural language.
Evaluation comparing LLM essay scoring with human grading across GPT and Llama models, finding weak agreement in standard settings.
Study on efficient benchmarking of AI agents showing how task subsets can preserve agent rankings while reducing evaluation costs.
Learning-guided prioritized planning combining ML and search-based solvers for lifelong multi-agent pathfinding in warehouse automation.
VehicleMemBench: Benchmark for evaluating long-term memory in multi-user in-vehicle agents handling preference conflicts and temporal dynamics.
SCoOP: Training-free uncertainty quantification framework for multi-VLM systems using semantic-consistent opinion pooling.
DeepXube: Free open-source Python package for pathfinding using learned heuristic functions from deep RL and search algorithms.
DUPLEX: Neuro-symbolic agentic architecture combining LLMs with schema-guided information extraction for robust robotic task planning in long-horizon domains.
AnalogAgent: LLM-based agentic framework for automated analog circuit design using multi-model loops to preserve domain-specific insights and context.
Empirical study analyzing 2000+ RL papers to create quantitative taxonomy of reinforcement learning environments and technological trends.
MAPUS: LLM-based multi-agent framework for personalized and fair participatory urban sensing modeling participants as autonomous agents with preferences.
ELITE framework for self-improving embodied agents using vision-language models with experiential learning and intent-aware transfer to bridge vision-action gap.
Enhanced Mycelium of Thought (EMoT): bio-inspired hierarchical reasoning architecture for LLMs with four-level hierarchy, strategic dormancy, and mnemonic encoding.
Standardized benchmarks and evaluation framework for multi-objective search addressing fragmentation in empirical evaluation.
AutoProf: multi-agent orchestration framework for autonomous AI research with persistent world model, gap analysis, and inter-agent verification mechanisms.
Multi-agent framework with specialist agents for medical multiple-choice question answering, improving calibration and confidence scoring through verification.
Incongruent normal form structural representation for self-referential semantic sentences preserving classical semantics.
Markovian framework for auditing reliability and oversight costs in agentic AI systems operating as stochastic policies with sequential decisions and tool calls.
Analysis of many-shot jailbreaking technique exploiting long context windows; probes effectiveness and develops mitigation strategies for LLM safety.
Novel methodology quantitatively evaluating metacognitive abilities in LLMs, testing self-awareness without relying on model self-reports.
Computerized Adaptive Testing framework grounded in Item Response Theory for cost-effective and scalable evaluation of LLMs in medical benchmarking.
Deletion-Insertion Diffusion language models replacing masking paradigm with discrete diffusion processes for improved computational efficiency and generation flexibility.
Internal Safety Collapse (ISC) failure mode identified in frontier LLMs where models generate harmful content under certain task conditions; TVD framework presented to trigger and study ISC.
Evaluation of visuospatial perspective-taking abilities in multimodal language models using adapted tasks from human studies (Director Task, Rotating F task).
DISCO benchmark suite for evaluating OCR pipelines and vision-language models on document parsing and QA across diverse document types including handwritten and multilingual text.
S-Path-RAG framework for multi-hop question answering over knowledge graphs using semantic-aware shortest-path retrieval with differentiable path scoring.
Berta: open-source modular platform for AI-enabled clinical documentation with institutional data governance and workflow integration, deployed at Alberta Health Services.
DepthCharge framework for measuring how deeply LLMs sustain accurate responses in domain-specific topics through adaptive probing across arbitrary domains.
Privacy-preserving synthetic clinical data trains LLM for medical coding automation, improving ICD-10-CM and CPT code assignment from clinical documentation.
Memory Sparse Attention enables end-to-end LLM scaling to 100M tokens for long-term memory tasks, extending effective context beyond 1M token limits.
Position paper proposes mechanism-aware evaluation combining symbolic rules and mechanistic interpretability to distinguish genuine generalization from shortcuts.
Cluster-R1 reframes instruction-following clustering as generative task, enabling reasoning models to autonomously infer corpus structure while respecting user instructions.
MedMT-Bench stress-tests LLMs on long-context memory, interference robustness, and safety in multi-turn medical conversations with realistic clinical scenarios.
Lightweight LLM framework captures and scales physician expertise for clinical decision-making agents using individualized diagnostic methodologies.
Chitrakshara multimodal dataset provides multi-image and Indian language coverage for training Vision-Language Models beyond English-centric datasets.
Qworld framework generates question-specific evaluation criteria for LLMs on open-ended tasks, capturing context-dependent response quality requirements.