RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
RobotArena ∞: scalable robot benchmarking via real-to-sim translation. Enables rigorous evaluation of robot policies across diverse tasks and environments.
RobotArena ∞: scalable robot benchmarking via real-to-sim translation. Enables rigorous evaluation of robot policies across diverse tasks and environments.
Verifying LLM inference to detect model weight exfiltration via steganography. Defends inference servers against model theft and anomalous behavior.
Portfolio optimization under stochastic dominance constraints with S-shaped utilities. Investigates first and second-order dominance constraints.
AnatomiX: anatomy-aware multimodal LLM for chest X-ray interpretation. Improves spatial reasoning and anatomical understanding in medical imaging.
Study of bioelectrical properties for malignancy detection. Systematic review of 535 datasets on cellular bioelectric parameters across frequencies.
FARM framework for malware family classification under concept drift. Uses triplet autoencoder for few-shot adaptation to covariate and label drift.
LatentChem: latent reasoning interface for chemical LLMs. Decouples chemical computation from discrete tokens to improve efficiency and performance in chemical reasoning.
FastLSQ framework for solving PDEs using Fourier features with analytical derivatives. Achieves high accuracy on 1-6D problems without autodiff.
Pyramid MoA: probabilistic framework for cost-optimized LLM inference via cascading and routing. Balances inference cost and reasoning capability for large language models.
Enhancements to projection pursuit tree classifier with visual diagnostic methods for high-dimensional classification. Addresses limitations in multi-class settings.
IROSA: framework combining foundation models with imitation learning for robot skill adaptation via natural language. LLM application to robotics.
Disentangled Safety Hypothesis: mechanistic study of LLM safety showing decoupling between harmfulness detection and refusal. ML interpretability research.
Benchmark evaluating frontier AI models on multi-step cyber attack scenarios. Agent capability measurement across extended action sequences.
Agentic framework for multimodal query processing with adaptive tool orchestration across text/image/audio/video. Research on agent coordination and tool selection.
Proof-Carrying Materials: falsifiable safety certificates for machine-learned interatomic potentials. ML research on reliability guarantees for scientific models.
Codex Security: AI agent for code security that analyzes repository architecture and trust boundaries before validating findings with humans.
AI-as-Code approach for agent factories.
Open-source AgentFactory orchestrates fleet of coding agents (Claude, Codex, Spring AI) through automated pipeline for issue resolution and code shipping.
Open-source framework for personal AI agents running entirely on-device with efficiency-aware evaluations and learning loop using local trace data.
NPM package enabling free OpenAI API access via ChatGPT OAuth tokens. Creates localhost proxy to ChatGPT backend API with Vercel AI SDK provider support.
AI automation tool to summarize Datadog monitoring alerts and escalate issues, reducing manual dashboard review.
Multi-threaded Redis replacement in Rust (5.6x faster, 1MB Docker image) with drop-in compatibility and concurrent architecture.
LessWrong editor UI update with Lexical framework and WYSIWYG improvements.
Video about sewage facility becoming bird sanctuary. Off-topic.
Discussion of mental fatigue and workflow challenges when working with LLMs like Claude and Codex, and recovery strategies.
Multi-agent workflow orchestration system supporting Gemini, Qwen, Claude with role-based agents, background execution, and visual workflow editing.
Report on Iranian drone strikes against AWS data centers in UAE used for AI infrastructure.
GitHub Action detecting LLM output drift in CI/CD by replaying workflows and diffing outputs to prevent silent model changes reaching production.
CLI tool for managing ETL transformation pipelines with artifact versioning and SQLite provenance tracking.
Anecdotal story about data scientist using AI and ChatGPT to develop cancer vaccine for dog.
Dashboard for real-time observability into Claude Code sessions, tracking costs, tool usage, and subagent execution without code changes.
Hypergraph data structure implementation in Zig language with research community modeling example.
ByteDance delays Seedance 2.0 video generation model launch due to copyright disputes with Hollywood studios.
Security middleware for autonomous AI agents that risk-scores actions, detects injection attacks, and catches behavioral drift across multi-turn interactions.
Open-source SDK for building autonomous AI agents that execute cross-chain financial operations with cryptographic guarantees and trusted execution environments.
Multi-agent coordination system using Claude Code, Discord webhooks, and timer-based polling. Production autonomous workflows with real-time notifications.
Timezone converter tool for Claude API usage promotion (Mar 2026). Minor LLM-adjacent utility.
Academic citation metric (CiteIQ) weighted by author position and research integrity. Not AI/ML focused.
Overview of layered security architecture for AI agents, emphasizing secure human identity verification and token-based authorization.
Quell is a local security layer that intercepts prompts to AI IDEs, redacting secrets before they reach cloud models, storing values in OS keychain.
ARISE framework enables LLM agents to synthesize their own tools at runtime when they encounter task gaps, adapting without pre-crafted tool libraries.
clifast tool converts TypeScript/JavaScript functions into CLI packages with optimized help text for LLM navigation, reducing token usage versus MCP.
LearnFork tool for branching AI chat conversations in learning contexts with minimal details.
LiveAuth system providing Proof-of-Work and Lightning Network authentication for AI agents, replacing CAPTCHAs and API keys.
Critical perspective on AI agent hype, questioning whether agents are necessary or overused in current implementations.
Opsmeter tool for cost attribution and budget control in LLM applications, breaking down spending by endpoint, tenant, user, and model.
Caliber scans codebases to auto-generate tailored AI agent skills, configs, and recommended MCPs matching project stack and best practices.
Free tool for analyzing and comparing AI product costs across 9 LLM providers before implementation to identify optimal architecture.
Blog post on using 'cupcake' prompt technique to detect AI hallucinations.
Analysis of LLM inconsistency when prompted repeatedly on same question, showing tendency to contradict prior responses.