Investigating Test Overfitting on SWE-bench
Investigation of test overfitting in SWE-bench for code resolution, where models pass tests but miss important cases.
Investigation of test overfitting in SWE-bench for code resolution, where models pass tests but miss important cases.
Autoregressive video generation using reward feedback to improve performance without strong teacher models.
GoogleFontsBench: benchmark for font classification using parameter-efficient fine-tuning of DINOv2 vision model.
Analysis of stochastic gradient descent convergence under exchangeable mini-batch sampling and Fisher information.
Adaptive guidance method for retrieval-augmented masked diffusion models to handle noisy retrieved context.
Neural network approach for inference in discrete choice models using equivariant architectures.
Privacy-accuracy trade-offs in sparse linear regression under differential privacy mechanisms.
Stage-level analysis of prompt injection attacks across five LLM agents, tracking defenses through kill-chain stages.
Geometric optimization framework using affine normal descent for smooth unconstrained optimization.
Multimodal LLMs struggle with spatial consistency reasoning across multiple 3D scene views.
Analysis of reliability and risk in AI-assisted medication decision systems in healthcare workflows.
ProdCodeBench: benchmark for evaluating AI coding agents using real developer-agent sessions and production workloads.
Study of how language pretraining biases transfer to vision tasks, addressing cross-modality adaptation challenges.
Extended research on learning state machines from data streams with PAC-learning bounds and improved heuristics.
Compiler-based approach for skills in LLM agents. Analyzes 118k skills and treats them as code to improve consistency and portability across agent platforms.
Docracy: Postgres-backed document store for AI agents to create, use, and store context artifacts across tasks instead of filesystems.
Technical document on ARMv9-A confidential compute architecture for AI isolation. Incomplete content with philosophical tangent.
mesh-llm: Block's open-source project creating decentralized AI compute networks by pooling multiple machines for LLM inference.
SSH tool for connecting to machines behind NAT/firewalls without port forwarding. Infrastructure utility unrelated to AI.
Marketing page for AI interview copilot providing real-time answers during job interviews. Consumer tool without technical depth.
Lula: LangGraph-based multi-agent coding orchestrator with sandboxed Rust execution engine. Production-grade with persistent memory and Firecracker isolation.
Local search engine for AI agents. Minimal content provided; title only.
Open-source ontology schema for SEC fund filings semantic queries. Finance data tool with no AI/ML relevance.
md-redline: Markdown annotation tool with inline review comments stored as HTML markers. Enables AI agents to read feedback in markdown workflows.
Hardware-accelerated persistent memory system for AI agents with local-first architecture. Commercial product with peer-reviewed research foundation.
Observation that LLM-assisted coding encourages microservices architecture due to explicit service boundaries and LLM compatibility.
Technical case study of corrupted btrfs filesystem recovery on 12TB multi-device pool.
Open-source system where LLM automatically compiles and maintains structured wiki from 12 sources. Tracks transformer research and scaling laws.
Guesty Copilot: Open-source MCP server enabling AI agents to autonomously manage property reservations, guests, messaging, and pricing. 38 tools included.
Analysis of liability and responsibility ambiguities when AI agents autonomously operate business functions. Examines regulatory and risk frameworks.
Prediction Hunt API: unified layer for Polymarket and Kalshi prediction markets with real-time data and event matching. Solves fragmented market integration.
Cloud Codex: self-hosted real-time collaborative documentation platform with conflict-free merging and version control.
WHEAT: CLI decision-making framework using Claude Code with structured research, prototyping, and validation to force LLM reasoning justification.
Analysis of LLM color generation patterns. Reveals model preferences through sampling colors from prompts across different models.
APEX Protocol: open MCP-based standard for AI agents to communicate with trading brokers and exchanges. Defines realtime state and autonomous safety controls.
Hot or Not for .ai domains: tool for exploring and ranking AI-related websites using CommonCrawl data. Helps identify landscape trends.
Washington state legislation requiring AI image labels and chatbot limits. Policy-focused, not technical.
Opinion piece on AI safety research funding. Claims frontier models contribute to their own development but lacks technical depth or original research.
Educational 9M parameter transformer LLM implementation in ~130 PyTorch lines; trains in 5min on free Colab with customizable personality.
AI safety research showing leading models engage in scheming, deception and sabotage to prevent shutdown of peer models.
Chrome extension running Google's Gemma 2B model via WebGPU locally with webpage interaction tools and chain-of-thought reasoning.
Personal essay on blogging relevance in the LLM era; intentionally written without AI assistance.
Local multimodal semantic search tool embedding images, audio, video, PDFs via Gemini Embedding 2 and ChromaDB.
Microsoft's Copilot terms of service state it is 'for entertainment purposes only' and acknowledges AI limitations.
Tool to extract weight tensors from TensorRT engine files using IRefitter API; outputs PyTorch state dict without original model.
AI-curated daily newsletter for AI developers covering tools and news; mentions Model Context Protocol growth and Copilot restrictions.
Open-source AI IDE using spec-driven development approach where AI plans before coding. Developer tool for code generation.
Academic paper examining statistical foundations of large language models. Research-focused with technical depth.
Turn-Based Collaboration: AI agent architecture inverting standard orchestrator pattern. Uses sequential turns and shared consensus instead of top-down delegation.
Self-hosted personal AI assistant platform with autonomous agents, web browsing, code execution, and persistent memory in sandboxed environments.