back to projects
public · MITapplied ai

Sentinel

AI code review with hybrid retrieval and a deterministic eval harness.

github
Eval fixtures98Hand-curated PR ground truth
Categories scored4Security · bug · perf · style
CI gate−5% F1Any category regression fails build

The problem

The market is full of "AI code review" wrappers around a prompt. The hard part is not generating text — it is knowing whether the system catches real issues without fooling yourself. Sentinel separates the production pipeline, a deterministic scorer, and a curated eval set so quality is measurable, not vibed.

Architecture

GitHub webhook (HMAC verified, X-GitHub-Delivery idempotent)
FastAPI service
hybrid retrieval over PR history: BM25 for exact identifiers + pgvector dense embeddings, fused via RRF
structured Pydantic v2 review with cost guardrails (daily budget + per-PR cap + circuit breaker)
deterministic scorer over 98 fixtures yielding per-category P/R/F1
CI gate that fails the build if any category regresses >5%.

Key decisions

DECISIONCHOICEWHY
RetrievalHybrid BM25 + dense (pgvector) with RRFBM25 catches exact identifiers and rare tokens; dense catches semantic similarity. Fusion outperforms either alone on real PRs.
EvaluationHand-readable fixtures + deterministic scorer (not LLM-as-judge)LLM-as-judge has circular validity. 98 realistic fixtures with explicit labels give a stable, reproducible score per category.
Cost controlDaily budgets + per-PR caps + circuit breakerProduction AI needs financial guardrails — one misconfigured repo should not drain the budget.
Structured outputPydantic v2 with JSON modeType-safe review comments enable automated scoring and consistent GitHub annotations.
PythonFastAPINext.jsPostgreSQLpgvectorBM25Docker