public · MITapplied ai

Sentinel

AI code review with hybrid retrieval and a deterministic eval harness.

Eval fixtures98Hand-curated PR ground truth

Categories scored4Security · bug · perf · style

CI gate−5% F1Any category regression fails build

The problem

The market is full of "AI code review" wrappers around a prompt. The hard part is not generating text — it is knowing whether the system catches real issues without fooling yourself. Sentinel separates the production pipeline, a deterministic scorer, and a curated eval set so quality is measurable, not vibed.

Architecture

GitHub webhook (HMAC verified, X-GitHub-Delivery idempotent)→
FastAPI service→
hybrid retrieval over PR history: BM25 for exact identifiers + pgvector dense embeddings, fused via RRF→
structured Pydantic v2 review with cost guardrails (daily budget + per-PR cap + circuit breaker)→
deterministic scorer over 98 fixtures yielding per-category P/R/F1→
CI gate that fails the build if any category regresses >5%.

Key decisions

DECISIONCHOICEWHY

RetrievalHybrid BM25 + dense (pgvector) with RRFBM25 catches exact identifiers and rare tokens; dense catches semantic similarity. Fusion outperforms either alone on real PRs.

EvaluationHand-readable fixtures + deterministic scorer (not LLM-as-judge)LLM-as-judge has circular validity. 98 realistic fixtures with explicit labels give a stable, reproducible score per category.

Cost controlDaily budgets + per-PR caps + circuit breakerProduction AI needs financial guardrails — one misconfigured repo should not drain the budget.

Structured outputPydantic v2 with JSON modeType-safe review comments enable automated scoring and consistent GitHub annotations.

PythonFastAPINext.jsPostgreSQLpgvectorBM25Docker