Productization writeup

How I’d productize an LLM eval layer

A writeup for the FrictionLens eval and observability surface. ~2,000 words. Last updated 2026-07-12.

TL;DR

FrictionLens is an AI-powered app review analyzer that scores user reviews across five sentiment dimensions (love, frustration, loyalty, momentum, word-of-mouth) and synthesizes them into a Vibe Report. The eval layer documented here measures how well the model’s dimension scores agree with hand-labeled ground truth, surfaces real per-call cost and latency from production traffic, and tracks prompt versions so regressions are visible at a glance. The headline metric: Spearman correlation against a 30-review golden set, computed per dimension, persisted to the database, and rendered on a public dashboard at /eval.

Why this matters

AI PM hiring is hot and “I’ve used ChatGPT” is the entry-level baseline. Hands-on intuition about LLM evals — what to measure, when correlation matters more than accuracy, why golden sets need to include sarcasm — is rare among early-career applicants because most production AI work happens behind closed doors. A public eval surface that runs against a real product, with real telemetry, is the cheapest defensible signal you can build. This is not a research project; it’s a portfolio artifact a hiring manager can click through in 90 seconds.

The product today

FrictionLens takes a CSV of app reviews (or auto-pulls them from the App Store / Play Store) and produces a Vibe Report: aggregate dimension scores, a 0–100 Vibe Score, top friction features, churn drivers with verbatim quotes, and prioritized action items. The AI pipeline uses Gemini 2.5 Flash by default via the Vercel AI SDK. Reviews are classified into three tiers to minimize cost: trivial (rating-only), short (rule-based sentiment), and complex (sent to Gemini). The five dimensions and their scoring rubrics are defined in the system prompt at src/lib/ai/prompts.ts. Sample report: /demo.

What the eval layer measures and why

Spearman correlation, per dimension

Spearman rank correlation, not Pearson. The dimension scores are 0–10 ordinal-ish numbers: a “love score” of 9 vs 10 isn’t necessarily a meaningful linear gap, but a 9 should rank higher than a 7. Spearman is robust to monotonic non-linear mappings (the model might consistently score 1–2 points lower than human labelers on the negative end without losing rank order — Pearson would punish that bias, Spearman would not). For an LLM whose absolute output magnitudes drift across versions but whose ranking should remain stable, Spearman is the metric that actually tells you whether the model still “understands” the dimension.

I report one Spearman per dimension because aggregating into a single score hides the failure modes. Frustration scoring is much easier than momentum scoring — the former has clear lexical signals, the latter requires temporal reasoning about product trajectory. Reporting them together would obscure the dimension where I’d want a PM to push back on the model.

Mean absolute error, per dimension

Spearman tells you rank fidelity. MAE tells you magnitude fidelity. Together they answer the two distinct questions a PM should ask: “is the model ordering things correctly?” and “is the model’s confidence calibrated?” An MAE of 1.0 on a 0–10 scale means the model is typically within ±1 point of a human labeler; an MAE of 3.0 means the dimension is essentially noise even if Spearman looks fine.

Churn risk as classification, not regression

The model also emits a discrete churn_risk label (Critical / High / Medium / Low). For discrete outputs, exact-match accuracy is the right metric. I deliberately did not use any continuous metric for this because the buckets are not equidistant — “Critical” vs “High” is a much more important distinction than “Medium” vs “Low,” and a regression metric would treat them symmetrically.

Golden set design

30 hand-labeled reviews live in src/lib/ai/eval/golden-set.json. The distribution is intentional:

6 strongly positive (clean signal, high ratings, advocacy language)
6 strongly negative (data loss, broken core features, support failures)
8 mixed / ambivalent (love + frustration coexisting, conditional praise, sarcasm)
5 churn-signaling (explicit cancellation, competitor mentions, ultimatums)
5 feature-specific complaints (overall positive, one feature blocking conversion)

The mixed bucket is where the model gets things wrong. That’s the point. A golden set where every Spearman lands at 0.95 is either too easy or has label leakage. I built this set knowing the model would fumble sarcasm (“Oh great, another update that ‘simplifies’ the workflow”) and undersell mixed reviews where the user’s frustration is bundled with affection.

In production I would tier the golden set: 50 hand-labeled reviews for primary signal plus ~500 weakly-labeled reviews (derived from star ratings with rule-based filters) for distribution coverage. The labor cost of hand-labeling rises linearly; weak supervision lets you get distribution coverage cheaply at the cost of label noise.

Cost and latency, per call

Every Gemini invocation goes through instrumentedGenerateText in src/lib/ai/instrument.ts, which extracts usage.inputTokens, usage.outputTokens, and finishReason from the SDK result, computes cost from static pricing constants, and writes one row to model_calls. The dashboard aggregates these into 30-day p50/p95/p99 latency per prompt and total spend per model. Telemetry is fire-and-forget — a database hiccup will never kill a user-facing analysis.

What I built vs. what I’d build next

Built

Per-call instrumentation persisted to model_calls (latency, tokens, cost, finish_reason, prompt_id, status)
Prompt versioning via a PROMPTS registry; every call records its prompt_id
30-review hand-labeled golden set with intentional distribution
CLI eval harness (npm run eval) computing Spearman + MAE per dimension and churn-risk exact-match
Public /eval dashboard with KPIs, cost breakdown, latency bars, eval scorecard, and recent traces
Public-safe aggregate views that exclude user identifiers

Next 30 days

Drift detection — compare today’s dimension distribution to last week’s at the per-prompt-version level; alert on >2σ shifts
Per-user cost surface — currently aggregate-only on the public dashboard; per-account spend visibility lives in a private dashboard
Prompt regression gate — block deploys if Spearman drops more than 0.1 on any dimension vs the last green eval run

Next 90 days

LLM-as-judge with calibrated rubric — second model rates the first model’s output against the rubric; expensive but unlocks evaluating new dimensions without growing the hand-labeled set
Prompt A/B routing — send a fraction of traffic to a candidate prompt version, accumulate parallel eval rows, surface the deltas
Multi-model ensemble for high-stakes dimensions — for churn_risk specifically, vote across 2–3 model families; cost is justified because misclassifying “Critical” as “Low” is asymmetric

Case study: debugging topic drift in production

The sections above are what I’d say in an interview. This section is what actually happened when I shipped continuous monitoring — weekly re-analysis of a tracked app, with measured period-over-period deltas — and the deltas came out as noise. I’m including it because debugging an LLM system in production is the skill the eval layer exists to support, and this incident used every part of it.

The symptom. Two analyses of the same app, minutes apart, on a ~99% identical review corpus, shared 1 of ~10 friction topic names. The delta matcher dutifully reported nearly every topic as simultaneously “new” and “resolved.” Week-over-week deltas built on that would be measuring naming drift, not product change.

Hypothesis 1: the matcher is too strict. The topic matcher uses word-set Jaccard similarity, and “free trial” vs “free trials” scores 1/3 — under the 0.5 match threshold. I shipped plural folding in the shared tokenizer with regression tests for the exact live miss. Matched topics went from 0 to 1. Real bug, not the root cause.

Hypothesis 2: the model needs anchoring. Feed the baseline run’s topic names into the report prompt and instruct the model to reuse them verbatim. Shipped it. Zero effect. Moved the topic list to the top of the prompt (it had been sitting after ~16k tokens of review summaries), repeated the rule at the decision point, hardened the wording. Still zero — the model wrote “customer service” when the supplied list said “customer support.” Before iterating a third time, I verified the anchor list was actually present in the live prompt by reproducing the exact database fetch and prompt composition locally against production data. It was. The prompt was fine; the model was ignoring it.

Root cause: nobody had set the temperature. Every call ran at the provider default (~1.0). At that temperature the report call doesn’t extract “the top 10 friction topics” — it samples 10 from a long tail of candidates, differently every run. No prompt instruction survives that. The tell, in hindsight, was in the data: near-disjoint topic sets across near-identical corpora is a variance signature, not an instruction-following failure. One line — temperature: 0 — did more than two prompt versions combined: matched topics went to 3, then 5, with verbatim name reuse.

The aftershock. Determinism exposed a latent contract violation. The anchoring instruction said “reuse prior names verbatim”; the output schema hard-rejects single-word feature names as insufficiently specific; the baseline contained “privacy.” Obeying the prompt guaranteed failing the schema — and because validation was all-or-nothing, one bad item rejected the entire report, which silently degraded the analysis to a non-AI fallback. Every scheduled run inherited the poisoned baseline. Two fixes: anchor candidates are now filtered through the same validator the schema uses before they reach the prompt, and a salvage pass parses the raw model text on validation failure, drops only the individually-invalid items, and re-validates the rest — with the drop counts logged instead of swallowed.

What generalizes

Measure before prompt-engineering. Two prompt iterations were spent on a sampling problem. Distribution-level symptoms (disjoint outputs across identical inputs) point at decoding parameters, not instructions.
Extraction tasks default to temperature 0. As a side effect, npm run eval wasn’t reproducible run-to-run until this landed — the eval layer itself was measuring sampling noise.
The prompt and the output schema are one contract. If the prompt can instruct the model into schema-invalid output, one of them is lying. Validate anything you feed into an instruction against the rules you’ll enforce on the way out.
All-or-nothing validation plus a silent fallback is how quality rots invisibly. The degraded runs still “succeeded” — completed analyses, sent emails — with quietly worse content. Salvage made the failure partial; logging made it visible.

What I’d kill

These look good in roadmaps and bad in production. Stated explicitly so a reader can see the negative space.

A UI for non-engineers to edit prompts. Every team I’ve talked to that built this regretted it within a quarter. Prompts encode product semantics; “anyone can edit” means “no one owns the regressions.”
Per-call PII redaction in v1. Reviews are already public-facing data; building a redaction layer for v1 would optimize for a threat model that doesn’t exist yet.
Real-time streaming dashboard. Daily aggregates are sufficient for the questions this layer answers. “Realtime” adds infrastructure for zero hiring signal.
Cost optimization automation. The system can identify expensive prompts; routing decisions should stay human-in-the-loop until the model behavior is well-characterized.

What I’d measure in production

Product	Reliability	Eval
Vibe Reports / week	p95 end-to-end latency	Spearman per dim (rolling 7d)
Conversion to paid from sample	Error rate per prompt_id	MAE drift vs. last green eval
Time from upload to insight	Cost per Vibe Report (USD)	Churn-risk confusion matrix

Tradeoffs and known gaps

n=30 golden set is small. Confidence intervals on Spearman are wide. A change in the third decimal between eval runs is noise. I treat differences below 0.1 as suggestive, above 0.2 as significant.
Pricing is static. Constants in src/lib/ai/pricing.ts reflect public Gemini pricing as of 2026-05-13. In production these would come from a config service. A drift here would silently mis-cost every analysis; surfacing the snapshot date on the dashboard is mitigation, not prevention.
Latency includes network. Measured from Vercel function start to SDK resolve, not pure inference time. Real model latency is somewhat lower; the displayed number is what users actually experience.
Eval and production share the same prompt. The harness calls the same analyzeReview that runs in production, so the eval rows are not independent samples — they share the prompt version they’re evaluating. This is fine because the goal is “did this prompt land well against ground truth,” not “is this prompt robust to perturbation.”

Stack and code

Framework: Next.js 16 (App Router) on Vercel
AI: Gemini 2.5 Flash via Vercel AI SDK v6 + @ai-sdk/google
Data: Supabase (Postgres + RLS)
Source: github.com/chetanjon/FrictionLens

Run the eval against the current prompt yourself: npm run eval. New rows appear on /eval after the next ISR refresh (every 5 minutes).