Skip to main content

Productization writeup

How I’d productize an LLM eval layer

A writeup for the FrictionLens eval and observability surface. ~1,400 words. Last updated 2026-05-13.

TL;DR

FrictionLens is an AI-powered app review analyzer that scores user reviews across five sentiment dimensions (love, frustration, loyalty, momentum, word-of-mouth) and synthesizes them into a Vibe Report. The eval layer documented here measures how well the model’s dimension scores agree with hand-labeled ground truth, surfaces real per-call cost and latency from production traffic, and tracks prompt versions so regressions are visible at a glance. The headline metric: Spearman correlation against a 30-review golden set, computed per dimension, persisted to the database, and rendered on a public dashboard at /eval.

Why this matters

AI PM hiring is hot and “I’ve used ChatGPT” is the entry-level baseline. Hands-on intuition about LLM evals — what to measure, when correlation matters more than accuracy, why golden sets need to include sarcasm — is rare among early-career applicants because most production AI work happens behind closed doors. A public eval surface that runs against a real product, with real telemetry, is the cheapest defensible signal you can build. This is not a research project; it’s a portfolio artifact a hiring manager can click through in 90 seconds.

The product today

FrictionLens takes a CSV of app reviews (or auto-pulls them from the App Store / Play Store) and produces a Vibe Report: aggregate dimension scores, a 0–100 Vibe Score, top friction features, churn drivers with verbatim quotes, and prioritized action items. The AI pipeline uses Gemini 2.5 Flash by default via the Vercel AI SDK. Reviews are classified into three tiers to minimize cost: trivial (rating-only), short (rule-based sentiment), and complex (sent to Gemini). The five dimensions and their scoring rubrics are defined in the system prompt at src/lib/ai/prompts.ts. Sample report: /demo.

What the eval layer measures and why

Spearman correlation, per dimension

Spearman rank correlation, not Pearson. The dimension scores are 0–10 ordinal-ish numbers: a “love score” of 9 vs 10 isn’t necessarily a meaningful linear gap, but a 9 should rank higher than a 7. Spearman is robust to monotonic non-linear mappings (the model might consistently score 1–2 points lower than human labelers on the negative end without losing rank order — Pearson would punish that bias, Spearman would not). For an LLM whose absolute output magnitudes drift across versions but whose ranking should remain stable, Spearman is the metric that actually tells you whether the model still “understands” the dimension.

I report one Spearman per dimension because aggregating into a single score hides the failure modes. Frustration scoring is much easier than momentum scoring — the former has clear lexical signals, the latter requires temporal reasoning about product trajectory. Reporting them together would obscure the dimension where I’d want a PM to push back on the model.

Mean absolute error, per dimension

Spearman tells you rank fidelity. MAE tells you magnitude fidelity. Together they answer the two distinct questions a PM should ask: “is the model ordering things correctly?” and “is the model’s confidence calibrated?” An MAE of 1.0 on a 0–10 scale means the model is typically within ±1 point of a human labeler; an MAE of 3.0 means the dimension is essentially noise even if Spearman looks fine.

Churn risk as classification, not regression

The model also emits a discrete churn_risk label (Critical / High / Medium / Low). For discrete outputs, exact-match accuracy is the right metric. I deliberately did not use any continuous metric for this because the buckets are not equidistant — “Critical” vs “High” is a much more important distinction than “Medium” vs “Low,” and a regression metric would treat them symmetrically.

Golden set design

30 hand-labeled reviews live in src/lib/ai/eval/golden-set.json. The distribution is intentional:

  • 6 strongly positive (clean signal, high ratings, advocacy language)
  • 6 strongly negative (data loss, broken core features, support failures)
  • 8 mixed / ambivalent (love + frustration coexisting, conditional praise, sarcasm)
  • 5 churn-signaling (explicit cancellation, competitor mentions, ultimatums)
  • 5 feature-specific complaints (overall positive, one feature blocking conversion)

The mixed bucket is where the model gets things wrong. That’s the point. A golden set where every Spearman lands at 0.95 is either too easy or has label leakage. I built this set knowing the model would fumble sarcasm (“Oh great, another update that ‘simplifies’ the workflow”) and undersell mixed reviews where the user’s frustration is bundled with affection.

In production I would tier the golden set: 50 hand-labeled reviews for primary signal plus ~500 weakly-labeled reviews (derived from star ratings with rule-based filters) for distribution coverage. The labor cost of hand-labeling rises linearly; weak supervision lets you get distribution coverage cheaply at the cost of label noise.

Cost and latency, per call

Every Gemini invocation goes through instrumentedGenerateText in src/lib/ai/instrument.ts, which extracts usage.inputTokens, usage.outputTokens, and finishReason from the SDK result, computes cost from static pricing constants, and writes one row to model_calls. The dashboard aggregates these into 30-day p50/p95/p99 latency per prompt and total spend per model. Telemetry is fire-and-forget — a database hiccup will never kill a user-facing analysis.

What I built vs. what I’d build next

Built

  • Per-call instrumentation persisted to model_calls (latency, tokens, cost, finish_reason, prompt_id, status)
  • Prompt versioning via a PROMPTS registry; every call records its prompt_id
  • 30-review hand-labeled golden set with intentional distribution
  • CLI eval harness (npm run eval) computing Spearman + MAE per dimension and churn-risk exact-match
  • Public /eval dashboard with KPIs, cost breakdown, latency bars, eval scorecard, and recent traces
  • Public-safe aggregate views that exclude user identifiers

Next 30 days

  • Drift detection — compare today’s dimension distribution to last week’s at the per-prompt-version level; alert on >2σ shifts
  • Per-user cost surface — currently aggregate-only on the public dashboard; per-account spend visibility lives in a private dashboard
  • Prompt regression gate — block deploys if Spearman drops more than 0.1 on any dimension vs the last green eval run

Next 90 days

  • LLM-as-judge with calibrated rubric — second model rates the first model’s output against the rubric; expensive but unlocks evaluating new dimensions without growing the hand-labeled set
  • Prompt A/B routing — send a fraction of traffic to a candidate prompt version, accumulate parallel eval rows, surface the deltas
  • Multi-model ensemble for high-stakes dimensions — for churn_risk specifically, vote across 2–3 model families; cost is justified because misclassifying “Critical” as “Low” is asymmetric

What I’d kill

These look good in roadmaps and bad in production. Stated explicitly so a reader can see the negative space.

  • A UI for non-engineers to edit prompts. Every team I’ve talked to that built this regretted it within a quarter. Prompts encode product semantics; “anyone can edit” means “no one owns the regressions.”
  • Per-call PII redaction in v1. Reviews are already public-facing data; building a redaction layer for v1 would optimize for a threat model that doesn’t exist yet.
  • Real-time streaming dashboard. Daily aggregates are sufficient for the questions this layer answers. “Realtime” adds infrastructure for zero hiring signal.
  • Cost optimization automation. The system can identify expensive prompts; routing decisions should stay human-in-the-loop until the model behavior is well-characterized.

What I’d measure in production

ProductReliabilityEval
Vibe Reports / weekp95 end-to-end latencySpearman per dim (rolling 7d)
Conversion to paid from sampleError rate per prompt_idMAE drift vs. last green eval
Time from upload to insightCost per Vibe Report (USD)Churn-risk confusion matrix

Tradeoffs and known gaps

  • n=30 golden set is small. Confidence intervals on Spearman are wide. A change in the third decimal between eval runs is noise. I treat differences below 0.1 as suggestive, above 0.2 as significant.
  • Pricing is static. Constants in src/lib/ai/pricing.ts reflect public Gemini pricing as of 2026-05-13. In production these would come from a config service. A drift here would silently mis-cost every analysis; surfacing the snapshot date on the dashboard is mitigation, not prevention.
  • Latency includes network. Measured from Vercel function start to SDK resolve, not pure inference time. Real model latency is somewhat lower; the displayed number is what users actually experience.
  • Eval and production share the same prompt. The harness calls the same analyzeReview that runs in production, so the eval rows are not independent samples — they share the prompt version they’re evaluating. This is fine because the goal is “did this prompt land well against ground truth,” not “is this prompt robust to perturbation.”

Stack and code

  • Framework: Next.js 16 (App Router) on Vercel
  • AI: Gemini 2.5 Flash via Vercel AI SDK v6 + @ai-sdk/google
  • Data: Supabase (Postgres + RLS)
  • Source: github.com/chetanjon/FrictionLens

Run the eval against the current prompt yourself: npm run eval. New rows appear on /eval after the next ISR refresh (every 5 minutes).