When to Use

Use whenever "looks/feels right" is the success criterion and there's no cheap numeric metric — animation easing/timing, zoom/camera feel, color grade, layout/spacing, design params, render/encoder settings, prompt params. Use the automated counterpart to lookdev when there's no human to sit the loop.

Source: connerkward/lookdev-auto-skill (MIT).

Visual eval loop — let a vision/video model tune what only an eye can judge

When the target is "does this LOOK/FEEL right" (not a number you can minimize), a vision model (image) or video-understanding model (motion/timing) can be the judge in a tight optimize loop. Worked reference: the screenstudio-alternative skill (iteration.py) (tuned zoom-animation feel via fal-ai/video-understanding).

The loop

Render N labeled variants into ONE artifact. Vary the parameter(s) across a small spread. Annotate each variant's params ON the artifact (burn the label in: "A · 2.2Hz · ζ0.5"). Images → a labeled grid/contact sheet. Video/motion → a labeled sequence (label card or burned-in overlay before/over each clip) so the model can compare temporally.
One model call, structured output. Send the single artifact with an explicit rubric (define what "good" means — and what "too much"/"too little" look like). Ask for per-variant ratings + concrete suggested new values as JSON: {"ratings":{"A":n,...},"best_so_far":"X","suggest":[[p1,p2],...]}.
Coarse → fine. Round 1 = wide spread to locate the region. Round 2 = render the model's suggestions (+ carry the current best) into one artifact; ask it to pick the single best. Usually converges in 2 rounds.
Stop when sufficient — best rates high and suggestions cluster. Apply the winner.

Token / quality / step reductions (do these)

One artifact per round, not one call per variant. The biggest saver — a 6-variant round is 1 upload + 1 inference, not 6. Montage/grid beats a loop of single calls.
Burn params onto the artifact. The model sees label+result together → no separate "variant A used X" context to carry → fewer tokens, fewer mistakes.
Structured JSON out + parse. No re-asking, no free-text wrangling. Prompt "return ONLY JSON"; regex the first {...}.
Short representative sample. Tune on a 3-5s clip / one frame / one component, not the whole asset. Cheaper render, smaller upload, faster inference. Apply the found params to the full render once.
Cap variants at ~5-6. More doesn't improve the model's discrimination and multiplies render + token cost. Wide-but-sparse round 1, narrow round 2.
Calibration anchors. Include one deliberately-bad and one safe-default variant as fixed anchors each round — gives the model a reference scale and exposes when its "best" is worse than the safe default (catch a bad recommendation early).
Independent rubric, stated up front. Define "good" concretely in the prompt (smooth, subtle settle, not bouncy, not sluggish). Don't ask "which do you like" — that lets it echo your framing. A held-out criterion keeps the judge honest (see verify-outputs-rule: the check must be independent of what you tuned).
Reuse renders across rounds. Carry the round-1 winner's clip into round 2 instead of re-rendering it.
Early-exit. If round-1 top ≥9/10 and the three suggestions are within a small delta, skip round 2.
Cheapest judge that can see the failure. Frames-through an image VLM can judge spatial things (layout, color, crop); only reach for a true video model when the thing being judged is temporal (easing, timing, motion smoothness) — those are invisible in stills.

When NOT to use it

A real numeric metric exists and correlates with quality → optimize that directly; don't pay a model per step.
The judgment is subjective-to-the-user (their taste, brand) → show them the variants and let them pick; a model's "best" isn't their best. (This is why the screen-studio spring auto-tune was dropped — the model's pick didn't match the owner's eye.)
One or two variants → just look yourself.

Caveats (learned)

The model's pick is an opinion, not ground truth — anchor it, and sanity-check the winner against the safe default yourself before committing.
Vision/video models perceive gross differences well, fine ones poorly — keep variant spacing perceptible; near-identical variants get noise-rated.

Limitations

Model ratings are probabilistic aesthetic judgments, not objective truth; keep a human review step for brand-critical or subjective work.
Automated rounds can become expensive or slow when renders are heavy or many variants are explored.
This skill needs screenshots, frames, or clips that expose the quality difference; it is weak for subtle motion, audio, copy nuance, or user-preference calls.

Visual eval loop — let a vision/video model tune what only an eye can judge

AI Summary

When to Use

Visual eval loop — let a vision/video model tune what only an eye can judge

The loop

Token / quality / step reductions (do these)

When NOT to use it

Caveats (learned)

Limitations

Related skills