When to Use
Use whenever "looks/feels right" is the success criterion and there's no cheap numeric metric — animation easing/timing, zoom/camera feel, color grade, layout/spacing, design params, render/encoder settings, prompt params. Use the automated counterpart to lookdev when there's no human to sit the loop.
Source: connerkward/lookdev-auto-skill (MIT).
Visual eval loop — let a vision/video model tune what only an eye can judge
When the target is "does this LOOK/FEEL right" (not a number you can minimize), a
vision model (image) or video-understanding model (motion/timing) can be the judge in
a tight optimize loop. Worked reference: the screenstudio-alternative skill (iteration.py)
(tuned zoom-animation feel via fal-ai/video-understanding).
The loop
- Render N labeled variants into ONE artifact. Vary the parameter(s) across a small spread. Annotate each variant's params ON the artifact (burn the label in: "A · 2.2Hz · ζ0.5"). Images → a labeled grid/contact sheet. Video/motion → a labeled sequence (label card or burned-in overlay before/over each clip) so the model can compare temporally.
- One model call, structured output. Send the single artifact with an explicit
rubric (define what "good" means — and what "too much"/"too little" look like).
Ask for per-variant ratings + concrete suggested new values as JSON:
{"ratings":{"A":n,...},"best_so_far":"X","suggest":[[p1,p2],...]}. - Coarse → fine. Round 1 = wide spread to locate the region. Round 2 = render the model's suggestions (+ carry the current best) into one artifact; ask it to pick the single best. Usually converges in 2 rounds.
- Stop when sufficient — best rates high and suggestions cluster. Apply the winner.
Token / quality / step reductions (do these)
- One artifact per round, not one call per variant. The biggest saver — a 6-variant round is 1 upload + 1 inference, not 6. Montage/grid beats a loop of single calls.
- Burn params onto the artifact. The model sees label+result together → no separate "variant A used X" context to carry → fewer tokens, fewer mistakes.
- Structured JSON out + parse. No re-asking, no free-text wrangling. Prompt "return
ONLY JSON"; regex the first
{...}. - Short representative sample. Tune on a 3-5s clip / one frame / one component, not the whole asset. Cheaper render, smaller upload, faster inference. Apply the found params to the full render once.
- Cap variants at ~5-6. More doesn't improve the model's discrimination and multiplies render + token cost. Wide-but-sparse round 1, narrow round 2.
- Calibration anchors. Include one deliberately-bad and one safe-default variant as fixed anchors each round — gives the model a reference scale and exposes when its "best" is worse than the safe default (catch a bad recommendation early).
- Independent rubric, stated up front. Define "good" concretely in the prompt (smooth, subtle settle, not bouncy, not sluggish). Don't ask "which do you like" — that lets it echo your framing. A held-out criterion keeps the judge honest (see verify-outputs-rule: the check must be independent of what you tuned).
- Reuse renders across rounds. Carry the round-1 winner's clip into round 2 instead of re-rendering it.
- Early-exit. If round-1 top ≥9/10 and the three suggestions are within a small delta, skip round 2.
- Cheapest judge that can see the failure. Frames-through an image VLM can judge spatial things (layout, color, crop); only reach for a true video model when the thing being judged is temporal (easing, timing, motion smoothness) — those are invisible in stills.
When NOT to use it
- A real numeric metric exists and correlates with quality → optimize that directly; don't pay a model per step.
- The judgment is subjective-to-the-user (their taste, brand) → show them the variants and let them pick; a model's "best" isn't their best. (This is why the screen-studio spring auto-tune was dropped — the model's pick didn't match the owner's eye.)
- One or two variants → just look yourself.
Caveats (learned)
- The model's pick is an opinion, not ground truth — anchor it, and sanity-check the winner against the safe default yourself before committing.
- Vision/video models perceive gross differences well, fine ones poorly — keep variant spacing perceptible; near-identical variants get noise-rated.
Limitations
- Model ratings are probabilistic aesthetic judgments, not objective truth; keep a human review step for brand-critical or subjective work.
- Automated rounds can become expensive or slow when renders are heavy or many variants are explored.
- This skill needs screenshots, frames, or clips that expose the quality difference; it is weak for subtle motion, audio, copy nuance, or user-preference calls.