YouTube Notetaker

When to Use

Use when this workflow matches the user request: >

Source: dair-ai/dair-academy-plugins (MIT).

Build a personal library of YouTube talks you study with. Each video becomes one plain markdown file: slide snapshots at their timestamps, a full timestamped transcript, and editable notes. A small bundled server renders the library as an interactive deep-dive in the browser. No database, no cloud service. Everything is files on disk you fully own.

Architecture (read this first)

The markdown library is the single source of truth. The artifact is a thin HTML shell that fetches from the server and writes notes back. Never hardcode video data into the HTML.

Library: a plain folder, set by VIDEO_LIBRARY_DIR (default ~/video-deepdives/).
- One markdown file per video, filename slug = YouTube id (e.g. RtywqDFBYnQ.md).
- Frontmatter holds video metadata + a slides array.
- Body holds the full transcript as [HH:MM:SS] text lines.
- _media/ holds slide images, namespaced per video as <youtube_id>-slide-NN.jpg to avoid collisions between videos.
Server: scripts/serve.py, a single stdlib + PyYAML file. Start it with:
```
python3 scripts/serve.py --dir ~/video-deepdives --port 8000
```
It serves the artifact at / and a small API the artifact talks to:
- GET /api/video-deepdives (front page fetches this) lists every video.
- GET /api/video-deepdives/<id> returns one video {meta, body}.
- GET /api/video-deepdives/_media/<file> serves a slide image.
- PATCH /api/video-deepdives/<id> with {fields:{slides:[...]}} writes notes back.
- It picks up new videos automatically the moment a markdown file exists. Adding a video means writing a markdown file + media; you almost never touch the HTML.
- The /api/video-deepdives URL namespace is local to the bundled server.
Artifact: reference/artifact.html, served by serve.py at /. A clean reference copy; only rewrite it if the user wants a UI change. For new videos, leave it alone.

Requirements

yt-dlp and ffmpeg on PATH (download + frame/scene extraction).

Python 3 with Pillow (contact sheet) and PyYAML (markdown file + server).

pip install yt-dlp pillow pyyaml      # ffmpeg via your package manager

Adding a video — the pipeline

All helper scripts are in scripts/. Work in a scratch dir (e.g. /tmp/ytnote-<id>/), then copy final assets into the library. Set VIDEO_LIBRARY_DIR once per shell if you don't want the default. Do not use em dashes (—) or arrows (→) in notes/titles.

1. Resolve the id and check embeddability

scripts/setup.sh "<youtube_url_or_id>"

Prints the 11-char YTID, the scratch dir, the target library path, and whether YouTube embedding is allowed (oembed 200) or blocked (oembed 401, e.g. some university talks). If blocked, inline playback won't work but the artifact degrades gracefully to an "open at this moment on YouTube" link, so proceed normally.

2. Download video + subtitles

scripts/download.sh "<YTID>" /tmp/ytnote-<YTID>

Uses yt-dlp to grab the video (≤720p is plenty for slide frames) and the best available subtitles (manual if present, else auto-captions) as .vtt. Also fetches title/uploader.

3. Detect candidate slide timestamps

scripts/detect_slides.sh /tmp/ytnote-<YTID>/video.mp4 /tmp/ytnote-<YTID>

Runs ffmpeg scene detection (select='gt(scene,0.3)') and writes scene_times.txt (seconds). 0.3 is a good default; lower it (0.2) for subtle slide decks, raise it (0.4) for busy video.

4. Build a contact sheet and CURATE

python3 scripts/contact_sheet.py /tmp/ytnote-<YTID>/video.mp4 /tmp/ytnote-<YTID>/scene_times.txt /tmp/ytnote-<YTID>/contact.jpg

Read contact.jpg (labeled with index + timestamp). This is the human-judgment step: keep frames that are real content slides; drop talking-head shots, transitions, duplicates, and blurry mid-animation frames. Save the kept timestamps (seconds) to /tmp/ytnote-<YTID>/keep.txt, one per line. Typical talk yields 15-25 slides.

5. Extract the curated slides at full quality and install to _media

python3 scripts/extract_slides.py <YTID> /tmp/ytnote-<YTID>/video.mp4 /tmp/ytnote-<YTID>/keep.txt > /tmp/ytnote-<YTID>/slides.json

Extracts each kept timestamp at 1280px wide, JPEG, and copies them into $VIDEO_LIBRARY_DIR/_media/ as <YTID>-slide-01.jpg, -02.jpg, … (numbered in time order). Progress goes to stderr; a clean slides.json scaffold prints to stdout, so redirect it to a file as shown, then fill in title and note.

Tip: talks are often a slide + speaker-cam composite, and speakers flip back and forth, so the same slide appears at several timestamps. Keep the cleanest instance of each, and re-anchor each slide's t to where it is actually discussed in the transcript (better "play from here" UX).

6. Build the transcript

python3 scripts/vtt_to_transcript.py /tmp/ytnote-<YTID>/*.vtt /tmp/ytnote-<YTID>/transcript.txt

Parses the VTT into clean, de-duplicated [HH:MM:SS] text lines (YouTube auto-captions repeat rolling text; the script collapses it). This becomes the markdown body.

7. Write notes and assemble the markdown file

For each kept slide, write a 1-3 sentence note grounded in the transcript around that timestamp (don't invent claims). Then assemble:

python3 scripts/write_library_item.py \
  --id <YTID> \
  --title "Talk title" \
  --speaker "Name, Role, Org" \
  --tags tag1,tag2,tag3 \
  --slides /tmp/ytnote-<YTID>/slides.json \
  --transcript /tmp/ytnote-<YTID>/transcript.txt

Writes $VIDEO_LIBRARY_DIR/<YTID>.md with correct frontmatter + body.

8. Serve and verify (always do this)

python3 scripts/serve.py --dir "$VIDEO_LIBRARY_DIR" --port 8000 &
scripts/verify.sh <YTID>                 # defaults to http://127.0.0.1:8000

verify.sh curls the collection list, the item, the first slide image, and the artifact, asserting HTTP 200 and that the new id appears in the index. Then open http://127.0.0.1:8000/#/<YTID> in a browser to confirm slides + transcript + notes render.

Markdown file shape (reference)

---
id: RtywqDFBYnQ
title: Memory and dreaming for self-learning agents
youtube_id: RtywqDFBYnQ
speaker: Mahesh, Product Manager, Platform team at Anthropic
source_url: https://www.youtube.com/watch?v=RtywqDFBYnQ
slide_count: 19
created: '2026-05-25'
tags: [anthropic, memory, agents]
slides:
- idx: 1
  t: 55.7                 # seconds (float ok), used for seeking
  mmss: 00:55             # display label
  title: Agent primitives have evolved
  note: One to three sentences grounded in the transcript at this timestamp.
  img: /api/video-deepdives/_media/RtywqDFBYnQ-slide-01.jpg
# ... more slides
---
## Transcript
[00:00:08] Hello, everyone...
[00:00:11] ...

Notes:

idx can be sparse/non-contiguous; the artifact sorts slides by t, so ordering is by timestamp, not idx.
img is always a /api/video-deepdives/_media/<file> URL (served by serve.py), never base64.
Slide note is what the user edits in the UI; PATCH writes the whole slides array back.

Gotchas

Embedding disabled (oembed 401): inline player is blocked by the video owner. Not a bug; the artifact shows an "open at this moment on YouTube" link instead. Mention it to the user.
Image collisions: always namespace media <YTID>-slide-NN.jpg. Never reuse bare slide-NN.jpg for a new video.
Auto-caption noise: rolling YouTube captions duplicate text across cues; use the provided VTT parser, don't dump raw VTT into the body.
Don't touch existing videos when adding a new one. Each video is an independent file.
Server not picking up a video: confirm the .md file is directly inside --dir (not a subfolder) and the filename is <YTID>.md.

What makes this portable

No orchestrator / no database. Storage is a plain folder of markdown + images.
One env var (VIDEO_LIBRARY_DIR) controls where the library lives.
One small server file (serve.py, stdlib + PyYAML) renders everything and handles note write-back. Drop it anywhere Python runs.
The markdown files are portable: readable in Obsidian or any editor, and the frontmatter is standard YAML.

Limitations

Requires the upstream tool, account, API key, or local setup when the workflow names one.
Does not authorize destructive, production, paid, or external-message actions without explicit user approval.
Validate generated artifacts or recommendations against the user's real sources before treating them as final.

YouTube Notetaker

AI Summary

YouTube Notetaker

When to Use

Architecture (read this first)

Requirements

Adding a video — the pipeline

1. Resolve the id and check embeddability

2. Download video + subtitles

3. Detect candidate slide timestamps

4. Build a contact sheet and CURATE

5. Extract the curated slides at full quality and install to _media

6. Build the transcript

7. Write notes and assemble the markdown file

8. Serve and verify (always do this)

Markdown file shape (reference)

Gotchas

What makes this portable

Limitations

Related skills