YouTube Notetaker
When to Use
Use when this workflow matches the user request: >
Source: dair-ai/dair-academy-plugins (MIT).
Build a personal library of YouTube talks you study with. Each video becomes one plain markdown file: slide snapshots at their timestamps, a full timestamped transcript, and editable notes. A small bundled server renders the library as an interactive deep-dive in the browser. No database, no cloud service. Everything is files on disk you fully own.
Architecture (read this first)
The markdown library is the single source of truth. The artifact is a thin HTML shell that fetches from the server and writes notes back. Never hardcode video data into the HTML.
- Library: a plain folder, set by
VIDEO_LIBRARY_DIR(default~/video-deepdives/).- One markdown file per video, filename slug = YouTube id (e.g.
RtywqDFBYnQ.md). - Frontmatter holds video metadata + a
slidesarray. - Body holds the full transcript as
[HH:MM:SS] textlines. _media/holds slide images, namespaced per video as<youtube_id>-slide-NN.jpgto avoid collisions between videos.
- One markdown file per video, filename slug = YouTube id (e.g.
- Server:
scripts/serve.py, a single stdlib + PyYAML file. Start it with:
It serves the artifact atpython3 scripts/serve.py --dir ~/video-deepdives --port 8000/and a small API the artifact talks to:GET /api/video-deepdives(front page fetches this) lists every video.GET /api/video-deepdives/<id>returns one video{meta, body}.GET /api/video-deepdives/_media/<file>serves a slide image.PATCH /api/video-deepdives/<id>with{fields:{slides:[...]}}writes notes back.- It picks up new videos automatically the moment a markdown file exists. Adding a video means writing a markdown file + media; you almost never touch the HTML.
- The
/api/video-deepdivesURL namespace is local to the bundled server.
- Artifact:
reference/artifact.html, served byserve.pyat/. A clean reference copy; only rewrite it if the user wants a UI change. For new videos, leave it alone.
Requirements
yt-dlpandffmpegon PATH (download + frame/scene extraction).- Python 3 with
Pillow(contact sheet) andPyYAML(markdown file + server).pip install yt-dlp pillow pyyaml # ffmpeg via your package manager
Adding a video — the pipeline
All helper scripts are in scripts/. Work in a scratch dir (e.g. /tmp/ytnote-<id>/), then
copy final assets into the library. Set VIDEO_LIBRARY_DIR once per shell if you don't want the
default. Do not use em dashes (—) or arrows (→) in notes/titles.
1. Resolve the id and check embeddability
scripts/setup.sh "<youtube_url_or_id>"
Prints the 11-char YTID, the scratch dir, the target library path, and whether YouTube
embedding is allowed (oembed 200) or blocked (oembed 401, e.g. some university talks).
If blocked, inline playback won't work but the artifact degrades gracefully to an "open at this
moment on YouTube" link, so proceed normally.
2. Download video + subtitles
scripts/download.sh "<YTID>" /tmp/ytnote-<YTID>
Uses yt-dlp to grab the video (≤720p is plenty for slide frames) and the best available
subtitles (manual if present, else auto-captions) as .vtt. Also fetches title/uploader.
3. Detect candidate slide timestamps
scripts/detect_slides.sh /tmp/ytnote-<YTID>/video.mp4 /tmp/ytnote-<YTID>
Runs ffmpeg scene detection (select='gt(scene,0.3)') and writes scene_times.txt (seconds).
0.3 is a good default; lower it (0.2) for subtle slide decks, raise it (0.4) for busy video.
4. Build a contact sheet and CURATE
python3 scripts/contact_sheet.py /tmp/ytnote-<YTID>/video.mp4 /tmp/ytnote-<YTID>/scene_times.txt /tmp/ytnote-<YTID>/contact.jpg
Read contact.jpg (labeled with index + timestamp). This is the human-judgment step: keep
frames that are real content slides; drop talking-head shots, transitions, duplicates, and
blurry mid-animation frames. Save the kept timestamps (seconds) to /tmp/ytnote-<YTID>/keep.txt,
one per line. Typical talk yields 15-25 slides.
5. Extract the curated slides at full quality and install to _media
python3 scripts/extract_slides.py <YTID> /tmp/ytnote-<YTID>/video.mp4 /tmp/ytnote-<YTID>/keep.txt > /tmp/ytnote-<YTID>/slides.json
Extracts each kept timestamp at 1280px wide, JPEG, and copies them into
$VIDEO_LIBRARY_DIR/_media/ as <YTID>-slide-01.jpg, -02.jpg, … (numbered in time order).
Progress goes to stderr; a clean slides.json scaffold prints to stdout, so redirect it to a
file as shown, then fill in title and note.
Tip: talks are often a slide + speaker-cam composite, and speakers flip back and forth, so the
same slide appears at several timestamps. Keep the cleanest instance of each, and re-anchor each
slide's t to where it is actually discussed in the transcript (better "play from here" UX).
6. Build the transcript
python3 scripts/vtt_to_transcript.py /tmp/ytnote-<YTID>/*.vtt /tmp/ytnote-<YTID>/transcript.txt
Parses the VTT into clean, de-duplicated [HH:MM:SS] text lines (YouTube auto-captions repeat
rolling text; the script collapses it). This becomes the markdown body.
7. Write notes and assemble the markdown file
For each kept slide, write a 1-3 sentence note grounded in the transcript around that timestamp
(don't invent claims). Then assemble:
python3 scripts/write_library_item.py \
--id <YTID> \
--title "Talk title" \
--speaker "Name, Role, Org" \
--tags tag1,tag2,tag3 \
--slides /tmp/ytnote-<YTID>/slides.json \
--transcript /tmp/ytnote-<YTID>/transcript.txt
Writes $VIDEO_LIBRARY_DIR/<YTID>.md with correct frontmatter + body.
8. Serve and verify (always do this)
python3 scripts/serve.py --dir "$VIDEO_LIBRARY_DIR" --port 8000 &
scripts/verify.sh <YTID> # defaults to http://127.0.0.1:8000
verify.sh curls the collection list, the item, the first slide image, and the artifact,
asserting HTTP 200 and that the new id appears in the index. Then open
http://127.0.0.1:8000/#/<YTID> in a browser to confirm slides + transcript + notes render.
Markdown file shape (reference)
---
id: RtywqDFBYnQ
title: Memory and dreaming for self-learning agents
youtube_id: RtywqDFBYnQ
speaker: Mahesh, Product Manager, Platform team at Anthropic
source_url: https://www.youtube.com/watch?v=RtywqDFBYnQ
slide_count: 19
created: '2026-05-25'
tags: [anthropic, memory, agents]
slides:
- idx: 1
t: 55.7 # seconds (float ok), used for seeking
mmss: 00:55 # display label
title: Agent primitives have evolved
note: One to three sentences grounded in the transcript at this timestamp.
img: /api/video-deepdives/_media/RtywqDFBYnQ-slide-01.jpg
# ... more slides
---
## Transcript
[00:00:08] Hello, everyone...
[00:00:11] ...
Notes:
idxcan be sparse/non-contiguous; the artifact sorts slides byt, so ordering is by timestamp, not idx.imgis always a/api/video-deepdives/_media/<file>URL (served by serve.py), never base64.- Slide
noteis what the user edits in the UI; PATCH writes the wholeslidesarray back.
Gotchas
- Embedding disabled (oembed 401): inline player is blocked by the video owner. Not a bug; the artifact shows an "open at this moment on YouTube" link instead. Mention it to the user.
- Image collisions: always namespace media
<YTID>-slide-NN.jpg. Never reuse bareslide-NN.jpgfor a new video. - Auto-caption noise: rolling YouTube captions duplicate text across cues; use the provided VTT parser, don't dump raw VTT into the body.
- Don't touch existing videos when adding a new one. Each video is an independent file.
- Server not picking up a video: confirm the
.mdfile is directly inside--dir(not a subfolder) and the filename is<YTID>.md.
What makes this portable
- No orchestrator / no database. Storage is a plain folder of markdown + images.
- One env var (
VIDEO_LIBRARY_DIR) controls where the library lives. - One small server file (
serve.py, stdlib + PyYAML) renders everything and handles note write-back. Drop it anywhere Python runs. - The markdown files are portable: readable in Obsidian or any editor, and the frontmatter is standard YAML.
Limitations
- Requires the upstream tool, account, API key, or local setup when the workflow names one.
- Does not authorize destructive, production, paid, or external-message actions without explicit user approval.
- Validate generated artifacts or recommendations against the user's real sources before treating them as final.