Most “turn your footage into a short” tools are cloud SaaS: you upload your raw clips, a server transcribes and renders them, and you pay per minute. I wanted the opposite, a pipeline that lives entirely on my own machine, uses only open-source tooling, captures only what I tell it to, and never sends a frame anywhere. This is that pipeline.

It takes raw recorded footage and produces a finished vertical short-form video framed correctly for TikTok or Instagram Reels, with word-level captions burned in and platform-aware layout. macOS / Apple Silicon, no cloud, no paid APIs, driven from a single CLI over an input folder.

[capture] -> [ingest+align] -> [transcribe] -> [render] -> output.mp4
   |              |                  |              |
 3 files     synced trio        words.json    final vertical

The same synced footage can be re-rendered for a different platform without re-capturing or re-transcribing.

Two input shapes

The pipeline supports two ways of producing a video, sharing everything downstream:

  • screen + camera (two-zone): a captured application window fills the top zone, a talking-head camera crop-fills the zone below, captions sit on the seam between them, with optional punch-in zooms on the screen content.
  • headshot-only (full-bleed): a single talking-head clip (e.g. an iPhone selfie) fills the whole canvas; at key beats the entire frame cuts to a full-frame branded scene while the audio and captions keep running.

On top of either layout sits a garnish layer: word-pop captions, badge overlays, and full-frame inserts (title / number / brand-logo scenes), all authored from the transcript.

Capture: one native tool, one clock

The capture stage was the hard part, and the design went through a reversal worth documenting.

The deliverable should only ever contain camera audio. The talking-head mic is the single source of truth for sound, and system audio is never captured. That constraint is what breaks the obvious architecture.

The original plan was two parallel processes: a screen-capture CLI plus ffmpeg recording the camera, aligned afterwards. The only reliable way to auto-align two independent recordings is to cross-correlate a shared audio signal present in both files. With system audio excluded, that shared signal would have to be the mic recorded into both. But the screen-capture CLI I’d chosen (scap / SwiftCapture) writes digital silence to its audio track, even when recording alone. No shared signal means no auto-sync. Research confirmed the well-trodden path: every production tool (OBS, ScreenFlow, and friends) captures screen and camera inside one session, on one clock. Two independent OS processes simply can’t be frame-synced reliably without a shared signal.

So v1 capture is a single native Swift tool (vidcap): a ScreenCaptureKit stream for the window writes screen.mov (no audio), and an AVFoundation session for the camera + mic writes camera.mov (muxed). The two files each start at their own t=0, but because they share a host clock, vidcap measures the start gap precisely (the camera warms up a second or two after the screen) and records it as offset_seconds so the next stage can realign them. It ships with an interactive window/mic picker, saved defaults, and a keep-alive for static screens.

Ingest, transcribe, tighten

Ingest realigns the two sources by offset_seconds, trims them to a common length, normalises both to 30fps CFR / yuv420p, and extracts a 16 kHz mono voice.wav. It’s stdlib-only Python driving ffmpeg, and idempotent. (Headshot-only clips skip ingest, since an iPhone portrait clip is already 9:16 with a rotation flag, so it just gets the rotation applied and scaled to 1080x1920, never cropped.)

Transcribe runs faster-whisper (large-v3, CPU/int8, since there’s no Metal on Mac) to produce words.json: a flat list of {word, start, end} with times relative to the aligned t=0. Word-level timing is what makes the pop-captions and transcript-driven garnish possible.

Tighten is an optional pass for talking-head footage, which carries a lot of dead air (one test clip was around 24 to 30% silence). It caps long pauses (a 3.7s gap becomes about 0.35s, so the dead air goes but a breath stays) by cutting every normalised source identically with a trim-and-concat, not the select filter, whose audio path leaves audio uncut and desyncs. It then remaps every word time onto the new contiguous timeline so downstream authoring stays aligned.

Render: Remotion, platform-aware

Rendering is a Remotion project: React components composited to video. platforms.ts is the single source of truth for each target’s canvas size and safe-area zones (top nav, right rail, bottom meta), so captions and content land where the platform UI won’t cover them. The garnish layer is all React: spring-popped per-word captions (captionStyle: "pop"), a screenFrame glow card around the screen zone, transcript-driven Overlays, and full-frame Inserts for title/number/brand scenes. A style-library/ holds machine-readable looks, brand definitions, and the rules that pick a style, so a finished aesthetic can be applied rather than hand-tuned each time.

One sharp edge worth recording: Remotion can only load media from its public/ directory via staticFile() (absolute paths and symlinks don’t work), so the render step copies the real media in before building props and rendering.

Build philosophy

The whole thing is deliberately a set of small, inspectable scripts rather than a framework, built phase by phase against explicit checkpoints. Constraints were non-negotiable from the start: open-source / self-hostable tooling only, no paid SaaS in the runtime path, macOS-first, and everything ultimately driven by a single CLI over an input folder. There’s intentionally no orchestration or agent layer in v1.

Status

Phases 1 to 5 and 7, plus the garnish layer, are built and validated end to end (a two-zone counted demo and a full-bleed headshot with overlays and brand inserts). Still open: calibrating the exact TikTok/Reels safe-area insets against real device screenshots, and Phase 8, wiring the single pipeline run command so the manual copy-media, build-props, render dance becomes one invocation, with the style-library auto-applied.

See it in action

Here’s a short produced with the pipeline:

@berwickgeek