Free text-to-speech on your Mac that's actually good

Sun, 28 Jun 2026 13:15:49 +1000

Text to speech (TTS) is a brilliant service for those who would otherwise be literally speechless. Lost Voice Guy on Britain’s Got Talent rather humorously demonstrated the challenge.

▶ Watch on YouTube

But TTS has broad use. DIY personal podcasts. Read some text when on a walk or in the car. Share a personal update with someone who is more of a listener than a reader.

The good news is you don’t need to pay for a service like ElevenLabs to get quite good TTS. At least not on a Mac. (Don’t ask me about Windows — that’s for someone else to untangle).

Apple ships Premium Neural text-to-speech voices with every modern Mac. Almost nobody talks about them. Most people reaching for ElevenLabs don’t know they’re paying for something already sitting on their machine — running locally, no API key, no usage limit.

What Apple actually ships

Modern Macs on Apple Silicon (macOS Ventura and above) include a tier called Premium voices - Neural TTS built on the same engine Apple uses for Siri. These are not the robot voices from 2012.

The non-American English voices worth knowing:

Lee (Premium) - Australian male. Matilda (Premium) - Australian female. Jamie (Premium) - British male. Moira (Enhanced) - Irish female. Tessa - South African female.

To download Premium voices: open System Settings, search “spoken”, open Read & Speak and click System Voice. The dropdown shows installed voices; clicking Customise reveals everything available to download. On Apple Silicon the Premium tier is available; on Intel Macs the ceiling is Enhanced, which is a noticeable step down.

The subscription trap this replaces

ElevenLabs - the category leader - charges $5/month for a Starter plan and $22/month for Creator. Both are designed for the moment you forget to cancel.

The quality gap that justified those prices has closed for a specific use case: reading documents aloud. Apple’s Premium voices are genuinely competitive for narration, podcast-style reading and accessibility work. Not for voice cloning, not for expressive performance, not for hyper-realistic output in a specific accent you’ve trained. For turning a document into audio? The free version is fine.

Building a Markdown-to-MP3 pipeline

The macOS speech engine alone is useful but limited. For document narration - turning a 2,000-word research note or blog post into an MP3 - you need to handle three things the basic approach ignores.

Strip the Markdown. Bold markers, image syntax, link URLs and code blocks all get spoken verbatim otherwise. The right preprocessing step removes formatting tags, keeps the display text from links, drops images entirely and strips YAML frontmatter before the voice ever sees the text.

Add structural pauses. A sentence break inside a paragraph sounds different to the end of a section. I use 600ms of silence between paragraphs, 1000ms after H2 and H3 headers and 1500ms after H1 headers. These are configurable. Without them, the output sounds like a single unbroken stream regardless of how the document is structured.

Normalise the audio before concatenating. Premium Neural voices output at a higher sample rate than the silence clips you generate. Skip the normalisation step and you get audible glitches at every segment join. Convert everything to a consistent format first, then concatenate.

The pipeline is straightforward once those three things are solved. Split the document into segments at headers and blank lines. Generate speech for each segment. Insert silence between them. Combine everything and encode to MP3 using ffmpeg, which is free and open source.

The whole thing generates at roughly 12% of real-time - a 10-minute read takes about 70 seconds to produce.

What I built on top of this

I integrated this into my personal dashboard as a page that accepts pasted text or a dropped .md file. Eight voices are listed with accent, tier (Premium / Enhanced / Standard) and a preview button per voice so you can hear the accent before committing to a 10-minute generation run.

Total ongoing cost: zero.

If you want to build the same thing, here’s the prompt to give Claude:

Build me a local web dashboard page for converting text and Markdown files to MP3 using macOS’s built-in text-to-speech. It should accept pasted text or a dropped .md or .txt file. Before speaking, strip all Markdown formatting - YAML frontmatter, code blocks, bold and italic markers, converting link text to plain text and dropping images entirely. Split the document at headers and paragraph breaks and insert silence between segments, with longer pauses after major headers than between paragraphs - make the durations adjustable. Show a voice panel listing non-American English macOS voices (Lee (Premium), Matilda (Premium), Jamie (Premium), Moira (Enhanced), Karen, Daniel, Moira, Tessa) with accent, tier and a per-voice preview button. Generate an MP3 and offer it for download. The backend is a Bun server that shells out to the macOS say command for speech and ffmpeg for audio normalisation, silence generation and MP3 encoding. POST /tts/generate accepts JSON with the text, voice name and pause durations and returns audio/mpeg. GET /tts/preview?voice=… returns a short cached sample. Assume Apple Silicon Mac, Bun runtime, ffmpeg installed via Homebrew.

That prompt, pasted into Claude with a connected working folder, gets you a working version in one session.

What it won’t do

Be clear-eyed about the limits. macOS Premium voices can’t clone a specific voice, can’t do expressive emotional ranges and can’t produce the hyper-realistic output ElevenLabs' top tier achieves. If you need those things, pay for them.

What you don’t need to pay for is basic high-quality narration of your own documents. That’s already on your machine. It’s been there for years. Set it up once, run it locally and stop subscribing to things you don’t need.

Sources: