Turn your Mac into a powerful transcription machine using the same AI that powers OpenAI's ChatGPT. With just a few Terminal commands, you can convert audio and video files into accurate text in minutes.
If you've never touched Terminal before, don't worry — setting up Whisper on macOS Sequoia 15 is easier than it looks and worth it. Whether you're working with YouTube videos, interviews, lectures, or voice notes, Whisper can handle all the heavy lifting.
Whisper is a free, open-source speech-to-text neural network from OpenAI that runs entirely on your machine — no internet required after setup. Once you get it going, it's fast, secure, and dead simple — and it can chew through just about any audio or video format you throw at it. And it's the perfect tool if you're sick of glitchy web-based transcription services, expensive Mac apps, and clunky browser extensions with limitations, such as file size caps, watermarks, ads, or lousy accuracy.
Yes, it lives in the Terminal — that black box of mystery most folks avoid. But here's the thing: if you can copy and paste, you can run Whisper. Once it's installed, transcribing a file is literally one line. There is no bloated interface, no uploading and waiting, and no monthly fee.
And if you're not ready to mess with the command line? You still have options. There are Mac apps like MacWhisper and Whisper Transcription that give you a drag-and-drop interface powered by Whisper under the hood. Browser-based services like Whisper demo on Hugging Face make it even easier — though you'll usually trade some privacy and flexibility for convenience. However, the command-line version is still the most powerful and flexible way to use Whisper, and it's the official implementation maintained by OpenAI. If you want complete control, this is the version you want.
Or you can skip all of it and just send ChatGPT the file via its web or desktop app — it can transcribe or translate it for you using Whisper.
So if you're tired of jumping through hoops just to get clean transcripts — whether you're a student, podcaster, journalist, or just someone trying to archive your Zoom calls — it's time to take five minutes and set up something that just works. Let's dive in.
Requirements
Through the instructions below, you'll be installing and using the following tools:
Whisper command-line tool from OpenAI: The core transcription engine that converts speech to text.
FFmpeg: Required for Whisper to open, convert, and process audio and video files.
Python 3.10 or later: The programming language Whisper is written in.
Homebrew: A package manager that makes installing Whisper, FFmpeg, and Python easy.
To run these tools successfully, you'll need:
A Mac running macOS Monterey 12.3 or later: Preferably macOS Sequoia 15 or later on an Apple Silicon chip for faster performance.
At least 8 GB of RAM and some free disk space: Larger Whisper models can use a lot of memory — especially on long files — but smaller models work fine on most setups.
Terminal app: Preinstalled on macOS — you'll use it to enter the setup and transcription commands.
Setting up Whisper on macOS
Follow these steps to install everything you need and start transcribing files. If you already have Homebrew, Python, and FFmpeg installed, it's still worth checking those steps out to ensure everything is up to date.
Open Terminal on your Mac
Terminal is the command-line app built into macOS — it’s how you’ll install and run Whisper. You don’t need to know how to code, just how to paste in commands. To open Terminal, press Command + Space, type “Terminal,” and hit Return. You can also find it in the Utilities folder in your Applications directory or in the Other folder in Launchpad.

Install or update Homebrew
Homebrew is a package manager for macOS — like the App Store but for powerful command-line tools. It makes it easy to install everything Whisper needs behind the scenes.
If you don't have Homebrew installed, paste this command and press Return:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
This command may look intimidating, but here's what it all means:
/bin/bash -c
tells your Mac to run a command in the Bash shell.The part in quotes —
"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
— usescurl
(a tool that fetches files from the internet) to download Homebrew's official installer script from GitHub.Here's what those flags mean:
-f
= fail silently on errors.-s
= run silently (don't show the download progress bar).-S
= show errors if any occur (used with-s
).-L
= follow redirects automatically.
The whole script installs Homebrew safely and automatically.
If Homebrew is already installed, update it by running:
brew update
Install Python 3.10 (or newer)
Python is the programming language Whisper is written in. Apple includes an older version on macOS, but Whisper needs a newer one to run properly. Homebrew makes it easy to install the correct version.
Whisper requires Python 3.10 or above. Install it with:
brew install python
If you already have Python installed but aren't sure if it's the right version, check it with:
python3 --version
If it's older than 3.10, you can upgrade it with:
brew upgrade python
You're good to go once you're on Python 3.10 or newer.
Install FFmpeg
FFmpeg is a tool for processing audio and video files. It helps Whisper handle all kinds of media formats, such as MP3, MP4, M4A, and WAV. Without FFmpeg, Whisper can't read or convert your files.
To install it using Homebrew:
brew install ffmpeg
If you already have FFmpeg installed, make sure it's up to date:
brew upgrade ffmpeg
You can verify that FFmpeg is working by running:
ffmpeg -version
If it prints version info, you're good.
Install Whisper via pip
Pip is Python's built-in package manager — it's how you install Python apps like Whisper. You'll use pip to download and install Whisper directly from OpenAI's GitHub repository.
First, make sure pip is up to date:
pip3 install --upgrade pip
Then install Whisper:
pip3 install git+https://github.com/openai/whisper.git
Run a transcription with Whisper
Once Whisper is installed, you can transcribe audio and video files (MP3, MP4, M4A, WAV, and more) using a single command. It supports a range of pretrained models, from lightweight and fast to large and highly accurate.
Audio files are transcribed much faster than video files, so you may want to extract the audio from your videos and use that with Whisper instead — especially when working with a larger model. On a Mac, you can quickly export audio from a video file using QuickTime Player.
Basic usage (auto-detects language)
The --model tiny
option runs the fastest and uses the least memory, while the --model large
option offers the best accuracy but requires significantly more RAM and takes longer to process.
whisper your_file.mp4 --model tiny
whisper your_file.mp4 --model base
whisper your_file.mp4 --model small
whisper your_file.mp4 --model medium
whisper your_file.mp4 --model large
Specify language for faster, more accurate results
If you know your file is in English, you can specify it using --language en
or --language English
:
whisper your_file.mp4 --language English --model tiny
whisper your_file.mp4 --language English --model base
whisper your_file.mp4 --language English --model small
whisper your_file.mp4 --language English --model medium
whisper your_file.mp4 --language English --model large
When using one of the commands above, the output will print directly in the same Terminal window.

However, Whisper can create .txt (plain transcript), .srt (standard subtitle format used by most video players and editors), and .vtt (Web Video Text Tracks format used for HTML5 video, YouTube, etc.) transcription files in the same directory as the original media file. If needed, add flags like --output_format txt
(to specify a specific format) or --task translate
(which automatically translates foreign languages into English).
For example, the following transcribes the file in English and outputs it to a .txt document in the same directory.
whisper your_file.mp4 --language en --model small --output_format txt
To generate subtitles for a foreign-language video in English, the following command will generate .txt, .srt, and .vtt transcription files in the same folder as your video or audio file.
whisper your_file.mp4 --task translate --model medium
Want just subtitle files (like .srt) and not the plain text transcript? Run:
whisper your_file.mp4 --language en --task translate --output_format srt
To see all available options:
whisper --help
Final thoughts
Whisper in Terminal isn't just a transcription tool — it's a secret weapon for creators, journalists, students, and anyone who deals with spoken content. The setup process might feel a bit technical the first time, but once it's up and running, it's incredibly simple to use.
That said, Whisper models run locally and can be slow, depending on your Mac's hardware. If you work with large files and want faster results, stick to the tiny or base models. If you need higher accuracy and don't mind the extra processing time, go for medium or large.
Full list of Whisper arguments and options
If you want to explore everything Whisper can do — including output formats, language support, and advanced flags — you can run whisper --help
in Terminal. Here's the complete list of available options for quick reference:
usage: whisper [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE]
[--output_dir OUTPUT_DIR]
[--output_format {txt,vtt,srt,tsv,json,all}]
[--verbose VERBOSE] [--task {transcribe,translate}]
[--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
[--temperature TEMPERATURE] [--best_of BEST_OF]
[--beam_size BEAM_SIZE] [--patience PATIENCE]
[--length_penalty LENGTH_PENALTY]
[--suppress_tokens SUPPRESS_TOKENS]
[--initial_prompt INITIAL_PROMPT]
[--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT]
[--fp16 FP16]
[--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
[--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]
[--logprob_threshold LOGPROB_THRESHOLD]
[--no_speech_threshold NO_SPEECH_THRESHOLD]
[--word_timestamps WORD_TIMESTAMPS]
[--prepend_punctuations PREPEND_PUNCTUATIONS]
[--append_punctuations APPEND_PUNCTUATIONS]
[--highlight_words HIGHLIGHT_WORDS]
[--max_line_width MAX_LINE_WIDTH]
[--max_line_count MAX_LINE_COUNT]
[--max_words_per_line MAX_WORDS_PER_LINE] [--threads THREADS]
[--clip_timestamps CLIP_TIMESTAMPS]
[--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD]
audio [audio ...]
positional arguments:
audio audio file(s) to transcribe
options:
-h, --help show this help message and exit
--model MODEL name of the Whisper model to use (default: turbo)
--model_dir MODEL_DIR
the path to save model files; uses ~/.cache/whisper by
default (default: None)
--device DEVICE device to use for PyTorch inference (default: cpu)
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
directory to save the outputs (default: .)
--output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
format of the output file; if not specified, all
available formats will be produced (default: all)
--verbose VERBOSE whether to print out the progress and debug messages
(default: True)
--task {transcribe,translate}
whether to perform X->X speech recognition
('transcribe') or X->English translation ('translate')
(default: transcribe)
--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
language spoken in the audio, specify None to perform
language detection (default: None)
--temperature TEMPERATURE
temperature to use for sampling (default: 0)
--best_of BEST_OF number of candidates when sampling with non-zero
temperature (default: 5)
--beam_size BEAM_SIZE
number of beams in beam search, only applicable when
temperature is zero (default: 5)
--patience PATIENCE optional patience value to use in beam decoding, as in
https://arxiv.org/abs/2204.05424, the default (1.0) is
equivalent to conventional beam search (default: None)
--length_penalty LENGTH_PENALTY
optional token length penalty coefficient (alpha) as
in https://arxiv.org/abs/1609.08144, uses simple
length normalization by default (default: None)
--suppress_tokens SUPPRESS_TOKENS
comma-separated list of token ids to suppress during
sampling; '-1' will suppress most special characters
except common punctuations (default: -1)
--initial_prompt INITIAL_PROMPT
optional text to provide as a prompt for the first
window. (default: None)
--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
if True, provide the previous output of the model as a
prompt for the next window; disabling may make the
text inconsistent across windows, but the model
becomes less prone to getting stuck in a failure loop
(default: True)
--fp16 FP16 whether to perform inference in fp16; True by default
(default: True)
--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
temperature to increase when falling back when the
decoding fails to meet either of the thresholds below
(default: 0.2)
--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
if the gzip compression ratio is higher than this
value, treat the decoding as failed (default: 2.4)
--logprob_threshold LOGPROB_THRESHOLD
if the average log probability is lower than this
value, treat the decoding as failed (default: -1.0)
--no_speech_threshold NO_SPEECH_THRESHOLD
if the probability of the <|nospeech|> token is higher
than this value AND the decoding has failed due to
`logprob_threshold`, consider the segment as silence
(default: 0.6)
--word_timestamps WORD_TIMESTAMPS
(experimental) extract word-level timestamps and
refine the results based on them (default: False)
--prepend_punctuations PREPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation
symbols with the next word (default: "'“¿([{-)
--append_punctuations APPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation
symbols with the previous word (default:
"'.。,,!!??::”)]}、)
--highlight_words HIGHLIGHT_WORDS
(requires --word_timestamps True) underline each word
as it is spoken in srt and vtt (default: False)
--max_line_width MAX_LINE_WIDTH
(requires --word_timestamps True) the maximum number
of characters in a line before breaking the line
(default: None)
--max_line_count MAX_LINE_COUNT
(requires --word_timestamps True) the maximum number
of lines in a segment (default: None)
--max_words_per_line MAX_WORDS_PER_LINE
(requires --word_timestamps True, no effect with
--max_line_width) the maximum number of words in a
segment (default: None)
--threads THREADS number of threads used by torch for CPU inference;
supercedes MKL_NUM_THREADS/OMP_NUM_THREADS (default:
0)
--clip_timestamps CLIP_TIMESTAMPS
comma-separated list start,end,start,end,...
timestamps (in seconds) of clips to process, where the
last end timestamp defaults to the end of the file
(default: 0)
--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD
(requires --word_timestamps True) skip silent periods
longer than this threshold (in seconds) when a
possible hallucination is detected (default: None)
Don't Miss: How to Remove or Add 'Where from' Metadata in Files on macOS
Cover photo, screenshots, and GIFs by Gadget Hacks.
Comments
Be the first, drop a comment!