Voice: the highest-bandwidth input you have
Typing is slow. Talking to your agent is two to three times faster, and on a phone it is the only humane option. We use Whisper for speech-to-text and FFmpeg for the audio pipeline.
Two flavors of Whisper
- whisper.cpp — pure C++, runs on CPU, 100% local, free
- OpenAI Whisper API — cloud, fast, paid, slightly more accurate on noisy audio
For privacy and zero ongoing cost, start with whisper.cpp. We will fall back to the API for hard cases.
Install whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp ~/Code/whisper
cd ~/Code/whisper && make
./models/download-ggml-model.sh base.en
Wire it into OpenClaw
Add a Skill or Tool entry that:
- Records audio from the mic with FFmpeg into a temp
.wav - Pipes the file through whisper.cpp to produce text
- Sends the text into OpenClaw as if it were a typed message
Example FFmpeg recorder line (macOS):
ffmpeg -f avfoundation -i ":0" -t 30 -ar 16000 -ac 1 /tmp/voice.wav
When to choose which model
tiny.en— keyword grade, fastestbase.en— daily driver, ~1× realtime on a modern laptopsmall.en/medium.en— meeting transcription, slower but punchy
Try it
Record yourself dictating a one-paragraph task. Confirm OpenClaw receives the transcribed text and acts on it.