CliffsNotes your voicemails with WebSockets, Twilio, and OpenAI

October 2, 2024

min read

While ngrok started as the way to get a local service online in a single line, it’s evolved to be much more than that (say, a universal gateway used by 80% of the Cloud 100!). Today, let’s hearken back to why ngrok was created: testing your local services against things outside localhost.

Just before the first session of Office Hours went live, I got a fantastic question:

How can I set up secure WebSockets with ngrok to use the Twilio [Media Stream] API?

First off, let’s clarify one thing about WebSockets, since we get questions about it now and then: ngrok not only supports WebSockets (WS) via default HTTP tunnels, but also secures them by doing all the messy work around TLS certificates and termination for you.

You can securely expose your service to Twilio’s API (or any other) for development and testing, and when you’re ready to deliver to prod, you don’t need to change your configuration—add a custom domain and a few Traffic Policy rules if you need extra security or rate limiting.

Here’s a quick walkthrough of the demo I wanted to do live.

An ngrok experiment: summarizing phone calls with an API

Actually, that’s a lie.

I was going to showcase one of Twilio’s Media Streams Demos, but when it comes to publishing a version of that here, it suddenly felt lacking. I needed to spice things up a bit.

Taking inspiration from those quickstarts, especially the ones that connect phone audio to a text summarization service, I decided to modernize them and go one step further: Create an API service that Twilio can POST, then stream audio over WebSockets, to then get summarized with the help of OpenAI.

Here’s what I came up with.

1require('dotenv').config();
2const express = require('express');
3const http = require('http');
4const WebSocket = require('ws');
5const path = require('path');
6const twilio = require('twilio');
7const { OpenAI, toFile } = require('openai');
8const TwilioMediaStreamSaveAudioFile = require('twilio-media-stream-save-audio-file');
9const fs = require('fs').promises;
10
11const app = express();
12const server = http.createServer(app);
13const wss = new WebSocket.Server({ server });
14
15const PORT = process.env.PORT || 3000;
16const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
17const mediaStreamSaver = new TwilioMediaStreamSaveAudioFile({ saveLocation: `${__dirname}/temp` });
18
19wss.on('connection', (ws) => {
20  ws.on('message', async (message) => {
21    const { event, media } = JSON.parse(message);
22    switch (event) {
23      case 'start':
24        console.log('Call connected. Starting media stream...');
25        mediaStreamSaver.twilioStreamStart();
26        break;
27      case 'media':
28        mediaStreamSaver.twilioStreamMedia(media.payload);
29        break;
30      case 'stop':
31        console.log('Call ended.');
32        mediaStreamSaver.twilioStreamStop();
33        try {
34          const transcription = await transcribeAudio();
35          const summary = await summarizeText(transcription);
36          console.log('Transcription:', transcription);
37          console.log('Summary:', summary);
38          ws.send(JSON.stringify({ transcription, summary }));
39        } catch (error) {
40          console.error('Error processing audio:', error);
41          ws.send(JSON.stringify({ error: error.message }));
42        }
43        break;
44    }
45  });
46});
47
48async function transcribeAudio() {
49  const audioData = await fs.readFile(mediaStreamSaver.wstream.path);
50  const file = await toFile(audioData, path.basename(mediaStreamSaver.wstream.path), { type: 'audio/wav' });
51  return openai.audio.transcriptions.create({
52    file: file,
53    model: 'whisper-1',
54    response_format: 'text',
55  });
56}
57
58async function summarizeText(transcription) {
59  const response = await openai.chat.completions.create({
60    model: "gpt-3.5-turbo",
61    messages: [
62      { role: "system", content: "You are a helpful assistant that summarizes text as succinctly as possible." },
63      { role: "user", content: `Please summarize the following text in a single sentence: ${transcription}` }
64    ],
65  });
66  return response.choices[0].message.content;
67}
68
69app.use(express.urlencoded({ extended: false }));
70
71app.post('/twiml', twilio.webhook({validate: false}), (req, res) => {
72  const twiml = new twilio.twiml.VoiceResponse();
73  twiml.start().stream({ url: `wss://${req.headers.host}/message` });
74  twiml.say('Please start speaking.');
75  twiml.pause({ length: 30 });
76  res.type('text/xml').send(twiml.toString());
77});
78
79server.listen(PORT, () => {
80  console.log(`Server is running on port ${PORT}`);
81});

I would’ve loved to comment this API so it’s completely self-explanatory, but in short, an Express server responds to POST requests to the /twiml route (TwiML being the Twilio Markup Language) by starting a WS connection and instructing Twilio on how to handle the phone call. The very handy twilio-media-stream-save-audio-file project captures and decodes Twilio’s streaming audio (mysteriously encoded in… MULAW?), then saves it as a local .wav file. That file then goes to OpenAI’s Whisper model for transcription, which we then pipe to ChatGPT for summarization.

ngrok operates as an API-gateway-in-development, tunneling Twilio’s traffic to my localhost.

Whew.

Here’s what I used to set the project up:

Installed and configured the Twilio CLI.
Found and purchased a Twilio phone number with:some text
1. twilio api:core:available-phone-numbers:local:list --country-code="US" --voice-enabled --properties="phoneNumber"
2. twilio api:core:incoming-phone-numbers:create --phone-number="+123456789"
Set up a .env file with an OPENAI_API_KEY and TWILIO_AUTH_TOKEN.
Ran npm install to get dependencies.
Created a /temp directory for storing Twilio streams with mkdir temp.

And the demo itself:

Started the Express+WS server with node server.js.
Started an ngrok agent with ngrok http 3000 --url twilio.{NGROK_DOMAIN}.
Ran the Twilio CLI to generate a phone call to my phone number: twilio api:core:calls:create --from="+123456789" --to="{MY_PHONE_NUMBER_PLEASE_DONT_ASK}" --url="https://{NGROK_DOMAIN}/twiml
Talked about… the weather.
Watched live transcription and summarization in the console.

Here was the result:

Call connected. 
Starting media stream...
Call ended.

Transcription: It's September 27th today, and here in Tucson, Arizona, 
it's still over 100 degrees. I think it's 105 today, and it's supposed 
to be, you know, it's almost the end of September. It's almost October. 
It's fall. The average high for this time of year is supposed to be in 
the low 90s, I think. So we're talking 10 degrees plus where it's 
supposed to be. It's just totally unfair. That's all.

Summary: In Tucson, Arizona on September 27th, the temperature is 
unseasonably hot at over 100 degrees, which is about 10 degrees higher 
than the typical average for this time of year.

Crude? Yes. A great example of using ngrok to secure WS from a local webserver and make them accessible to a public API or service like Twilio? Absolutely.

And in ngrok’s Traffic Inspector, you can also analyze how ngrok gives the existing HTTP connection a Connection: Upgrade, securing your WS implementation without any extra configuration.

ngrok makes the push from dev to prod easy, too

This project was a perfect example of the use case that got ngrok started more than 10 years ago: exposing local services (WebSockets and beyond) to public webhooks and APIs. Imagine how painful this would have been if I had to push my WS server to a production system after every change?

ngrok undoubtedly sped up the pace of my development process, but it’s also expanded far beyond that founding use case of webhook testing and tunneling to localhost—I could just as quickly and easily go-live in production without a single change to how I use ngrok. Maybe I’d just add some Traffic Policy magic with request variables and CEL?

These are the kinds of use cases and paths to prod we’ll continue exploring in the next session of Office Hours. I’d love for you to join us! When you register for the next livestream, please ask your question in advance—these chats are yours to shape, and I can only craft demos like these if I know my goalposts ahead of time.

In the meantime, if you’ve been meaning to start developing a new app alongside Twilio or any other external API, give ngrok a try with a free account. This need to expose local services certainly hasn’t—and probably never will—go away, and even after all these years, ngrok remains your app’s simplest and most secure front door.

Share this post

Joel Hans

Joel Hans is a Senior Developer Educator. Away from blog posts and demo apps, you might find him mountain biking, writing fiction, or digging holes in his yard.