A modular framework for building VoIP-Agent applications.
Dialog is an orchestration layer for VoIP-Agent applications. Two common VoIP-Agent models exist today: the Speech-to-Speech (S2S) model and the Speech-to-Text with Text-to-Speech (STT–TTS) model.
The S2S model converts spoken input into spoken output, while the STT–TTS model first converts speech into text, which is processed by an Agent; the Agent’s textual response is then converted back into speech. Both approaches involve tradeoffs.
Dialog adopts the STT–TTS model. It orchestrates communication between the VoIP, STT, TTS, and Agent modules. The framework provides concrete implementations of VoIP, STT, and TTS modules, along with abstract Agent classes designed for subclassing.
- Simple, extensible, modular framework
- Concrete implementations for VoIP, STT, and TTS, plus abstract Agent classes for extension
- Multithreaded deployments
- Event-driven architecture
- Isolated state — modules exchange objects but never share references
NB Dialog is a well architected and production-grade implementation; however, it is still undergoing active refactoring. Prior to 1.0.0, public interfaces may change on turns of the minor and commit messages will be minimal.
- Installation
- Usage
- Examples
- Architecture
- Implementations
- Custom Implementations
- Multithreading
- API
- Troubleshooting
- Alternatives
- Support
These instructions describe how to clone the Dialog repository and build the package.
git clone https://github.com/faranalytics/dialog.git
cd dialog
npm install && npm update
You can use the clean:build
script in order to do a clean build.
npm run clean:build
Alternatively, you can use the watch
script in order to watch and build the package. This will build the package each time you make a change to a file in ./src
. If you use the watch
script, you will need to open a new terminal in order to build and run your application.
npm run watch
npm install <path-to-the-dialog-respository> --save
You should now be able to import Dialog artifacts into your package.
When a call is initiated, a Gateway
(e.g., a Twilio Gateway) emits a voip
event. The voip
handler is called with a VoIP
instance as its single argument. The VoIP
instance handles the web socket connection that is set on it by the Gateway
. In the voip
handler, an instance of an Agent
is constructed by passing a VoIP
, STT
, and TTS
implementation into its constructor. The agent is started by calling its activate
method. The activate
method of the Agent
instance connects the interfaces that comprise the application.
An important characteristic of the architecture is that a new instance of each participant in a Dialog application — VoIP
, STT
, TTS
, and Agent
— is created for every call. This allows each instance to maintain state specific to its call.
Excerpted from src/main.ts
.
...
const gateway = new TwilioGateway({
httpServer,
webSocketServer,
webhookURL: new URL(WEBHOOK_URL),
authToken: TWILIO_AUTH_TOKEN,
accountSid: TWILIO_ACCOUNT_SID
});
gateway.on("voip", (voip: TwilioVoIP) => {
const agent = new TwilioVoIPOpenAIAgent({
voip: voip,
stt: new DeepgramSTT({ apiKey: DEEPGRAM_API_KEY, liveSchema: DEEPGRAM_LIVE_SCHEMA }),
tts: new CartesiaTTS({ apiKey: CARTESIA_API_KEY, speechOptions: CARTESIA_SPEECH_OPTIONS }),
apiKey: OPENAI_API_KEY,
system: OPENAI_SYSTEM_MESSAGE,
greeting: OPENAI_GREETING_MESSAGE,
model: OPENAI_MODEL,
twilioAccountSid: TWILIO_ACCOUNT_SID,
twilioAuthToken: TWILIO_AUTH_TOKEN
});
agent.activate();
});
...
Example implementations are provided in the examples subpackages.
In the Custom Twilio VoIP + OpenAI Agent example you will create a simple hypothetical Agent that prepends its messages with a timestamp and manages its conversation history.
In the Twilio VoIP (Worker Thread Bridge) example you will use a worker thread bridge in order to run each call session and Agent instance in a worker thread.
In the minimal Twilio VoIP + OpenAI Agent (Deepgram STT + Cartesia TTS) example you will subclass the provided abstract Agent implementation and implement the abstract Agent.inference
method.
The following instructions apply to all the examples.
Each example includes a .env.template
file with the variables required to contruct the respective participant:
TWILIO_ACCOUNT_SID
TWILIO_AUTH_TOKEN
DEEPGRAM_API_KEY
ELEVEN_LABS_API_KEY
CARTESIA_API_KEY
OPENAI_API_KEY
KEY_FILE
CERT_FILE
HOST_NAME
PORT
WEBHOOK_URL
Copy the template to .env
and fill in your own values. Do not commit real secrets.
The examples use simple HTTPS and WSS servers. Set KEY_FILE
and CERT_FILE
to the absolute paths of your TLS private key and certificate files on your system.
Each component of a Dialog orchestration, including the User(s), the Agent and its LLM(s), the STT model, the TTS model, and the VoIP implementation, is a participant.
The User participant is typically the human(s) who initiated an incoming call or answered an outgoing call. A User may also be another Agent.
The Agent participant is essential to assembling the external LLM, the VoIP, STT, and TTS implementations into a working whole. Dialog, as the orchestration layer, does not provide a concrete Agent implementation. Instead you are provided with an interface and abstract class that you can implement or subclass with your custom special tool calling logic. For example, an Agent will decide when to transfer a call; if the LLM determines the User intent is to be transferred, the Agent can carry out this intent by calling the VoIP.transferTo
method — or it could circumvent the provided call transfer facilities entirely and make a direct call to the VoIP provider (e.g., Twilio, Telnyx, etc.) API. The point here is that very little architectural constraints should be imposed on the Agent; this ensures the extensibility of the architecture.
The STT participant transcribes the User speech into text. The STT emits utterance and VAD events that may be consumed by the Agent.
The TTS participant synthesizes the text received from the Agent and/or LLM. The TTS emits message events that may be consumed by the Agent.
The VoIP participant handles the incoming call, transcriptions, recordings, and streams audio into the STT.
Dialog favors simplicity and accessibility over feature richness. Its architecture should meet all the requirements of a typical VoIP-Agent application where many Users interact with a set of Agents. Although Dialog doesn't presently support concepts like "rooms", the simplicity and extensibility of its architecture should lend to even more advanced implementations.
Each participant in a Dialog orchestration must not directly mutate the state of another participant. Participants may emit messages and consume the messages of other participants and they may hold references to each other; however the mutation of an object held by one participant should never directly mutate the state of an object held by another participant. This is an important characteristic of Dialog participants — they exhibit isolated state — modules exchange objects but never share references. For example, a VoIP participant may emit a Metadata
object that contains information about a given incoming call that is consumed by other participants; however, a subsequent mutation in the VoIP's Metadata
must not mutate the Metadata
in another participant.
This strict separation of concerns ensures that participant state remains predictable and easy for a human to reason about. Likewise, the architecture is expected to be easy for LLMs to consume, as the LLM's attention can be focused on the pattern that is exhibited by the relevant participant.
+-------------+ audio (base64) +------------+ transcripts +----------+ text +-------------+
| Twilio | ------------------>| STT | -------------------> | Agent | -------> | TTS |
| VoIP | --metadata/events--| (Deepgram | --metadata/events--> | (OpenAI) | | (11Labs or |
| (WS in/out) | | or OpenAI) | | | | Cartesia) |
+-------------+ +------------+ +----------+ +-------------+
^ v
+----------------------------------------------------------------------------------------------+
audio (base64)
Dialog provides example implementations for each of the artifacts that comprise a VoIP-Agent application. You can use a packaged implementation as-is, subclass it, or implement your own. If you choose to implement a custom participant, you can use one of the provided participant interfaces.
- Twilio request validation
- Recording status
- Transcript status
- Speech interruption
An implementation similar to Twilio is planned. A placeholder exists under src/implementations/voip/telnyx/
.
- Voice activity detection (VAD) events
- Voice activity detection (VAD) events
- Semantic VAD
- Configurable voice
- Configurable voice
- An abstract Agent implementation is provided that uses the OpenAI API.
Dialog provides concrete VoIP
, STT
, and TTS
implementations and an abstract Agent
implementation. You can use a provided implementation as-is, subclass it, or choose an interface and implement your own. If you plan to implement your own VoIP
, STT
, Agent
, or TTS
, interfaces are provided for each participant of the application.
A custom Agent
implementation will allow you to facilitate tool calling, conversation history, and other nuances.
You can extend the provided OpenAIAgent
class, as in the example below, or just implement the Agent
interface. The straight-forward openai_agent.ts
implementation can be used as a guide.
This hypothetical custom Agent
implementation adds a timestamp to each user message and maintains conversation history.
import { once } from "node:events";
import { randomUUID } from "node:crypto";
import {
log,
Message,
OpenAIAgent,
OpenAIAgentOptions,
TwilioMetadata,
TwilioVoIP,
OpenAIConversationHistory,
} from "@farar/dialog";
export interface TwilioCustomAgentOptions
extends OpenAIAgentOptions<TwilioVoIP> {
twilioAccountSid: string;
twilioAuthToken: string;
system?: string;
greeting?: string;
}
export class TwilioCustomAgent extends OpenAIAgent<TwilioVoIP> {
protected metadata?: TwilioMetadata;
protected twilioAccountSid: string;
protected twilioAuthToken: string;
protected history: OpenAIConversationHistory;
protected transcript: unknown[];
protected system: string;
protected greeting: string;
constructor(options: TwilioCustomAgentOptions) {
super(options);
this.twilioAccountSid = options.twilioAccountSid;
this.twilioAuthToken = options.twilioAuthToken;
this.transcript = [];
this.system = options.system ?? "";
this.greeting = options.greeting ?? "";
if (this.system) {
this.history = [
{
role: "system",
content: this.system,
},
];
} else {
this.history = [];
}
}
public inference = async (message: Message): Promise<void> => {
try {
const content = `${new Date().toISOString()}\n${message.data}`;
log.notice(`User message: ${content}`);
this.history.push({ role: "user", content });
const stream = await this.openAI.chat.completions.create({
model: this.model,
messages: this.history,
temperature: 1,
stream: true,
});
const assistantMessage = await this.dispatchStream(message.uuid, stream);
log.notice(`Assistant message: ${assistantMessage} `);
this.history.push({ role: "assistant", content: assistantMessage });
} catch (err) {
this.dispose(err);
}
};
public updateMetadata = (metadata: TwilioMetadata): void => {
if (!this.metadata) {
this.metadata = metadata;
} else {
this.metadata = { ...this.metadata, ...metadata };
}
};
public activate = (): void => {
super.activate();
this.voip.on("streaming_started", this.dispatchInitialMessage);
this.voip.on("streaming_started", this.startDisposal);
this.voip.on("metadata", this.updateMetadata);
};
public deactivate = (): void => {
super.deactivate();
this.voip.off("streaming_started", this.dispatchInitialMessage);
this.voip.off("streaming_started", this.startDisposal);
this.voip.off("metadata", this.updateMetadata);
};
public dispatchInitialMessage = (): void => {
const uuid = randomUUID();
this.activeMessages.add(uuid);
this.history.push({ role: "assistant", content: this.greeting });
this.dispatchMessage(
{ uuid: uuid, data: this.greeting, done: true },
false
).catch(this.dispose);
};
protected startDisposal = (): void => {
void (async () => {
try {
await once(this.voip, "streaming_stopped");
this.dispose();
} catch (err) {
log.error(err);
}
})();
};
}
Dialog provides a simple multithreading implementation you can use. An example is provided that demonstrates a multithreaded deployment.
A Worker
is spun up for each call. VoIP events are propagated over a MessageChannel
using the Port Agent RPC-like facility. This approach ensures that any peculiarity that takes place in handling one call will not interfer with other concurrent calls. Another notable aspect of this approach is that it permits hot changes to the Agent (and the STT and TTS) code without interrupting calls that are already underway — new calls will pick up changes each time a Worker
is spun up.
In the excerpt below, a TwilioVoIPWorker
is instantiated on each call.
Excerpted from ./src/main.ts
.
const gateway = new TwilioGateway({
httpServer,
webSocketServer,
webhookURL: new URL(WEBHOOK_URL),
authToken: TWILIO_AUTH_TOKEN,
accountSid: TWILIO_ACCOUNT_SID,
requestSizeLimit: 1e6,
});
gateway.on("voip", (voip: TwilioVoIP) => {
new TwilioVoIPWorker({ voip, worker: new Worker("./dist/worker.js") });
});
Over in worker.js
the Agent is instantiated, as usual, except using a TwilioVoIPProxy
instance that implements the VoIP
interface.
Excerpted from ./src/worker.ts
.
const voip = new TwilioVoIPProxy();
const agent = new Agent({
voip: voip,
stt: new DeepgramSTT({
apiKey: DEEPGRAM_API_KEY,
liveSchema: DEEPGRAM_LIVE_SCHEMA,
}),
tts: new CartesiaTTS({
apiKey: CARTESIA_API_KEY,
speechOptions: CARTESIA_SPEECH_OPTIONS,
}),
apiKey: OPENAI_API_KEY,
system: OPENAI_SYSTEM_MESSAGE,
greeting: OPENAI_GREETING_MESSAGE,
model: OPENAI_MODEL,
twilioAccountSid: TWILIO_ACCOUNT_SID,
twilioAuthToken: TWILIO_AUTH_TOKEN,
});
agent.activate();
Dialog provides building blocks to create real‑time, voice‑driven agents that integrate telephony (VoIP), speech‑to‑text (STT), text‑to‑speech (TTS), and LLM agents. It includes interfaces, utility classes, and concrete implementations for Twilio VoIP, Deepgram STT, OpenAI Realtime STT, ElevenLabs TTS, Cartesia TTS, and an OpenAI‑based agent.
The API is organized by component. You can mix and match implementations by wiring them through the provided interfaces.
The logging utilities are thin wrappers around streams-logger
for structured, backpressure‑aware logging.
- log
<Logger>
An initialized Logger pipeline emitting to the console via the includedformatter
andconsoleHandler
. - formatter
<Formatter<unknown, string>>
Formats log records into human‑readable strings. - consoleHandler
<ConsoleHandler<string>>
A console sink with level set to DEBUG. - SyslogLevel
<enum>
The syslog‑style levels exported fromstreams-logger
.
Use these exports in order to emit structured logs across the library. See streams-logger
for details on usage and configuration.
- options
<StreamBufferOptions>
- bufferSizeLimit
<number>
Optionally specify a maximum buffer size in bytes. Default:1e6
- bufferSizeLimit
- writableOptions
<stream.WritableOptions>
Optional Node.js stream options; use to customize highWaterMark, etc.
Use a StreamBuffer
in order to buffer incoming stream chunks into a single in‑memory Buffer
with an upper bound. If the buffer exceeds the limit, an error is emitted.
public streamBuffer.buffer
<Buffer>
The accumulated buffer contents.
- options
<RequestBufferOptions>
- req
<http.IncomingMessage>
The HTTP request to read from. - bufferSizeLimit
<number>
Optionally specify a maximum body size in bytes. Default:1e6
- req
Use a RequestBuffer
in order to read and bound the body of an IncomingMessage
into a string.
public requestBuffer.body()
Returns: <Promise<string>>
Read, buffer, and return the entire request body as a UTF‑8 string. Emits error
if the size limit is exceeded or the underlying stream errors.
- options
<MutexOptions>
- queueSizeLimit
<number>
A hard limit imposed on all mark queues.mutex.call
will throw if this limit is exceeded.
- queueSizeLimit
Use a Mutex
in order to serialize asynchronous calls by key.
public mutex.call(mark, fn, ...args)
- mark
<string>
A key identifying the critical section. - fn
<(...args: unknown[]) => Promise<unknown>>
An async function to execute exclusively per key. - ...args
<unknown[]>
Arguments forwarded tofn
.
Returns: <Promise<unknown>>
Acquire the mutex for mark
, invoke fn
, and release the mutex, even on error.
public mutex.acquire(mark)
- mark
<string>
A key identifying the critical section.
Returns: <Promise<void>>
Wait until the mutex for mark
is available and acquire it.
public mutex.release(mark)
- mark
<string>
A key identifying the critical section.
Returns: <void>
Release a previously acquired mutex for mark
. Throws if called without a corresponding acquire.
These interfaces define the contracts between VoIP, STT, TTS, and Agent components.
- uuid
<UUID>
A unique identifier for correlation across components. - data
<DataT>
The payload: audio (base64) or text, depending on the context. - done
<boolean>
Whether the message is complete (end of stream/utterance).
- inference
(message: Message) => Promise<void>
Implement the main inference loop for a message. - activate
() => void
Begin wiring events between components. - deactivate
() => void
Remove event wiring.
Extends: EventEmitter<STTEvents>
Events (STTEvents):
"message"
:[Message]
Emitted when a finalized transcription is available."vad"
:[]
Emitted on voice activity boundary events (start/stop cues)."error"
:[unknown]
Emitted on errors.
Methods:
- post
(media: Message) => void
Post audio media into the recognizer (typically base64 payloads). - dispose
() => void
Dispose resources and listeners.
Extends: EventEmitter<TTSEvents>
Events (TTSEvents):
"message"
:[Message]
Emitted with encoded audio output chunks, and a terminal chunk withdone: true
."error"
:[unknown]
Emitted on errors.
Methods:
- post
(message: Message) => void
Post text to synthesize. Whendone
istrue
, the provider should flush and emit the terminal chunk. - abort
(uuid: UUID) => void
Cancel a previously posted message stream. - dispose
() => void
Dispose resources and listeners.
Extends: EventEmitter<VoIPEvents<MetadataT, TranscriptT>>
Events (VoIPEvents):
"metadata"
:[MetadataT]
Emitted for call/session metadata updates."message"
:[Message]
Emitted for inbound audio media frames (base64 payloads)."message_dispatched"
:[UUID]
Emitted when a downstream consumer has finished dispatching a message identified by the UUID."transcript"
:[TranscriptT]
Emitted for transcription webhook updates, when supported."recording_url"
:[string]
Emitted with a URL for completed recordings, when supported."streaming_started"
:[]
Emitted when the media stream starts."streaming_stopped"
:[]
Emitted when the media stream ends."error"
:[unknown]
Emitted on errors.
Methods:
- post
(message: Message) => void
Post synthesized audio back to the call/session. - abort
(uuid: UUID) => void
Cancel an in‑flight TTS dispatch and clear provider state if needed. - hangup
() => void
Terminate the call/session, when supported by the provider. - transferTo
(tel: string) => void
Transfer the call to the specified telephone number, when supported. - dispose
() => void
Dispose resources and listeners.
Twilio implementations provide inbound call handling, WebSocket media streaming, call control, recording, and transcription via Twilio.
- options
<TwilioGatewayOptions>
- httpServer
<http.Server>
An HTTP/HTTPS server for Twilio webhooks. - webSocketServer
<ws.Server>
A WebSocket server to receive Twilio Media Streams. - webhookURL
<URL>
The public webhook URL path for the voice webhook (full origin and path). - accountSid
<string>
Twilio Account SID. - authToken
<string>
Twilio Auth Token. - recordingStatusURL
<URL>
Optional recording status callback URL. If omitted, a unique URL on the same origin is generated. - transcriptStatusURL
<URL>
Optional transcription status callback URL. If omitted, a unique URL on the same origin is generated. - requestSizeLimit
<number>
Optional limit (bytes) for inbound webhook bodies. Default:1e6
- httpServer
Use a TwilioGateway
in order to accept Twilio voice webhooks, validate signatures, respond with a TwiML Connect <Stream>
response, and manage the associated WebSocket connection and callbacks. On each new call, a TwilioVoIP
instance is created and emitted.
Events:
"voip"
:[TwilioVoIP]
Emitted when a new call is established and itsTwilioVoIP
instance is ready.
- options
<{ webSocket: ws.WebSocket, twilioGateway: TwilioGateway, callSidToTwilioVoIP: Map<string, TwilioVoIP> }>
Use a WebSocketListener
in order to translate Twilio Media Stream messages into VoIP
events for the associated TwilioVoIP
instance. This class is managed by TwilioGateway
and not typically constructed directly.
public webSocketListener.webSocket
<ws.WebSocket>
The underlying WebSocket connection.
public webSocketListener.startMessage
<StartWebSocketMessage | undefined>
The initial "start" message, when received.
- options
<TwilioVoIPOptions>
- metadata
<TwilioMetadata>
Initial call/stream metadata. - accountSid
<string>
Twilio Account SID. - authToken
<string>
Twilio Auth Token. - recordingStatusURL
<URL>
Recording status callback URL. - transcriptStatusURL
<URL>
Transcription status callback URL.
- metadata
Use a TwilioVoIP
in order to send synthesized audio back to Twilio, emit inbound media frames, and control the call (transfer, hangup, recording, and transcription).
public twilioVoIP.post(message)
- message
<Message>
Post base64‑encoded audio media back to Twilio over the Media Stream. Whendone
istrue
, a marker is sent to allow downstream dispatch tracking.
Returns: <void>
public twilioVoIP.abort(uuid)
- uuid
<UUID>
A message UUID to cancel. Sends a cancel marker and clears state; when no active messages remain, aclear
control message is sent.
Returns: <void>
public twilioVoIP.transferTo(tel)
- tel
<string>
A destination telephone number in E.164 format.
Returns: <void>
Transfer the active call to tel
using TwiML.
public twilioVoIP.hangup()
Returns: <void>
End the active call using TwiML.
public twilioVoIP.startTranscript()
Returns: <Promise<void>>
Start Twilio call transcription (Deepgram engine) with both_tracks
.
public twilioVoIP.startRecording()
Returns: <Promise<void>>
Begin dual‑channel call recording with status callbacks.
public twilioVoIP.stopRecording()
Returns: <Promise<void>>
Stop the in‑progress recording when applicable.
public twilioVoIP.removeRecording()
Returns: <Promise<void>>
Remove the last recording via the Twilio API.
public twilioVoIP.dispose()
Returns: <void>
Close the media WebSocket and clean up listener maps.
Helper types and type guards for Twilio webhook and Media Stream payloads.
- Body
<Record<string, string | string[] | undefined>>
A generic Twilio form‑encoded body map. - CallMetadata Extends
Body
with required Twilio voice webhook fields. - isCallMetadata(message) Returns:
<message is CallMetadata>
- RecordingStatus Extends
Body
with Twilio recording status fields. - isRecordingStatus(message) Returns:
<message is RecordingStatus>
- TranscriptStatus Extends
Body
with Twilio transcription status fields. - isTranscriptStatus(message) Returns:
<message is TranscriptStatus>
- WebSocketMessage
{ event: "start" | "media" | "stop" | "mark" }
- StartWebSocketMessage, MediaWebSocketMessage, StopWebSocketMessage, MarkWebSocketMessage Specific Twilio Media Stream messages.
- isStartWebSocketMessage / isMediaWebSocketMessage / isStopWebSocketMessage / isMarkWebSocketMessage Type guards for the above.
- TwilioMetadata
Partial<StartWebSocketMessage> & Partial<CallMetadata>
A merged, partial metadata shape for convenience.
- options
<OpenAIAgentOptions<VoIPT>>
- voip
<VoIPT>
The telephony transport. - stt
<STT>
The speech‑to‑text provider. - tts
<TTS>
The text‑to‑speech provider. - apiKey
<string>
OpenAI API key. - model
<string>
OpenAI Chat Completions model identifier. - queueSizeLimit
<number>
A queueSizeLimit to be passed to the implementation'sMutex
constructor.
- voip
Use an OpenAIAgent
as a base class in order to build streaming, interruptible LLM agents that connect STT input, TTS output, and a VoIP transport. Subclasses implement inference
to call OpenAI APIs and stream back responses.
public (abstract) openAIAgent.inference(message)
- message
<Message>
A transcribed user message to process.
Returns: <Promise<void>>
Implement this to call OpenAI and generate/stream the assistant’s reply.
public openAIAgent.post(message)
- message
<Message>
Push a user message into the agent. Ignored ifmessage.data
is empty. The message UUID is tracked for cancellation.
Returns: <void>
public openAIAgent.dispatchStream(uuid, stream, allowInterrupt?)
- uuid
<UUID>
The message correlation identifier. - stream
<Stream<OpenAI.Chat.Completions.ChatCompletionChunk>>
The OpenAI streaming iterator. - allowInterrupt
<boolean>
Whether to allow VAD‑driven interruption. Default:true
Returns: <Promise<string>>
Stream assistant tokens to TTS. When allowInterrupt
is false
, waits for a downstream "message_dispatched"
before returning.
public openAIAgent.dispatchMessage(message, allowInterrupt?)
- message
<Message>
A pre‑composed assistant message to play via TTS. - allowInterrupt
<boolean>
Whether to allow VAD‑driven interruption. Default:true
Returns: <Promise<string>>
Dispatch a complete assistant message to TTS with optional interruption handling.
public openAIAgent.abort()
Returns: <void>
Abort all active messages that are not currently being dispatched; cancels TTS and instructs the VoIP transport to clear state.
public openAIAgent.dispose(err?)
- err
<unknown>
Optional error to log.
Returns: <void>
Abort any in‑flight OpenAI stream and dispose TTS, STT, and VoIP transports.
public openAIAgent.setTTS(tts)
- tts
<TTS>
Replacement TTS implementation.
Returns: <void>
Swap the current TTS implementation, updating event wiring.
public openAIAgent.setSTT(stt)
- stt
<STT>
Replacement STT implementation.
Returns: <void>
Swap the current STT implementation, updating event wiring.
public openAIAgent.activate()
Returns: <void>
Wire up voip
→ stt
(media), stt
→ agent
(messages, vad), and tts
→ voip
(audio). Also subscribes to error and dispatch events.
public openAIAgent.deactivate()
Returns: <void>
Remove event wiring.
- options
<TwilioVoIPOpenAIAgentOptions>
ExtendsOpenAIAgentOptions<TwilioVoIP>
- twilioAccountSid
<string>
Twilio Account SID used for authenticated media fetch. - twilioAuthToken
<string>
Twilio Auth Token used for authenticated media fetch. - system
<string>
Optional system prompt for conversation history. Default:""
- greeting
<string>
Optional initial assistant greeting. Default:""
- twilioAccountSid
Use a TwilioVoIPOpenAIAgent
in order to run an OpenAI‑driven assistant over a Twilio call. It records the call, starts transcription, streams a greeting on connect, collects conversation history, and disposes once recording and transcription are complete.
public twilioVoIPOpenAIAgent.updateMetadata(metadata)
- metadata
<TwilioMetadata>
Merge updated Twilio metadata.
Returns: <void>
public twilioVoIPOpenAIAgent.activate()
Returns: <void>
Extends OpenAIAgent.activate()
by wiring Twilio‑specific events (stream start/stop, recording, transcript) and dispatching the initial greeting.
public twilioVoIPOpenAIAgent.deactivate()
Returns: <void>
Remove Twilio‑specific wiring in addition to base wiring.
- options
<DeepgramSTTOptions>
- apiKey
<string>
Deepgram API key. - liveSchema
<LiveSchema>
Deepgram live connection options. - queueSizeLimit
<number>
A queueSizeLimit to be passed to the implementation'sMutex
constructor.
- apiKey
Use a DeepgramSTT
in order to stream audio to Deepgram Live and emit final transcripts. Emits vad
on speech boundary messages. Automatically reconnects when needed.
public deepgramSTT.post(message)
- message
<Message>
Base64‑encoded (PCM/Telephony) audio chunk.
Returns: <void>
public deepgramSTT.dispose()
Returns: <void>
Close the underlying connection and remove listeners.
- options
<OpenAISTTOptions>
- apiKey
<string>
OpenAI API key. - session
<Session>
Realtime transcription session configuration. - queueSizeLimit
<number>
A queueSizeLimit to be passed to the implementation'sMutex
constructor.
- apiKey
Use an OpenAISTT
in order to stream audio to OpenAI Realtime STT and emit message
on completed transcriptions and vad
on speech boundary events.
public openaiSTT.post(message)
- message
<Message>
Base64‑encoded audio chunk.
Returns: <void>
public openaiSTT.dispose()
Returns: <void>
Close the WebSocket and remove listeners.
- options
<ElevenlabsTTSOptions>
- voiceId
<string>
Optional voice identifier. Default:"JBFqnCBsd6RMkjVDRZzb"
- apiKey
<string>
ElevenLabs API key. - headers
<Record<string, string>>
Optional additional headers. - url
<string>
Optional override URL for the WebSocket endpoint. - queryParameters
<Record<string, string>>
Optional query parameters appended to the endpoint. - timeout
<number>
Optional timeout in milliseconds to wait for finalization whendone
is set. If the timeout elapses, a terminal empty chunk is emitted. Default:undefined
- queueSizeLimit
<number>
A queueSizeLimit to be passed to the implementation'sMutex
constructor.
- voiceId
Use an ElevenlabsTTS
in order to stream synthesized audio back as it’s generated. Supports message contexts (UUIDs), incremental text updates, flushing on done
, and cancellation.
public elevenlabsTTS.post(message)
- message
<Message>
Assistant text to synthesize. Whendone
istrue
, the current context is closed and finalization is awaited (with optional timeout).
Returns: <void>
public elevenlabsTTS.abort(uuid)
- uuid
<UUID>
The context to cancel; sends a flush and close if initialized.
Returns: <void>
public elevenlabsTTS.dispose()
Returns: <void>
Close the WebSocket.
- options
<CartesiaTTSOptions>
- apiKey
<string>
Cartesia API key. - speechOptions
<Record<string, unknown>>
Provider options merged into each request. - url
<string>
Optional override URL for the WebSocket endpoint. Default:"wss://api.cartesia.ai/tts/websocket"
- headers
<Record<string, string>>
Optional additional headers merged with required headers. - timeout
<number>
Optional timeout in milliseconds to wait for finalization whendone
is set. If the timeout elapses, a terminal empty chunk is emitted. Default:undefined
- queueSizeLimit
<number>
A queueSizeLimit to be passed to the implementation'sMutex
constructor.
- apiKey
Use a CartesiaTTS
in order to stream synthesized audio chunks for a given context UUID. Supports cancellation and optional finalization timeouts.
public cartesiaTTS.post(message)
- message
<Message>
Assistant text to synthesize; whendone
istrue
, the provider is instructed to flush and complete the context.
Returns: <void>
public cartesiaTTS.abort(uuid)
- uuid
<UUID>
The context to cancel.
Returns: <void>
public cartesiaTTS.dispose()
Returns: <void>
Close the WebSocket and remove listeners.
The following classes enable running VoIP handling in a worker thread using the port_agent
library.
- options
<TwilioVoIPWorkerOptions>
- worker
<Worker>
The target worker thread to communicate with. - voip
<TwilioVoIP>
The localTwilioVoIP
instance whose events and methods will be bridged.
- worker
Use a TwilioVoIPWorker
in order to expose TwilioVoIP
events and actions to a worker thread. It forwards VoIP events to the worker and registers callables that invoke the corresponding TwilioVoIP
methods.
Use a TwilioVoIPProxy
in order to consume VoIP events and call VoIP methods from inside a worker thread. It mirrors the VoIP
interface and delegates the work to a host TwilioVoIP
via the port_agent
channel.
public twilioVoIPProxy.post(message)
- message
<Message>
Post synthesized audio.
Returns: <void>
public twilioVoIPProxy.abort(uuid)
- uuid
<UUID>
The context to cancel.
Returns: <void>
public twilioVoIPProxy.hangup()
Returns: <void>
public twilioVoIPProxy.transferTo(tel)
- tel
<string>
A destination telephone number in E.164 format.
Returns: <void>
public twilioVoIPProxy.startRecording()
Returns: <Promise<void>>
public twilioVoIPProxy.stopRecording()
Returns: <Promise<void>>
public twilioVoIPProxy.startTranscript()
Returns: <Promise<void>>
public twilioVoIPProxy.dispose()
Returns: <void>
Helper types for configuring OpenAI Realtime STT sessions and message discrimination.
public Session
<object>
- input_audio_format
<"pcm16" | "g711_ulaw" | "g711_alaw">
- input_audio_noise_reduction
{ type: "near_field" | "far_field" }
Optional noise reduction. - input_audio_transcription
{ model: "gpt-4o-transcribe" | "gpt-4o-mini-transcribe", prompt?: string, language?: string }
- turn_detection
{ type: "semantic_vad" | "server_vad", threshold?: number, prefix_padding_ms?: number, silence_duration_ms?: number, eagerness?: "low" | "medium" | "high" | "auto" }
- input_audio_format
Discriminated unions for WebSocket messages are also provided with type guards:
WebSocketMessage
andisCompletedWebSocketMessage
,isSpeechStartedWebSocketMessage
,isConversationItemCreatedWebSocketMessage
.
public OpenAIConversationHistory
<{ role: "system" | "assistant" | "user" | "developer", content: string }[]>
A conversation history array suitable for OpenAI chat APIs.
- Error loading key/cert: verify
KEY_FILE
andCERT_FILE
paths are absolute and readable by the node process. Use self‑signed certs for testing only. - Client won’t connect: ensure you are serving HTTPS on the
HOST_NAME
:PORT
configured in.env
and that your firewall allows inbound traffic.
- 403 Forbidden on webhook or upgrade: signature validation failed. Ensure
WEBHOOK_URL
matches exactly what Twilio calls (scheme, host, port, path). Do not modify the body before validation. ConfirmTWILIO_AUTH_TOKEN
is correct. - 400 Bad Request: missing
x-twilio-signature
header or unsupported content type. Twilio must POSTapplication/x-www-form-urlencoded
. - 404 Not Found on upgrade: the request path must match the internally generated WebSocket URL returned in the TwiML
<Connect><Stream>
response. Ensure your public host/port is reachable and consistent.
- Check that STT and TTS providers use compatible formats for your telephony path (e.g., mulaw/8kHz for G.711). The examples use
g711_ulaw
(OpenAI) ormulaw
/8000 (Deepgram, Cartesia). - ElevenLabs and Cartesia require a terminal
done: true
message to flush the final audio chunk. Ensure your agent dispatches it.
- Request body limit:
TwilioGateway
supportsrequestSizeLimit
. If you see size‑related errors, increase it to accommodate your environment. - WebSocket frame limit: Twilio media frames should be small; if you see
WebSocket message too large
, ensure the upstream is sending properly sized frames.
- Verify network stability and TLS correctness on your VPS.
- For TTS timeouts (Cartesia/ElevenLabs), consider using the
timeout
options so the system emits a terminal empty chunk and unblocks.
There are a lot of great VoIP-Agent orchestration implementations out there. This is a selection of recommended implementations to consider.
If you have a feature request or run into any issues, feel free to submit an issue or start a discussion. You’re also welcome to reach out directly to one of the authors.