Skip to main content
Version: 0.23.0

Gemini Live Integration

info

This tutorial requires a working Fishjam backend. If you haven't set one up yet, please check the Backend Quick Start.

This guide demonstrates how to build a real-time speech-to-speech agent using Fishjam and Google's Multimodal Live API. By connecting these two services, you can create a low-latency voice assistant that not only listens to peers in a room and responds with natural voice, but also provides a real-time transcription of the conversation.

Overview

The implementation acts as a bridge between two real-time streams:

  1. Fishjam ➡️ Gemini: The Agent receives audio from the room and forwards it to Google GenAI.
  2. Gemini ➡️ Fishjam: The Agent receives audio generated by Gemini and plays it back into the room.

To ensure these streams connect without audio glitches (garbled voice, wrong pitch), the audio sample rates must match between services.

Prerequisites

You will need:

  • Fishjam Server Credentials: fishjamId and managementToken. You can get them at fishjam.io/app.
  • Google Gemini API Key: Obtainable from Google AI Studio.

Installation

Since the Google integration is optional, you need to install the specific dependencies for your SDK.

First, ensure you have the Google GenAI SDK installed alongside Fishjam.

npm install @fishjam-cloud/js-server-sdk @google/genai

Implementation

Step 1: Initialize Clients

We provide a helper factory to initialize the Google Client.

import { FishjamClient } from '@fishjam-cloud/js-server-sdk'; import GeminiIntegration from '@fishjam-cloud/js-server-sdk/gemini'; const fishjamClient = new FishjamClient({ fishjamId: process.env.FISHJAM_ID!, managementToken: process.env.FISHJAM_TOKEN!, }); const genAi = GeminiIntegration.createClient({ // Pass standard Google client options here apiKey: process.env.GOOGLE_API_KEY!, });

Step 2: Configure the Agent

Create a Fishjam agent configured to match the audio format that the Google client expects (16kHz and 24kHz on Gemini input and output, respectively).

import GeminiIntegration from '@fishjam-cloud/js-server-sdk/gemini'; const room = await fishjamClient.createRoom(); const { agent } = await fishjamClient.createAgent(room.id, { subscribeMode: 'auto', // Use our preset to match the required audio format (16kHz) output: GeminiIntegration.geminiInputAudioSettings, });

Step 3: Connect the Streams

Encoding

Fishjam handles raw bytes, while Google GenAI SDKs often expect Base64 strings. Ensure you convert between them correctly as shown below.

Now we setup the callbacks. We need to forward incoming Fishjam audio to Google, and forward incoming Google audio to Fishjam.

import GeminiIntegration from '@fishjam-cloud/js-server-sdk/gemini'; const GEMINI_MODEL = 'gemini-2.5-flash-native-audio-preview-12-2025' // Use our preset to match the required audio format (24kHz) const agentTrack = agent.createTrack(GeminiIntegration.geminiOutputAudioSettings); const session = await genAi.live.connect({ model: GEMINI_MODEL, config: { responseModalities: [Modality.AUDIO] }, callbacks: { // Google -> Fishjam onmessage: (msg) => { if (msg.data) { const pcmData = Buffer.from(msg.data, 'base64'); agent.sendData(agentTrack.id, pcmData); } if (msg.serverContent?.interrupted) { console.log('Agent was interrupted by user.'); // Clears the buffer on the Fishjam media server agent.interruptTrack(agentTrack.id); } } } }); // Fishjam -> Google agent.on('trackData', ({ data }) => { session.sendRealtimeInput({ audio: { mimeType: GeminiIntegration.inputMimeType, data: Buffer.from(data).toString('base64'), } }); });