Skip to content

rishitank/SpeakMCP

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

277 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SpeakMCP

🎀 AI-powered voice assistant with MCP integration - A fork of Whispo that transforms your voice into intelligent actions with advanced speech recognition, LLM processing, and Model Context Protocol (MCP) tool execution.

License: AGPL-3.0 Electron TypeScript React

🎬 Preview

speakmcp-vid.mp4

πŸš€ Quick Start

Download

Cross-Platform Support: macOS (Apple Silicon & Intel), Windows (x64), Linux (x64)

πŸ“₯ Download Latest Release

Basic Usage

Voice Recording:

  1. Hold Ctrl key to start recording your voice
  2. Release Ctrl to stop recording and transcribe
  3. Text is automatically inserted into your active application

MCP Agent Mode:

  1. Hold Ctrl+Alt to start recording for agent mode
  2. Release Ctrl+Alt to process with MCP tools
  3. Watch real-time progress as the agent executes tools
  4. Results are automatically inserted or displayed

Text Input:

  • Press Ctrl+T to open text input mode for direct typing

✨ Features

🎀 Voice & Speech

  • Voice-to-Text: Hold Ctrl to record, release to transcribe
  • Toggle Voice Dictation: Press Fn key to start/stop recording (configurable)
  • Multi-Language Support: 30+ languages including Spanish, French, German, Chinese, Japanese, Arabic, Hindi
  • Text-to-Speech (TTS): AI-generated speech with 50+ voices across OpenAI, Groq, and Gemini
  • Auto-Play TTS: Automatic speech playback for seamless conversations

πŸ€– AI Agent & MCP

  • MCP Agent Mode: Hold Ctrl+Alt for intelligent tool execution with real-time progress
  • MCP Integration: Connect to any MCP-compatible tools and services
  • OAuth 2.1 Support: Secure authentication for MCP servers with deep link integration
  • Tool Management: Per-server tool toggles and approval prompts
  • Conversation Continuity: Context preservation across agent interactions

πŸ› οΈ Platform & Performance

  • Cross-Platform: macOS, Windows, and Linux support with native builds
  • Rate Limit Handling: Exponential backoff retry for API rate limits (429 errors)
  • Model Selection: Choose specific models for OpenAI, Groq, and Gemini providers
  • Debug Modes: Comprehensive logging for LLM calls, tool execution, and TTS
  • Universal Integration: Works with any text-input application

🎨 User Experience

  • Text Input: Press Ctrl+T for direct text input mode
  • Dark/Light Themes: Toggle between dark and light modes
  • Resizable Panels: Drag-to-resize interface components
  • Kill Switch: Emergency stop for agent operations (Escape key)
  • Conversation Management: Full conversation history with tool call visualization

πŸ—οΈ Architecture

Built with modern technologies for cross-platform performance:

  • Electron: Main process for system integration, MCP orchestration, and TTS processing
  • React + TypeScript: Modern UI with real-time progress tracking and conversation management
  • Rust: High-performance keyboard monitoring and text injection across platforms
  • MCP Client: Full Model Context Protocol implementation with OAuth 2.1 support
  • Multi-Provider AI: OpenAI, Groq, and Gemini integration for speech, text, and TTS

πŸ› οΈ Development

Prerequisites: Node.js 18+, pnpm, Rust toolchain

# Setup
git clone https://github.com/aj47/SpeakMCP.git
cd SpeakMCP
pnpm install
pnpm build-rs  # Build Rust binary for your platform
pnpm dev       # Start development server

# Platform-specific builds
pnpm build        # Production build for current platform
pnpm build:mac    # macOS build (Apple Silicon + Intel)
pnpm build:win    # Windows build (x64)
pnpm build:linux  # Linux build (x64)

# Testing
pnpm test         # Run test suite
pnpm test:tts     # Test TTS functionality

βš™οΈ Configuration

AI Providers: OpenAI, Groq, Google Gemini

  • Configure API keys and custom base URLs in settings
  • Select specific models for each provider
  • Multi-language speech recognition support
  • TTS with 50+ voices across providers

MCP Servers: Configure tools in mcpServers JSON format:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["@modelcontextprotocol/server-filesystem", "/path"]
    },
    "web-search": {
      "command": "npx",
      "args": ["@modelcontextprotocol/server-web-search"],
      "env": {"BRAVE_API_KEY": "your-key"}
    }
  }
}

Keyboard Shortcuts:

  • Hold Ctrl: Voice recording (traditional mode)
  • Fn Key: Toggle voice dictation (press once to start/stop)
  • Hold Ctrl+Alt: MCP agent mode
  • Ctrl+T: Text input mode
  • Escape: Cancel/kill switch for operations

πŸ€– MCP Agent Mode

MCP (Model Context Protocol) enables AI assistants to connect to external tools. SpeakMCP implements a full MCP client with advanced capabilities.

Enhanced Features:

  • Intelligent Tool Selection: Automatically determines which tools to use
  • Real-time Progress: Visual feedback with TTS narration during execution
  • Conversation Continuity: Context preservation across multi-turn interactions
  • OAuth 2.1 Integration: Secure authentication for MCP servers
  • Rate Limit Handling: Automatic retry with exponential backoff
  • Kill Switch: Emergency stop functionality with Escape key
  • Tool Management: Per-server tool toggles and approval prompts

Example commands:

  • "Create a new project folder and add a README"
  • "Search for latest AI news and summarize the top 3 articles"
  • "Send a message to the team about today's progress"
  • "Analyze this codebase and suggest improvements"

πŸ†• What's New

Recent Major Features:

🎡 Text-to-Speech (TTS) Integration

  • 50+ AI Voices: OpenAI (6 voices), Groq (23 voices), Gemini (30+ voices)
  • Auto-Play: Seamless conversation flow with automatic speech playback
  • Smart Preprocessing: Converts code blocks, URLs, and markdown to natural speech
  • Multi-Language: Support for 30+ languages with native pronunciation

πŸ–₯️ Cross-Platform Support

  • Windows Build: Full Windows compatibility with native builds
  • Enhanced macOS: Apple Silicon and Intel support
  • Linux Ready: Complete Linux build pipeline

πŸŽ›οΈ Enhanced Voice Controls

  • Toggle Voice Dictation: Press Fn key to start/stop recording
  • Multi-Language Recognition: 30+ languages with automatic detection
  • Configurable Hotkeys: Customize keyboard shortcuts for all functions

πŸ”§ Reliability & Performance

  • Rate Limit Handling: Automatic retry with exponential backoff for API limits
  • OAuth 2.1: Secure authentication for MCP servers with deep link integration
  • Kill Switch: Emergency stop functionality for all operations
  • Model Selection: Choose specific AI models for each provider

πŸ› Debug Mode

For development and troubleshooting, SpeakMCP includes comprehensive debug logging:

# Enable all debug modes
pnpm dev d               # Shortest option
pnpm dev debug-all       # Readable format

# Enable specific modes
pnpm dev debug-llm       # LLM calls and responses
pnpm dev debug-tools     # MCP tool execution
pnpm dev debug-tts       # Text-to-speech debugging

🀝 Contributing

We welcome contributions! Fork the repo, create a feature branch, and open a Pull Request.

πŸ’¬ Get help on Discord | 🌐 More info at techfren.net

πŸ“„ License

This project is licensed under the AGPL-3.0 License.

πŸ™ Acknowledgments

  • Whispo - This project is a fork of Whispo, the original AI voice assistant
  • OpenAI for Whisper speech recognition and GPT models
  • Anthropic for Claude and MCP protocol development
  • Model Context Protocol for the extensible tool integration standard
  • Electron for cross-platform desktop framework
  • React for the user interface
  • Rust for system-level integration
  • Groq for fast inference capabilities
  • Google for Gemini models

Made with ❀️ by the SpeakMCP team

About

Voice Powered Agent Delegation

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • TypeScript 94.9%
  • CSS 1.4%
  • PowerShell 1.4%
  • JavaScript 1.3%
  • Rust 0.5%
  • HTML 0.3%
  • Shell 0.2%