diff --git a/.gitignore b/.gitignore index 87df5a1..b4848fb 100644 --- a/.gitignore +++ b/.gitignore @@ -22,3 +22,4 @@ share/python-wheels/ .installed.cfg *.egg MANIFEST +CLAUDE.md diff --git a/README.md b/README.md index 47d3eb2..3a7e4b2 100644 --- a/README.md +++ b/README.md @@ -1,27 +1,91 @@ -# speech_compass +# SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization -This repository contains the code accompanying the publication -**SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional -Guidance via Multi-Microphone Localization**, -published in CHI, 2025. (https://arxiv.org/abs/2502.08848) +[![CHI 2025 Best Paper](https://img.shields.io/badge/CHI%202025-Best%20Paper%20Award-gold)](https://dl.acm.org/doi/10.1145/3706598.3713631) +[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE) +[![arXiv](https://img.shields.io/badge/arXiv-2502.08848-b31b1b.svg)](https://arxiv.org/abs/2502.08848) -## Installation +[Paper (PDF)](https://arxiv.org/pdf/2502.08848) | [ACM Digital Library](https://dl.acm.org/doi/10.1145/3706598.3713631) | [Project Page](https://www.olwal.com/speechcompass) | [Google Research Blog](https://research.google/blog/making-group-conversations-more-accessible-with-sound-localization/) -Setting up the whole system requires multiple steps and custom hardware. Refer -to details in doc folder for each step: +Artem Dementyev*, Dimitri Kanevsky, Samuel J. Yang, Mathieu Parvaix, Chiong Lai, Alex Olwal* -1) Custom hardware. The microphone phone-case was custom designed. +Official code release for **SpeechCompass: Enhancing Mobile Captioning with Diarization and +Directional Guidance via Multi-Microphone Localization**, published at CHI 2025. -2) Firmware. The phone-case microcontroller needs to be flashed with firmware +[Video 4:24](https://www.youtube.com/watch?v=crWXO5T5jaQ) | [Presentation 9:30](https://www.youtube.com/watch?v=cOnMxClQZ4g) -2) DSP algorithms. The core processing algorithms were developed in light-weight -C to be platform agnostic. They can be tested separately. +![SpeechCompass teaser](docs/images/speech_compass_teaser.jpg) -3) Android application. The app was developed in Android studio +*First and last author contributed equally to this work + +## Overview + +Mobile speech-to-text apps have a fundamental limitation in group conversations: they +transcribe everything into a single undifferentiated stream, making it hard to follow who +said what. SpeechCompass addresses this by adding a spatial dimension — using multiple +microphones to localize speakers in real time and overlay directional guidance on live +captions. + +The system is designed with accessibility in mind, particularly for people who are hard of +hearing. Rather than relying on machine learning approaches that require video, speaker +embeddings, or high compute, SpeechCompass uses classical DSP (GCC-PHAT + kernel density +estimation) that runs on a low-power embedded microcontroller with low latency and no voice +data retention. + +![App](docs/images/app.jpg) + +### Visualizations + +The Android app offers multiple ways to display speaker direction alongside captions: + +- **Colored text** — each speaker gets a distinct color +- **Directional arrows and glyphs** — indicate where speech is coming from +- **Radar minimap** — a persistent spatial overview of active speakers +- **Edge indicators** — subtle screen-edge cues for peripheral awareness +- **Speech suppression** — filter out speech from a specific direction + +### Performance + +- **Localization accuracy:** 11°–22° average error at normal conversational volume (60–65 dB), + comparable to human localization ability +- **Diarization:** 4-microphone configuration achieves 23–35% relative improvement in + Diarization Error Rate (DER) over a 3-microphone setup across varying SNR conditions + +### User Research + +A survey of 263 frequent captioning users identified speaker distinction as the most +significant unmet need. In a follow-up prototype study with 8 frequent users, colored text +and directional arrows were the preferred visualizations, and all participants agreed that +directional guidance was valuable for group conversations. + +## System + +![System diagram](docs/images/system_diagram.png) + +SpeechCompass combines a custom hardware phone case with lightweight on-device processing: + +- A **4-microphone phone case** sends audio to an STM32 L5 microcontroller, which runs + GCC-PHAT localization and streams azimuth angles to the phone over USB +- The **Android app** uses the phone's built-in microphone for speech recognition (ASR) + and receives speaker direction from the case — keeping voice data local and processing + costs low +- The **DSP algorithms** are written in portable C11 and can also run on phones with + 2+ built-in microphones, providing 180° localization without additional hardware + +## Repository Structure + +| Component | Description | +|-----------|-------------| +| [`hardware/README.md`](hardware/README.md) | PCB schematics for the custom 4-microphone phone case | +| [`firmware/README.md`](firmware/README.md) | STM32 L5 firmware (GCC-PHAT localization → USB output) | +| [`dsp/README.md`](dsp/README.md) | Platform-agnostic C localization and beamforming algorithms, with Bazel unit tests | +| [`android/README.md`](android/README.md) | Android Studio app (ASR + directional visualization) | + +Each component can be used independently — in particular, the DSP algorithms can be built +and tested with Bazel without any hardware. ## Citing this work -``` +```bibtex @inproceedings{dementyev2025speechcompass, title={SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization}, author={Dementyev, Artem and Kanevsky, Dimitri and Yang, Samuel and Parvaix, Mathieu and Lai, Chiong and Olwal, Alex}, @@ -30,6 +94,24 @@ C to be platform agnostic. They can be tested separately. } ``` +## Related Work + +SpeechCompass builds on **LiveLocalizer** (UIST 2023), which first demonstrated +microphone-array localization augmenting mobile speech-to-text. The same hardware can run +the SpeechCompass firmware. + +> Dementyev, A., Kanevsky, D., Yang, S., Parvaix, M., Lai, C., and Olwal, A. +> "LiveLocalizer: Augmenting Mobile Speech-to-Text with Microphone Arrays, Optimized +> Localization and Beamforming." *UIST 2023 Adjunct*, San Francisco, CA. +> [ACM DL](https://dl.acm.org/doi/10.1145/3586182.3615789) + +## Acknowledgments + +We thank Sagar Savla, Dmitrii Votintcev, Pascal Getreuer, Richard Lyon, Alex Huang, Shao-Fu Shih, +Chet Gnegy, Shaun Kane, James Landay, Malcolm Slaney, Meredith Morris, Carson Lau, +Ngan Nguyen, Mei Lu, Don Barnett, Ryan Geraghty, and Sanjay Batra for their contributions +and support. + ## License and disclaimer Copyright 2025 Google LLC diff --git a/android/README.md b/android/README.md new file mode 100644 index 0000000..63573da --- /dev/null +++ b/android/README.md @@ -0,0 +1,34 @@ +# Android Application + +The app runs on the phone. It uses the phone's built-in microphone for speech recognition +(ASR) and receives azimuth angle data from the SpeechCompass phone case over USB. +Visualizations are built with the [Processing for Android](https://android.processing.org/) +framework. + +![App screenshot](https://github.com/google-deepmind/speech_compass/blob/main/docs/images/app.jpg) + +## Quickest way: install the pre-built APK + +If you don't need to modify the app, sideload the pre-built APK via ADB: + +``` +adb install path/to/speechcompass.apk +``` + +The APK is available on +[Google Drive](https://drive.google.com/file/d/15mf4d6tlzD6GbkNFa18XGd1UUCz8RhcP/view?usp=drive_link&resourcekey=0-Whdp8aFD-M6qDvHfQQJZww). +Connect the phone to your PC before running the command. + +> The app may stop working on newer Android versions due to API changes. + +## Building from source + +1. Install the latest [Android Studio](https://developer.android.com/studio). + +2. Download the [zipped Android Studio project](TODO) and unzip it. + +3. Open Android Studio and import the project (**File → Open**). + +4. Build the project (**Build → Make Project**). + +5. Connect the phone over USB and click **Run** to install and launch. diff --git a/docs/algorithms/index.md b/docs/algorithms/index.md deleted file mode 100644 index 9618e33..0000000 --- a/docs/algorithms/index.md +++ /dev/null @@ -1,28 +0,0 @@ -# Localization and Beamforming algorithms - -## Localization - -The localization algorithms are in the /dsp subfolder. We made lightweight -localization algorithms in C. The low-level implementation allows the algorithms -to be ported to different embedded platforms. Our localization algorithm is -based on generalized cross-correlation with phase transform (GCC-PHAT) [1] and -statistical estimation of source location. - -We used a slightly modified GCC-PHAT approach to calculate the cross correlation -between microphone pairs. In our case, we used normalization to the power of --0.3. Also, while most localizers use SPR (Steered Power Response), we used an -ad-hoc lightweight statistical estimation based on Kernel Density Estimation. - -## Beamforming - -We have two classical beamformer implementations: Delay-and-Sum (DAS) and -Filter-and-Sum (FAS) located in the /beam subfolder. The beamformer takes four -channels and outputs one beamformer channel. We ended up focusing on the -localization in the SpeechCompass paper, so the firmware doesn't run -beamforming. - -### References - -[1] Knapp, C. H. and G.C. Carter, “The Generalized Correlation Method for -Estimation of Time Delay.” IEEE Transactions on Acoustics, Speech and Signal -Processing. Vol. ASSP-24, No. 4, Aug 1976. diff --git a/docs/android/index.md b/docs/android/index.md deleted file mode 100644 index a482688..0000000 --- a/docs/android/index.md +++ /dev/null @@ -1,35 +0,0 @@ -# Android application - -This application runs on the phone, and receives the data from the SpeechCompass -phone case over the USB. We used the -[Processing](https://android.processing.org/) framework for visualizations. - -## Simplest way to run the app - -If no debugging or development is need, loading the app -[APK](https://drive.google.com/file/d/15mf4d6tlzD6GbkNFa18XGd1UUCz8RhcP/view?usp=drive_link&resourcekey=0-Whdp8aFD-M6qDvHfQQJZww) -over [Android Debug Bridge](https://developer.android.com/tools/adb) (adb) is -the easiest way. To do so connect the phone to the PC, open the terminal and -load the app with the command line: - -```adb install path_to_app``` - -The app might stop running on the newer version of Android. - -## Building the Android application - -Building the app is more involved, especially for first time users. - -1) Download and install latest -[Android studio](https://developer.android.com/studio). - -2) Download the -[zipped](TODO) -Android Studio project for SpeechCompass. - -3) Import the project to Android Studio. - -4) Build the project. - -5) Connect the phone over USB and load the application by clicking the Run -button in Android Studio. diff --git a/docs/firmware/index.md b/docs/firmware/index.md deleted file mode 100644 index cf1d272..0000000 --- a/docs/firmware/index.md +++ /dev/null @@ -1,51 +0,0 @@ -# Firmware - -The firmware runs on a low-power microcontroller (STM32 L5). It gets the raw -microphone data, runs lightweight localization and signal processing algorithms -and outputs results to the USB. Loading firmware on the MCU will need a cable -and an ST-LINK programmer. The steps assume some previous experience with STM32. - -## Compiling the firmware - -We used [STM32Cube IDE](https://www.st.com/software/stm32cube-ide) for firmware -development. It provides all the convenient tools for embedded ARM development. -We used the STM32CubeMX to create the project template and import the necessary -drivers. The most convenient way to access the code is to compile the code using -STM32Cube IDE as follows: - -1) Install the STM32 CUBE IDE and ST-LINK toolchain. - -2) Download the -[zipped project](https://drive.google.com/file/d/1aSLFQMz3HJg2O-bxhoN2yHJ5k2ODyI81/view?usp=sharing&resourcekey=0-FB9BwKRDcssJl4RME0ycYQ) -and unzip it. - -3) Import the project into STM32Cube IDE. Click on File -> Import -> Existing -Projects into Workspace, and select the project folder. - -4) Build the project. The console should show no errors. - -## Loading the firmware - -Loading and debugging the firmware on the microcontroller is more involved as it -requires a programmer and a specific connector. - -1) Get an -[ST-LINK programmer](https://www.mouser.com/ProductDetail/STMicroelectronics/STLINK-V3MINIE?qs=MyNHzdoqoQKcLQe5Jawcgw%3D%3D) -and a special -[connector/cable](https://www.tag-connect.com/product/tc2030-ctx-stdc14-for-use-with-stm32-processors-with-stlink-v3). -We used such a connector to reduce physical footprint. - -2) Plug in a USB cable for board power. The programmer doesn't provide power for -the board. - -3) Open the STM32Cube project and compile. Alternatively, this can be done -without STM32Cube IDE by flashing the compiled binary file with the code. This -can be done over a terminal, but still needs ST-LINK drivers installed. - -4) Connect and hold the connector to the board and upload by clicking the debug -button. If doing this the first time, the programmer might need to be -configured. - -5) Open a serial terminal (e.g, Arduino IDE) on a PC connected to the board over -USB. Make sure the correct port is selected. The baud rate doesn't matter. You -should see angles coming in and printing. diff --git a/docs/hardware/index.md b/docs/hardware/index.md deleted file mode 100644 index f6643ae..0000000 --- a/docs/hardware/index.md +++ /dev/null @@ -1,35 +0,0 @@ -# SpeechCompass hardware design - -The hardware is composed of two PCBs: the main board with the microcontroller -and flexible PCB connecting all the microphones together. - - -![Phone case](https://github.com/google-deepmind/speech_compass/blob/main/docs/images/electronics.jpg) - -## Main PCB - -The main PCB is a motherboard that has the STM32 microcontroller and I/O ports. -The board includes an audio codec that provides headphone output. There is a -Bluetooth module as well, but we are not using it. With Bluetooth and the -battery, the system does not need to be tethered to the phone. -[Schematic pdf](https://github.com/google-deepmind/speech_compass/blob/main/docs/hardware/main_board_schematic.pdf) - -## Flex PCB - -Flexible PCB is mainly a cable to connect the microphones to the main board. The -surface mount microphones were soldered to the flex PCB. -[Schematic pdf](https://github.com/google-deepmind/speech_compass/blob/main/docs/hardware/flex_pcb_schematic.pdf) - -## Old version (LiveLocalizer) - -Our initial version of the phone case had one rigid board for everything. (See -UIST demo [proceedings](https://dl.acm.org/doi/10.1145/3586182.3615789) for -details). It is more bulky but it is simpler to build and uses microphones on a -breakout boards. It can run the same firmware. - -![Phone case](https://github.com/google-deepmind/speech_compass/blob/main/docs/images/livelocalizer.png) - -## Firmware - -The firmware runs on the microcontroller. Mainly it runs the localization -algorithm and sends the data to the phone diff --git a/docs/index.md b/docs/index.md deleted file mode 100644 index 5f2bc03..0000000 --- a/docs/index.md +++ /dev/null @@ -1,41 +0,0 @@ -# SpeechCompass - -(This is not an officially supported Google product.) - -SpeechCompass is a real-time, multi-microphone speech localization, -visualization, and diarization platform. We believe that adding a spatial -dimension to sound understanding can greatly improve the usability of audio -interfaces. For more details see our publication in -[CHI'25](https://arxiv.org/pdf/2502.08848) - - -![Phone case](images/speech_compass_teaser.jpg) - -## Multi microphone phone case design - -To allow experimentation, we designed a custom hardware phone case with embedded -four microphones. The localization data is sent from the phone case to the phone -over USB. - -![Phone case](images/phone_case.jpg) - -## Lightweight localization and beamforming - -We implement localization and beamforming algorithms capable of running in -real-time on low-power microcontroller. - -![Phone case](images/system_diagram.png) - -## Android visualization application - -The ASR and visualizations runs as an app on the phone. It actually uses phone -microphone for the ASR and receives the sound direction from the phone case over -USB. ![Phone case](images/app.jpg) - - -## Documentation - -* [Hardware](https://github.com/google-deepmind/speech_compass/blob/main/docs/hardware/index.md) -* [Firmware](https://github.com/google-deepmind/speech_compass/blob/main/docs/firmware/index.md) -* [Android application](https://github.com/google-deepmind/speech_compass/blob/main/docs/android/index.md) -* [DSP algorithms](https://github.com/google-deepmind/speech_compass/blob/main/docs/algorithms/index.md) diff --git a/dsp/README.md b/dsp/README.md new file mode 100644 index 0000000..b5d9143 --- /dev/null +++ b/dsp/README.md @@ -0,0 +1,54 @@ +# DSP Algorithms + +Lightweight, platform-agnostic C (C11) implementations of the localization and beamforming +algorithms. Designed to run on low-power microcontrollers but fully testable on desktop +with Bazel. + +## Localization (`dsp/`) + +Localization is based on Generalized Cross-Correlation with Phase Transform (GCC-PHAT) [1]. + +- **`gcc_phat.c/.h`** — Frequency-domain cross-correlation with partial phase normalization + (exponent −0.3). Operates on a single microphone pair. +- **`tdoa.c/.h`** — Extracts Time Difference of Arrival (TDOA) from GCC-PHAT peaks and + converts delays to azimuth angles. Uses ARM CMSIS DSP for FFTs on embedded targets. +- **`angle_estimation.c/.h`** — Aggregates TDOA measurements from all mic pairs into a + single azimuth estimate (0–359°) using histogram accumulation and Kernel Density + Estimation (KDE) with Gaussian or Pearson Type II kernels. + +Unlike most localizers that use Steered Power Response (SPR), we use a lightweight +statistical KDE approach that is well-suited to real-time embedded constraints. + +## Beamforming (`beam/`) + +Two classical beamformer implementations are included. The SpeechCompass firmware uses +localization only (not beamforming), but these are provided for completeness. + +- **`beam/das_beamformer.c/.h`** — Time-domain Delay-and-Sum beamformer; supports 2- and + 4-microphone circular arrays; stateful ring buffer. +- **`beam/fas_beamformer.c/.h`** — Frequency-domain Filter-and-Sum beamformer with complex + steering weights for circular arrays. + +## Building and testing + +Build rules are defined in `defs.bzl` using Bazel wrapper rules (`c_binary`, `c_library`, +`c_test`) that enforce C11 and a consistent warning set. + +```bash +# Run all tests +bazel test //... + +# Run a specific test +bazel test //test:angle_estimation_test +bazel test //test:gcc_phat_test +bazel test //test:das_beamformer_test +bazel test //test:fas_beamformer_test +``` + +Tests use the `CHECK()` assertion macro from `utility/logging.h`. + +## References + +[1] Knapp, C. H. and G.C. Carter, "The Generalized Correlation Method for Estimation of +Time Delay." *IEEE Transactions on Acoustics, Speech and Signal Processing*, Vol. ASSP-24, +No. 4, Aug 1976. diff --git a/firmware/README.md b/firmware/README.md new file mode 100644 index 0000000..67367ca --- /dev/null +++ b/firmware/README.md @@ -0,0 +1,39 @@ +# Firmware + +The firmware runs on the STM32 L5 microcontroller (ARM Cortex-M33). It reads raw audio +from the four microphones, runs the GCC-PHAT localization algorithm, and streams azimuth +angle estimates to the phone over USB. + +The firmware source is provided as an STM32CubeIDE project. Loading it onto the MCU +requires an ST-LINK programmer. + +## Compiling + +1. Install [STM32Cube IDE](https://www.st.com/software/stm32cube-ide) and the ST-LINK + toolchain. + +2. Download the + [zipped project](https://drive.google.com/file/d/1aSLFQMz3HJg2O-bxhoN2yHJ5k2ODyI81/view?usp=sharing&resourcekey=0-FB9BwKRDcssJl4RME0ycYQ) + and unzip it. + +3. Import into STM32Cube IDE: **File → Import → Existing Projects into Workspace** and + select the project folder. + +4. Build the project. The console should show no errors. + +## Flashing + +Flashing requires a programmer and a tag-connect cable: + +- [ST-LINK V3 Mini programmer](https://www.mouser.com/ProductDetail/STMicroelectronics/STLINK-V3MINIE?qs=MyNHzdoqoQKcLQe5Jawcgw%3D%3D) +- [Tag-Connect TC2030-CTX-STDC14 cable](https://www.tag-connect.com/product/tc2030-ctx-stdc14-for-use-with-stm32-processors-with-stlink-v3) (compact footprint) + +1. Connect a USB cable for board power (the programmer does not supply power). +2. Hold the tag-connect cable against the board's programming header. +3. In STM32Cube IDE, click the debug/flash button. On first use, configure the programmer + if prompted. +4. To verify: open a serial terminal (e.g., Arduino IDE serial monitor) on the USB port. + You should see angle values printing continuously. Baud rate does not matter. + +> **Note:** Flashing can also be done without the IDE by using ST-LINK command-line tools +> to flash a pre-compiled `.hex` or `.bin` binary directly. diff --git a/hardware/README.md b/hardware/README.md new file mode 100644 index 0000000..eb4ed2a --- /dev/null +++ b/hardware/README.md @@ -0,0 +1,27 @@ +# Hardware + +The SpeechCompass phone case consists of two PCBs. + +![Electronics](https://github.com/google-deepmind/speech_compass/blob/main/docs/images/electronics.jpg) + +## Main PCB + +The main board hosts the STM32 L5 microcontroller, an audio codec (headphone output), +and a Bluetooth module (currently unused). With a battery added, the system can operate +untethered from the phone. + +[Schematic (PDF)](main_board_schematic.pdf) + +## Flex PCB + +The flexible PCB routes the four surface-mount microphones back to the main board. + +[Schematic (PDF)](flex_pcb_schematic.pdf) + +## Earlier version: LiveLocalizer + +The original prototype used a single rigid PCB — bulkier but simpler to build, with +microphones on breakout boards. It runs the same firmware. See the +[UIST 2023 demo paper](https://dl.acm.org/doi/10.1145/3586182.3615789) for details. + +![LiveLocalizer](https://github.com/google-deepmind/speech_compass/blob/main/docs/images/livelocalizer.png) diff --git a/docs/hardware/flex_pcb_schematic.pdf b/hardware/flex_pcb_schematic.pdf similarity index 100% rename from docs/hardware/flex_pcb_schematic.pdf rename to hardware/flex_pcb_schematic.pdf diff --git a/docs/hardware/main_board_schematic.pdf b/hardware/main_board_schematic.pdf similarity index 100% rename from docs/hardware/main_board_schematic.pdf rename to hardware/main_board_schematic.pdf