The Whisper app facilitates conversation where one of the participants (called the Whisperer) has difficulty speaking but not hearing, and the other participants (called Listeners) can speak to the Whisperer. The Whisperer “speaks” by typing, and the Listeners “hear” that typing in real time.
The Whisper app uses a simple entity model to represent devices, users, and conversations. It uses a simple data protocol to connect a Whisperer with Listeners. iOS and macOS devices running the Whisper app can establish conversations over Bluetooth LE without any internet connection, but for conversations with browser-based or remote Listeners the app relies on the whisper server to provide coordination.
Whisper assigns each device that can whisper or listen a device ID (which is a UUID). iOS/macOS devices create their own device IDs, retain them in sandbox storage, and communicate them to the whisper server. Browsers are assigned device IDs by the whisper server and retain them in a long-lived cookie.
A whisper user is represented by a long-lived profile entity, identified by a UUID. A whisper profile stores the user’s chosen username, the list of conversations that the user has created as the Whisperer, and the list of conversations that the user has participated in as a Listener. Each generation of a device ID also generates a new profile entity which starts out as the user profile in use on that device. However, existing devices can elect to use the profile shared from another device, so that multiple devices can be used by the same user. When different devices are using the same profile, they cooperate with the whisper server to make sure any profile updates made on one device are also made on the others.
A whisper conversation is represented as a long-lived conversation entity, identified by a UUID. Each conversation is created by a Whisperer. That Whisperer then controls which Listeners are allowed to join that conversation, and their profile IDs are kept in the conversation. Whenever a user profile is created, it contains a default conversation for the user to Whisper with, but that user can create new conversations and switch among them at will.
The whisper protocol consists of packets, typically very small, sent from whisperer to listener or vice versa. (There is no listener-to-listener communication.) Each packet is a utf8-encoded string in three parts:
- a decimal integer packet type (called the offset for historical reasons)
- a vertical bar '|' dividing the offset from the packet data
- the packet data itself
There are two kinds of packets: text packets and control packets:
- Text packets are used to send changes in the Whisperer’s live text. They have offsets of 0 or greater, and their packet data is text. Their offset indicates the position past which the packet text replaces the live text. If a listener receives an offset that's shorter than the live text, he can assume the user has revised earlier text. If a listener receives an offset that's longer than the live text, they can assume they’ve missed a packet and call for a re-read of the live text, suspending incremental text packet processing until the full data is received.
- Control packets have offsets less than 0, and the interpretation of their packet data depends on their offset. Some of them are used to carry content changes, such as shifting the live text to past text when the Whisperer hits return. Others are used for connection control, such as authorization handshakes and flow control
The whisper protocol is designed to work over a transport layer that provides:
- peer-to-peer, point-to-point or broadcast, sequenced delivery of packets, and
- two independent, simultaneous channels, each with its own authentication.
The app currently supports two transport layers:
- Bluetooth LE connections (for local iOS/macOS Listeners), and
- the Ably pub/sub realtime infrastructure (for remote or browser-based Listeners).
A single conversation can utilize both transports, with some Listeners on Bluetooth and others on Ably.
Establishment and teardown of connections between a Whisperer and Listeners takes place on one channel (called the control channel), and then content transfer from Whisperer to Listeners takes place on another channel (called the content channel). The separate authentication of the two channels is used to prevent Listeners who aren’t authorized from eavesdropping on a conversation.
The canonical sequence for establishing a new conversation between a Whisperer and a Listener goes as follows:
-
The Listener sends the Whisperer a listen offer packet on the control channel. This packet contains the listener’s profile ID but reveals nothing about the listener’s username.
-
The Whisperer sends the Listener a whisper offer packet on the control channel. This contains the id and name of the conversation being offered as well as the Whisperer’s profile id and username.
-
The Listener decides based on the offer information whether they want to participate in the conversation. If they do, they respond with a listen request packet, giving their profile id and name.
-
The Whisperer sees the listen request packet and decides based on the request information whether they want to allow the Listener into the conversation:
- If so, the Whisperer authorizes the Listener on the content channel and sends a listen authorization packet.
- If not, the Whisperer sends a listen deauthorization packet.
-
If the Listener receives a listen authorization packet, they connect to the content channel and send a joining message on the control channel.
Because a Whisperer can recognize an existing listener from their listen offer packet, the canonical sequence for a Listener re-joining a conversation to which they have previously been a party is just steps 1, 4.1, and 5 from the above sequence.
Whenever a Whisperer drops from a conversation, they send a dropping packet to let the Listeners know, and vice versa (Listeners who drop send a dropping packet to the Whisperer).
The implementation of whispering/listening is broken into two layers: a transport layer that handles making/breaking connections and sending protocol messages between conversation participants, and an application layer that understands the semantics of the messages being exchanged. As mentioned above, the transport layer may either be broadcast based (all packets seen by all participants) or it may be point-to-point (all packets go from one participant to a specific other participant).
All content packets, and almost all control packets, are both initiated at the application layer by the sender and then processed at the application layer by the receiver. There are three exceptions:
- The listen offer packet is initiated from the tranport layer on the listener side whenever it starts up (broadcast) or connects to a new whisperer (point-to-point). This packet announces the presence of the new listener and is passed to the whisper application layer. Without this packet, listeners who arrive mid-conversation would not be noticed by the whisperer.
- The whisper offer packet is initiated from the broadcast transport layer on the whisperer side when it first connects to the control channel. Without this packet, late-arriving whisperers would not be aware of (broadcast-based) listeners who had connected earlier.
- The leave conversation packet. This packet is sent by the transport layer to warn other participants that they are disconnecting from the conversation. The disconnection may be motivated either by the application layer (e.g., if the user quits) or by the transport layer (e.g., if there is a transport error and the connection must be torn down). Whenever a participant’s transport receives a leave conversation packet from a participant that is already known to the application layer, it tells the application layer of the departure.
The whisper server coordinates and secures all internet-based interactions between devices, including both internet-based conversations and the synchronization of profiles. Whisper device IDs are tracked on the whisper server, where they are associated with all of the other entities used in the app’s operation on that device.
All code and textual materials in this repository are copyright 2023 Daniel C. Brotsky, and licensed under the GNU Afero General Public License v3, which is reproduced in full in the LICENSE file.
The icon assets in this repository come from the Noun Project, and are licensed via subscription by Daniel Brotsky for use in this application (see details here).
The sound assets in this repository come from Pixabay, and are licensed by Daniel Brotsky for use in this application (see details here).
Daniel Brotsky gratefully acknowledges the following content creators whose materials are used in this application:
- Air Horn Icon by SuperNdre via the Noun Project.
- Air Horn Sound by goose278 from Pixabay.
- Bicycle Bell Icon by DinosoftLab via the Noun Project.
- Bicycle Bell Sound by Yin Yang Jake007 from Pixabay.
- Bicycle Horn Icon by Berkah Icon via the Noun Project.
- Bicycle Horn Sound by AntumDeluge from Pixabay.
- Checkmark by Ganesha via the Noun Project.
- Decrease Font Size Icon by Yeong Rong Kim via the Noun Project.
- Increase Font Size Icon by Yeong Rong Kim via the Noun Project.
- Quote Dialog by Viktor Vorobyev via the Noun Project.
- Record Voice Over Icon by Justin Blake via the Noun Project.
- Repeat Icon by Julie Collard via the Noun Project.
- Voice Over Off Icon by Justin Blake via the Noun Project.
- Whisper Speech Bubble Icon by Lucas Helle via the Noun Project.
- Uncheck by Annisa via the Noun Project.
Daniel Brotsky is also grateful for the example application ColorStripe, designed by @artemnovichkov as described in his blog post. Some of the code in ColorStripe, especially the use of Combine to connect Core Bluetooth managers with ViewModels, has been used in Whisper, as permitted by the MIT license under which ColorStripe was released.