Skip to content

cs124/pa6b-speech

Repository files navigation

Programming Assignment 6b - Speech

In this assignment, you will convert sampled sentences into audio files using text-to-speech, and then transcribe the audio files back into text using speech-to-text.

Cloning the Assignment

Select the folder in which you'd like to clone the assignment, and run the following commands in the command line to clone it.

cd folder/to/put/assignment
git clone https://github.com/cs124/pa6b-speech.git

Using a Text Editor

Like PA6a, this assignment is not run on Jupyter Notebooks. This means that you may have to use a text editor (i.e., Visual Studio Code). For those of you that might not be familiar with this, please refer to this guide to get started, or come to office hours!

Environment Setup

PA6b requires the Cartesia package. There are two ways to setup your environment:

  1. Activate the cs124 environment you previously created, then install the required package:
conda activate cs124
pip install cartesia
  1. Create a new environment called cs124_pa6b:
conda env create -f environment_pa6b.yml
conda activate cs124_pa6b

Getting Your Cartesia API Key

In this assignment, you'll use the Cartesia API for both text-to-speech and speech-to-text. An API key is like a password that allows your code to access Cartesia's services. Here's how to get yours:

Step 1: Create a Cartesia Account

  1. Go to https://play.cartesia.ai/sign-up
  2. Sign up for a free account

Step 2: Generate Your API Key

  1. Once logged in, look at the left menu under PLATFORM
  2. Click on API Keys
  3. Click the + New button at the top right
  4. Add a description for your key (e.g., cs124_pa6b)
  5. Click Create
  6. Copy your API key (it will look like sk_car_...)

Step 3: Set Your API Key

Before running any scripts, set your API key as an environment variable in your terminal:

export CARTESIA_API_KEY=your_api_key_here

Replace your_api_key_here with the actual key you copied from the dashboard.

Important: Cartesia's free tier API comes with a rate limit. While this rate limit is more than enough to complete this assignment, please beware of overusage.

Part 1: Convert Sampled Sentences into Audio Files

We have provided a default sampled_sentences.json file for you to use. If you'd like, you can optionally replace it with the sentences you sampled from your trained model in PA6a.

You should use the Cartesia API to convert your sampled sentences into audio files. You can do so by running the tts.py script. This script will automatically save the audio file as a wav file called sampled_sentences_speech.wav.

python tts.py

Optional: Customize the Voice

You can browse and try out different voices on the Cartesia Playground. To use a different voice, you can replace the voice id in tts.py with the ID of the voice you'd like to use:

  1. Go to the Cartesia Voice Library
  2. Browse and listen to different voices
  3. Click on a voice you like
  4. On the right side, click the three dots (⋮) next to the voice
  5. Click Copy ID from the menu
  6. Open tts.py and find line 15:
    "id": "694f9389-aac1-45b6-b726-9d9369183238",
  7. Replace the ID with your chosen voice's ID (paste the ID you copied)

Part 2: Transcribe the Audio Files into Text

Next, you should use the Cartesia API to transcribe the audio files into text. You can do so by running the speech_to_text.py script. This script will automatically save the text file as a txt file called sampled_sentences_speech.txt.

python speech_to_text.py

You will include the transcribed file sampled_sentences_speech.txt in the submission as part of the grading. You are also encouraged to try what happens when you record yourself reading the sampled sentences in a noisy environment and transcribe the audio file back into text; and compare the quality of the transcription with the original sampled sentences.

Part 3: Error Analysis

Now that you've run the TTS→STT pipeline, analyze what happened to your text by answering the questions in responses.py. Each question is defined as a function, where you should fill in your answer in the response strings.

Question 1: Error Classification

Compare your original sampled_sentences.json to the final sampled_sentences_speech.txt. Identify and describe three distinct types of errors or information loss you observe. For each type, provide a specific example from your results and explain what aspect of the TTS (text-to-speech) or STT (speech-to-text) process likely caused it.

Question 2: Model Bias and Training Data

Based on the errors you observed, what can you infer about the training data or design priorities of the speech-to-text model? Consider: What types of content does it handle well vs. poorly? What trade-offs might the model designers have made?

Question 3: Formatting as Information

Your original text likely had formatting elements (capitalization, line breaks, punctuation, etc.) that disappeared in the final transcription. Choose one formatting element that was lost and explain: (a) Why current speech systems can't preserve it, and (b) What would be required to preserve it in a future system.

Part 4: Ethics of Automated Transcription & Accessibility

In Part 2, you transcribed audio into text. While tools like YouTube's auto-captions provide a level of instant accessibility, they have been the subject of significant legal debate. Organizations like the National Association of the Deaf (NAD) have successfully sued major institutions because automated captions often fail the equitable access standard required by laws like the Americans with Disabilities Act (ADA).

Answer the following questions in responses.py.

Question 4: The "Good Enough" Threshold

Disability law often reduces accommodation quality to a single metric like word error rate, yet this ignores dimensions like subtitle timing, speaker identification, or tonal cues — things human captioners naturally attend to. What does it mean for access to collapse something as multidimensional as communication into a single number? How do you view the relationship or difference between meeting legal requirements and providing equal access?

Question 5: The Curb-Cut Effect

High-quality transcription is a classic example of universal design—a feature built for disability that benefits a much broader population. Beyond the D/deaf and hard-of-hearing communities, identify other groups who rely on these transcriptions. Do you rely on captions? How does a noticeably worse AI transcription impact their ability to engage with content?

Part 5: Stress-Testing ASR with Dialects and Dysfluency

In this section, you will intentionally test the limits of the transcription model by providing it with speech that deviates from standard American or British English.

The Task:

  1. Source or Record Audio: Find or record a short audio clip (10-15 seconds) featuring a strong regional dialect, a non-native English accent, or speech containing natural dysfluencies (stutters, long pauses, or "um/uh" fillers).

    • Keep your clip to 10-15 seconds. If your recording is longer, trim it before using it. On macOS, you can open the file in QuickTime Player, go to Edit > Trim, and drag the handles to select the portion you want. On Windows, you can use the built-in Video Editor (search "Video Editor" in the Start menu) or the Photos app to trim.
    • The Cartesia API accepts many common audio formats: .wav, .mp3, .m4a, .mp4, .mov, .flac, .ogg, .webm. So if you record on your phone (which typically saves as .m4a) or screen-record on your Mac (which saves as .mov), you can use that file directly without converting.
    • Some ideas for where to find clips:
      • Record on your phone: Use the Voice Memos app (iPhone) or Sound Recorder (Android), then transfer the file to your computer (e.g., via AirDrop, email, or Google Drive).
      • Record on your computer: On macOS, open QuickTime Player > File > New Audio Recording. On Windows, open the Sound Recorder app.
      • Capture audio from a video (e.g., YouTube):
        • macOS: Press Cmd + Shift + 5. In the toolbar that appears at the bottom of your screen, click either Record Entire Screen or Record Selected Portion. Click Options and under Microphone, select your microphone or built-in audio to make sure audio is captured. Click Record, then play the video. When done, press Cmd + Ctrl + Esc to stop recording. The file will be saved to your Desktop as a .mov file.
        • Windows: Press Win + G to open the Xbox Game Bar. In the Capture widget, click the microphone icon to make sure audio recording is enabled. Click the Record button (the circle icon) to start recording, then play the video. When done, click the Stop button (the square icon) in the floating recording toolbar. The file will be saved to Videos > Captures as an .mp4 file.
        • The Cartesia API accepts .mov and .mp4 directly, so no conversion is needed.
  2. Move your audio file into the assignment folder: The test_transcription.py script looks for your audio file relative to where it lives, so your clip needs to be in the same folder. You can do this by either:

    • Dragging the file into the assignment folder using Finder (macOS) or File Explorer (Windows)
    • Or using the terminal:
      cp ~/Desktop/my_clip.mov .
      
      This copies a file called my_clip.mov from your Desktop into the current directory. Replace ~/Desktop/my_clip.mov with wherever your file actually is (e.g., ~/Downloads/recording.m4a). Make sure your terminal is in the assignment directory first — you can check with pwd and navigate there with cd.
  3. Run Transcription: Use the test_transcription.py script to transcribe your audio clip.

    • Open test_transcription.py and update line 7 to the filename of your audio clip. For example, if your file is called my_clip.mov, change:
      audio_filepath = "your_audio_clip.wav"
      to:
      audio_filepath = "my_clip.mov"
      The filename must match exactly (including the extension) and the file must be in the same folder as the script.
    • Run the script:
    python test_transcription.py
    
    • The transcription will be saved to test_transcription_output.txt in the same folder.
  4. Manual Comparison: Listen to your audio clip and manually transcribe it yourself -- write down exactly what was said, including any pauses, stutters, and non-standard pronunciations. You'll compare this to the AI transcription in the questions below.

Now answer the following questions in responses.py based on your observations:

Question 6: Compare and Contrast

Given your manual transcription above, note every instance where the AI model hallucinated a word, skipped a phrase, or sanitized the speech by removing pauses and stutters. What do you notice about the model's failures?

In your response, include:

  • Your manual transcription
  • The AI transcription (from test_transcription_output.txt)
  • Your analysis

Question 7: Whose Voice Gets Transcribed?

These systems have real stakes — an automated transcription used in a job interview, a 911 call, or a courtroom can fail certain speakers entirely.

Research shows that ASR model performance varies significantly across different speakers. Do you find these patterns of performance variation in your data, or does your clip tell a different story? Does what you observed correspond with your own experience using voice systems like Siri, Google Assistant, customer service phone trees, or your car's navigation?

What does either result reveal about the assumptions baked into these systems about what counts as the default way of speaking?

Part 6: Zip and Submit

Run bash create_assignment_zip.sh to zip your submission and submit the zip file to Gradescope.

To recap, the submission zip should include the following files:

  • sampled_sentences_speech.wav: the audio file generated from your sampled sentences
  • sampled_sentences_speech.txt: the transcribed text of the audio file
  • test_transcription_output.txt: the transcribed text of your dialect/dysfluency test clip
  • responses.py: your answers to the error analysis questions

Note: Because your submission includes a large audio file, the upload and grading on Gradescope may take longer than usual. This is normal -- please be patient and wait for it to complete.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors