diff --git a/chapters/zh-CN/_toctree.yml b/chapters/zh-CN/_toctree.yml
index eab9891..edc35ad 100644
--- a/chapters/zh-CN/_toctree.yml
+++ b/chapters/zh-CN/_toctree.yml
@@ -100,29 +100,22 @@
     title: 实战练习
   - local: chapter6/supplemental_reading
     title: 补充阅读
-#
-#- title: 第7单元：音频到音频合成(ATA)
-#  sections:
-#  - local: chapter7/introduction
-#    title: 单元简介
-#  - local: chapter7/tasks
-#    title: 音频到音频合成（ATA）任务实例
-#  - local: chapter7/choosing_dataset
-#    title: 数据集选择
-#  - local: chapter7/preprocessing
-#    title: 数据加载和预处理
-#  - local: chapter7/evaluation
-#    title: 音频到音频合成（ATA）的评价指标
-#  - local: chapter7/fine-tuning
-#    title: 模型微调
-#  - local: chapter7/quiz
-#    title: 习题
-#    quiz: 7
-#  - local: chapter7/hands_on
-#    title: 实战练习
-#  - local: chapter7/supplemental_reading
-#    title: 补充阅读
-#
+
+- title: 第7单元：整合实战
+  sections:
+  - local: chapter7/introduction
+    title: 单元简介
+  - local: chapter7/speech-to-speech
+    title: 语音到语音翻译
+  - local: chapter7/voice-assistant
+    title: 构建语音助手
+  - local: chapter7/transcribe-meeting
+    title: 会议转录
+  - local: chapter7/hands_on
+    title: 实战练习
+  - local: chapter7/supplemental_reading
+    title: 补充阅读
+
 - title: 第8单元：结束线
   sections:
   - local: chapter8/introduction
diff --git a/chapters/zh-CN/chapter7/hands_on.mdx b/chapters/zh-CN/chapter7/hands_on.mdx
new file mode 100644
index 0000000..6d4e9a2
--- /dev/null
+++ b/chapters/zh-CN/chapter7/hands_on.mdx
@@ -0,0 +1,20 @@
+# 实战练习
+
+在本单元中，我们整合了前六个单元学到的内容，构建了三个集成音频应用。正如你所体验到的，借助本课程掌握的基础技能，构建复杂一点的音频工具完全是可以实现的。
+
+本次实践任务将基于本单元中的一个应用，并对其进行一些多语言扩展🌍。你的目标是从本单元第一节的[级联式语音翻译Gradio示例](https://huggingface.co/spaces/course-demos/speech-to-speech-translation)出发，修改它以支持**非英语**目标语言的语音翻译。也就是说，示例程序应能将语言X的语音输入，翻译成语言Y的语音输出，且Y不能是英语。你可以通过点击[此处复制](https://huggingface.co/spaces/course-demos/speech-to-speech-translation?duplicate=true)将模板克隆到你在Hugging Face上的命名空间下。无需使用GPU加速器——免费的CPU服务就已足够🤗。不过请确保你的示例项目设置为**公开**，这样我们才能访问并进行评估。
+
+关于如何更新语音翻译函数以实现多语言翻译的技巧可参考[语音到语音翻译](speech-to-speech)一节。按照该说明，你应该可以将示例程序更新为支持从语言X语音到语言Y文本的翻译任务，这已完成一半目标！
+
+要将语言Y的文本合成成语言Y的语音（即多语言语音合成），你需要使用一个多语言TTS模型检查点。为此，你可以使用上一个实践练习中自己微调的SpeechT5模型，或者使用一个预训练的多语言TTS检查点。有两个推荐选项：一个是[sanchit-gandhi/speecht5\_tts\_vox\_nl](https://huggingface.co/sanchit-gandhi/speecht5_tts_vox_nl)，它是在[VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli)数据集的荷兰语子集上微调的SpeechT5模型；另一个是MMS TTS检查点（详见[语音合成的预训练模型](../chapter6/pre-trained_models)一节）。
+
+<Tip>
+
+在我们的测试中，对于荷兰语（Dutch），MMS TTS检查点效果优于微调后的SpeechT5模型。但你可能会发现自己微调的模型在某些语言上表现更佳。如果你决定使用MMS TTS检查点，需要修改demo的<a href="https://huggingface.co/spaces/course-demos/speech-to-speech-translation/blob/a03175878f522df7445290d5508bfb5c5178f787/requirements.txt#L2">requirements.txt</a>文件，以安装该分支的<code>transformers</code>：
+<p><code>git+https://github.com/hollance/transformers.git@6900e8ba6532162a8613d2270ec2286c3f58f57b</code></p>
+
+</Tip>
+
+你的程序应接收一个音频文件作为输入，并输出一个音频文件作为结果，其函数接口需匹配模板demo中的[`speech_to_speech_translation`](https://huggingface.co/spaces/course-demos/speech-to-speech-translation/blob/3946ba6705a6632a63de8672ac52a482ab74b3fc/app.py#L35)。因此，我们建议你保留主函数`speech_to_speech_translation`不变，仅根据需要更新[`translate`](https://huggingface.co/spaces/course-demos/speech-to-speech-translation/blob/a03175878f522df7445290d5508bfb5c5178f787/app.py#L24)和[`synthesise`](https://huggingface.co/spaces/course-demos/speech-to-speech-translation/blob/a03175878f522df7445290d5508bfb5c5178f787/app.py#L29)两个函数。
+
+构建好Gradio demo后，你可以提交它以供评估。访问Space [audio-course-u7-assessment](https://huggingface.co/spaces/huggingface-course/audio-course-u7-assessment)，并在提示时提供你项目的repository id。该Space会自动发送一个音频样本到你的demo，并检测返回的音频是否为非英语语种。如果通过测试，你的名字旁边会在[总进度页面](https://huggingface.co/spaces/MariaK/Check-my-progress-Audio-Course)上显示一个绿色对勾✅。
diff --git a/chapters/zh-CN/chapter7/introduction.mdx b/chapters/zh-CN/chapter7/introduction.mdx
new file mode 100644
index 0000000..9f87680
--- /dev/null
+++ b/chapters/zh-CN/chapter7/introduction.mdx
@@ -0,0 +1,11 @@
+# 第7单元：整合实战 🪢
+
+恭喜你来到第7单元🥳！现在你距离完成整个课程只差最后几步了，也即将掌握构建完整音频机器学习应用所需的核心技能。从理解角度来看，你已经掌握了音频领域的关键知识点：我们已经系统学习了音频数据处理、音频分类、语音识别以及语音合成等核心主题及其背后的理论知识。本单元的目标是帮助你**将这些内容整合起来**：既然你已经分别了解了每一类任务的原理和实践方法，现在我们将探索如何将它们组合在一起，构建一些真实世界的应用。
+
+## 你将学到什么，构建什么
+
+在本单元中，我们将学习以下三个主题：
+
+* [语音到语音翻译](speech-to-speech)：将一种语言的语音翻译为另一种语言的语音
+* [构建语音助手](voice-assistant)：开发一个类似 Alexa 或 Siri 的语音助手
+* [会议转写](transcribe-meeting)：将会议内容转写成文本，并标注每位说话者的发言时间和内容
diff --git a/chapters/zh-CN/chapter7/speech-to-speech.mdx b/chapters/zh-CN/chapter7/speech-to-speech.mdx
new file mode 100644
index 0000000..b1d54cb
--- /dev/null
+++ b/chapters/zh-CN/chapter7/speech-to-speech.mdx
@@ -0,0 +1,211 @@
+# 语音到语音翻译
+
+语音到语音翻译（Speech-to-speech translation，简称STST或S2ST）是一项相对较新的语音语言处理任务，其目标是将一种语言的语音翻译成**另一种**语言的语音：
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/s2st.png" alt="Diagram of speech to speech translation">
+</div>
+
+STST可以被视为传统机器翻译（MT）任务的扩展：不同之处在于，我们翻译的不再是**文本**，而是**语音**。STST在多语言交流领域具有广泛应用，它可以帮助不同语言的使用者通过语音自然沟通。
+
+想象一下，当你需要与讲不同语言的人沟通时，不必先将你想表达的内容写下来再翻译成目标语言，而是可以直接开口说话，然后由STST系统将你的语音翻译为目标语言的语音。对方也可以通过该系统以语音方式进行回应。这种交互方式相比基于文本的翻译更加自然流畅。
+
+在本节中，我们将探索一种**级联式（cascaded）**STST方法，整合你在第5单元（语音识别）和第6单元（语音合成）中学到的知识。我们将先使用一个**语音翻译（ST）**系统，将源语音直接翻译为目标语言的文本，然后使用**文本转语音（TTS）**系统，将翻译后的文本合成为语音：
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/s2st_cascaded.png" alt="Diagram of cascaded speech to speech translation">
+</div>
+
+我们也可以采用三阶段的方法：首先使用自动语音识别（ASR）系统将源语音转写为相同语言的文本，然后通过机器翻译（MT）将该文本翻译为目标语言，最后再使用TTS将翻译后的文本合成为语音。不过，增加组件数量会导致**误差累积（error propagation）**问题——某个阶段出现的错误会影响到后续模型；同时也会增加推理时延，因为需要顺序调用多个模型。
+
+虽然这种级联方法看起来较为简单，但却能构建出效果非常好的STST系统。实际上，早期许多商用STST产品（如[Google翻译](https://ai.googleblog.com/2019/05/introducing-translatotron-end-to-end.html)）就是基于ASR + MT + TTS的三阶段级联方案实现的。该方法还具有良好的数据效率和计算效率，因为可以直接组合已有的语音识别与语音合成系统，无需额外训练STST模型。
+
+在本单元接下来的内容中，我们将聚焦构建一个可将任意语言X的语音翻译为英语语音的STST系统。尽管我们聚焦X→英语的翻译方向，但你可以将相同方法扩展至任意X→Y的语言组合，我们在后文也会提供相应的提示。我们会将STST拆解为两个核心子任务：语音翻译（ST）与文本转语音（TTS），最后将两者整合，并通过Gradio构建一个演示界面来展示整个系统的效果。
+
+## 语音翻译（Speech Translation）
+
+我们将使用Whisper模型构建语音翻译系统，因为它支持将来自96种语言的语音翻译为英文。具体来说，我们会加载[Whisper Base](https://huggingface.co/openai/whisper-base)模型，该模型拥有7400万参数。虽然它并不是性能最强的版本（[Whisper Large](https://huggingface.co/openai/whisper-large-v2) 的参数量是它的20多倍），但考虑到我们需要级联两个自回归模型（ST + TTS），因此希望每个模型都能尽快完成推理，以保持整体响应速度合理：
+
+```python
+import torch
+from transformers import pipeline
+
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+pipe = pipeline(
+    "automatic-speech-recognition", model="openai/whisper-base", device=device
+)
+```
+
+太好了！接下来我们加载一段非英语的语音样本来测试STST系统。这里我们选用[VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli)数据集中意大利语（`it`）验证集的第一个样本：
+
+```python
+from datasets import load_dataset
+
+dataset = load_dataset("facebook/voxpopuli", "it", split="validation", streaming=True)
+sample = next(iter(dataset))
+```
+
+你可以在Hub的数据集页面试听该样本：[facebook/voxpopuli/viewer](https://huggingface.co/datasets/facebook/voxpopuli/viewer/it/validation?row=0)
+
+也可以通过Jupyter Notebook的音频功能直接播放：
+
+```python
+from IPython.display import Audio
+
+Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])
+```
+
+现在我们定义一个函数，接收音频输入并返回翻译后的文本。请注意，我们需要通过参数设置任务为 `"translate"`，以确保Whisper执行语音翻译而非语音识别：
+
+```python
+def translate(audio):
+    outputs = pipe(audio, max_new_tokens=256, generate_kwargs={"task": "translate"})
+    return outputs["text"]
+```
+
+<Tip>
+
+Whisper也可以通过一些技巧实现将语音从任意语言X翻译成任意语言Y。只需将任务设置为`"transcribe"`，并通过`"language"`参数指定目标语言，例如若要翻译成西班牙语，可设置为：
+
+`generate_kwargs={"task": "transcribe", "language": "es"&rcub;`
+
+</Tip>
+
+太棒了！我们快速检查一下模型是否能输出合理的结果：
+
+```python
+translate(sample["audio"].copy())
+```
+```
+' psychological and social. I think that it is a very important step in the construction of a juridical space of freedom, circulation and protection of rights.'
+```
+
+没问题！如果我们将其与原始文本进行对比：
+
+```python
+sample["raw_text"]
+```
+```
+'Penso che questo sia un passo in avanti importante nella costruzione di uno spazio giuridico di libertà di circolazione e di protezione dei diritti per le persone in Europa.'
+```
+
+可以看到，翻译内容大致一致（你可以用 Google 翻译自行验证）。唯一的差别在于开头出现了一些额外的词汇，那是说话者上一句话的尾部。
+
+至此，我们已完成级联STST流水线的第一步，并实际运用了第5单元中学到的Whisper模型用于语音识别和翻译的技能。如果你想回顾相关内容，可以重新阅读[第5单元的预训练模型章节](../chapter5/asr_models)。
+
+## 文本转语音（Text-to-speech）
+
+接下来是STST系统的第二部分：将英文文本转换为英文语音。我们将使用预训练的[SpeechT5 TTS](https://huggingface.co/microsoft/speecht5_tts)模型来进行语音合成。目前🤗 Transformers暂未提供用于TTS的 `pipeline`，所以我们需要手动使用模型。不过没关系，在第 6 单元中你已经掌握了推理方法，完全可以胜任！
+
+首先，我们从预训练检查点中加载SpeechT5的处理器、模型和声码器（vocoder）：
+
+```python
+from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
+
+processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
+
+model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
+vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
+```
+
+<Tip>
+
+这里使用的是专门为英语TTS训练的SpeechT5检查点。如果你想翻译成其他语言的语音，可以更换为你目标语言上微调的SpeechT5模型，或者使用 MMSTTS项目的多语言模型。
+
+</Tip>
+
+和Whisper一样，我们可以将SpeechT5模型和vocoder部署到GPU加速设备上（如可用）：
+
+```python
+model.to(device)
+vocoder.to(device)
+```
+
+太好了！现在加载说话人嵌入（speaker embeddings）：
+
+```python
+embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
+speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
+```
+
+然后我们可以编写一个函数，接收文本作为输入，并生成对应的语音。首先用处理器对文本进行预处理（标记化），获取输入ID；然后将输入ID和说话人嵌入传入SpeechT5模型，同时部署在加速器设备上；最后将生成的语音搬到CPU上，便于在Jupyter Notebook中播放：
+
+```python
+def synthesise(text):
+    inputs = processor(text=text, return_tensors="pt")
+    speech = model.generate_speech(
+        inputs["input_ids"].to(device), speaker_embeddings.to(device), vocoder=vocoder
+    )
+    return speech.cpu()
+```
+
+我们用一条示例文本测试一下效果：
+
+```python
+speech = synthesise("Hey there! This is a test!")
+
+Audio(speech, rate=16000)
+```
+
+听起来不错！接下来进入最激动人心的部分——将整个流程串联起来。
+
+## 构建一个STST演示应用
+
+在使用[Gradio](https://gradio.app)构建我们的STST系统演示之前，我们先进行一个简单的健壮性检查，确保两个模型可以无缝衔接：输入一段音频，输出翻译后的音频。我们会将前两个小节中定义的函数组合在一起：首先输入源语音，获取翻译后的文本，然后将文本合成为语音。最后，我们将合成后的语音转换为`int16`数组，这是Gradio所期望的输出音频格式。具体步骤如下：我们先将音频数组归一化到`int16`类型的动态范围，再将默认的NumPy浮点类型（`float64`）转换为目标类型（`int16`）：
+
+```python
+import numpy as np
+
+target_dtype = np.int16
+max_range = np.iinfo(target_dtype).max
+
+
+def speech_to_speech_translation(audio):
+    translated_text = translate(audio)
+    synthesised_speech = synthesise(translated_text)
+    synthesised_speech = (synthesised_speech.numpy() * max_range).astype(np.int16)
+    return 16000, synthesised_speech
+```
+
+我们检查一下这个组合函数是否能正常运行：
+
+```python
+sampling_rate, synthesised_speech = speech_to_speech_translation(sample["audio"])
+
+Audio(synthesised_speech, rate=sampling_rate)
+```
+
+完美！接下来我们将这个函数封装成一个Gradio应用，可以使用麦克风输入或上传音频文件进行测试，并播放模型输出：
+
+```python
+import gradio as gr
+
+demo = gr.Blocks()
+
+mic_translate = gr.Interface(
+    fn=speech_to_speech_translation,
+    inputs=gr.Audio(source="microphone", type="filepath"),
+    outputs=gr.Audio(label="Generated Speech", type="numpy"),
+)
+
+file_translate = gr.Interface(
+    fn=speech_to_speech_translation,
+    inputs=gr.Audio(source="upload", type="filepath"),
+    outputs=gr.Audio(label="Generated Speech", type="numpy"),
+)
+
+with demo:
+    gr.TabbedInterface([mic_translate, file_translate], ["Microphone", "Audio File"])
+
+demo.launch(debug=True)
+```
+
+这将启动一个Gradio演示程序，效果类似于Hugging Face Space上运行的版本：
+
+<iframe src="https://course-demos-speech-to-speech-translation.hf.space" frameBorder="0" height="450" title="Gradio app" class="container p-0 flex-grow space-iframe" allow="accelerometer; ambient-light-sensor; autoplay; battery; camera; document-domain; encrypted-media; fullscreen; geolocation; gyroscope; layout-animations; legacy-image-formats; magnetometer; microphone; midi; oversized-images; payment; picture-in-picture; publickey-credentials-get; sync-xhr; usb; vr ; wake-lock; xr-spatial-tracking" sandbox="allow-forms allow-modals allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-downloads"></iframe>
+
+你可以[复制](https://huggingface.co/spaces/course-demos/speech-to-speech-translation?duplicate=true)这个演示，并对其进行修改，例如使用不同的Whisper模型检查点、不同的TTS模型，或放宽“输出为英语语音”的限制，按照提示将其翻译成你选择的目标语言！
+
+## 展望未来
+
+尽管级联系统是一种高效的数据与计算方法来构建STST系统，但它存在前文提到的错误传播和延迟累积问题。近年来的研究探索了一种**直接**（direct）的STST方法，它不再预测中间文本表示，而是直接从源语音映射到目标语音。这类系统还能够保留源说话人的发音特征（例如语调、音高和节奏），使输出更自然。如果你对此类系统感兴趣，可以参考[补充阅读](supplemental_reading)章节中的相关资料。
diff --git a/chapters/zh-CN/chapter7/supplemental_reading.mdx b/chapters/zh-CN/chapter7/supplemental_reading.mdx
new file mode 100644
index 0000000..cb7d79c
--- /dev/null
+++ b/chapters/zh-CN/chapter7/supplemental_reading.mdx
@@ -0,0 +1,17 @@
+# 补充阅读
+
+本单元整合了前面各单元中的多个组件，介绍了语音到语音翻译（speech-to-speech translation）、语音助手（voice assistants）以及说话人分离（speaker diarization）等新任务。为了方便阅读，以下拓展资料按这三个任务分类整理：
+
+语音到语音翻译：
+* [Meta AI：使用离散单元实现STST](https://ai.facebook.com/blog/advancing-direct-speech-to-speech-modeling-with-discrete-units/)：一种基于编码器-解码器模型的端到端STST方法
+* [Meta AI：闽南语直接语音翻译](https://ai.facebook.com/blog/ai-translation-hokkien/)：基于编码器-两阶段解码器模型的STST方法，支持低资源语言
+* [Google：利用无监督与弱监督数据改进STST](https://arxiv.org/abs/2203.13339)：提出使用无监督与弱监督数据训练直接STST模型的方法，并对Transformer架构作出改进
+* [Google：Translatotron-2](https://google-research.github.io/lingvo-lab/translatotron2/)：支持保留说话人音色的端到端语音翻译系统
+
+语音助手：
+* [Amazon：精准唤醒词检测](https://www.amazon.science/publications/accurate-detection-of-wake-word-start-and-end-using-a-cnn)：一种低延迟、适用于设备端应用的唤醒词检测方法
+* [Google：RNN-Transducer架构](https://arxiv.org/pdf/1811.06621.pdf)：对CTC架构进行改进以支持流式设备端语音识别
+
+会议转录：
+* [Hervé Bredin：pyannote.audio技术报告](https://huggingface.co/pyannote/speaker-diarization/blob/main/technical_report_2.1.pdf)：介绍`pyannote.audio`说话人分离流水线背后的核心技术原理
+* [Max Bain等：Whisper X](https://arxiv.org/pdf/2303.00747.pdf)：一种结合Whisper模型实现高精度词级时间戳的方法
diff --git a/chapters/zh-CN/chapter7/transcribe-meeting.mdx b/chapters/zh-CN/chapter7/transcribe-meeting.mdx
new file mode 100644
index 0000000..52cd5e5
--- /dev/null
+++ b/chapters/zh-CN/chapter7/transcribe-meeting.mdx
@@ -0,0 +1,198 @@
+# 会议转录
+
+在本节中，我们将使用Whisper模型为两位或多位说话者之间的对话或会议生成转录内容。然后，我们会结合一个**说话人分离**模型，用于预测“谁在什么时候说话”。通过将Whisper转录的时间戳与说话人分离模型的时间戳进行匹配，我们可以生成一份端到端的会议转录，为每位说话者准确标注起止时间。这就是你在网络上常见的会议转录服务（如 [Otter.ai](https://otter.ai)）的基础版本。
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/diarization_transcription.png">
+ </div>
+
+## 说话人分离
+
+说话人分离（speaker diarization）是指从未标注的音频中预测“谁在什么时候说话”。这样一来，我们可以为每段发言预测开始与结束时间，精确对应每位说话者的说话时机。
+
+🤗 Transformers目前尚未在库中内置说话人分离模型，但Hub上已有可直接使用的检查点。本节中我们将使用[pyannote.audio](https://github.com/pyannote/pyannote-audio)提供的预训练模型。首先，我们需要安装相关工具包：
+
+```bash
+pip install --upgrade pyannote.audio
+```
+
+太棒了！该模型的权重托管在Hugging Face Hub上。访问前需要同意说话人分离模型的使用条款：[pyannote/speaker-diarization](https://huggingface.co/pyannote/speaker-diarization)，随后还需同意分割模型的使用条款：[pyannote/segmentation](https://huggingface.co/pyannote/segmentation)。
+
+完成之后，我们就可以在本地加载预训练的说话人分离流水线：
+
+```python
+from pyannote.audio import Pipeline
+
+diarization_pipeline = Pipeline.from_pretrained(
+    "pyannote/speaker-diarization@2.1", use_auth_token=True
+)
+```
+
+我们来试试用一个示例音频测试该流水线。为此，我们将加载[LibriSpeech ASR](https://huggingface.co/datasets/librispeech_asr)数据集中一个由两位说话者拼接而成的音频样本：
+
+```python
+from datasets import load_dataset
+
+concatenated_librispeech = load_dataset(
+    "sanchit-gandhi/concatenated_librispeech", split="train", streaming=True
+)
+sample = next(iter(concatenated_librispeech))
+```
+
+我们可以播放这段音频来听听效果：
+
+```python
+from IPython.display import Audio
+
+Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])
+```
+
+很酷！我们可以清晰地听出有两位说话者，在大约15秒的位置发生了转换。现在我们将这段音频传入分离模型，获取每位说话者的起止时间。请注意，pyannote.audio要求输入为形状为(channels, seq_len)的PyTorch张量，因此在运行模型前需要进行转换：
+
+```python
+import torch
+
+input_tensor = torch.from_numpy(sample["audio"]["array"][None, :]).float()
+outputs = diarization_pipeline(
+    {"waveform": input_tensor, "sample_rate": sample["audio"]["sampling_rate"]}
+)
+
+outputs.for_json()["content"]
+```
+
+```text
+[{'segment': {'start': 0.4978125, 'end': 14.520937500000002},
+  'track': 'B',
+  'label': 'SPEAKER_01'},
+ {'segment': {'start': 15.364687500000002, 'end': 21.3721875},
+  'track': 'A',
+  'label': 'SPEAKER_00'}]
+```
+
+效果相当不错！我们可以看到，模型预测第一位说话者的发言持续至14.5秒左右，第二位说话者则从15.4秒开始。接下来我们要进行语音转录了！
+
+## 语音转录
+
+这是本单元中第三次使用Whisper模型来完成语音转录任务。本节我们加载[Whisper Base](https://huggingface.co/openai/whisper-base)检查点，该模型体积小巧，能够在保持合理转录精度的同时提供不错的推理速度。和之前一样，你也可以自由选择其他Hub上的语音识别模型，比如Wav2Vec2、MMS ASR或其他Whisper模型：
+
+```python
+from transformers import pipeline
+
+asr_pipeline = pipeline(
+    "automatic-speech-recognition",
+    model="openai/whisper-base",
+)
+```
+
+现在我们来获取示例音频的转录内容，同时返回每段音频的起止时间戳。你还记得在第5单元中，为了启用Whisper的时间戳预测功能，我们需要传入参数`return_timestamps=True`：
+
+```python
+asr_pipeline(
+    sample["audio"].copy(),
+    generate_kwargs={"max_new_tokens": 256},
+    return_timestamps=True,
+)
+```
+
+```text
+{
+    "text": " The second and importance is as follows. Sovereignty may be defined to be the right of making laws. In France, the king really exercises a portion of the sovereign power, since the laws have no weight. He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon his entire future.",
+    "chunks": [
+        {"timestamp": (0.0, 3.56), "text": " The second and importance is as follows."},
+        {
+            "timestamp": (3.56, 7.84),
+            "text": " Sovereignty may be defined to be the right of making laws.",
+        },
+        {
+            "timestamp": (7.84, 13.88),
+            "text": " In France, the king really exercises a portion of the sovereign power, since the laws have",
+        },
+        {"timestamp": (13.88, 15.48), "text": " no weight."},
+        {
+            "timestamp": (15.48, 19.44),
+            "text": " He was in a favored state of mind, owing to the blight his wife's action threatened to",
+        },
+        {"timestamp": (19.44, 21.28), "text": " cast upon his entire future."},
+    ],
+}
+```
+
+好了！我们看到每段转录文本都带有开始和结束时间戳，而说话人在15.48秒处发生了切换。现在我们就可以将这个转录结果与说话人分离模型的时间戳结合起来，得到最终的会议转录了。
+
+## Speechbox
+
+为了生成最终的转录结果，我们需要将说话人分离模型的时间戳与Whisper模型的时间戳对齐。说话人分离模型预测第一位说话者在14.5秒结束，第二位在15.4秒开始。而Whisper模型预测的片段边界分别为13.88、15.48和19.44秒。由于两个模型的时间戳并不完全匹配，我们需要找到最接近14.5和15.4秒的文本片段边界，并据此按说话人进行切分。具体来说，我们通过最小化两组时间戳之间的绝对距离，来找出最合适的对齐方式。
+
+幸运的是，我们可以使用🤗 Speechbox工具包来完成这个对齐过程。首先，从主分支安装`speechbox`：
+
+```bash
+pip install git+https://github.com/huggingface/speechbox
+```
+
+接着，我们通过传入ASR模型和说话人分离模型，实例化组合流水线[`ASRDiarizationPipeline`](https://github.com/huggingface/speechbox/tree/main#asr-with-speaker-diarization)：
+
+```python
+from speechbox import ASRDiarizationPipeline
+
+pipeline = ASRDiarizationPipeline(
+    asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline
+)
+```
+
+<Tip>
+
+你也可以直接使用Hub上的模型ID调用预训练版本的<code>ASRDiarizationPipeline</code>：
+<p><code>pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-base")</code></p>
+
+</Tip>
+
+将音频输入该组合流水线，看看输出结果如何：
+
+```python
+pipeline(sample["audio"].copy())
+```
+
+```text
+[{'speaker': 'SPEAKER_01',
+  'text': ' The second and importance is as follows. Sovereignty may be defined to be the right of making laws. In France, the king really exercises a portion of the sovereign power, since the laws have no weight.',
+  'timestamp': (0.0, 15.48)},
+ {'speaker': 'SPEAKER_00',
+  'text': " He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon his entire future.",
+  'timestamp': (15.48, 21.28)}]
+```
+
+太棒了！第一位说话者的发言时间为0到15.48秒，第二位说话者为15.48到21.28秒，并分别对应各自的转录文本。
+
+我们可以通过定义两个辅助函数来让时间戳的格式更美观一些。第一个函数将时间戳元组格式化为字符串，并保留指定的小数位数。第二个函数将说话人ID、时间戳和文本组合成一行，并将不同说话人分行展示，方便阅读：
+
+```python
+def tuple_to_string(start_end_tuple, ndigits=1):
+    return str((round(start_end_tuple[0], ndigits), round(start_end_tuple[1], ndigits)))
+
+
+def format_as_transcription(raw_segments):
+    return "\n\n".join(
+        [
+            chunk["speaker"] + " " + tuple_to_string(chunk["timestamp"]) + chunk["text"]
+            for chunk in raw_segments
+        ]
+    )
+```
+
+我们再次运行流水线，并使用刚刚定义的格式化函数美化输出：
+
+```python
+outputs = pipeline(sample["audio"].copy())
+
+format_as_transcription(outputs)
+```
+
+```text
+SPEAKER_01 (0.0, 15.5) The second and importance is as follows. Sovereignty may be defined to be the right of making laws.
+In France, the king really exercises a portion of the sovereign power, since the laws have no weight.
+
+SPEAKER_00 (15.5, 21.3) He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon
+his entire future.
+```
+
+就这样！我们已经成功完成了音频的说话人分离与文本转录，并返回了按说话人分段的完整转录内容。尽管我们使用的是最小距离算法来对齐时间戳，这一方法在实际中效果非常好。如果你想进一步探索更复杂的对齐策略，不妨查看`ASRDiarizationPipeline`的源码：[speechbox/diarize.py](https://github.com/huggingface/speechbox/blob/96d2d1a180252d92263f862a1cd25a48860f1aed/src/speechbox/diarize.py#L12)。
diff --git a/chapters/zh-CN/chapter7/voice-assistant.mdx b/chapters/zh-CN/chapter7/voice-assistant.mdx
new file mode 100644
index 0000000..4bd340b
--- /dev/null
+++ b/chapters/zh-CN/chapter7/voice-assistant.mdx
@@ -0,0 +1,374 @@
+# 构建语音助手
+
+在本节中，我们将组合三个之前已经实践过的模型，构建一个端到端的语音助手，名为**Marvin** 🤖。就像Amazon的Alexa或Apple的Siri一样，Marvin是一个虚拟语音助手，会响应特定的“唤醒词”，然后监听用户的语音提问，并以语音作答。
+
+我们可以将语音助手的流程拆解为四个阶段，每个阶段都对应一个独立的模型：
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/voice_assistant.png">
+ </div>
+
+### 1. 唤醒词检测
+
+语音助手会持续监听设备麦克风接收到的音频，但只有在用户说出特定的“唤醒词”或“触发词”时才会启动。
+
+唤醒词检测任务由一个小型的本地音频分类模型完成，它远小于语音识别模型，通常只有几百万个参数，而语音识别模型则可能有几亿个参数。因此，这个小模型可以持续运行而不会消耗太多电量。只有检测到唤醒词时，才会加载更大的语音识别模型，使用完后又会立即关闭。
+
+### 2. 语音转文本
+
+下一步是将用户的语音问题转录为文本。实际上，由于音频文件体积较大，将其从本地传输到云端的速度较慢，因此更高效的方式是在本地设备上使用自动语音识别（ASR）模型进行转录。虽然本地模型可能比云端模型更小、精度略低，但由于推理速度快，可以实现近实时的转录体验，说话的同时就能看到文字结果。
+
+我们现在已经非常熟悉语音识别流程了，所以这一步轻而易举！
+
+### 3. 语言模型生成回复
+
+识别出用户的问题后，我们就需要生成一个回答。完成这一步最好的模型是**大型语言模型（LLM）**，它们能够理解文本语义并生成合适的自然语言回复。
+
+由于用户的提问很短（只包含少量文本分词），而语言模型很大（拥有数十亿参数），最有效的方式是将文本提问发送到云端的语言模型，生成回答后再返回给本地设备。
+
+### 4. 语音合成
+
+最后一步，我们使用文本到语音（TTS）模型将文本回复合成为语音。这通常在本地执行，但也可以在云端运行TTS模型，生成音频后再传回设备。
+
+这一过程我们也已经多次完成，相信你已经驾轻就熟了！
+
+<Tip>
+
+下面的部分需要使用麦克风录制语音。由于Google Colab不支持麦克风功能，建议在本地运行本节内容，可以使用CPU，也可以使用本地GPU（如果有）。我们选用的模型检查点都足够小，即使在CPU上也能保持良好的性能。
+
+</Tip>
+
+## 唤醒词检测
+
+语音助手流程的第一步是检测用户是否说出了唤醒词。为此，我们需要一个合适的预训练模型！还记得我们在[音频分类的预训练模型](../chapter4/classification_models)一节中提到的[Speech Commands](https://huggingface.co/datasets/speech_commands)数据集吗？这个数据集包含多种简单指令词的语音样本（如`"up"`、`"down"`、`"yes"`和`"no"`），还有一个`"silence"`标签用于分类无语音片段，常用于评估音频分类模型的表现。你可以花一点时间在Hub上通过数据集预览器试听这些样本，重新熟悉一下Speech Commands数据集：[datasets viewer](https://huggingface.co/datasets/speech_commands/viewer/v0.01/train)。
+
+我们可以选用一个在Speech Commands数据集上预训练好的音频分类模型，并从中选择一个简单的指令词作为唤醒词。在15个以上的可选指令词中，只要模型对我们指定的唤醒词预测概率最高，就可以判断唤醒词已被唤醒。
+
+首先，前往Hugging Face Hub的“Models”（模型）页面：[https://huggingface.co/models](https://huggingface.co/models)
+
+你将看到所有托管在Hugging Face Hub上的模型，默认按过去30天下载量排序：
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/all_models.png">
+ </div>
+
+在左侧，你可以使用筛选选项按任务、库、数据集等进行过滤。向下滚动，在音频任务列表中选择“Audio Classification”（音频分类）：
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/by_audio_classification.png">
+ </div>
+
+现在我们看到的是Hub上500多个音频分类模型。我们可以进一步通过数据集进行筛选。点击“Datasets”标签页，在搜索框中输入“speech\_commands”。当你开始输入时，会出现`speech_commands`的选项，点击它即可筛选出所有在该数据集上微调的音频分类模型：
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/by_speech_commands.png">
+ </div>
+
+太好了！我们可以看到目前有6个适用于该数据集和任务的预训练模型（如果你是在较晚时间阅读本课程，也许会有更多模型）。你可能已经认出了第一个模型，也就是我们在第4单元中用过的[Audio Spectrogram Transformer](https://huggingface.co/MIT/ast-finetuned-speech-commands-v2)。我们将在唤醒词检测任务中继续使用这个模型。
+
+使用`pipeline`类加载该模型如下所示：
+
+```python
+from transformers import pipeline
+import torch
+
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+
+classifier = pipeline(
+    "audio-classification", model="MIT/ast-finetuned-speech-commands-v2", device=device
+)
+```
+
+语音助手的第一步是检测是否说出了唤醒词。我们可以通过查看模型配置中的`id2label`属性来确认该模型的分类标签：
+
+```python
+classifier.model.config.id2label
+```
+
+可以看到，该模型共训练了35个标签，包含一些简单的指令词（如前面提到的）以及一些物品名称（如 `"bed"`、`"house"`、`"cat"`）。其中标签编号27对应的是名称**"marvin"**：
+
+```python
+classifier.model.config.id2label[27]
+```
+
+```
+'marvin'
+```
+
+太好了！我们可以将这个名字作为语音助手的唤醒词，就像Amazon的Alexa或Apple的Siri一样。如果模型将`"marvin"`预测为概率最高的类别，我们就可以基本确定用户说出了唤醒词。
+
+接下来，我们需要定义一个函数，用于持续监听麦克风输入，并将音频实时传入分类模型进行推理。为此，我们将使用🤗 Transformers提供的辅助函数[`ffmpeg_microphone_live`](https://github.com/huggingface/transformers/blob/fb78769b9c053876ed7ae152ee995b0439a4462a/src/transformers/pipelines/audio_utils.py#L98)。
+
+该函数会将指定长度（`chunk_length_s`）的小段音频传入模型，并使用滑动窗口方式平滑拼接每段音频，窗口步长为`chunk_length_s / 6`。为了降低初始延迟，它还会在未达到一个完整音频段时提前以`stream_chunk_s`长度的音频段开始传入模型。
+
+`ffmpeg_microphone_live`返回的是一个*生成器*对象，每次生成一段音频输入，我们可以将其传入`pipeline`模型进行推理。模型会返回每段音频的预测结果，我们可以根据每段的预测标签及其概率来判断是否听到了唤醒词。
+
+我们采用一个非常简单的判断标准：如果预测得分最高的标签就是唤醒词，并且该标签的概率超过了设定阈值（`prob_threshold`），则认定唤醒词被说出。通过设定概率阈值，我们可以降低模型因背景噪声等引发的误判。如果你愿意，还可以进一步通过[*熵值（entropy）*](https://en.wikipedia.org/wiki/Entropy_(information_theory))或不确定性等指标来改进判断方式。
+
+```python
+from transformers.pipelines.audio_utils import ffmpeg_microphone_live
+
+
+def launch_fn(
+    wake_word="marvin",
+    prob_threshold=0.5,
+    chunk_length_s=2.0,
+    stream_chunk_s=0.25,
+    debug=False,
+):
+    if wake_word not in classifier.model.config.label2id.keys():
+        raise ValueError(
+            f"Wake word {wake_word} not in set of valid class labels, pick a wake word in the set {classifier.model.config.label2id.keys()}."
+        )
+
+    sampling_rate = classifier.feature_extractor.sampling_rate
+
+    mic = ffmpeg_microphone_live(
+        sampling_rate=sampling_rate,
+        chunk_length_s=chunk_length_s,
+        stream_chunk_s=stream_chunk_s,
+    )
+
+    print("Listening for wake word...")
+    for prediction in classifier(mic):
+        prediction = prediction[0]
+        if debug:
+            print(prediction)
+        if prediction["label"] == wake_word:
+            if prediction["score"] > prob_threshold:
+                return True
+```
+
+我们可以尝试运行这个函数来看看效果。将`debug=True`以输出每段音频的预测结果。先让模型运行几秒钟，观察在没有语音输入的情况下预测结果，然后清晰地说出唤醒词`"marvin"`，你会看到该标签的概率骤升接近1：
+
+```python
+launch_fn(debug=True)
+```
+
+```text
+Listening for wake word...
+{'score': 0.055326107889413834, 'label': 'one'}
+{'score': 0.05999856814742088, 'label': 'off'}
+{'score': 0.1282748430967331, 'label': 'five'}
+{'score': 0.07310110330581665, 'label': 'follow'}
+{'score': 0.06634809821844101, 'label': 'follow'}
+{'score': 0.05992642417550087, 'label': 'tree'}
+{'score': 0.05992642417550087, 'label': 'tree'}
+{'score': 0.999913215637207, 'label': 'marvin'}
+```
+
+太棒了！正如预期，前几秒的预测几乎都是随机的，因为没有语音输入，模型的置信度很低。当我们说出唤醒词时，模型会高置信度地预测`"marvin"`，并立即终止循环，这就表明唤醒词已被检测到，ASR系统可以启动了！
+
+## 语音转录
+
+我们依然使用Whisper模型来进行语音转录。这里我们选择的是[Whisper Base English](https://huggingface.co/openai/whisper-base.en)检查点，它模型小巧，能够在保证一定准确率的同时实现较快的推理速度。我们还会使用一些技巧，通过控制音频输入方式，使模型接近实时进行转录。当然，你也可以根据需要在[🤗 Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&library=transformers&sort=trending)上选择其他语音识别模型，例如Wav2Vec2、MMS ASR或其他Whisper检查点：
+
+```python
+transcriber = pipeline(
+    "automatic-speech-recognition", model="openai/whisper-base.en", device=device
+)
+```
+
+<Tip>
+
+如果你使用的是GPU，也可以将模型替换为更大的[Whisper Small English](https://huggingface.co/openai/whisper-small.en)检查点，它在延迟控制范围内提供了更好的转录准确率。只需将模型ID更改为`"openai/whisper-small.en"`。
+
+</Tip>
+
+我们现在可以定义一个函数，从麦克风录制语音并进行转录。借助`ffmpeg_microphone_live`工具函数，我们可以控制语音识别的“实时程度”。较小的`stream_chunk_s`值会使语音被划分成更小片段，从而更实时地进行推理，但这也会带来上下文减少的问题，影响模型准确率。
+
+此外，我们还需要检测用户**停止**说话的时机，以终止录音。为了简化处理，我们默认在录音达到`chunk_length_s`（默认为5秒）后自动终止。你也可以尝试使用[语音活动检测模型（Voice Activity Detection，VAD）](https://huggingface.co/models?pipeline_tag=voice-activity-detection&sort=trending)来精确判断用户是否仍在说话。
+
+```python
+import sys
+
+
+def transcribe(chunk_length_s=5.0, stream_chunk_s=1.0):
+    sampling_rate = transcriber.feature_extractor.sampling_rate
+
+    mic = ffmpeg_microphone_live(
+        sampling_rate=sampling_rate,
+        chunk_length_s=chunk_length_s,
+        stream_chunk_s=stream_chunk_s,
+    )
+
+    print("Start speaking...")
+    for item in transcriber(mic, generate_kwargs={"max_new_tokens": 128}):
+        sys.stdout.write("\033[K")
+        print(item["text"], end="\r")
+        if not item["partial"][0]:
+            break
+
+    return item["text"]
+```
+
+运行这个函数看看效果吧！麦克风开启后开始说话，你将看到模型实时转录的文本结果显示在屏幕上：
+
+```python
+transcribe()
+```
+
+```text
+Start speaking...
+ Hey, this is a test with the whisper model.
+```
+
+很不错！你可以根据说话速度调整`chunk_length_s`的值（说不完时加长，说太慢就缩短），也可以修改`stream_chunk_s`参数以优化实时效果。只需将它们作为参数传入`transcribe`函数即可。
+
+## 调用语言模型生成回答
+
+有了转录文本后，我们需要生成一个有意义的回答。为此，我们可以调用部署在云端的LLM（大型语言模型）。我们将通过Hugging Face Hub上的[Inference API](https://huggingface.co/inference-api)来调用模型。
+
+首先，前往Hugging Face Hub。我们推荐使用[🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)页面来查找高性能的LLM，并筛选包含"instruct"的模型，这些模型经过指令微调，更适合用作问答助手：
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/llm_leaderboard.png">
+ </div>
+
+我们将使用[tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)模型，这是由[TII](https://www.tii.ae/)提供的70亿参数解码器式语言模型，经过聊天与指令数据集的微调。你也可以选择其他开启了“Hosted inference API”的模型，查看模型卡右侧是否有相关小组件即可：
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/inference_api.png">
+ </div>
+
+Inference API允许我们从本地向云端模型发送HTTP请求，并返回JSON格式的回答。你只需提供Hugging Face账号中的访问令牌（token）和要调用的模型ID即可：
+
+```python
+from huggingface_hub import HfFolder
+import requests
+
+
+def query(text, model_id="tiiuae/falcon-7b-instruct"):
+    api_url = f"https://api-inference.huggingface.co/models/{model_id}"
+    headers = {"Authorization": f"Bearer {HfFolder().get_token()}"}
+    payload = {"inputs": text}
+
+    print(f"Querying...: {text}")
+    response = requests.post(api_url, headers=headers, json=payload)
+    return response.json()[0]["generated_text"][len(text) + 1 :]
+```
+
+试试看这个示例输入吧：
+
+```python
+query("What does Hugging Face do?")
+```
+
+```
+'Hugging Face is a company that provides natural language processing and machine learning tools for developers. They'
+```
+
+你会注意到，通过Inference API推理的响应速度非常快--本地只需发送少量的文本token到云端模型，通信开销很低；模型在GPU加速器上运行，推理迅速；生成的结果再返回到本地，同样几乎无延迟。
+
+## 语音合成
+
+现在我们准备好生成最终的语音回答了！我们依然使用微软的[SpeechT5 TTS](https://huggingface.co/microsoft/speecht5_tts)模型进行英文语音合成。当然，你也可以根据需要选择任何其他TTS模型。我们加载处理器和模型：
+
+```python
+from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
+
+processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
+
+model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts").to(device)
+vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device)
+```
+
+再加载说话人嵌入向量：
+
+```python
+from datasets import load_dataset
+
+embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
+speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
+```
+
+我们将复用前一章[语音到语音翻译](speech-to-speech)中定义的`synthesise`函数：
+
+```python
+def synthesise(text):
+    inputs = processor(text=text, return_tensors="pt")
+    speech = model.generate_speech(
+        inputs["input_ids"].to(device), speaker_embeddings.to(device), vocoder=vocoder
+    )
+    return speech.cpu()
+```
+
+最后来验证一下是否能正常工作：
+
+```python
+from IPython.display import Audio
+
+audio = synthesise(
+    "Hugging Face is a company that provides natural language processing and machine learning tools for developers."
+)
+
+Audio(audio, rate=16000)
+```
+
+干得漂亮👍！
+
+## Marvin 🤖
+
+现在我们已经为语音助手流程的四个阶段分别定义了函数，接下来只需将它们组合起来即可实现端到端语音助手。我们会依次调用唤醒词检测（`launch_fn`）、语音转录、LLM 查询以及语音合成四个阶段。
+
+```python
+launch_fn()
+transcription = transcribe()
+response = query(transcription)
+audio = synthesise(response)
+
+Audio(audio, rate=16000, autoplay=True)
+```
+
+试试一些提示语吧！下面是几个示例：
+* *What is the hottest country in the world?*（世界上最热的国家是哪个？）
+* *How do Transformer models work?*（Transformer 模型是如何工作的？）
+* *Do you know Spanish?*（你会说西班牙语吗？）
+
+至此，我们就完成了端到端语音助手的构建，全部由你在本课程中掌握的🤗音频工具打造，最后再加上一点LLM的魔法。当然，还有不少可以改进的地方。首先，当前使用的音频分类模型包含35个不同的标签，我们其实可以改用一个更小巧、轻量的二分类模型，只判断是否说出了唤醒词。其次，我们是预先加载所有模型，并让它们持续驻留在设备上运行。为了节省设备资源，更理想的方式是：在需要时再加载对应模型，用完后立即释放。最后，我们的转录函数中缺乏语音活动检测（VAD），只能基于固定时间窗口进行转录，这样在某些情况下可能太短，而在另一些情况下又显得太长。
+
+## 万物皆可语音助手🪄
+
+目前我们实现了语音助手Marvin的语音输出功能。最后，我们将展示如何将其推广到生成文本、音频和图像等多模态内容。
+
+我们将使用[Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents)来构建这个通用助手。Transformers Agents 提供了一个自然语言 API，基于🤗 Transformers和Diffusers库。它通过精心设计的提示词，让LLM理解用户意图，并结合一组精选工具生成多模态输出。
+
+我们先初始化一个Agent。目前Transformers Agents支持[三种 LLM](https://huggingface.co/docs/transformers/transformers_agents#quickstart)，其中两个是Hugging Face Hub上的开源模型，第三个是来自OpenAI的模型，需要OpenAI API密钥。这里我们使用免费开源的[Bigcode Starcoder](https://huggingface.co/bigcode/starcoder)模型：
+
+```python
+from transformers import HfAgent
+
+agent = HfAgent(
+    url_endpoint="https://api-inference.huggingface.co/models/bigcode/starcoder"
+)
+```
+
+调用`agent.run`并传入文本提示，即可开始生成。比如我们让它画一只猫🐈：
+
+```python
+agent.run("Generate an image of a cat")
+```
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/generated_cat.png">
+ </div>
+
+<Tip>
+
+首次调用时模型权重将被下载，时间取决于Hugging Face Hub的下载速度。
+
+</Tip>
+
+就是这么简单！Agent会自动理解提示语，并调用[Stable Diffusion](https://huggingface.co/docs/diffusers/using-diffusers/conditional_image_generation)生成图像，整个过程无需我们手动加载模型、编写函数或执行代码。
+
+我们现在可以将之前Marvin中的LLM查询与语音合成步骤用Transformers Agent替代，因为Agent会自动完成这两个任务：
+
+```python
+launch_fn()
+transcription = transcribe()
+agent.run(transcription)
+```
+
+现在可以试试对着麦克风说：“生成一只猫的图像”，看看系统效果如何。如果你提的是常规问答问题，Agent会返回文本答案；如果提示中包含图像或语音要求，它会生成相应的多模态内容。例如你可以说：“画一只猫，为它添加标题，并朗读这个标题”。
+
+尽管相比第一版Marvin 🤖，Agent的功能更灵活，但在常规语音助手场景下，其表现可能不如前者。如果想提升性能，可以使用更强大的LLM（如OpenAI模型），或[自定义工具](https://huggingface.co/docs/transformers/transformers_agents#custom-tools)，专门面向语音助手任务。