From 9f72205e1b2862f2c62485638188f8d35d76280b Mon Sep 17 00:00:00 2001 From: zefang-liu Date: Tue, 29 Jul 2025 15:16:43 -0700 Subject: [PATCH 1/7] Copy the chapter 4 --- chapters/zh-CN/_toctree.yml | 28 +- .../zh-CN/chapter4/classification_models.mdx | 320 ++++++++++ chapters/zh-CN/chapter4/demo.mdx | 44 ++ chapters/zh-CN/chapter4/fine-tuning.mdx | 579 ++++++++++++++++++ chapters/zh-CN/chapter4/hands_on.mdx | 38 ++ chapters/zh-CN/chapter4/introduction.mdx | 22 + 6 files changed, 1017 insertions(+), 14 deletions(-) create mode 100644 chapters/zh-CN/chapter4/classification_models.mdx create mode 100644 chapters/zh-CN/chapter4/demo.mdx create mode 100644 chapters/zh-CN/chapter4/fine-tuning.mdx create mode 100644 chapters/zh-CN/chapter4/hands_on.mdx create mode 100644 chapters/zh-CN/chapter4/introduction.mdx diff --git a/chapters/zh-CN/_toctree.yml b/chapters/zh-CN/_toctree.yml index eab9891a..2c3c6274 100644 --- a/chapters/zh-CN/_toctree.yml +++ b/chapters/zh-CN/_toctree.yml @@ -52,19 +52,19 @@ - local: chapter3/supplemental_reading title: 补充阅读 -#- title: 第4单元:构建音乐风格分类器 -# sections: -# - local: chapter4/introduction -# title: 单元简介 -# - local: chapter4/classification_models -# title: 音频分类的预训练模型 -# - local: chapter4/fine-tuning -# title: 针对音乐分类进行微调 -# - local: chapter4/demo -# title: 使用Gradio构建demo -# - local: chapter4/hands_on -# title: 实战练习 -# +- title: 第4单元:构建音乐风格分类器 + sections: + - local: chapter4/introduction + title: 单元简介 + - local: chapter4/classification_models + title: 音频分类的预训练模型 + - local: chapter4/fine-tuning + title: 针对音乐分类进行微调 + - local: chapter4/demo + title: 使用Gradio构建demo + - local: chapter4/hands_on + title: 实战练习 + - title: 第5单元:自动语音识别 (ASR) sections: - local: chapter5/introduction @@ -84,7 +84,7 @@ - local: chapter5/supplemental_reading title: 补充阅读 -- title: 第六单元:从文本到语音 +- title: 第6单元:从文本到语音 sections: - local: chapter6/introduction title: 单元简介 diff --git a/chapters/zh-CN/chapter4/classification_models.mdx b/chapters/zh-CN/chapter4/classification_models.mdx new file mode 100644 index 00000000..782b8a98 --- /dev/null +++ b/chapters/zh-CN/chapter4/classification_models.mdx @@ -0,0 +1,320 @@ +# Pre-trained models and datasets for audio classification + +The Hugging Face Hub is home to over 500 pre-trained models for audio classification. In this section, we'll go through +some of the most common audio classification tasks and suggest appropriate pre-trained models for each. Using the `pipeline()` +class, switching between models and tasks is straightforward - once you know how to use `pipeline()` for one model, you'll +be able to use it for any model on the Hub no code changes! This makes experimenting with the `pipeline()` class extremely +fast, allowing you to quickly select the best pre-trained model for your needs. + +Before we jump into the various audio classification problems, let's quickly recap the transformer architectures typically +used. The standard audio classification architecture is motivated by the nature of the task; we want to transform a sequence +of audio inputs (i.e. our input audio array) into a single class label prediction. Encoder-only models first map the input +audio sequence into a sequence of hidden-state representations by passing the inputs through a transformer block. The +sequence of hidden-state representations is then mapped to a class label output by taking the mean over the hidden-states, +and passing the resulting vector through a linear classification layer. Hence, there is a preference for _encoder-only_ +models for audio classification. + +Decoder-only models introduce unnecessary complexity to the task, since they assume that the outputs can also be a _sequence_ +of predictions (rather than a single class label prediction), and so generate multiple outputs. Therefore, they have slower +inference speed and tend not to be used. Encoder-decoder models are largely omitted for the same reason. These architecture +choices are analogous to those in NLP, where encoder-only models such as [BERT](https://huggingface.co/blog/bert-101) +are favoured for sequence classification tasks, and decoder-only models such as GPT reserved for sequence generation tasks. + +Now that we've recapped the standard transformer architecture for audio classification, let's jump into the different +subsets of audio classification and cover the most popular models! + +## 🤗 Transformers Installation + +At the time of writing, the latest updates required for audio classification pipeline are only on the `main` version of +the 🤗 Transformers repository, rather than the latest PyPi version. To make sure we have these updates locally, we'll +install Transformers from the `main` branch with the following command: + +``` +pip install git+https://github.com/huggingface/transformers +``` + +## Keyword Spotting + +Keyword spotting (KWS) is the task of identifying a keyword in a spoken utterance. The set of possible keywords forms the +set of predicted class labels. Hence, to use a pre-trained keyword spotting model, you should ensure that your keywords +match those that the model was pre-trained on. Below, we'll introduce two datasets and models for keyword spotting. + +### Minds-14 + +Let's go ahead and use the same [MINDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset that you have explored +in the previous unit. If you recall, MINDS-14 contains recordings of people asking an e-banking system questions in several +languages and dialects, and has the `intent_class` for each recording. We can classify the recordings by intent of the call. + +```python +from datasets import load_dataset + +minds = load_dataset("PolyAI/minds14", name="en-AU", split="train") +``` + +We'll load the checkpoint [`"anton-l/xtreme_s_xlsr_300m_minds14"`](https://huggingface.co/anton-l/xtreme_s_xlsr_300m_minds14), +which is an XLS-R model fine-tuned on MINDS-14 for approximately 50 epochs. It achieves 90% accuracy over all languages +from MINDS-14 on the evaluation set. + +```python +from transformers import pipeline + +classifier = pipeline( + "audio-classification", + model="anton-l/xtreme_s_xlsr_300m_minds14", +) +``` + +Finally, we can pass a sample to the classification pipeline to make a prediction: +```python +classifier(minds[0]["audio"]) +``` +**Output:** +``` +[ + {"score": 0.9631525278091431, "label": "pay_bill"}, + {"score": 0.02819698303937912, "label": "freeze"}, + {"score": 0.0032787492964416742, "label": "card_issues"}, + {"score": 0.0019414445850998163, "label": "abroad"}, + {"score": 0.0008378693601116538, "label": "high_value_payment"}, +] +``` + +Great! We've identified that the intent of the call was paying a bill, with probability 96%. You can imagine this kind of +keyword spotting system being used as the first stage of an automated call centre, where we want to categorise incoming +customer calls based on their query and offer them contextualised support accordingly. + +### Speech Commands + +Speech Commands is a dataset of spoken words designed to evaluate audio classification models on simple command words. +The dataset consists of 15 classes of keywords, a class for silence, and an unknown class to include the false positive. +The 15 keywords are single words that would typically be used in on-device settings to control basic tasks or launch +other processes. + +A similar model is running continuously on your mobile phone. Here, instead of having single command words, we have +'wake words' specific to your device, such as "Hey Google" or "Hey Siri". When the audio classification model detects +these wake words, it triggers your phone to start listening to the microphone and transcribe your speech using a speech +recognition model. + +The audio classification model is much smaller and lighter than the speech recognition model, often only several millions +of parameters compared to several hundred millions for speech recognition. Thus, it can be run continuously on your device +without draining your battery! Only when the wake word is detected is the larger speech recognition model launched, and +afterwards it is shut down again. We'll cover transformer models for speech recognition in the next Unit, so by the end +of the course you should have the tools you need to build your own voice activated assistant! + +As with any dataset on the Hugging Face Hub, we can get a feel for the kind of audio data it has present without downloading +or committing it memory. After heading to the [Speech Commands' dataset card](https://huggingface.co/datasets/speech_commands) +on the Hub, we can use the Dataset Viewer to scroll through the first 100 samples of the dataset, listening to the audio +files and checking any other metadata information: + +
+ Diagram of datasets viewer. +
+ +The Dataset Preview is a brilliant way of experiencing audio datasets before committing to using them. You can pick any +dataset on the Hub, scroll through the samples and listen to the audio for the different subsets and splits, gauging whether +it's the right dataset for your needs. Once you've selected a dataset, it's trivial to load the data so that you can start +using it. + +Let's do exactly that and load a sample of the Speech Commands dataset using streaming mode: + +```python +speech_commands = load_dataset( + "speech_commands", "v0.02", split="validation", streaming=True +) +sample = next(iter(speech_commands)) +``` + +We'll load an official [Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer) +checkpoint fine-tuned on the Speech Commands dataset, under the namespace [`"MIT/ast-finetuned-speech-commands-v2"`](https://huggingface.co/MIT/ast-finetuned-speech-commands-v2): + +```python +classifier = pipeline( + "audio-classification", model="MIT/ast-finetuned-speech-commands-v2" +) +classifier(sample["audio"].copy()) +``` +**Output:** +``` +[{'score': 0.9999892711639404, 'label': 'backward'}, + {'score': 1.7504888774055871e-06, 'label': 'happy'}, + {'score': 6.703040185129794e-07, 'label': 'follow'}, + {'score': 5.805884484288981e-07, 'label': 'stop'}, + {'score': 5.614546694232558e-07, 'label': 'up'}] +``` + +Cool! Looks like the example contains the word "backward" with high probability. We can take a listen to the sample +and verify this is correct: +``` +from IPython.display import Audio + +Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"]) +``` + +Now, you might be wondering how we've selected these pre-trained models to show you in these audio classification examples. +The truth is, finding pre-trained models for your dataset and task is very straightforward! The first thing we need to do +is head to the Hugging Face Hub and click on the "Models" tab: https://huggingface.co/models + +This is going to bring up all the models on the Hugging Face Hub, sorted by downloads in the past 30 days: + +
+ +
+ +You'll notice on the left-hand side that we have a selection of tabs that we can select to filter models by task, library, +dataset, etc. Scroll down and select the task "Audio Classification" from the list of audio tasks: + +
+ +
+ +We're now presented with the sub-set of 500+ audio classification models on the Hub. To further refine this selection, we +can filter models by dataset. Click on the tab "Datasets", and in the search box type "speech_commands". As you begin typing, +you'll see the selection for `speech_commands` appear underneath the search tab. You can click this button to filter all +audio classification models to those fine-tuned on the Speech Commands dataset: + +
+ +
+ +Great! We see that we have 6 pre-trained models available to us for this specific dataset and task. You'll recognise the +first of these models as the Audio Spectrogram Transformer checkpoint that we used in the previous example. This process +of filtering models on the Hub is exactly how we went about selecting the checkpoint to show you! + +## Language Identification + +Language identification (LID) is the task of identifying the language spoken in an audio sample from a list of candidate +languages. LID can form an important part in many speech pipelines. For example, given an audio sample in an unknown language, +an LID model can be used to categorise the language(s) spoken in the audio sample, and then select an appropriate speech +recognition model trained on that language to transcribe the audio. + +### FLEURS + +FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a dataset for evaluating speech recognition +systems in 102 languages, including many that are classified as 'low-resource'. Take a look at the FLEURS dataset +card on the Hub and explore the different languages that are present: [google/fleurs](https://huggingface.co/datasets/google/fleurs). +Can you find your native tongue here? If not, what's the most closely related language? + +Let's load up a sample from the validation split of the FLEURS dataset using streaming mode: + +```python +fleurs = load_dataset("google/fleurs", "all", split="validation", streaming=True) +sample = next(iter(fleurs)) +``` + +Great! Now we can load our audio classification model. For this, we'll use a version of [Whisper](https://arxiv.org/pdf/2212.04356.pdf) +fine-tuned on the FLEURS dataset, which is currently the most performant LID model on the Hub: + +```python +classifier = pipeline( + "audio-classification", model="sanchit-gandhi/whisper-medium-fleurs-lang-id" +) +``` + +We can then pass the audio through our classifier and generate a prediction: +```python +classifier(sample["audio"]) +``` +**Output:** +``` +[{'score': 0.9999330043792725, 'label': 'Afrikaans'}, + {'score': 7.093023668858223e-06, 'label': 'Northern-Sotho'}, + {'score': 4.269149485480739e-06, 'label': 'Icelandic'}, + {'score': 3.2661141631251667e-06, 'label': 'Danish'}, + {'score': 3.2580724109720904e-06, 'label': 'Cantonese Chinese'}] +``` + +We can see that the model predicted the audio was in Afrikaans with extremely high probability (near 1). The FLEURS dataset +contains audio data from a wide range of languages - we can see that possible class labels include Northern-Sotho, Icelandic, +Danish and Cantonese Chinese amongst others. You can find the full list of languages on the dataset card here: [google/fleurs](https://huggingface.co/datasets/google/fleurs). + +Over to you! What other checkpoints can you find for FLEURS LID on the Hub? What transformer models are they using under-the-hood? + +## Zero-Shot Audio Classification + +In the traditional paradigm for audio classification, the model predicts a class label from a _pre-defined_ set of +possible classes. This poses a barrier to using pre-trained models for audio classification, since the label set of the +pre-trained model must match that of the downstream task. For the previous example of LID, the model must predict one of +the 102 langauge classes on which it was trained. If the downstream task actually requires 110 languages, the model would +not be able to predict 8 of the 110 languages, and so would require re-training to achieve full coverage. This limits the +effectiveness of transfer learning for audio classification tasks. + +Zero-shot audio classification is a method for taking a pre-trained audio classification model trained on a set of labelled +examples and enabling it to be able to classify new examples from previously unseen classes. Let's take a look at how we +can achieve this! + +Currently, 🤗 Transformers supports one kind of model for zero-shot audio classification: the [CLAP model](https://huggingface.co/docs/transformers/model_doc/clap). +CLAP is a transformer-based model that takes both audio and text as inputs, and computes the _similarity_ between the two. +If we pass a text input that strongly correlates with an audio input, we'll get a high similarity score. Conversely, passing +a text input that is completely unrelated to the audio input will return a low similarity. + +We can use this similarity prediction for zero-shot audio classification by passing one audio input to the model and +multiple candidate labels. The model will return a similarity score for each of the candidate labels, and we can pick the +one that has the highest score as our prediction. + +Let's take an example where we use one audio input from the [Environmental Speech Challenge (ESC)](https://huggingface.co/datasets/ashraq/esc50) +dataset: + +```python +dataset = load_dataset("ashraq/esc50", split="train", streaming=True) +audio_sample = next(iter(dataset))["audio"]["array"] +``` + +We then define our candidate labels, which form the set of possible classification labels. The model will return a +classification probability for each of the labels we define. This means we need to know _a-priori_ the set of possible +labels in our classification problem, such that the correct label is contained within the set and is thus assigned a +valid probability score. Note that we can either pass the full set of labels to the model, or a hand-selected subset +that we believe contains the correct label. Passing the full set of labels is going to be more exhaustive, but comes +at the expense of lower classification accuracy since the classification space is larger (provided the correct label is +our chosen subset of labels): + +```python +candidate_labels = ["Sound of a dog", "Sound of vacuum cleaner"] +``` + +We can run both through the model to find the candidate label that is _most similar_ to the audio input: + +```python +classifier = pipeline( + task="zero-shot-audio-classification", model="laion/clap-htsat-unfused" +) +classifier(audio_sample, candidate_labels=candidate_labels) +``` +**Output:** +``` +[{'score': 0.9997242093086243, 'label': 'Sound of a dog'}, {'score': 0.0002758323971647769, 'label': 'Sound of vacuum cleaner'}] +``` + +Alright! The model seems pretty confident we have the sound of a dog - it predicts it with 99.96% probability, so we'll +take that as our prediction. Let's confirm whether we were right by listening to the audio sample (don't turn up your +volume too high or else you might get a jump!): + +```python +Audio(audio_sample, rate=16000) +``` + +Perfect! We have the sound of a dog barking 🐕, which aligns with the model's prediction. Have a play with different audio +samples and different candidate labels - can you define a set of labels that give good generalisation across the ESC +dataset? Hint: think about where you could find information on the possible sounds in ESC and construct your labels accordingly! + +You might be wondering why we don't use the zero-shot audio classification pipeline for **all** audio classification tasks? +It seems as though we can make predictions for any audio classification problem by defining appropriate class labels _a-priori_, +thus bypassing the constraint that our classification task needs to match the labels that the model was pre-trained on. +This comes down to the nature of the CLAP model used in the zero-shot pipeline: CLAP is pre-trained on _generic_ audio +classification data, similar to the environmental sounds in the ESC dataset, rather than specifically speech data, like +we had in the LID task. If you gave it speech in English and speech in Spanish, CLAP would know that both examples were +speech data 🗣️ But it wouldn't be able to differentiate between the languages in the same way a dedicated LID model is +able to. + +## What next? + +We've covered a number of different audio classification tasks and presented the most relevant datasets and models that +you can download from the Hugging Face Hub and use in just several lines of code using the `pipeline()` class. These tasks +included keyword spotting, language identification and zero-shot audio classification. + +But what if we want to do something **new**? We've worked extensively on speech processing tasks, but this is only one +aspect of audio classification. Another popular field of audio processing involves **music**. While music has inherently +different features to speech, many of the same principles that we've learnt about already can be applied to music. + +In the following section, we'll go through a step-by-step guide on how you can fine-tune a transformer model with 🤗 +Transformers on the task of music classification. By the end of it, you'll have a fine-tuned checkpoint that you can plug +into the `pipeline()` class, enabling you to classify songs in exactly the same way that we've classified speech here! diff --git a/chapters/zh-CN/chapter4/demo.mdx b/chapters/zh-CN/chapter4/demo.mdx new file mode 100644 index 00000000..0510e0d6 --- /dev/null +++ b/chapters/zh-CN/chapter4/demo.mdx @@ -0,0 +1,44 @@ +# Build a demo with Gradio + +In this final section on audio classification, we'll build a [Gradio](https://gradio.app) demo to showcase the music +classification model that we just trained on the [GTZAN](https://huggingface.co/datasets/marsyas/gtzan) dataset. The first +thing to do is load up the fine-tuned checkpoint using the `pipeline()` class - this is very familiar now from the section +on [pre-trained models](classification_models). You can change the `model_id` to the namespace of your fine-tuned model +on the Hugging Face Hub: + +```python +from transformers import pipeline + +model_id = "sanchit-gandhi/distilhubert-finetuned-gtzan" +pipe = pipeline("audio-classification", model=model_id) +``` + +Secondly, we'll define a function that takes the filepath for an audio input and passes it through the pipeline. Here, +the pipeline automatically takes care of loading the audio file, resampling it to the correct sampling rate, and running +inference with the model. We take the models predictions of `preds` and format them as a dictionary object to be displayed on the +output: + +```python +def classify_audio(filepath): + preds = pipe(filepath) + outputs = {} + for p in preds: + outputs[p["label"]] = p["score"] + return outputs +``` + +Finally, we launch the Gradio demo using the function we've just defined: + +```python +import gradio as gr + +demo = gr.Interface( + fn=classify_audio, inputs=gr.Audio(type="filepath"), outputs=gr.outputs.Label() +) +demo.launch(debug=True) +``` + +This will launch a Gradio demo similar to the one running on the Hugging Face Space: + + + diff --git a/chapters/zh-CN/chapter4/fine-tuning.mdx b/chapters/zh-CN/chapter4/fine-tuning.mdx new file mode 100644 index 00000000..496e4c7e --- /dev/null +++ b/chapters/zh-CN/chapter4/fine-tuning.mdx @@ -0,0 +1,579 @@ +# Fine-tuning a model for music classification + +In this section, we'll present a step-by-step guide on fine-tuning an encoder-only transformer model for music classification. +We'll use a lightweight model for this demonstration and fairly small dataset, meaning the code is runnable end-to-end +on any consumer grade GPU, including the T4 16GB GPU provided in the Google Colab free tier. The section includes various +tips that you can try should you have a smaller GPU and encounter memory issues along the way. + + +## The Dataset + +To train our model, we'll use the [GTZAN](https://huggingface.co/datasets/marsyas/gtzan) dataset, which is a popular +dataset of 1,000 songs for music genre classification. Each song is a 30-second clip from one of 10 genres of music, +spanning disco to metal. We can get the audio files and their corresponding labels from the Hugging Face Hub with the +`load_dataset()` function from 🤗 Datasets: + +```python +from datasets import load_dataset + +gtzan = load_dataset("marsyas/gtzan", "all") +gtzan +``` + +**Output:** +```out +Dataset({ + features: ['file', 'audio', 'genre'], + num_rows: 999 +}) +``` + + + +One of the recordings in GTZAN is corrupted, so it's been removed from the dataset. That's why we have 999 examples +instead of 1,000. + + + + +GTZAN doesn't provide a predefined validation set, so we'll have to create one ourselves. The dataset is balanced across +genres, so we can use the `train_test_split()` method to quickly create a 90/10 split as follows: + +```python +gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1) +gtzan +``` + +**Output:** +```out +DatasetDict({ + train: Dataset({ + features: ['file', 'audio', 'genre'], + num_rows: 899 + }) + test: Dataset({ + features: ['file', 'audio', 'genre'], + num_rows: 100 + }) +}) +``` + +Great, now that we've got our training and validation sets, let's take a look at one of the audio files: + +```python +gtzan["train"][0] +``` + +**Output:** +```out +{ + "file": "~/.cache/huggingface/datasets/downloads/extracted/fa06ce46130d3467683100aca945d6deafb642315765a784456e1d81c94715a8/genres/pop/pop.00098.wav", + "audio": { + "path": "~/.cache/huggingface/datasets/downloads/extracted/fa06ce46130d3467683100aca945d6deafb642315765a784456e1d81c94715a8/genres/pop/pop.00098.wav", + "array": array( + [ + 0.10720825, + 0.16122437, + 0.28585815, + ..., + -0.22924805, + -0.20629883, + -0.11334229, + ], + dtype=float32, + ), + "sampling_rate": 22050, + }, + "genre": 7, +} +``` + +As we saw in [Unit 1](../chapter1/audio_data), the audio files are represented as 1-dimensional NumPy arrays, +where the value of the array represents the amplitude at that timestep. For these songs, the sampling rate is 22,050 Hz, +meaning there are 22,050 amplitude values sampled per second. We'll have to keep this in mind when using a pretrained model +with a different sampling rate, converting the sampling rates ourselves to ensure they match. We can also see the genre +is represented as an integer, or _class label_, which is the format the model will make it's predictions in. Let's use the +`int2str()` method of the `genre` feature to map these integers to human-readable names: + +```python +id2label_fn = gtzan["train"].features["genre"].int2str +id2label_fn(gtzan["train"][0]["genre"]) +``` + +**Output:** +```out +'pop' +``` + +This label looks correct, since it matches the filename of the audio file. Let's now listen to a few more examples by +using Gradio to create a simple interface with the `Blocks` API: + +```python +import gradio as gr + + +def generate_audio(): + example = gtzan["train"].shuffle()[0] + audio = example["audio"] + return ( + audio["sampling_rate"], + audio["array"], + ), id2label_fn(example["genre"]) + + +with gr.Blocks() as demo: + with gr.Column(): + for _ in range(4): + audio, label = generate_audio() + output = gr.Audio(audio, label=label) + +demo.launch(debug=True) +``` + + + +From these samples we can certainly hear the difference between genres, but can a transformer do this too? Let's train a +model to find out! First, we'll need to find a suitable pretrained model for this task. Let's see how we can do that. + +## Picking a pretrained model for audio classification + +To get started, let's pick a suitable pretrained model for audio classification. In this domain, pretraining is typically +carried out on large amounts of unlabeled audio data, using datasets like [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) +and [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli). The best way to find these models on the Hugging +Face Hub is to use the "Audio Classification" filter, as described in the previous section. Although models like Wav2Vec2 and +HuBERT are very popular, we'll use a model called _DistilHuBERT_. This is a much smaller (or _distilled_) version of the [HuBERT](https://huggingface.co/docs/transformers/model_doc/hubert) +model, which trains around 73% faster, yet preserves most of the performance. + + + +## From audio to machine learning features + +## Preprocessing the data + +Similar to tokenization in NLP, audio and speech models require the input to be encoded in a format that the model +can process. In 🤗 Transformers, the conversion from audio to the input format is handled by the _feature extractor_ of +the model. Similar to tokenizers, 🤗 Transformers provides a convenient `AutoFeatureExtractor` class that can automatically +select the correct feature extractor for a given model. To see how we can process our audio files, let's begin by instantiating +the feature extractor for DistilHuBERT from the pre-trained checkpoint: + +```python +from transformers import AutoFeatureExtractor + +model_id = "ntu-spml/distilhubert" +feature_extractor = AutoFeatureExtractor.from_pretrained( + model_id, do_normalize=True, return_attention_mask=True +) +``` + +Since the sampling rate of the model and the dataset are different, we'll have to resample the audio file to 16,000 +Hz before passing it to the feature extractor. We can do this by first obtaining the model's sample rate from the feature +extractor: + +```python +sampling_rate = feature_extractor.sampling_rate +sampling_rate +``` + +**Output:** +```out +16000 +``` + +Next, we resample the dataset using the `cast_column()` method and `Audio` feature from 🤗 Datasets: + +```python +from datasets import Audio + +gtzan = gtzan.cast_column("audio", Audio(sampling_rate=sampling_rate)) +``` + +We can now check the first sample of the train-split of our dataset to verify that it is indeed at 16,000 Hz. 🤗 Datasets +will resample the audio file _on-the-fly_ when we load each audio sample: + +```python +gtzan["train"][0] +``` + +**Output:** +```out +{ + "file": "~/.cache/huggingface/datasets/downloads/extracted/fa06ce46130d3467683100aca945d6deafb642315765a784456e1d81c94715a8/genres/pop/pop.00098.wav", + "audio": { + "path": "~/.cache/huggingface/datasets/downloads/extracted/fa06ce46130d3467683100aca945d6deafb642315765a784456e1d81c94715a8/genres/pop/pop.00098.wav", + "array": array( + [ + 0.0873509, + 0.20183384, + 0.4790867, + ..., + -0.18743178, + -0.23294401, + -0.13517427, + ], + dtype=float32, + ), + "sampling_rate": 16000, + }, + "genre": 7, +} +``` + +Great! We can see that the sampling rate has been downsampled to 16kHz. The array values are also different, as we've +now only got approximately one amplitude value for every 1.5 that we had before. + +A defining feature of Wav2Vec2 and HuBERT like models is that they accept a float array corresponding to the raw waveform +of the speech signal as an input. This is in contrast to other models, like Whisper, where we pre-process the raw audio waveform +to spectrogram format. + +We mentioned that the audio data is represented as a 1-dimensional array, so it's already in the right format to be read +by the model (a set of continuous inputs at discrete time steps). So, what exactly does the feature extractor do? + +Well, the audio data is in the right format, but we've imposed no restrictions on the values it can take. For our model to +work optimally, we want to keep all the inputs within the same dynamic range. This is going to make sure we get a similar +range of activations and gradients for our samples, helping with stability and convergence during training. + +To do this, we _normalise_ our audio data, by rescaling each sample to zero mean and unit variance, a process called +_feature scaling_. It's exactly this feature normalisation that our feature extractor performs! + +We can take a look at the feature extractor in operation by applying it to our first audio sample. First, let's compute +the mean and variance of our raw audio data: + +```python +import numpy as np + +sample = gtzan["train"][0]["audio"] + +print(f"Mean: {np.mean(sample['array']):.3}, Variance: {np.var(sample['array']):.3}") +``` + +**Output:** +```out +Mean: 0.000185, Variance: 0.0493 +``` + +We can see that the mean is close to zero already, but the variance is closer to 0.05. If the variance for the sample was +larger, it could cause our model problems, since the dynamic range of the audio data would be very small and thus difficult to +separate. Let's apply the feature extractor and see what the outputs look like: + +```python +inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"]) + +print(f"inputs keys: {list(inputs.keys())}") + +print( + f"Mean: {np.mean(inputs['input_values']):.3}, Variance: {np.var(inputs['input_values']):.3}" +) +``` + +**Output:** +```out +inputs keys: ['input_values', 'attention_mask'] +Mean: -4.53e-09, Variance: 1.0 +``` + +Alright! Our feature extractor returns a dictionary of two arrays: `input_values` and `attention_mask`. The `input_values` +are the preprocessed audio inputs that we'd pass to the HuBERT model. The [`attention_mask`](https://huggingface.co/docs/transformers/glossary#attention-mask) +is used when we process a _batch_ of audio inputs at once - it is used to tell the model where we have padded inputs of +different lengths. + +We can see that the mean value is now very much closer to zero, and the variance bang-on one! This is exactly the form we +want our audio samples in prior to feeding them to the HuBERT model. + + + +Note how we've passed the sampling rate of our audio data to our feature extractor. This is good practice, as the feature +extractor performs a check under-the-hood to make sure the sampling rate of our audio data matches the sampling rate +expected by the model. If the sampling rate of our audio data did not match the sampling rate of our model, we'd need to +up-sample or down-sample the audio data to the correct sampling rate. + + + +Great, so now we know how to process our resampled audio files, the last thing to do is define a function that we can +apply to all the examples in the dataset. Since we expect the audio clips to be 30 seconds in length, we'll also +truncate any longer clips by using the `max_length` and `truncation` arguments of the feature extractor as follows: + + +```python +max_duration = 30.0 + + +def preprocess_function(examples): + audio_arrays = [x["array"] for x in examples["audio"]] + inputs = feature_extractor( + audio_arrays, + sampling_rate=feature_extractor.sampling_rate, + max_length=int(feature_extractor.sampling_rate * max_duration), + truncation=True, + return_attention_mask=True, + ) + return inputs +``` + +With this function defined, we can now apply it to the dataset using the [`map()`](https://huggingface.co/docs/datasets/v2.14.0/en/package_reference/main_classes#datasets.Dataset.map) +method. The `.map()` method supports working with batches of examples, which we'll enable by setting `batched=True`. +The default batch size is 1000, but we'll reduce it to 100 to ensure the peak RAM stays within a sensible range for +Google Colab's free tier: + + + +```python +gtzan_encoded = gtzan.map( + preprocess_function, + remove_columns=["audio", "file"], + batched=True, + batch_size=100, + num_proc=1, +) +gtzan_encoded +``` + +**Output:** +```out +DatasetDict({ + train: Dataset({ + features: ['genre', 'input_values','attention_mask'], + num_rows: 899 + }) + test: Dataset({ + features: ['genre', 'input_values','attention_mask'], + num_rows: 100 + }) +}) +``` + + + If you exhaust your device's RAM executing the above code, you can adjust the batch parameters to reduce the peak + RAM usage. In particular, the following two arguments can be modified: + * `batch_size`: defaults to 1000, but set to 100 above. Try reducing by a factor of 2 again to 50 + * `writer_batch_size`: defaults to 1000. Try reducing it to 500, and if that doesn't work, then reduce it by a factor of 2 again to 250 + + + +To simplify the training, we've removed the `audio` and `file` columns from the dataset. The `input_values` column contains +the encoded audio files, the `attention_mask` a binary mask of 0/1 values that indicate where we have padded the audio input, +and the `genre` column contains the corresponding labels (or targets). To enable the `Trainer` to process the class labels, +we need to rename the `genre` column to `label`: + +```python +gtzan_encoded = gtzan_encoded.rename_column("genre", "label") +``` + +Finally, we need to obtain the label mappings from the dataset. This mapping will take us from integer ids (e.g. `7`) to +human-readable class labels (e.g. `"pop"`) and back again. In doing so, we can convert our model's integer id prediction +into human-readable format, enabling us to use the model in any downstream application. We can do this by using the `int2str()` +method as follows: + +```python +id2label = { + str(i): id2label_fn(i) + for i in range(len(gtzan_encoded["train"].features["label"].names)) +} +label2id = {v: k for k, v in id2label.items()} + +id2label["7"] +``` + +```out +'pop' +``` + +OK, we've now got a dataset that's ready for training! Let's take a look at how we can train a model on this dataset. + + +## Fine-tuning the model + +To fine-tune the model, we'll use the `Trainer` class from 🤗 Transformers. As we've seen in other chapters, the `Trainer` +is a high-level API that is designed to handle the most common training scenarios. In this case, we'll use the `Trainer` +to fine-tune the model on GTZAN. To do this, we'll first need to load a model for this task. We can do this by using the +`AutoModelForAudioClassification` class, which will automatically add the appropriate classification head to our pretrained +DistilHuBERT model. Let's go ahead and instantiate the model: + +```python +from transformers import AutoModelForAudioClassification + +num_labels = len(id2label) + +model = AutoModelForAudioClassification.from_pretrained( + model_id, + num_labels=num_labels, + label2id=label2id, + id2label=id2label, +) +``` + +We strongly advise you to upload model checkpoints directly the [Hugging Face Hub](https://huggingface.co/) while training. +The Hub provides: +- Integrated version control: you can be sure that no model checkpoint is lost during training. +- Tensorboard logs: track important metrics over the course of training. +- Model cards: document what a model does and its intended use cases. +- Community: an easy way to share and collaborate with the community! 🤗 + +Linking the notebook to the Hub is straightforward - it simply requires entering your Hub authentication token when prompted. +Find your Hub authentication token [here](https://huggingface.co/settings/tokens): + +```python +from huggingface_hub import notebook_login + +notebook_login() +``` + +**Output:** +```bash +Login successful +Your token has been saved to /root/.huggingface/token +``` + +The next step is to define the training arguments, including the batch size, gradient accumulation steps, number of +training epochs and learning rate: + +```python +from transformers import TrainingArguments + +model_name = model_id.split("/")[-1] +batch_size = 8 +gradient_accumulation_steps = 1 +num_train_epochs = 10 + +training_args = TrainingArguments( + f"{model_name}-finetuned-gtzan", + evaluation_strategy="epoch", + save_strategy="epoch", + learning_rate=5e-5, + per_device_train_batch_size=batch_size, + gradient_accumulation_steps=gradient_accumulation_steps, + per_device_eval_batch_size=batch_size, + num_train_epochs=num_train_epochs, + warmup_ratio=0.1, + logging_steps=5, + load_best_model_at_end=True, + metric_for_best_model="accuracy", + fp16=True, + push_to_hub=True, +) +``` + + + + Here we have set `push_to_hub=True` to enable automatic upload of our fine-tuned checkpoints during training. Should you + not wish for your checkpoints to be uploaded to the Hub, you can set this to `False`. + + + +The last thing we need to do is define the metrics. Since the dataset is balanced, we'll use accuracy as our metric and +load it using the 🤗 Evaluate library: + +```python +import evaluate +import numpy as np + +metric = evaluate.load("accuracy") + + +def compute_metrics(eval_pred): + """Computes accuracy on a batch of predictions""" + predictions = np.argmax(eval_pred.predictions, axis=1) + return metric.compute(predictions=predictions, references=eval_pred.label_ids) +``` + +We've now got all the pieces! Let's instantiate the `Trainer` and train the model: + +```python +from transformers import Trainer + +trainer = Trainer( + model, + training_args, + train_dataset=gtzan_encoded["train"], + eval_dataset=gtzan_encoded["test"], + tokenizer=feature_extractor, + compute_metrics=compute_metrics, +) + +trainer.train() +``` + + + +Depending on your GPU, it is possible that you will encounter a CUDA `"out-of-memory"` error when you start training. +In this case, you can reduce the `batch_size` incrementally by factors of 2 and employ [`gradient_accumulation_steps`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps) +to compensate. + + + +**Output:** +```out +| Training Loss | Epoch | Step | Validation Loss | Accuracy | +|:-------------:|:-----:|:----:|:---------------:|:--------:| +| 1.7297 | 1.0 | 113 | 1.8011 | 0.44 | +| 1.24 | 2.0 | 226 | 1.3045 | 0.64 | +| 0.9805 | 3.0 | 339 | 0.9888 | 0.7 | +| 0.6853 | 4.0 | 452 | 0.7508 | 0.79 | +| 0.4502 | 5.0 | 565 | 0.6224 | 0.81 | +| 0.3015 | 6.0 | 678 | 0.5411 | 0.83 | +| 0.2244 | 7.0 | 791 | 0.6293 | 0.78 | +| 0.3108 | 8.0 | 904 | 0.5857 | 0.81 | +| 0.1644 | 9.0 | 1017 | 0.5355 | 0.83 | +| 0.1198 | 10.0 | 1130 | 0.5716 | 0.82 | +``` + +Training will take approximately 1 hour depending on your GPU or the one allocated to the Google Colab. Our best +evaluation accuracy is 83% - not bad for just 10 epochs with 899 examples of training data! We could certainly improve +upon this result by training for more epochs, using regularisation techniques such as _dropout_, or sub-diving each +audio example from 30s into 15s segments to use a more efficient data pre-processing strategy. + +The big question is how this compares to other music classification systems 🤔 +For that, we can view the [autoevaluate leaderboard](https://huggingface.co/spaces/autoevaluate/leaderboards?dataset=marsyas%2Fgtzan&only_verified=0&task=audio-classification&config=all&split=train&metric=accuracy), +a leaderboard that categorises models by language and dataset, and subsequently ranks them according to their accuracy. + +We can automatically submit our checkpoint to the leaderboard when we push the training results to the Hub - we simply have +to set the appropriate key-word arguments (kwargs). You can change these values to match your dataset, language and model name +accordingly: + +```python +kwargs = { + "dataset_tags": "marsyas/gtzan", + "dataset": "GTZAN", + "model_name": f"{model_name}-finetuned-gtzan", + "finetuned_from": model_id, + "tasks": "audio-classification", +} +``` + +The training results can now be uploaded to the Hub. To do so, execute the `.push_to_hub` command: + +```python +trainer.push_to_hub(**kwargs) +``` + +This will save the training logs and model weights under `"your-username/distilhubert-finetuned-gtzan"`. For this example, +check out the upload at [`"sanchit-gandhi/distilhubert-finetuned-gtzan"`](https://huggingface.co/sanchit-gandhi/distilhubert-finetuned-gtzan). + +## Share Model + +You can now share this model with anyone using the link on the Hub. They can load it with the identifier `"your-username/distilhubert-finetuned-gtzan"` +directly into the `pipeline()` class. For instance, to load the fine-tuned checkpoint [`"sanchit-gandhi/distilhubert-finetuned-gtzan"`](https://huggingface.co/sanchit-gandhi/distilhubert-finetuned-gtzan): + +```python +from transformers import pipeline + +pipe = pipeline( + "audio-classification", model="sanchit-gandhi/distilhubert-finetuned-gtzan" +) +``` + +## Conclusion + +In this section, we've covered a step-by-step guide for fine-tuning the DistilHuBERT model for music classification. While +we focussed on the task of music classification and the GTZAN dataset, the steps presented here apply more generally to any +audio classification task - the same script can be used for spoken language audio classification tasks like keyword spotting +or language identification. You just need to swap out the dataset for one that corresponds to your task of interest! If +you're interested in fine-tuning other Hugging Face Hub models for audio classification, we encourage you to check out the +other [examples](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification) in the 🤗 +Transformers repository. + +In the next section, we'll take the model that you just fine-tuned and build a music classification demo that you can share +on the Hugging Face Hub. diff --git a/chapters/zh-CN/chapter4/hands_on.mdx b/chapters/zh-CN/chapter4/hands_on.mdx new file mode 100644 index 00000000..5e349ec6 --- /dev/null +++ b/chapters/zh-CN/chapter4/hands_on.mdx @@ -0,0 +1,38 @@ +# Hands-on exercise + +It's time to get your hands on some Audio models and apply what you have learned so far. +This exercise is one of the four hands-on exercises required to qualify for a course completion certificate. + +Here are the instructions. +In this unit, we demonstrated how to fine-tune a Hubert model on `marsyas/gtzan` dataset for music classification. Our example achieved 83% accuracy. +Your task is to improve upon this accuracy metric. + +Feel free to choose any model on the [🤗 Hub](https://huggingface.co/models) that you think is suitable for audio classification, +and use the exact same dataset [`marsyas/gtzan`](https://huggingface.co/datasets/marsyas/gtzan) to build your own classifier. + +Your goal is to achieve 87% accuracy on this dataset with your classifier. You can choose the exact same model, and play with the training hyperparameters, +or pick an entirely different model - it's up to you! + +For your result to count towards your certificate, don't forget to push your model to Hub as was shown in this unit with +the following `**kwargs` at the end of the training: + +```python +kwargs = { + "dataset_tags": "marsyas/gtzan", + "dataset": "GTZAN", + "model_name": f"{model_name}-finetuned-gtzan", + "finetuned_from": model_id, + "tasks": "audio-classification", +} + +trainer.push_to_hub(**kwargs) +``` + +Here are some additional resources that you may find helpful when working on this exercise: +* [Audio classification task guide in Transformers documentation](https://huggingface.co/docs/transformers/tasks/audio_classification) +* [Hubert model documentation](https://huggingface.co/docs/transformers/model_doc/hubert) +* [M-CTC-T model documentation](https://huggingface.co/docs/transformers/model_doc/mctct) +* [Audio Spectrogram Transformer documentation](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer) +* [Wav2Vec2 documentation](https://huggingface.co/docs/transformers/model_doc/wav2vec2) + +Feel free to build a demo of your model, and share it on Discord! If you have questions, post them in the #audio-study-group channel. diff --git a/chapters/zh-CN/chapter4/introduction.mdx b/chapters/zh-CN/chapter4/introduction.mdx new file mode 100644 index 00000000..5735cb37 --- /dev/null +++ b/chapters/zh-CN/chapter4/introduction.mdx @@ -0,0 +1,22 @@ +# Unit 4. Build a music genre classifier + +## What you'll learn and what you'll build + +Audio classification is one of the most common applications of transformers in audio and speech processing. Like other +classification tasks in machine learning, this task involves assigning one or more labels to an audio recording based on +its content. For example, in the case of speech, we might want to detect when wake words like "Hey Siri" are spoken, or +infer a key word like "temperature" from a spoken query like "What is the weather today?". Environmental sounds +provide another example, where we might want to automatically distinguish between sounds such as "car horn", "siren", +"dog barking", etc. + +In this section, we'll look at how pre-trained audio transformers can be applied to a range of audio classification tasks. +We'll then fine-tune a transformer model on the task of music classification, classifying songs into genres like "pop" and +"rock". This is an important part of music streaming platforms like [Spotify](https://en.wikipedia.org/wiki/Spotify), which +recommend songs that are similar to the ones the user is listening to. + +By the end of this section, you'll know how to: + +* Find suitable pre-trained models for audio classification tasks +* Use the 🤗 Datasets library and the Hugging Face Hub to select audio classification datasets +* Fine-tune a pretrained model to classify songs by genre +* Build a Gradio demo that lets you classify your own songs From fa4ee3a5a1d081c3d1b74d7ac65a741e77ba3d0d Mon Sep 17 00:00:00 2001 From: zefang-liu Date: Tue, 29 Jul 2025 16:36:26 -0700 Subject: [PATCH 2/7] Translate the zh-cn chapter 4 introduction --- chapters/zh-CN/chapter4/introduction.mdx | 26 ++++++++---------------- 1 file changed, 9 insertions(+), 17 deletions(-) diff --git a/chapters/zh-CN/chapter4/introduction.mdx b/chapters/zh-CN/chapter4/introduction.mdx index 5735cb37..a8a8c126 100644 --- a/chapters/zh-CN/chapter4/introduction.mdx +++ b/chapters/zh-CN/chapter4/introduction.mdx @@ -1,22 +1,14 @@ -# Unit 4. Build a music genre classifier +# 第4单元:构建音乐风格分类器 -## What you'll learn and what you'll build +## 学习目标与项目构建 -Audio classification is one of the most common applications of transformers in audio and speech processing. Like other -classification tasks in machine learning, this task involves assigning one or more labels to an audio recording based on -its content. For example, in the case of speech, we might want to detect when wake words like "Hey Siri" are spoken, or -infer a key word like "temperature" from a spoken query like "What is the weather today?". Environmental sounds -provide another example, where we might want to automatically distinguish between sounds such as "car horn", "siren", -"dog barking", etc. +音频分类是Transformer在音频和语音处理中的常见应用之一。与机器学习中的其他分类任务类似,该任务的目标是根据音频内容为其分配一个或多个标签。例如,在语音场景中,我们可能希望检测唤醒词(如“Hey Siri”)何时被说出,或从诸如“今天天气怎么样?”的语句中识别出关键词“天气”。在环境声音的应用中,我们则可能希望自动区分诸如“汽车喇叭”、“警报声”或“狗叫声”等不同的声音类型。 -In this section, we'll look at how pre-trained audio transformers can be applied to a range of audio classification tasks. -We'll then fine-tune a transformer model on the task of music classification, classifying songs into genres like "pop" and -"rock". This is an important part of music streaming platforms like [Spotify](https://en.wikipedia.org/wiki/Spotify), which -recommend songs that are similar to the ones the user is listening to. +在本节中,我们将了解如何使用预训练的音频Transformer来处理各种音频分类任务。随后,我们将把模型微调到音乐分类任务上,将歌曲分类为“流行”、“摇滚”等不同的音乐风格。这类功能在音乐流媒体平台(如[Spotify](https://en.wikipedia.org/wiki/Spotify))中十分关键,可用于推荐与用户当前正在听的歌曲相似的内容。 -By the end of this section, you'll know how to: +完成本节后,你将掌握以下内容: -* Find suitable pre-trained models for audio classification tasks -* Use the 🤗 Datasets library and the Hugging Face Hub to select audio classification datasets -* Fine-tune a pretrained model to classify songs by genre -* Build a Gradio demo that lets you classify your own songs +* 如何寻找适用于音频分类任务的预训练模型 +* 如何使用🤗 Datasets库和Hugging Face Hub获取音频分类数据集 +* 如何微调预训练模型,实现按音乐风格对歌曲进行分类 +* 如何使用Gradio构建一个可以对你上传的歌曲进行分类的应用demo From a7ac1b11669b2b9160830baa7ad490f51f6e2f2c Mon Sep 17 00:00:00 2001 From: zefang-liu Date: Tue, 29 Jul 2025 17:25:35 -0700 Subject: [PATCH 3/7] Translate the zh-cn chapter 4 classification models --- .../zh-CN/chapter4/classification_models.mdx | 202 +++++------------- 1 file changed, 53 insertions(+), 149 deletions(-) diff --git a/chapters/zh-CN/chapter4/classification_models.mdx b/chapters/zh-CN/chapter4/classification_models.mdx index 782b8a98..1bcbd3a6 100644 --- a/chapters/zh-CN/chapter4/classification_models.mdx +++ b/chapters/zh-CN/chapter4/classification_models.mdx @@ -1,49 +1,28 @@ -# Pre-trained models and datasets for audio classification +# 音频分类的预训练模型 -The Hugging Face Hub is home to over 500 pre-trained models for audio classification. In this section, we'll go through -some of the most common audio classification tasks and suggest appropriate pre-trained models for each. Using the `pipeline()` -class, switching between models and tasks is straightforward - once you know how to use `pipeline()` for one model, you'll -be able to use it for any model on the Hub no code changes! This makes experimenting with the `pipeline()` class extremely -fast, allowing you to quickly select the best pre-trained model for your needs. +Hugging Face Hub上托管着超过500个用于音频分类的预训练模型。在本节中,我们将介绍几种常见的音频分类任务,并为每种任务推荐合适的预训练模型。借助`pipeline()`类,切换不同的模型和任务变得非常简单。一旦掌握了如何为一个模型使用`pipeline()`,你就可以在Hub上对任意模型复用,无需修改代码!这让基于`pipeline()`的实验流程变得非常高效,使你能够快速选出最适合自己需求的预训练模型。 -Before we jump into the various audio classification problems, let's quickly recap the transformer architectures typically -used. The standard audio classification architecture is motivated by the nature of the task; we want to transform a sequence -of audio inputs (i.e. our input audio array) into a single class label prediction. Encoder-only models first map the input -audio sequence into a sequence of hidden-state representations by passing the inputs through a transformer block. The -sequence of hidden-state representations is then mapped to a class label output by taking the mean over the hidden-states, -and passing the resulting vector through a linear classification layer. Hence, there is a preference for _encoder-only_ -models for audio classification. +在开始讲解各类音频分类任务之前,我们先快速回顾一下在音频分类中常用的Transformer架构。标准的音频分类架构是基于任务本质设计的:我们希望将一段音频输入序列(即音频数组)转换为一个类别标签。编码器模型(encoder-only)首先将音频输入序列通过Transformer模块映射为一系列隐藏状态表示,然后对这些隐藏状态取平均,并将结果输入线性分类层,最终输出一个类别标签。因此,音频分类任务通常偏好使用**仅包含编码器的模型**。 -Decoder-only models introduce unnecessary complexity to the task, since they assume that the outputs can also be a _sequence_ -of predictions (rather than a single class label prediction), and so generate multiple outputs. Therefore, they have slower -inference speed and tend not to be used. Encoder-decoder models are largely omitted for the same reason. These architecture -choices are analogous to those in NLP, where encoder-only models such as [BERT](https://huggingface.co/blog/bert-101) -are favoured for sequence classification tasks, and decoder-only models such as GPT reserved for sequence generation tasks. +仅含解码器的模型(decoder-only)会为任务引入不必要的复杂度,因为它们假设输出是一个**序列**(而不仅是单一类别),因此会生成多个输出。这导致推理速度变慢,不适合用在音频分类中。同样地,编码器-解码器模型(encoder-decoder)也因为类似原因很少使用。这种架构选择与NLP中的任务是类似的:像[BERT](https://huggingface.co/blog/bert-101)这样的编码器模型适用于序列分类任务,而像GPT这样的解码器模型则用于文本生成任务。 -Now that we've recapped the standard transformer architecture for audio classification, let's jump into the different -subsets of audio classification and cover the most popular models! +现在我们已经回顾了音频分类中常用的Transformer架构,让我们开始介绍音频分类的几种子任务,以及最受欢迎的模型! -## 🤗 Transformers Installation +## 🤗Transformers安装说明 -At the time of writing, the latest updates required for audio classification pipeline are only on the `main` version of -the 🤗 Transformers repository, rather than the latest PyPi version. To make sure we have these updates locally, we'll -install Transformers from the `main` branch with the following command: +截至目前,音频分类所需的一些最新更新仅包含在🤗Transformers仓库的`main`分支中,还未发布到PyPi。为了确保本地环境具备这些更新,我们需要通过以下命令从`main`分支安装Transformers: ``` pip install git+https://github.com/huggingface/transformers ``` -## Keyword Spotting +## 关键词识别 -Keyword spotting (KWS) is the task of identifying a keyword in a spoken utterance. The set of possible keywords forms the -set of predicted class labels. Hence, to use a pre-trained keyword spotting model, you should ensure that your keywords -match those that the model was pre-trained on. Below, we'll introduce two datasets and models for keyword spotting. +关键词识别(Keyword Spotting,KWS)是指从语音中识别出特定关键词。可识别的关键词集合即为分类任务的标签集合。因此,在使用预训练的关键词识别模型时,你应确保目标关键词与模型的预训练标签匹配。下面我们将介绍两个关键词识别的数据集和模型。 ### Minds-14 -Let's go ahead and use the same [MINDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset that you have explored -in the previous unit. If you recall, MINDS-14 contains recordings of people asking an e-banking system questions in several -languages and dialects, and has the `intent_class` for each recording. We can classify the recordings by intent of the call. +我们将使用上一单元中提到的[MINDS-14](https://huggingface.co/datasets/PolyAI/minds14)数据集。如果你还记得,MINDS-14包含用户向电子银行系统提问的语音数据,涵盖多种语言和方言,并为每条录音提供了`intent_class`标签,用于标注通话意图。 ```python from datasets import load_dataset @@ -51,9 +30,7 @@ from datasets import load_dataset minds = load_dataset("PolyAI/minds14", name="en-AU", split="train") ``` -We'll load the checkpoint [`"anton-l/xtreme_s_xlsr_300m_minds14"`](https://huggingface.co/anton-l/xtreme_s_xlsr_300m_minds14), -which is an XLS-R model fine-tuned on MINDS-14 for approximately 50 epochs. It achieves 90% accuracy over all languages -from MINDS-14 on the evaluation set. +我们将加载模型[`anton-l/xtreme_s_xlsr_300m_minds14`](https://huggingface.co/anton-l/xtreme_s_xlsr_300m_minds14)。这是一个基于MINDS-14微调约50轮的XLS-R模型,在所有语言上的评估准确率约为90%。 ```python from transformers import pipeline @@ -64,11 +41,11 @@ classifier = pipeline( ) ``` -Finally, we can pass a sample to the classification pipeline to make a prediction: +最后,我们将一条样本输入到分类pipeline中进行预测: ```python classifier(minds[0]["audio"]) ``` -**Output:** +**输出:** ``` [ {"score": 0.9631525278091431, "label": "pay_bill"}, @@ -79,43 +56,25 @@ classifier(minds[0]["audio"]) ] ``` -Great! We've identified that the intent of the call was paying a bill, with probability 96%. You can imagine this kind of -keyword spotting system being used as the first stage of an automated call centre, where we want to categorise incoming -customer calls based on their query and offer them contextualised support accordingly. +太棒了!我们识别出该通话的意图是“支付账单”,预测概率为96%。你可以想象,这种关键词识别系统可以作为自动化呼叫中心的第一道流程,根据用户的请求对通话进行分类,并提供相应的上下文支持。 ### Speech Commands -Speech Commands is a dataset of spoken words designed to evaluate audio classification models on simple command words. -The dataset consists of 15 classes of keywords, a class for silence, and an unknown class to include the false positive. -The 15 keywords are single words that would typically be used in on-device settings to control basic tasks or launch -other processes. +Speech Commands是一个口语单词数据集,用于评估音频分类模型在简单命令词上的表现。数据集包含15个关键词类别、1个静音类别以及1个未知类别用于包括误报。15个关键词都是单词,通常用于端侧设备上以控制基础任务或启动其他进程。 -A similar model is running continuously on your mobile phone. Here, instead of having single command words, we have -'wake words' specific to your device, such as "Hey Google" or "Hey Siri". When the audio classification model detects -these wake words, it triggers your phone to start listening to the microphone and transcribe your speech using a speech -recognition model. +与你的手机上持续运行的类似模型相同,这里并非单个命令词,而是与你设备相关的“唤醒词”,例如"Hey Google"或"Hey Siri"。当音频分类模型检测到这些唤醒词后,它会触发手机开始监听麦克风,并通过语音识别模型转写你的语音。 -The audio classification model is much smaller and lighter than the speech recognition model, often only several millions -of parameters compared to several hundred millions for speech recognition. Thus, it can be run continuously on your device -without draining your battery! Only when the wake word is detected is the larger speech recognition model launched, and -afterwards it is shut down again. We'll cover transformer models for speech recognition in the next Unit, so by the end -of the course you should have the tools you need to build your own voice activated assistant! +音频分类模型通常比语音识别模型小得多也轻量得多,参数量往往只有几百万,而语音识别模型可能有数亿参数。因此,音频分类模型可以在设备上持续运行而不至于耗尽电量。只有在检测到唤醒词时,较大的语音识别模型才会被启动,之后再次关闭。我们将在下一单元介绍用于语音识别的Transformer模型,学完之后你就具备构建语音唤醒助手所需的工具。 -As with any dataset on the Hugging Face Hub, we can get a feel for the kind of audio data it has present without downloading -or committing it memory. After heading to the [Speech Commands' dataset card](https://huggingface.co/datasets/speech_commands) -on the Hub, we can use the Dataset Viewer to scroll through the first 100 samples of the dataset, listening to the audio -files and checking any other metadata information: +像Hugging Face Hub上的任何数据集一样,我们无需下载或占用内存就能先大致了解它包含的音频数据。前往Hub上的Speech Commands数据集卡片:[https://huggingface.co/datasets/speech\_commands](https://huggingface.co/datasets/speech_commands),可以使用Dataset Viewer浏览该数据集的前100个样本,收听音频文件并查看其他元数据:
Diagram of datasets viewer.
-The Dataset Preview is a brilliant way of experiencing audio datasets before committing to using them. You can pick any -dataset on the Hub, scroll through the samples and listen to the audio for the different subsets and splits, gauging whether -it's the right dataset for your needs. Once you've selected a dataset, it's trivial to load the data so that you can start -using it. +Dataset Preview是体验音频数据集的绝佳方式,你可以在Hub上选取任意数据集,滚动查看不同子集与切分的样本并收听音频,从而评估它是否适合你的需求。一旦选定数据集,加载数据并开始使用就非常简单。 -Let's do exactly that and load a sample of the Speech Commands dataset using streaming mode: +我们就来实际操作一下,使用流式模式加载Speech Commands数据集的一个样本: ```python speech_commands = load_dataset( @@ -124,8 +83,7 @@ speech_commands = load_dataset( sample = next(iter(speech_commands)) ``` -We'll load an official [Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer) -checkpoint fine-tuned on the Speech Commands dataset, under the namespace [`"MIT/ast-finetuned-speech-commands-v2"`](https://huggingface.co/MIT/ast-finetuned-speech-commands-v2): +下面加载一个在Speech Commands上微调的官方[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)模型文件,模型名称为`"MIT/ast-finetuned-speech-commands-v2"`: ```python classifier = pipeline( @@ -133,7 +91,7 @@ classifier = pipeline( ) classifier(sample["audio"].copy()) ``` -**Output:** +**输出:** ``` [{'score': 0.9999892711639404, 'label': 'backward'}, {'score': 1.7504888774055871e-06, 'label': 'happy'}, @@ -142,67 +100,51 @@ classifier(sample["audio"].copy()) {'score': 5.614546694232558e-07, 'label': 'up'}] ``` -Cool! Looks like the example contains the word "backward" with high probability. We can take a listen to the sample -and verify this is correct: +很好!看起来该样本以极高概率包含单词“backward”。我们可以播放这段音频来验证: ``` from IPython.display import Audio Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"]) ``` -Now, you might be wondering how we've selected these pre-trained models to show you in these audio classification examples. -The truth is, finding pre-trained models for your dataset and task is very straightforward! The first thing we need to do -is head to the Hugging Face Hub and click on the "Models" tab: https://huggingface.co/models +你或许会好奇我们是如何为这些示例选择预训练模型的。事实上,为你的数据集与任务寻找预训练模型非常直接。首先前往Hugging Face Hub的“Models”标签页:[https://huggingface.co/models](https://huggingface.co/models) -This is going to bring up all the models on the Hugging Face Hub, sorted by downloads in the past 30 days: +这会展示Hub上的全部模型,并按最近30天的下载量排序:
-You'll notice on the left-hand side that we have a selection of tabs that we can select to filter models by task, library, -dataset, etc. Scroll down and select the task "Audio Classification" from the list of audio tasks: +你会注意到左侧有诸多筛选选项,可以按任务、库、数据集等过滤模型。向下滚动,在音频任务中选择“Audio Classification”:
-We're now presented with the sub-set of 500+ audio classification models on the Hub. To further refine this selection, we -can filter models by dataset. Click on the tab "Datasets", and in the search box type "speech_commands". As you begin typing, -you'll see the selection for `speech_commands` appear underneath the search tab. You can click this button to filter all -audio classification models to those fine-tuned on the Speech Commands dataset: +现在看到的是Hub上500多个音频分类模型的子集。为了进一步缩小范围,可以按数据集筛选。点击“Datasets”,在搜索框中输入“speech\_commands”。当你开始输入时,`speech_commands`选项会出现在搜索框下方。点击它即可将模型过滤为在Speech Commands上微调过的音频分类模型:
-Great! We see that we have 6 pre-trained models available to us for this specific dataset and task. You'll recognise the -first of these models as the Audio Spectrogram Transformer checkpoint that we used in the previous example. This process -of filtering models on the Hub is exactly how we went about selecting the checkpoint to show you! +很好!我们可以看到针对该数据集与任务共有6个可用的预训练模型。你会认出其中第一个正是我们刚刚使用的Audio Spectrogram Transformer的模型文件。我们挑选模型的过程就是在Hub上这样逐步过滤得到的。 -## Language Identification +## 语言识别 -Language identification (LID) is the task of identifying the language spoken in an audio sample from a list of candidate -languages. LID can form an important part in many speech pipelines. For example, given an audio sample in an unknown language, -an LID model can be used to categorise the language(s) spoken in the audio sample, and then select an appropriate speech -recognition model trained on that language to transcribe the audio. +语言识别(Language Identification,LID)是从音频样本中在一组候选语言里判断其所说语言的任务。LID常作为许多语音处理流程中的重要组成部分。例如,给定一段未知语言的音频,LID模型可以先判断其中的语言,然后选择在该语言上训练的语音识别模型来转写音频。 ### FLEURS -FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a dataset for evaluating speech recognition -systems in 102 languages, including many that are classified as 'low-resource'. Take a look at the FLEURS dataset -card on the Hub and explore the different languages that are present: [google/fleurs](https://huggingface.co/datasets/google/fleurs). -Can you find your native tongue here? If not, what's the most closely related language? +FLEURS(Few-shot Learning Evaluation of Universal Representations of Speech)是一个用于评估语音识别系统的数据集,覆盖102种语言,其中包含大量被视为“低资源”的语言。前往Hub上的数据集卡片查看并探索这些语言:[google/fleurs](https://huggingface.co/datasets/google/fleurs)。你能找到你的母语吗?如果没有,哪一种是最接近的语言? -Let's load up a sample from the validation split of the FLEURS dataset using streaming mode: +我们使用流式模式加载FLEURS数据集验证集的一个样本: ```python fleurs = load_dataset("google/fleurs", "all", split="validation", streaming=True) sample = next(iter(fleurs)) ``` -Great! Now we can load our audio classification model. For this, we'll use a version of [Whisper](https://arxiv.org/pdf/2212.04356.pdf) -fine-tuned on the FLEURS dataset, which is currently the most performant LID model on the Hub: +太棒了!现在我们可以加载音频分类模型了。这里我们使用一个在FLEURS上微调的[Whisper](https://arxiv.org/pdf/2212.04356.pdf)版本,该模型目前是Hub上表现最好的LID模型之一 ```python classifier = pipeline( @@ -210,11 +152,11 @@ classifier = pipeline( ) ``` -We can then pass the audio through our classifier and generate a prediction: +将音频传入分类器得到预测: ```python classifier(sample["audio"]) ``` -**Output:** +**输出:** ``` [{'score': 0.9999330043792725, 'label': 'Afrikaans'}, {'score': 7.093023668858223e-06, 'label': 'Northern-Sotho'}, @@ -223,55 +165,34 @@ classifier(sample["audio"]) {'score': 3.2580724109720904e-06, 'label': 'Cantonese Chinese'}] ``` -We can see that the model predicted the audio was in Afrikaans with extremely high probability (near 1). The FLEURS dataset -contains audio data from a wide range of languages - we can see that possible class labels include Northern-Sotho, Icelandic, -Danish and Cantonese Chinese amongst others. You can find the full list of languages on the dataset card here: [google/fleurs](https://huggingface.co/datasets/google/fleurs). +可以看到模型以极高概率(接近1)判断这段音频为阿非利卡语(Afrikaans)。FLEURS覆盖了非常广泛的语言集合,可能的类别标签包括北索托语(Northern-Sotho)、冰岛语(Icelandic)、丹麦语(Danish)、粤语(Cantonese Chinese)等。你可以在这里查看完整的语言列表:[google/fleurs](https://huggingface.co/datasets/google/fleurs)。 -Over to you! What other checkpoints can you find for FLEURS LID on the Hub? What transformer models are they using under-the-hood? +试试看吧!你还能在Hub上找到哪些FLEURS的LID模型文件?它们底层使用了哪些Transformer模型? -## Zero-Shot Audio Classification +## 零样本音频分类(Zero-Shot Audio Classification) -In the traditional paradigm for audio classification, the model predicts a class label from a _pre-defined_ set of -possible classes. This poses a barrier to using pre-trained models for audio classification, since the label set of the -pre-trained model must match that of the downstream task. For the previous example of LID, the model must predict one of -the 102 langauge classes on which it was trained. If the downstream task actually requires 110 languages, the model would -not be able to predict 8 of the 110 languages, and so would require re-training to achieve full coverage. This limits the -effectiveness of transfer learning for audio classification tasks. +在传统音频分类范式中,模型从一组**预先定义**的类别中进行预测。这对迁移使用预训练模型带来限制,因为预训练模型的标签集合必须与下游任务匹配。以LID为例,模型必须在其训练用到的102种语言里选择其一。如果下游任务需要110种语言,那么其中8种将无法被预测,必须重新训练才能覆盖完整标签空间,从而限制了迁移学习的效果。 -Zero-shot audio classification is a method for taking a pre-trained audio classification model trained on a set of labelled -examples and enabling it to be able to classify new examples from previously unseen classes. Let's take a look at how we -can achieve this! +零样本音频分类的思路是,将一个在带标注样本上训练好的音频分类模型扩展为能够对**未见过的类别**进行分类。我们来看看如何实现这一点! -Currently, 🤗 Transformers supports one kind of model for zero-shot audio classification: the [CLAP model](https://huggingface.co/docs/transformers/model_doc/clap). -CLAP is a transformer-based model that takes both audio and text as inputs, and computes the _similarity_ between the two. -If we pass a text input that strongly correlates with an audio input, we'll get a high similarity score. Conversely, passing -a text input that is completely unrelated to the audio input will return a low similarity. +目前,🤗 Transformers支持一种用于零样本音频分类的模型:[CLAP模型](https://huggingface.co/docs/transformers/model_doc/clap)。CLAP是一个同时接收音频与文本输入的Transformer模型,用于计算二者之间的**相似度**。如果输入的文本与音频语义强相关,相似度会高;如果完全无关,相似度会低。 -We can use this similarity prediction for zero-shot audio classification by passing one audio input to the model and -multiple candidate labels. The model will return a similarity score for each of the candidate labels, and we can pick the -one that has the highest score as our prediction. +我们可以通过向模型输入一段音频和多个候选标签文本来实现零样本分类。模型会为每个候选标签返回一个相似度分数,我们选择分数最高者作为预测结果。 -Let's take an example where we use one audio input from the [Environmental Speech Challenge (ESC)](https://huggingface.co/datasets/ashraq/esc50) -dataset: +来看一个示例,使用[Environmental Speech Challenge(ESC)](https://huggingface.co/datasets/ashraq/esc50)数据集的一段音频: ```python dataset = load_dataset("ashraq/esc50", split="train", streaming=True) audio_sample = next(iter(dataset))["audio"]["array"] ``` -We then define our candidate labels, which form the set of possible classification labels. The model will return a -classification probability for each of the labels we define. This means we need to know _a-priori_ the set of possible -labels in our classification problem, such that the correct label is contained within the set and is thus assigned a -valid probability score. Note that we can either pass the full set of labels to the model, or a hand-selected subset -that we believe contains the correct label. Passing the full set of labels is going to be more exhaustive, but comes -at the expense of lower classification accuracy since the classification space is larger (provided the correct label is -our chosen subset of labels): +接下来定义候选标签,它们构成可能的分类集合。模型会为我们定义的每个标签返回一个分类概率。这意味着我们需要**先验**地知道任务的可能标签集合,使正确标签包含在其中并能被赋予有效概率。注意可以传入完整标签集合,或我们认为包含正确标签的一个子集。传入完整集合更全面,但由于分类空间更大,通常会降低分类准确率;若我们有合理的先验,使用子集往往更实用: ```python candidate_labels = ["Sound of a dog", "Sound of vacuum cleaner"] ``` -We can run both through the model to find the candidate label that is _most similar_ to the audio input: +将两者一并输入模型,找出与音频输入**最相似**的候选标签: ```python classifier = pipeline( @@ -279,42 +200,25 @@ classifier = pipeline( ) classifier(audio_sample, candidate_labels=candidate_labels) ``` -**Output:** +**输出:** ``` [{'score': 0.9997242093086243, 'label': 'Sound of a dog'}, {'score': 0.0002758323971647769, 'label': 'Sound of vacuum cleaner'}] ``` -Alright! The model seems pretty confident we have the sound of a dog - it predicts it with 99.96% probability, so we'll -take that as our prediction. Let's confirm whether we were right by listening to the audio sample (don't turn up your -volume too high or else you might get a jump!): +很好!模型几乎可以确定这是“狗的声音”,概率为99.96%,我们就采纳这个预测。播放音频验证一下(别把音量开太大,以免被突然的声音吓到!): ```python Audio(audio_sample, rate=16000) ``` -Perfect! We have the sound of a dog barking 🐕, which aligns with the model's prediction. Have a play with different audio -samples and different candidate labels - can you define a set of labels that give good generalisation across the ESC -dataset? Hint: think about where you could find information on the possible sounds in ESC and construct your labels accordingly! +完美!正是狗叫声🐕,与模型预测一致。你可以尝试使用不同的音频样本和候选标签进行测试。你能给出一组在ESC数据集上具有良好泛化能力的标签集合吗?提示:思考一下可以从哪里找到ESC的可能声音类别,据此来构造你的标签! -You might be wondering why we don't use the zero-shot audio classification pipeline for **all** audio classification tasks? -It seems as though we can make predictions for any audio classification problem by defining appropriate class labels _a-priori_, -thus bypassing the constraint that our classification task needs to match the labels that the model was pre-trained on. -This comes down to the nature of the CLAP model used in the zero-shot pipeline: CLAP is pre-trained on _generic_ audio -classification data, similar to the environmental sounds in the ESC dataset, rather than specifically speech data, like -we had in the LID task. If you gave it speech in English and speech in Spanish, CLAP would know that both examples were -speech data 🗣️ But it wouldn't be able to differentiate between the languages in the same way a dedicated LID model is -able to. +你可能会问,为什么不把零样本音频分类用于**所有**音频分类任务?看起来只要先验地定义合适的标签集合,就能对任意任务进行预测,从而绕过“下游任务标签必须匹配预训练标签”的限制。关键在于零样本流程所用的CLAP模型:CLAP是在**通用**音频分类数据上预训练的,类似ESC中的环境声音,而不是专门的语音数据,如LID任务中那样。如果给它英语与西班牙语的语音,CLAP能识别它们都是语音🗣️,但无法像专门的LID模型那样区分语言。 -## What next? +## 接下来做什么? -We've covered a number of different audio classification tasks and presented the most relevant datasets and models that -you can download from the Hugging Face Hub and use in just several lines of code using the `pipeline()` class. These tasks -included keyword spotting, language identification and zero-shot audio classification. +本节我们覆盖了多种音频分类任务,并介绍了可以从Hugging Face Hub下载、只需少量代码就能通过`pipeline()`使用的相关数据集与模型。这些任务包括关键词识别、语言识别以及零样本音频分类。 -But what if we want to do something **new**? We've worked extensively on speech processing tasks, but this is only one -aspect of audio classification. Another popular field of audio processing involves **music**. While music has inherently -different features to speech, many of the same principles that we've learnt about already can be applied to music. +但如果我们想做点**新的**呢?我们已经广泛涉及语音处理任务,但这只是音频分类的一个方面。另一个热门方向是**音乐**。虽然音乐与语音有本质差异,但许多我们已经掌握的原则同样适用于音乐。 -In the following section, we'll go through a step-by-step guide on how you can fine-tune a transformer model with 🤗 -Transformers on the task of music classification. By the end of it, you'll have a fine-tuned checkpoint that you can plug -into the `pipeline()` class, enabling you to classify songs in exactly the same way that we've classified speech here! +接下来的章节将手把手带你使用🤗Transformers在音乐分类任务上微调一个Transformer模型。完成后,你将得到一个可直接接入`pipeline()`的微调checkpoint,以与本节中处理语音相同的方式来分类歌曲! From 44ae47790b15465c15858194c0b05ac0f727c0e3 Mon Sep 17 00:00:00 2001 From: zefang-liu Date: Tue, 29 Jul 2025 20:28:42 -0700 Subject: [PATCH 4/7] Translate the zh-cn chapter 4 fine-tuning --- .../zh-CN/chapter4/classification_models.mdx | 6 +- chapters/zh-CN/chapter4/fine-tuning.mdx | 230 ++++++------------ 2 files changed, 76 insertions(+), 160 deletions(-) diff --git a/chapters/zh-CN/chapter4/classification_models.mdx b/chapters/zh-CN/chapter4/classification_models.mdx index 1bcbd3a6..ca17f7a6 100644 --- a/chapters/zh-CN/chapter4/classification_models.mdx +++ b/chapters/zh-CN/chapter4/classification_models.mdx @@ -8,9 +8,9 @@ Hugging Face Hub上托管着超过500个用于音频分类的预训练模型。 现在我们已经回顾了音频分类中常用的Transformer架构,让我们开始介绍音频分类的几种子任务,以及最受欢迎的模型! -## 🤗Transformers安装说明 +## 🤗 Transformers安装说明 -截至目前,音频分类所需的一些最新更新仅包含在🤗Transformers仓库的`main`分支中,还未发布到PyPi。为了确保本地环境具备这些更新,我们需要通过以下命令从`main`分支安装Transformers: +截至目前,音频分类所需的一些最新更新仅包含在🤗 Transformers仓库的`main`分支中,还未发布到PyPi。为了确保本地环境具备这些更新,我们需要通过以下命令从`main`分支安装Transformers: ``` pip install git+https://github.com/huggingface/transformers @@ -221,4 +221,4 @@ Audio(audio_sample, rate=16000) 但如果我们想做点**新的**呢?我们已经广泛涉及语音处理任务,但这只是音频分类的一个方面。另一个热门方向是**音乐**。虽然音乐与语音有本质差异,但许多我们已经掌握的原则同样适用于音乐。 -接下来的章节将手把手带你使用🤗Transformers在音乐分类任务上微调一个Transformer模型。完成后,你将得到一个可直接接入`pipeline()`的微调checkpoint,以与本节中处理语音相同的方式来分类歌曲! +接下来的章节将手把手带你使用🤗 Transformers在音乐分类任务上微调一个Transformer模型。完成后,你将得到一个可直接接入`pipeline()`的微调checkpoint,以与本节中处理语音相同的方式来分类歌曲! diff --git a/chapters/zh-CN/chapter4/fine-tuning.mdx b/chapters/zh-CN/chapter4/fine-tuning.mdx index 496e4c7e..c52402a1 100644 --- a/chapters/zh-CN/chapter4/fine-tuning.mdx +++ b/chapters/zh-CN/chapter4/fine-tuning.mdx @@ -1,17 +1,10 @@ -# Fine-tuning a model for music classification +# 针对音乐分类进行微调 -In this section, we'll present a step-by-step guide on fine-tuning an encoder-only transformer model for music classification. -We'll use a lightweight model for this demonstration and fairly small dataset, meaning the code is runnable end-to-end -on any consumer grade GPU, including the T4 16GB GPU provided in the Google Colab free tier. The section includes various -tips that you can try should you have a smaller GPU and encounter memory issues along the way. +在本节中,我们将逐步演示如何微调一个仅包含编码器的Transformer模型以完成音乐分类任务。我们将使用一个轻量级模型和一个相对较小的数据集,这意味着你可以在任何消费级GPU上完整运行整个代码,包括Google Colab免费版中提供的T4 16GB GPU。本节还包含了一些实用建议,供你在显存有限的设备上出现内存问题时参考。 +## 数据集 -## The Dataset - -To train our model, we'll use the [GTZAN](https://huggingface.co/datasets/marsyas/gtzan) dataset, which is a popular -dataset of 1,000 songs for music genre classification. Each song is a 30-second clip from one of 10 genres of music, -spanning disco to metal. We can get the audio files and their corresponding labels from the Hugging Face Hub with the -`load_dataset()` function from 🤗 Datasets: +为了训练我们的模型,我们将使用[GTZAN](https://huggingface.co/datasets/marsyas/gtzan)数据集,这是一个广泛使用的音乐风格分类数据集,包含1000首歌曲。每首歌曲都是30秒的片段,涵盖从Disco到金属在内的10种音乐风格。我们可以通过🤗 Datasets中的`load_dataset()`函数从Hugging Face Hub加载这些音频文件及其对应标签: ```python from datasets import load_dataset @@ -20,7 +13,7 @@ gtzan = load_dataset("marsyas/gtzan", "all") gtzan ``` -**Output:** +**输出:** ```out Dataset({ features: ['file', 'audio', 'genre'], @@ -30,21 +23,18 @@ Dataset({ -One of the recordings in GTZAN is corrupted, so it's been removed from the dataset. That's why we have 999 examples -instead of 1,000. +GTZAN中的一个录音文件已损坏,因此在数据集中被移除。所以总样本数是999而不是1000。 - -GTZAN doesn't provide a predefined validation set, so we'll have to create one ourselves. The dataset is balanced across -genres, so we can use the `train_test_split()` method to quickly create a 90/10 split as follows: +GTZAN没有预定义的验证集,因此我们需要自己创建一个。由于数据集中每种风格的样本数量是均衡的,我们可以使用`train_test_split()`方法快速创建一个90/10的训练/验证划分: ```python gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1) gtzan ``` -**Output:** +**输出:** ```out DatasetDict({ train: Dataset({ @@ -58,13 +48,13 @@ DatasetDict({ }) ``` -Great, now that we've got our training and validation sets, let's take a look at one of the audio files: +很好,现在我们已经有了训练集和验证集,来看一个音频样本的具体内容: ```python gtzan["train"][0] ``` -**Output:** +**输出:** ```out { "file": "~/.cache/huggingface/datasets/downloads/extracted/fa06ce46130d3467683100aca945d6deafb642315765a784456e1d81c94715a8/genres/pop/pop.00098.wav", @@ -88,25 +78,19 @@ gtzan["train"][0] } ``` -As we saw in [Unit 1](../chapter1/audio_data), the audio files are represented as 1-dimensional NumPy arrays, -where the value of the array represents the amplitude at that timestep. For these songs, the sampling rate is 22,050 Hz, -meaning there are 22,050 amplitude values sampled per second. We'll have to keep this in mind when using a pretrained model -with a different sampling rate, converting the sampling rates ourselves to ensure they match. We can also see the genre -is represented as an integer, or _class label_, which is the format the model will make it's predictions in. Let's use the -`int2str()` method of the `genre` feature to map these integers to human-readable names: +正如我们在[第1单元](../chapter1/audio_data)中看到的,音频文件以一维NumPy数组的形式表示,数组中的每个值代表音频在某个时间点的振幅。该数据集的采样率是22,050 Hz,意味着每秒有22,050个采样值。在使用采样率不同的预训练模型时,我们需要注意进行采样率转换,确保输入格式匹配。此外,`genre` 以整数形式表示,是一种**类别标签(class label)**,也正是模型预测时所采用的输出格式。我们可以使用 `genre` 特征的 `int2str()` 方法将这些整数标签映射为可读的类别名称: ```python id2label_fn = gtzan["train"].features["genre"].int2str id2label_fn(gtzan["train"][0]["genre"]) ``` -**Output:** +**输出:** ```out 'pop' ``` -This label looks correct, since it matches the filename of the audio file. Let's now listen to a few more examples by -using Gradio to create a simple interface with the `Blocks` API: +这个标签看起来没问题,与音频文件名一致。接下来我们使用 Gradio 的 `Blocks` API 构建一个简单的界面,来试听更多样本: ```python import gradio as gr @@ -132,29 +116,19 @@ demo.launch(debug=True) -From these samples we can certainly hear the difference between genres, but can a transformer do this too? Let's train a -model to find out! First, we'll need to find a suitable pretrained model for this task. Let's see how we can do that. +从这些样本中,我们确实能听出风格的不同,那Transformer模型是否也能做到呢?让我们训练一个模型来验证一下。第一步是选一个合适的预训练模型。一起来看看该怎么做吧。 -## Picking a pretrained model for audio classification +## 选择音频分类的预训练模型 -To get started, let's pick a suitable pretrained model for audio classification. In this domain, pretraining is typically -carried out on large amounts of unlabeled audio data, using datasets like [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -and [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli). The best way to find these models on the Hugging -Face Hub is to use the "Audio Classification" filter, as described in the previous section. Although models like Wav2Vec2 and -HuBERT are very popular, we'll use a model called _DistilHuBERT_. This is a much smaller (or _distilled_) version of the [HuBERT](https://huggingface.co/docs/transformers/model_doc/hubert) -model, which trains around 73% faster, yet preserves most of the performance. +首先,我们需要为音频分类任务选一个合适的预训练模型。在这个领域,预训练通常是在大量未标注音频数据上完成的,比如[LibriSpeech](https://huggingface.co/datasets/librispeech_asr)和[Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)。最方便的方式是在Hugging Face Hub上使用“Audio Classification”任务筛选器(上一节已介绍)。虽然Wav2Vec2和HuBERT非常流行,但我们将使用一个名为**DistilHuBERT**的模型。它是[HuBERT](https://huggingface.co/docs/transformers/model_doc/hubert)的轻量版,训练速度提升约73%,但在性能上几乎没有损失。 -## From audio to machine learning features +## 从音频到机器学习特征 -## Preprocessing the data +## 数据预处理 -Similar to tokenization in NLP, audio and speech models require the input to be encoded in a format that the model -can process. In 🤗 Transformers, the conversion from audio to the input format is handled by the _feature extractor_ of -the model. Similar to tokenizers, 🤗 Transformers provides a convenient `AutoFeatureExtractor` class that can automatically -select the correct feature extractor for a given model. To see how we can process our audio files, let's begin by instantiating -the feature extractor for DistilHuBERT from the pre-trained checkpoint: +与自然语言处理中的分词类似,音频和语音模型也需要将输入编码为模型可处理的格式。在🤗 Transformers中,这种从原始音频到模型输入格式的转换由模型的**特征提取器(feature extractor)**完成。与分词器(Tokenizer)类似,🤗 Transformers提供了`AutoFeatureExtractor`类,它可以根据所用模型自动选择正确的特征提取器。为了了解我们该如何处理音频文件,我们先从预训练的DistilHuBERT模型中初始化特征提取器开始: ```python from transformers import AutoFeatureExtractor @@ -165,21 +139,19 @@ feature_extractor = AutoFeatureExtractor.from_pretrained( ) ``` -Since the sampling rate of the model and the dataset are different, we'll have to resample the audio file to 16,000 -Hz before passing it to the feature extractor. We can do this by first obtaining the model's sample rate from the feature -extractor: +由于模型和数据集的采样率不同,我们需要先将音频重新采样至16,000Hz,然后再传入特征提取器。首先可以通过特征提取器来获取模型所需的采样率: ```python sampling_rate = feature_extractor.sampling_rate sampling_rate ``` -**Output:** +**输出:** ```out 16000 ``` -Next, we resample the dataset using the `cast_column()` method and `Audio` feature from 🤗 Datasets: +接着,我们使用🤗 Datasets提供的`Audio`特征和`cast_column()`方法重新采样整个数据集: ```python from datasets import Audio @@ -187,14 +159,13 @@ from datasets import Audio gtzan = gtzan.cast_column("audio", Audio(sampling_rate=sampling_rate)) ``` -We can now check the first sample of the train-split of our dataset to verify that it is indeed at 16,000 Hz. 🤗 Datasets -will resample the audio file _on-the-fly_ when we load each audio sample: +我们现在可以检查训练集中的第一个样本是否已被正确重采样为16kHz。🤗 Datasets会在加载每个样本时**动态重采样**: ```python gtzan["train"][0] ``` -**Output:** +**输出:** ```out { "file": "~/.cache/huggingface/datasets/downloads/extracted/fa06ce46130d3467683100aca945d6deafb642315765a784456e1d81c94715a8/genres/pop/pop.00098.wav", @@ -218,25 +189,17 @@ gtzan["train"][0] } ``` -Great! We can see that the sampling rate has been downsampled to 16kHz. The array values are also different, as we've -now only got approximately one amplitude value for every 1.5 that we had before. +太好了!我们可以看到采样率已经被下采样为16kHz。由于采样点减少,数组中的振幅值也发生了变化,现在大约每1.5个原始采样点只保留了1个。 -A defining feature of Wav2Vec2 and HuBERT like models is that they accept a float array corresponding to the raw waveform -of the speech signal as an input. This is in contrast to other models, like Whisper, where we pre-process the raw audio waveform -to spectrogram format. +Wav2Vec2和HuBERT等模型的一个显著特点是,它们直接接受表示语音信号原始波形的float类型数组作为输入。这与其他模型(如Whisper)不同,后者在输入前需要将原始音频波形预处理为声谱图格式。 -We mentioned that the audio data is represented as a 1-dimensional array, so it's already in the right format to be read -by the model (a set of continuous inputs at discrete time steps). So, what exactly does the feature extractor do? +我们提到过,音频数据是用一维数组表示的,因此它已经具备模型所需的格式,也就是在离散时间步上的一组连续输入值。那么,特征提取器的作用是什么呢? -Well, the audio data is in the right format, but we've imposed no restrictions on the values it can take. For our model to -work optimally, we want to keep all the inputs within the same dynamic range. This is going to make sure we get a similar -range of activations and gradients for our samples, helping with stability and convergence during training. +虽然音频数据的结构正确,但它的数值范围没有任何限制。为了使模型运行更稳定,我们希望所有输入的动态范围保持一致。这可以确保不同样本在训练时具有相似的激活值和梯度,有助于模型收敛并提升训练稳定性。 -To do this, we _normalise_ our audio data, by rescaling each sample to zero mean and unit variance, a process called -_feature scaling_. It's exactly this feature normalisation that our feature extractor performs! +为此,我们需要对音频数据进行归一化处理,即将每个样本缩放为零均值、单位方差的形式,这一过程称为特征缩放(feature scaling)。这正是特征提取器的功能所在。 -We can take a look at the feature extractor in operation by applying it to our first audio sample. First, let's compute -the mean and variance of our raw audio data: +我们可以将特征提取器应用到第一个样本上,来看它的具体表现。首先,计算原始音频数据的均值和方差: ```python import numpy as np @@ -246,14 +209,12 @@ sample = gtzan["train"][0]["audio"] print(f"Mean: {np.mean(sample['array']):.3}, Variance: {np.var(sample['array']):.3}") ``` -**Output:** +**输出:** ```out Mean: 0.000185, Variance: 0.0493 ``` -We can see that the mean is close to zero already, but the variance is closer to 0.05. If the variance for the sample was -larger, it could cause our model problems, since the dynamic range of the audio data would be very small and thus difficult to -separate. Let's apply the feature extractor and see what the outputs look like: +我们可以看到,均值已经非常接近0,但方差约为0.05。如果样本的方差更大,可能会导致模型出现问题,因为这意味着音频数据的动态范围很小,从而使不同特征难以区分。接下来我们使用特征提取器,看看处理后的输出结果: ```python inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"]) @@ -265,33 +226,23 @@ print( ) ``` -**Output:** +**输出:** ```out inputs keys: ['input_values', 'attention_mask'] Mean: -4.53e-09, Variance: 1.0 ``` -Alright! Our feature extractor returns a dictionary of two arrays: `input_values` and `attention_mask`. The `input_values` -are the preprocessed audio inputs that we'd pass to the HuBERT model. The [`attention_mask`](https://huggingface.co/docs/transformers/glossary#attention-mask) -is used when we process a _batch_ of audio inputs at once - it is used to tell the model where we have padded inputs of -different lengths. +很好!特征提取器返回了一个包含两个键的字典:`input_values`和`attention_mask`。其中,`input_values`是我们将传递给HuBERT模型的预处理音频输入;[`attention_mask`](https://huggingface.co/docs/transformers/glossary#attention-mask)则用于**批量**处理音频输入时,指示哪些位置是由于输入长度不一而添加的填充。 -We can see that the mean value is now very much closer to zero, and the variance bang-on one! This is exactly the form we -want our audio samples in prior to feeding them to the HuBERT model. +我们可以看到,现在的均值已经非常接近0,方差则恰好为1!这正是我们在将音频样本输入HuBERT模型之前所期望的数据格式。 -Note how we've passed the sampling rate of our audio data to our feature extractor. This is good practice, as the feature -extractor performs a check under-the-hood to make sure the sampling rate of our audio data matches the sampling rate -expected by the model. If the sampling rate of our audio data did not match the sampling rate of our model, we'd need to -up-sample or down-sample the audio data to the correct sampling rate. +注意我们将音频数据的采样率传给了特征提取器。这是一个好习惯,因为特征提取器会在底层进行检查,确保音频的采样率与模型所期望的一致。如果两者不一致,我们就需要自行进行上采样或下采样处理。 -Great, so now we know how to process our resampled audio files, the last thing to do is define a function that we can -apply to all the examples in the dataset. Since we expect the audio clips to be 30 seconds in length, we'll also -truncate any longer clips by using the `max_length` and `truncation` arguments of the feature extractor as follows: - +太棒了!现在我们已经掌握了如何处理重采样后的音频数据,接下来只需定义一个函数,将这一处理过程应用到数据集中所有样本上。由于我们预期每段音频时长为30秒,对于更长的片段,可以使用特征提取器的`max_length`和`truncation`参数进行截断处理: ```python max_duration = 30.0 @@ -309,10 +260,7 @@ def preprocess_function(examples): return inputs ``` -With this function defined, we can now apply it to the dataset using the [`map()`](https://huggingface.co/docs/datasets/v2.14.0/en/package_reference/main_classes#datasets.Dataset.map) -method. The `.map()` method supports working with batches of examples, which we'll enable by setting `batched=True`. -The default batch size is 1000, but we'll reduce it to 100 to ensure the peak RAM stays within a sensible range for -Google Colab's free tier: +定义好函数后,我们可以使用[`.map()`方法](https://huggingface.co/docs/datasets/v2.14.0/en/package_reference/main_classes#datasets.Dataset.map)将其应用到整个数据集。`.map()`支持批量处理,我们通过设置`batched=True`来启用这一功能。默认的批处理大小为1000,但为了确保在Google Colab免费版中内存占用保持在合理范围内,我们将其减小为100: