Skip to content

Commit 8b3587f

Browse files
authored
Merge pull request lovemefan#86 from lovemefan/develop
merge Develop
2 parents 77d7a68 + b4d0ca8 commit 8b3587f

File tree

3 files changed

+166
-65
lines changed

3 files changed

+166
-65
lines changed

README-EN.md

Lines changed: 71 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,26 @@ python scripts/convert-pt-to-gguf.py \
6565
```
6666

6767
### Non-Streaming Speech Recognition (Silero-VAD + SenseVoice)
68+
69+
#### Parameter Description
70+
Only the following parameters are currently supported:
71+
```bash
72+
usage: ./bin/sense-voice-main [options] file.wav
73+
74+
options:
75+
-t N, --threads N [4 ] Number of decoding threads
76+
-l LANG, --language LANG [auto ] Language code ('auto' for detection), supports [`zh`, `en`, `yue`, `ja`, `ko`]
77+
-m FNAME, --model FNAME [models/sense-voice-small-q4_k.gguf] Path to GGUF model
78+
-f FNAME, --file FNAME [ ] Path to WAV file (only supports 16kHz)
79+
--min_speech_duration_ms [250 ] VAD parameter: minimum speech length in ms
80+
--max_speech_duration_ms [15000 ] VAD parameter: maximum speech length in ms
81+
--min_silence_duration_ms [100 ] VAD parameter: minimum silence length in ms
82+
-ng, --no-gpu [false ] Disable GPU
83+
-fa, --flash-attn [false ] Enable flash attention decoding
84+
-itn, --use-itn [false ] Use inverse text normalization (includes punctuation)
85+
-prfix, --use-prefix [false ] Output extra info: language, emotion, event, itn
86+
```
87+
6888
```bash
6989

7090
git clone https://github.com/lovemefan/SenseVoice.cpp
@@ -80,37 +100,59 @@ cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 8
80100

81101
### Output
82102

83-
Currently using the sense-voice-f16 model for output:
103+
Example output on MacBook M1 using the sense-voice-q4_k model:
84104

85105
```
86-
$./bin/sense-voice-main -m /data/code/SenseVoice.cpp/scripts/resources/gguf-fp16-sense-voice.bin /data/code/SenseVoice.cpp/scripts/resources/SenseVoiceSmall/example/asr_example_zh.wav -t 4
87-
88-
sense_voice_small_init_from_file_with_params_no_state: loading model from '/data/code/SenseVoice.cpp/scripts/resources/gguf-fp16-sense-voice-small.bin'
89-
sense_voice_model_load: version: 3
90-
sense_voice_model_load: alignment: 32
91-
sense_voice_model_load: data offset: 444480
92-
sense_voice_model_load: loading model
93-
sense_voice_model_load: n_vocab = 25055
94-
sense_voice_model_load: n_encoder_hidden_state = 512
95-
sense_voice_model_load: n_encoder_linear_units = 2048
96-
sense_voice_model_load: n_encoder_attention_heads = 4
97-
sense_voice_model_load: n_encoder_layers = 50
98-
sense_voice_model_load: n_mels = 80
99-
sense_voice_model_load: ftype = 1
100-
sense_voice_model_load: vocab[25055] loaded
101-
sense_voice_model_load: CPU total size = 468.98 MB
102-
sense_voice_model_load: n_tensors: 1197
103-
sense_voice_model_load: load SenseVoiceSmall takes 0.213000 second
104-
sense_voice_init_state: compute buffer (encoder) = 50.40 MB
105-
sense_voice_init_state: compute buffer (decoder) = 13.72 MB
106-
107-
system_info: n_threads = 4 / 256 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0
108-
109-
main: processing audio (88747 samples, 5.54669 sec) , 4 threads, 1 processors, lang = auto...
110-
111-
sense_voice_pcm_to_feature_with_state: calculate fbank and cmvn takes 7.207 ms
112-
<|zh|><|NEUTRAL|><|Speech|><|withitn|>欢迎大家来体验达摩院推出的语音识别模型。
113-
sense_voice_full_with_state: decoder audio use 1.011289 s, rtf is 0.182323.
106+
$$ ./bin/sense-voice-main -m /Users/Code/cpp-project/SenseVoice.cpp/scripts/resources/SenseVoiceGGUF/sense-voice-small-q4_k.gguf /Users/Downloads/en.wav -t 1 -l auto -itn -prefix
107+
108+
sense_voice_small_init_from_file_with_params_no_state: loading model from '/Users/Code/cpp-project/SenseVoice.cpp/scripts/resources/SenseVoiceGGUF/sense-voice-small-q4_k.gguf'
109+
sense_voice_init_with_params_no_state: use gpu = 1
110+
sense_voice_init_with_params_no_state: flash attn = 0
111+
sense_voice_init_with_params_no_state: gpu_device = 0
112+
sense_voice_init_with_params_no_state: devices = 3
113+
sense_voice_init_with_params_no_state: backends = 3
114+
sense_voice_model_load: version: 3
115+
sense_voice_model_load: alignment: 32
116+
sense_voice_model_load: data offset: 423680
117+
sense_voice_model_load: loading model
118+
sense_voice_model_load: n_vocab = 25055
119+
sense_voice_model_load: n_encoder_hidden_state = 512
120+
sense_voice_model_load: n_encoder_linear_units = 2048
121+
sense_voice_model_load: n_encoder_attention_heads = 4
122+
sense_voice_model_load: n_encoder_layers = 50
123+
sense_voice_model_load: n_mels = 80
124+
sense_voice_model_load: ftype = 12
125+
sense_voice_model_load: vocab[25055] loaded
126+
sense_voice_default_buffer_type: using device Metal (Apple M1 Pro)
127+
sense_voice_model_load: Metal total size = 181.86 MB
128+
sense_voice_model_load: n_tensors: 1212
129+
sense_voice_model_load: load SenseVoiceSmall takes 0.338000 second
130+
sense_voice_backend_init_gpu: using Metal backend
131+
ggml_metal_init: allocating
132+
ggml_metal_init: found device: Apple M1 Pro
133+
ggml_metal_init: picking default device: Apple M1 Pro
134+
ggml_metal_init: using embedded metal library
135+
ggml_metal_init: GPU name: Apple M1 Pro
136+
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
137+
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
138+
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
139+
...
140+
sense_voice_backend_init: using BLAS backend
141+
sense_voice_backend_init: using CPU backend
142+
sense_voice_init_state: kv pad size = 3.67 MB
143+
sense_voice_init_state: compute buffer (encoder) = 3.09 MB
144+
sense_voice_init_state: compute buffer (encoder) = 17.53 MB
145+
sense_voice_init_state: compute buffer (decoder) = 7.99 MB
146+
147+
system_info: n_threads = 1 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0
148+
149+
main: processing audio (114816 samples, 7.17600 sec) , 1 threads, 1 processors, lang = auto...
150+
151+
[1.12-3.42] <|en|><|NEUTRAL|><|Speech|><|withitn|>The tribal chief then called for the boy.
152+
[3.87-6.53] <|en|><|NEUTRAL|><|Speech|><|withitn|>And presented him with 50 pieces of gold.
153+
154+
main: decoder audio use 0.135743 s, rtf is 0.018916.
155+
114156
```
115157

116158
### Streaming Speech Recognition

README.md

Lines changed: 89 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,27 @@ python scripts/convert-pt-to-gguf.py \
5959
```
6060

6161
### 非流式语音识别 silero-vad + sense voice
62+
63+
#### 参数说明
64+
65+
以下列举的参数支持,未列举的暂不支持:
66+
```bash
67+
usage: ./bin/sense-voice-main [options] file.wav
68+
69+
options:
70+
-t N, --threads N [4 ] 解码使用的线程数
71+
-l LANG, --language LANG [auto ] 语音代码 ('auto' 为自动检测), 支持 [`zh`, `en`, `yue`, `ja`, `ko`],分别对应中文、英文、粤语、日语、韩语
72+
-m FNAME, --model FNAME [models/sense-voice-small-q4_k.gguf] gguf模型路径
73+
-f FNAME, --file FNAME [ ] wav文件路径, 当前仅支持16k采样率的音频
74+
--min_speech_duration_ms [250 ] vad 参数, 切割音频最小长度,单位毫秒
75+
--max_speech_duration_ms [15000 ] vad 参数, 切割音频最大长度,单位毫秒
76+
--min_silence_duration_ms [100 ] vad 参数,静默最小长度
77+
-ng, --no-gpu [false ] 不使用GPU
78+
-fa, --flash-attn [false ] 使用flash attention 解码
79+
-itn, --use-itn [false ] 使用逆文本正则化,包括标点。
80+
-prfix, --use-prefix [false ] 输出语种、情感、事件、是否itn
81+
```
82+
#### 使用
6283
```bash
6384

6485
git clone https://github.com/lovemefan/SenseVoice.cpp
@@ -74,40 +95,78 @@ cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 8
7495

7596
### 输出
7697

77-
当前使用sense-voice-f16模型输出
98+
以下是使用sense-voice-q4_k模型在Macbook M1上输出:
7899

79100
```
80-
$./bin/sense-voice-main -m /data/code/SenseVoice.cpp/scripts/resources/gguf-fp16-sense-voice.bin /data/code/SenseVoice.cpp/scripts/resources/SenseVoiceSmall/example/asr_example_zh.wav -t 4
81-
82-
sense_voice_small_init_from_file_with_params_no_state: loading model from '/data/code/SenseVoice.cpp/scripts/resources/gguf-fp16-sense-voice-small.bin'
83-
sense_voice_model_load: version: 3
84-
sense_voice_model_load: alignment: 32
85-
sense_voice_model_load: data offset: 444480
86-
sense_voice_model_load: loading model
87-
sense_voice_model_load: n_vocab = 25055
88-
sense_voice_model_load: n_encoder_hidden_state = 512
89-
sense_voice_model_load: n_encoder_linear_units = 2048
90-
sense_voice_model_load: n_encoder_attention_heads = 4
91-
sense_voice_model_load: n_encoder_layers = 50
92-
sense_voice_model_load: n_mels = 80
93-
sense_voice_model_load: ftype = 1
94-
sense_voice_model_load: vocab[25055] loaded
95-
sense_voice_model_load: CPU total size = 468.98 MB
96-
sense_voice_model_load: n_tensors: 1197
97-
sense_voice_model_load: load SenseVoiceSmall takes 0.213000 second
98-
sense_voice_init_state: compute buffer (encoder) = 50.40 MB
99-
sense_voice_init_state: compute buffer (decoder) = 13.72 MB
100-
101-
system_info: n_threads = 4 / 256 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0
102-
103-
main: processing audio (88747 samples, 5.54669 sec) , 4 threads, 1 processors, lang = auto...
104-
105-
sense_voice_pcm_to_feature_with_state: calculate fbank and cmvn takes 7.207 ms
106-
<|zh|><|NEUTRAL|><|Speech|><|withitn|>欢迎大家来体验达摩院推出的语音识别模型。
107-
sense_voice_full_with_state: decoder audio use 1.011289 s, rtf is 0.182323.
101+
$ ./bin/sense-voice-main -m /Users/Code/cpp-project/SenseVoice.cpp/scripts/resources/SenseVoiceGGUF/sense-voice-small-q4_k.gguf /Users/Downloads/asr_example_zh.wav -t 1 -l auto -itn -prefix
102+
103+
sense_voice_small_init_from_file_with_params_no_state: loading model from '/Users/Code/cpp-project/SenseVoice.cpp/scripts/resources/SenseVoiceGGUF/sense-voice-small-q4_k.gguf'
104+
sense_voice_init_with_params_no_state: use gpu = 1
105+
sense_voice_init_with_params_no_state: flash attn = 0
106+
sense_voice_init_with_params_no_state: gpu_device = 0
107+
sense_voice_init_with_params_no_state: devices = 3
108+
sense_voice_init_with_params_no_state: backends = 3
109+
sense_voice_model_load: version: 3
110+
sense_voice_model_load: alignment: 32
111+
sense_voice_model_load: data offset: 423680
112+
sense_voice_model_load: loading model
113+
sense_voice_model_load: n_vocab = 25055
114+
sense_voice_model_load: n_encoder_hidden_state = 512
115+
sense_voice_model_load: n_encoder_linear_units = 2048
116+
sense_voice_model_load: n_encoder_attention_heads = 4
117+
sense_voice_model_load: n_encoder_layers = 50
118+
sense_voice_model_load: n_mels = 80
119+
sense_voice_model_load: ftype = 12
120+
sense_voice_model_load: vocab[25055] loaded
121+
sense_voice_default_buffer_type: using device Metal (Apple M1 Pro)
122+
sense_voice_model_load: Metal total size = 181.86 MB
123+
sense_voice_model_load: n_tensors: 1212
124+
sense_voice_model_load: load SenseVoiceSmall takes 0.338000 second
125+
sense_voice_backend_init_gpu: using Metal backend
126+
ggml_metal_init: allocating
127+
ggml_metal_init: found device: Apple M1 Pro
128+
ggml_metal_init: picking default device: Apple M1 Pro
129+
ggml_metal_init: using embedded metal library
130+
ggml_metal_init: GPU name: Apple M1 Pro
131+
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
132+
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
133+
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
134+
...
135+
sense_voice_backend_init: using BLAS backend
136+
sense_voice_backend_init: using CPU backend
137+
sense_voice_init_state: kv pad size = 3.67 MB
138+
sense_voice_init_state: compute buffer (encoder) = 3.09 MB
139+
sense_voice_init_state: compute buffer (encoder) = 17.53 MB
140+
sense_voice_init_state: compute buffer (decoder) = 7.99 MB
141+
142+
system_info: n_threads = 1 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0
143+
144+
main: processing audio (88747 samples, 5.54669 sec) , 1 threads, 1 processors, lang = auto...
145+
146+
[0.96-5.18] <|zh|><|NEUTRAL|><|Speech|><|withitn|>欢迎大家来体验达摩院推出的语音识别模型。
147+
148+
main: decoder audio use 0.103725 s, rtf is 0.018700.
108149
```
109150
### 流式语音识别识别
110-
151+
流式的vad是基于信号处理实现的,区别于非流式的vad是使用模型实现的
152+
```bash
153+
usage: ./bin/sense-voice-stream [options]
154+
155+
options:
156+
-t N, --threads N [4 ] [SenseVoice] 解码使用的线程数
157+
--chunk_size [100 ] vad chunk 大小(单位ms)
158+
-mmc --min-mute-chunks [10 ] 静音片段最小chunk数量
159+
-mnc --max-nomute-chunks [80 ] 最大非静音chunk数量
160+
--use-vad [false ] 是否使用vad
161+
--use-prefix [false ] 是否使用 sensevoice的额外信息(语种、情感、事件、是否itn)
162+
-c ID, --capture ID [-1 ] [Device] capture device ID
163+
-l LANG, --language LANG [auto ] [SenseVoice] 语音代码 ('auto' 为自动检测), 支持 [`zh`, `en`, `yue`, `ja`, `ko`],分别对应中文、英文、粤语、日语、韩语
164+
-m FNAME, --model FNAME [models/sense-voice-small-q4_k.gguf] [SenseVoice] 模型路径
165+
-ng, --no-gpu [false ] 不使用GPU
166+
-fa, --flash-attn [false ] 使用flash attention 解码
167+
-itn, --use-itn [false ] 使用逆文本正则化,包括标点。
168+
169+
```
111170

112171
```bash
113172
sudo apt install libsdl2-dev

0 commit comments

Comments
 (0)