-
Notifications
You must be signed in to change notification settings - Fork 30.5k
Open
Labels
Feature requestRequest for a new featureRequest for a new feature
Description
System Info
I want to SFT Mistral-v0.3 with my own chat template.
So I followed this comment and replaced some [controal_n] tokens with special tokens for the chat template.
However, the new vocabulary was actually added and the size of the vocabulary increased.
Is there any way to replace the vocabulary?
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
tokenizer.json
{
"version": "1.0",
"truncation": null,
"padding": null,
"added_tokens": [
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
{
"id": 10,
"content": "<|system|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 11,
"content": "<|user|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 12,
"content": "<|assistant|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 13,
"content": "<|eot|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokenizer_config.json
{
"add_bos_token": true,
"add_eos_token": false,
"add_prefix_space": true,
"added_tokens_decoder": {
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"10": {
"content": "<|system|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"11": {
"content": "<|user|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"12": {
"content": "<|assistant|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"13": {
"content": "<|eot|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
}
test code
tokenizer = AutoTokenizer.from_pretrained(model_dir)
pprint(tokenizer.added_tokens_decoder)
output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
768: AddedToken("[control_766]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
769: AddedToken("[control_767]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
770: AddedToken("[control_768]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32768: AddedToken("<|system|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32769: AddedToken("<|user|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32770: AddedToken("<|assistant|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32771: AddedToken("<|eot|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)}
Expected behavior
[control_n] Tokens can be replaced with any token.
JulianAssmann and ksopyla
Metadata
Metadata
Assignees
Labels
Feature requestRequest for a new featureRequest for a new feature