diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index ea62fd545882..b105f9414bb4 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -329,11 +329,8 @@ By default, slow tests are skipped but you can set the `RUN_SLOW` environment va
 `yes` to run them. This will download many gigabytes of models so make sure you
 have enough disk space, a good internet connection or a lot of patience!
 
-<Tip warning={true}>
-
-Remember to specify a *path to a subfolder or a test file* to run the test. Otherwise, you'll run all the tests in the `tests` or `examples` folder, which will take a very long time!
-
-</Tip>
+> [!WARNING]
+> Remember to specify a *path to a subfolder or a test file* to run the test. Otherwise, you'll run all the tests in the `tests` or `examples` folder, which will take a very long time!
 
 ```bash
 RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model
diff --git a/docs/source/ar/autoclass_tutorial.md b/docs/source/ar/autoclass_tutorial.md
index 9c7709e2d172..7fc7c306cb2a 100644
--- a/docs/source/ar/autoclass_tutorial.md
+++ b/docs/source/ar/autoclass_tutorial.md
@@ -131,13 +131,10 @@
 >>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip warning={true}>
-
-بالنسبة لنماذج PyTorch، تستخدم طريقة `from_pretrained()` `torch.load()` التي تستخدم داخليًا `pickle` والتي يُعرف أنها غير آمنة. بشكل عام، لا تقم مطلقًا بتحميل نموذج قد يكون مصدره مصدرًا غير موثوق به، أو قد يكون تم العبث به. يتم تخفيف هذا الخطر الأمني جزئيًا للنماذج العامة المستضافة على Hub Hugging Face، والتي يتم [فحصها بحثًا عن البرامج الضارة](https://huggingface.co/docs/hub/security-malware) في كل ارتكاب. راجع [توثيق Hub](https://huggingface.co/docs/hub/security) للحصول على أفضل الممارسات مثل [التحقق من التوقيع](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg) باستخدام GPG.
-
-لا تتأثر نقاط تفتيش TensorFlow و Flax، ويمكن تحميلها داخل بنيات PyTorch باستخدام `from_tf` و `from_flax` kwargs لطريقة `from_pretrained` للتحايل على هذه المشكلة.
-
-</Tip>
+> [!WARNING]
+> بالنسبة لنماذج PyTorch، تستخدم طريقة `from_pretrained()` `torch.load()` التي تستخدم داخليًا `pickle` والتي يُعرف أنها غير آمنة. بشكل عام، لا تقم مطلقًا بتحميل نموذج قد يكون مصدره مصدرًا غير موثوق به، أو قد يكون تم العبث به. يتم تخفيف هذا الخطر الأمني جزئيًا للنماذج العامة المستضافة على Hub Hugging Face، والتي يتم [فحصها بحثًا عن البرامج الضارة](https://huggingface.co/docs/hub/security-malware) في كل ارتكاب. راجع [توثيق Hub](https://huggingface.co/docs/hub/security) للحصول على أفضل الممارسات مثل [التحقق من التوقيع](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg) باستخدام GPG.
+>
+> لا تتأثر نقاط تفتيش TensorFlow و Flax، ويمكن تحميلها داخل بنيات PyTorch باستخدام `from_tf` و `from_flax` kwargs لطريقة `from_pretrained` للتحايل على هذه المشكلة.
 
 
 بشكل عام، نوصي باستخدام فئة `AutoTokenizer` وفئة `AutoModelFor` لتحميل مثيلات مُدربة مسبقًا من النماذج. سيساعدك هذا في تحميل البنية الصحيحة في كل مرة. في البرنامج التعليمي التالي، تعرف على كيفية استخدام المحلل اللغوي ومعالج الصور ومستخرج الميزات والمعالج الذي تم تحميله حديثًا لمعالجة مجموعة بيانات للضبط الدقيق.
diff --git a/docs/source/ar/chat_templating.md b/docs/source/ar/chat_templating.md
index 0e05e1fedcbc..3f9879ed0e4e 100644
--- a/docs/source/ar/chat_templating.md
+++ b/docs/source/ar/chat_templating.md
@@ -230,11 +230,10 @@ The sun.</s>
 
 من هنا، استمر في التدريب كما تفعل مع مهمة نمذجة اللغة القياسية، باستخدام عمود `formatted_chat`.
 
-<Tip>
-بشكل افتراضي ، تضيف بعض *tokenizers* رموزًا خاصة مثل `<bos>` و `<eos>` إلى النص الذي تقوم بتقسيمه إلى رموز. يجب أن تتضمن قوالب المحادثة بالفعل جميع الرموز الخاصة التي تحتاجها ، وبالتالي فإن الرموز الخاصة الإضافية ستكون غالبًا غير صحيحة أو مُكررة ، مما سيؤثر سلبًا على أداء النموذج .
-
-لذلك ، إذا قمت بتنسيق النص باستخدام  `apply_chat_template(tokenize=False)` ، فيجب تعيين المعامل `add_special_tokens=False` عندما تقوم بتقسيم ذلك النص إلى رموز لاحقًا . إذا كنت تستخدم `apply_chat_template(tokenize=True)` ، فلن تحتاج إلى القلق بشأن ذلك !
-</Tip>
+> [!TIP]
+> بشكل افتراضي ، تضيف بعض *tokenizers* رموزًا خاصة مثل `<bos>` و `<eos>` إلى النص الذي تقوم بتقسيمه إلى رموز. يجب أن تتضمن قوالب المحادثة بالفعل جميع الرموز الخاصة التي تحتاجها ، وبالتالي فإن الرموز الخاصة الإضافية ستكون غالبًا غير صحيحة أو مُكررة ، مما سيؤثر سلبًا على أداء النموذج .
+>
+> لذلك ، إذا قمت بتنسيق النص باستخدام  `apply_chat_template(tokenize=False)` ، فيجب تعيين المعامل `add_special_tokens=False` عندما تقوم بتقسيم ذلك النص إلى رموز لاحقًا . إذا كنت تستخدم `apply_chat_template(tokenize=True)` ، فلن تحتاج إلى القلق بشأن ذلك !
 
 ## متقدّم: مدخلات إضافية لِقوالب الدردشة
 
@@ -361,9 +360,8 @@ print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))
 The current temperature in Paris, France is 22.0 ° Celsius.<|im_end|>
 ```
 
-<Tip>
-لا تستخدم جميع نماذج استخدام الأدوات جميع ميزات استدعاء الأدوات الموضحة أعلاه. يستخدم البعض معرفات استدعاء الأدوات، بينما يستخدم البعض الآخر ببساطة اسم الدالة ويقارن استدعاءات الأدوات بالنتائج باستخدام الترتيب، وهناك عدة نماذج لا تستخدم أيًا منهما ولا تصدر سوى استدعاء أداة واحد في كل مرة لتجنب الارتباك. إذا كنت تريد أن يكون رمزك متوافقًا مع أكبر عدد ممكن من النماذج، فإننا نوصي بهيكلة استدعاءات الأدوات الخاصة بك كما هو موضح هنا، وإعادة نتائج الأدوات بالترتيب الذي أصدرها النموذج. يجب أن تتعامل قوالب الدردشة على كل نموذج مع الباقي.
-</Tip>
+> [!TIP]
+> لا تستخدم جميع نماذج استخدام الأدوات جميع ميزات استدعاء الأدوات الموضحة أعلاه. يستخدم البعض معرفات استدعاء الأدوات، بينما يستخدم البعض الآخر ببساطة اسم الدالة ويقارن استدعاءات الأدوات بالنتائج باستخدام الترتيب، وهناك عدة نماذج لا تستخدم أيًا منهما ولا تصدر سوى استدعاء أداة واحد في كل مرة لتجنب الارتباك. إذا كنت تريد أن يكون رمزك متوافقًا مع أكبر عدد ممكن من النماذج، فإننا نوصي بهيكلة استدعاءات الأدوات الخاصة بك كما هو موضح هنا، وإعادة نتائج الأدوات بالترتيب الذي أصدرها النموذج. يجب أن تتعامل قوالب الدردشة على كل نموذج مع الباقي.
 
 ### فهم مخططات الأدوات
 
@@ -514,9 +512,8 @@ print(gen_text)
 إن مُدخل documents للتوليد القائم على الاسترجاع غير مدعوم على نطاق واسع، والعديد من النماذج لديها قوالب دردشة تتجاهل هذا المُدخل ببساطة.
 
 للتحقق مما إذا كان النموذج يدعم مُدخل `documents`، يمكنك قراءة بطاقة النموذج الخاصة به، أو `print(tokenizer.chat_template)` لمعرفة ما إذا كان مفتاح `documents` مستخدمًا في أي مكان.
-<Tip>
-ومع ذلك، فإن أحد فئات النماذج التي تدعمه هي [Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-08-2024) و [Command-R+](https://huggingface.co/CohereForAI/c4ai-command-r-pluse-08-2024) من Cohere، من خلال قالب الدردشة rag الخاص بهم. يمكنك رؤية أمثلة إضافية على التوليد باستخدام هذه الميزة في بطاقات النموذج الخاصة بهم.
-</Tip>
+> [!TIP]
+> ومع ذلك، فإن أحد فئات النماذج التي تدعمه هي [Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-08-2024) و [Command-R+](https://huggingface.co/CohereForAI/c4ai-command-r-pluse-08-2024) من Cohere، من خلال قالب الدردشة rag الخاص بهم. يمكنك رؤية أمثلة إضافية على التوليد باستخدام هذه الميزة في بطاقات النموذج الخاصة بهم.
 
 ## متقدم: كيف تعمل قوالب الدردشة؟
 يتم تخزين قالب الدردشة للنموذج في الخاصية `tokenizer.chat_template`. إذا لم يتم تعيين قالب دردشة، فسيتم استخدام القالب الافتراضي لفئة النموذج هذه بدلاً من ذلك. دعونا نلقي نظرة على قالب دردشة `Zephyr`، ولكن لاحظ أن هذا القالب مُبسّط قليلاً عن القالب الفعلي!
@@ -587,9 +584,8 @@ tokenizer.push_to_hub("model_name")  # تحميل القالب الجديد إل
 
 يتم استدعاء الدالة [`~PreTrainedTokenizer.apply_chat_template`] الذي نستخدم قالب الدردشة الخاص بك بواسطة فئة [`TextGenerationPipeline`] لذلك بمجرد تعيين قالب الدردشة الصحيح، سيصبح نموذجك متوافقًا تلقائيًا مع [`TextGenerationPipeline`].
 
-<Tip>
-إذا كنت تُجري ضبطًا دقيقًا لنموذج للدردشة، بالإضافة إلى تعيين قالب دردشة، فربما يجب عليك إضافة أي رموز تحكم دردشة جديدة كرموز خاصة في المجزىء اللغوي. لا يتم تقسيم الرموز الخاصة أبدًا، مما يضمن معالجة رموز التحكم الخاصة بك دائمًا كرموز فردية بدلاً من تجزئتها إلى أجزاء. يجب عليك أيضًا تعيين خاصية `eos_token` للمجزىء اللغوي إلى الرمز الذي يُشير إلى نهاية توليدات المساعد في قالبك. سيضمن هذا أن أدوات توليد النصوص يمكنها تحديد وقت إيقاف توليد النص بشكل صحيح.
-</Tip>
+> [!TIP]
+> إذا كنت تُجري ضبطًا دقيقًا لنموذج للدردشة، بالإضافة إلى تعيين قالب دردشة، فربما يجب عليك إضافة أي رموز تحكم دردشة جديدة كرموز خاصة في المجزىء اللغوي. لا يتم تقسيم الرموز الخاصة أبدًا، مما يضمن معالجة رموز التحكم الخاصة بك دائمًا كرموز فردية بدلاً من تجزئتها إلى أجزاء. يجب عليك أيضًا تعيين خاصية `eos_token` للمجزىء اللغوي إلى الرمز الذي يُشير إلى نهاية توليدات المساعد في قالبك. سيضمن هذا أن أدوات توليد النصوص يمكنها تحديد وقت إيقاف توليد النص بشكل صحيح.
 
 ### لماذا تحتوي بعض النماذج على قوالب متعددة؟
 تستخدم بعض النماذج قوالب مختلفة لحالات استخدام مختلفة. على سبيل المثال، قد تستخدم قالبًا واحدًا للدردشة العادية وآخر لاستخدام الأدوات، أو التوليد القائم على الاسترجاع. في هذه الحالات، تكون `tokenizer.chat_template` قاموسًا. يمكن أن يتسبب هذا في بعض الارتباك، وحيثما أمكن، نوصي باستخدام قالب واحد لجميع حالات الاستخدام. يمكنك استخدام عبارات Jinja مثل `if tools is defined` وتعريفات `{% macro %}` لتضمين مسارات تعليمات برمجية متعددة بسهولة في قالب واحد.
@@ -640,10 +636,8 @@ I'm doing great!<|im_end|>
 
 ## متقدم: نصائح لكتابة القوالب
 
-<Tip>
-أسهل طريقة للبدء في كتابة قوالب Jinja هي إلقاء نظرة على بعض القوالب الموجودة. يمكنك استخدام `print(tokenizer.chat_template)` لأي نموذج دردشة لمعرفة القالب الذي يستخدمه. بشكل عام، تحتوي النماذج التي تدعم استخدام الأدوات على قوالب أكثر تعقيدًا بكثير من النماذج الأخرى - لذلك عندما تبدأ للتو، فمن المحتمل أنها مثال سيئ للتعلم منه! يمكنك أيضًا إلقاء نظرة على [وثائق Jinja](https://jinja.palletsprojects.com/en/3.1.x/templates/#synopsis) للحصول على تفاصيل حول تنسيق Jinja العام وتركيبه.
-
-</Tip>
+> [!TIP]
+> أسهل طريقة للبدء في كتابة قوالب Jinja هي إلقاء نظرة على بعض القوالب الموجودة. يمكنك استخدام `print(tokenizer.chat_template)` لأي نموذج دردشة لمعرفة القالب الذي يستخدمه. بشكل عام، تحتوي النماذج التي تدعم استخدام الأدوات على قوالب أكثر تعقيدًا بكثير من النماذج الأخرى - لذلك عندما تبدأ للتو، فمن المحتمل أنها مثال سيئ للتعلم منه! يمكنك أيضًا إلقاء نظرة على [وثائق Jinja](https://jinja.palletsprojects.com/en/3.1.x/templates/#synopsis) للحصول على تفاصيل حول تنسيق Jinja العام وتركيبه.
 
 تُطابق قوالب Jinja في `transformers` قوالب Jinja في أي مكان آخر. الشيء الرئيسي الذي يجب معرفته هو أن سجل الدردشة سيكون متاحًا داخل قالبك كمتغير يسمى `messages`. ستتمكن من الوصول إلى `messages` في قالبك تمامًا كما يمكنك في Python، مما يعني أنه يمكنك التكرار خلاله باستخدام `{% for message in messages %}` أو الوصول إلى رسائل فردية باستخدام `{{ messages[0] }}`، على سبيل المثال.
 
@@ -680,11 +674,8 @@ I'm doing great!<|im_end|>
 - **الرموز الخاصة** مثل `bos_token` و `eos_token`. يتم استخراجها من `tokenizer.special_tokens_map`. ستختلف الرموز الدقيقة المتاحة داخل كل قالب اعتمادًا على المجزىء اللغوي الأصلي.
 
 
-<Tip>
-
-يمكنك في الواقع تمرير أي `kwarg` إلى `apply_chat_template`، وستكون متاحة داخل القالب كمتغير. بشكل عام، نوصي بمحاولة الالتزام بالمتغيرات الأساسية المذكورة أعلاه، لأن ذلك سيجعل نموذجك أكثر صعوبة في الاستخدام إذا كان على المستخدمين كتابة تعليمات برمجية مخصصة لتمرير `kwargs` خاصة بالنموذج. ومع ذلك، فنحن نُدرك أن هذا المجال يتحرك بسرعة، لذلك إذا كانت لديك حالة استخدام جديدة لا تتناسب مع واجهة برمجة التطبيقات الأساسية، فلا تتردد في استخدام `kwarg`  معامل جديد لها! إذا أصبح `kwarg` المعامل الجديد شائعًا، فقد نقوم بترقيته إلى واجهة برمجة التطبيقات الأساسية وإنشاء  وتوثيق الخاص به.
-
-</Tip>
+> [!TIP]
+> يمكنك في الواقع تمرير أي `kwarg` إلى `apply_chat_template`، وستكون متاحة داخل القالب كمتغير. بشكل عام، نوصي بمحاولة الالتزام بالمتغيرات الأساسية المذكورة أعلاه، لأن ذلك سيجعل نموذجك أكثر صعوبة في الاستخدام إذا كان على المستخدمين كتابة تعليمات برمجية مخصصة لتمرير `kwargs` خاصة بالنموذج. ومع ذلك، فنحن نُدرك أن هذا المجال يتحرك بسرعة، لذلك إذا كانت لديك حالة استخدام جديدة لا تتناسب مع واجهة برمجة التطبيقات الأساسية، فلا تتردد في استخدام `kwarg`  معامل جديد لها! إذا أصبح `kwarg` المعامل الجديد شائعًا، فقد نقوم بترقيته إلى واجهة برمجة التطبيقات الأساسية وإنشاء  وتوثيق الخاص به.
 
 ### دوال قابلة للاستدعاء
 
diff --git a/docs/source/ar/conversations.md b/docs/source/ar/conversations.md
index c3e320375dcd..6fff180031fb 100644
--- a/docs/source/ar/conversations.md
+++ b/docs/source/ar/conversations.md
@@ -188,11 +188,8 @@ pipe = pipeline("text-generation", "meta-llama/Meta-Llama-3-8B-Instruct", device
 
 ### اعتبارات الأداء
 
-<Tip>
-
-للحصول على دليل أكثر شمولاً حول أداء نموذج اللغة والتحسين، راجع [تحسين استدلال LLM](./llm_optims).
-
-</Tip>
+> [!TIP]
+> للحصول على دليل أكثر شمولاً حول أداء نموذج اللغة والتحسين، راجع [تحسين استدلال LLM](./llm_optims).
 
 
 كقاعدة عامة، ستكون نماذج المحادثة الأكبر حجمًا أبطأ في توليد النصوص بالإضافة إلى احتياجها لذاكرة أكبرة. من الممكن أن تكون أكثر تحديدًا بشأن هذا: إن توليد النص من نموذج دردشة أمر غير عادي في أنه يخضع لقيود **سعة الذاكرة** بدلاً من قوة الحوسبة، لأن كل معلمة نشطة يجب قراءتها من الذاكرة لكل رمز ينشئه النموذج. وهذا يعني أن عدد الرموز في الثانية التي يمكنك توليدها من نموذج الدردشة يتناسب بشكل عام مع إجمالي حجم الذاكرة التي بوجد بها ا، مقسومًا على حجم النموذج.
diff --git a/docs/source/ar/create_a_model.md b/docs/source/ar/create_a_model.md
index a2b49696f04b..79ff487e3496 100644
--- a/docs/source/ar/create_a_model.md
+++ b/docs/source/ar/create_a_model.md
@@ -72,9 +72,8 @@ DistilBertConfig {
 >>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json")
 ```
 
-<Tip>
-يمكنك أيضًا حفظ ملف التكوين كقاموس أو حتى كفرق بين خصائص التكوين المُعدّلة والخصائص التكوين الافتراضية! راجع وثائق [التكوين](main_classes/configuration) لمزيد من التفاصيل.
-</Tip>
+> [!TIP]
+> يمكنك أيضًا حفظ ملف التكوين كقاموس أو حتى كفرق بين خصائص التكوين المُعدّلة والخصائص التكوين الافتراضية! راجع وثائق [التكوين](main_classes/configuration) لمزيد من التفاصيل.
 
 
 ## النموذج
@@ -133,11 +132,8 @@ DistilBertConfig {
 
 يدعم كلا النوعين من المجزئات طرقًا شائعة مثل الترميز وفك الترميز، وإضافة رموز جديدة، وإدارة الرموز الخاصة.
 
-<Tip warning={true}>
-
-لا يدعم كل نموذج  مجزئ النصوص سريع. الق نظرة على هذا [جدول](index#supported-frameworks) للتحقق مما إذا كان النموذج يحتوي على دعم  مجزئ النصوص سريع.
-
-</Tip>
+> [!WARNING]
+> لا يدعم كل نموذج  مجزئ النصوص سريع. الق نظرة على هذا [جدول](index#supported-frameworks) للتحقق مما إذا كان النموذج يحتوي على دعم  مجزئ النصوص سريع.
 
 إذا دربت مجزئ النصوص خاص بك، فيمكنك إنشاء واحد من *قاموسك*:```
 
@@ -163,9 +159,8 @@ DistilBertConfig {
 >>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip>
-افتراضيًا، سيحاول [`AutoTokenizer`] تحميل مجزئ نصوص سريع. يمكنك تعطيل هذا السلوك عن طريق تعيين `use_fast=False` في `from_pretrained`.
-</Tip>
+> [!TIP]
+> افتراضيًا، سيحاول [`AutoTokenizer`] تحميل مجزئ نصوص سريع. يمكنك تعطيل هذا السلوك عن طريق تعيين `use_fast=False` في `from_pretrained`.
 
 ## معالج الصور
 
@@ -197,11 +192,8 @@ ViTImageProcessor {
 }
 ```
 
-<Tip>
-
-إذا كنت لا تبحث عن أي تخصيص، فما عليك سوى استخدام طريقة `from_pretrained` لتحميل معلمات معالج الصور الافتراضية للنموذج.
-
-</Tip>
+> [!TIP]
+> إذا كنت لا تبحث عن أي تخصيص، فما عليك سوى استخدام طريقة `from_pretrained` لتحميل معلمات معالج الصور الافتراضية للنموذج.
 
 عدل أيًا من معلمات [`ViTImageProcessor`] لإنشاء معالج الصور المخصص الخاص بك:
 
@@ -334,9 +326,8 @@ Wav2Vec2FeatureExtractor {
 }
 ```
 
-<Tip>
-إذا لم تكن بحاجة لأي تخصيص، فاستخدم فقط طريقة `from_pretrained` لتحميل معلمات مستخرج الميزات الافتراضية للنموذج. 
-</Tip>
+> [!TIP]
+> إذا لم تكن بحاجة لأي تخصيص، فاستخدم فقط طريقة `from_pretrained` لتحميل معلمات مستخرج الميزات الافتراضية للنموذج.
 
 قم بتعديل أي من معلمات [`Wav2Vec2FeatureExtractor`] لإنشاء مستخرج ميزات مخصص:
 
diff --git a/docs/source/ar/custom_models.md b/docs/source/ar/custom_models.md
index d46df9cb7298..be5480808b79 100644
--- a/docs/source/ar/custom_models.md
+++ b/docs/source/ar/custom_models.md
@@ -11,11 +11,8 @@
 
 لنبدأ بكتابة إعدادات النموذج. إعدادات النموذج هو كائنٌ يحتوي على جميع المعلومات اللازمة لبنائه. كما سنرى لاحقًا، يتطلب النموذج كائن `config` لتهيئته، لذا يجب أن يكون هذا الكائن كاملاً.
 
-<Tip>
-
-تتبع النماذج في مكتبة `transformers` اتفاقية قبول كائن `config` في دالة  `__init__` الخاصة بها، ثم تمرر كائن `config` بالكامل إلى الطبقات الفرعية في النموذج، بدلاً من تقسيمه إلى معامﻻت متعددة. يؤدي كتابة نموذجك بهذا الأسلوب إلى كود أبسط مع "مصدر حقيقة" واضح لأي فرط معلمات، كما يسهل إعادة استخدام الكود من نماذج أخرى في `transformers`.
-
-</Tip>
+> [!TIP]
+> تتبع النماذج في مكتبة `transformers` اتفاقية قبول كائن `config` في دالة  `__init__` الخاصة بها، ثم تمرر كائن `config` بالكامل إلى الطبقات الفرعية في النموذج، بدلاً من تقسيمه إلى معامﻻت متعددة. يؤدي كتابة نموذجك بهذا الأسلوب إلى كود أبسط مع "مصدر حقيقة" واضح لأي فرط معلمات، كما يسهل إعادة استخدام الكود من نماذج أخرى في `transformers`.
 
 في مثالنا، سنعدّل بعض الوسائط في فئة ResNet التي قد نرغب في ضبطها. ستعطينا التكوينات المختلفة أنواع ResNets المختلفة الممكنة. سنقوم بتخزين هذه الوسائط بعد التحقق من صحته.
 
@@ -154,11 +151,8 @@ class ResnetModelForImageClassification(PreTrainedModel):
 ```
 في كلتا الحالتين، لاحظ كيف نرث من `PreTrainedModel` ونستدعي مُهيئ الفئة الرئيسية باستخدام `config` (كما تفعل عند إنشاء وحدة `torch.nn.Module` عادية). ليس من الضروري تعريف `config_class` إلا إذا كنت ترغب في تسجيل نموذجك مع الفئات التلقائية (راجع القسم الأخير).
 
-<Tip>
-
-إذا كان نموذجك مشابهًا جدًا لنموذج داخل المكتبة، فيمكنك إعادة استخدام نفس التكوين مثل هذا النموذج.
-
-</Tip>
+> [!TIP]
+> إذا كان نموذجك مشابهًا جدًا لنموذج داخل المكتبة، فيمكنك إعادة استخدام نفس التكوين مثل هذا النموذج.
 
 يمكن لنموذجك أن يعيد أي شيء تريده، ولكن إعادة قاموس مثلما فعلنا لـ
 `ResnetModelForImageClassification`، مع تضمين الخسارة عند تمرير العلامات، سيجعل نموذجك قابلًا للاستخدام مباشرة داخل فئة [`Trainer`]. يعد استخدام تنسيق إخراج آخر أمرًا جيدًا طالما أنك تخطط لاستخدام حلقة تدريب خاصة بك أو مكتبة أخرى للتدريب.
@@ -204,11 +198,8 @@ AutoModelForImageClassification.register(ResnetConfig, ResnetModelForImageClassi
 
 ## إرسال الكود إلى Hub
 
-<Tip warning={true}>
-
-هذا API تجريبي وقد يكون له بعض التغييرات الطفيفة في الإصدارات القادمة.
-
-</Tip>
+> [!WARNING]
+> هذا API تجريبي وقد يكون له بعض التغييرات الطفيفة في الإصدارات القادمة.
 
 أولاً، تأكد من تعريف نموذجك بالكامل في ملف `.py`. يمكن أن يعتمد على الاستيراد النسبي لملفات أخرى طالما أن جميع الملفات موجودة في نفس الدليل (لا ندعم الوحدات الفرعية لهذه الميزة حتى الآن). في مثالنا، سنحدد ملف `modeling_resnet.py` وملف `configuration_resnet.py` في مجلد باسم "resnet_model" في دليل العمل الحالي. يحتوي ملف التكوين على كود لـ `ResnetConfig` ويحتوي ملف النمذجة على كود لـ `ResnetModel` و`ResnetModelForImageClassification`.
 
@@ -222,12 +213,9 @@ AutoModelForImageClassification.register(ResnetConfig, ResnetModelForImageClassi
 
 يمكن أن يكون ملف `__init__.py` فارغًا، فهو موجود فقط حتى يتمكن Python من اكتشاف أن `resnet_model` يمكن استخدامه كموديل.
 
-<Tip warning={true}>
-
-إذا كنت تقوم بنسخ ملفات النمذجة من المكتبة، فسوف تحتاج إلى استبدال جميع الواردات النسبية في أعلى الملف
-لاستيرادها من حزمة `transformers`.
-
-</Tip>
+> [!WARNING]
+> إذا كنت تقوم بنسخ ملفات النمذجة من المكتبة، فسوف تحتاج إلى استبدال جميع الواردات النسبية في أعلى الملف
+> لاستيرادها من حزمة `transformers`.
 
 لاحظ أنه يمكنك إعادة استخدام (أو توسيع) تكوين/نموذج موجود.
 
@@ -251,21 +239,18 @@ ResnetModelForImageClassification.register_for_auto_class("AutoModelForImageClas
 [`AutoConfig`]) ولكن الأمر يختلف بالنسبة للنماذج. قد يكون نموذجك المخصص مناسبًا للعديد من المهام المختلفة، لذلك يجب
 تحديد أي من الفئات التلقائية هو الصحيح لنموذجك.
 
-<Tip>
-
-استخدم `register_for_auto_class()` إذا كنت تريد نسخ ملفات الكود. إذا كنت تفضل استخدام الكود على Hub من مستودع آخر،
-فلا تحتاج إلى استدعائه. في الحالات التي يوجد فيها أكثر من فئة تلقائية واحدة، يمكنك تعديل ملف `config.json` مباشرة باستخدام
-الهيكل التالي:
-
-```json
-"auto_map": {     
-	"AutoConfig": "<your-repo-name>--<config-name>",     
-	"AutoModel": "<your-repo-name>--<config-name>",
-	"AutoModelFor<Task>": "<your-repo-name>--<config-name>",    
-},
-```
-
-</Tip>
+> [!TIP]
+> استخدم `register_for_auto_class()` إذا كنت تريد نسخ ملفات الكود. إذا كنت تفضل استخدام الكود على Hub من مستودع آخر،
+> فلا تحتاج إلى استدعائه. في الحالات التي يوجد فيها أكثر من فئة تلقائية واحدة، يمكنك تعديل ملف `config.json` مباشرة باستخدام
+> الهيكل التالي:
+>
+> ```json
+> "auto_map": {     
+> 	"AutoConfig": "<your-repo-name>--<config-name>",     
+> 	"AutoModel": "<your-repo-name>--<config-name>",
+> 	"AutoModelFor<Task>": "<your-repo-name>--<config-name>",    
+> },
+> ```
 
 بعد ذلك، دعنا نقوم بإنشاء التكوين والنماذج كما فعلنا من قبل:
 
diff --git a/docs/source/ar/glossary.md b/docs/source/ar/glossary.md
index b1c59a68c399..8b6c79705a96 100644
--- a/docs/source/ar/glossary.md
+++ b/docs/source/ar/glossary.md
@@ -245,11 +245,8 @@
 نماذج اكتشاف الأجسام: ([DetrForObjectDetection]) يتوقع النموذج قائمة من القواميس تحتوي على مفتاح class_labels و boxes حيث تتوافق كل قيمة من المجموعة مع الملصق المتوقع وعدد المربعات المحيطة بكل صورة فردية.
 نماذج التعرف التلقائي على الكلام: ([Wav2Vec2ForCTC]) يتوقع النموذج مصفوفة ذات بعد (batch_size, target_length) حيث تتوافق كل قيمة مع الملصق المتوقع لكل رمز فردي.
 
-<Tip>
-
-قد تختلف تسميات كل نموذج، لذا تأكد دائمًا من مراجعة وثائق كل نموذج للحصول على معلومات حول التسميات الخاصة به.
-
-</Tip>
+> [!TIP]
+> قد تختلف تسميات كل نموذج، لذا تأكد دائمًا من مراجعة وثائق كل نموذج للحصول على معلومات حول التسميات الخاصة به.
 لا تقبل النماذج الأساسية ([`BertModel`]) الملصقات ، لأنها نماذج المحول الأساسية، والتي تقوم ببساطة بإخراج الميزات.
 
 ### نماذج اللغة الكبيرة large language models (LLM)
diff --git a/docs/source/ar/installation.md b/docs/source/ar/installation.md
index d3bd4c655b60..4703f188dce0 100644
--- a/docs/source/ar/installation.md
+++ b/docs/source/ar/installation.md
@@ -48,17 +48,14 @@ pip install 'transformers[torch]'
 pip install 'transformers[tf-cpu]'
 ```
 
-<Tip warning={true}>
-
-لمستخدمي M1 / ARM
-
-ستحتاج إلى تثبيت ما يلي قبل تثبيت TensorFLow 2.0
-```bash
-brew install cmake
-brew install pkg-config
-```
-
-</Tip>
+> [!WARNING]
+> لمستخدمي M1 / ARM
+>
+> ستحتاج إلى تثبيت ما يلي قبل تثبيت TensorFLow 2.0
+> ```bash
+> brew install cmake
+> brew install pkg-config
+> ```
 
 🤗 Transformers وFlax:
 
@@ -117,11 +114,8 @@ pip install -e .
 
  ستقوم هذه الأوامر بربط المجلد الذي قمت باستنساخ المستودع فيه بمسارات مكتبة Python. بمعنى آخر، سيبحث Python داخل المجلد الذي قمت باستنساخه بالإضافة إلى المسارات المعتادة للمكتبات. على سبيل المثال، إذا تم تثبيت حزم Python الخاصة بك عادةً في `~/anaconda3/envs/main/lib/python3.7/site-packages/`, فسيقوم Python أيضًا بالبحث في المجلد الذي قمت باستنساخه: `~/transformers/`.
 
-<Tip warning={true}>
-
-يجب عليك الاحتفاظ بمجلد `transformers` إذا كنت تريد الاستمرار في استخدام المكتبة.
-
-</Tip>
+> [!WARNING]
+> يجب عليك الاحتفاظ بمجلد `transformers` إذا كنت تريد الاستمرار في استخدام المكتبة.
 
 الآن يمكنك تحديث المستنسخ الخاص بك بسهولة إلى أحدث إصدار من 🤗 Transformers باستخدام الأمر التالي:
 
@@ -148,21 +142,15 @@ conda install conda-forge::transformers
 2. متغير البيئة: `HF_HOME`.
 3. متغير البيئة: `XDG_CACHE_HOME` + `/huggingface`.
 
-<Tip>
-
-سيستخدم 🤗 Transformers متغيرات البيئة `PYTORCH_TRANSFORMERS_CACHE` أو `PYTORCH_PRETRAINED_BERT_CACHE` إذا كنت قادمًا من إصدار سابق من هذه المكتبة وقمت بتعيين متغيرات البيئة هذه، ما لم تحدد متغير البيئة `TRANSFORMERS_CACHE`.
-
-</Tip>
+> [!TIP]
+> سيستخدم 🤗 Transformers متغيرات البيئة `PYTORCH_TRANSFORMERS_CACHE` أو `PYTORCH_PRETRAINED_BERT_CACHE` إذا كنت قادمًا من إصدار سابق من هذه المكتبة وقمت بتعيين متغيرات البيئة هذه، ما لم تحدد متغير البيئة `TRANSFORMERS_CACHE`.
 
 ## الوضع دون اتصال بالإنترنت
 
 قم بتشغيل 🤗 Transformers في بيئة محمية بجدار حماية أو غير متصلة باستخدام الملفات المخزنة مؤقتًا محليًا عن طريق تعيين متغير البيئة `HF_HUB_OFFLINE=1`.
 
-<Tip>
-
-أضف [🤗 Datasets](https://huggingface.co/docs/datasets/) إلى سير عمل التدريب غير المتصل باستخدام متغير البيئة `HF_DATASETS_OFFLINE=1`.
-
-</Tip>
+> [!TIP]
+> أضف [🤗 Datasets](https://huggingface.co/docs/datasets/) إلى سير عمل التدريب غير المتصل باستخدام متغير البيئة `HF_DATASETS_OFFLINE=1`.
 
 ```bash
 HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \
@@ -239,8 +227,5 @@ model = T5Model.from_pretrained("./path/to/local/directory", local_files_only=Tr
 >>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json")
 ```
 
-<Tip>
-
-راجع قسم [كيفية تنزيل الملفات من Hub](https://huggingface.co/docs/hub/how-to-downstream) لمزيد من التفاصيل حول تنزيل الملفات المخزنة على Hub.
-
-</Tip>
+> [!TIP]
+> راجع قسم [كيفية تنزيل الملفات من Hub](https://huggingface.co/docs/hub/how-to-downstream) لمزيد من التفاصيل حول تنزيل الملفات المخزنة على Hub.
diff --git a/docs/source/ar/llm_tutorial.md b/docs/source/ar/llm_tutorial.md
index 264797a982b9..096f89e2f73f 100644
--- a/docs/source/ar/llm_tutorial.md
+++ b/docs/source/ar/llm_tutorial.md
@@ -51,11 +51,8 @@ pip install transformers bitsandbytes>=0.39.0 -q
 دعنا نتحدث عن الكود!
 
 
-<Tip>
-
-إذا كنت مهتمًا بالاستخدام الأساسي لـ LLM، فإن واجهة [`Pipeline`](pipeline_tutorial) عالية المستوى هي نقطة انطلاق رائعة. ومع ذلك، غالبًا ما تتطلب LLMs ميزات متقدمة مثل التكميم والتحكم الدقيق في خطوة اختيار الرمز، والتي يتم تنفيذها بشكل أفضل من خلال [`~generation.GenerationMixin.generate`]. التوليد التلقائي باستخدام LLMs  يستهلك الكثير من المواردد ويجب تنفيذه على وحدة معالجة الرسومات للحصول على أداء كافٍ.
-
-</Tip>
+> [!TIP]
+> إذا كنت مهتمًا بالاستخدام الأساسي لـ LLM، فإن واجهة [`Pipeline`](pipeline_tutorial) عالية المستوى هي نقطة انطلاق رائعة. ومع ذلك، غالبًا ما تتطلب LLMs ميزات متقدمة مثل التكميم والتحكم الدقيق في خطوة اختيار الرمز، والتي يتم تنفيذها بشكل أفضل من خلال [`~generation.GenerationMixin.generate`]. التوليد التلقائي باستخدام LLMs  يستهلك الكثير من المواردد ويجب تنفيذه على وحدة معالجة الرسومات للحصول على أداء كافٍ.
 
 أولاً، تحتاج إلى تحميل النموذج.
 
diff --git a/docs/source/ar/llm_tutorial_optimization.md b/docs/source/ar/llm_tutorial_optimization.md
index 400c17f735c5..50011533ffa5 100644
--- a/docs/source/ar/llm_tutorial_optimization.md
+++ b/docs/source/ar/llm_tutorial_optimization.md
@@ -673,11 +673,8 @@ length of key-value cache 24
 > يجب *دائمًا* استخدام ذاكرة التخزين المؤقت للمفاتيح والقيم حيث يؤدي ذلك إلى نتائج متطابقة وزيادة كبيرة في السرعة لتسلسلات الإدخال الأطول. ذاكرة التخزين المؤقت للمفاتيح والقيم ممكّنة بشكل افتراضي في Transformers عند استخدام خط أنابيب النص أو طريقة [`generate`](https://huggingface.co/docs/transformers/main_classes/text_generation).
 
 
-<Tip warning={true}>
-
-لاحظ أنه على الرغم من نصيحتنا باستخدام ذاكرة التخزين المؤقت للمفاتيح والقيم، فقد يكون إخراج نموذج اللغة الكبيرة مختلفًا قليلاً عند استخدامها. هذه خاصية نوى ضرب المصفوفة نفسها - يمكنك قراءة المزيد عنها [هنا](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535).
-
-</Tip>
+> [!WARNING]
+> لاحظ أنه على الرغم من نصيحتنا باستخدام ذاكرة التخزين المؤقت للمفاتيح والقيم، فقد يكون إخراج نموذج اللغة الكبيرة مختلفًا قليلاً عند استخدامها. هذه خاصية نوى ضرب المصفوفة نفسها - يمكنك قراءة المزيد عنها [هنا](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535).
 
 #### 3.2.1 محادثة متعددة الجولات
 
diff --git a/docs/source/ar/model_memory_anatomy.md b/docs/source/ar/model_memory_anatomy.md
index db3473e5e02a..5e8a2bec86e3 100644
--- a/docs/source/ar/model_memory_anatomy.md
+++ b/docs/source/ar/model_memory_anatomy.md
@@ -116,11 +116,8 @@ default_args = {
 }
 ```
 
-<Tip>
-
- إذا كنت تخطط لتشغيل عدة تجارب، من أجل مسح الذاكرة بشكل صحيح بين التجارب، قم بإعادة تشغيل نواة Python بين التجارب.
-
-</Tip>
+> [!TIP]
+> إذا كنت تخطط لتشغيل عدة تجارب، من أجل مسح الذاكرة بشكل صحيح بين التجارب، قم بإعادة تشغيل نواة Python بين التجارب.
 
 ## استخدام الذاكرة في التدريب الأساسي
 
diff --git a/docs/source/ar/model_sharing.md b/docs/source/ar/model_sharing.md
index b81173b15a29..e3b9955c85aa 100644
--- a/docs/source/ar/model_sharing.md
+++ b/docs/source/ar/model_sharing.md
@@ -12,11 +12,8 @@
 frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
 picture-in-picture" allowfullscreen></iframe>
 
-<Tip>
-
-لمشاركة نموذج مع المجتمع، تحتاج إلى حساب على [huggingface.co](https://huggingface.co/join). يمكنك أيضًا الانضمام إلى منظمة موجودة أو إنشاء منظمة جديدة.
-
-</Tip>
+> [!TIP]
+> لمشاركة نموذج مع المجتمع، تحتاج إلى حساب على [huggingface.co](https://huggingface.co/join). يمكنك أيضًا الانضمام إلى منظمة موجودة أو إنشاء منظمة جديدة.
 
 ## ميزات المستودع
 
diff --git a/docs/source/ar/peft.md b/docs/source/ar/peft.md
index 892d70eb5f6d..bbea8f310b2d 100644
--- a/docs/source/ar/peft.md
+++ b/docs/source/ar/peft.md
@@ -51,11 +51,8 @@ peft_model_id = "ybelkada/opt-350m-lora"
 model = AutoModelForCausalLM.from_pretrained(peft_model_id)
 ```
 
-<Tip>
-
-يمكنك تحميل محول PEFT باستخدام فئة `AutoModelFor` أو فئة النموذج الأساسي مثل `OPTForCausalLM` أو `LlamaForCausalLM`.
-
-</Tip>
+> [!TIP]
+> يمكنك تحميل محول PEFT باستخدام فئة `AutoModelFor` أو فئة النموذج الأساسي مثل `OPTForCausalLM` أو `LlamaForCausalLM`.
 
 يمكنك أيضًا تحميل محول PEFT عن طريق استدعاء طريقة `load_adapter`:
 
@@ -162,11 +159,8 @@ output = model.generate(**inputs)
 
 يدعم محول PEFT فئة [`Trainer`] بحيث يمكنك تدريب محول لحالتك الاستخدام المحددة. فهو يتطلب فقط إضافة بضع سطور أخرى من التعليمات البرمجية. على سبيل المثال، لتدريب محول LoRA:
 
-<Tip>
-
-إذا لم تكن معتادًا على ضبط نموذج دقيق باستخدام [`Trainer`، فراجع البرنامج التعليمي](training) لضبط نموذج مُدرب مسبقًا.
-
-</Tip>
+> [!TIP]
+> إذا لم تكن معتادًا على ضبط نموذج دقيق باستخدام [`Trainer`، فراجع البرنامج التعليمي](training) لضبط نموذج مُدرب مسبقًا.
 
 1. حدد تكوين المحول باستخدام نوع المهمة والمعاملات الزائدة (راجع [`~peft.LoraConfig`] لمزيد من التفاصيل حول وظيفة هذه  المعلمات).
 
diff --git a/docs/source/ar/pipeline_tutorial.md b/docs/source/ar/pipeline_tutorial.md
index 4f71ebb95fa6..2478e3295573 100644
--- a/docs/source/ar/pipeline_tutorial.md
+++ b/docs/source/ar/pipeline_tutorial.md
@@ -6,11 +6,8 @@
 * استخدم مُجزّئ أو نموذجًا محددًا.
 * استخدم [`pipeline`] للمهام الصوتية والبصرية والمتعددة الوسائط.
 
-<Tip>
-
-اطلع على وثائق [`pipeline`] للحصول على القائمة كاملة بالمهام المدعومة والمعلمات المتاحة.
-
-</Tip>
+> [!TIP]
+> اطلع على وثائق [`pipeline`] للحصول على القائمة كاملة بالمهام المدعومة والمعلمات المتاحة.
 
 ## استخدام الأنابيب
 
@@ -189,9 +186,8 @@ for out in pipe(KeyDataset(dataset, "audio")):
 
 ## استخدام خطوط الأنابيب لخادم ويب
 
-<Tip>
-إن إنشاء محرك استدلال هو موضوع معقد يستحق صفحته الخاصة.
-</Tip>
+> [!TIP]
+> إن إنشاء محرك استدلال هو موضوع معقد يستحق صفحته الخاصة.
 
 [Link](./pipeline_webserver)
 
@@ -251,16 +247,13 @@ for out in pipe(KeyDataset(dataset, "audio")):
 [{'score': 0.425, 'answer': 'us-001', 'start': 16, 'end': 16}]
 ```
 
-<Tip>
-
-لتشغيل المثال أعلاه، تحتاج إلى تثبيت [`pytesseract`](https://pypi.org/project/pytesseract/) بالإضافة إلى 🤗 Transformers:
-
-```bash
-sudo apt install -y tesseract-ocr
-pip install pytesseract
-```
-
-</Tip>
+> [!TIP]
+> لتشغيل المثال أعلاه، تحتاج إلى تثبيت [`pytesseract`](https://pypi.org/project/pytesseract/) بالإضافة إلى 🤗 Transformers:
+>
+> ```bash
+> sudo apt install -y tesseract-ocr
+> pip install pytesseract
+> ```
 
 ## استخدام `pipeline` على نماذج كبيرة مع 🤗 `accelerate`:
 
diff --git a/docs/source/ar/pipeline_webserver.md b/docs/source/ar/pipeline_webserver.md
index 2a19e84d1632..8f5cbfd4bc20 100644
--- a/docs/source/ar/pipeline_webserver.md
+++ b/docs/source/ar/pipeline_webserver.md
@@ -1,11 +1,8 @@
 # استخدام قنوات المعالجة لخادم ويب 
 
-<Tip>
-
-يُعدّ إنشاء محرك استدلال أمرًا معقدًا، ويعتمد الحل "الأفضل" على مساحة مشكلتك. هل تستخدم وحدة المعالجة المركزية أم وحدة معالجة الرسومات؟ هل تريد أقل زمن وصول، أم أعلى معدل نقل، أم دعمًا للعديد من النماذج، أم مجرد تحقيق أقصى تحسين نموذج محدد؟
-توجد طرق عديدة لمعالجة هذا الموضوع، لذلك ما سنقدمه هو إعداد افتراضي جيد للبدء به قد لا يكون بالضرورة هو الحل الأمثل لك.```
-
-</Tip> 
+> [!TIP]
+> يُعدّ إنشاء محرك استدلال أمرًا معقدًا، ويعتمد الحل "الأفضل" على مساحة مشكلتك. هل تستخدم وحدة المعالجة المركزية أم وحدة معالجة الرسومات؟ هل تريد أقل زمن وصول، أم أعلى معدل نقل، أم دعمًا للعديد من النماذج، أم مجرد تحقيق أقصى تحسين نموذج محدد؟
+> توجد طرق عديدة لمعالجة هذا الموضوع، لذلك ما سنقدمه هو إعداد افتراضي جيد للبدء به قد لا يكون بالضرورة هو الحل الأمثل لك.``` 
 
 الشيء الرئيسي الذي يجب فهمه هو أننا يمكن أن نستخدم مؤشرًا، تمامًا كما تفعل [على مجموعة بيانات](pipeline_tutorial#using-pipelines-on-a-dataset)، نظرًا لأن خادم الويب هو أساسًا نظام ينتظر الطلبات ويعالجها عند استلامها. 
 
@@ -71,11 +68,8 @@ curl -X POST -d "test [MASK]" http://localhost:8000/
 
 المهم حقًا هو أننا نقوم بتحميل النموذج **مرة واحدة** فقط، لذلك لا توجد نسخ من النموذج على خادم الويب. بهذه الطريقة، لا يتم استخدام ذاكرة الوصول العشوائي غير الضرورية. تسمح آلية وضع قائمة الانتظار بالقيام بأشياء متقدمة مثل تجميع بعض العناصر قبل الاستدلال لاستخدام معالجة الدفعات الديناميكية: 
 
-<Tip warning={true}>
-
-تم كتابة نموذج الكود البرمجى أدناه بشكل مقصود مثل كود وهمي للقراءة. لا تقم بتشغيله دون التحقق مما إذا كان منطقيًا لموارد النظام الخاص بك! 
-
-</Tip> 
+> [!WARNING]
+> تم كتابة نموذج الكود البرمجى أدناه بشكل مقصود مثل كود وهمي للقراءة. لا تقم بتشغيله دون التحقق مما إذا كان منطقيًا لموارد النظام الخاص بك! 
 
 ```py
 (string, rq) = await q.get()
diff --git a/docs/source/ar/preprocessing.md b/docs/source/ar/preprocessing.md
index 1418c69fd7a3..287e8885a99c 100644
--- a/docs/source/ar/preprocessing.md
+++ b/docs/source/ar/preprocessing.md
@@ -9,11 +9,8 @@
 * تستخدم مدخلات الصورة [ImageProcessor](./main_classes/image_processor) لتحويل الصور إلى موترات.
 * تستخدم مدخلات متعددة الوسائط [معالجًا](./main_classes/processors) لدمج مُجزّئ الرموز ومستخرج الميزات أو معالج الصور.
 
-<Tip>
-
-`AutoProcessor` **يعمل دائمًا** ويختار تلقائيًا الفئة الصحيحة للنموذج الذي تستخدمه، سواء كنت تستخدم مُجزّئ رموز أو معالج صور أو مستخرج ميزات أو معالجًا.
-
-</Tip>
+> [!TIP]
+> `AutoProcessor` **يعمل دائمًا** ويختار تلقائيًا الفئة الصحيحة للنموذج الذي تستخدمه، سواء كنت تستخدم مُجزّئ رموز أو معالج صور أو مستخرج ميزات أو معالجًا.
 
 قبل البدء، قم بتثبيت 🤗 Datasets حتى تتمكن من تحميل بعض مجموعات البيانات لتجربتها:
 
@@ -27,11 +24,8 @@ pip install datasets
 
 أداة المعالجة المسبقة الرئيسية للبيانات النصية هي [مُجزّئ اللغوي](main_classes/tokenizer). يقوم مُجزّئ اللغوي بتقسيم النص إلى  "أجزاء لغوية" (tokens) وفقًا لمجموعة من القواعد. يتم تحويل الأجزاء اللغوية إلى أرقام ثم إلى منسوجات، والتي تصبح مدخلات للنموذج. يقوم المجزئ اللغوي بإضافة أي مدخلات إضافية يحتاجها النموذج.
 
-<Tip>
-
-إذا كنت تخطط لاستخدام نموذج مُدرب مسبقًا، فمن المهم استخدامالمجزئ اللغوي المقترن بنفس ذلك النموذج. يضمن ذلك تقسيم النص بنفس الطريقة التي تم بها تقسيم النصوص ما قبل التدريب، واستخدام نفس  القاموس الذي يربط بين الأجزاء اللغوية وأرقامها ( يُشار إليها عادةً باسم المفردات *vocab*) أثناء التدريب المسبق.
-
-</Tip>
+> [!TIP]
+> إذا كنت تخطط لاستخدام نموذج مُدرب مسبقًا، فمن المهم استخدامالمجزئ اللغوي المقترن بنفس ذلك النموذج. يضمن ذلك تقسيم النص بنفس الطريقة التي تم بها تقسيم النصوص ما قبل التدريب، واستخدام نفس  القاموس الذي يربط بين الأجزاء اللغوية وأرقامها ( يُشار إليها عادةً باسم المفردات *vocab*) أثناء التدريب المسبق.
 
 ابدأ بتحميل  المُجزّئ اللغوي مُدرب مسبقًا باستخدام طريقة [`AutoTokenizer.from_pretrained`]. يقوم هذا بتنزيل المفردات *vocab* الذي تم تدريب النموذج عليه:
 
@@ -140,11 +134,8 @@ pip install datasets
                     [1، 1، 1، 1، 1، 1، 1، 0، 0، 0، 0، 0، 0، 0، 0، 0]]}
 ```
 
-<Tip>
-
-تحقق من دليل المفاهيم [Padding and truncation](./pad_truncation) لمعرفة المزيد حول معامﻻت الحشو و البتر المختلفة.
-
-</Tip>
+> [!TIP]
+> تحقق من دليل المفاهيم [Padding and truncation](./pad_truncation) لمعرفة المزيد حول معامﻻت الحشو و البتر المختلفة.
 
 ### بناء الموترات Build tensors
 
@@ -172,14 +163,11 @@ pip install datasets
                            [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
 ```
 
-<Tip>
-
-تدعم خطوط الأنابيب المختلفة معامل مُجزِّئ الرموز(tokenizer) بشكل مختلف في طريقة `()__call__` الخاصة بها.
-و خطوط الأنابيب `text-2-text-generation` تدعم فقط `truncation`.
-و خطوط الأنابيب `text-generation` تدعم `max_length` و`truncation` و`padding` و`add_special_tokens`.
-أما في خطوط الأنابيب `fill-mask`، يمكن تمرير معامل مُجزِّئ الرموز (tokenizer) في المتغير `tokenizer_kwargs` (قاموس).
-
-</Tip>
+> [!TIP]
+> تدعم خطوط الأنابيب المختلفة معامل مُجزِّئ الرموز(tokenizer) بشكل مختلف في طريقة `()__call__` الخاصة بها.
+> و خطوط الأنابيب `text-2-text-generation` تدعم فقط `truncation`.
+> و خطوط الأنابيب `text-generation` تدعم `max_length` و`truncation` و`padding` و`add_special_tokens`.
+> أما في خطوط الأنابيب `fill-mask`، يمكن تمرير معامل مُجزِّئ الرموز (tokenizer) في المتغير `tokenizer_kwargs` (قاموس).
 
 ## الصوت Audio
 
@@ -291,24 +279,18 @@ pip install datasets
 
 بالنسبة لمهام رؤية الحاسوبية، ستحتاج إلى معالج صور [image processor](main_classes/image_processor) لإعداد مجموعة البيانات الخاصة بك لتناسب النموذج. تتكون معالجة الصور المسبقة من عدة خطوات لتحويل الصور إلى الشكل الذي يتوقعه النموذج. وتشمل هذه الخطوات، على سبيل المثال لا الحصر، تغيير الحجم والتطبيع وتصحيح قناة الألوان وتحويل الصور إلى موترات(tensors).
 
-<Tip>
-
-عادة ما تتبع معالجة الصور المسبقة شكلاً من أشكال زيادة البيانات (التضخيم). كلا العمليتين،  معالجة الصور المسبقة وزيادة الصور تغيران بيانات الصورة، ولكنها تخدم أغراضًا مختلفة:
-
-*زيادة البيانات: تغيير الصور عن طريق زيادة الصور بطريقة يمكن أن تساعد في منع الإفراط في التعميم وزيادة متانة النموذج. يمكنك أن تكون مبدعًا في كيفية زيادة بياناتك - ضبط السطوع والألوان، واالقص، والدوران، تغيير الحجم، التكبير، إلخ. ومع ذلك، كن حذرًا من عدم تغيير معنى الصور بزياداتك.
-*معالجة الصور المسبقة: تضمن معالجة الصور اتتطابق الصور مع تنسيق الإدخال المتوقع للنموذج. عند ضبط نموذج رؤية حاسوبية بدقة، يجب معالجة الصور بالضبط كما كانت عند تدريب النموذج في البداية.
-
-يمكنك استخدام أي مكتبة تريدها لزيادة بيانات الصور. لمعالجة الصور المسبقة، استخدم `ImageProcessor` المرتبط بالنموذج.
-
-</Tip>
+> [!TIP]
+> عادة ما تتبع معالجة الصور المسبقة شكلاً من أشكال زيادة البيانات (التضخيم). كلا العمليتين،  معالجة الصور المسبقة وزيادة الصور تغيران بيانات الصورة، ولكنها تخدم أغراضًا مختلفة:
+>
+> *زيادة البيانات: تغيير الصور عن طريق زيادة الصور بطريقة يمكن أن تساعد في منع الإفراط في التعميم وزيادة متانة النموذج. يمكنك أن تكون مبدعًا في كيفية زيادة بياناتك - ضبط السطوع والألوان، واالقص، والدوران، تغيير الحجم، التكبير، إلخ. ومع ذلك، كن حذرًا من عدم تغيير معنى الصور بزياداتك.
+> *معالجة الصور المسبقة: تضمن معالجة الصور اتتطابق الصور مع تنسيق الإدخال المتوقع للنموذج. عند ضبط نموذج رؤية حاسوبية بدقة، يجب معالجة الصور بالضبط كما كانت عند تدريب النموذج في البداية.
+>
+> يمكنك استخدام أي مكتبة تريدها لزيادة بيانات الصور. لمعالجة الصور المسبقة، استخدم `ImageProcessor` المرتبط بالنموذج.
 
 قم بتحميل مجموعة بيانات [food101](https://huggingface.co/datasets/food101) (راجع دليل 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) لمزيد من التفاصيل حول كيفية تحميل مجموعة بيانات) لمعرفة كيف يمكنك استخدام معالج الصور مع مجموعات بيانات رؤية الحاسب:
 
-<Tip>
-
-استخدم معامل `split` من 🤗 Datasets لتحميل عينة صغيرة فقط من مجموعة التدريب نظرًا لحجم البيانات كبيرة جدًا!
-
-</Tip>
+> [!TIP]
+> استخدم معامل `split` من 🤗 Datasets لتحميل عينة صغيرة فقط من مجموعة التدريب نظرًا لحجم البيانات كبيرة جدًا!
 
 ```py
 >>> from datasets import load_dataset
@@ -362,15 +344,13 @@ pip install datasets
 ...     return examples
 ```
 
-<Tip>
-
-في المثال أعلاه، قمنا بتعيين `do_resize=False` لأننا قمنا بالفعل بتغيير حجم الصور في تحويل زيادة الصور،
-واستفدنا من خاصية `size` من `image_processor` المناسب. إذا لم تقم بتغيير حجم الصور أثناء زيادة الصور،
-فاترك هذا المعلمة. بشكل افتراضي، ستتعامل `ImageProcessor` مع تغيير الحجم.
-
-إذا كنت ترغب في تطبيع الصور كجزء من تحويل زيادة الصور، فاستخدم قيم `image_processor.image_mean`،
-و `image_processor.image_std`.
-</Tip>
+> [!TIP]
+> في المثال أعلاه، قمنا بتعيين `do_resize=False` لأننا قمنا بالفعل بتغيير حجم الصور في تحويل زيادة الصور،
+> واستفدنا من خاصية `size` من `image_processor` المناسب. إذا لم تقم بتغيير حجم الصور أثناء زيادة الصور،
+> فاترك هذا المعلمة. بشكل افتراضي، ستتعامل `ImageProcessor` مع تغيير الحجم.
+>
+> إذا كنت ترغب في تطبيع الصور كجزء من تحويل زيادة الصور، فاستخدم قيم `image_processor.image_mean`،
+> و `image_processor.image_std`.
 
 3. ثم استخدم 🤗 Datasets[`~datasets.Dataset.set_transform`] لتطبيق التحولات أثناء التنقل:
 ```py
@@ -397,13 +377,10 @@ pip install datasets
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
 </div>
 
-<Tip>
-
-بالنسبة للمهام مثل الكشف عن الأشياء، والتجزئة الدلالية، والتجزئة المثالية، والتجزئة الشاملة، يوفر `ImageProcessor`
-تقوم هذه الطرق بتحويل النواتج الأولية للنموذج إلى تنبؤات ذات معنى مثل مربعات الحدود،
-أو خرائط التجزئة.
-
-</Tip>
+> [!TIP]
+> بالنسبة للمهام مثل الكشف عن الأشياء، والتجزئة الدلالية، والتجزئة المثالية، والتجزئة الشاملة، يوفر `ImageProcessor`
+> تقوم هذه الطرق بتحويل النواتج الأولية للنموذج إلى تنبؤات ذات معنى مثل مربعات الحدود،
+> أو خرائط التجزئة.
 
 ### الحشو Pad
 
diff --git a/docs/source/ar/quicktour.md b/docs/source/ar/quicktour.md
index 55466e0a1563..cb995bc11f12 100644
--- a/docs/source/ar/quicktour.md
+++ b/docs/source/ar/quicktour.md
@@ -23,11 +23,8 @@ pip install torch
 
 يمثل [`pipeline`] أسهل وأسرع طريقة لاستخدام نموذج مُدرب مسبقًا للاستنتاج. يمكنك استخدام [`pipeline`] جاهزًا للعديد من المهام عبر طرق مختلفة، والتي يظهر بعضها في الجدول أدناه:
 
-<Tip>
-
-للاطلاع على القائمة الكاملة للمهام المتاحة، راجع [مرجع واجهة برمجة التطبيقات الخاصة بخط الأنابيب](./main_classes/pipelines).
-
-</Tip>
+> [!TIP]
+> للاطلاع على القائمة الكاملة للمهام المتاحة، راجع [مرجع واجهة برمجة التطبيقات الخاصة بخط الأنابيب](./main_classes/pipelines).
 
 <div dir="rtl">
 
@@ -179,11 +176,8 @@ label: NEGATIVE, with score: 0.5309
 ... )
 ```
 
-<Tip>
-
-اطلع على [الدليل التمهيدي للمعالجة المسبقة](./preprocessing) للحصول على مزيد من التفاصيل حول المعالجة، وكيفية استخدام [`AutoImageProcessor`] و [`AutoFeatureExtractor`] و [`AutoProcessor`] لمعالجة الصور والصوت والإدخالات متعددة الوسائط.
-
-</Tip>
+> [!TIP]
+> اطلع على [الدليل التمهيدي للمعالجة المسبقة](./preprocessing) للحصول على مزيد من التفاصيل حول المعالجة، وكيفية استخدام [`AutoImageProcessor`] و [`AutoFeatureExtractor`] و [`AutoProcessor`] لمعالجة الصور والصوت والإدخالات متعددة الوسائط.
 
 ### AutoModel
 
@@ -196,11 +190,8 @@ label: NEGATIVE, with score: 0.5309
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
 ```
 
-<Tip>
-
-راجع [ملخص المهمة](./task_summary) للاطلاع على المهام التي تدعمها فئة [`AutoModel`].
-
-</Tip>
+> [!TIP]
+> راجع [ملخص المهمة](./task_summary) للاطلاع على المهام التي تدعمها فئة [`AutoModel`].
 
 الآن قم بتمرير دفعة المدخلات المُعالجة مسبقًا مباشرة إلى النموذج. عليك فقط فك تعبئة القاموس عن طريق إضافة `**`:
 
@@ -223,11 +214,8 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
         [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
 ```
 
-<Tip>
-
-تخرج جميع نماذج 🤗 Transformers (PyTorch أو TensorFlow) المصفوفات *قبل* دالة التنشيط النهائية (مثل softmax) لأن دالة التنشيط النهائية غالبًا ما تكون مدمجة مع دالة الخسارة. نواتج النموذج عبارة عن فئات بيانات خاصة، لذلك يتم استكمال سماتها تلقائيًا في IDE. وتتصرف مخرجات النموذج مثل زوج مرتب أو قاموس (يمكنك الفهرسة باستخدام عدد صحيح ، شريحة، أو سلسلة)، وفي هذه الحالة، يتم تجاهل السمات التي تساوي None.
-
-</Tip>
+> [!TIP]
+> تخرج جميع نماذج 🤗 Transformers (PyTorch أو TensorFlow) المصفوفات *قبل* دالة التنشيط النهائية (مثل softmax) لأن دالة التنشيط النهائية غالبًا ما تكون مدمجة مع دالة الخسارة. نواتج النموذج عبارة عن فئات بيانات خاصة، لذلك يتم استكمال سماتها تلقائيًا في IDE. وتتصرف مخرجات النموذج مثل زوج مرتب أو قاموس (يمكنك الفهرسة باستخدام عدد صحيح ، شريحة، أو سلسلة)، وفي هذه الحالة، يتم تجاهل السمات التي تساوي None.
 
 ### حفظ النموذج
 
@@ -363,11 +351,8 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
 >>> trainer.train()  # doctest: +SKIP
 ```
 
-<Tip>
-
-بالنسبة للمهام - مثل الترجمة أو التلخيص - التي تستخدم نموذج تسلسل إلى تسلسل، استخدم فئات [`Seq2SeqTrainer`] و [`Seq2SeqTrainingArguments`] بدلاً من ذلك.
-
-</Tip>
+> [!TIP]
+> بالنسبة للمهام - مثل الترجمة أو التلخيص - التي تستخدم نموذج تسلسل إلى تسلسل، استخدم فئات [`Seq2SeqTrainer`] و [`Seq2SeqTrainingArguments`] بدلاً من ذلك.
 
 يمكنك تخصيص سلوك حلقة التدريب عن طريق إنشاء فئة فرعية من الطرق داخل [`Trainer`]. يسمح لك ذلك بتخصيص ميزات مثل دالة الخسارة، والمحسن، والمجدول. راجع مرجع [`Trainer`] للتعرف على الطرق التي يمكن إنشاء فئات فرعية منها.
 
diff --git a/docs/source/ar/serialization.md b/docs/source/ar/serialization.md
index 6f437dea0681..e53901cdc6a1 100644
--- a/docs/source/ar/serialization.md
+++ b/docs/source/ar/serialization.md
@@ -114,11 +114,8 @@ optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_s
 
 ### تصدير نموذج باستخدام `transformers.onnx`
 
-<Tip warning={true}>
-
-لم يعد يتم دعم `transformers.onnx`  يُرجى تصدير النماذج باستخدام 🤗 Optimum كما هو موضح أعلاه. سيتم إزالة هذا القسم في الإصدارات القادمة.
-
-</Tip>
+> [!WARNING]
+> لم يعد يتم دعم `transformers.onnx`  يُرجى تصدير النماذج باستخدام 🤗 Optimum كما هو موضح أعلاه. سيتم إزالة هذا القسم في الإصدارات القادمة.
 
 لتصدير نموذج 🤗 Transformers إلى ONNX باستخدام `transformers.onnx`، ثبّت التبعيات الإضافية:
 
diff --git a/docs/source/ar/tasks/language_modeling.md b/docs/source/ar/tasks/language_modeling.md
index 4b6bb31692a7..682caa980a19 100644
--- a/docs/source/ar/tasks/language_modeling.md
+++ b/docs/source/ar/tasks/language_modeling.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 1. ضبط دقيق [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) على مجموعة فرعية [r/askscience](https://www.reddit.com/r/askscience/) من مجموعة بيانات [ELI5](https://huggingface.co/datasets/eli5).
 2.  استخدام النموذج المدرب الخاص بك للاستنتاج.
 
-<Tip>
-
-لرؤية جميع العمارات ونقاط التحقق المتوافقة مع هذه المهمة، نوصي بالتحقق من [task-page](https://huggingface.co/tasks/text-generation)
-
-</Tip>
+> [!TIP]
+> لرؤية جميع العمارات ونقاط التحقق المتوافقة مع هذه المهمة، نوصي بالتحقق من [task-page](https://huggingface.co/tasks/text-generation)
 
 قبل أن تبدأ، تأكد من تثبيت جميع المكتبات الضرورية:
 
@@ -195,11 +192,8 @@ pip install transformers datasets evaluate
 ## التدريب (Train)
 
 
-<Tip>
-
-إذا لم تكن على دراية بتدريب نموذج باستخدام [`Trainer`], اطلع على [البرنامج التعليمي الأساسي](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> إذا لم تكن على دراية بتدريب نموذج باستخدام [`Trainer`], اطلع على [البرنامج التعليمي الأساسي](../training#train-with-pytorch-trainer)!
 
 أنت جاهز الآن لبدء تدريب نموذجك! قم بتحميل DistilGPT2 باستخدام [`AutoModelForCausalLM`]:
 
@@ -252,13 +246,10 @@ Perplexity: 49.61
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-للحصول على مثال أكثر تعمقًا حول كيفية تدريب نموذج للنمذجة اللغوية السببية، اطلع على الدفتر المقابل
-[دفتر PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
-أو [دفتر TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).
-
-</Tip>
+> [!TIP]
+> للحصول على مثال أكثر تعمقًا حول كيفية تدريب نموذج للنمذجة اللغوية السببية، اطلع على الدفتر المقابل
+> [دفتر PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
+> أو [دفتر TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).
 
 ## الاستدلال (Inference)
 
diff --git a/docs/source/ar/tasks/masked_language_modeling.md b/docs/source/ar/tasks/masked_language_modeling.md
index 846614b4b177..daa6b3269775 100644
--- a/docs/source/ar/tasks/masked_language_modeling.md
+++ b/docs/source/ar/tasks/masked_language_modeling.md
@@ -24,11 +24,8 @@ rendered properly in your Markdown viewer.
 1. تكييف [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) على مجموعة فرعية [r/askscience](https://www.reddit.com/r/askscience/) من مجموعة بيانات [ELI5](https://huggingface.co/datasets/eli5).
 2. استخدام نموذج المدرب الخاص بك للاستدلال.
 
-<Tip>
-
-لمعرفة جميع البنى والنسخ المتوافقة مع هذه المهمة، نوصي بالتحقق من [صفحة المهمة](https://huggingface.co/tasks/fill-mask)
-
-</Tip>
+> [!TIP]
+> لمعرفة جميع البنى والنسخ المتوافقة مع هذه المهمة، نوصي بالتحقق من [صفحة المهمة](https://huggingface.co/tasks/fill-mask)
 
 قبل أن تبدأ، تأكد من تثبيت جميع المكتبات الضرورية:
 
@@ -189,11 +186,8 @@ pip install transformers datasets evaluate
 ## التدريب (Train)
 
 
-<Tip>
-
-إذا لم تكن على دراية بتعديل نموذج باستخدام [`Trainer`], ألق نظرة على الدليل الأساسي [هنا](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> إذا لم تكن على دراية بتعديل نموذج باستخدام [`Trainer`], ألق نظرة على الدليل الأساسي [هنا](../training#train-with-pytorch-trainer)!
 
 أنت مستعد الآن لبدء تدريب نموذجك! قم بتحميل DistilRoBERTa باستخدام [`AutoModelForMaskedLM`]:
 
@@ -248,13 +242,10 @@ Perplexity: 8.76
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-لمثال أكثر تفصيلاً حول كيفية تعديل نموذج للنمذجة اللغوية المقنعة، ألق نظرة على الدفتر المقابل
-[دفتر PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
-أو [دفتر TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).
-
-</Tip>
+> [!TIP]
+> لمثال أكثر تفصيلاً حول كيفية تعديل نموذج للنمذجة اللغوية المقنعة، ألق نظرة على الدفتر المقابل
+> [دفتر PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
+> أو [دفتر TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).
 
 ## الاستدلال
 
diff --git a/docs/source/ar/tasks/multiple_choice.md b/docs/source/ar/tasks/multiple_choice.md
index cdfe0b8caf6c..69b4fa8cd1fc 100644
--- a/docs/source/ar/tasks/multiple_choice.md
+++ b/docs/source/ar/tasks/multiple_choice.md
@@ -183,11 +183,8 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
 ## التدريب (Train)
 
 
-<Tip>
-
-إذا لم تكن معتادًا على ضبط نموذج باستخدام [`Trainer`], فراجع الدرس الأساسي [هنا](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> إذا لم تكن معتادًا على ضبط نموذج باستخدام [`Trainer`], فراجع الدرس الأساسي [هنا](../training#train-with-pytorch-trainer)!
 
 أنت جاهز لبدء تدريب نموذجك الآن! قم بتحميل BERT باستخدام [`AutoModelForMultipleChoice`]:
 
@@ -236,12 +233,9 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-للحصول على مثال أكثر تعمقًا حول كيفية ضبط نموذج للاختيار من متعدد، ألق نظرة على [دفتر ملاحظات PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)
-أو [دفتر ملاحظات TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb) المقابل.
-
-</Tip>
+> [!TIP]
+> للحصول على مثال أكثر تعمقًا حول كيفية ضبط نموذج للاختيار من متعدد، ألق نظرة على [دفتر ملاحظات PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)
+> أو [دفتر ملاحظات TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb) المقابل.
 
 ## الاستدلال  (Inference)
 
diff --git a/docs/source/ar/tasks/question_answering.md b/docs/source/ar/tasks/question_answering.md
index b0f00c9316b3..ebc54f30e636 100644
--- a/docs/source/ar/tasks/question_answering.md
+++ b/docs/source/ar/tasks/question_answering.md
@@ -30,11 +30,8 @@ rendered properly in your Markdown viewer.
 1. ضبط [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) على مجموعة بيانات [SQuAD](https://huggingface.co/datasets/squad) للإجابة على الأسئلة الاستخراجية.
 2. استخدام النموذج المضبوط للاستدلال.
 
-<Tip>
-
-لمشاهدة جميع الهياكل والنسخ المتوافقة مع هذه المهمة، نوصي بالرجوع إلى [صفحة المهمة](https://huggingface.co/tasks/question-answering)
-
-</Tip>
+> [!TIP]
+> لمشاهدة جميع الهياكل والنسخ المتوافقة مع هذه المهمة، نوصي بالرجوع إلى [صفحة المهمة](https://huggingface.co/tasks/question-answering)
 
 قبل البدء، تأكد من تثبيت جميع المكتبات الضرورية:
 
@@ -177,11 +174,8 @@ pip install transformers datasets evaluate
 ## التدريب (Train)
 
 
-<Tip>
-
-إذا لم تكن معتادًا على ضبط نموذج باستخدام [`Trainer`], ألق نظرة على البرنامج التعليمي الأساسي [هنا](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> إذا لم تكن معتادًا على ضبط نموذج باستخدام [`Trainer`], ألق نظرة على البرنامج التعليمي الأساسي [هنا](../training#train-with-pytorch-trainer)!
 
 أنت جاهز لبدء تدريب نموذجك الآن! قم بتحميل DistilBERT باستخدام [`AutoModelForQuestionAnswering`]:
 
@@ -228,12 +222,9 @@ pip install transformers datasets evaluate
 ```
 
 
-<Tip>
-
-للحصول على مثال أكثر تعمقًا حول كيفية ضبط نموذج للإجابة على الأسئلة، ألق نظرة على [دفتر ملاحظات PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb) المقابل
-أو [دفتر ملاحظات TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb).
-
-</Tip>
+> [!TIP]
+> للحصول على مثال أكثر تعمقًا حول كيفية ضبط نموذج للإجابة على الأسئلة، ألق نظرة على [دفتر ملاحظات PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb) المقابل
+> أو [دفتر ملاحظات TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb).
 
 ## التقييم (Evaluate)
 
diff --git a/docs/source/ar/tasks/sequence_classification.md b/docs/source/ar/tasks/sequence_classification.md
index d8e6cb29bad5..e7a9d319d3c8 100644
--- a/docs/source/ar/tasks/sequence_classification.md
+++ b/docs/source/ar/tasks/sequence_classification.md
@@ -22,11 +22,8 @@ rendered properly in your Markdown viewer.
 1. ضبط [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) على مجموعة بيانات [IMDb](https://huggingface.co/datasets/imdb) لتحديد ما إذا كانت مراجعة الفيلم إيجابية أو سلبية.
 2. استخدام نموذج الضبط الدقيق للتنبؤ.
 
-<Tip>
-
-لرؤية جميع البنى ونقاط التحقق المتوافقة مع هذه المهمة، نوصي بالتحقق من [صفحة المهمة](https://huggingface.co/tasks/text-classification).
-
-</Tip>
+> [!TIP]
+> لرؤية جميع البنى ونقاط التحقق المتوافقة مع هذه المهمة، نوصي بالتحقق من [صفحة المهمة](https://huggingface.co/tasks/text-classification).
 
 قبل أن تبدأ، تأكد من تثبيت جميع المكتبات الضرورية:
 
@@ -131,11 +128,8 @@ tokenized_imdb = imdb.map(preprocess_function, batched=True)
 >>> label2id = {"NEGATIVE": 0, "POSITIVE": 1}
 ```
 
-<Tip>
-
-إذا لم تكن على دراية بضبط نموذج دقيق باستخدام [`Trainer`], فالق نظرة على البرنامج التعليمي الأساسي [هنا](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> إذا لم تكن على دراية بضبط نموذج دقيق باستخدام [`Trainer`], فالق نظرة على البرنامج التعليمي الأساسي [هنا](../training#train-with-pytorch-trainer)!
 
 أنت مستعد الآن لبدء تدريب نموذجك! قم بتحميل DistilBERT مع [`AutoModelForSequenceClassification`] جنبًا إلى جنب مع عدد التصنيفات المتوقعة، وتصنيفات الخرائط:
 
@@ -180,11 +174,8 @@ tokenized_imdb = imdb.map(preprocess_function, batched=True)
 >>> trainer.train()
 ```
 
-<Tip>
-
-يستخدم [`Trainer`] الحشو الديناميكي افتراضيًا عند تمرير `tokenizer` إليه. في هذه الحالة،  لا تحتاج لتحديد مُجمِّع البيانات صراحةً. 
-
-</Tip>
+> [!TIP]
+> يستخدم [`Trainer`] الحشو الديناميكي افتراضيًا عند تمرير `tokenizer` إليه. في هذه الحالة،  لا تحتاج لتحديد مُجمِّع البيانات صراحةً.
 
 بعد اكتمال التدريب، شارك نموذجك على Hub باستخدام الطريقة [`~transformers.Trainer.push_to_hub`] ليستخدمه الجميع:
 
@@ -192,13 +183,10 @@ tokenized_imdb = imdb.map(preprocess_function, batched=True)
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-للحصول على مثال أكثر عمقًا حول كيفية ضبط نموذج لتصنيف النصوص، قم بالاطلاع على الدفتر المقابل
-[دفتر PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
-أو [دفتر TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb).
-
-</Tip>
+> [!TIP]
+> للحصول على مثال أكثر عمقًا حول كيفية ضبط نموذج لتصنيف النصوص، قم بالاطلاع على الدفتر المقابل
+> [دفتر PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
+> أو [دفتر TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb).
 
 ## الاستدلال(Inference)
 
diff --git a/docs/source/ar/tasks/summarization.md b/docs/source/ar/tasks/summarization.md
index 760b6d370d17..7acf60a675e3 100644
--- a/docs/source/ar/tasks/summarization.md
+++ b/docs/source/ar/tasks/summarization.md
@@ -30,11 +30,8 @@ rendered properly in your Markdown viewer.
 1. ضبط دقيق [T5](https://huggingface.co/google-t5/t5-small) على مجموعة فرعية من مشاريع قوانين ولاية كاليفورنيا من مجموعة بيانات [BillSum](https://huggingface.co/datasets/billsum) للتلخيص التجريدي.
 2. استخدام النموذج المضبوط بدقة للتنبؤ.
 
-<Tip>
-
-لمشاهدة جميع البنى ونقاط التفتيش المتوافقة مع هذه المهمة، نوصي بالتحقق من [صفحة المهمة](https://huggingface.co/tasks/summarization)
-
-</Tip>
+> [!TIP]
+> لمشاهدة جميع البنى ونقاط التفتيش المتوافقة مع هذه المهمة، نوصي بالتحقق من [صفحة المهمة](https://huggingface.co/tasks/summarization)
 
 قبل البدء، تأكد من تثبيت جميع المكتبات الضرورية:
 
@@ -159,11 +156,8 @@ pip install transformers datasets evaluate rouge_score
 ## التدريب (Train)
 
 
-<Tip>
-
-إذا لم تكن معتادًا على ضبط نموذج باستخدام [`Trainer`]، فألق نظرة على البرنامج التعليمي الأساسي [هنا](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> إذا لم تكن معتادًا على ضبط نموذج باستخدام [`Trainer`]، فألق نظرة على البرنامج التعليمي الأساسي [هنا](../training#train-with-pytorch-trainer)!
 
 أنت جاهز لبدء تدريب نموذجك الآن! قم بتحميل T5 باستخدام [`AutoModelForSeq2SeqLM`]:
 
@@ -213,12 +207,9 @@ pip install transformers datasets evaluate rouge_score
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-للحصول على مثال أكثر تعمقًا حول كيفية ضبط نموذج للتجميع، ألقِ نظرة على [دفتر ملاحظات PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)
-أو [دفتر ملاحظات TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb) المقابل.
-
-</Tip>
+> [!TIP]
+> للحصول على مثال أكثر تعمقًا حول كيفية ضبط نموذج للتجميع، ألقِ نظرة على [دفتر ملاحظات PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)
+> أو [دفتر ملاحظات TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb) المقابل.
 
 ## الاستدلال (Inference)
 
diff --git a/docs/source/ar/tasks/token_classification.md b/docs/source/ar/tasks/token_classification.md
index b3d353527962..b061ed987251 100644
--- a/docs/source/ar/tasks/token_classification.md
+++ b/docs/source/ar/tasks/token_classification.md
@@ -22,11 +22,8 @@
 1. ضبط [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) على مجموعة بيانات [WNUT 17](https://huggingface.co/datasets/wnut_17) للكشف عن كيانات جديدة.
 2.  استخدام نموذجك المضبوط بدقة للاستدلال.
 
-<Tip>
-
-للاطلاع جميع البنى والنقاط المتوافقة مع هذه المهمة، نوصي بالرجوع من [صفحة المهمة](https://huggingface.co/tasks/token-classification).
-
-</Tip>
+> [!TIP]
+> للاطلاع جميع البنى والنقاط المتوافقة مع هذه المهمة، نوصي بالرجوع من [صفحة المهمة](https://huggingface.co/tasks/token-classification).
 
 قبل أن تبدأ، تأكد من تثبيت جميع المكتبات الضرورية:
 
@@ -235,11 +232,8 @@ pip install transformers datasets evaluate seqeval
 ... }
 ```
 
-<Tip>
-
-إذا لم تكن على دراية بتعديل نموذج باستخدام [`Trainer`], ألق نظرة على الدليل التعليمي الأساسي [هنا](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> إذا لم تكن على دراية بتعديل نموذج باستخدام [`Trainer`], ألق نظرة على الدليل التعليمي الأساسي [هنا](../training#train-with-pytorch-trainer)!
 
 أنت مستعد الآن لبدء تدريب نموذجك! قم بتحميل DistilBERT مع [`AutoModelForTokenClassification`] إلى جانب عدد التصنيفات المتوقعة، وخريطة التسميات:
 
@@ -290,13 +284,10 @@ pip install transformers datasets evaluate seqeval
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-للحصول على مثال أكثر تفصيلاً حول كيفية تعديل نموذج لتصنيف الرموز، ألق نظرة على الدفتر المقابل
-[دفتر PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)
-أو [دفتر TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb).
-
-</Tip>
+> [!TIP]
+> للحصول على مثال أكثر تفصيلاً حول كيفية تعديل نموذج لتصنيف الرموز، ألق نظرة على الدفتر المقابل
+> [دفتر PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)
+> أو [دفتر TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb).
 
 ## الاستدلال(Inference)
 
diff --git a/docs/source/ar/tasks/translation.md b/docs/source/ar/tasks/translation.md
index e2beb45acb59..0d8a33bd7dde 100644
--- a/docs/source/ar/tasks/translation.md
+++ b/docs/source/ar/tasks/translation.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 1. ضبط دقيق لنموذج [T5](https://huggingface.co/google-t5/t5-small) على المجموعة الفرعية الإنجليزية-الفرنسية من مجموعة بيانات [OPUS Books](https://huggingface.co/datasets/opus_books) لترجمة النص الإنجليزي إلى الفرنسية.
 2. استخدام النموذج المضبوط بدقة للاستدلال.
 
-<Tip>
-
-لمشاهدة جميع البنى والنسخ المتوافقة مع هذه المهمة، نوصي بالتحقق من [صفحة المهمة](https://huggingface.co/tasks/translation).
-
-</Tip>
+> [!TIP]
+> لمشاهدة جميع البنى والنسخ المتوافقة مع هذه المهمة، نوصي بالتحقق من [صفحة المهمة](https://huggingface.co/tasks/translation).
 
 قبل البدء، تأكد من تثبيت جميع المكتبات الضرورية:
 
@@ -166,11 +163,8 @@ pip install transformers datasets evaluate sacrebleu
 ## التدريب (Train)
 
 
-<Tip>
-
-إذا لم تكن معتادًا على ضبط دقيق نموذج باستخدام [`Trainer`], فألقِ نظرة على البرنامج التعليمي الأساسي [هنا](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> إذا لم تكن معتادًا على ضبط دقيق نموذج باستخدام [`Trainer`], فألقِ نظرة على البرنامج التعليمي الأساسي [هنا](../training#train-with-pytorch-trainer)!
 
 أنت جاهز لبدء تدريب نموذجك الآن! حمّل T5 باستخدام [`AutoModelForSeq2SeqLM`]:
 
@@ -220,12 +214,9 @@ pip install transformers datasets evaluate sacrebleu
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-للحصول على مثال أكثر تعمقًا لكيفية ضبط نموذج للترجمة، ألق نظرة على [دفتر ملاحظات PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb) المقابل
-أو [دفتر ملاحظات TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb).
-
-</Tip>
+> [!TIP]
+> للحصول على مثال أكثر تعمقًا لكيفية ضبط نموذج للترجمة، ألق نظرة على [دفتر ملاحظات PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb) المقابل
+> أو [دفتر ملاحظات TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb).
 
 ## الاستدلال (Inference)
 
diff --git a/docs/source/ar/tasks_explained.md b/docs/source/ar/tasks_explained.md
index dbc50192b4e3..6355f491a3ed 100644
--- a/docs/source/ar/tasks_explained.md
+++ b/docs/source/ar/tasks_explained.md
@@ -13,11 +13,8 @@
 - [GPT2](model_doc/gpt2) لمهام NLP مثل توليد النصوص التي تستخدم فك تشفير
 - [BART](model_doc/bart) لمهام NLP مثل الملخص والترجمة التي تستخدم ترميز-فك تشفير
 
-<Tip>
-
-قبل المتابعة، من الجيد أن يكون لديك بعض المعرفة الأساسية بهيكلية المحولات (Transformer Architecture) الأصلية. إن معرفة كيفية عمل المُشفّرات (Encoders) والمُفكّكات (Decoders) وآلية الانتباه (Attention Mechanism) سوف تساعدك في فهم كيفية عمل نماذج Transformer المختلفة. إذا كنت مبتدئًا أو بحاجة إلى مراجعة، فراجع [دورتنا](https://huggingface.co/course/chapter1/4؟fw=pt) لمزيد من المعلومات!
-
-</Tip>
+> [!TIP]
+> قبل المتابعة، من الجيد أن يكون لديك بعض المعرفة الأساسية بهيكلية المحولات (Transformer Architecture) الأصلية. إن معرفة كيفية عمل المُشفّرات (Encoders) والمُفكّكات (Decoders) وآلية الانتباه (Attention Mechanism) سوف تساعدك في فهم كيفية عمل نماذج Transformer المختلفة. إذا كنت مبتدئًا أو بحاجة إلى مراجعة، فراجع [دورتنا](https://huggingface.co/course/chapter1/4؟fw=pt) لمزيد من المعلومات!
 
 ## الكلام والصوت (Speech and audio)
 
@@ -57,11 +54,8 @@
 1. قم بتقسيم الصورة إلى تسلسل من الرقع ومعالجتها بالتوازي باستخدام مُحوّل Transformer.
 2. استخدم شبكة عصبية تلافيفية CNN) حديثة، مثل [ConvNeXT](model_doc/convnext)، والتي تعتمد على الطبقات التلافيفية ولكنها تعتمد تصميمات حديثة للشبكات.
 
-<Tip>
-
-يقوم النهج الثالث بمزج المحولات مع التلافيف (على سبيل المثال، [Convolutional Vision Transformer](model_doc/cvt) أو [LeViT](model_doc/levit)). لن نناقشها لأنها تجمع ببساطة بين النهجين اللذين نستعرضهما هنا.
-
-</Tip>
+> [!TIP]
+> يقوم النهج الثالث بمزج المحولات مع التلافيف (على سبيل المثال، [Convolutional Vision Transformer](model_doc/cvt) أو [LeViT](model_doc/levit)). لن نناقشها لأنها تجمع ببساطة بين النهجين اللذين نستعرضهما هنا.
 
 يتم استخدام ViT و ConvNeXT بشكل شائع لتصنيف الصور، ولكن بالنسبة لمهام الرؤية الأخرى مثل اكتشاف الكائنات والتجزئة وتقدير العمق، سنلقي نظرة على DETR و Mask2Former و GLPN، على التوالي؛ فهذه النماذج هي الأنسب لتلك المهام.
 
@@ -91,11 +85,8 @@
 
 #### الشبكات العصبية التلافيفية (CNN)
 
-<Tip>
-
-يشرح هذا القسم بإيجاز الالتفافات، ولكن سيكون من المفيد أن يكون لديك فهم مسبق لكيفية تغيير شكل الصورة وحجمها. إذا كنت غير معتاد على الالتفافات، تحقق من [فصل الشبكات العصبية التلافيفية](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb) من كتاب fastai!
-
-</Tip>
+> [!TIP]
+> يشرح هذا القسم بإيجاز الالتفافات، ولكن سيكون من المفيد أن يكون لديك فهم مسبق لكيفية تغيير شكل الصورة وحجمها. إذا كنت غير معتاد على الالتفافات، تحقق من [فصل الشبكات العصبية التلافيفية](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb) من كتاب fastai!
 
 [ConvNeXT](model_doc/convnext) هو بنية CNN تعتمد تصاميم الشبكات الجديدة والحديثة لتحسين الأداء. ومع ذلك، لا تزال الالتفافات هي جوهر النموذج. من منظور عام، [الالتفاف](glossary#convolution) هو عملية حيث يتم ضرب مصفوفة أصغر (*نواة*) بمقطع صغير من وحدات بكسل الصورة. يحسب بعض الميزات منه، مثل نسيج معين أو انحناء خط. ثم ينزلق إلى النافذة التالية من البكسلات؛ المسافة التي تقطعها الالتفاف تسمى *الخطوة*. 
 
@@ -214,11 +205,8 @@
 هل أنت مستعد لتجربة الإجابة على الأسئلة؟ راجع [دليل الإجابة على الأسئلة](tasks/question_answering) الشامل الخاص بنا لمعرفة كيفية ضبط نموذج DistilBERT واستخدامه في الاستدلال!
 
 
-<Tip>
-
-💡 لاحظ مدى سهولة استخدام BERT لمهام مختلفة بمجرد تدريبه مسبقًا. كل ما تحتاج إليه هو إضافة رأس محدد إلى النموذج المسبق التدريب للتلاعب بالحالات المخفية إلى الإخراج المطلوب!
-
-</Tip>
+> [!TIP]
+> 💡 لاحظ مدى سهولة استخدام BERT لمهام مختلفة بمجرد تدريبه مسبقًا. كل ما تحتاج إليه هو إضافة رأس محدد إلى النموذج المسبق التدريب للتلاعب بالحالات المخفية إلى الإخراج المطلوب!
 
 ### توليد النصوص (Text generation)
 
@@ -236,11 +224,8 @@
 
 هل أنت مستعد لتجربة توليد النصوص؟ تحقق من دليل [دليل  نمذجة  اللغة  السببية](tasks/language_modeling#causal- الشامل الخاص بنا لمعرفة كيفية ضبط نموذج DistilGPT-2 واستخدامه للاستنتاج!
 
-<Tip>
-
-للحصول على مزيد من المعلومات حول توليد النص، راجع دليل [استراتيجيات توليد النصوص](generation_strategies)!
-
-</Tip>
+> [!TIP]
+> للحصول على مزيد من المعلومات حول توليد النص، راجع دليل [استراتيجيات توليد النصوص](generation_strategies)!
 
 ### التلخيص (Summarization)
 
@@ -256,11 +241,8 @@
 
 هل أنت مستعد لتجربة التلخيص؟ تحقق من دليل التلخيص الشامل الخاص بنا لمعرفة كيفية ضبط نموذج T5 واستخدامه للاستنتاج!
 
-<Tip>
-
-للحصول على مزيد من المعلومات حول توليد النص، راجع دليل استراتيجيات توليد النص!
-
-</Tip>
+> [!TIP]
+> للحصول على مزيد من المعلومات حول توليد النص، راجع دليل استراتيجيات توليد النص!
 
 ### الترجمة (Translation)
 
@@ -272,8 +254,5 @@
 
 هل أنت مستعد لتجربة الترجمة؟ تحقق من دليل الترجمة الشامل الخاص بنا لمعرفة كيفية ضبط نموذج T5 واستخدامه للاستنتاج!
 
-<Tip>
-
- **للحصول على مزيد من المعلومات حول توليد النصوص، راجع دليل [استراتيجيات توليد النصوص](generation_strategies)!** 
- 
-</Tip>
+> [!TIP]
+> **للحصول على مزيد من المعلومات حول توليد النصوص، راجع دليل [استراتيجيات توليد النصوص](generation_strategies)!**
diff --git a/docs/source/ar/torchscript.md b/docs/source/ar/torchscript.md
index bf0bc0dde04b..766543e99d97 100644
--- a/docs/source/ar/torchscript.md
+++ b/docs/source/ar/torchscript.md
@@ -1,10 +1,7 @@
 # التصدير إلى TorchScript
 
-<Tip>
-
-هذه هي بداية تجاربنا مع TorchScript ولا زلنا نستكشف قدراته مع نماذج المدخلات المتغيرة الحجم. إنه مجال اهتمامنا وسنعمق تحليلنا في الإصدارات القادمة، مع المزيد من الأمثلة البرمجية، وتنفيذ أكثر مرونة، ومقاييس مقارنة بين  الأكواد القائمة على Python مع أكواد TorchScript المُجمّعة.
-
-</Tip>
+> [!TIP]
+> هذه هي بداية تجاربنا مع TorchScript ولا زلنا نستكشف قدراته مع نماذج المدخلات المتغيرة الحجم. إنه مجال اهتمامنا وسنعمق تحليلنا في الإصدارات القادمة، مع المزيد من الأمثلة البرمجية، وتنفيذ أكثر مرونة، ومقاييس مقارنة بين  الأكواد القائمة على Python مع أكواد TorchScript المُجمّعة.
 
 وفقًا لـ [وثائق TorchScript](https://pytorch.org/docs/stable/jit.html):
 
diff --git a/docs/source/ar/trainer.md b/docs/source/ar/trainer.md
index 1784d76a4ecb..9d02c334ed37 100644
--- a/docs/source/ar/trainer.md
+++ b/docs/source/ar/trainer.md
@@ -2,15 +2,12 @@
 
 تُتيح وحدة [`Trainer`] حلقة تدريب وتقييم متكاملة لنماذج PyTorch المطبقة في مكتبة Transformers. تحتاج فقط إلى تمرير المكونات الضرورية للتدريب (النموذج، والمجزىء النصى، ومجموعة البيانات، دالة التقييم، معلمات التدريب الفائقة، إلخ)، وستتولى فئة [`Trainer`] الباقي. هذا يُسهّل بدء التدريب بشكل أسرع دون كتابة حلقة التدريب الخاصة بك يدويًا. ولكن في الوقت نفسه، فإن [`Trainer`] قابل للتخصيص بدرجة كبيرة ويوفر العديد من خيارات التدريب حتى تتمكن من تخصيصه وفقًا لاحتياجات التدريب الخاصة بك بدقة.
 
-<Tip>
-
-بالإضافة إلى فئة [`Trainer`], توفر مكتبة Transformers أيضًا فئة [`Seq2SeqTrainer`] للمهام التسلسلية مثل الترجمة أو التلخيص. هناك أيضًا فئة [`~trl.SFTTrainer`] من مكتبة [TRL](https://hf.co/docs/trl) التي تغلّف فئة [`Trainer`] وهي مُحُسَّنة لتدريب نماذج اللغة مثل Llama-2 وMistral باستخدام تقنيات التوليد اللغوي. كما يدعم [`~trl.SFTTrainer`] ميزات مثل حزم التسلسلات، وLoRA، والقياس الكمي، وDeepSpeed مما يُمكّن من التدريب بكفاءة على نماذج ضخمة الحجم.
-
-<br>
-
-لا تتردد في الاطلاع على [مرجع API](./main_classes/trainer) لهذه الفئات الأخرى من النوع [`Trainer`] لمعرفة المزيد حول متى يتم استخدام كل منها. بشكل عام، [`Trainer`] هو الخيار الأكثر تنوعًا ومناسبًا لمجموعة واسعة من المهام. تم تصميم [`Seq2SeqTrainer`] للمهام التسلسلية ، و [`~trl.SFTTrainer`] مُصمم لتدريب نماذج اللغة الكبيرة.
-
-</Tip>
+> [!TIP]
+> بالإضافة إلى فئة [`Trainer`], توفر مكتبة Transformers أيضًا فئة [`Seq2SeqTrainer`] للمهام التسلسلية مثل الترجمة أو التلخيص. هناك أيضًا فئة [`~trl.SFTTrainer`] من مكتبة [TRL](https://hf.co/docs/trl) التي تغلّف فئة [`Trainer`] وهي مُحُسَّنة لتدريب نماذج اللغة مثل Llama-2 وMistral باستخدام تقنيات التوليد اللغوي. كما يدعم [`~trl.SFTTrainer`] ميزات مثل حزم التسلسلات، وLoRA، والقياس الكمي، وDeepSpeed مما يُمكّن من التدريب بكفاءة على نماذج ضخمة الحجم.
+>
+> <br>
+>
+> لا تتردد في الاطلاع على [مرجع API](./main_classes/trainer) لهذه الفئات الأخرى من النوع [`Trainer`] لمعرفة المزيد حول متى يتم استخدام كل منها. بشكل عام، [`Trainer`] هو الخيار الأكثر تنوعًا ومناسبًا لمجموعة واسعة من المهام. تم تصميم [`Seq2SeqTrainer`] للمهام التسلسلية ، و [`~trl.SFTTrainer`] مُصمم لتدريب نماذج اللغة الكبيرة.
 
 قبل البدء، تأكد من تثبيت مكتبة [Accelerate](https://hf.co/docs/accelerate) - وهي مكتبة تُمكّن تشغيل تدريب PyTorch في بيئات مُوزعة.
 
@@ -164,21 +161,15 @@ trainer = Trainer(
 
 ## تسجيل الأحداث (Logging)
 
-<Tip>
-
-راجع مرجع [API](./main_classes/logging) للتسجيل للحصول على مزيد من المعلومات حول مستويات التسجيل المختلفة للأحداث.
-
-</Tip>
+> [!TIP]
+> راجع مرجع [API](./main_classes/logging) للتسجيل للحصول على مزيد من المعلومات حول مستويات التسجيل المختلفة للأحداث.
 
 يتم تعيين [`Trainer`] إلى `logging.INFO` افتراضيًا والذي يُبلغ عن الأخطاء والتحذيرات ومعلومات أساسية أخرى. يتم تعيين نسخة [`Trainer`] - في البيئات الموزعة - إلى `logging.WARNING` والتي يُبلغ فقط عن الأخطاء والتحذيرات. يمكنك تغيير مستوى تسجيل الأحداث باستخدام معاملي [`log_level`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.log_level) و [`log_level_replica`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.log_level_replica) في [`TrainingArguments`].
 
 لتهيئة إعداد مُستوى تسجيل  اﻷحداث لكل عقدة، استخدم معامل [`log_on_each_node`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.log_on_each_node) لتحديد ما إذا كان سيتم استخدام مُستوى السجل على كل عقدة أو فقط على العقدة الرئيسية.
 
-<Tip>
-
-يحدد [`Trainer`] مُستوى التسجيل بشكل مُنفصل لكل عقدة في طريقة [`Trainer.__init__`]، لذا فقد ترغب في التفكير في تعيين هذا الإعداد في وقت سابق إذا كنت تستخدم وظائف Transformers الأخرى قبل إنشاء كائن [`Trainer`].
-
-</Tip>
+> [!TIP]
+> يحدد [`Trainer`] مُستوى التسجيل بشكل مُنفصل لكل عقدة في طريقة [`Trainer.__init__`]، لذا فقد ترغب في التفكير في تعيين هذا الإعداد في وقت سابق إذا كنت تستخدم وظائف Transformers الأخرى قبل إنشاء كائن [`Trainer`].
 
 على سبيل المثال، لتعيين التعليمات البرمجية والوحدات النمطية الرئيسية الخاصة بك لاستخدام نفس مُستوى التسجيل وفقًا لكل عقدة:
 
@@ -382,11 +373,8 @@ trainer.train()
 تم تقديم مُحسِّنات LOMO في [التدريب على المعلمات الكاملة لنماذج اللغة الكبيرة باستخدام موارد محدودة](https://hf.co/papers/2306.09782) و [AdaLomo: تحسين ذاكرة منخفضة بمعدل تعلم متكيف](https://hf.co/papers/2310.10195).
 يتكون كلاهما من طريقة فعالة لضبط المعلمات الكاملة. تدمج محسنات LOMO حساب الاشتقاق وتحديث المعلمات في خطوة واحدة لتقليل استخدام الذاكرة. محسنات LOMO المدعومة هي `"lomo"` و `"adalomo"`. أولاً قم بتثبيت LOMO من pypi `pip install lomo-optim` أو قم بتثبيته من المصدر باستخدام `pip install git+https://github.com/OpenLMLab/LOMO.git`.
 
-<Tip>
-
-وفقًا للمؤلفين، يوصى باستخدام `AdaLomo` بدون `grad_norm` للحصول على أداء أفضل وسرعة أعلى.
-
-</Tip>
+> [!TIP]
+> وفقًا للمؤلفين، يوصى باستخدام `AdaLomo` بدون `grad_norm` للحصول على أداء أفضل وسرعة أعلى.
 
 فيما يلي نص برمجي بسيط يوضح كيفية ضبط نموذج [google/gemma-2b](https://huggingface.co/google/gemma-2b) على مجموعة بيانات IMDB في الدقة الكاملة:
 
@@ -411,9 +399,8 @@ trainer.train()
 
 ### مُحسِّن GrokAdamW
 تم تصميم مُحسِّن GrokAdamW لتعزيز أداء التدريب واستقراره، خاصةً للنماذج التي تستفيد من دوال إشارة `grokking`. لاستخدام `GrokAdamW`، قم أولاً بتثبيت حزمة المُحسِّن باستخدام `pip install grokadamw`.
-<Tip>
-يُعد GrokAdamW مفيدًا بشكل خاص للنماذج التي تتطلب تقنيات تحسين مُتقدمة لتحقيق أداء واستقرار أفضل.
-</Tip>
+> [!TIP]
+> يُعد GrokAdamW مفيدًا بشكل خاص للنماذج التي تتطلب تقنيات تحسين مُتقدمة لتحقيق أداء واستقرار أفضل.
 
 فيما يلي نص برمجى بسيط لشرح كيفية ضبط [google/gemma-2b](https://huggingface.co/google/gemma-2b) بدقة على مجموعة بيانات IMDB باستخدام مُحسِّن GrokAdamW:
 ```python
@@ -482,11 +469,8 @@ trainer.train()
 
 يتم تشغيل فئة [`Trainer`] بواسطة [تسريع](https://hf.co/docs/accelerate)، وهي مكتبة لتدريب نماذج PyTorch بسهولة في بيئات موزعة مع دعم عمليات التكامل مثل [FullyShardedDataParallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) و [DeepSpeed](https://www.deepspeed.ai/).
 
-<Tip>
-
-تعرف على المزيد حول استراتيجيات تجزئة FSDP، وتفريغ وحدة المعالجة المركزية (CPU)، والمزيد مع [`Trainer`] في [دليل Fully Sharded Data Parallel](fsdp).
-
-</Tip>
+> [!TIP]
+> تعرف على المزيد حول استراتيجيات تجزئة FSDP، وتفريغ وحدة المعالجة المركزية (CPU)، والمزيد مع [`Trainer`] في [دليل Fully Sharded Data Parallel](fsdp).
 
 لاستخدام Accelerate مع [`Trainer`]]، قم بتشغيل الأمر [`accelerate.config`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config) لإعداد التدريب لبيئة التدريب الخاصة بك. نشئ هذا الأمر ملف `config_file.yaml` الذي سيتم استخدامه عند تشغيل نص للتدريب البرمجى. على سبيل المثال، بعض تكوينات المثال التي يمكنك إعدادها هي:
 
diff --git a/docs/source/ar/training.md b/docs/source/ar/training.md
index c509b27a3317..ddc783b8874d 100644
--- a/docs/source/ar/training.md
+++ b/docs/source/ar/training.md
@@ -72,11 +72,8 @@
 >>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
 ```
 
-<Tip>
-
-سترى تحذيرًا بشأن بعض أوزان النموذج المُدرب مسبقًا لن تُستخدم وبعض الأوزان الأخرى ستُبدء بشكل عشوائي. لا تقلق، هذا أمر طبيعي تمامًا! يتم التخلص من رأس النموذج المُدرب مسبقًا لشبكة BERT، ويتم استبداله برأس تصنيف يُبدء بشكل عشوائي. سوف تقوم بضبط الرأس الجديد للنموذج بدقة على مهمة تصنيف التسلسلات الخاصة بك، مما ينقل المعرفة من النموذج المُدرب مسبقًا إليه.
-
-</Tip>
+> [!TIP]
+> سترى تحذيرًا بشأن بعض أوزان النموذج المُدرب مسبقًا لن تُستخدم وبعض الأوزان الأخرى ستُبدء بشكل عشوائي. لا تقلق، هذا أمر طبيعي تمامًا! يتم التخلص من رأس النموذج المُدرب مسبقًا لشبكة BERT، ويتم استبداله برأس تصنيف يُبدء بشكل عشوائي. سوف تقوم بضبط الرأس الجديد للنموذج بدقة على مهمة تصنيف التسلسلات الخاصة بك، مما ينقل المعرفة من النموذج المُدرب مسبقًا إليه.
 
 ### اختيار أحسن العوامل والمتغيرات للتدريب (Training hyperparameters)
 
@@ -230,11 +227,8 @@ torch.cuda.empty_cache()
 >>> model.to(device)
 ```
 
-<Tip>
-
-احصل على وصول مجاني إلى وحدة معالجة رسومات سحابية إذا لم يكن لديك واحدة مع دفتر ملاحظات مستضاف مثل [Colaboratory](https://colab.research.google.com/) أو [SageMaker StudioLab](https://studiolab.sagemaker.aws/).
-
-</Tip>
+> [!TIP]
+> احصل على وصول مجاني إلى وحدة معالجة رسومات سحابية إذا لم يكن لديك واحدة مع دفتر ملاحظات مستضاف مثل [Colaboratory](https://colab.research.google.com/) أو [SageMaker StudioLab](https://studiolab.sagemaker.aws/).
 
 رائع، الآن أنت مستعد للتدريب! 🥳 
 
diff --git a/docs/source/ar/troubleshooting.md b/docs/source/ar/troubleshooting.md
index 7874a9fad133..5adf9efead65 100644
--- a/docs/source/ar/troubleshooting.md
+++ b/docs/source/ar/troubleshooting.md
@@ -39,9 +39,8 @@ CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 11.17 GiB total capacit
 
 - حاول استخدام [`gradient_accumulation_steps`](main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps) في [`TrainingArguments`] لزيادة حجم الدُفعة بشكل فعال.
 
-<Tip>
-راجع دليل [الأداء](performance) لمزيد من التفاصيل حول تقنيات توفير الذاكرة.
-</Tip>
+> [!TIP]
+> راجع دليل [الأداء](performance) لمزيد من التفاصيل حول تقنيات توفير الذاكرة.
 
 ## عدم القدرة على تحميل نموذج TensorFlow محفوظ
 
@@ -138,9 +137,8 @@ tensor([[-0.1008, -0.4061]], grad_fn=<AddmmBackward0>)
 
 يجب عليك في معظم الوقت توفير `attention_mask` للنموذج لتجاهل رموز الحشو لتجنب هذا الخطأ الصامت. الآن يتطابق مُخرجات التسلسل الثاني مع مُخرجاته الفعلية:
 
-<Tip>
-بشكل افتراضي، ينشئ مجزىء النصوص `attention_mask` لك استنادًا إلى إعدادات المجزىء المحدد.
-</Tip>
+> [!TIP]
+> بشكل افتراضي، ينشئ مجزىء النصوص `attention_mask` لك استنادًا إلى إعدادات المجزىء المحدد.
 
 ```python
 >>> attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1], [1, 0, 0, 0, 0, 0]])
diff --git a/docs/source/de/add_new_model.md b/docs/source/de/add_new_model.md
index e5ef4234319f..fd5c40d05257 100644
--- a/docs/source/de/add_new_model.md
+++ b/docs/source/de/add_new_model.md
@@ -748,11 +748,8 @@ Tests erfolgreich sind, führen Sie
 RUN_SLOW=1 pytest -sv tests/models/brand_new_bert/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests
 ```
 
-<Tip>
-
-Falls Sie Windows verwenden, sollten Sie `RUN_SLOW=1` durch `SET RUN_SLOW=1` ersetzen.
-
-</Tip>
+> [!TIP]
+> Falls Sie Windows verwenden, sollten Sie `RUN_SLOW=1` durch `SET RUN_SLOW=1` ersetzen.
 
 Zweitens sollten alle Funktionen, die speziell für *brand_new_bert* sind, zusätzlich in einem separaten Test getestet werden unter
 `BrandNewBertModelTester`/`BrandNewBertModelTest`. Dieser Teil wird oft vergessen, ist aber in zweierlei Hinsicht äußerst nützlich
diff --git a/docs/source/de/autoclass_tutorial.md b/docs/source/de/autoclass_tutorial.md
index 94fabccb25fd..dc20878d8958 100644
--- a/docs/source/de/autoclass_tutorial.md
+++ b/docs/source/de/autoclass_tutorial.md
@@ -18,11 +18,8 @@ rendered properly in your Markdown viewer.
 
 Bei so vielen verschiedenen Transformator-Architekturen kann es eine Herausforderung sein, eine für Ihren Checkpoint zu erstellen. Als Teil der 🤗 Transformers Kernphilosophie, die Bibliothek leicht, einfach und flexibel nutzbar zu machen, leitet eine `AutoClass` automatisch die richtige Architektur aus einem gegebenen Checkpoint ab und lädt sie. Mit der Methode `from_pretrained()` kann man schnell ein vortrainiertes Modell für eine beliebige Architektur laden, so dass man keine Zeit und Ressourcen aufwenden muss, um ein Modell von Grund auf zu trainieren. Die Erstellung dieser Art von Checkpoint-agnostischem Code bedeutet, dass Ihr Code, wenn er für einen Checkpoint funktioniert, auch mit einem anderen Checkpoint funktionieren wird - solange er für eine ähnliche Aufgabe trainiert wurde - selbst wenn die Architektur unterschiedlich ist.
 
-<Tip>
-
-Denken Sie daran, dass sich die Architektur auf das Skelett des Modells bezieht und die Checkpoints die Gewichte für eine bestimmte Architektur sind. Zum Beispiel ist [BERT](https://huggingface.co/google-bert/bert-base-uncased) eine Architektur, während `google-bert/bert-base-uncased` ein Checkpoint ist. Modell ist ein allgemeiner Begriff, der entweder Architektur oder Prüfpunkt bedeuten kann.
-
-</Tip>
+> [!TIP]
+> Denken Sie daran, dass sich die Architektur auf das Skelett des Modells bezieht und die Checkpoints die Gewichte für eine bestimmte Architektur sind. Zum Beispiel ist [BERT](https://huggingface.co/google-bert/bert-base-uncased) eine Architektur, während `google-bert/bert-base-uncased` ein Checkpoint ist. Modell ist ein allgemeiner Begriff, der entweder Architektur oder Prüfpunkt bedeuten kann.
 
 In dieser Anleitung lernen Sie, wie man:
 
@@ -97,12 +94,9 @@ Sie können denselben Prüfpunkt problemlos wiederverwenden, um eine Architektur
 >>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip warning={true}>
-
-Für PyTorch-Modelle verwendet die Methode `from_pretrained()` `torch.load()`, die intern `pickle` verwendet und als unsicher bekannt ist. Generell sollte man niemals ein Modell laden, das aus einer nicht vertrauenswürdigen Quelle stammen könnte, oder das manipuliert worden sein könnte. Dieses Sicherheitsrisiko wird für öffentliche Modelle, die auf dem Hugging Face Hub gehostet werden, teilweise gemildert, da diese bei jeder Übertragung [auf Malware](https://huggingface.co/docs/hub/security-malware) gescannt werden. Siehe die [Hub-Dokumentation](https://huggingface.co/docs/hub/security) für Best Practices wie [signierte Commit-Verifizierung](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg) mit GPG.
-
-TensorFlow- und Flax-Checkpoints sind nicht betroffen und können in PyTorch-Architekturen mit den Kwargs `from_tf` und `from_flax` für die Methode `from_pretrained` geladen werden, um dieses Problem zu umgehen.
-
-</Tip>
+> [!WARNING]
+> Für PyTorch-Modelle verwendet die Methode `from_pretrained()` `torch.load()`, die intern `pickle` verwendet und als unsicher bekannt ist. Generell sollte man niemals ein Modell laden, das aus einer nicht vertrauenswürdigen Quelle stammen könnte, oder das manipuliert worden sein könnte. Dieses Sicherheitsrisiko wird für öffentliche Modelle, die auf dem Hugging Face Hub gehostet werden, teilweise gemildert, da diese bei jeder Übertragung [auf Malware](https://huggingface.co/docs/hub/security-malware) gescannt werden. Siehe die [Hub-Dokumentation](https://huggingface.co/docs/hub/security) für Best Practices wie [signierte Commit-Verifizierung](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg) mit GPG.
+>
+> TensorFlow- und Flax-Checkpoints sind nicht betroffen und können in PyTorch-Architekturen mit den Kwargs `from_tf` und `from_flax` für die Methode `from_pretrained` geladen werden, um dieses Problem zu umgehen.
 
 Im Allgemeinen empfehlen wir die Verwendung der Klasse "AutoTokenizer" und der Klasse "AutoModelFor", um trainierte Instanzen von Modellen zu laden. Dadurch wird sichergestellt, dass Sie jedes Mal die richtige Architektur laden. Im nächsten [Tutorial] (Vorverarbeitung) erfahren Sie, wie Sie Ihren neu geladenen Tokenizer, Feature Extractor und Prozessor verwenden, um einen Datensatz für die Feinabstimmung vorzuverarbeiten.
diff --git a/docs/source/de/contributing.md b/docs/source/de/contributing.md
index f7fc3d1359c3..a4d9d56c4b39 100644
--- a/docs/source/de/contributing.md
+++ b/docs/source/de/contributing.md
@@ -269,11 +269,8 @@ Sie können auch eine kleinere Anzahl an Tests angeben, um nur die Funktion, an
 
 Standardmäßig werden langsame Tests übersprungen, aber Sie können die Umgebungsvariable `RUN_SLOW` auf `yes` setzen, um sie auszuführen. Dies wird den Download vieler Gigabyte an Modellen starten - stellen Sie also sicher, dass Sie sowohl genügend Festplattenspeicher als auch eine gute Internetverbindung oder die nötige Geduld haben!
 
-<Tip warning={true}>
-
-Vergessen Sie nicht, einen *Pfad zu einem Unterordner oder einer Testdatei* anzugeben, um den Test auszuführen. Sonst führen Sie alle Tests im `tests` oder `examples` Ordner aus, was sehr lange dauern wird!
-
-</Tip>
+> [!WARNING]
+> Vergessen Sie nicht, einen *Pfad zu einem Unterordner oder einer Testdatei* anzugeben, um den Test auszuführen. Sonst führen Sie alle Tests im `tests` oder `examples` Ordner aus, was sehr lange dauern wird!
 
 ```bash
 RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model
diff --git a/docs/source/de/installation.md b/docs/source/de/installation.md
index 44b6f1ed981e..bf1317998954 100644
--- a/docs/source/de/installation.md
+++ b/docs/source/de/installation.md
@@ -121,11 +121,8 @@ pip install -e .
 Diese Befehle verknüpfen den Ordner, in den Sie das Repository geklont haben, mit den Pfaden Ihrer Python-Bibliotheken. Python wird nun in dem Ordner suchen, in den Sie geklont haben, zusätzlich zu den normalen Bibliothekspfaden. Wenn zum Beispiel Ihre Python-Pakete normalerweise in `~/anaconda3/envs/main/lib/python3.7/site-packages/` installiert sind, wird Python auch den Ordner durchsuchen, in den Sie geklont haben: `~/transformers/`.
 
 
-<Tip warning={true}>
-
-Sie müssen den Ordner `transformers` behalten, wenn Sie die Bibliothek weiter verwenden wollen.
-
-</Tip>
+> [!WARNING]
+> Sie müssen den Ordner `transformers` behalten, wenn Sie die Bibliothek weiter verwenden wollen.
 
 Jetzt können Sie Ihren Klon mit dem folgenden Befehl ganz einfach auf die neueste Version von 🤗 Transformers aktualisieren:
 
@@ -154,21 +151,15 @@ Vorgefertigte Modelle werden heruntergeladen und lokal zwischengespeichert unter
 3. Shell-Umgebungsvariable: `XDG_CACHE_HOME` + `/huggingface`.
 
 
-<Tip>
-
-Transformers verwendet die Shell-Umgebungsvariablen `PYTORCH_TRANSFORMERS_CACHE` oder `PYTORCH_PRETRAINED_BERT_CACHE`, wenn Sie von einer früheren Iteration dieser Bibliothek kommen und diese Umgebungsvariablen gesetzt haben, sofern Sie nicht die Shell-Umgebungsvariable `TRANSFORMERS_CACHE` angeben.
-
-</Tip>
+> [!TIP]
+> Transformers verwendet die Shell-Umgebungsvariablen `PYTORCH_TRANSFORMERS_CACHE` oder `PYTORCH_PRETRAINED_BERT_CACHE`, wenn Sie von einer früheren Iteration dieser Bibliothek kommen und diese Umgebungsvariablen gesetzt haben, sofern Sie nicht die Shell-Umgebungsvariable `TRANSFORMERS_CACHE` angeben.
 
 ## Offline Modus
 
 Transformers ist in der Lage, in einer Firewall- oder Offline-Umgebung zu laufen, indem es nur lokale Dateien verwendet. Setzen Sie die Umgebungsvariable `HF_HUB_OFFLINE=1`, um dieses Verhalten zu aktivieren.
 
-<Tip>
-
-Fügen sie [🤗 Datasets](https://huggingface.co/docs/datasets/) zu Ihrem Offline-Trainingsworkflow hinzufügen, indem Sie die Umgebungsvariable `HF_DATASETS_OFFLINE=1` setzen.
-
-</Tip>
+> [!TIP]
+> Fügen sie [🤗 Datasets](https://huggingface.co/docs/datasets/) zu Ihrem Offline-Trainingsworkflow hinzufügen, indem Sie die Umgebungsvariable `HF_DATASETS_OFFLINE=1` setzen.
 
 So würden Sie beispielsweise ein Programm in einem normalen Netzwerk mit einer Firewall für externe Instanzen mit dem folgenden Befehl ausführen:
 
@@ -243,8 +234,5 @@ Sobald Ihre Datei heruntergeladen und lokal zwischengespeichert ist, geben Sie d
 >>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json")
 ```
 
-<Tip>
-
-Weitere Informationen zum Herunterladen von Dateien, die auf dem Hub gespeichert sind, finden Sie im Abschnitt [Wie man Dateien vom Hub herunterlädt](https://huggingface.co/docs/hub/how-to-downstream).
-
-</Tip>
+> [!TIP]
+> Weitere Informationen zum Herunterladen von Dateien, die auf dem Hub gespeichert sind, finden Sie im Abschnitt [Wie man Dateien vom Hub herunterlädt](https://huggingface.co/docs/hub/how-to-downstream).
diff --git a/docs/source/de/llm_tutorial.md b/docs/source/de/llm_tutorial.md
index ea4a96632cb1..28edca535e6d 100644
--- a/docs/source/de/llm_tutorial.md
+++ b/docs/source/de/llm_tutorial.md
@@ -68,11 +68,8 @@ Damit sich Ihr Modell so verhält, wie Sie es für Ihre Aufgabe erwarten, müsse
 
 Lassen Sie uns über Code sprechen!
 
-<Tip>
-
-Wenn Sie an der grundlegenden Verwendung von LLMs interessiert sind, ist unsere High-Level-Schnittstelle [`Pipeline`](pipeline_tutorial) ein guter Ausgangspunkt. LLMs erfordern jedoch oft fortgeschrittene Funktionen wie Quantisierung und Feinsteuerung des Token-Auswahlschritts, was am besten über [`~generation.GenerationMixin.generate`] erfolgt. Die autoregressive Generierung mit LLMs ist ebenfalls ressourcenintensiv und sollte für einen angemessenen Durchsatz auf einer GPU ausgeführt werden.
-
-</Tip>
+> [!TIP]
+> Wenn Sie an der grundlegenden Verwendung von LLMs interessiert sind, ist unsere High-Level-Schnittstelle [`Pipeline`](pipeline_tutorial) ein guter Ausgangspunkt. LLMs erfordern jedoch oft fortgeschrittene Funktionen wie Quantisierung und Feinsteuerung des Token-Auswahlschritts, was am besten über [`~generation.GenerationMixin.generate`] erfolgt. Die autoregressive Generierung mit LLMs ist ebenfalls ressourcenintensiv und sollte für einen angemessenen Durchsatz auf einer GPU ausgeführt werden.
 
 <!-- TODO: update example to llama 2 (or a newer popular baseline) when it becomes ungated -->
 Zunächst müssen Sie das Modell laden.
diff --git a/docs/source/de/model_sharing.md b/docs/source/de/model_sharing.md
index 6bfc444ae50b..02330e8d6efc 100644
--- a/docs/source/de/model_sharing.md
+++ b/docs/source/de/model_sharing.md
@@ -27,11 +27,8 @@ In diesem Tutorial lernen Sie zwei Methoden kennen, wie Sie ein trainiertes oder
 frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
 picture-in-picture" allowfullscreen></iframe>
 
-<Tip>
-
-Um ein Modell mit der Öffentlichkeit zu teilen, benötigen Sie ein Konto auf [huggingface.co](https://huggingface.co/join). Sie können auch einer bestehenden Organisation beitreten oder eine neue Organisation gründen.
-
-</Tip>
+> [!TIP]
+> Um ein Modell mit der Öffentlichkeit zu teilen, benötigen Sie ein Konto auf [huggingface.co](https://huggingface.co/join). Sie können auch einer bestehenden Organisation beitreten oder eine neue Organisation gründen.
 
 ## Repository-Funktionen
 
diff --git a/docs/source/de/peft.md b/docs/source/de/peft.md
index f43d227a9a6e..5f88eadad822 100644
--- a/docs/source/de/peft.md
+++ b/docs/source/de/peft.md
@@ -63,11 +63,8 @@ peft_model_id = "ybelkada/opt-350m-lora"
 model = AutoModelForCausalLM.from_pretrained(peft_model_id)
 ```
 
-<Tip>
-
-Sie können einen PEFT-Adapter entweder mit einer `AutoModelFor`-Klasse oder der Basismodellklasse wie `OPTForCausalLM` oder `LlamaForCausalLM` laden.
-
-</Tip>
+> [!TIP]
+> Sie können einen PEFT-Adapter entweder mit einer `AutoModelFor`-Klasse oder der Basismodellklasse wie `OPTForCausalLM` oder `LlamaForCausalLM` laden.
 
 Sie können einen PEFT-Adapter auch laden, indem Sie die Methode `load_adapter` aufrufen:
 
@@ -168,11 +165,8 @@ output = model.generate(**inputs)
 
 PEFT-Adapter werden von der Klasse [`Trainer`] unterstützt, so dass Sie einen Adapter für Ihren speziellen Anwendungsfall trainieren können. Dazu müssen Sie nur ein paar weitere Codezeilen hinzufügen. Zum Beispiel, um einen LoRA-Adapter zu trainieren:
 
-<Tip>
-
-Wenn Sie mit der Feinabstimmung eines Modells mit [`Trainer`] noch nicht vertraut sind, werfen Sie einen Blick auf das Tutorial [Feinabstimmung eines vortrainierten Modells](Training).
-
-</Tip>
+> [!TIP]
+> Wenn Sie mit der Feinabstimmung eines Modells mit [`Trainer`] noch nicht vertraut sind, werfen Sie einen Blick auf das Tutorial [Feinabstimmung eines vortrainierten Modells](Training).
 
 1. Definieren Sie Ihre Adapterkonfiguration mit dem Aufgabentyp und den Hyperparametern (siehe [`~peft.LoraConfig`] für weitere Details darüber, was die Hyperparameter tun).
 
diff --git a/docs/source/de/pipeline_tutorial.md b/docs/source/de/pipeline_tutorial.md
index 5106af9b2faf..cf86358cb9a1 100644
--- a/docs/source/de/pipeline_tutorial.md
+++ b/docs/source/de/pipeline_tutorial.md
@@ -22,11 +22,8 @@ Die [`pipeline`] macht es einfach, jedes beliebige Modell aus dem [Hub](https://
 * Einen bestimmten Tokenizer oder ein bestimmtes Modell zu verwenden.
 * Eine [`pipeline`] für Audio-, Vision- und multimodale Aufgaben zu verwenden.
 
-<Tip>
-
-Eine vollständige Liste der unterstützten Aufgaben und verfügbaren Parameter finden Sie in der [`pipeline`]-Dokumentation.
-
-</Tip>
+> [!TIP]
+> Eine vollständige Liste der unterstützten Aufgaben und verfügbaren Parameter finden Sie in der [`pipeline`]-Dokumentation.
 
 ## Verwendung von Pipelines
 
diff --git a/docs/source/de/pr_checks.md b/docs/source/de/pr_checks.md
index ee2bbf489b8e..9bef34d6ddfe 100644
--- a/docs/source/de/pr_checks.md
+++ b/docs/source/de/pr_checks.md
@@ -147,11 +147,8 @@ Zusätzliche Prüfungen betreffen PRs, die neue Modelle hinzufügen, vor allem,
 
 Da die Transformers-Bibliothek in Bezug auf den Modellcode sehr eigenwillig ist und jedes Modell vollständig in einer einzigen Datei implementiert sein sollte, ohne sich auf andere Modelle zu stützen, haben wir einen Mechanismus hinzugefügt, der überprüft, ob eine Kopie des Codes einer Ebene eines bestimmten Modells mit dem Original übereinstimmt. Auf diese Weise können wir bei einer Fehlerbehebung alle anderen betroffenen Modelle sehen und entscheiden, ob wir die Änderung weitergeben oder die Kopie zerstören.
 
-<Tip>
-
-Wenn eine Datei eine vollständige Kopie einer anderen Datei ist, sollten Sie sie in der Konstante `FULL_COPIES` von `utils/check_copies.py` registrieren.
-
-</Tip>
+> [!TIP]
+> Wenn eine Datei eine vollständige Kopie einer anderen Datei ist, sollten Sie sie in der Konstante `FULL_COPIES` von `utils/check_copies.py` registrieren.
 
 Dieser Mechanismus stützt sich auf Kommentare der Form `# Kopiert von xxx`. Das `xxx` sollte den gesamten Pfad zu der Klasse der Funktion enthalten, die darunter kopiert wird. Zum Beispiel ist `RobertaSelfOutput` eine direkte Kopie der Klasse `BertSelfOutput`. Sie können also [hier](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L289) sehen, dass sie einen Kommentar hat:
 
@@ -181,11 +178,8 @@ Sie können mehrere Muster durch ein Komma getrennt hinzufügen. Zum Beispiel is
 
 Wenn die Reihenfolge eine Rolle spielt (weil eine der Ersetzungen mit einer vorherigen in Konflikt geraten könnte), werden die Ersetzungen von links nach rechts ausgeführt.
 
-<Tip>
-
-Wenn die Ersetzungen die Formatierung ändern (wenn Sie z.B. einen kurzen Namen durch einen sehr langen Namen ersetzen), wird die Kopie nach Anwendung des automatischen Formats überprüft.
-
-</Tip>
+> [!TIP]
+> Wenn die Ersetzungen die Formatierung ändern (wenn Sie z.B. einen kurzen Namen durch einen sehr langen Namen ersetzen), wird die Kopie nach Anwendung des automatischen Formats überprüft.
 
 Eine andere Möglichkeit, wenn es sich bei den Mustern nur um verschiedene Umschreibungen derselben Ersetzung handelt (mit einer groß- und einer kleingeschriebenen Variante), besteht darin, die Option `all-casing` hinzuzufügen. [Hier](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/mobilebert/modeling_mobilebert.py#L1237) ist ein Beispiel in `MobileBertForSequenceClassification` mit dem Kommentar:
 
diff --git a/docs/source/de/preprocessing.md b/docs/source/de/preprocessing.md
index baae623d6988..5751ac966497 100644
--- a/docs/source/de/preprocessing.md
+++ b/docs/source/de/preprocessing.md
@@ -30,11 +30,8 @@ Bevor Sie Ihre Daten in einem Modell verwenden können, müssen die Daten in ein
 
 Das wichtigste Werkzeug zur Verarbeitung von Textdaten ist ein [Tokenizer](main_classes/tokenizer). Ein Tokenizer zerlegt Text zunächst nach einer Reihe von Regeln in *Token*. Die Token werden in Zahlen umgewandelt, die zum Aufbau von Tensoren als Eingabe für ein Modell verwendet werden. Alle zusätzlichen Eingaben, die ein Modell benötigt, werden ebenfalls vom Tokenizer hinzugefügt.
 
-<Tip>
-
-Wenn Sie ein vortrainiertes Modell verwenden möchten, ist es wichtig, den zugehörigen vortrainierten Tokenizer zu verwenden. Dadurch wird sichergestellt, dass der Text auf die gleiche Weise aufgeteilt wird wie das Pretraining-Korpus und die gleichen entsprechenden Token-zu-Index (in der Regel als *vocab* bezeichnet) während des Pretrainings verwendet werden.
-
-</Tip>
+> [!TIP]
+> Wenn Sie ein vortrainiertes Modell verwenden möchten, ist es wichtig, den zugehörigen vortrainierten Tokenizer zu verwenden. Dadurch wird sichergestellt, dass der Text auf die gleiche Weise aufgeteilt wird wie das Pretraining-Korpus und die gleichen entsprechenden Token-zu-Index (in der Regel als *vocab* bezeichnet) während des Pretrainings verwendet werden.
 
 Laden Sie einen vortrainierten Tokenizer mit der Klasse [AutoTokenizer], um schnell loszulegen. Damit wird das *vocab* heruntergeladen, das verwendet wird, wenn ein Modell vortrainiert wird.
 
diff --git a/docs/source/de/quicktour.md b/docs/source/de/quicktour.md
index 024c9fe8b3c6..1ac68a8f2f91 100644
--- a/docs/source/de/quicktour.md
+++ b/docs/source/de/quicktour.md
@@ -20,12 +20,9 @@ rendered properly in your Markdown viewer.
 
 Mit 🤗 Transformers können Sie sofort loslegen! Verwenden Sie die [`pipeline`] für schnelle Inferenz und laden Sie schnell ein vortrainiertes Modell und einen Tokenizer mit einer [AutoClass](./model_doc/auto), um Ihre Text-, Bild- oder Audioaufgabe zu lösen.
 
-<Tip>
-
-Alle in der Dokumentation vorgestellten Codebeispiele haben oben links einen Umschalter für PyTorch und TensorFlow. Wenn
-nicht, wird erwartet, dass der Code für beide Backends ohne Änderungen funktioniert.
-
-</Tip>
+> [!TIP]
+> Alle in der Dokumentation vorgestellten Codebeispiele haben oben links einen Umschalter für PyTorch und TensorFlow. Wenn
+> nicht, wird erwartet, dass der Code für beide Backends ohne Änderungen funktioniert.
 
 ## Pipeline
 
@@ -54,11 +51,8 @@ Die [`pipeline`] unterstützt viele gängige Aufgaben:
 * Audioklassifizierung: Zuweisung eines Labels zu einem bestimmten Audiosegment.
 * Automatische Spracherkennung (ASR): Transkription von Audiodaten in Text.
 
-<Tip>
-
-Für mehr Details über die [`pipeline`] und assoziierte Aufgaben, schauen Sie in die Dokumentation [hier](./main_classes/pipelines).
-
-</Tip>
+> [!TIP]
+> Für mehr Details über die [`pipeline`] und assoziierte Aufgaben, schauen Sie in die Dokumentation [hier](./main_classes/pipelines).
 
 ### Verwendung der Pipeline
 
@@ -226,11 +220,8 @@ Lesen Sie das Tutorial [preprocessing](./preprocessing) für weitere Details zur
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
 ```
 
-<Tip>
-
-In der [Aufgabenzusammenfassung](./task_summary) steht, welche [AutoModel]-Klasse für welche Aufgabe zu verwenden ist.
-
-</Tip>
+> [!TIP]
+> In der [Aufgabenzusammenfassung](./task_summary) steht, welche [AutoModel]-Klasse für welche Aufgabe zu verwenden ist.
 
 Jetzt können Sie Ihren vorverarbeiteten Stapel von Eingaben direkt an das Modell übergeben. Sie müssen nur das Wörterbuch entpacken, indem Sie `**` hinzufügen:
 
@@ -249,21 +240,15 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
         [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
 ```
 
-<Tip>
-
-Alle 🤗 Transformers-Modelle (PyTorch oder TensorFlow) geben die Tensoren *vor* der endgültigen Aktivierungsfunktion
-Funktion (wie Softmax) aus, da die endgültige Aktivierungsfunktion oft mit dem Verlusten verschmolzen ist.
-
-</Tip>
+> [!TIP]
+> Alle 🤗 Transformers-Modelle (PyTorch oder TensorFlow) geben die Tensoren *vor* der endgültigen Aktivierungsfunktion
+> Funktion (wie Softmax) aus, da die endgültige Aktivierungsfunktion oft mit dem Verlusten verschmolzen ist.
 
 Modelle sind ein standardmäßiges [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) oder ein [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model), sodass Sie sie in Ihrer üblichen Trainingsschleife verwenden können. Um jedoch die Dinge einfacher zu machen, bietet 🤗 Transformers eine [`Trainer`]-Klasse für PyTorch, die Funktionalität für verteiltes Training, gemischte Präzision und mehr bietet. Für TensorFlow können Sie die Methode `fit` aus [Keras](https://keras.io/) verwenden. Siehe das [training tutorial](./training) für weitere Details.
 
-<Tip>
-
-Transformers-Modellausgaben sind spezielle Datenklassen, so dass ihre Attribute in einer IDE automatisch vervollständigt werden.
-Die Modellausgänge verhalten sich auch wie ein Tupel oder ein Wörterbuch (z.B. können Sie mit einem Integer, einem Slice oder einem String indexieren), wobei die Attribute, die "None" sind, ignoriert werden.
-
-</Tip>
+> [!TIP]
+> Transformers-Modellausgaben sind spezielle Datenklassen, so dass ihre Attribute in einer IDE automatisch vervollständigt werden.
+> Die Modellausgänge verhalten sich auch wie ein Tupel oder ein Wörterbuch (z.B. können Sie mit einem Integer, einem Slice oder einem String indexieren), wobei die Attribute, die "None" sind, ignoriert werden.
 
 ### Modell speichern
 
diff --git a/docs/source/de/testing.md b/docs/source/de/testing.md
index 07be15f31ece..3912fd636d2f 100644
--- a/docs/source/de/testing.md
+++ b/docs/source/de/testing.md
@@ -323,17 +323,11 @@ Und führen Sie dann jeden Test mehrmals durch (standardmäßig 50):
 pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
 ```
 
-<Tip>
+> [!TIP]
+> Dieses Plugin funktioniert nicht mit dem `-n` Flag von `pytest-xdist`.
 
-Dieses Plugin funktioniert nicht mit dem `-n` Flag von `pytest-xdist`.
-
-</Tip>
-
-<Tip>
-
-Es gibt noch ein anderes Plugin `pytest-repeat`, aber es funktioniert nicht mit `unittest`.
-
-</Tip>
+> [!TIP]
+> Es gibt noch ein anderes Plugin `pytest-repeat`, aber es funktioniert nicht mit `unittest`.
 
 #### Run tests in a random order
 
@@ -802,20 +796,14 @@ keine Daten dort hinterlassen haben.
   - `after=True`: das temporäre Verzeichnis wird immer am Ende des Tests gelöscht.
   - `after=False`: das temporäre Verzeichnis wird am Ende des Tests immer beibehalten.
 
-<Tip>
-
-Um das Äquivalent von `rm -r` sicher ausführen zu können, sind nur Unterverzeichnisse des Projektarchivs checkout erlaubt, wenn
-ein explizites `tmp_dir` verwendet wird, so dass nicht versehentlich ein `/tmp` oder ein ähnlich wichtiger Teil des Dateisystems vernichtet wird.
-d.h. geben Sie bitte immer Pfade an, die mit `./` beginnen.
-
-</Tip>
-
-<Tip>
-
-Jeder Test kann mehrere temporäre Verzeichnisse registrieren, die alle automatisch entfernt werden, sofern nicht anders gewünscht.
-anders.
+> [!TIP]
+> Um das Äquivalent von `rm -r` sicher ausführen zu können, sind nur Unterverzeichnisse des Projektarchivs checkout erlaubt, wenn
+> ein explizites `tmp_dir` verwendet wird, so dass nicht versehentlich ein `/tmp` oder ein ähnlich wichtiger Teil des Dateisystems vernichtet wird.
+> d.h. geben Sie bitte immer Pfade an, die mit `./` beginnen.
 
-</Tip>
+> [!TIP]
+> Jeder Test kann mehrere temporäre Verzeichnisse registrieren, die alle automatisch entfernt werden, sofern nicht anders gewünscht.
+> anders.
 
 ### Temporäre Überschreibung von sys.path
 
diff --git a/docs/source/de/training.md b/docs/source/de/training.md
index 92051d5d1a58..ccbbaf43f9b7 100644
--- a/docs/source/de/training.md
+++ b/docs/source/de/training.md
@@ -87,12 +87,9 @@ Beginnen Sie mit dem Laden Ihres Modells und geben Sie die Anzahl der erwarteten
 >>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
 ```
 
-<Tip>
-
-Es wird eine Warnung angezeigt, dass einige der trainierten Parameter nicht verwendet werden und einige Parameter zufällig
-initialisiert werden. Machen Sie sich keine Sorgen, das ist völlig normal! Der vorher trainierte Kopf des BERT-Modells wird verworfen und durch einen zufällig initialisierten Klassifikationskopf ersetzt. Sie werden diesen neuen Modellkopf in Ihrer Sequenzklassifizierungsaufgabe feinabstimmen, indem Sie das Wissen des vortrainierten Modells auf ihn übertragen.
-
-</Tip>
+> [!TIP]
+> Es wird eine Warnung angezeigt, dass einige der trainierten Parameter nicht verwendet werden und einige Parameter zufällig
+> initialisiert werden. Machen Sie sich keine Sorgen, das ist völlig normal! Der vorher trainierte Kopf des BERT-Modells wird verworfen und durch einen zufällig initialisierten Klassifikationskopf ersetzt. Sie werden diesen neuen Modellkopf in Ihrer Sequenzklassifizierungsaufgabe feinabstimmen, indem Sie das Wissen des vortrainierten Modells auf ihn übertragen.
 
 ### Hyperparameter für das Training
 
@@ -248,11 +245,8 @@ Geben Sie schließlich `device` an, um einen Grafikprozessor zu verwenden, wenn
 >>> model.to(device)
 ```
 
-<Tip>
-
-Holen Sie sich mit einem gehosteten Notebook wie [Colaboratory](https://colab.research.google.com/) oder [SageMaker StudioLab](https://studiolab.sagemaker.aws/) kostenlosen Zugang zu einem Cloud-GPU, wenn Sie noch keinen haben.
-
-</Tip>
+> [!TIP]
+> Holen Sie sich mit einem gehosteten Notebook wie [Colaboratory](https://colab.research.google.com/) oder [SageMaker StudioLab](https://studiolab.sagemaker.aws/) kostenlosen Zugang zu einem Cloud-GPU, wenn Sie noch keinen haben.
 
 Großartig, Sie sind bereit für das Training! 🥳 
 
diff --git a/docs/source/en/glossary.md b/docs/source/en/glossary.md
index 1c8d8ebc2146..b23387fde9b1 100644
--- a/docs/source/en/glossary.md
+++ b/docs/source/en/glossary.md
@@ -296,12 +296,9 @@ These labels are different according to the model head, for example:
 - For automatic speech recognition models, ([`Wav2Vec2ForCTC`]), the model expects a tensor of dimension `(batch_size,
   target_length)` with each value corresponding to the expected label of each individual token.
 
-<Tip>
-
-Each model's labels may be different, so be sure to always check the documentation of each model for more information
-about their specific labels!
-
-</Tip>
+> [!TIP]
+> Each model's labels may be different, so be sure to always check the documentation of each model for more information
+> about their specific labels!
 
 The base models ([`BertModel`]) do not accept labels, as these are the base transformer models, simply outputting
 features.
diff --git a/docs/source/en/llm_tutorial_optimization.md b/docs/source/en/llm_tutorial_optimization.md
index d3095055472c..3b6cd6617fea 100644
--- a/docs/source/en/llm_tutorial_optimization.md
+++ b/docs/source/en/llm_tutorial_optimization.md
@@ -680,11 +680,8 @@ Using the key-value cache has two advantages:
 
 > One should *always* make use of the key-value cache as it leads to identical results and a significant speed-up for longer input sequences. Transformers has the key-value cache enabled by default when making use of the text pipeline or the [`generate` method](https://huggingface.co/docs/transformers/main_classes/text_generation). We have an entire guide dedicated to caches [here](./kv_cache).
 
-<Tip warning={true}>
-
-Note that, despite our advice to use key-value caches, your LLM output may be slightly different when you use them. This is a property of the matrix multiplication kernels themselves -- you can read more about it [here](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535).
-
-</Tip>
+> [!WARNING]
+> Note that, despite our advice to use key-value caches, your LLM output may be slightly different when you use them. This is a property of the matrix multiplication kernels themselves -- you can read more about it [here](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535).
 
 #### 3.2.1 Multi-round conversation
 
diff --git a/docs/source/en/main_classes/deepspeed.md b/docs/source/en/main_classes/deepspeed.md
index b04949229da4..b53f6a838a3a 100644
--- a/docs/source/en/main_classes/deepspeed.md
+++ b/docs/source/en/main_classes/deepspeed.md
@@ -20,11 +20,8 @@ rendered properly in your Markdown viewer.
 
 However, if you want to use DeepSpeed without the [`Trainer`], Transformers provides a [`HfDeepSpeedConfig`] class.
 
-<Tip>
-
-Learn more about using DeepSpeed with [`Trainer`] in the [DeepSpeed](../deepspeed) guide.
-
-</Tip>
+> [!TIP]
+> Learn more about using DeepSpeed with [`Trainer`] in the [DeepSpeed](../deepspeed) guide.
 
 ## HfDeepSpeedConfig
 
diff --git a/docs/source/en/main_classes/output.md b/docs/source/en/main_classes/output.md
index 8a9ae879fb19..7d03e6e6481a 100644
--- a/docs/source/en/main_classes/output.md
+++ b/docs/source/en/main_classes/output.md
@@ -40,12 +40,9 @@ an optional `attentions` attribute. Here we have the `loss` since we passed alon
 `hidden_states` and `attentions` because we didn't pass `output_hidden_states=True` or
 `output_attentions=True`.
 
-<Tip>
-
-When passing `output_hidden_states=True` you may expect the `outputs.hidden_states[-1]` to match `outputs.last_hidden_state` exactly.
-However, this is not always the case. Some models apply normalization or subsequent process to the last hidden state when it's returned.
-
-</Tip>
+> [!TIP]
+> When passing `output_hidden_states=True` you may expect the `outputs.hidden_states[-1]` to match `outputs.last_hidden_state` exactly.
+> However, this is not always the case. Some models apply normalization or subsequent process to the last hidden state when it's returned.
 
 You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
 will get `None`. Here for instance `outputs.loss` is the loss computed by the model, and `outputs.attentions` is
diff --git a/docs/source/en/main_classes/pipelines.md b/docs/source/en/main_classes/pipelines.md
index 2a63deeba378..4608a9c906a9 100644
--- a/docs/source/en/main_classes/pipelines.md
+++ b/docs/source/en/main_classes/pipelines.md
@@ -125,14 +125,11 @@ for out in pipe(KeyDataset(dataset, "text"), batch_size=8, truncation="only_firs
     # as batches to the model
 ```
 
-<Tip warning={true}>
-
-However, this is not automatically a win for performance. It can be either a 10x speedup or 5x slowdown depending
-on hardware, data and the actual model being used.
-
-Example where it's mostly a speedup:
-
-</Tip>
+> [!WARNING]
+> However, this is not automatically a win for performance. It can be either a 10x speedup or 5x slowdown depending
+> on hardware, data and the actual model being used.
+>
+> Example where it's mostly a speedup:
 
 ```python
 from transformers import pipeline
diff --git a/docs/source/en/main_classes/quantization.md b/docs/source/en/main_classes/quantization.md
index e1f4940103c2..4496a53a980e 100755
--- a/docs/source/en/main_classes/quantization.md
+++ b/docs/source/en/main_classes/quantization.md
@@ -20,11 +20,8 @@ Quantization techniques reduce memory and computational costs by representing we
 
 Quantization techniques that aren't supported in Transformers can be added with the [`HfQuantizer`] class.
 
-<Tip>
-
-Learn how to quantize models in the [Quantization](../quantization) guide.
-
-</Tip>
+> [!TIP]
+> Learn how to quantize models in the [Quantization](../quantization) guide.
 
 ## QuantoConfig
 
diff --git a/docs/source/en/main_classes/trainer.md b/docs/source/en/main_classes/trainer.md
index 21ba9ed935e2..6410f7cb9444 100644
--- a/docs/source/en/main_classes/trainer.md
+++ b/docs/source/en/main_classes/trainer.md
@@ -20,17 +20,14 @@ The [`Trainer`] class provides an API for feature-complete training in PyTorch,
 
 [`Seq2SeqTrainer`] and [`Seq2SeqTrainingArguments`] inherit from the [`Trainer`] and [`TrainingArguments`] classes and they're adapted for training models for sequence-to-sequence tasks such as summarization or translation.
 
-<Tip warning={true}>
-
-The [`Trainer`] class is optimized for 🤗 Transformers models and can have surprising behaviors
-when used with other models. When using it with your own model, make sure:
-
-- your model always return tuples or subclasses of [`~utils.ModelOutput`]
-- your model can compute the loss if a `labels` argument is provided and that loss is returned as the first
-  element of the tuple (if your model returns tuples)
-- your model can accept multiple label arguments (use `label_names` in [`TrainingArguments`] to indicate their name to the [`Trainer`]) but none of them should be named `"label"`
-
-</Tip>
+> [!WARNING]
+> The [`Trainer`] class is optimized for 🤗 Transformers models and can have surprising behaviors
+> when used with other models. When using it with your own model, make sure:
+>
+> - your model always return tuples or subclasses of [`~utils.ModelOutput`]
+> - your model can compute the loss if a `labels` argument is provided and that loss is returned as the first
+>   element of the tuple (if your model returns tuples)
+> - your model can accept multiple label arguments (use `label_names` in [`TrainingArguments`] to indicate their name to the [`Trainer`]) but none of them should be named `"label"`
 
 ## Trainer[[api-reference]]
 
diff --git a/docs/source/en/model_doc/auto.md b/docs/source/en/model_doc/auto.md
index c1db5e2541a6..3dbb97d90d60 100644
--- a/docs/source/en/model_doc/auto.md
+++ b/docs/source/en/model_doc/auto.md
@@ -46,16 +46,13 @@ AutoModel.register(NewModelConfig, NewModel)
 
 You will then be able to use the auto classes like you would usually do!
 
-<Tip warning={true}>
-
-If your `NewModelConfig` is a subclass of [`~transformers.PretrainedConfig`], make sure its
-`model_type` attribute is set to the same key you use when registering the config (here `"new-model"`).
-
-Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its
-`config_class` attribute is set to the same class you use when registering the model (here
-`NewModelConfig`).
-
-</Tip>
+> [!WARNING]
+> If your `NewModelConfig` is a subclass of [`~transformers.PretrainedConfig`], make sure its
+> `model_type` attribute is set to the same key you use when registering the config (here `"new-model"`).
+>
+> Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its
+> `config_class` attribute is set to the same class you use when registering the model (here
+> `NewModelConfig`).
 
 ## AutoConfig
 
diff --git a/docs/source/en/model_doc/bert-japanese.md b/docs/source/en/model_doc/bert-japanese.md
index 6599efa73e08..a38a532e566f 100644
--- a/docs/source/en/model_doc/bert-japanese.md
+++ b/docs/source/en/model_doc/bert-japanese.md
@@ -74,12 +74,9 @@ Example of using a model with Character tokenization:
 
 This model was contributed by [cl-tohoku](https://huggingface.co/cl-tohoku).
 
-<Tip>
-
-This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for
-API reference information.
-
-</Tip>
+> [!TIP]
+> This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for
+> API reference information.
 
 ## BertJapaneseTokenizer
 
diff --git a/docs/source/en/model_doc/bort.md b/docs/source/en/model_doc/bort.md
index 159a5027f03f..fa0dc8bebcb2 100644
--- a/docs/source/en/model_doc/bort.md
+++ b/docs/source/en/model_doc/bort.md
@@ -21,14 +21,11 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we do not accept any new PRs changing its code.
-
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
-You can do so by running the following command: `pip install -U transformers==4.30.0`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we do not accept any new PRs changing its code.
+>
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
+> You can do so by running the following command: `pip install -U transformers==4.30.0`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/chameleon.md b/docs/source/en/model_doc/chameleon.md
index dc573faa1112..8db9fc235889 100644
--- a/docs/source/en/model_doc/chameleon.md
+++ b/docs/source/en/model_doc/chameleon.md
@@ -134,13 +134,10 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
 
 The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
 
-<Tip>
-
-bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
-
-We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
-
-</Tip>
+> [!TIP]
+> bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
+>
+> We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
 
 Simply change the snippet above with:
 
diff --git a/docs/source/en/model_doc/cpm.md b/docs/source/en/model_doc/cpm.md
index 275f5629db13..398b171fb4ad 100644
--- a/docs/source/en/model_doc/cpm.md
+++ b/docs/source/en/model_doc/cpm.md
@@ -42,12 +42,9 @@ NLP tasks in the settings of few-shot (even zero-shot) learning.*
 This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found
 here: https://github.com/TsinghuaAI/CPM-Generate
 
-<Tip>
-
-CPM's architecture is the same as GPT-2, except for tokenization method. Refer to [GPT-2 documentation](gpt2) for
-API reference information.
-
-</Tip>
+> [!TIP]
+> CPM's architecture is the same as GPT-2, except for tokenization method. Refer to [GPT-2 documentation](gpt2) for
+> API reference information.
 
 ## CpmTokenizer
 
diff --git a/docs/source/en/model_doc/deplot.md b/docs/source/en/model_doc/deplot.md
index 5a7d4d12dcd6..f97291871f01 100644
--- a/docs/source/en/model_doc/deplot.md
+++ b/docs/source/en/model_doc/deplot.md
@@ -64,8 +64,5 @@ optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=Fa
 scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000)
 ```
 
-<Tip>
-
-DePlot is a model trained using `Pix2Struct` architecture. For API reference, see [`Pix2Struct` documentation](pix2struct).
-
-</Tip>
+> [!TIP]
+> DePlot is a model trained using `Pix2Struct` architecture. For API reference, see [`Pix2Struct` documentation](pix2struct).
diff --git a/docs/source/en/model_doc/deta.md b/docs/source/en/model_doc/deta.md
index 0dda1c891791..b3e5a9fbaa54 100644
--- a/docs/source/en/model_doc/deta.md
+++ b/docs/source/en/model_doc/deta.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/dialogpt.md b/docs/source/en/model_doc/dialogpt.md
index 34b33d48e500..ce97d69a389c 100644
--- a/docs/source/en/model_doc/dialogpt.md
+++ b/docs/source/en/model_doc/dialogpt.md
@@ -54,8 +54,5 @@ follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and
 modeling. We first concatenate all dialog turns within a dialogue session into a long text x_1,..., x_N (N is the
 sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
 
-<Tip>
-
-DialoGPT's architecture is based on the GPT2 model, refer to [GPT2's documentation page](gpt2) for API reference and examples.
-
-</Tip>
+> [!TIP]
+> DialoGPT's architecture is based on the GPT2 model, refer to [GPT2's documentation page](gpt2) for API reference and examples.
diff --git a/docs/source/en/model_doc/efficientformer.md b/docs/source/en/model_doc/efficientformer.md
index f25460976d0f..375f1eb6bdf4 100644
--- a/docs/source/en/model_doc/efficientformer.md
+++ b/docs/source/en/model_doc/efficientformer.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/ernie_m.md b/docs/source/en/model_doc/ernie_m.md
index e044614e7644..60eb2d3784b6 100644
--- a/docs/source/en/model_doc/ernie_m.md
+++ b/docs/source/en/model_doc/ernie_m.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/flan-t5.md b/docs/source/en/model_doc/flan-t5.md
index e7b2473eaab4..429003aef4bb 100644
--- a/docs/source/en/model_doc/flan-t5.md
+++ b/docs/source/en/model_doc/flan-t5.md
@@ -55,8 +55,5 @@ Google has released the following variants:
 
 The original checkpoints can be found [here](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints).
 
-<Tip>
-
-Refer to [T5's documentation page](t5) for all API reference, code examples and notebooks. For more details regarding training and evaluation of the FLAN-T5, refer to the model card.
-
-</Tip>
+> [!TIP]
+> Refer to [T5's documentation page](t5) for all API reference, code examples and notebooks. For more details regarding training and evaluation of the FLAN-T5, refer to the model card.
diff --git a/docs/source/en/model_doc/flan-ul2.md b/docs/source/en/model_doc/flan-ul2.md
index b4cbac713a38..4b07997b816e 100644
--- a/docs/source/en/model_doc/flan-ul2.md
+++ b/docs/source/en/model_doc/flan-ul2.md
@@ -51,8 +51,5 @@ The model is pretty heavy (~40GB in half precision) so if you just want to run t
 ['In a large skillet, brown the ground beef and onion over medium heat. Add the garlic']
 ```
 
-<Tip>
-
-Refer to [T5's documentation page](t5) for API reference, tips, code examples and notebooks.
-
-</Tip>
+> [!TIP]
+> Refer to [T5's documentation page](t5) for API reference, tips, code examples and notebooks.
diff --git a/docs/source/en/model_doc/fuyu.md b/docs/source/en/model_doc/fuyu.md
index 34202b022f7e..3153d65bb8e3 100644
--- a/docs/source/en/model_doc/fuyu.md
+++ b/docs/source/en/model_doc/fuyu.md
@@ -29,16 +29,13 @@ The authors introduced Fuyu-8B, a decoder-only multimodal model based on the cla
 
 By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.
 
-<Tip warning={true}>
-
-The `Fuyu` models were trained using `bfloat16`, but the original inference uses `float16` The checkpoints uploaded on the hub use `dtype = 'float16'` which will be
-used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
-
-The `dtype` of the online weights is mostly irrelevant, unless you are using `dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be cast to the default `dtype` of `torch` (becomes `torch.float32`). Users should specify the `dtype` they want, and if they don't it will be `torch.float32`.
-
-Finetuning the model in `float16` is not recommended and known to produce `nan`, as such the model should be fine-tuned in `bfloat16`.
-
-</Tip>
+> [!WARNING]
+> The `Fuyu` models were trained using `bfloat16`, but the original inference uses `float16` The checkpoints uploaded on the hub use `dtype = 'float16'` which will be
+> used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
+>
+> The `dtype` of the online weights is mostly irrelevant, unless you are using `dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be cast to the default `dtype` of `torch` (becomes `torch.float32`). Users should specify the `dtype` they want, and if they don't it will be `torch.float32`.
+>
+> Finetuning the model in `float16` is not recommended and known to produce `nan`, as such the model should be fine-tuned in `bfloat16`.
 
 Tips:
 
diff --git a/docs/source/en/model_doc/glpn.md b/docs/source/en/model_doc/glpn.md
index 810e00e00e56..0f311365c954 100644
--- a/docs/source/en/model_doc/glpn.md
+++ b/docs/source/en/model_doc/glpn.md
@@ -21,12 +21,9 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip>
-
-This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
-breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
-
-</Tip>
+> [!TIP]
+> This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
+> breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/gpt-sw3.md b/docs/source/en/model_doc/gpt-sw3.md
index 984e485380a0..d00edcdd36a7 100644
--- a/docs/source/en/model_doc/gpt-sw3.md
+++ b/docs/source/en/model_doc/gpt-sw3.md
@@ -59,14 +59,11 @@ Träd är fina för att de är färgstarka. Men ibland är det fint
 - [Token classification task guide](../tasks/token_classification)
 - [Causal language modeling task guide](../tasks/language_modeling)
 
-<Tip>
-
-The implementation uses the `GPT2Model` coupled with our `GPTSw3Tokenizer`. Refer to [GPT2Model documentation](gpt2)
-for API reference and examples.
-
-Note that sentencepiece is required to use our tokenizer and can be installed with `pip install transformers[sentencepiece]` or `pip install sentencepiece`
-
-</Tip>
+> [!TIP]
+> The implementation uses the `GPT2Model` coupled with our `GPTSw3Tokenizer`. Refer to [GPT2Model documentation](gpt2)
+> for API reference and examples.
+>
+> Note that sentencepiece is required to use our tokenizer and can be installed with `pip install transformers[sentencepiece]` or `pip install sentencepiece`
 
 ## GPTSw3Tokenizer
 
diff --git a/docs/source/en/model_doc/gptsan-japanese.md b/docs/source/en/model_doc/gptsan-japanese.md
index fc83f4846e04..5a472c7c8fac 100644
--- a/docs/source/en/model_doc/gptsan-japanese.md
+++ b/docs/source/en/model_doc/gptsan-japanese.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/graphormer.md b/docs/source/en/model_doc/graphormer.md
index 851f52df09f4..81dc83f866d9 100644
--- a/docs/source/en/model_doc/graphormer.md
+++ b/docs/source/en/model_doc/graphormer.md
@@ -19,13 +19,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/herbert.md b/docs/source/en/model_doc/herbert.md
index aa6a4bf96adf..1b2835bc07ea 100644
--- a/docs/source/en/model_doc/herbert.md
+++ b/docs/source/en/model_doc/herbert.md
@@ -64,12 +64,9 @@ This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski
 >>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
 ```
 
-<Tip>
-
-Herbert implementation is the same as `BERT` except for the tokenization method. Refer to [BERT documentation](bert)
-for API reference and examples.
-
-</Tip>
+> [!TIP]
+> Herbert implementation is the same as `BERT` except for the tokenization method. Refer to [BERT documentation](bert)
+> for API reference and examples.
 
 ## HerbertTokenizer
 
diff --git a/docs/source/en/model_doc/idefics.md b/docs/source/en/model_doc/idefics.md
index fdb6e5de4659..693a3bde3c08 100644
--- a/docs/source/en/model_doc/idefics.md
+++ b/docs/source/en/model_doc/idefics.md
@@ -34,13 +34,10 @@ The abstract from the paper is the following:
 
 This model was contributed by [HuggingFaceM4](https://huggingface.co/HuggingFaceM4). The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>). (TODO: don't have a public link yet).
 
-<Tip warning={true}>
-
-IDEFICS modeling code in Transformers is for finetuning and inferencing the pre-trained IDEFICS models.
-
-To train a new IDEFICS model from scratch use the m4 codebase (a link will be provided once it's made public)
-
-</Tip>
+> [!WARNING]
+> IDEFICS modeling code in Transformers is for finetuning and inferencing the pre-trained IDEFICS models.
+>
+> To train a new IDEFICS model from scratch use the m4 codebase (a link will be provided once it's made public)
 
 ## IdeficsConfig
 
diff --git a/docs/source/en/model_doc/jukebox.md b/docs/source/en/model_doc/jukebox.md
index 385eeb560e50..cc5f659679fd 100644
--- a/docs/source/en/model_doc/jukebox.md
+++ b/docs/source/en/model_doc/jukebox.md
@@ -20,13 +20,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/layoutlmv3.md b/docs/source/en/model_doc/layoutlmv3.md
index b9964fa3f86c..362ef271c472 100644
--- a/docs/source/en/model_doc/layoutlmv3.md
+++ b/docs/source/en/model_doc/layoutlmv3.md
@@ -46,11 +46,8 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
 
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
 
-<Tip>
-
-LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 resources you can adapt for LayoutLMv3 tasks. For these notebooks, take care to use [`LayoutLMv2Processor`] instead when preparing data for the model!
-
-</Tip>
+> [!TIP]
+> LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 resources you can adapt for LayoutLMv3 tasks. For these notebooks, take care to use [`LayoutLMv2Processor`] instead when preparing data for the model!
 
 - Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
 - Demo scripts can be found [here](https://github.com/huggingface/transformers-research-projects/tree/main/layoutlmv3).
diff --git a/docs/source/en/model_doc/layoutxlm.md b/docs/source/en/model_doc/layoutxlm.md
index 19051f55b683..83997e362286 100644
--- a/docs/source/en/model_doc/layoutxlm.md
+++ b/docs/source/en/model_doc/layoutxlm.md
@@ -65,10 +65,8 @@ Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally appl
 [`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all
 data for the model.
 
-<Tip>
-
-As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks.
-</Tip>
+> [!TIP]
+> As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks.
 
 ## LayoutXLMTokenizer
 
diff --git a/docs/source/en/model_doc/llama3.md b/docs/source/en/model_doc/llama3.md
index 4f98d9c778a5..6592c03e5ded 100644
--- a/docs/source/en/model_doc/llama3.md
+++ b/docs/source/en/model_doc/llama3.md
@@ -45,16 +45,13 @@ The original code of the authors can be found [here](https://github.com/meta-lla
 
 ## Usage tips
 
-<Tip warning={true}>
-
-The `Llama3` models were trained using `bfloat16`, but the original inference uses `float16`. The checkpoints uploaded on the Hub use `dtype = 'float16'`, which will be
-used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
-
-The `dtype` of the online weights is mostly irrelevant unless you are using `dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `dtype` or `torch_dtype` provided in the config, it will be used.
-
-Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`.
-
-</Tip>
+> [!WARNING]
+> The `Llama3` models were trained using `bfloat16`, but the original inference uses `float16`. The checkpoints uploaded on the Hub use `dtype = 'float16'`, which will be
+> used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
+>
+> The `dtype` of the online weights is mostly irrelevant unless you are using `dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `dtype` or `torch_dtype` provided in the config, it will be used.
+>
+> Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`.
 
 Tips:
 
diff --git a/docs/source/en/model_doc/llava_next_video.md b/docs/source/en/model_doc/llava_next_video.md
index 61aa7e1ffc51..4811b4b4ecf0 100644
--- a/docs/source/en/model_doc/llava_next_video.md
+++ b/docs/source/en/model_doc/llava_next_video.md
@@ -48,11 +48,8 @@ The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tre
 
 - We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.
 
-<Tip warning={true}>
-
-- Llava-Next uses different number of patches for images and thus has to pad the inputs inside modeling code, aside from the padding done when processing the inputs. The default setting is "left-padding" if model is in `eval()` mode, otherwise "right-padding".
-
-</Tip>
+> [!WARNING]
+> - Llava-Next uses different number of patches for images and thus has to pad the inputs inside modeling code, aside from the padding done when processing the inputs. The default setting is "left-padding" if model is in `eval()` mode, otherwise "right-padding".
 
 > [!NOTE]
 > LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and `processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
@@ -193,13 +190,10 @@ The model can be loaded in lower bits, significantly reducing memory burden whil
 
 First, make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
 
-<Tip>
-
-bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
-
-We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
-
-</Tip>
+> [!TIP]
+> bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
+>
+> We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
 
 Then simply load the quantized model by adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
 
diff --git a/docs/source/en/model_doc/llava_onevision.md b/docs/source/en/model_doc/llava_onevision.md
index 08bc075495b0..921a74f19ca8 100644
--- a/docs/source/en/model_doc/llava_onevision.md
+++ b/docs/source/en/model_doc/llava_onevision.md
@@ -48,11 +48,8 @@ Tips:
 
 - We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.
 
-<Tip warning={true}>
-
-- Llava-OneVision uses different number of patches for images and thus has to pad the inputs inside modeling code, aside from the padding done when processing the inputs. The default setting is "left-padding" if model is in `eval()` mode, otherwise "right-padding".
-
-</Tip>
+> [!WARNING]
+> - Llava-OneVision uses different number of patches for images and thus has to pad the inputs inside modeling code, aside from the padding done when processing the inputs. The default setting is "left-padding" if model is in `eval()` mode, otherwise "right-padding".
 
 ### Formatting Prompts with Chat Templates  
 
@@ -253,13 +250,10 @@ processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spac
 
 The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a GPU/accelerator that is supported by the library.
 
-<Tip>
-
-bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
-
-We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
-
-</Tip>
+> [!TIP]
+> bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
+>
+> We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
 
 Simply change the snippet above with:
 
diff --git a/docs/source/en/model_doc/madlad-400.md b/docs/source/en/model_doc/madlad-400.md
index ca8891960232..06116785a85c 100644
--- a/docs/source/en/model_doc/madlad-400.md
+++ b/docs/source/en/model_doc/madlad-400.md
@@ -66,8 +66,5 @@ Google has released the following variants:
 
 The original checkpoints can be found [here](https://github.com/google-research/google-research/tree/master/madlad_400).
 
-<Tip>
-
-Refer to [T5's documentation page](t5) for all API references, code examples, and notebooks. For more details regarding training and evaluation of the MADLAD-400, refer to the model card.
-
-</Tip>
+> [!TIP]
+> Refer to [T5's documentation page](t5) for all API references, code examples, and notebooks. For more details regarding training and evaluation of the MADLAD-400, refer to the model card.
diff --git a/docs/source/en/model_doc/maskformer.md b/docs/source/en/model_doc/maskformer.md
index aed2dcfa6c40..8b65686a8297 100644
--- a/docs/source/en/model_doc/maskformer.md
+++ b/docs/source/en/model_doc/maskformer.md
@@ -21,12 +21,9 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip>
-
-This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
-breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
-
-</Tip>
+> [!TIP]
+> This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
+> breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/matcha.md b/docs/source/en/model_doc/matcha.md
index a5b2689dcb5d..6bbb7ed25c5a 100644
--- a/docs/source/en/model_doc/matcha.md
+++ b/docs/source/en/model_doc/matcha.md
@@ -75,8 +75,5 @@ optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=Fa
 scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000)
 ```
 
-<Tip>
-
-MatCha is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct).
-
-</Tip>
+> [!TIP]
+> MatCha is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct).
diff --git a/docs/source/en/model_doc/mctct.md b/docs/source/en/model_doc/mctct.md
index c766b1a825d6..ee3a9ec994d5 100644
--- a/docs/source/en/model_doc/mctct.md
+++ b/docs/source/en/model_doc/mctct.md
@@ -21,14 +21,11 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, so we won't accept any new PRs changing its code.
-
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
-You can do so by running the following command: `pip install -U transformers==4.30.0`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, so we won't accept any new PRs changing its code.
+>
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
+> You can do so by running the following command: `pip install -U transformers==4.30.0`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/mega.md b/docs/source/en/model_doc/mega.md
index d6580427778a..29021420be6a 100644
--- a/docs/source/en/model_doc/mega.md
+++ b/docs/source/en/model_doc/mega.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/megatron_gpt2.md b/docs/source/en/model_doc/megatron_gpt2.md
index e0f93b08ae52..40a9aa0302a5 100644
--- a/docs/source/en/model_doc/megatron_gpt2.md
+++ b/docs/source/en/model_doc/megatron_gpt2.md
@@ -74,9 +74,6 @@ The following command allows you to do the conversion. We assume that the folder
 python3 $PATH_TO_TRANSFORMERS/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py megatron_gpt2_345m_v0_0.zip
 ```
 
-<Tip>
-
- MegatronGPT2 architecture is the same as OpenAI GPT-2 . Refer to [GPT-2 documentation](gpt2) for information on
- configuration classes and their parameters.
-
- </Tip>
+> [!TIP]
+> MegatronGPT2 architecture is the same as OpenAI GPT-2 . Refer to [GPT-2 documentation](gpt2) for information on
+>  configuration classes and their parameters.
diff --git a/docs/source/en/model_doc/mllama.md b/docs/source/en/model_doc/mllama.md
index a0fc5db41cfe..621f0b748440 100644
--- a/docs/source/en/model_doc/mllama.md
+++ b/docs/source/en/model_doc/mllama.md
@@ -35,22 +35,19 @@ The [Llama 3.2-Vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-ed
 - The text passed to the processor should have the `"<|image|>"` tokens where the images should be inserted.
 - The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as text to the processor. If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
 
-<Tip warning={true}>
-
-Mllama has an extra token used as a placeholder for image positions in the text. It means that input ids and an input embedding layer will have an extra token. But since the weights for input and output embeddings are not tied, the `lm_head` layer has one less token and will fail if you want to calculate loss on image tokens or apply some logit processors. In case you are training, make sure to mask out special `"<|image|>"` tokens in the `labels` as the model should not be trained on predicting them.
-
-Otherwise if you see CUDA-side index errors when generating, use the below code to expand the `lm_head` by one more token.
-
-```python
-old_embeddings = model.get_output_embeddings()
-
-num_tokens = model.vocab_size + 1
-resized_embeddings = model._get_resized_lm_head(old_embeddings, new_num_tokens=num_tokens, mean_resizing=True)
-resized_embeddings.requires_grad_(old_embeddings.weight.requires_grad)
-model.set_output_embeddings(resized_embeddings)
-```
-
-</Tip>
+> [!WARNING]
+> Mllama has an extra token used as a placeholder for image positions in the text. It means that input ids and an input embedding layer will have an extra token. But since the weights for input and output embeddings are not tied, the `lm_head` layer has one less token and will fail if you want to calculate loss on image tokens or apply some logit processors. In case you are training, make sure to mask out special `"<|image|>"` tokens in the `labels` as the model should not be trained on predicting them.
+>
+> Otherwise if you see CUDA-side index errors when generating, use the below code to expand the `lm_head` by one more token.
+>
+> ```python
+> old_embeddings = model.get_output_embeddings()
+>
+> num_tokens = model.vocab_size + 1
+> resized_embeddings = model._get_resized_lm_head(old_embeddings, new_num_tokens=num_tokens, mean_resizing=True)
+> resized_embeddings.requires_grad_(old_embeddings.weight.requires_grad)
+> model.set_output_embeddings(resized_embeddings)
+> ```
 
 ## Usage Example
 
diff --git a/docs/source/en/model_doc/mluke.md b/docs/source/en/model_doc/mluke.md
index f9310d6c22f9..34a6d8d3708a 100644
--- a/docs/source/en/model_doc/mluke.md
+++ b/docs/source/en/model_doc/mluke.md
@@ -62,12 +62,9 @@ from transformers import MLukeTokenizer
 tokenizer = MLukeTokenizer.from_pretrained("studio-ousia/mluke-base")
 ```
 
-<Tip>
-
-As mLUKE's architecture is equivalent to that of LUKE, one can refer to [LUKE's documentation page](luke) for all
-tips, code examples and notebooks.
-
-</Tip>
+> [!TIP]
+> As mLUKE's architecture is equivalent to that of LUKE, one can refer to [LUKE's documentation page](luke) for all
+> tips, code examples and notebooks.
 
 ## MLukeTokenizer
 
diff --git a/docs/source/en/model_doc/mms.md b/docs/source/en/model_doc/mms.md
index 171beaf440d1..883537837e43 100644
--- a/docs/source/en/model_doc/mms.md
+++ b/docs/source/en/model_doc/mms.md
@@ -71,18 +71,15 @@ processor = AutoProcessor.from_pretrained(model_id, target_lang=target_lang)
 model = Wav2Vec2ForCTC.from_pretrained(model_id, target_lang=target_lang, ignore_mismatched_sizes=True)
 ```
 
-<Tip>
-
-You can safely ignore a warning such as:
-
-```text
-Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/mms-1b-all and are newly initialized because the shapes did not match:
-- lm_head.bias: found shape torch.Size([154]) in the checkpoint and torch.Size([314]) in the model instantiated
-- lm_head.weight: found shape torch.Size([154, 1280]) in the checkpoint and torch.Size([314, 1280]) in the model instantiated
-You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
-```
-
-</Tip>
+> [!TIP]
+> You can safely ignore a warning such as:
+>
+> ```text
+> Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/mms-1b-all and are newly initialized because the shapes did not match:
+> - lm_head.bias: found shape torch.Size([154]) in the checkpoint and torch.Size([314]) in the model instantiated
+> - lm_head.weight: found shape torch.Size([154, 1280]) in the checkpoint and torch.Size([314, 1280]) in the model instantiated
+> You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
+> ```
 
 If you want to use the ASR pipeline, you can load your chosen target language as such:
 
@@ -386,10 +383,8 @@ processor.id2label.values()
 Pretrained models are available for two different sizes - [300M](https://huggingface.co/facebook/mms-300m) ,
 [1Bil](https://huggingface.co/facebook/mms-1b).
 
-<Tip>
-
-The MMS for ASR architecture is based on the Wav2Vec2 model, refer to [Wav2Vec2's documentation page](wav2vec2) for further
-details on how to finetune with models for various downstream tasks.
-
-MMS-TTS uses the same model architecture as VITS, refer to [VITS's documentation page](vits) for API reference.
-</Tip>
+> [!TIP]
+> The MMS for ASR architecture is based on the Wav2Vec2 model, refer to [Wav2Vec2's documentation page](wav2vec2) for further
+> details on how to finetune with models for various downstream tasks.
+>
+> MMS-TTS uses the same model architecture as VITS, refer to [VITS's documentation page](vits) for API reference.
diff --git a/docs/source/en/model_doc/moshi.md b/docs/source/en/model_doc/moshi.md
index 885623b26e52..45f4a341f7ab 100644
--- a/docs/source/en/model_doc/moshi.md
+++ b/docs/source/en/model_doc/moshi.md
@@ -75,11 +75,8 @@ This implementation has two main aims:
 1. quickly test model generation by simplifying the original API
 2. simplify training. A training guide will come soon, but user contributions are welcomed!
 
-<Tip>
-
-It is designed for intermediate use. We strongly recommend using the original [implementation](https://github.com/kyutai-labs/moshi) to infer the model in real-time streaming.
-
-</Tip>
+> [!TIP]
+> It is designed for intermediate use. We strongly recommend using the original [implementation](https://github.com/kyutai-labs/moshi) to infer the model in real-time streaming.
 
 **1. Model generation**
 
@@ -98,13 +95,10 @@ You can dynamically use the 3 inputs depending on what you want to test:
 1. Simply check the model response to an user prompt - in that case, `input_ids` can be filled with pad tokens and `user_input_values` can be a zero tensor of the same shape than the user prompt.
 2. Test more complex behaviour - in that case, you must be careful about how the input tokens are synchronized with the audios.
 
-<Tip>
-
-The original model is synchronized text with audio by padding the text in between each token enunciation.
-
-To follow the example of the following image, `"Hello, I'm Moshi"` could be transformed to `"Hello,<pad><unk>I'm Moshi"`.
-
-</Tip>
+> [!TIP]
+> The original model is synchronized text with audio by padding the text in between each token enunciation.
+>
+> To follow the example of the following image, `"Hello, I'm Moshi"` could be transformed to `"Hello,<pad><unk>I'm Moshi"`.
 
 <div style="text-align: center">
 <img src="https://huggingface.co/datasets/ylacombe/benchmark-comparison/resolve/main/moshi_text_sync.png">
diff --git a/docs/source/en/model_doc/nat.md b/docs/source/en/model_doc/nat.md
index 36662173f2f4..bdfcd78fdf20 100644
--- a/docs/source/en/model_doc/nat.md
+++ b/docs/source/en/model_doc/nat.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/nezha.md b/docs/source/en/model_doc/nezha.md
index 37687fc25df5..a007b09ca18d 100644
--- a/docs/source/en/model_doc/nezha.md
+++ b/docs/source/en/model_doc/nezha.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/nougat.md b/docs/source/en/model_doc/nougat.md
index f07cc42f4a2d..3576816c3469 100644
--- a/docs/source/en/model_doc/nougat.md
+++ b/docs/source/en/model_doc/nougat.md
@@ -94,11 +94,8 @@ into a single instance to both extract the input features and decode the predict
 
 See the [model hub](https://huggingface.co/models?filter=nougat) to look for Nougat checkpoints.
 
-<Tip>
-
-The model is identical to [Donut](donut) in terms of architecture.
-
-</Tip>
+> [!TIP]
+> The model is identical to [Donut](donut) in terms of architecture.
 
 ## NougatImageProcessor
 
diff --git a/docs/source/en/model_doc/open-llama.md b/docs/source/en/model_doc/open-llama.md
index 38954cd315d0..ea44ed4a7b4f 100644
--- a/docs/source/en/model_doc/open-llama.md
+++ b/docs/source/en/model_doc/open-llama.md
@@ -21,20 +21,14 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.31.0.
-You can do so by running the following command: `pip install -U transformers==4.31.0`.
-
-</Tip>
-
-<Tip warning={true}>
-
-This model differs from the [OpenLLaMA models](https://huggingface.co/models?search=openllama) on the Hugging Face Hub, which primarily use the [LLaMA](llama) architecture.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+>
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.31.0.
+> You can do so by running the following command: `pip install -U transformers==4.31.0`.
+
+> [!WARNING]
+> This model differs from the [OpenLLaMA models](https://huggingface.co/models?search=openllama) on the Hugging Face Hub, which primarily use the [LLaMA](llama) architecture.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/owlv2.md b/docs/source/en/model_doc/owlv2.md
index 675dc1c9c0d5..ffba8f2d44db 100644
--- a/docs/source/en/model_doc/owlv2.md
+++ b/docs/source/en/model_doc/owlv2.md
@@ -80,12 +80,9 @@ Detected a photo of a cat with confidence 0.665 at location [6.75, 51.96, 326.62
 - A demo notebook on using OWLv2 for zero- and one-shot (image-guided) object detection can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/OWLv2).
 - [Zero-shot object detection task guide](../tasks/zero_shot_object_detection)
 
-<Tip>
-
-The architecture of OWLv2 is identical to [OWL-ViT](owlvit), however the object detection head now also includes an objectness classifier, which predicts the (query-agnostic) likelihood that a predicted box contains an object (as opposed to background). The objectness score can be used to rank or filter predictions independently of text queries.
-Usage of OWLv2 is identical to [OWL-ViT](owlvit) with a new, updated image processor ([`Owlv2ImageProcessor`]).
-
-</Tip>
+> [!TIP]
+> The architecture of OWLv2 is identical to [OWL-ViT](owlvit), however the object detection head now also includes an objectness classifier, which predicts the (query-agnostic) likelihood that a predicted box contains an object (as opposed to background). The objectness score can be used to rank or filter predictions independently of text queries.
+> Usage of OWLv2 is identical to [OWL-ViT](owlvit) with a new, updated image processor ([`Owlv2ImageProcessor`]).
 
 ## Owlv2Config
 
diff --git a/docs/source/en/model_doc/perceiver.md b/docs/source/en/model_doc/perceiver.md
index 5414daf0f1a1..8f3ae54ead5d 100644
--- a/docs/source/en/model_doc/perceiver.md
+++ b/docs/source/en/model_doc/perceiver.md
@@ -86,11 +86,8 @@ alt="drawing" width="600"/>
 This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
 [here](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
 
-<Tip warning={true}>
-
-Perceiver does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
-
-</Tip>
+> [!WARNING]
+> Perceiver does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
 
 ## Resources
 
diff --git a/docs/source/en/model_doc/persimmon.md b/docs/source/en/model_doc/persimmon.md
index 854eaee835df..2c1d5df3bfad 100644
--- a/docs/source/en/model_doc/persimmon.md
+++ b/docs/source/en/model_doc/persimmon.md
@@ -36,16 +36,13 @@ The original code can be found [here](https://github.com/persimmon-ai-labs/adept
 
 ## Usage tips
 
-<Tip warning={true}>
-
-The `Persimmon` models were trained using `bfloat16`, but the original inference uses `float16` The checkpoints uploaded on the hub use `dtype = 'float16'` which will be
-used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
-
-The `dtype` of the online weights is mostly irrelevant, unless you are using `dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be cast to the default `dtype` of `torch` (becomes `torch.float32`). Users should specify the `dtype` they want, and if they don't it will be `torch.float32`.
-
-Finetuning the model in `float16` is not recommended and known to produce `nan`, as such the model should be fine-tuned in `bfloat16`.
-
-</Tip>
+> [!WARNING]
+> The `Persimmon` models were trained using `bfloat16`, but the original inference uses `float16` The checkpoints uploaded on the hub use `dtype = 'float16'` which will be
+> used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
+>
+> The `dtype` of the online weights is mostly irrelevant, unless you are using `dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be cast to the default `dtype` of `torch` (becomes `torch.float32`). Users should specify the `dtype` they want, and if they don't it will be `torch.float32`.
+>
+> Finetuning the model in `float16` is not recommended and known to produce `nan`, as such the model should be fine-tuned in `bfloat16`.
 
 Tips:
 
diff --git a/docs/source/en/model_doc/phi3.md b/docs/source/en/model_doc/phi3.md
index 9a045e6f184d..10dc5952f83b 100644
--- a/docs/source/en/model_doc/phi3.md
+++ b/docs/source/en/model_doc/phi3.md
@@ -43,15 +43,12 @@ The original code for Phi-3 can be found [here](https://huggingface.co/microsoft
 
 ## How to use Phi-3
 
-<Tip warning={true}>
-
-Phi-3 has been integrated in the development version (4.40.0.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing one of the following:
-
-* When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
-
-* Update your local `transformers` to the development version: `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers`. The previous command is an alternative to cloning and installing from the source.
-
-</Tip>
+> [!WARNING]
+> Phi-3 has been integrated in the development version (4.40.0.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing one of the following:
+>
+> * When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
+>
+> * Update your local `transformers` to the development version: `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers`. The previous command is an alternative to cloning and installing from the source.
 
 ```python
 >>> from transformers import AutoModelForCausalLM, AutoTokenizer
diff --git a/docs/source/en/model_doc/phimoe.md b/docs/source/en/model_doc/phimoe.md
index 7394e26b5b98..3e53f4e0a130 100644
--- a/docs/source/en/model_doc/phimoe.md
+++ b/docs/source/en/model_doc/phimoe.md
@@ -42,24 +42,21 @@ The original code for PhiMoE can be found [here](https://huggingface.co/microsof
 
 ## How to use PhiMoE
 
-<Tip warning={true}>
-
-Phi-3.5-MoE-instruct has been integrated in the development version (4.44.2.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing the following:
-
-* When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
-
-The current `transformers` version can be verified with: `pip list | grep transformers`.
-
-Examples of required packages:
-
-```bash
-flash_attn==2.5.8
-torch==2.3.1
-accelerate==0.31.0
-transformers==4.43.0
-```
-
-</Tip>
+> [!WARNING]
+> Phi-3.5-MoE-instruct has been integrated in the development version (4.44.2.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing the following:
+>
+> * When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
+>
+> The current `transformers` version can be verified with: `pip list | grep transformers`.
+>
+> Examples of required packages:
+>
+> ```bash
+> flash_attn==2.5.8
+> torch==2.3.1
+> accelerate==0.31.0
+> transformers==4.43.0
+> ```
 
 ```python
 import torch
diff --git a/docs/source/en/model_doc/phobert.md b/docs/source/en/model_doc/phobert.md
index f4c64fde7184..e327ed6be100 100644
--- a/docs/source/en/model_doc/phobert.md
+++ b/docs/source/en/model_doc/phobert.md
@@ -53,12 +53,9 @@ This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The o
 ...     features = phobert(input_ids)  # Models outputs are now tuples
 ```
 
-<Tip>
-
-PhoBERT implementation is the same as BERT, except for tokenization. Refer to [BERT documentation](bert) for information on
-configuration classes and their parameters. PhoBERT-specific tokenizer is documented below.
-
-</Tip>
+> [!TIP]
+> PhoBERT implementation is the same as BERT, except for tokenization. Refer to [BERT documentation](bert) for information on
+> configuration classes and their parameters. PhoBERT-specific tokenizer is documented below.
 
 ## PhobertTokenizer
 
diff --git a/docs/source/en/model_doc/qdqbert.md b/docs/source/en/model_doc/qdqbert.md
index b791b4b2afe6..9e3882542600 100644
--- a/docs/source/en/model_doc/qdqbert.md
+++ b/docs/source/en/model_doc/qdqbert.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/realm.md b/docs/source/en/model_doc/realm.md
index da3d1c140f4c..806697af852c 100644
--- a/docs/source/en/model_doc/realm.md
+++ b/docs/source/en/model_doc/realm.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/retribert.md b/docs/source/en/model_doc/retribert.md
index 829fed24215f..5f03337ae123 100644
--- a/docs/source/en/model_doc/retribert.md
+++ b/docs/source/en/model_doc/retribert.md
@@ -21,14 +21,11 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, so we won't accept any new PRs changing its code.
-
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
-You can do so by running the following command: `pip install -U transformers==4.30.0`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, so we won't accept any new PRs changing its code.
+>
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
+> You can do so by running the following command: `pip install -U transformers==4.30.0`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/speech_to_text_2.md b/docs/source/en/model_doc/speech_to_text_2.md
index a3d836455b19..c6fab2e7bd99 100644
--- a/docs/source/en/model_doc/speech_to_text_2.md
+++ b/docs/source/en/model_doc/speech_to_text_2.md
@@ -17,13 +17,10 @@ rendered properly in your Markdown viewer.
 
 # Speech2Text2
 
-  <Tip warning={true}>
-
-  This model is in maintenance mode only, we don't accept any new PRs changing its code.
-  If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-  You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-  </Tip>
+  > [!WARNING]
+  > This model is in maintenance mode only, we don't accept any new PRs changing its code.
+  > If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+  > You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/t5v1.1.md b/docs/source/en/model_doc/t5v1.1.md
index 62787d5f9d62..f700f6fe0f7b 100644
--- a/docs/source/en/model_doc/t5v1.1.md
+++ b/docs/source/en/model_doc/t5v1.1.md
@@ -68,8 +68,5 @@ Google has released the following variants:
 
 - [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl).
 
-<Tip>
-
-Refer to [T5's documentation page](t5) for all API reference, tips, code examples and notebooks.
-
-</Tip>
+> [!TIP]
+> Refer to [T5's documentation page](t5) for all API reference, tips, code examples and notebooks.
diff --git a/docs/source/en/model_doc/tapex.md b/docs/source/en/model_doc/tapex.md
index 606d8940c4ed..2066e5d19e7d 100644
--- a/docs/source/en/model_doc/tapex.md
+++ b/docs/source/en/model_doc/tapex.md
@@ -21,14 +21,11 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
-You can do so by running the following command: `pip install -U transformers==4.30.0`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+>
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
+> You can do so by running the following command: `pip install -U transformers==4.30.0`.
 
 ## Overview
 
@@ -141,12 +138,9 @@ benchmark for table fact checking (it achieves 84% accuracy). The code example b
 Refused
 ```
 
-<Tip>
-
-TAPEX architecture is the same as BART, except for tokenization. Refer to [BART documentation](bart) for information on
-configuration classes and their parameters. TAPEX-specific tokenizer is documented below.
-
-</Tip>
+> [!TIP]
+> TAPEX architecture is the same as BART, except for tokenization. Refer to [BART documentation](bart) for information on
+> configuration classes and their parameters. TAPEX-specific tokenizer is documented below.
 
 ## TapexTokenizer
 
diff --git a/docs/source/en/model_doc/trajectory_transformer.md b/docs/source/en/model_doc/trajectory_transformer.md
index fba51b181157..83066ec08cc4 100644
--- a/docs/source/en/model_doc/trajectory_transformer.md
+++ b/docs/source/en/model_doc/trajectory_transformer.md
@@ -21,14 +21,11 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, so we won't accept any new PRs changing its code.
-
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
-You can do so by running the following command: `pip install -U transformers==4.30.0`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, so we won't accept any new PRs changing its code.
+>
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
+> You can do so by running the following command: `pip install -U transformers==4.30.0`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/transfo-xl.md b/docs/source/en/model_doc/transfo-xl.md
index 0bd1b0f57e1d..346416fbc275 100644
--- a/docs/source/en/model_doc/transfo-xl.md
+++ b/docs/source/en/model_doc/transfo-xl.md
@@ -21,34 +21,31 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, so we won't accept any new PRs changing its code. This model was deprecated due to security issues linked to `pickle.load`.
-
-We recommend switching to more recent models for improved security.
-
-In case you would still like to use `TransfoXL` in your experiments, we recommend using the [Hub checkpoint](https://huggingface.co/transfo-xl/transfo-xl-wt103) with a specific revision to ensure you are downloading safe files from the Hub.
-
-You will need to set the environment variable `TRUST_REMOTE_CODE` to `True` in order to allow the
-usage of `pickle.load()`:
-
-```python
-import os
-from transformers import TransfoXLTokenizer, TransfoXLLMHeadModel
-
-os.environ["TRUST_REMOTE_CODE"] = "True"
-
-checkpoint = 'transfo-xl/transfo-xl-wt103'
-revision = '40a186da79458c9f9de846edfaea79c412137f97'
-
-tokenizer = TransfoXLTokenizer.from_pretrained(checkpoint, revision=revision)
-model = TransfoXLLMHeadModel.from_pretrained(checkpoint, revision=revision)
-```
-
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.35.0.
-You can do so by running the following command: `pip install -U transformers==4.35.0`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, so we won't accept any new PRs changing its code. This model was deprecated due to security issues linked to `pickle.load`.
+>
+> We recommend switching to more recent models for improved security.
+>
+> In case you would still like to use `TransfoXL` in your experiments, we recommend using the [Hub checkpoint](https://huggingface.co/transfo-xl/transfo-xl-wt103) with a specific revision to ensure you are downloading safe files from the Hub.
+>
+> You will need to set the environment variable `TRUST_REMOTE_CODE` to `True` in order to allow the
+> usage of `pickle.load()`:
+>
+> ```python
+> import os
+> from transformers import TransfoXLTokenizer, TransfoXLLMHeadModel
+>
+> os.environ["TRUST_REMOTE_CODE"] = "True"
+>
+> checkpoint = 'transfo-xl/transfo-xl-wt103'
+> revision = '40a186da79458c9f9de846edfaea79c412137f97'
+>
+> tokenizer = TransfoXLTokenizer.from_pretrained(checkpoint, revision=revision)
+> model = TransfoXLLMHeadModel.from_pretrained(checkpoint, revision=revision)
+> ```
+>
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.35.0.
+> You can do so by running the following command: `pip install -U transformers==4.35.0`.
 
 <div class="flex flex-wrap space-x-1">
 <a href="https://huggingface.co/models?filter=transfo-xl">
@@ -90,11 +87,8 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o
 - Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention scores. This allows the model to pay attention to information that was in the previous segment as well as the current one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments.
 - This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would give the same results in the current input and the current hidden state at a given position) and needs to make some adjustments in the way attention scores are computed.
 
-<Tip warning={true}>
-
-TransformerXL does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
-
-</Tip>
+> [!WARNING]
+> TransformerXL does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
 
 ## Resources
 
diff --git a/docs/source/en/model_doc/tvlt.md b/docs/source/en/model_doc/tvlt.md
index 069978176865..150765246268 100644
--- a/docs/source/en/model_doc/tvlt.md
+++ b/docs/source/en/model_doc/tvlt.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/ul2.md b/docs/source/en/model_doc/ul2.md
index b936bf78eb6c..55d375760c54 100644
--- a/docs/source/en/model_doc/ul2.md
+++ b/docs/source/en/model_doc/ul2.md
@@ -37,8 +37,5 @@ This model was contributed by [DanielHesslow](https://huggingface.co/Seledorn).
 - UL2 has the same architecture as [T5v1.1](t5v1.1) but uses the Gated-SiLU activation function instead of Gated-GELU.
 - The authors release checkpoints of one architecture which can be seen [here](https://huggingface.co/google/ul2)
 
-<Tip>
-
-As UL2 has the same architecture as T5v1.1,  refer to [T5's documentation page](t5) for API reference, tips, code examples and notebooks.
-
-</Tip>
+> [!TIP]
+> As UL2 has the same architecture as T5v1.1,  refer to [T5's documentation page](t5) for API reference, tips, code examples and notebooks.
diff --git a/docs/source/en/model_doc/umt5.md b/docs/source/en/model_doc/umt5.md
index 784cc9974df1..d93a598d67f0 100644
--- a/docs/source/en/model_doc/umt5.md
+++ b/docs/source/en/model_doc/umt5.md
@@ -67,10 +67,8 @@ The conversion script is also different because the model was saved in t5x's lat
 ['<pad><extra_id_0>nyone who<extra_id_1> drink<extra_id_2> a<extra_id_3> alcohol<extra_id_4> A<extra_id_5> A. This<extra_id_6> I<extra_id_7><extra_id_52><extra_id_53></s>']
 ```
 
-<Tip>
-
-Refer to [T5's documentation page](t5) for more tips, code examples and notebooks.
-</Tip>
+> [!TIP]
+> Refer to [T5's documentation page](t5) for more tips, code examples and notebooks.
 
 ## UMT5Config
 
diff --git a/docs/source/en/model_doc/van.md b/docs/source/en/model_doc/van.md
index 0a4ded430211..70c475762eeb 100644
--- a/docs/source/en/model_doc/van.md
+++ b/docs/source/en/model_doc/van.md
@@ -21,14 +21,11 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
-You can do so by running the following command: `pip install -U transformers==4.30.0`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+>
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
+> You can do so by running the following command: `pip install -U transformers==4.30.0`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/video_llava.md b/docs/source/en/model_doc/video_llava.md
index 2e1bf19abdc6..731703f11d3f 100644
--- a/docs/source/en/model_doc/video_llava.md
+++ b/docs/source/en/model_doc/video_llava.md
@@ -151,13 +151,10 @@ The model can be loaded in lower bits, significantly reducing memory burden whil
 
 First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
 
-<Tip>
-
-bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
-
-We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
-
-</Tip>
+> [!TIP]
+> bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
+>
+> We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
 
 Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
 
diff --git a/docs/source/en/model_doc/vit_hybrid.md b/docs/source/en/model_doc/vit_hybrid.md
index c10d1c489b76..950021d7ec30 100644
--- a/docs/source/en/model_doc/vit_hybrid.md
+++ b/docs/source/en/model_doc/vit_hybrid.md
@@ -22,13 +22,10 @@ rendered properly in your Markdown viewer.
 <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 ## Overview
 
diff --git a/docs/source/en/model_doc/vitmatte.md b/docs/source/en/model_doc/vitmatte.md
index 0584df8e67a5..ba70f8d879d9 100644
--- a/docs/source/en/model_doc/vitmatte.md
+++ b/docs/source/en/model_doc/vitmatte.md
@@ -40,10 +40,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 
 - A demo notebook regarding inference with [`VitMatteForImageMatting`], including background replacement, can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ViTMatte).
 
-<Tip>
-
-The model expects both the image and trimap (concatenated) as input. Use [`ViTMatteImageProcessor`] for this purpose.
-</Tip>
+> [!TIP]
+> The model expects both the image and trimap (concatenated) as input. Use [`ViTMatteImageProcessor`] for this purpose.
 
 ## VitMatteConfig
 
diff --git a/docs/source/en/model_doc/wav2vec2_phoneme.md b/docs/source/en/model_doc/wav2vec2_phoneme.md
index 206ea048c023..14216fb2a0a9 100644
--- a/docs/source/en/model_doc/wav2vec2_phoneme.md
+++ b/docs/source/en/model_doc/wav2vec2_phoneme.md
@@ -53,12 +53,9 @@ The original code can be found [here](https://github.com/pytorch/fairseq/tree/ma
 - By default, the model outputs a sequence of phonemes. In order to transform the phonemes to a sequence of words one
   should make use of a dictionary and language model.
 
-<Tip>
-
-Wav2Vec2Phoneme's architecture is based on the Wav2Vec2 model, for API reference, check out [`Wav2Vec2`](wav2vec2)'s documentation page
-except for the tokenizer.
-
-</Tip>
+> [!TIP]
+> Wav2Vec2Phoneme's architecture is based on the Wav2Vec2 model, for API reference, check out [`Wav2Vec2`](wav2vec2)'s documentation page
+> except for the tokenizer.
 
 ## Wav2Vec2PhonemeCTCTokenizer
 
diff --git a/docs/source/en/model_doc/xlm-prophetnet.md b/docs/source/en/model_doc/xlm-prophetnet.md
index fbf47d8c422a..bc94c3887ce9 100644
--- a/docs/source/en/model_doc/xlm-prophetnet.md
+++ b/docs/source/en/model_doc/xlm-prophetnet.md
@@ -21,13 +21,10 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+> You can do so by running the following command: `pip install -U transformers==4.40.2`.
 
 <div class="flex flex-wrap space-x-1">
 <a href="https://huggingface.co/models?filter=xprophetnet">
diff --git a/docs/source/en/model_doc/xlm-roberta.md b/docs/source/en/model_doc/xlm-roberta.md
index 0e9867636892..5fc878fe7082 100644
--- a/docs/source/en/model_doc/xlm-roberta.md
+++ b/docs/source/en/model_doc/xlm-roberta.md
@@ -167,10 +167,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 
 - A blog post on how to [Deploy Serverless XLM RoBERTa on AWS Lambda](https://www.philschmid.de/multilingual-serverless-xlm-roberta-with-huggingface).
 
-<Tip>
-
-This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples as well as the information relative to the inputs and outputs.
-</Tip>
+> [!TIP]
+> This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples as well as the information relative to the inputs and outputs.
 
 ## XLMRobertaConfig
 
diff --git a/docs/source/en/model_doc/xlm-v.md b/docs/source/en/model_doc/xlm-v.md
index 671827615f27..59e4508ee757 100644
--- a/docs/source/en/model_doc/xlm-v.md
+++ b/docs/source/en/model_doc/xlm-v.md
@@ -51,7 +51,5 @@ The experiments repository can be found [here](https://github.com/stefan-it/xlm-
 
 A XLM-V (base size) model is available under the [`facebook/xlm-v-base`](https://huggingface.co/facebook/xlm-v-base) identifier.
 
-<Tip>
-
-XLM-V architecture is the same as XLM-RoBERTa, refer to [XLM-RoBERTa documentation](xlm-roberta) for API reference, and examples.
-</Tip>
+> [!TIP]
+> XLM-V architecture is the same as XLM-RoBERTa, refer to [XLM-RoBERTa documentation](xlm-roberta) for API reference, and examples.
diff --git a/docs/source/en/model_doc/xls_r.md b/docs/source/en/model_doc/xls_r.md
index 643aba73230f..5de19c3e1af0 100644
--- a/docs/source/en/model_doc/xls_r.md
+++ b/docs/source/en/model_doc/xls_r.md
@@ -49,8 +49,5 @@ The original code can be found [here](https://github.com/pytorch/fairseq/tree/ma
 - XLS-R model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using
   [`Wav2Vec2CTCTokenizer`].
 
-<Tip>
-
-XLS-R's architecture is based on the Wav2Vec2 model, refer to [Wav2Vec2's documentation page](wav2vec2) for API reference.
-
-</Tip>
+> [!TIP]
+> XLS-R's architecture is based on the Wav2Vec2 model, refer to [Wav2Vec2's documentation page](wav2vec2) for API reference.
diff --git a/docs/source/en/model_doc/xlsr_wav2vec2.md b/docs/source/en/model_doc/xlsr_wav2vec2.md
index a97b0a3eff77..0c77990dd8e0 100644
--- a/docs/source/en/model_doc/xlsr_wav2vec2.md
+++ b/docs/source/en/model_doc/xlsr_wav2vec2.md
@@ -49,8 +49,5 @@ Note: Meta (FAIR) released a new version of [Wav2Vec2-BERT 2.0](https://huggingf
 - XLSR-Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be
   decoded using [`Wav2Vec2CTCTokenizer`].
 
-<Tip>
-
-XLSR-Wav2Vec2's architecture is based on the Wav2Vec2 model, so one can refer to [Wav2Vec2's documentation page](wav2vec2).
-
-</Tip>
+> [!TIP]
+> XLSR-Wav2Vec2's architecture is based on the Wav2Vec2 model, so one can refer to [Wav2Vec2's documentation page](wav2vec2).
diff --git a/docs/source/en/model_memory_anatomy.md b/docs/source/en/model_memory_anatomy.md
index f0a215b05c1b..ce466e8cdbd7 100644
--- a/docs/source/en/model_memory_anatomy.md
+++ b/docs/source/en/model_memory_anatomy.md
@@ -149,12 +149,9 @@ default_args = {
 }
 ```
 
-<Tip>
-
- If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python
- kernel between experiments.
-
-</Tip>
+> [!TIP]
+> If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python
+>  kernel between experiments.
 
 ## Memory utilization at vanilla training
 
diff --git a/docs/source/en/pr_checks.md b/docs/source/en/pr_checks.md
index 5fdbbbab05bc..8999494a4309 100644
--- a/docs/source/en/pr_checks.md
+++ b/docs/source/en/pr_checks.md
@@ -148,11 +148,8 @@ Additional checks concern PRs that add new models, mainly that:
 
 Since the Transformers library is very opinionated with respect to model code, and each model should fully be implemented in a single file without relying on other models, we have added a mechanism that checks whether a copy of the code of a layer of a given model stays consistent with the original. This way, when there is a bug fix, we can see all other impacted models and choose to trickle down the modification or break the copy.
 
-<Tip>
-
-If a file is a full copy of another file, you should register it in the constant `FULL_COPIES` of `utils/check_copies.py`.
-
-</Tip>
+> [!TIP]
+> If a file is a full copy of another file, you should register it in the constant `FULL_COPIES` of `utils/check_copies.py`.
 
 This mechanism relies on comments of the form `# Copied from xxx`. The `xxx` should contain the whole path to the class of function which is being copied below. For instance, `RobertaSelfOutput` is a direct copy of the `BertSelfOutput` class, so you can see [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L289) it has a comment:
 
@@ -182,11 +179,8 @@ You can add several patterns separated by a comma. For instance here `CamemberFo
 
 If the order matters (because one of the replacements might conflict with a previous one), the replacements are executed from left to right.
 
-<Tip>
-
-If the replacements change the formatting (if you replace a short name by a very long name for instance), the copy is checked after applying the auto-formatter.
-
-</Tip>
+> [!TIP]
+> If the replacements change the formatting (if you replace a short name by a very long name for instance), the copy is checked after applying the auto-formatter.
 
 Another way when the patterns are just different casings of the same replacement (with an uppercased and a lowercased variants) is just to add the option `all-casing`. [Here](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/mobilebert/modeling_mobilebert.py#L1237) is an example in `MobileBertForSequenceClassification` with the comment:
 
diff --git a/docs/source/en/tasks/asr.md b/docs/source/en/tasks/asr.md
index 33dc3fc518e6..2e4d8b39c0d6 100644
--- a/docs/source/en/tasks/asr.md
+++ b/docs/source/en/tasks/asr.md
@@ -27,11 +27,8 @@ This guide will show you how to:
 1. Fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset to transcribe audio to text.
 2. Use your fine-tuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/automatic-speech-recognition)
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/automatic-speech-recognition)
 
 Before you begin, make sure you have all the necessary libraries installed:
 
@@ -228,11 +225,8 @@ Your `compute_metrics` function is ready to go now, and you'll return to it when
 
 ## Train
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
 
 You are now ready to start training your model! Load Wav2Vec2 with [`AutoModelForCTC`]. Specify the reduction to apply with the `ctc_loss_reduction` parameter. It is often better to use the average instead of the default summation:
 
@@ -293,11 +287,8 @@ Once training is completed, share your model to the Hub with the [`~transformers
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to fine-tune a model for automatic speech recognition, take a look at this blog [post](https://huggingface.co/blog/fine-tune-wav2vec2-english) for English ASR and this [post](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) for multilingual ASR.
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to fine-tune a model for automatic speech recognition, take a look at this blog [post](https://huggingface.co/blog/fine-tune-wav2vec2-english) for English ASR and this [post](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) for multilingual ASR.
 
 ## Inference
 
@@ -324,11 +315,8 @@ The simplest way to try out your fine-tuned model for inference is to use it in
 {'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'}
 ```
 
-<Tip>
-
-The transcription is decent, but it could be better! Try finetuning your model on more examples to get even better results!
-
-</Tip>
+> [!TIP]
+> The transcription is decent, but it could be better! Try finetuning your model on more examples to get even better results!
 
 You can also manually replicate the results of the `pipeline` if you'd like:
 
diff --git a/docs/source/en/tasks/audio_classification.md b/docs/source/en/tasks/audio_classification.md
index 250b980be190..d5439c726702 100644
--- a/docs/source/en/tasks/audio_classification.md
+++ b/docs/source/en/tasks/audio_classification.md
@@ -27,11 +27,8 @@ This guide will show you how to:
 1. Fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset to classify speaker intent.
 2. Use your fine-tuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/audio-classification)
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/audio-classification)
 
 Before you begin, make sure you have all the necessary libraries installed:
 
@@ -187,11 +184,8 @@ Your `compute_metrics` function is ready to go now, and you'll return to it when
 
 ## Train
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
 
 You're ready to start training your model now! Load Wav2Vec2 with [`AutoModelForAudioClassification`] along with the number of expected labels, and the label mappings:
 
@@ -245,11 +239,8 @@ Once training is completed, share your model to the Hub with the [`~transformers
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to fine-tune a model for audio classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to fine-tune a model for audio classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
 
 ## Inference
 
diff --git a/docs/source/en/tasks/document_question_answering.md b/docs/source/en/tasks/document_question_answering.md
index 2c729f76adcb..e4b129fb84d4 100644
--- a/docs/source/en/tasks/document_question_answering.md
+++ b/docs/source/en/tasks/document_question_answering.md
@@ -28,11 +28,8 @@ This guide illustrates how to:
 - Fine-tune [LayoutLMv2](../model_doc/layoutlmv2) on the [DocVQA dataset](https://huggingface.co/datasets/nielsr/docvqa_1200_examples_donut).
 - Use your fine-tuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/image-to-text)
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/image-to-text)
 
 LayoutLMv2 solves the document question-answering task by adding a question-answering head on top of the final hidden
 states of the tokens, to predict the positions of the start and end tokens of the
diff --git a/docs/source/en/tasks/idefics.md b/docs/source/en/tasks/idefics.md
index b03c7bccd9c2..462390650508 100644
--- a/docs/source/en/tasks/idefics.md
+++ b/docs/source/en/tasks/idefics.md
@@ -54,9 +54,8 @@ Before you begin, make sure you have all the necessary libraries installed.
 pip install -q bitsandbytes sentencepiece accelerate transformers
 ```
 
-<Tip>
-To run the following examples with a non-quantized version of the model checkpoint you will need at least 20GB of GPU memory.
-</Tip>
+> [!TIP]
+> To run the following examples with a non-quantized version of the model checkpoint you will need at least 20GB of GPU memory.
 
 ## Loading the model
 
@@ -141,13 +140,11 @@ As image input to the model, you can use either an image object (`PIL.Image`) or
 A puppy in a flower bed
 ```
 
-<Tip>
-
-It is a good idea to include the `bad_words_ids` in the call to `generate` to avoid errors arising when increasing
-the `max_new_tokens`: the model will want to generate a new `<image>` or `<fake_token_around_image>` token when there
-is no image being generated by the model.
-You can set it on-the-fly as in this guide, or store in the `GenerationConfig` as described in the [Text generation strategies](../generation_strategies) guide.
-</Tip>
+> [!TIP]
+> It is a good idea to include the `bad_words_ids` in the call to `generate` to avoid errors arising when increasing
+> the `max_new_tokens`: the model will want to generate a new `<image>` or `<fake_token_around_image>` token when there
+> is no image being generated by the model.
+> You can set it on-the-fly as in this guide, or store in the `GenerationConfig` as described in the [Text generation strategies](../generation_strategies) guide.
 
 ## Prompted image captioning
 
@@ -332,12 +329,10 @@ The little girl ran
 
 Looks like IDEFICS noticed the pumpkin on the doorstep and went with a spooky Halloween story about a ghost.
 
-<Tip>
-
-For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help
-you significantly improve the quality of the generated output. Check out [Text generation strategies](../generation_strategies)
-to learn more.
-</Tip>
+> [!TIP]
+> For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help
+> you significantly improve the quality of the generated output. Check out [Text generation strategies](../generation_strategies)
+> to learn more.
 
 ## Running inference in batch mode
 
diff --git a/docs/source/en/tasks/image_captioning.md b/docs/source/en/tasks/image_captioning.md
index 4b4b3ba5fa36..11ba87d86094 100644
--- a/docs/source/en/tasks/image_captioning.md
+++ b/docs/source/en/tasks/image_captioning.md
@@ -65,11 +65,8 @@ DatasetDict({
 
 The dataset has two features, `image` and `text`.
 
-<Tip>
-
-Many image captioning datasets contain multiple captions per image. In those cases, a common strategy is to randomly sample a caption amongst the available ones during training.
-
-</Tip>
+> [!TIP]
+> Many image captioning datasets contain multiple captions per image. In those cases, a common strategy is to randomly sample a caption amongst the available ones during training.
 
 Split the dataset's train split into a train and test set with the [`~datasets.Dataset.train_test_split`] method:
 
diff --git a/docs/source/en/tasks/image_classification.md b/docs/source/en/tasks/image_classification.md
index 4754a91bd482..a2551637310b 100644
--- a/docs/source/en/tasks/image_classification.md
+++ b/docs/source/en/tasks/image_classification.md
@@ -29,11 +29,8 @@ This guide illustrates how to:
 1. Fine-tune [ViT](../model_doc/vit) on the [Food-101](https://huggingface.co/datasets/food101) dataset to classify a food item in an image.
 2. Use your fine-tuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/image-classification)
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/image-classification)
 
 Before you begin, make sure you have all the necessary libraries installed:
 
@@ -175,11 +172,8 @@ Your `compute_metrics` function is ready to go now, and you'll return to it when
 
 ## Train
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
 
 You're ready to start training your model now! Load ViT with [`AutoModelForImageClassification`]. Specify the number of labels along with the number of expected labels, and the label mappings:
 
@@ -237,11 +231,8 @@ Once training is completed, share your model to the Hub with the [`~transformers
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to finetune a model for image classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to finetune a model for image classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
 
 ## Inference
 
diff --git a/docs/source/en/tasks/language_modeling.md b/docs/source/en/tasks/language_modeling.md
index 6ff73af98206..8c096f0a0afc 100644
--- a/docs/source/en/tasks/language_modeling.md
+++ b/docs/source/en/tasks/language_modeling.md
@@ -32,11 +32,8 @@ This guide will show you how to:
 1. Finetune [DistilGPT2](https://huggingface.co/distilbert/distilgpt2) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/dany0407/eli5_category) dataset.
 2. Use your finetuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/text-generation)
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/text-generation)
 
 Before you begin, make sure you have all the necessary libraries installed:
 
@@ -200,11 +197,8 @@ Use the end-of-sequence token as the padding token and set `mlm=False`. This wil
 
 ## Train
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the [basic tutorial](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the [basic tutorial](../training#train-with-pytorch-trainer)!
 
 You're ready to start training your model now! Load DistilGPT2 with [`AutoModelForCausalLM`]:
 
@@ -257,12 +251,9 @@ Then share your model to the Hub with the [`~transformers.Trainer.push_to_hub`]
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to finetune a model for causal language modeling, take a look at the corresponding
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to finetune a model for causal language modeling, take a look at the corresponding
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
 
 ## Inference
 
diff --git a/docs/source/en/tasks/masked_language_modeling.md b/docs/source/en/tasks/masked_language_modeling.md
index 619374f91dae..fd7e951371dd 100644
--- a/docs/source/en/tasks/masked_language_modeling.md
+++ b/docs/source/en/tasks/masked_language_modeling.md
@@ -29,11 +29,8 @@ This guide will show you how to:
 1. Finetune [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/eli5) dataset.
 2. Use your finetuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/fill-mask)
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/fill-mask)
 
 Before you begin, make sure you have all the necessary libraries installed:
 
@@ -193,11 +190,8 @@ Use the end-of-sequence token as the padding token and specify `mlm_probability`
 
 ## Train
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
 
 You're ready to start training your model now! Load DistilRoBERTa with [`AutoModelForMaskedLM`]:
 
@@ -251,12 +245,9 @@ Then share your model to the Hub with the [`~transformers.Trainer.push_to_hub`]
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to finetune a model for masked language modeling, take a look at the corresponding
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to finetune a model for masked language modeling, take a look at the corresponding
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
 
 ## Inference
 
diff --git a/docs/source/en/tasks/monocular_depth_estimation.md b/docs/source/en/tasks/monocular_depth_estimation.md
index aef9bd22c4d3..31478eab1b01 100644
--- a/docs/source/en/tasks/monocular_depth_estimation.md
+++ b/docs/source/en/tasks/monocular_depth_estimation.md
@@ -33,11 +33,8 @@ There are two main depth estimation categories:
 
 In this guide, we will see how to infer with [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Large), a state-of-the-art zero-shot relative depth estimation model, and [ZoeDepth](https://huggingface.co/docs/transformers/main/en/model_doc/zoedepth), an absolute depth estimation model.
 
-<Tip>
-
-Check the [Depth Estimation](https://huggingface.co/tasks/depth-estimation) task page to view all compatible architectures and checkpoints.
-
-</Tip>
+> [!TIP]
+> Check the [Depth Estimation](https://huggingface.co/tasks/depth-estimation) task page to view all compatible architectures and checkpoints.
 
 Before we begin, we need to install the latest version of Transformers:
 
@@ -141,18 +138,17 @@ Let's post-process the results to remove any padding and resize the depth map to
 >>> depth = Image.fromarray(depth.astype("uint8"))
 ```
 
-<Tip>
-<p>In the <a href="https://github.com/isl-org/ZoeDepth/blob/edb6daf45458569e24f50250ef1ed08c015f17a7/zoedepth/models/depth_model.py#L131">original implementation</a> ZoeDepth model performs inference on both the original and flipped images and averages out the results. The <code>post_process_depth_estimation</code> function can handle this for us by passing the flipped outputs to the optional <code>outputs_flipped</code> argument:</p>
-<pre><code class="language-Python">&gt;&gt;&gt; with torch.no_grad():
-...     outputs = model(pixel_values)
-...     outputs_flipped = model(pixel_values=torch.flip(inputs.pixel_values, dims=[3]))
-&gt;&gt;&gt; post_processed_output = image_processor.post_process_depth_estimation(
-...     outputs,
-...     source_sizes=[(image.height, image.width)],
-...     outputs_flipped=outputs_flipped,
-... )
-</code></pre>
-</Tip>
+> [!TIP]
+> <p>In the <a href="https://github.com/isl-org/ZoeDepth/blob/edb6daf45458569e24f50250ef1ed08c015f17a7/zoedepth/models/depth_model.py#L131">original implementation</a> ZoeDepth model performs inference on both the original and flipped images and averages out the results. The <code>post_process_depth_estimation</code> function can handle this for us by passing the flipped outputs to the optional <code>outputs_flipped</code> argument:</p>
+> <pre><code class="language-Python">&gt;&gt;&gt; with torch.no_grad():
+> ...     outputs = model(pixel_values)
+> ...     outputs_flipped = model(pixel_values=torch.flip(inputs.pixel_values, dims=[3]))
+> &gt;&gt;&gt; post_processed_output = image_processor.post_process_depth_estimation(
+> ...     outputs,
+> ...     source_sizes=[(image.height, image.width)],
+> ...     outputs_flipped=outputs_flipped,
+> ... )
+> </code></pre>
 
 <div class="flex justify-center">
      <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-visualization-zoe.png" alt="Depth estimation visualization"/>
diff --git a/docs/source/en/tasks/multiple_choice.md b/docs/source/en/tasks/multiple_choice.md
index d35f108ecce5..9d4227d0eb38 100644
--- a/docs/source/en/tasks/multiple_choice.md
+++ b/docs/source/en/tasks/multiple_choice.md
@@ -145,11 +145,8 @@ Your `compute_metrics` function is ready to go now, and you'll return to it when
 
 ## Train
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
 
 You're ready to start training your model now! Load BERT with [`AutoModelForMultipleChoice`]:
 
@@ -198,12 +195,9 @@ Once training is completed, share your model to the Hub with the [`~transformers
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to finetune a model for multiple choice, take a look at the corresponding
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to finetune a model for multiple choice, take a look at the corresponding
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb).
 
 ## Inference
 
diff --git a/docs/source/en/tasks/object_detection.md b/docs/source/en/tasks/object_detection.md
index ef2a86190bbc..ec8d84a40331 100644
--- a/docs/source/en/tasks/object_detection.md
+++ b/docs/source/en/tasks/object_detection.md
@@ -32,11 +32,8 @@ In this guide, you will learn how to:
  dataset.
  2. Use your finetuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/object-detection)
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/object-detection)
 
 Before you begin, make sure you have all the necessary libraries installed:
 
diff --git a/docs/source/en/tasks/question_answering.md b/docs/source/en/tasks/question_answering.md
index 905c2aeb3ee2..57b776dc8a74 100644
--- a/docs/source/en/tasks/question_answering.md
+++ b/docs/source/en/tasks/question_answering.md
@@ -30,11 +30,8 @@ This guide will show you how to:
 1. Finetune [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on the [SQuAD](https://huggingface.co/datasets/squad) dataset for extractive question answering.
 2. Use your finetuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/question-answering)
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/question-answering)
 
 Before you begin, make sure you have all the necessary libraries installed:
 
@@ -175,11 +172,8 @@ Now create a batch of examples using [`DefaultDataCollator`]. Unlike other data
 
 ## Train
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
 
 You're ready to start training your model now! Load DistilBERT with [`AutoModelForQuestionAnswering`]:
 
@@ -225,12 +219,9 @@ Once training is completed, share your model to the Hub with the [`~transformers
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to finetune a model for question answering, take a look at the corresponding
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to finetune a model for question answering, take a look at the corresponding
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).
 
 ## Evaluate
 
diff --git a/docs/source/en/tasks/semantic_segmentation.md b/docs/source/en/tasks/semantic_segmentation.md
index de88a0af6866..e05270ebea00 100644
--- a/docs/source/en/tasks/semantic_segmentation.md
+++ b/docs/source/en/tasks/semantic_segmentation.md
@@ -206,11 +206,8 @@ We will now:
 1. Finetune [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) on the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset.
 2. Use your fine-tuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/image-segmentation)
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/image-segmentation)
 
 ### Load SceneParse150 dataset
 
@@ -403,11 +400,8 @@ Your `compute_metrics` function is ready to go now, and you'll return to it when
 
 ### Train
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
 
 You're ready to start training your model now! Load SegFormer with [`AutoModelForSemanticSegmentation`], and pass the model the mapping between label ids and label classes:
 
diff --git a/docs/source/en/tasks/sequence_classification.md b/docs/source/en/tasks/sequence_classification.md
index 686871a56229..7bad4cd1e7a6 100644
--- a/docs/source/en/tasks/sequence_classification.md
+++ b/docs/source/en/tasks/sequence_classification.md
@@ -27,11 +27,8 @@ This guide will show you how to:
 1. Finetune [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on the [IMDb](https://huggingface.co/datasets/imdb) dataset to determine whether a movie review is positive or negative.
 2. Use your finetuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/text-classification).
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/text-classification).
 
 Before you begin, make sure you have all the necessary libraries installed:
 
@@ -136,11 +133,8 @@ Before you start training your model, create a map of the expected ids to their
 >>> label2id = {"NEGATIVE": 0, "POSITIVE": 1}
 ```
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
 
 You're ready to start training your model now! Load DistilBERT with [`AutoModelForSequenceClassification`] along with the number of expected labels, and the label mappings:
 
@@ -185,11 +179,8 @@ At this point, only three steps remain:
 >>> trainer.train()
 ```
 
-<Tip>
-
-[`Trainer`] applies dynamic padding by default when you pass `tokenizer` to it. In this case, you don't need to specify a data collator explicitly.
-
-</Tip>
+> [!TIP]
+> [`Trainer`] applies dynamic padding by default when you pass `tokenizer` to it. In this case, you don't need to specify a data collator explicitly.
 
 Once training is completed, share your model to the Hub with the [`~transformers.Trainer.push_to_hub`] method so everyone can use your model:
 
@@ -197,12 +188,9 @@ Once training is completed, share your model to the Hub with the [`~transformers
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to finetune a model for text classification, take a look at the corresponding
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to finetune a model for text classification, take a look at the corresponding
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb).
 
 ## Inference
 
diff --git a/docs/source/en/tasks/summarization.md b/docs/source/en/tasks/summarization.md
index b2f2beebc806..3fcad66f416d 100644
--- a/docs/source/en/tasks/summarization.md
+++ b/docs/source/en/tasks/summarization.md
@@ -30,11 +30,8 @@ This guide will show you how to:
 1. Finetune [T5](https://huggingface.co/google-t5/t5-small) on the California state bill subset of the [BillSum](https://huggingface.co/datasets/billsum) dataset for abstractive summarization.
 2. Use your finetuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/summarization)
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/summarization)
 
 Before you begin, make sure you have all the necessary libraries installed:
 
@@ -159,11 +156,8 @@ Your `compute_metrics` function is ready to go now, and you'll return to it when
 
 ## Train
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
 
 You're ready to start training your model now! Load T5 with [`AutoModelForSeq2SeqLM`]:
 
@@ -213,12 +207,9 @@ Once training is completed, share your model to the Hub with the [`~transformers
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to finetune a model for summarization, take a look at the corresponding
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to finetune a model for summarization, take a look at the corresponding
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb).
 
 ## Inference
 
diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md
index 28153cd1160e..c7e5e41ed3c5 100644
--- a/docs/source/en/tasks/text-to-speech.md
+++ b/docs/source/en/tasks/text-to-speech.md
@@ -64,21 +64,18 @@ Install 🤗Transformers from source as not all the SpeechT5 features have been
 pip install git+https://github.com/huggingface/transformers.git
 ```
 
-<Tip>
-
-To follow this guide you will need a GPU. If you're working in a notebook, run the following line to check if a GPU is available:
-
-```bash
-!nvidia-smi
-```
-
-or alternatively for AMD GPUs:
-
-```bash
-!rocm-smi
-```
-
-</Tip>
+> [!TIP]
+> To follow this guide you will need a GPU. If you're working in a notebook, run the following line to check if a GPU is available:
+>
+> ```bash
+> !nvidia-smi
+> ```
+>
+> or alternatively for AMD GPUs:
+>
+> ```bash
+> !rocm-smi
+> ```
 
 We encourage you to log in to your Hugging Face account to upload and share your model with the community. When prompted, enter your token to log in:
 
diff --git a/docs/source/en/tasks/token_classification.md b/docs/source/en/tasks/token_classification.md
index 5096298affd1..7f8462d39202 100644
--- a/docs/source/en/tasks/token_classification.md
+++ b/docs/source/en/tasks/token_classification.md
@@ -27,11 +27,8 @@ This guide will show you how to:
 1. Finetune [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on the [WNUT 17](https://huggingface.co/datasets/wnut_17) dataset to detect new entities.
 2. Use your finetuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/token-classification).
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/token-classification).
 
 Before you begin, make sure you have all the necessary libraries installed:
 
@@ -242,11 +239,8 @@ Before you start training your model, create a map of the expected ids to their
 ... }
 ```
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
 
 You're ready to start training your model now! Load DistilBERT with [`AutoModelForTokenClassification`] along with the number of expected labels, and the label mappings:
 
@@ -297,12 +291,9 @@ Once training is completed, share your model to the Hub with the [`~transformers
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to finetune a model for token classification, take a look at the corresponding
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to finetune a model for token classification, take a look at the corresponding
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb).
 
 ## Inference
 
diff --git a/docs/source/en/tasks/translation.md b/docs/source/en/tasks/translation.md
index 69246a8c17a9..7f82e45d088d 100644
--- a/docs/source/en/tasks/translation.md
+++ b/docs/source/en/tasks/translation.md
@@ -27,11 +27,8 @@ This guide will show you how to:
 1. Finetune [T5](https://huggingface.co/google-t5/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French.
 2. Use your finetuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/translation).
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/translation).
 
 Before you begin, make sure you have all the necessary libraries installed:
 
@@ -167,11 +164,8 @@ Your `compute_metrics` function is ready to go now, and you'll return to it when
 
 ## Train
 
-<Tip>
-
-If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
 
 You're ready to start training your model now! Load T5 with [`AutoModelForSeq2SeqLM`]:
 
@@ -221,12 +215,9 @@ Once training is completed, share your model to the Hub with the [`~transformers
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to finetune a model for translation, take a look at the corresponding
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to finetune a model for translation, take a look at the corresponding
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb).
 
 ## Inference
 
diff --git a/docs/source/en/tasks/video_classification.md b/docs/source/en/tasks/video_classification.md
index bae638bd84ed..d9ff9dc38dc1 100644
--- a/docs/source/en/tasks/video_classification.md
+++ b/docs/source/en/tasks/video_classification.md
@@ -25,11 +25,8 @@ This guide will show you how to:
 1. Fine-tune [VideoMAE](https://huggingface.co/docs/transformers/main/en/model_doc/videomae) on a subset of the [UCF101](https://www.crcv.ucf.edu/data/UCF101.php) dataset.
 2. Use your fine-tuned model for inference.
 
-<Tip>
-
-To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/video-classification).
-
-</Tip>
+> [!TIP]
+> To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/video-classification).
 
 Before you begin, make sure you have all the necessary libraries installed:
 
diff --git a/docs/source/en/testing.md b/docs/source/en/testing.md
index 01658aa2beb7..b48aa124e224 100644
--- a/docs/source/en/testing.md
+++ b/docs/source/en/testing.md
@@ -313,17 +313,11 @@ And then run every test multiple times (50 by default):
 pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
 ```
 
-<Tip>
+> [!TIP]
+> This plugin doesn't work with `-n` flag from `pytest-xdist`.
 
-This plugin doesn't work with `-n` flag from `pytest-xdist`.
-
-</Tip>
-
-<Tip>
-
-There is another plugin `pytest-repeat`, but it doesn't work with `unittest`.
-
-</Tip>
+> [!TIP]
+> There is another plugin `pytest-repeat`, but it doesn't work with `unittest`.
 
 #### Run tests in a random order
 
@@ -809,20 +803,14 @@ leave any data in there.
   - `after=True`: the temporary dir will always be deleted at the end of the test.
   - `after=False`: the temporary dir will always be left intact at the end of the test.
 
-<Tip>
-
-In order to run the equivalent of `rm -r` safely, only subdirs of the project repository checkout are allowed if
-an explicit `tmp_dir` is used, so that by mistake no `/tmp` or similar important part of the filesystem will
-get nuked. i.e. please always pass paths that start with `./`.
-
-</Tip>
-
-<Tip>
-
-Each test can register multiple temporary directories and they all will get auto-removed, unless requested
-otherwise.
+> [!TIP]
+> In order to run the equivalent of `rm -r` safely, only subdirs of the project repository checkout are allowed if
+> an explicit `tmp_dir` is used, so that by mistake no `/tmp` or similar important part of the filesystem will
+> get nuked. i.e. please always pass paths that start with `./`.
 
-</Tip>
+> [!TIP]
+> Each test can register multiple temporary directories and they all will get auto-removed, unless requested
+> otherwise.
 
 ### Temporary sys.path override
 
diff --git a/docs/source/en/troubleshooting.md b/docs/source/en/troubleshooting.md
index 0cc5829d2e8d..e59d37630783 100644
--- a/docs/source/en/troubleshooting.md
+++ b/docs/source/en/troubleshooting.md
@@ -58,11 +58,8 @@ Here are some potential solutions you can try to lessen memory use:
 - Reduce the [`per_device_train_batch_size`](main_classes/trainer#transformers.TrainingArguments.per_device_train_batch_size) value in [`TrainingArguments`].
 - Try using [`gradient_accumulation_steps`](main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps) in [`TrainingArguments`] to effectively increase overall batch size.
 
-<Tip>
-
-Refer to the Performance [guide](performance) for more details about memory-saving techniques.
-
-</Tip>
+> [!TIP]
+> Refer to the Performance [guide](performance) for more details about memory-saving techniques.
 
 ## ImportError
 
@@ -136,11 +133,8 @@ tensor([[-0.1008, -0.4061]], grad_fn=<AddmmBackward0>)
 
 Most of the time, you should provide an `attention_mask` to your model to ignore the padding tokens to avoid this silent error. Now the output of the second sequence matches its actual output:
 
-<Tip>
-
-By default, the tokenizer creates an `attention_mask` for you based on your specific tokenizer's defaults.
-
-</Tip>
+> [!TIP]
+> By default, the tokenizer creates an `attention_mask` for you based on your specific tokenizer's defaults.
 
 ```py
 >>> attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1], [1, 0, 0, 0, 0, 0]])
diff --git a/docs/source/es/autoclass_tutorial.md b/docs/source/es/autoclass_tutorial.md
index 67c0911dde9e..4a0099e37348 100644
--- a/docs/source/es/autoclass_tutorial.md
+++ b/docs/source/es/autoclass_tutorial.md
@@ -18,11 +18,8 @@ rendered properly in your Markdown viewer.
 
 Con tantas arquitecturas diferentes de Transformer puede ser retador crear una para tu checkpoint. Como parte de la filosofía central de 🤗 Transformers para hacer que la biblioteca sea fácil, simple y flexible de usar; una `AutoClass` automáticamente infiere y carga la arquitectura correcta desde un checkpoint dado. El método `from_pretrained` te permite cargar rápidamente un modelo preentrenado para cualquier arquitectura, por lo que no tendrás que dedicar tiempo y recursos para entrenar uno desde cero. Producir este tipo de código con checkpoint implica que si funciona con uno, funcionará también con otro (siempre que haya sido entrenado para una tarea similar) incluso si la arquitectura es distinta.
 
-<Tip>
-
-Recuerda, la arquitectura se refiere al esqueleto del modelo y los checkpoints son los pesos para una arquitectura dada. Por ejemplo, [BERT](https://huggingface.co/google-bert/bert-base-uncased) es una arquitectura, mientras que `google-bert/bert-base-uncased` es un checkpoint. Modelo es un término general que puede significar una arquitectura o un checkpoint.
-
-</Tip>
+> [!TIP]
+> Recuerda, la arquitectura se refiere al esqueleto del modelo y los checkpoints son los pesos para una arquitectura dada. Por ejemplo, [BERT](https://huggingface.co/google-bert/bert-base-uncased) es una arquitectura, mientras que `google-bert/bert-base-uncased` es un checkpoint. Modelo es un término general que puede significar una arquitectura o un checkpoint.
 
 En este tutorial, aprenderás a:
 
diff --git a/docs/source/es/chat_templating.md b/docs/source/es/chat_templating.md
index e287c2137435..37fd68024241 100644
--- a/docs/source/es/chat_templating.md
+++ b/docs/source/es/chat_templating.md
@@ -301,11 +301,8 @@ tokenizer.push_to_hub("model_name")  # Upload your new template to the Hub!
 
 El método [`~PreTrainedTokenizer.apply_chat_template`], que utiliza tu plantilla de chat, es llamado por la clase [`TextGenerationPipeline`], así que una vez que configures la plantilla de chat correcta, tu modelo se volverá automáticamente compatible con [`TextGenerationPipeline`].
 
-<Tip>
-
-Si estás ajustando finamente un modelo para chat, además de establecer una plantilla de chat, probablemente deberías agregar cualquier nuevo token de control de chat como los tokens especiales en el tokenizador. Los tokens especiales nunca se dividen, asegurando que tus tokens de control siempre se manejen como tokens únicos en lugar de ser tokenizados en piezas. También deberías establecer el atributo `eos_token` del tokenizador con el token que marca el final de las generaciones del asistente en tu plantilla. Esto asegurará que las herramientas de generación de texto puedan determinar correctamente cuándo detener la generación de texto.
-
-</Tip>
+> [!TIP]
+> Si estás ajustando finamente un modelo para chat, además de establecer una plantilla de chat, probablemente deberías agregar cualquier nuevo token de control de chat como los tokens especiales en el tokenizador. Los tokens especiales nunca se dividen, asegurando que tus tokens de control siempre se manejen como tokens únicos en lugar de ser tokenizados en piezas. También deberías establecer el atributo `eos_token` del tokenizador con el token que marca el final de las generaciones del asistente en tu plantilla. Esto asegurará que las herramientas de generación de texto puedan determinar correctamente cuándo detener la generación de texto.
 
 ### ¿Qué plantilla debería usar?
 
diff --git a/docs/source/es/create_a_model.md b/docs/source/es/create_a_model.md
index 4463952f4846..7e157b9167f3 100644
--- a/docs/source/es/create_a_model.md
+++ b/docs/source/es/create_a_model.md
@@ -101,11 +101,8 @@ Para volver a usar el archivo de configuración, puedes cargarlo usando [`~Pretr
 >>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json")
 ```
 
-<Tip>
-  
-También puedes guardar los archivos de configuración como un diccionario; o incluso guardar solo la diferencia entre tu archivo personalizado y la configuración por defecto. Consulta la [documentación sobre configuración](main_classes/configuration) para ver más detalles.
-
-</Tip>
+> [!TIP]
+> También puedes guardar los archivos de configuración como un diccionario; o incluso guardar solo la diferencia entre tu archivo personalizado y la configuración por defecto. Consulta la [documentación sobre configuración](main_classes/configuration) para ver más detalles.
 
 ## Modelo
 
@@ -167,11 +164,8 @@ La ultima clase base que debes conocer antes de usar un modelo con datos textual
 
 Ambos *tokenizers* son compatibles con los métodos comunes, como los de encodificación y decodificación, los métodos para añadir tokens y aquellos que manejan tokens especiales. 
 
-<Tip warning={true}>
-
-No todos los modelos son compatibles con un *tokenizer* rápido. Échale un vistazo a esta [tabla](index#supported-frameworks) para comprobar si un modelo específico es compatible con un *tokenizer* rápido.
-
-</Tip>
+> [!WARNING]
+> No todos los modelos son compatibles con un *tokenizer* rápido. Échale un vistazo a esta [tabla](index#supported-frameworks) para comprobar si un modelo específico es compatible con un *tokenizer* rápido.
 
 Si has entrenado tu propio *tokenizer*, puedes crear uno desde tu archivo de “vocabulario”:
 
@@ -199,12 +193,8 @@ Crea un *tokenizer* rápido con la clase [`DistilBertTokenizerFast`]:
 >>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip>
-
-Por defecto, el [`AutoTokenizer`] intentará cargar un *tokenizer* rápido. Puedes desactivar este comportamiento cambiando el parámetro `use_fast=False` de `from_pretrained`.
-
-
-</Tip>
+> [!TIP]
+> Por defecto, el [`AutoTokenizer`] intentará cargar un *tokenizer* rápido. Puedes desactivar este comportamiento cambiando el parámetro `use_fast=False` de `from_pretrained`.
 
 ## Extractor de Características 
 
@@ -236,11 +226,8 @@ ViTFeatureExtractor {
 }
 ```
 
-<Tip>
-
-Si no estás buscando ninguna personalización en específico, usa el método `from_pretrained` para cargar los parámetros del extractor de características por defecto del modelo.
-
-</Tip>
+> [!TIP]
+> Si no estás buscando ninguna personalización en específico, usa el método `from_pretrained` para cargar los parámetros del extractor de características por defecto del modelo.
 
 Puedes modificar cualquier parámetro de [`ViTFeatureExtractor`] para crear tu extractor de características personalizado:
 
diff --git a/docs/source/es/custom_models.md b/docs/source/es/custom_models.md
index fec50e4e7a18..57b7384a5328 100644
--- a/docs/source/es/custom_models.md
+++ b/docs/source/es/custom_models.md
@@ -182,11 +182,8 @@ En ambos casos, observa cómo heredamos de `PreTrainedModel` y llamamos a la ini
 (un poco como cuando escribes `torch.nn.Module`). La línea que establece `config_class` no es obligatoria, a menos 
 que quieras registrar tu modelo con las clases automáticas (consulta la última sección).
 
-<Tip>
-
-Si tu modelo es muy similar a un modelo dentro de la biblioteca, puedes reutilizar la misma configuración de ese modelo.
-
-</Tip>
+> [!TIP]
+> Si tu modelo es muy similar a un modelo dentro de la biblioteca, puedes reutilizar la misma configuración de ese modelo.
 
 Puedes hacer que tu modelo devuelva lo que quieras, pero devolver un diccionario como lo hicimos para 
 `ResnetModelForImageClassification`, con el `loss` incluido cuando se pasan las etiquetas, hará que tu modelo se pueda 
@@ -219,11 +216,8 @@ se guarda el código del modelo.
 
 ## Enviar el código al _Hub_
 
-<Tip warning={true}>
-
-Esta _API_ es experimental y puede tener algunos cambios leves en las próximas versiones.
-
-</Tip>
+> [!WARNING]
+> Esta _API_ es experimental y puede tener algunos cambios leves en las próximas versiones.
 
 Primero, asegúrate de que tu modelo esté completamente definido en un archivo `.py`. Puedes basarte en importaciones 
 relativas a otros archivos, siempre que todos los archivos estén en el mismo directorio (aún no admitimos submódulos 
@@ -242,12 +236,9 @@ contiene el código de `ResnetConfig` y el archivo del modelo contiene el códig
 
 El `__init__.py`  puede estar vacío, solo está ahí para que Python detecte que `resnet_model` se puede usar como un módulo.
 
-<Tip warning={true}>
-
-Si copias archivos del modelo desde la biblioteca, deberás reemplazar todas las importaciones relativas en la parte superior 
-del archivo para importarlos desde el paquete `transformers`.
-
-</Tip>
+> [!WARNING]
+> Si copias archivos del modelo desde la biblioteca, deberás reemplazar todas las importaciones relativas en la parte superior 
+> del archivo para importarlos desde el paquete `transformers`.
 
 Ten en cuenta que puedes reutilizar (o subclasificar) una configuración o modelo existente.
 
diff --git a/docs/source/es/debugging.md b/docs/source/es/debugging.md
index 313566753052..b271a9d4b250 100644
--- a/docs/source/es/debugging.md
+++ b/docs/source/es/debugging.md
@@ -46,23 +46,14 @@ Esto mostrará mucha información de debug relacionada con NCCL, que luego puede
 
 ## Detección de Underflow y Overflow
 
-<Tip>
+> [!TIP]
+> Esta función está disponible actualmente sólo para PyTorch.
 
-Esta función está disponible actualmente sólo para PyTorch.
+> [!TIP]
+> Para el entrenamiento multi-GPU, requiere DDP (`torch.distributed.launch`).
 
-</Tip>
-
-<Tip>
-
-Para el entrenamiento multi-GPU, requiere DDP (`torch.distributed.launch`).
-
-</Tip>
-
-<Tip>
-
-Esta función puede utilizarse con cualquier modelo basado en `nn.Module`.
-
-</Tip>
+> [!TIP]
+> Esta función puede utilizarse con cualquier modelo basado en `nn.Module`.
 
 Si empiezas a obtener `loss=NaN` o el modelo muestra algún otro comportamiento anormal debido a `inf` o `nan` en
 activations o weights hay que descubrir dónde se produce el primer underflow o overflow y qué lo ha provocado. Por suerte
diff --git a/docs/source/es/glossary.md b/docs/source/es/glossary.md
index 3debcdbd3545..33ef61c71ca2 100644
--- a/docs/source/es/glossary.md
+++ b/docs/source/es/glossary.md
@@ -261,11 +261,8 @@ Estas etiquetas son diferentes según la cabecera del modelo, por ejemplo:
 - Para modelos de detección de objetos ([`DetrForObjectDetection`]), el modelo espera una lista de diccionarios con claves `class_labels` y `boxes` donde cada valor del lote corresponde a la etiqueta esperada y el número de cajas delimitadoras de cada imagen individual.
 - Para modelos de reconocimiento automático de voz ([`Wav2Vec2ForCTC`]), el modelo espera un tensor de dimensión `(batch_size, target_length)` con cada valor correspondiente a la etiqueta esperada de cada token individual.
   
-<Tip>
-
-Las etiquetas de cada modelo pueden ser diferentes, así que asegúrate siempre de revisar la documentación de cada modelo para obtener más información sobre sus etiquetas específicas.
-
-</Tip>
+> [!TIP]
+> Las etiquetas de cada modelo pueden ser diferentes, así que asegúrate siempre de revisar la documentación de cada modelo para obtener más información sobre sus etiquetas específicas.
 
 Los modelos base ([`BertModel`]) no aceptan etiquetas, ya que estos son los modelos base de transformadores, que simplemente generan características.
 
diff --git a/docs/source/es/installation.md b/docs/source/es/installation.md
index 714c3b195ebc..5472f549a5fa 100644
--- a/docs/source/es/installation.md
+++ b/docs/source/es/installation.md
@@ -114,11 +114,8 @@ pip install -e .
 
 Éstos comandos van a ligar el directorio desde donde clonamos el repositorio al path de las bibliotecas de Python. Python ahora buscará dentro de la carpeta que clonaste además de los paths normales de la biblioteca. Por ejemplo, si los paquetes de Python se encuentran instalados en `~/anaconda3/envs/main/lib/python3.7/site-packages/`, Python también buscará en el directorio desde donde clonamos el repositorio `~/transformers/`.
 
-<Tip warning={true}>
-
-Debes mantener el directorio `transformers` si deseas seguir usando la biblioteca.
-
-</Tip>
+> [!WARNING]
+> Debes mantener el directorio `transformers` si deseas seguir usando la biblioteca.
 
 Puedes actualizar tu copia local a la última versión de 🤗 Transformers con el siguiente comando:
 
@@ -145,22 +142,16 @@ Los modelos preentrenados se descargan y almacenan en caché localmente en: `~/.
 2. Variable de entorno del shell:`HF_HOME` + `transformers/`.
 3. Variable de entorno del shell: `XDG_CACHE_HOME` + `/huggingface/transformers`.
 
-<Tip>
-
-🤗 Transformers usará las variables de entorno de shell `PYTORCH_TRANSFORMERS_CACHE` o `PYTORCH_PRETRAINED_BERT_CACHE` si viene de una iteración anterior de la biblioteca y ha configurado esas variables de entorno, a menos que especifiques la variable de entorno de shell `TRANSFORMERS_CACHE`.
-
-</Tip>
+> [!TIP]
+> 🤗 Transformers usará las variables de entorno de shell `PYTORCH_TRANSFORMERS_CACHE` o `PYTORCH_PRETRAINED_BERT_CACHE` si viene de una iteración anterior de la biblioteca y ha configurado esas variables de entorno, a menos que especifiques la variable de entorno de shell `TRANSFORMERS_CACHE`.
 
 
 ## Modo Offline
 
 🤗 Transformers puede ejecutarse en un entorno con firewall o fuera de línea (offline) usando solo archivos locales. Configura la variable de entorno `HF_HUB_OFFLINE=1` para habilitar este comportamiento.
 
-<Tip>
-
-Puedes añadir [🤗 Datasets](https://huggingface.co/docs/datasets/) al flujo de entrenamiento offline declarando la variable de entorno  `HF_DATASETS_OFFLINE=1`.
-
-</Tip>
+> [!TIP]
+> Puedes añadir [🤗 Datasets](https://huggingface.co/docs/datasets/) al flujo de entrenamiento offline declarando la variable de entorno  `HF_DATASETS_OFFLINE=1`.
 
 Por ejemplo, normalmente ejecutarías un programa en una red normal con firewall para instancias externas con el siguiente comando:
 
@@ -235,8 +226,5 @@ Una vez que el archivo se descargue y se almacene en caché localmente, especifi
 >>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json")
 ```
 
-<Tip>
-
-Para más detalles sobre cómo descargar archivos almacenados en el Hub consulta la sección [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream).
-
-</Tip>
+> [!TIP]
+> Para más detalles sobre cómo descargar archivos almacenados en el Hub consulta la sección [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream).
diff --git a/docs/source/es/model_memory_anatomy.md b/docs/source/es/model_memory_anatomy.md
index 54609a1c1e79..59a7c3939ab5 100644
--- a/docs/source/es/model_memory_anatomy.md
+++ b/docs/source/es/model_memory_anatomy.md
@@ -136,11 +136,8 @@ default_args = {
 }
 ```
 
-<Tip>
-
-Si planeas ejecutar varias pruebas, reinicie el kernel de Python entre cada prueba para borrar correctamente la memoria.
-
-</Tip>
+> [!TIP]
+> Si planeas ejecutar varias pruebas, reinicie el kernel de Python entre cada prueba para borrar correctamente la memoria.
 
 ## Utilización de la memoria en el entrenamiento
 
diff --git a/docs/source/es/model_sharing.md b/docs/source/es/model_sharing.md
index aef87578da31..c5d58830b825 100644
--- a/docs/source/es/model_sharing.md
+++ b/docs/source/es/model_sharing.md
@@ -27,11 +27,8 @@ En este tutorial aprenderás dos métodos para compartir un modelo trained o fin
 frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
 picture-in-picture" allowfullscreen></iframe>
 
-<Tip>
-
-Para compartir un modelo con la comunidad necesitas una cuenta en [huggingface.co](https://huggingface.co/join). También puedes unirte a una organización existente o crear una nueva.
-
-</Tip>
+> [!TIP]
+> Para compartir un modelo con la comunidad necesitas una cuenta en [huggingface.co](https://huggingface.co/join). También puedes unirte a una organización existente o crear una nueva.
 
 ## Características de los repositorios
 
diff --git a/docs/source/es/pipeline_tutorial.md b/docs/source/es/pipeline_tutorial.md
index 149ae9d9a257..dbc5f424fc24 100644
--- a/docs/source/es/pipeline_tutorial.md
+++ b/docs/source/es/pipeline_tutorial.md
@@ -22,11 +22,8 @@ Un [`pipeline`] simplifica el uso de cualquier modelo del [Hub](https://huggingf
 * Utilizar un tokenizador o modelo específico.
 * Utilizar un [`pipeline`] para tareas de audio y visión.
 
-<Tip>
-
-Echa un vistazo a la documentación de [`pipeline`] para obtener una lista completa de tareas admitidas.
-
-</Tip>
+> [!TIP]
+> Echa un vistazo a la documentación de [`pipeline`] para obtener una lista completa de tareas admitidas.
 
 ## Uso del pipeline
 
@@ -198,9 +195,8 @@ for out in pipe(KeyDataset(dataset, "audio")):
 
 ## Uso de pipelines para un servidor web
 
-<Tip>
-Crear un motor de inferencia es un tema complejo que merece su propia página.
-</Tip>
+> [!TIP]
+> Crear un motor de inferencia es un tema complejo que merece su propia página.
 
 [Link](./pipeline_webserver)
 
@@ -260,16 +256,13 @@ Por ejemplo, si usas esta [imagen de factura](https://huggingface.co/spaces/impi
 [{'score': 0.425, 'answer': 'us-001', 'start': 16, 'end': 16}]
 ```
 
-<Tip>
-
-Para ejecutar el ejemplo anterior, debe tener instalado [`pytesseract`](https://pypi.org/project/pytesseract/) además de 🤗 Transformers:
-
-```bash
-sudo apt install -y tesseract-ocr
-pip install pytesseract
-```
-
-</Tip>
+> [!TIP]
+> Para ejecutar el ejemplo anterior, debe tener instalado [`pytesseract`](https://pypi.org/project/pytesseract/) además de 🤗 Transformers:
+>
+> ```bash
+> sudo apt install -y tesseract-ocr
+> pip install pytesseract
+> ```
 
 ## Uso de `pipeline` en modelos grandes con 🤗 `accelerate`:
 
diff --git a/docs/source/es/pipeline_webserver.md b/docs/source/es/pipeline_webserver.md
index e268daabcbe8..4fb8ca904daa 100644
--- a/docs/source/es/pipeline_webserver.md
+++ b/docs/source/es/pipeline_webserver.md
@@ -4,9 +4,8 @@ rendered properly in your Markdown viewer.
 
 # Uso de un flujo de trabajo para un servidor web
 
-<Tip>
-Crear un motor de inferencia es un tema complejo, y la "mejor" solución probablemente dependerá de tu caso de uso. ¿Estás en CPU o en GPU? ¿Quieres la latencia más baja, el rendimiento más alto, soporte para muchos modelos o simplemente optimizar altamente un modelo específico? Hay muchas formas de abordar este tema, así que lo que vamos a presentar es un buen valor predeterminado para comenzar, que no necesariamente será la solución más óptima para ti.
-</Tip>
+> [!TIP]
+> Crear un motor de inferencia es un tema complejo, y la "mejor" solución probablemente dependerá de tu caso de uso. ¿Estás en CPU o en GPU? ¿Quieres la latencia más baja, el rendimiento más alto, soporte para muchos modelos o simplemente optimizar altamente un modelo específico? Hay muchas formas de abordar este tema, así que lo que vamos a presentar es un buen valor predeterminado para comenzar, que no necesariamente será la solución más óptima para ti.
 
 
 Lo fundamental para entender es que podemos usar un iterador, tal como [en un conjunto de datos](pipeline_tutorial#uso-de-pipelines-en-un-conjunto-de-datos), ya que un servidor web es básicamente un sistema que espera solicitudes y las trata a medida que llegan.
@@ -71,12 +70,9 @@ curl -X POST -d "test [MASK]" http://localhost:8000/
 
 Lo realmente importante es cargar el modelo solo **una vez**, de modo que no haya copias del modelo en el servidor web. De esta manera, no se utiliza RAM innecesariamente. Luego, el mecanismo de queuing (colas) te permite hacer cosas sofisticadas como acumular algunos elementos antes de inferir para usar el agrupamiento dinámico:
 
-<Tip warning={true}>
-
-El ejemplo de código a continuación está escrito intencionalmente como pseudocódigo para facilitar la lectura.
-¡No lo ejecutes sin verificar si tiene sentido para los recursos de tu sistema!
-
-</Tip>
+> [!WARNING]
+> El ejemplo de código a continuación está escrito intencionalmente como pseudocódigo para facilitar la lectura.
+> ¡No lo ejecutes sin verificar si tiene sentido para los recursos de tu sistema!
 
 ```py
 (string, rq) = await q.get()
diff --git a/docs/source/es/preprocessing.md b/docs/source/es/preprocessing.md
index 8486d6a0687a..d2882593106d 100644
--- a/docs/source/es/preprocessing.md
+++ b/docs/source/es/preprocessing.md
@@ -30,11 +30,8 @@ Antes de que puedas utilizar los datos en un modelo, debes procesarlos en un for
 
 La principal herramienta para procesar datos textuales es un [tokenizador](main_classes/tokenizer). Un tokenizador comienza dividiendo el texto en *tokens* según un conjunto de reglas. Los tokens se convierten en números, que se utilizan para construir tensores como entrada a un modelo. El tokenizador también añade cualquier entrada adicional que requiera el modelo.
 
-<Tip>
-
-Si tienes previsto utilizar un modelo pre-entrenado, es importante que utilices el tokenizador pre-entrenado asociado. Esto te asegura que el texto se divide de la misma manera que el corpus de pre-entrenamiento y utiliza el mismo índice de tokens correspondiente (usualmente referido como el *vocab*) durante el pre-entrenamiento.
-
-</Tip>
+> [!TIP]
+> Si tienes previsto utilizar un modelo pre-entrenado, es importante que utilices el tokenizador pre-entrenado asociado. Esto te asegura que el texto se divide de la misma manera que el corpus de pre-entrenamiento y utiliza el mismo índice de tokens correspondiente (usualmente referido como el *vocab*) durante el pre-entrenamiento.
 
 Comienza rápidamente cargando un tokenizador pre-entrenado con la clase [`AutoTokenizer`]. Esto descarga el *vocab* utilizado cuando un modelo es pre-entrenado.
 
diff --git a/docs/source/es/quicktour.md b/docs/source/es/quicktour.md
index 3599df38950a..8ab40a68566d 100644
--- a/docs/source/es/quicktour.md
+++ b/docs/source/es/quicktour.md
@@ -20,12 +20,9 @@ rendered properly in your Markdown viewer.
 
 ¡Entra en marcha con los 🤗 Transformers! Comienza usando [`pipeline`] para una inferencia veloz, carga un modelo preentrenado y un tokenizador con una [AutoClass](./model_doc/auto) para resolver tu tarea de texto, visión o audio.
 
-<Tip>
-
-Todos los ejemplos de código presentados en la documentación tienen un botón arriba a la derecha para elegir si quieres ocultar o mostrar el código en Pytorch o TensorFlow.
-Si no fuese así, se espera que el código funcione para ambos backends sin ningún cambio.
-
-</Tip>
+> [!TIP]
+> Todos los ejemplos de código presentados en la documentación tienen un botón arriba a la derecha para elegir si quieres ocultar o mostrar el código en Pytorch o TensorFlow.
+> Si no fuese así, se espera que el código funcione para ambos backends sin ningún cambio.
 
 ## Pipeline
 
@@ -54,11 +51,8 @@ El [`pipeline`] soporta muchas tareas comunes listas para usar:
 * Clasificación de Audios (Audio Classification, en inglés): asigna una etiqueta a un segmento de audio.
 * Reconocimiento de Voz Automático (Automatic Speech Recognition o ASR, en inglés): transcribe datos de audio a un texto.
 
-<Tip>
-
-Para más detalles acerca del [`pipeline`] y tareas asociadas, consulta la documentación [aquí](./main_classes/pipelines).
-
-</Tip>
+> [!TIP]
+> Para más detalles acerca del [`pipeline`] y tareas asociadas, consulta la documentación [aquí](./main_classes/pipelines).
 
 ### Uso del Pipeline
 
@@ -223,11 +217,8 @@ Lee el tutorial de [preprocessing](./preprocessing) para más detalles acerca de
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
 ```
 
-<Tip>
-
-Ve el [task summary](./task_summary) para revisar qué clase del [`AutoModel`] deberías usar para cada tarea.
-
-</Tip>
+> [!TIP]
+> Ve el [task summary](./task_summary) para revisar qué clase del [`AutoModel`] deberías usar para cada tarea.
 
 Ahora puedes pasar tu lote (batch) preprocesado de inputs directamente al modelo. Solo tienes que desempacar el diccionario añadiendo `**`:
 
@@ -246,21 +237,15 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
         [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
 ```
 
-<Tip>
-
-Todos los modelos de 🤗 Transformers (PyTorch o TensorFlow) producirán los tensores *antes* de la función de activación
-final (como softmax) porque la función de activación final es comúnmente fusionada con la pérdida.
-
-</Tip>
+> [!TIP]
+> Todos los modelos de 🤗 Transformers (PyTorch o TensorFlow) producirán los tensores *antes* de la función de activación
+> final (como softmax) porque la función de activación final es comúnmente fusionada con la pérdida.
 
 Los modelos son [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) o [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) estándares así que podrás usarlos en tu training loop usual. Sin embargo, para facilitar las cosas, 🤗 Transformers provee una clase [`Trainer`] para PyTorch que añade funcionalidades para entrenamiento distribuido, precición mixta, y más. Para TensorFlow, puedes usar el método `fit` desde [Keras](https://keras.io/). Consulta el [tutorial de entrenamiento](./training) para más detalles.
 
-<Tip>
-
-Los outputs del modelo de 🤗 Transformers son dataclasses especiales por lo que sus atributos pueden ser completados en un IDE.
-Los outputs del modelo también se comportan como tuplas o diccionarios (e.g., puedes indexar con un entero, un slice o una cadena) en cuyo caso los atributos que son `None` son ignorados.
-
-</Tip>
+> [!TIP]
+> Los outputs del modelo de 🤗 Transformers son dataclasses especiales por lo que sus atributos pueden ser completados en un IDE.
+> Los outputs del modelo también se comportan como tuplas o diccionarios (e.g., puedes indexar con un entero, un slice o una cadena) en cuyo caso los atributos que son `None` son ignorados.
 
 ### Guarda un modelo
 
diff --git a/docs/source/es/serialization.md b/docs/source/es/serialization.md
index 9c29ed6f0406..d424daad2472 100644
--- a/docs/source/es/serialization.md
+++ b/docs/source/es/serialization.md
@@ -263,13 +263,10 @@ Ten en cuenta que, en este caso, los nombres de salida del modelo ajustado son `
 que vimos anteriormente con el checkpoint `distilbert/distilbert-base-uncased`. Esto es de esperarse ya que el modelo ajustado 
 tiene un cabezal de clasificación secuencial.
 
-<Tip>
-
-Las características que tienen un sufijo 'with-past' (por ejemplo, 'causal-lm-with-past') corresponden a topologías 
-de modelo con estados ocultos precalculados (clave y valores en los bloques de atención) que se pueden usar para una 
-decodificación autorregresiva más rápida.
-
-</Tip>
+> [!TIP]
+> Las características que tienen un sufijo 'with-past' (por ejemplo, 'causal-lm-with-past') corresponden a topologías 
+> de modelo con estados ocultos precalculados (clave y valores en los bloques de atención) que se pueden usar para una 
+> decodificación autorregresiva más rápida.
 
 
 ### Exportar un modelo para una arquitectura no compatible
@@ -293,12 +290,9 @@ de las que debe heredar, según el tipo de arquitectura del modelo que quieras e
 * Modelos basados en el _Decoder_ inherente de [`~onnx.config.OnnxConfigWithPast`]
 * Modelos _Encoder-decoder_ inherente de [`~onnx.config.OnnxSeq2SeqConfigWithPast`]
 
-<Tip>
-
-Una buena manera de implementar una configuración personalizada en ONNX es observar la implementación 
-existente en el archivo `configuration_<model_name>.py` de una arquitectura similar.
-
-</Tip>
+> [!TIP]
+> Una buena manera de implementar una configuración personalizada en ONNX es observar la implementación 
+> existente en el archivo `configuration_<model_name>.py` de una arquitectura similar.
 
 Dado que DistilBERT es un modelo de tipo _encoder_, su configuración se hereda de `OnnxConfig`:
 
@@ -324,14 +318,11 @@ Para DistilBERT, podemos ver que se requieren dos entradas: `input_ids` y `atten
 Estas entradas tienen la misma forma de `(batch_size, sequence_length)`, es por lo que vemos 
 los mismos ejes utilizados en la configuración.
 
-<Tip>
-
-Observa que la propiedad `inputs` para `DistilBertOnnxConfig` devuelve un `OrderedDict`.
-Esto nos asegura que las entradas coincidan con su posición relativa dentro del método 
-`PreTrainedModel.forward()` al rastrear el grafo. Recomendamos usar un `OrderedDict` 
-para las propiedades `inputs` y `outputs` al implementar configuraciones ONNX personalizadas.
-
-</Tip>
+> [!TIP]
+> Observa que la propiedad `inputs` para `DistilBertOnnxConfig` devuelve un `OrderedDict`.
+> Esto nos asegura que las entradas coincidan con su posición relativa dentro del método 
+> `PreTrainedModel.forward()` al rastrear el grafo. Recomendamos usar un `OrderedDict` 
+> para las propiedades `inputs` y `outputs` al implementar configuraciones ONNX personalizadas.
 
 Una vez que hayas implementado una configuración ONNX, puedes crear una 
 instancia proporcionando la configuración del modelo base de la siguiente manera:
@@ -376,13 +367,10 @@ exportar DistilBERT con un cabezal de clasificación de secuencias, podríamos u
 OrderedDict([('logits', {0: 'batch'})])
 ```
 
-<Tip>
-
-Todas las propiedades base y métodos asociados con [`~onnx.config.OnnxConfig`] y las 
-otras clases de configuración se pueden sobreescribir si es necesario.
-Consulte [`BartOnnxConfig`] para ver un ejemplo avanzado.
-
-</Tip>
+> [!TIP]
+> Todas las propiedades base y métodos asociados con [`~onnx.config.OnnxConfig`] y las 
+> otras clases de configuración se pueden sobreescribir si es necesario.
+> Consulte [`BartOnnxConfig`] para ver un ejemplo avanzado.
 
 #### Exportar el modelo
 
@@ -415,15 +403,12 @@ Una vez exportado el modelo, puedes probar que el modelo está bien formado de l
 >>> onnx.checker.check_model(onnx_model)
 ```
 
-<Tip>
-
-Si tu modelo tiene más de 2GB, verás que se crean muchos archivos adicionales durante la exportación.
-Esto es _esperado_ porque ONNX usa [Búferes de protocolo](https://developers.google.com/protocol-buffers/) 
-para almacenar el modelo y éstos tienen un límite de tamaño de 2 GB. Consulta la 
-[documentación de ONNX](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md) para obtener 
-instrucciones sobre cómo cargar modelos con datos externos.
-
-</Tip>
+> [!TIP]
+> Si tu modelo tiene más de 2GB, verás que se crean muchos archivos adicionales durante la exportación.
+> Esto es _esperado_ porque ONNX usa [Búferes de protocolo](https://developers.google.com/protocol-buffers/) 
+> para almacenar el modelo y éstos tienen un límite de tamaño de 2 GB. Consulta la 
+> [documentación de ONNX](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md) para obtener 
+> instrucciones sobre cómo cargar modelos con datos externos.
 
 #### Validar los resultados del modelo
 
@@ -457,14 +442,11 @@ y así tener una idea de lo que necesito.
 
 ## TorchScript
 
-<Tip>
-
-Este es el comienzo de nuestros experimentos con TorchScript y todavía estamos explorando sus capacidades con modelos de 
-tamaño de entrada variable. Es un tema de interés y profundizaremos nuestro análisis en las próximas 
-versiones,  con más ejemplos de código, una implementación más flexible y puntos de referencia que comparen códigos 
-basados en Python con TorchScript compilado.
-
-</Tip>
+> [!TIP]
+> Este es el comienzo de nuestros experimentos con TorchScript y todavía estamos explorando sus capacidades con modelos de 
+> tamaño de entrada variable. Es un tema de interés y profundizaremos nuestro análisis en las próximas 
+> versiones,  con más ejemplos de código, una implementación más flexible y puntos de referencia que comparen códigos 
+> basados en Python con TorchScript compilado.
 
 Según la documentación de PyTorch: "TorchScript es una forma de crear modelos serializables y optimizables a partir del 
 código de PyTorch". Los dos módulos de Pytorch [JIT y TRACE](https://pytorch.org/docs/stable/jit.html) permiten al 
diff --git a/docs/source/es/tasks/asr.md b/docs/source/es/tasks/asr.md
index 30c880d1f189..a008690d34a9 100644
--- a/docs/source/es/tasks/asr.md
+++ b/docs/source/es/tasks/asr.md
@@ -25,11 +25,8 @@ En esta guía te mostraremos como:
 1. Hacer fine-tuning al modelo [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) con el dataset [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) para transcribir audio a texto.
 2. Usar tu modelo ajustado para tareas de inferencia.
 
-<Tip>
-
-Revisa la [página de la tarea](https://huggingface.co/tasks/automatic-speech-recognition) de reconocimiento automático del habla para acceder a más información sobre los modelos, datasets y métricas asociados.
-
-</Tip>
+> [!TIP]
+> Revisa la [página de la tarea](https://huggingface.co/tasks/automatic-speech-recognition) de reconocimiento automático del habla para acceder a más información sobre los modelos, datasets y métricas asociados.
 
 Antes de comenzar, asegúrate de haber instalado todas las librerías necesarias:
 
@@ -224,11 +221,8 @@ Ahora tu función `compute_metrics` (computar métricas) está lista y podrás u
 
 ## Entrenamiento
 
-<Tip>
-
-Si no tienes experiencia haciéndole fine-tuning a un modelo con el [`Trainer`], ¡échale un vistazo al tutorial básico [aquí](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> Si no tienes experiencia haciéndole fine-tuning a un modelo con el [`Trainer`], ¡échale un vistazo al tutorial básico [aquí](../training#train-with-pytorch-trainer)!
 
 ¡Ya puedes empezar a entrenar tu modelo! Para ello, carga Wav2Vec2 con [`AutoModelForCTC`]. Especifica la reducción que quieres aplicar con el parámetro `ctc_loss_reduction`. A menudo, es mejor usar el promedio en lugar de la sumatoria que se hace por defecto.
 
@@ -288,11 +282,8 @@ Una vez que el entrenamiento haya sido completado, comparte tu modelo en el Hub
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-Para ver un ejemplo más detallado de cómo hacerle fine-tuning a un modelo para reconocimiento automático del habla, échale un vistazo a esta [entrada de blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) para ASR en inglés y a esta [entrada](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) para ASR multilingüe.
-
-</Tip>
+> [!TIP]
+> Para ver un ejemplo más detallado de cómo hacerle fine-tuning a un modelo para reconocimiento automático del habla, échale un vistazo a esta [entrada de blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) para ASR en inglés y a esta [entrada](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) para ASR multilingüe.
 
 ## Inferencia
 
@@ -319,11 +310,8 @@ La manera más simple de probar tu modelo para hacer inferencia es usarlo en un
 {'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'}
 ```
 
-<Tip>
-
-La transcripción es decente, pero podría ser mejor. ¡Intenta hacerle fine-tuning a tu modelo con más ejemplos para obtener resultados aún mejores!
-
-</Tip>
+> [!TIP]
+> La transcripción es decente, pero podría ser mejor. ¡Intenta hacerle fine-tuning a tu modelo con más ejemplos para obtener resultados aún mejores!
 
 También puedes replicar de forma manual los resultados del `pipeline` si lo deseas:
 
diff --git a/docs/source/es/tasks/audio_classification.md b/docs/source/es/tasks/audio_classification.md
index 3b0446143262..39267d07cd74 100644
--- a/docs/source/es/tasks/audio_classification.md
+++ b/docs/source/es/tasks/audio_classification.md
@@ -28,11 +28,8 @@ En esta guía te mostraremos como:
 2. Usar tu modelo ajustado para tareas de inferencia.
 
 
-<Tip>
-
-Consulta la [página de la tarea](https://huggingface.co/tasks/audio-classification) de clasificación de audio para acceder a más información sobre los modelos, datasets, y métricas asociados.
-
-</Tip>
+> [!TIP]
+> Consulta la [página de la tarea](https://huggingface.co/tasks/audio-classification) de clasificación de audio para acceder a más información sobre los modelos, datasets, y métricas asociados.
 
 Antes de comenzar, asegúrate de haber instalado todas las librerías necesarias:
 
@@ -187,11 +184,8 @@ Ahora tu función `compute_metrics` (computar métricas) está lista y podrás u
 
 ## Entrenamiento
 
-<Tip>
-
-¡Si no tienes experiencia haciéndo *fine-tuning* a un modelo con el [`Trainer`], échale un vistazo al tutorial básico [aquí](../training#train-with-pytorch-trainer)!
-
-</Tip>
+> [!TIP]
+> ¡Si no tienes experiencia haciéndo *fine-tuning* a un modelo con el [`Trainer`], échale un vistazo al tutorial básico [aquí](../training#train-with-pytorch-trainer)!
 
 ¡Ya puedes empezar a entrenar tu modelo! Carga Wav2Vec2 con [`AutoModelForAudioClassification`] junto con el especifica el número de etiquetas, y pasa al modelo los *mappings* entre el número entero de etiqueta y la clase de etiqueta.
 
@@ -245,11 +239,8 @@ Una vez que el entrenamiento haya sido completado, comparte tu modelo en el Hub
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-Para ver un ejemplo más detallado de comó hacerle fine-tuning a un modelo para clasificación, échale un vistazo al correspondiente [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
-
-</Tip>
+> [!TIP]
+> Para ver un ejemplo más detallado de comó hacerle fine-tuning a un modelo para clasificación, échale un vistazo al correspondiente [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
 
 ## Inference
 
diff --git a/docs/source/es/tasks/image_captioning.md b/docs/source/es/tasks/image_captioning.md
index 620dcec1bfbd..29d3c6c015bb 100644
--- a/docs/source/es/tasks/image_captioning.md
+++ b/docs/source/es/tasks/image_captioning.md
@@ -64,11 +64,8 @@ DatasetDict({
 
 El conjunto de datos tiene dos características, `image` y `text`.
 
-<Tip>
-
-Muchos conjuntos de datos de subtítulos de imágenes contienen múltiples subtítulos por imagen. En esos casos, una estrategia común es muestrear aleatoriamente un subtítulo entre los disponibles durante el entrenamiento.
-
-</Tip>
+> [!TIP]
+> Muchos conjuntos de datos de subtítulos de imágenes contienen múltiples subtítulos por imagen. En esos casos, una estrategia común es muestrear aleatoriamente un subtítulo entre los disponibles durante el entrenamiento.
 
 Divide el conjunto de entrenamiento del conjunto de datos en un conjunto de entrenamiento y de prueba con el método [`~datasets.Dataset.train_test_split`]:
 
diff --git a/docs/source/es/tasks/image_classification.md b/docs/source/es/tasks/image_classification.md
index 1bea46884202..b35a8b308c13 100644
--- a/docs/source/es/tasks/image_classification.md
+++ b/docs/source/es/tasks/image_classification.md
@@ -22,11 +22,8 @@ La clasificación de imágenes asigna una etiqueta o clase a una imagen. A difer
 
 Esta guía te mostrará como hacer fine-tune al [ViT](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/vit) en el dataset [Food-101](https://huggingface.co/datasets/food101) para clasificar un alimento en una imagen.
 
-<Tip>
-
-Consulta la [página de la tarea](https://huggingface.co/tasks/audio-classification) de clasificación de imágenes para obtener más información sobre sus modelos, datasets y métricas asociadas.
-
-</Tip>
+> [!TIP]
+> Consulta la [página de la tarea](https://huggingface.co/tasks/audio-classification) de clasificación de imágenes para obtener más información sobre sus modelos, datasets y métricas asociadas.
 
 ## Carga el dataset Food-101
 
@@ -127,11 +124,8 @@ Carga ViT con [`AutoModelForImageClassification`]. Especifica el número de labe
 ... )
 ```
 
-<Tip>
-
-Si no estás familiarizado con el fine-tuning de un modelo con el [`Trainer`], echa un vistazo al tutorial básico [aquí](../training#finetune-with-trainer)!
-
-</Tip>
+> [!TIP]
+> Si no estás familiarizado con el fine-tuning de un modelo con el [`Trainer`], echa un vistazo al tutorial básico [aquí](../training#finetune-with-trainer)!
 
 Al llegar a este punto, solo quedan tres pasos:
 
@@ -166,8 +160,5 @@ Al llegar a este punto, solo quedan tres pasos:
 >>> trainer.train()
 ```
 
-<Tip>
-
-Para ver un ejemplo más a profundidad de cómo hacer fine-tune a un modelo para clasificación de imágenes, echa un vistazo al correspondiente [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
-
-</Tip>
+> [!TIP]
+> Para ver un ejemplo más a profundidad de cómo hacer fine-tune a un modelo para clasificación de imágenes, echa un vistazo al correspondiente [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
diff --git a/docs/source/es/tasks/language_modeling.md b/docs/source/es/tasks/language_modeling.md
index b5937cdb13cf..74412207412c 100644
--- a/docs/source/es/tasks/language_modeling.md
+++ b/docs/source/es/tasks/language_modeling.md
@@ -28,11 +28,8 @@ El modelado de lenguaje por enmascaramiento predice un token enmascarado en una
 
 Esta guía te mostrará cómo realizar fine-tuning [DistilGPT2](https://huggingface.co/distilbert/distilgpt2) para modelos de lenguaje causales y [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) para modelos de lenguaje por enmascaramiento en el [r/askscience](https://www.reddit.com/r/askscience/) subdataset [ELI5](https://huggingface.co/datasets/eli5). 
 
-<Tip>
-
-Mira la [página de tarea](https://huggingface.co/tasks/text-generation) para generación de texto y la [página de tarea](https://huggingface.co/tasks/fill-mask) para modelos de lenguajes por enmascaramiento para obtener más información sobre los modelos, datasets, y métricas asociadas.
-
-</Tip>
+> [!TIP]
+> Mira la [página de tarea](https://huggingface.co/tasks/text-generation) para generación de texto y la [página de tarea](https://huggingface.co/tasks/fill-mask) para modelos de lenguajes por enmascaramiento para obtener más información sobre los modelos, datasets, y métricas asociadas.
 
 ## Carga el dataset ELI5
 
@@ -192,11 +189,8 @@ Carga DistilGPT2 con [`AutoModelForCausalLM`]:
 >>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
 ```
 
-<Tip>
-
-Si no estás familiarizado con el proceso de realizar fine-tuning sobre un modelo con [`Trainer`], considera el tutorial básico [aquí](../training#finetune-with-trainer)!
-
-</Tip>
+> [!TIP]
+> Si no estás familiarizado con el proceso de realizar fine-tuning sobre un modelo con [`Trainer`], considera el tutorial básico [aquí](../training#finetune-with-trainer)!
 
 A este punto, solo faltan tres pasos:
 
@@ -237,11 +231,8 @@ Carga DistilRoBERTa con [`AutoModelForMaskedlM`]:
 >>> model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")
 ```
 
-<Tip>
-
-Si no estás familiarizado con el proceso de realizar fine-tuning sobre un modelo con [`Trainer`], considera el tutorial básico [aquí](../training#finetune-with-trainer)!
-
-</Tip>
+> [!TIP]
+> Si no estás familiarizado con el proceso de realizar fine-tuning sobre un modelo con [`Trainer`], considera el tutorial básico [aquí](../training#finetune-with-trainer)!
 
 A este punto, solo faltan tres pasos:
 
@@ -269,10 +260,7 @@ A este punto, solo faltan tres pasos:
 >>> trainer.train()
 ```
 
-<Tip>
-
-Para un ejemplo más profundo sobre cómo realizar el fine-tuning sobre un modelo de lenguaje causal, considera
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
-o [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).
-
-</Tip>
+> [!TIP]
+> Para un ejemplo más profundo sobre cómo realizar el fine-tuning sobre un modelo de lenguaje causal, considera
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
+> o [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).
diff --git a/docs/source/es/tasks/multiple_choice.md b/docs/source/es/tasks/multiple_choice.md
index d73688e36a8e..9dd31f9b0d7d 100644
--- a/docs/source/es/tasks/multiple_choice.md
+++ b/docs/source/es/tasks/multiple_choice.md
@@ -110,11 +110,8 @@ Carga el modelo BERT con [`AutoModelForMultipleChoice`]:
 >>> model = AutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased")
 ```
 
-<Tip>
-
-Para familiarizarte con el fine-tuning con [`Trainer`], ¡mira el tutorial básico [aquí](../training#finetune-with-trainer)!
-
-</Tip>
+> [!TIP]
+> Para familiarizarte con el fine-tuning con [`Trainer`], ¡mira el tutorial básico [aquí](../training#finetune-with-trainer)!
 
 En este punto, solo quedan tres pasos:
 
diff --git a/docs/source/es/tasks/question_answering.md b/docs/source/es/tasks/question_answering.md
index 085f381aa0f5..d11238954244 100644
--- a/docs/source/es/tasks/question_answering.md
+++ b/docs/source/es/tasks/question_answering.md
@@ -25,11 +25,8 @@ La respuesta a preguntas devuelve una respuesta a partir de una pregunta dada. E
 
 Esta guía te mostrará como hacer fine-tuning de [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) en el dataset [SQuAD](https://huggingface.co/datasets/squad) para responder preguntas de forma extractiva.
 
-<Tip>
-
-Revisa la [página de la tarea](https://huggingface.co/tasks/question-answering) de responder preguntas para tener más información sobre otras formas de responder preguntas y los modelos, datasets y métricas asociadas.
-
-</Tip>
+> [!TIP]
+> Revisa la [página de la tarea](https://huggingface.co/tasks/question-answering) de responder preguntas para tener más información sobre otras formas de responder preguntas y los modelos, datasets y métricas asociadas.
 
 ## Carga el dataset SQuAD
 
@@ -154,11 +151,8 @@ Carga el modelo DistilBERT con [`AutoModelForQuestionAnswering`]:
 >>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip>
-
-Para familiarizarte con el fine-tuning con [`Trainer`], ¡mira el tutorial básico [aquí](../training#finetune-with-trainer)!
-
-</Tip>
+> [!TIP]
+> Para familiarizarte con el fine-tuning con [`Trainer`], ¡mira el tutorial básico [aquí](../training#finetune-with-trainer)!
 
 En este punto, solo quedan tres pasos:
 
@@ -189,10 +183,7 @@ En este punto, solo quedan tres pasos:
 >>> trainer.train()
 ```
 
-<Tip>
-
-Para un ejemplo con mayor profundidad de cómo hacer fine-tuning a un modelo para responder preguntas, échale un vistazo al
-[cuaderno de PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb) o al
-[cuaderno de TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb) correspondiente.
-
-</Tip>
+> [!TIP]
+> Para un ejemplo con mayor profundidad de cómo hacer fine-tuning a un modelo para responder preguntas, échale un vistazo al
+> [cuaderno de PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb) o al
+> [cuaderno de TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb) correspondiente.
diff --git a/docs/source/es/tasks/summarization.md b/docs/source/es/tasks/summarization.md
index 7525ccaa41f6..3c1767bb0ded 100644
--- a/docs/source/es/tasks/summarization.md
+++ b/docs/source/es/tasks/summarization.md
@@ -25,11 +25,8 @@ La generación de resúmenes (summarization, en inglés) crea una versión más
 
 Esta guía te mostrará cómo puedes hacer fine-tuning del modelo [T5](https://huggingface.co/google-t5/t5-small) sobre el subset de proyectos de ley del estado de California, dentro del dataset [BillSum](https://huggingface.co/datasets/billsum) para hacer generación de resúmenes abstractiva.
 
-<Tip>
-
-Consulta la [página de la tarea](https://huggingface.co/tasks/summarization) de generación de resúmenes para obtener más información sobre sus modelos, datasets y métricas asociadas.
-
-</Tip>
+> [!TIP]
+> Consulta la [página de la tarea](https://huggingface.co/tasks/summarization) de generación de resúmenes para obtener más información sobre sus modelos, datasets y métricas asociadas.
 
 ## Carga el dataset BillSum
 
@@ -112,11 +109,8 @@ Carga T5 con [`AutoModelForSeq2SeqLM`]:
 >>> model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")
 ```
 
-<Tip>
-
-Para familiarizarte con el proceso para realizar fine-tuning sobre un modelo con [`Trainer`], ¡mira el tutorial básico [aquí](../training#finetune-with-trainer)!
-
-</Tip>
+> [!TIP]
+> Para familiarizarte con el proceso para realizar fine-tuning sobre un modelo con [`Trainer`], ¡mira el tutorial básico [aquí](../training#finetune-with-trainer)!
 
 En este punto, solo faltan tres pasos:
 
@@ -149,10 +143,7 @@ En este punto, solo faltan tres pasos:
 >>> trainer.train()
 ```
 
-<Tip>
-
-Para un ejemplo con mayor profundidad de cómo hacer fine-tuning a un modelo para generación de resúmenes, revisa la
-[notebook en PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)
-o a la [notebook en TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb).
-
-</Tip>
+> [!TIP]
+> Para un ejemplo con mayor profundidad de cómo hacer fine-tuning a un modelo para generación de resúmenes, revisa la
+> [notebook en PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)
+> o a la [notebook en TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb).
diff --git a/docs/source/es/tasks_explained.md b/docs/source/es/tasks_explained.md
index 69d822e82ac6..9b6f8d548e5a 100644
--- a/docs/source/es/tasks_explained.md
+++ b/docs/source/es/tasks_explained.md
@@ -29,11 +29,8 @@ Para explicar cómo se resuelven las tareas, caminaremos a través de lo que suc
 - [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2) para tareas de NLP como generación de texto que utilizan un descodificador
 - [BART](https://huggingface.co/docs/transformers/model_doc/bart) para tareas de NLP como resumen y traducción que utilizan un codificador-descodificador
 
-<Tip>
-
-Antes de continuar, es bueno tener un conocimiento básico de la arquitectura original del Transformer. Saber cómo funcionan los codificadores, decodificadores y la atención te ayudará a entender cómo funcionan los diferentes modelos de Transformer. Si estás empezando o necesitas repasar, ¡echa un vistazo a nuestro [curso](https://huggingface.co/course/chapter1/4?fw=pt) para obtener más información!
-
-</Tip>
+> [!TIP]
+> Antes de continuar, es bueno tener un conocimiento básico de la arquitectura original del Transformer. Saber cómo funcionan los codificadores, decodificadores y la atención te ayudará a entender cómo funcionan los diferentes modelos de Transformer. Si estás empezando o necesitas repasar, ¡echa un vistazo a nuestro [curso](https://huggingface.co/course/chapter1/4?fw=pt) para obtener más información!
 
 ## Habla y audio
 
@@ -74,11 +71,8 @@ Hay dos formas de abordar las tareas de visión por computadora:
 1. Dividir una imagen en una secuencia de parches y procesarlos en paralelo con un Transformer.
 2. Utilizar una CNN moderna, como [ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext), que se basa en capas convolucionales pero adopta diseños de redes modernas.
 
-<Tip>
-
-Un tercer enfoque combina Transformers con convoluciones (por ejemplo, [Convolutional Vision Transformer](https://huggingface.co/docs/transformers/model_doc/cvt) o [LeViT](https://huggingface.co/docs/transformers/model_doc/levit)). No discutiremos estos porque simplemente combinan los dos enfoques que examinamos aquí.
-
-</Tip>
+> [!TIP]
+> Un tercer enfoque combina Transformers con convoluciones (por ejemplo, [Convolutional Vision Transformer](https://huggingface.co/docs/transformers/model_doc/cvt) o [LeViT](https://huggingface.co/docs/transformers/model_doc/levit)). No discutiremos estos porque simplemente combinan los dos enfoques que examinamos aquí.
 
 ViT y ConvNeXT se utilizan comúnmente para la clasificación de imágenes, pero para otras tareas de visión como la detección de objetos, la segmentación y la estimación de profundidad, veremos DETR, Mask2Former y GLPN, respectivamente; estos modelos son más adecuados para esas tareas.
 
@@ -108,11 +102,8 @@ El cambio principal que introdujo ViT fue en cómo se alimentan las imágenes a
 
 #### CNN
 
-<Tip>
-
-Esta sección explica brevemente las convoluciones, pero sería útil tener un entendimiento previo de cómo cambian la forma y el tamaño de una imagen. Si no estás familiarizado con las convoluciones, ¡echa un vistazo al [capítulo de Redes Neuronales Convolucionales](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb) del libro fastai!
-
-</Tip>
+> [!TIP]
+> Esta sección explica brevemente las convoluciones, pero sería útil tener un entendimiento previo de cómo cambian la forma y el tamaño de una imagen. Si no estás familiarizado con las convoluciones, ¡echa un vistazo al [capítulo de Redes Neuronales Convolucionales](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb) del libro fastai!
 
 [ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext) es una arquitectura de CNN que adopta diseños de redes nuevas y modernas para mejorar el rendimiento. Sin embargo, las convoluciones siguen siendo el núcleo del modelo. Desde una perspectiva de alto nivel, una [convolución](glossary#convolution) es una operación donde una matriz más pequeña (*kernel*) se multiplica por una pequeña ventana de píxeles de la imagen. Esta calcula algunas características de ella, como una textura particular o la curvatura de una línea. Luego, se desliza hacia la siguiente ventana de píxeles; la distancia que recorre la convolución se conoce como el *stride*. 
 
@@ -230,11 +221,8 @@ Para usar BERT en la respuesta a preguntas, añade una cabecera de clasificació
 
 ¿Listo para probar la respuesta a preguntas? ¡Consulta nuestra guía completa de [respuesta a preguntas](tasks/question_answering) para aprender cómo ajustar DistilBERT y usarlo para inferencia!
 
-<Tip>
-
-💡 ¡Observa lo fácil que es usar BERT para diferentes tareas una vez que ha sido preentrenado! ¡Solo necesitas añadir una cabecera específica al modelo preentrenado para manipular los estados ocultos en tu salida deseada!
-
-</Tip>
+> [!TIP]
+> 💡 ¡Observa lo fácil que es usar BERT para diferentes tareas una vez que ha sido preentrenado! ¡Solo necesitas añadir una cabecera específica al modelo preentrenado para manipular los estados ocultos en tu salida deseada!
 
 ### Generación de texto
 
@@ -252,11 +240,8 @@ El objetivo del preentrenamiento de GPT-2 se basa completamente en el [modelado
 
 ¿Listo para probar la generación de texto? ¡Consulta nuestra guía completa de [modelado de lenguaje causal](tasks/language_modeling#modelado-de-lenguaje-causal) para aprender cómo ajustar DistilGPT-2 y usarlo para inferencia!
 
-<Tip>
-
-Para obtener más información sobre la generación de texto, ¡consulta la guía de [estrategias de generación de texto](https://huggingface.co/docs/transformers/generation_strategies)!
-
-</Tip>
+> [!TIP]
+> Para obtener más información sobre la generación de texto, ¡consulta la guía de [estrategias de generación de texto](https://huggingface.co/docs/transformers/generation_strategies)!
 
 ### Resumir
 
@@ -272,11 +257,8 @@ Los modelos codificador-decodificador como [BART](https://huggingface.co/docs/tr
 
 ¿Listo para probar la sumarización? ¡Consulta nuestra guía completa de [Generación de resúmenes](tasks/summarization) para aprender cómo ajustar T5 y usarlo para inferencia!
 
-<Tip>
-
-Para obtener más información sobre la generación de texto, ¡consulta la guía de [estrategias de generación de texto](https://huggingface.co/docs/transformers/generation_strategies)!
-
-</Tip>
+> [!TIP]
+> Para obtener más información sobre la generación de texto, ¡consulta la guía de [estrategias de generación de texto](https://huggingface.co/docs/transformers/generation_strategies)!
 
 ### Traducción
 
@@ -288,8 +270,5 @@ Desde entonces, BART ha sido seguido por una versión multilingüe, mBART, desti
 
 ¿Listo para probar la traducción? ¡Consulta nuestra guía completa de [traducción](https://huggingface.co/docs/transformers/tasks/translation) para aprender cómo ajustar T5 y usarlo para inferencia!
 
-<Tip>
-
-Para obtener más información sobre la generación de texto, ¡consulta la guía de [estrategias de generación de texto](https://huggingface.co/docs/transformers/generation_strategies)!
-
-</Tip>
\ No newline at end of file
+> [!TIP]
+> Para obtener más información sobre la generación de texto, ¡consulta la guía de [estrategias de generación de texto](https://huggingface.co/docs/transformers/generation_strategies)!
\ No newline at end of file
diff --git a/docs/source/es/torchscript.md b/docs/source/es/torchscript.md
index 93873fadcae8..25686ae65ecd 100644
--- a/docs/source/es/torchscript.md
+++ b/docs/source/es/torchscript.md
@@ -16,10 +16,8 @@ rendered properly in your Markdown viewer.
 
 # Exportar a TorchScript
 
-<Tip>
-Este es el comienzo de nuestros experimentos con TorchScript y todavía estamos explorando sus capacidades con modelos de variables de entrada. Es un tema de interés para nosotros y profundizaremos en nuestro análisis en las próximas versiones, con más ejemplos de código, una implementación más flexible y comparativas de rendimiento comparando códigos basados en Python con TorchScript compilado.  
-
-</Tip>
+> [!TIP]
+> Este es el comienzo de nuestros experimentos con TorchScript y todavía estamos explorando sus capacidades con modelos de variables de entrada. Es un tema de interés para nosotros y profundizaremos en nuestro análisis en las próximas versiones, con más ejemplos de código, una implementación más flexible y comparativas de rendimiento comparando códigos basados en Python con TorchScript compilado.
 
 De acuerdo con la documentación de TorchScript: 
 
diff --git a/docs/source/es/trainer.md b/docs/source/es/trainer.md
index 4455521f5317..900165ac2653 100644
--- a/docs/source/es/trainer.md
+++ b/docs/source/es/trainer.md
@@ -18,15 +18,12 @@ rendered properly in your Markdown viewer.
 
 El [`Trainer`] es un bucle completo de entrenamiento y evaluación para modelos de PyTorch implementado en la biblioteca Transformers. Solo necesitas pasarle las piezas necesarias para el entrenamiento (modelo, tokenizador, conjunto de datos, función de evaluación, hiperparámetros de entrenamiento, etc.), y la clase [`Trainer`] se encarga del resto. Esto facilita comenzar a entrenar más rápido sin tener que escribir manualmente tu propio bucle de entrenamiento. Pero al mismo tiempo, [`Trainer`] es muy personalizable y ofrece una gran cantidad de opciones de entrenamiento para que puedas adaptarlo a tus necesidades exactas de entrenamiento.
 
-<Tip>
-
-Además de la clase [`Trainer`], Transformers también proporciona una clase [`Seq2SeqTrainer`] para tareas de secuencia a secuencia como traducción o resumen. También está la clase [~trl.SFTTrainer] de la biblioteca [TRL](https://hf.co/docs/trl) que envuelve la clase [`Trainer`] y está optimizada para entrenar modelos de lenguaje como Llama-2 y Mistral con técnicas autoregresivas. [`~trl.SFTTrainer`] también admite funciones como el empaquetado de secuencias, LoRA, cuantización y DeepSpeed para escalar eficientemente a cualquier tamaño de modelo.
-
-<br>
-
-Siéntete libre de consultar [la referencia de API](./main_classes/trainer) para estas otras clases tipo [`Trainer`] para aprender más sobre cuándo usar cada una. En general, [`Trainer`] es la opción más versátil y es apropiada para una amplia gama de tareas. [`Seq2SeqTrainer`] está diseñado para tareas de secuencia a secuencia y [`~trl.SFTTrainer`] está diseñado para entrenar modelos de lenguaje.
-
-</Tip>
+> [!TIP]
+> Además de la clase [`Trainer`], Transformers también proporciona una clase [`Seq2SeqTrainer`] para tareas de secuencia a secuencia como traducción o resumen. También está la clase [~trl.SFTTrainer] de la biblioteca [TRL](https://hf.co/docs/trl) que envuelve la clase [`Trainer`] y está optimizada para entrenar modelos de lenguaje como Llama-2 y Mistral con técnicas autoregresivas. [`~trl.SFTTrainer`] también admite funciones como el empaquetado de secuencias, LoRA, cuantización y DeepSpeed para escalar eficientemente a cualquier tamaño de modelo.
+>
+> <br>
+>
+> Siéntete libre de consultar [la referencia de API](./main_classes/trainer) para estas otras clases tipo [`Trainer`] para aprender más sobre cuándo usar cada una. En general, [`Trainer`] es la opción más versátil y es apropiada para una amplia gama de tareas. [`Seq2SeqTrainer`] está diseñado para tareas de secuencia a secuencia y [`~trl.SFTTrainer`] está diseñado para entrenar modelos de lenguaje.
 
 Antes de comenzar, asegúrate de tener instalado [Accelerate](https://hf.co/docs/accelerate), una biblioteca para habilitar y ejecutar el entrenamiento de PyTorch en entornos distribuidos.
 
@@ -178,21 +175,15 @@ trainer = Trainer(
 
 ## Logging
 
-<Tip>
-
-Comprueba el API referencia [logging](./main_classes/logging) para mas información sobre los niveles differentes de logging.
-
-</Tip>
+> [!TIP]
+> Comprueba el API referencia [logging](./main_classes/logging) para mas información sobre los niveles differentes de logging.
 
 El [`Trainer`] está configurado a `logging.INFO` de forma predeterminada el cual informa errores, advertencias y otra información basica. Un [`Trainer`] réplica - en entornos distributos - está configurado a `logging.WARNING` el cual solamente informa errores y advertencias. Puedes cambiar el nivel de logging con los parametros [`log_level`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.log_level) y [`log_level_replica`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.log_level_replica) en [`TrainingArguments`].
 
 Para configurar el nivel de registro para cada nodo, usa el parámetro [`log_on_each_node`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.log_on_each_node) para determinar si deseas utilizar el nivel de registro en cada nodo o solo en el nodo principal.
 
-<Tip>
-
-[`Trainer`] establece el nivel de registro por separado para cada nodo en el método [`Trainer.init`], por lo que es posible que desees considerar establecer esto antes si estás utilizando otras funcionalidades de Transformers antes de crear el objeto [`Trainer`].
-
-</Tip>
+> [!TIP]
+> [`Trainer`] establece el nivel de registro por separado para cada nodo en el método [`Trainer.init`], por lo que es posible que desees considerar establecer esto antes si estás utilizando otras funcionalidades de Transformers antes de crear el objeto [`Trainer`].
 
 Por ejemplo, para establecer que tu código principal y los módulos utilicen el mismo nivel de registro según cada nodo:
 
@@ -253,11 +244,8 @@ NEFTune se desactiva después del entrenamiento para restaurar la capa de incrus
 
 La clase [`Trainer`] está impulsada por [Accelerate](https://hf.co/docs/accelerate), una biblioteca para entrenar fácilmente modelos de PyTorch en entornos distribuidos con soporte para integraciones como [FullyShardedDataParallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) y [DeepSpeed](https://www.deepspeed.ai/).
 
-<Tip>
-
-Aprende más sobre las estrategias de fragmentación FSDP, descarga de CPU y más con el [`Trainer`] en la guía [Paralela de Datos Completamente Fragmentados](fsdp).
-
-</Tip>
+> [!TIP]
+> Aprende más sobre las estrategias de fragmentación FSDP, descarga de CPU y más con el [`Trainer`] en la guía [Paralela de Datos Completamente Fragmentados](fsdp).
 
 Para usar Accelerate con [`Trainer`], ejecuta el comando [`accelerate.config`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config) para configurar el entrenamiento para tu entorno de entrenamiento. Este comando crea un `config_file.yaml` que se utilizará cuando inicies tu script de entrenamiento. Por ejemplo, algunas configuraciones de ejemplo que puedes configurar son:
 
diff --git a/docs/source/es/training.md b/docs/source/es/training.md
index f10f49d08997..f793b210272f 100644
--- a/docs/source/es/training.md
+++ b/docs/source/es/training.md
@@ -81,12 +81,9 @@ Comienza cargando tu modelo y especifica el número de labels previstas. A parti
 >>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
 ```
 
-<Tip>
-
-Verás una advertencia acerca de que algunos de los pesos pre-entrenados no están siendo utilizados y que algunos pesos están siendo inicializados al azar. No te preocupes, esto es completamente normal.
-El head/cabezal pre-entrenado del modelo BERT se descarta y se sustituye por un head de clasificación inicializado aleatoriamente. Puedes aplicar fine-tuning a este nuevo head del modelo en tu tarea de clasificación de secuencias haciendo transfer learning del modelo pre-entrenado.
-
-</Tip>
+> [!TIP]
+> Verás una advertencia acerca de que algunos de los pesos pre-entrenados no están siendo utilizados y que algunos pesos están siendo inicializados al azar. No te preocupes, esto es completamente normal.
+> El head/cabezal pre-entrenado del modelo BERT se descarta y se sustituye por un head de clasificación inicializado aleatoriamente. Puedes aplicar fine-tuning a este nuevo head del modelo en tu tarea de clasificación de secuencias haciendo transfer learning del modelo pre-entrenado.
 
 ### Hiperparámetros de entrenamiento
 
@@ -166,11 +163,8 @@ El [`DefaultDataCollator`] junta los tensores en un batch para que el modelo se
 >>> data_collator = DefaultDataCollator(return_tensors="tf")
 ```
 
-<Tip>
-
-[`Trainer`] utiliza [`DataCollatorWithPadding`] por defecto por lo que no es necesario especificar explícitamente un intercalador de datos (data collator, en inglés).
-
-</Tip>
+> [!TIP]
+> [`Trainer`] utiliza [`DataCollatorWithPadding`] por defecto por lo que no es necesario especificar explícitamente un intercalador de datos (data collator, en inglés).
 
 A continuación, convierte los datasets tokenizados en datasets de TensorFlow con el método [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset). Especifica tus entradas en `columns` y tu etiqueta en `label_cols`:
 
@@ -309,11 +303,8 @@ Por último, especifica el `device` o entorno de ejecución para utilizar una GP
 >>> model.to(device)
 ```
 
-<Tip>
-
-Consigue acceso gratuito a una GPU en la nube si es que no tienes este recurso de forma local con un notebook alojado en [Colaboratory](https://colab.research.google.com/) o [SageMaker StudioLab](https://studiolab.sagemaker.aws/).
-
-</Tip>
+> [!TIP]
+> Consigue acceso gratuito a una GPU en la nube si es que no tienes este recurso de forma local con un notebook alojado en [Colaboratory](https://colab.research.google.com/) o [SageMaker StudioLab](https://studiolab.sagemaker.aws/).
 
 Genial, ¡ahora podemos entrenar! 🥳
 
diff --git a/docs/source/fr/autoclass_tutorial.md b/docs/source/fr/autoclass_tutorial.md
index 3eaa2946d745..8cf1fbfaa8e0 100644
--- a/docs/source/fr/autoclass_tutorial.md
+++ b/docs/source/fr/autoclass_tutorial.md
@@ -18,11 +18,8 @@ rendered properly in your Markdown viewer.
 
 Avec autant d'architectures Transformer différentes, il peut être difficile d'en créer une pour votre ensemble de poids (aussi appelés "weights" ou "checkpoint" en anglais). Dans l'idée de créer une librairie facile, simple et flexible à utiliser, 🤗 Transformers fournit une `AutoClass` qui infère et charge automatiquement l'architecture correcte à partir d'un ensemble de poids donné. La fonction `from_pretrained()` vous permet de charger rapidement un modèle pré-entraîné pour n'importe quelle architecture afin que vous n'ayez pas à consacrer du temps et des ressources à l'entraînement d'un modèle à partir de zéro. Produire un tel code indépendant d'un ensemble de poids signifie que si votre code fonctionne pour un ensemble de poids, il fonctionnera avec un autre ensemble - tant qu'il a été entraîné pour une tâche similaire - même si l'architecture est différente.
 
-<Tip>
-
-Rappel, l'architecture fait référence au squelette du modèle et l'ensemble de poids contient les poids pour une architecture donnée. Par exemple, [BERT](https://huggingface.co/google-bert/bert-base-uncased) est une architecture, tandis que `google-bert/bert-base-uncased` est un ensemble de poids. Le terme modèle est général et peut signifier soit architecture soit ensemble de poids.
-
-</Tip>
+> [!TIP]
+> Rappel, l'architecture fait référence au squelette du modèle et l'ensemble de poids contient les poids pour une architecture donnée. Par exemple, [BERT](https://huggingface.co/google-bert/bert-base-uncased) est une architecture, tandis que `google-bert/bert-base-uncased` est un ensemble de poids. Le terme modèle est général et peut signifier soit architecture soit ensemble de poids.
 
 Dans ce tutoriel, vous apprendrez à:
 
@@ -152,12 +149,9 @@ Réutilisez facilement le même ensemble de poids pour charger une architecture
 >>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip warning={true}>
-
-Pour les modèles PyTorch, la fonction `from_pretrained()` utilise `torch.load()` qui utilise `pickle` en interne et est connu pour être non sécurisé. En général, ne chargez jamais un modèle qui pourrait provenir d'une source non fiable, ou qui pourrait avoir été altéré. Ce risque de sécurité est partiellement atténué pour les modèles hébergés publiquement sur le Hugging Face Hub, qui sont [scannés pour les logiciels malveillants](https://huggingface.co/docs/hub/security-malware) à chaque modification. Consultez la [documentation du Hub](https://huggingface.co/docs/hub/security) pour connaître les meilleures pratiques comme la [vérification des modifications signées](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg) avec GPG.
-
-Les points de contrôle TensorFlow et Flax ne sont pas concernés, et peuvent être chargés dans des architectures PyTorch en utilisant les arguments `from_tf` et `from_flax` de la fonction `from_pretrained` pour contourner ce problème.
-
-</Tip>
+> [!WARNING]
+> Pour les modèles PyTorch, la fonction `from_pretrained()` utilise `torch.load()` qui utilise `pickle` en interne et est connu pour être non sécurisé. En général, ne chargez jamais un modèle qui pourrait provenir d'une source non fiable, ou qui pourrait avoir été altéré. Ce risque de sécurité est partiellement atténué pour les modèles hébergés publiquement sur le Hugging Face Hub, qui sont [scannés pour les logiciels malveillants](https://huggingface.co/docs/hub/security-malware) à chaque modification. Consultez la [documentation du Hub](https://huggingface.co/docs/hub/security) pour connaître les meilleures pratiques comme la [vérification des modifications signées](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg) avec GPG.
+>
+> Les points de contrôle TensorFlow et Flax ne sont pas concernés, et peuvent être chargés dans des architectures PyTorch en utilisant les arguments `from_tf` et `from_flax` de la fonction `from_pretrained` pour contourner ce problème.
 
 En général, nous recommandons d'utiliser les classes `AutoTokenizer` et `AutoModelFor` pour charger des instances pré-entraînées de tokenizers et modèles respectivement. Cela vous permettra de charger la bonne architecture à chaque fois. Dans le prochain [tutoriel](preprocessing), vous apprenez à utiliser un tokenizer, processeur d'image, extracteur de caractéristiques et processeur pour pré-traiter un jeu de données pour le fine-tuning.
diff --git a/docs/source/fr/installation.md b/docs/source/fr/installation.md
index da3e585df7ae..e421d6ba0ee6 100644
--- a/docs/source/fr/installation.md
+++ b/docs/source/fr/installation.md
@@ -68,18 +68,15 @@ pip install 'transformers[torch]'
 pip install 'transformers[tf-cpu]'
 ```
 
-<Tip warning={true}>
-
-Pour les architectures mac M1 / ARM
-
-Vous devez installer les outils suivants avant d'installer TensorFLow 2.0
-
-```bash
-brew install cmake
-brew install pkg-config
-```
-
-</Tip>
+> [!WARNING]
+> Pour les architectures mac M1 / ARM
+>
+> Vous devez installer les outils suivants avant d'installer TensorFLow 2.0
+>
+> ```bash
+> brew install cmake
+> brew install pkg-config
+> ```
 
 🤗 Transformers et Flax :
 
@@ -132,11 +129,8 @@ pip install -e .
 
 Ces commandes créent des liens entre le dossier où le projet a été cloné et les chemins de vos librairies Python. Python regardera maintenant dans le dossier que vous avez cloné en plus des dossiers où sont installées vos autres librairies. Par exemple, si vos librairies Python sont installées dans `~/anaconda3/envs/main/lib/python3.7/site-packages/`, Python cherchera aussi dans le dossier où vous avez cloné : `~/transformers/`.
 
-<Tip warning={true}>
-
-Vous devez garder le dossier `transformers` si vous voulez continuer d'utiliser la librairie.
-
-</Tip>
+> [!WARNING]
+> Vous devez garder le dossier `transformers` si vous voulez continuer d'utiliser la librairie.
 
 Maintenant, vous pouvez facilement mettre à jour votre clone avec la dernière version de 🤗 Transformers en utilisant la commande suivante :
 
@@ -163,21 +157,15 @@ Les modèles pré-entraînés sont téléchargés et mis en cache localement dan
 2. Variable d'environnement : `HF_HOME`.
 3. Variable d'environnement : `XDG_CACHE_HOME` + `/huggingface`.
 
-<Tip>
-
-🤗 Transformers utilisera les variables d'environnement `PYTORCH_TRANSFORMERS_CACHE` ou `PYTORCH_PRETRAINED_BERT_CACHE` si vous utilisez une version précédente de cette librairie et avez défini ces variables d'environnement, sauf si vous spécifiez la variable d'environnement `TRANSFORMERS_CACHE`.
-
-</Tip>
+> [!TIP]
+> 🤗 Transformers utilisera les variables d'environnement `PYTORCH_TRANSFORMERS_CACHE` ou `PYTORCH_PRETRAINED_BERT_CACHE` si vous utilisez une version précédente de cette librairie et avez défini ces variables d'environnement, sauf si vous spécifiez la variable d'environnement `TRANSFORMERS_CACHE`.
 
 ## Mode hors ligne
 
 🤗 Transformers peut fonctionner dans un environnement cloisonné ou hors ligne en n'utilisant que des fichiers locaux. Définissez la variable d'environnement `HF_HUB_OFFLINE=1` pour activer ce mode.
 
-<Tip>
-
-Ajoutez [🤗 Datasets](https://huggingface.co/docs/datasets/) à votre processus d'entraînement hors ligne en définissant la variable d'environnement `HF_DATASETS_OFFLINE=1`.
-
-</Tip>
+> [!TIP]
+> Ajoutez [🤗 Datasets](https://huggingface.co/docs/datasets/) à votre processus d'entraînement hors ligne en définissant la variable d'environnement `HF_DATASETS_OFFLINE=1`.
 
 ```bash
 HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \
@@ -251,8 +239,5 @@ Une fois que votre fichier est téléchargé et caché localement, spécifiez so
 >>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json")
 ```
 
-<Tip>
-
-Consultez la section [How to download files from the Hub (Comment télécharger des fichiers depuis le Hub)](https://huggingface.co/docs/hub/how-to-downstream) pour plus de détails sur le téléchargement de fichiers stockés sur le Hub.
-
-</Tip>
+> [!TIP]
+> Consultez la section [How to download files from the Hub (Comment télécharger des fichiers depuis le Hub)](https://huggingface.co/docs/hub/how-to-downstream) pour plus de détails sur le téléchargement de fichiers stockés sur le Hub.
diff --git a/docs/source/fr/quicktour.md b/docs/source/fr/quicktour.md
index a0cf66e76dd3..c8c20c334f77 100644
--- a/docs/source/fr/quicktour.md
+++ b/docs/source/fr/quicktour.md
@@ -190,11 +190,8 @@ Un tokenizer peut également accepter une liste de textes, et remplir et tronque
 ... )
 ```
 
-<Tip>
-
-Consultez le tutoriel [prétraitement](./preprocessing) pour plus de détails sur la tokenisation, et sur la manière d'utiliser un [`AutoImageProcessor`], un [`AutoFeatureExtractor`] et un [`AutoProcessor`] pour prétraiter les images, l'audio et les contenus multimodaux.
-
-</Tip>
+> [!TIP]
+> Consultez le tutoriel [prétraitement](./preprocessing) pour plus de détails sur la tokenisation, et sur la manière d'utiliser un [`AutoImageProcessor`], un [`AutoFeatureExtractor`] et un [`AutoProcessor`] pour prétraiter les images, l'audio et les contenus multimodaux.
 
 ### AutoModel
 
@@ -207,11 +204,8 @@ Consultez le tutoriel [prétraitement](./preprocessing) pour plus de détails su
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
 ```
 
-<Tip>
-
-Voir le [résumé de la tâche](./task_summary) pour vérifier si elle est prise en charge par une classe [`AutoModel`].
-
-</Tip>
+> [!TIP]
+> Voir le [résumé de la tâche](./task_summary) pour vérifier si elle est prise en charge par une classe [`AutoModel`].
 
 Maintenant, passez votre échantillon d'entrées prétraitées directement au modèle. Il vous suffit de décompresser le dictionnaire en ajoutant `**` :
 
@@ -230,11 +224,8 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
         [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
 ```
 
-<Tip>
-
-Tous les modèles 🤗 Transformers (PyTorch ou TensorFlow) produisent les tensors *avant* la fonction d'activation finale (comme softmax) car la fonction d'activation finale est souvent fusionnée avec le calcul de la perte. Les structures produites par le modèle sont des classes de données spéciales, de sorte que leurs attributs sont autocomplétés dans un environnement de développement. Les structures produites par le modèle se comportent comme un tuple ou un dictionnaire (vous pouvez les indexer avec un entier, une tranche ou une chaîne), auquel cas les attributs qui sont None sont ignorés.
-
-</Tip>
+> [!TIP]
+> Tous les modèles 🤗 Transformers (PyTorch ou TensorFlow) produisent les tensors *avant* la fonction d'activation finale (comme softmax) car la fonction d'activation finale est souvent fusionnée avec le calcul de la perte. Les structures produites par le modèle sont des classes de données spéciales, de sorte que leurs attributs sont autocomplétés dans un environnement de développement. Les structures produites par le modèle se comportent comme un tuple ou un dictionnaire (vous pouvez les indexer avec un entier, une tranche ou une chaîne), auquel cas les attributs qui sont None sont ignorés.
 
 ### Sauvegarder un modèle
 
@@ -370,11 +361,8 @@ Une fois que vous êtes prêt, appelez la fonction [`~Trainer.train`] pour comme
 >>> trainer.train()  # doctest: +SKIP
 ```
 
-<Tip>
-
-Pour les tâches - comme la traduction ou la génération de résumé - qui utilisent un modèle séquence à séquence, utilisez plutôt les classes [`Seq2SeqTrainer`] et [`Seq2SeqTrainingArguments`].
-
-</Tip>
+> [!TIP]
+> Pour les tâches - comme la traduction ou la génération de résumé - qui utilisent un modèle séquence à séquence, utilisez plutôt les classes [`Seq2SeqTrainer`] et [`Seq2SeqTrainingArguments`].
 
 Vous pouvez personnaliser le comportement de la boucle d'apprentissage en redéfinissant les méthodes à l'intérieur de [`Trainer`]. Cela vous permet de personnaliser des caractéristiques telles que la fonction de perte, l'optimiseur et le planificateur. Consultez la documentation de [`Trainer`] pour savoir quelles méthodes peuvent être redéfinies.
 
diff --git a/docs/source/fr/tasks_explained.md b/docs/source/fr/tasks_explained.md
index 775f8f4ff7fb..d7b462bd80bd 100644
--- a/docs/source/fr/tasks_explained.md
+++ b/docs/source/fr/tasks_explained.md
@@ -29,11 +29,8 @@ Voici comment différents modèles résolvent des tâches spécifiques :
 - [GPT2](model_doc/gpt2) pour les tâches de traitement du language naturel telles que la génération de texte utilisant un décodeur
 - [BART](model_doc/bart) pour les tâches de traitement du language naturel telles que le résumé de texte et la traduction utilisant un encodeur-décodeur
 
-<Tip>
-
-Avant de poursuivre, il est utile d'avoir quelques connaissances de base sur l'architecture des Transformers. Comprendre le fonctionnement des encodeurs, des décodeurs et du mécanisme d'attention vous aidera à saisir comment les différents modèles Transformer fonctionnent. Si vous débutez ou avez besoin d'un rappel, consultez notre [cours](https://huggingface.co/course/chapter1/4?fw=pt) pour plus d'informations !
-
-</Tip>
+> [!TIP]
+> Avant de poursuivre, il est utile d'avoir quelques connaissances de base sur l'architecture des Transformers. Comprendre le fonctionnement des encodeurs, des décodeurs et du mécanisme d'attention vous aidera à saisir comment les différents modèles Transformer fonctionnent. Si vous débutez ou avez besoin d'un rappel, consultez notre [cours](https://huggingface.co/course/chapter1/4?fw=pt) pour plus d'informations !
 
 ## Paroles et audio
 
@@ -74,11 +71,8 @@ Il existe deux façons d'aborder les tâches de vision par ordinateur :
 1. **Diviser une image en une séquence de patches** et les traiter en parallèle avec un Transformer.
 2. **Utiliser un CNN moderne**, comme [ConvNeXT](model_doc/convnext), qui repose sur des couches convolutionnelles mais adopte des conceptions de réseau modernes.
 
-<Tip>
-
-Une troisième approche combine les Transformers avec des convolutions (par exemple, [Convolutional Vision Transformer](model_doc/cvt) ou [LeViT](model_doc/levit)). Nous ne discuterons pas de ces approches ici, car elles mélangent simplement les deux approches que nous examinons.
-
-</Tip>
+> [!TIP]
+> Une troisième approche combine les Transformers avec des convolutions (par exemple, [Convolutional Vision Transformer](model_doc/cvt) ou [LeViT](model_doc/levit)). Nous ne discuterons pas de ces approches ici, car elles mélangent simplement les deux approches que nous examinons.
 
 ViT et ConvNeXT sont couramment utilisés pour la classification d'images. Pour d'autres tâches de vision par ordinateur comme la détection d'objets, la segmentation et l'estimation de la profondeur, nous examinerons respectivement DETR, Mask2Former et GLPN, qui sont mieux adaptés à ces tâches.
 
@@ -108,11 +102,8 @@ Prêt à vous essayer à la classification d'images ? Consultez notre [guide com
 
 #### CNN
 
-<Tip>
-
-Cette section explique brièvement les convolutions, mais il serait utile d'avoir une compréhension préalable de la façon dont elles modifient la forme et la taille d'une image. Si vous n'êtes pas familier avec les convolutions, consultez le [chapitre sur les réseaux de neurones convolutionnels](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb) du livre fastai !
-
-</Tip>
+> [!TIP]
+> Cette section explique brièvement les convolutions, mais il serait utile d'avoir une compréhension préalable de la façon dont elles modifient la forme et la taille d'une image. Si vous n'êtes pas familier avec les convolutions, consultez le [chapitre sur les réseaux de neurones convolutionnels](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb) du livre fastai !
 
 [ConvNeXT](model_doc/convnext) est une architecture CNN qui adopte des conceptions de réseau modernes pour améliorer les performances. Cependant, les convolutions restent au cœur du modèle. D'un point de vue général, une [convolution](glossary#convolution) est une opération où une matrice plus petite (*noyau*) est multipliée par une petite fenêtre de pixels de l'image. Elle calcule certaines caractéristiques à partir de cette fenêtre, comme une texture particulière ou la courbure d'une ligne. Ensuite, elle se déplace vers la fenêtre suivante de pixels ; la distance parcourue par la convolution est appelée le *stride*.
 
@@ -229,11 +220,8 @@ Pour utiliser BERT pour la réponse aux questions, ajoutez une tête de classifi
 
 Prêt à essayer la réponse aux questions ? Consultez notre [guide complet sur la réponse aux questions](tasks/question_answering) pour découvrir comment effectuer un réglagle fin (*fine-tuning*) de DistilBERT et l'utiliser pour l'inférence !
 
-<Tip>
-
-💡 Une fois BERT préentraîné, il est incroyablement facile de l’adapter à diverses tâches ! Il vous suffit d’ajouter une tête spécifique au modèle préentraîné pour transformer les états cachés en la sortie souhaitée.
-
-</Tip>
+> [!TIP]
+> 💡 Une fois BERT préentraîné, il est incroyablement facile de l’adapter à diverses tâches ! Il vous suffit d’ajouter une tête spécifique au modèle préentraîné pour transformer les états cachés en la sortie souhaitée.
 
 ### Génération de texte
 
@@ -251,11 +239,8 @@ L'objectif de préentraînement de GPT-2 est basé sur la [modélisation du lang
 
 Prêt à essayer la génération de texte ? Consultez notre [guide complet sur la modélisation du langage causale](tasks/language_modeling#causal-language-modeling) pour découvrir comment effectuer un réglagle fin (*fine-tuning*) de DistilGPT-2 et l'utiliser pour l'inférence !
 
-<Tip>
-
-Pour plus d'informations sur la génération de texte, consultez le guide sur les [stratégies de génération de texte](generation_strategies) !
-
-</Tip>
+> [!TIP]
+> Pour plus d'informations sur la génération de texte, consultez le guide sur les [stratégies de génération de texte](generation_strategies) !
 
 ### Résumé de texte
 
@@ -271,11 +256,8 @@ Les modèles encodeur-décodeur tels que [BART](model_doc/bart) et [T5](model_do
 
 Prêt à essayer le résumé ? Consultez notre [guide complet sur le résumé](tasks/summarization) pour apprendre à effectuer un réglage fin (*fine-tuning*) de T5 et l'utiliser pour l'inférence !
 
-<Tip>
-
-Pour plus d'informations sur la génération de texte, consultez le guide sur les [stratégies de génération de texte](generation_strategies) !
-
-</Tip>
+> [!TIP]
+> Pour plus d'informations sur la génération de texte, consultez le guide sur les [stratégies de génération de texte](generation_strategies) !
 
 ### Traduction
 
@@ -287,8 +269,5 @@ BART a été suivi par une version multilingue, mBART, qui est spécifiquement c
 
 Prêt à essayer la traduction ? Consultez notre [guide complet sur la traduction](tasks/translation) pour apprendre à affiner T5 et l'utiliser pour l'inférence !
 
-<Tip>
-
-Pour plus d'informations sur la génération de texte, consultez le guide sur les [stratégies de génération de texte](generation_strategies) !
-
-</Tip>
+> [!TIP]
+> Pour plus d'informations sur la génération de texte, consultez le guide sur les [stratégies de génération de texte](generation_strategies) !
diff --git a/docs/source/fr/tutoriel_pipeline.md b/docs/source/fr/tutoriel_pipeline.md
index e5028c9d0bda..1f6a66a07455 100644
--- a/docs/source/fr/tutoriel_pipeline.md
+++ b/docs/source/fr/tutoriel_pipeline.md
@@ -10,11 +10,8 @@ L'objet [`pipeline`] rend simple l'utilisation de n'importe quel modèle du [Hub
 * Utiliser un tokenizer ou modèle spécifique.
 * Utiliser un [`pipeline`] pour des tâches audio, de vision et multimodales.
 
-<Tip>
-
-Consultez la documentation du [`pipeline`] pour une liste complète des tâches prises en charge et des paramètres disponibles.
-
-</Tip>
+> [!TIP]
+> Consultez la documentation du [`pipeline`] pour une liste complète des tâches prises en charge et des paramètres disponibles.
 
 ## Utilisation du pipeline
 
@@ -186,9 +183,8 @@ for out in pipe(KeyDataset(dataset, "audio")):
 
 ## Utilisation des pipelines pour un serveur web
 
-<Tip>
-Créer un moteur d'inférence est un sujet complexe qui mérite sa propre page.
-</Tip>
+> [!TIP]
+> Créer un moteur d'inférence est un sujet complexe qui mérite sa propre page.
 
 [Lien](./pipeline_webserver)
 
@@ -250,16 +246,13 @@ Par exemple, si vous utilisez cette [image de facture](https://huggingface.co/sp
 [{'score': 0.425, 'answer': 'us-001', 'start': 16, 'end': 16}]
 ```
 
-<Tip>
-
-Pour exécuter l'exemple ci-dessus, vous devez avoir [`pytesseract`](https://pypi.org/project/pytesseract/) installé en plus de 🤗 Transformers :
-
-```bash
-sudo apt install -y tesseract-ocr
-pip install pytesseract
-```
-
-</Tip>
+> [!TIP]
+> Pour exécuter l'exemple ci-dessus, vous devez avoir [`pytesseract`](https://pypi.org/project/pytesseract/) installé en plus de 🤗 Transformers :
+>
+> ```bash
+> sudo apt install -y tesseract-ocr
+> pip install pytesseract
+> ```
 
 ## Utilisation de `pipeline` sur de grands modèles avec 🤗 `accelerate` :
 
diff --git a/docs/source/hi/pipeline_tutorial.md b/docs/source/hi/pipeline_tutorial.md
index d2103643d509..96e3c597d56d 100644
--- a/docs/source/hi/pipeline_tutorial.md
+++ b/docs/source/hi/pipeline_tutorial.md
@@ -22,11 +22,8 @@ rendered properly in your Markdown viewer.
 * एक विशिष्ट टोकननाइज़र या मॉडल का उपयोग करें।
 * ऑडियो, विज़न और मल्टीमॉडल कार्यों के लिए [`pipeline`] का उपयोग करें।
 
-<Tip>
-
-समर्थित कार्यों और उपलब्ध मापदंडों की पूरी सूची के लिए [`pipeline`] दस्तावेज़ पर एक नज़र डालें।
-
-</Tip>
+> [!TIP]
+> समर्थित कार्यों और उपलब्ध मापदंडों की पूरी सूची के लिए [`pipeline`] दस्तावेज़ पर एक नज़र डालें।
 
 ## पाइपलाइन का उपयोग
 
@@ -216,10 +213,9 @@ for out in pipe(KeyDataset(dataset, "audio")):
 
 ## वेबसर्वर के लिए पाइपलाइनों का उपयोग करना
 
-<Tip>
-एक अनुमान इंजन बनाना एक जटिल विषय है जो अपने आप में उपयुक्त है
-पृष्ठ।
-</Tip>
+> [!TIP]
+> एक अनुमान इंजन बनाना एक जटिल विषय है जो अपने आप में उपयुक्त है
+> पृष्ठ।
 
 [Link](./pipeline_webserver)
 
@@ -279,16 +275,13 @@ NLP कार्यों के लिए [`pipeline`] का उपयोग 
 [{'score': 0.425, 'answer': 'us-001', 'start': 16, 'end': 16}]
 ```
 
-<Tip>
-
-ऊपर दिए गए उदाहरण को चलाने के लिए आपको 🤗 ट्रांसफॉर्मर के अलावा [`pytesseract`](https://pypi.org/project/pytesseract/) इंस्टॉल करना होगा:
-
-```bash
-sudo apt install -y tesseract-ocr
-pip install pytesseract
-```
-
-</Tip>
+> [!TIP]
+> ऊपर दिए गए उदाहरण को चलाने के लिए आपको 🤗 ट्रांसफॉर्मर के अलावा [`pytesseract`](https://pypi.org/project/pytesseract/) इंस्टॉल करना होगा:
+>
+> ```bash
+> sudo apt install -y tesseract-ocr
+> pip install pytesseract
+> ```
 
 ## 🤗 `त्वरण` के साथ बड़े मॉडलों पर `pipeline` का उपयोग करना:
 
diff --git a/docs/source/it/add_new_model.md b/docs/source/it/add_new_model.md
index 1cd5da18c645..79b0c95efc96 100644
--- a/docs/source/it/add_new_model.md
+++ b/docs/source/it/add_new_model.md
@@ -644,11 +644,8 @@ voi dovrete solo completarlo. Una volta che questi tests sono OK, provate:
 RUN_SLOW=1 pytest -sv tests/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests
 ```
 
-<Tip>
-
-Nel caso siate su Windows, sostituite `RUN_SLOW=1` con `SET RUN_SLOW=1`
-
-</Tip>
+> [!TIP]
+> Nel caso siate su Windows, sostituite `RUN_SLOW=1` con `SET RUN_SLOW=1`
 
 Di seguito, tutte le features che sono utili e necessarire per *brand_new_bert* devono essere testate in test separati,
 contenuti in `BrandNewBertModelTester`/ `BrandNewBertModelTest`. spesso la gente si scorda questi test, ma ricordate che sono utili per:
diff --git a/docs/source/it/autoclass_tutorial.md b/docs/source/it/autoclass_tutorial.md
index 74587ef53c19..a4c16859d416 100644
--- a/docs/source/it/autoclass_tutorial.md
+++ b/docs/source/it/autoclass_tutorial.md
@@ -18,11 +18,8 @@ rendered properly in your Markdown viewer.
 
 Con così tante architetture Transformer differenti, può essere sfidante crearne una per il tuo checkpoint. Come parte della filosofia centrale di 🤗 Transformers per rendere la libreria facile, semplice e flessibile da utilizzare, una `AutoClass` inferisce e carica automaticamente l'architettura corretta da un dato checkpoint. Il metodo `from_pretrained` ti permette di caricare velocemente un modello pre-allenato per qualsiasi architettura, così non devi utilizzare tempo e risorse per allenare un modello da zero. Produrre questo codice agnostico ai checkpoint significa che se il tuo codice funziona per un checkpoint, funzionerà anche per un altro checkpoint, purché sia stato allenato per un compito simile, anche se l'architettura è differente.
 
-<Tip>
-
-Ricorda, con architettura ci si riferisce allo scheletro del modello e con checkpoint ai pesi di una determinata architettura. Per esempio, [BERT](https://huggingface.co/google-bert/bert-base-uncased) è un'architettura, mentre `google-bert/bert-base-uncased` è un checkpoint. Modello è un termine generale che può significare sia architettura che checkpoint.
-
-</Tip>
+> [!TIP]
+> Ricorda, con architettura ci si riferisce allo scheletro del modello e con checkpoint ai pesi di una determinata architettura. Per esempio, [BERT](https://huggingface.co/google-bert/bert-base-uncased) è un'architettura, mentre `google-bert/bert-base-uncased` è un checkpoint. Modello è un termine generale che può significare sia architettura che checkpoint.
 
 In questo tutorial, imparerai a:
 
diff --git a/docs/source/it/big_models.md b/docs/source/it/big_models.md
index 6a5c346dec89..90fca0bb7404 100644
--- a/docs/source/it/big_models.md
+++ b/docs/source/it/big_models.md
@@ -25,11 +25,8 @@ in PyTorch è:
 
 I passi 1 e 2 una versione completa del modello in memoria, in molti casi non è un problema, ma se il modello inizia a pesare diversi GigaBytes, queste due copie possono sturare la nostra RAM. Ancora peggio, se stai usando `torch.distributed` per seguire l'addestramento (training) in distribuito, ogni processo caricherà il modello preaddestrato e memorizzerà queste due copie nella RAM.
 
-<Tip>
-
-Nota che il modello creato casualmente è inizializzato con tensori "vuoti", che occupano spazio in memoria ma senza riempirlo (quindi i valori casuali sono quelli che si trovavano in questa porzione di memoria in un determinato momento). L'inizializzazione casuale che segue la distribuzione appropriata per il tipo di modello/parametri istanziato (come la distribuzione normale per le istanze) è eseguito solo dopo il passaggio 3 sui pesi non inizializzati, per essere più rapido possibile!
-
-</Tip>
+> [!TIP]
+> Nota che il modello creato casualmente è inizializzato con tensori "vuoti", che occupano spazio in memoria ma senza riempirlo (quindi i valori casuali sono quelli che si trovavano in questa porzione di memoria in un determinato momento). L'inizializzazione casuale che segue la distribuzione appropriata per il tipo di modello/parametri istanziato (come la distribuzione normale per le istanze) è eseguito solo dopo il passaggio 3 sui pesi non inizializzati, per essere più rapido possibile!
 
 In questa guida, esploreremo le soluzioni che Transformers offre per affrontare questo problema. C'è da tenere in conto che questa è un'area in cui si sta attualmente sviluppando, quindi le API spiegate qui possono variare velocemente in futuro.
 
diff --git a/docs/source/it/create_a_model.md b/docs/source/it/create_a_model.md
index 174083e73e67..f2662c45f50a 100644
--- a/docs/source/it/create_a_model.md
+++ b/docs/source/it/create_a_model.md
@@ -101,11 +101,8 @@ Per riutilizzare la configurazione del file, caricalo con [`~PretrainedConfig.fr
 >>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json")
 ```
 
-<Tip>
-
-Puoi anche salvare il file di configurazione come dizionario oppure come la differenza tra gli attributi della tua configurazione personalizzata e gli attributi della configurazione predefinita! Guarda la documentazione [configuration](main_classes/configuration) per più dettagli.
-
-</Tip>
+> [!TIP]
+> Puoi anche salvare il file di configurazione come dizionario oppure come la differenza tra gli attributi della tua configurazione personalizzata e gli attributi della configurazione predefinita! Guarda la documentazione [configuration](main_classes/configuration) per più dettagli.
 
 ## Modello
 
@@ -163,11 +160,8 @@ L'ultima classe base di cui hai bisogno prima di utilizzare un modello per i dat
 
 Entrambi i tokenizer supportano metodi comuni come la codifica e la decodifica, l'aggiunta di nuovi token e la gestione di token speciali.
 
-<Tip warning={true}>
-
-Non tutti i modelli supportano un tokenizer veloce. Dai un'occhiata a questo [tabella](index#supported-frameworks) per verificare se un modello ha il supporto per tokenizer veloce. 
-
-</Tip>
+> [!WARNING]
+> Non tutti i modelli supportano un tokenizer veloce. Dai un'occhiata a questo [tabella](index#supported-frameworks) per verificare se un modello ha il supporto per tokenizer veloce.
 
 Se hai addestrato il tuo tokenizer, puoi crearne uno dal tuo file *vocabolario*: 
 
@@ -193,11 +187,8 @@ Crea un tokenizer veloce con la classe [`DistilBertTokenizerFast`]:
 >>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip>
-
-Per l'impostazione predefinita, [`AutoTokenizer`] proverà a caricare un tokenizer veloce. Puoi disabilitare questo comportamento impostando `use_fast=False` in `from_pretrained`.
-
-</Tip>
+> [!TIP]
+> Per l'impostazione predefinita, [`AutoTokenizer`] proverà a caricare un tokenizer veloce. Puoi disabilitare questo comportamento impostando `use_fast=False` in `from_pretrained`.
 
 ## Estrattore Di Feature
 
@@ -229,11 +220,8 @@ ViTFeatureExtractor {
 }
 ```
 
-<Tip>
-
-Se non stai cercando alcuna personalizzazione, usa il metodo `from_pretrained` per caricare i parametri di default dell'estrattore di caratteristiche di un modello.
-
-</Tip>
+> [!TIP]
+> Se non stai cercando alcuna personalizzazione, usa il metodo `from_pretrained` per caricare i parametri di default dell'estrattore di caratteristiche di un modello.
 
 Modifica uno qualsiasi dei parametri [`ViTFeatureExtractor`] per creare il tuo estrattore di caratteristiche personalizzato:
 
diff --git a/docs/source/it/custom_models.md b/docs/source/it/custom_models.md
index 5f3d4cade007..0a7718f8291a 100644
--- a/docs/source/it/custom_models.md
+++ b/docs/source/it/custom_models.md
@@ -183,11 +183,8 @@ Nota come, in entrambi i casi, ereditiamo da `PreTrainedModel` e chiamiamo l'ini
 con il metodo `config` (un po' come quando scrivi un normale `torch.nn.Module`). La riga che imposta la  `config_class`
 non è obbligatoria, a meno che tu non voglia registrare il modello con le classi Auto (vedi l'ultima sezione).
 
-<Tip>
-
-Se il tuo modello è molto simile a un modello all'interno della libreria, puoi ri-usare la stessa configurazione di quel modello.
-
-</Tip>
+> [!TIP]
+> Se il tuo modello è molto simile a un modello all'interno della libreria, puoi ri-usare la stessa configurazione di quel modello.
 
 Puoi fare in modo che il tuo modello restituisca in output qualunque cosa tu voglia, ma far restituire un dizionario 
 come abbiamo fatto per `ResnetModelForImageClassification`, con la funzione di perdita inclusa quando vengono passate le labels,
@@ -220,11 +217,8 @@ il codice del modello venga salvato.
 
 ## Inviare il codice all'Hub
 
-<Tip warning={true}>
-
-Questa API è sperimentale e potrebbe avere alcuni cambiamenti nei prossimi rilasci.
-
-</Tip>
+> [!WARNING]
+> Questa API è sperimentale e potrebbe avere alcuni cambiamenti nei prossimi rilasci.
 
 Innanzitutto, assicurati che il tuo modello sia completamente definito in un file `.py`. Può sfruttare import relativi
 ad altri file, purchè questi siano nella stessa directory (non supportiamo ancora sotto-moduli per questa funzionalità).
@@ -242,12 +236,9 @@ contiene il codice di `ResnetModel` e `ResnetModelForImageClassification`.
 
 Il file `__init__.py` può essere vuoto, serve solo perchè Python capisca che `resnet_model` può essere utilizzato come un modulo.
 
-<Tip warning={true}>
-
-Se stai copiando i file relativi alla modellazione della libreria, dovrai sostituire tutti gli import relativi in cima al file con import del 
-    pacchetto `transformers`.
-
-</Tip>
+> [!WARNING]
+> Se stai copiando i file relativi alla modellazione della libreria, dovrai sostituire tutti gli import relativi in cima al file con import del 
+>     pacchetto `transformers`.
 
 Nota che puoi ri-utilizzare (o usare come sottoclassi) un modello/configurazione esistente.
 
diff --git a/docs/source/it/debugging.md b/docs/source/it/debugging.md
index 5c1dab51bd11..fde61136ada3 100644
--- a/docs/source/it/debugging.md
+++ b/docs/source/it/debugging.md
@@ -46,23 +46,14 @@ In questo modo si scaricano molte informazioni di debug relative a NCCL, che puo
 
 ## Rilevamento di Underflow e Overflow
 
-<Tip>
+> [!TIP]
+> Questa funzionalità al momento è disponibile solo per PyTorch.
 
-Questa funzionalità al momento è disponibile solo per PyTorch.
+> [!TIP]
+> Per addestramento multi-GPU richiede DDP (`torch.distributed.launch`).
 
-</Tip>
-
-<Tip>
-
-Per addestramento multi-GPU richiede DDP (`torch.distributed.launch`).
-
-</Tip>
-
-<Tip>
-
-Questa funzionalità può essere usata con modelli basati su `nn.Module`.
-
-</Tip>
+> [!TIP]
+> Questa funzionalità può essere usata con modelli basati su `nn.Module`.
 
 Se inizi a ottenere `loss=NaN` o il modello presenta qualche altro comportamento anomalo a causa di valori `inf` o `nan` in
 attivazioni o nei pesi, è necessario scoprire dove si verifica il primo underflow o overflow e cosa lo ha determinato. Fortunatamente
diff --git a/docs/source/it/installation.md b/docs/source/it/installation.md
index a4f444c1eb0c..a98eecb8ef66 100644
--- a/docs/source/it/installation.md
+++ b/docs/source/it/installation.md
@@ -113,11 +113,8 @@ pip install -e .
 
 Questi comandi collegheranno la cartella in cui è stato clonato il repository e i path delle librerie Python. Python guarderà ora all'interno della cartella clonata, oltre ai normali path delle librerie. Per esempio, se i tuoi pacchetti Python sono installati tipicamente in `~/anaconda3/envs/main/lib/python3.7/site-packages/`, Python cercherà anche nella cartella clonata: `~/transformers/`.
 
-<Tip warning={true}>
-
-Devi tenere la cartella `transformers` se vuoi continuare ad utilizzare la libreria.
-
-</Tip>
+> [!WARNING]
+> Devi tenere la cartella `transformers` se vuoi continuare ad utilizzare la libreria.
 
 Ora puoi facilmente aggiornare il tuo clone all'ultima versione di 🤗 Transformers con il seguente comando:
 
@@ -144,21 +141,15 @@ I modelli pre-allenati sono scaricati e memorizzati localmente nella cache in: `
 2. Variabile d'ambiente della shell: `HF_HOME` + `transformers/`.
 3. Variabile d'ambiente della shell: `XDG_CACHE_HOME` + `/huggingface/transformers`.
 
-<Tip>
-
-🤗 Transformers utilizzerà le variabili d'ambiente della shell `PYTORCH_TRANSFORMERS_CACHE` o `PYTORCH_PRETRAINED_BERT_CACHE` se si proviene da un'iterazione precedente di questa libreria e sono state impostate queste variabili d'ambiente, a meno che non si specifichi la variabile d'ambiente della shell `TRANSFORMERS_CACHE`.
-
-</Tip>
+> [!TIP]
+> 🤗 Transformers utilizzerà le variabili d'ambiente della shell `PYTORCH_TRANSFORMERS_CACHE` o `PYTORCH_PRETRAINED_BERT_CACHE` se si proviene da un'iterazione precedente di questa libreria e sono state impostate queste variabili d'ambiente, a meno che non si specifichi la variabile d'ambiente della shell `TRANSFORMERS_CACHE`.
 
 ## Modalità Offline
 
 🤗 Transformers può essere eseguita in un ambiente firewalled o offline utilizzando solo file locali. Imposta la variabile d'ambiente `HF_HUB_OFFLINE=1` per abilitare questo comportamento.
 
-<Tip>
-
-Aggiungi [🤗 Datasets](https://huggingface.co/docs/datasets/) al tuo flusso di lavoro offline di training impostando la variabile d'ambiente `HF_DATASETS_OFFLINE=1`.
-
-</Tip>
+> [!TIP]
+> Aggiungi [🤗 Datasets](https://huggingface.co/docs/datasets/) al tuo flusso di lavoro offline di training impostando la variabile d'ambiente `HF_DATASETS_OFFLINE=1`.
 
 Ad esempio, in genere si esegue un programma su una rete normale, protetta da firewall per le istanze esterne, con il seguente comando:
 
@@ -232,8 +223,5 @@ Una volta che il tuo file è scaricato e salvato in cache localmente, specifica
 >>> config = AutoConfig.from_pretrained("./il/tuo/path/bigscience_t0/config.json")
 ```
 
-<Tip>
-
-Fai riferimento alla sezione [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream) per avere maggiori dettagli su come scaricare modelli presenti sull Hub.
-
-</Tip>
+> [!TIP]
+> Fai riferimento alla sezione [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream) per avere maggiori dettagli su come scaricare modelli presenti sull Hub.
diff --git a/docs/source/it/model_sharing.md b/docs/source/it/model_sharing.md
index ce06ade1fe2c..ab84f22861f9 100644
--- a/docs/source/it/model_sharing.md
+++ b/docs/source/it/model_sharing.md
@@ -27,11 +27,8 @@ In questo tutorial, imparerai due metodi per la condivisione di un modello train
 frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
 picture-in-picture" allowfullscreen></iframe>
 
-<Tip>
-
-Per condividere un modello con la community, hai bisogno di un account su [huggingface.co](https://huggingface.co/join). Puoi anche unirti ad un'organizzazione esistente o crearne una nuova.
-
-</Tip>
+> [!TIP]
+> Per condividere un modello con la community, hai bisogno di un account su [huggingface.co](https://huggingface.co/join). Puoi anche unirti ad un'organizzazione esistente o crearne una nuova.
 
 ## Caratteristiche dei repository
 
diff --git a/docs/source/it/perf_infer_cpu.md b/docs/source/it/perf_infer_cpu.md
index 5bf48e4737d9..008757c52e95 100644
--- a/docs/source/it/perf_infer_cpu.md
+++ b/docs/source/it/perf_infer_cpu.md
@@ -42,14 +42,11 @@ I rilasci di IPEX seguono PyTorch, verifica i vari approcci per [IPEX installati
 
 Per abilitare JIT-mode in Trainer per evaluation e prediction, devi aggiungere `jit_mode_eval` negli argomenti di Trainer.
 
-<Tip warning={true}>
-
-per PyTorch >= 1.14.0. JIT-mode potrebe giovare a qualsiasi modello di prediction e evaluaion visto che il dict input è supportato in jit.trace
-
-per PyTorch < 1.14.0. JIT-mode potrebbe giovare ai modelli il cui ordine dei parametri corrisponde all'ordine delle tuple in ingresso in jit.trace, come i modelli per question-answering.
-Nel caso in cui l'ordine dei parametri seguenti non corrisponda all'ordine delle tuple in ingresso in jit.trace, come nei modelli di text-classification, jit.trace fallirà e lo cattureremo con una eccezione al fine di renderlo un fallback. Il logging è usato per notificare gli utenti.
-
-</Tip>
+> [!WARNING]
+> per PyTorch >= 1.14.0. JIT-mode potrebe giovare a qualsiasi modello di prediction e evaluaion visto che il dict input è supportato in jit.trace
+>
+> per PyTorch < 1.14.0. JIT-mode potrebbe giovare ai modelli il cui ordine dei parametri corrisponde all'ordine delle tuple in ingresso in jit.trace, come i modelli per question-answering.
+> Nel caso in cui l'ordine dei parametri seguenti non corrisponda all'ordine delle tuple in ingresso in jit.trace, come nei modelli di text-classification, jit.trace fallirà e lo cattureremo con una eccezione al fine di renderlo un fallback. Il logging è usato per notificare gli utenti.
 
 Trovi un esempo con caso d'uso in [Transformers question-answering](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)
 
diff --git a/docs/source/it/perf_infer_gpu_many.md b/docs/source/it/perf_infer_gpu_many.md
index b78cb34e1d6d..e5de88430d1c 100644
--- a/docs/source/it/perf_infer_gpu_many.md
+++ b/docs/source/it/perf_infer_gpu_many.md
@@ -17,11 +17,8 @@ rendered properly in your Markdown viewer.
 
 Questo documento contiene informazioni su come fare inferenza in maniera efficiente su GPU multiple.
 
-<Tip>
-
-Nota: Un setup con GPU multiple può utilizzare la maggior parte delle strategie descritte nella [sezione con GPU singola](./perf_infer_gpu_one). Tuttavia, è necessario conoscere delle tecniche semplici che possono essere utilizzate per un risultato migliore.
-
-</Tip>
+> [!TIP]
+> Nota: Un setup con GPU multiple può utilizzare la maggior parte delle strategie descritte nella [sezione con GPU singola](./perf_infer_gpu_one). Tuttavia, è necessario conoscere delle tecniche semplici che possono essere utilizzate per un risultato migliore.
 
 ## `BetterTransformer` per inferenza più rapida
 
diff --git a/docs/source/it/perf_infer_gpu_one.md b/docs/source/it/perf_infer_gpu_one.md
index 5339d72d4c9d..a18345f7188c 100644
--- a/docs/source/it/perf_infer_gpu_one.md
+++ b/docs/source/it/perf_infer_gpu_one.md
@@ -23,11 +23,8 @@ Abbiamo recentemente integrato `BetterTransformer` per velocizzare l'inferenza s
 
 ## Integrazione di `bitsandbytes` per Int8 mixed-precision matrix decomposition
 
-<Tip>
-
-Nota che questa funzione può essere utilizzata anche nelle configurazioni multi GPU.
-
-</Tip>
+> [!TIP]
+> Nota che questa funzione può essere utilizzata anche nelle configurazioni multi GPU.
 
 Dal paper [`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://huggingface.co/papers/2208.07339), noi supportiamo l'integrazione di Hugging Face per tutti i modelli dell'Hub con poche righe di codice.
 Il metodo `nn.Linear` riduce la dimensione di 2 per i pesi `float16` e `bfloat16` e di 4 per i pesi `float32`, con un impatto quasi nullo sulla qualità, operando sugli outlier in half-precision.
diff --git a/docs/source/it/perf_train_cpu_many.md b/docs/source/it/perf_train_cpu_many.md
index c1f8833829ac..a88075382fcb 100644
--- a/docs/source/it/perf_train_cpu_many.md
+++ b/docs/source/it/perf_train_cpu_many.md
@@ -45,12 +45,9 @@ dove `{pytorch_version}` deve essere la tua versione di PyTorch, per l'stanza 1.
 Verifica altri approcci per [oneccl_bind_pt installation](https://github.com/intel/torch-ccl).
 Le versioni di oneCCL e PyTorch devono combaciare.
 
-<Tip warning={true}>
-
-oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0)
-PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100
-
-</Tip>
+> [!WARNING]
+> oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0)
+> PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100
 
 ## Intel® MPI library
 
diff --git a/docs/source/it/perf_train_special.md b/docs/source/it/perf_train_special.md
index 23ba034e8e2d..dc197a604c3c 100644
--- a/docs/source/it/perf_train_special.md
+++ b/docs/source/it/perf_train_special.md
@@ -16,10 +16,7 @@ rendered properly in your Markdown viewer.
 
 # Addestramento su Hardware Specializzato
 
-<Tip>
-
- Nota: Molte delle strategie introdotte nella [sezione sulla GPU singola](perf_train_gpu_one) (come mixed precision training o gradient accumulation) e [sezione multi-GPU](perf_train_gpu_many) sono generiche e applicabili all'addestramento di modelli in generale quindi assicurati di dargli un'occhiata prima di immergerti in questa sezione.
-
-</Tip>
+> [!TIP]
+> Nota: Molte delle strategie introdotte nella [sezione sulla GPU singola](perf_train_gpu_one) (come mixed precision training o gradient accumulation) e [sezione multi-GPU](perf_train_gpu_many) sono generiche e applicabili all'addestramento di modelli in generale quindi assicurati di dargli un'occhiata prima di immergerti in questa sezione.
 
 Questo documento sarà presto completato con informazioni su come effettuare la formazione su hardware specializzato.
diff --git a/docs/source/it/perf_train_tpu.md b/docs/source/it/perf_train_tpu.md
index 663f83c499cb..819ea0556034 100644
--- a/docs/source/it/perf_train_tpu.md
+++ b/docs/source/it/perf_train_tpu.md
@@ -15,10 +15,7 @@ rendered properly in your Markdown viewer.
 
 # Addestramento su TPU
 
-<Tip>
-
- Nota: Molte delle strategie introdotte nella [sezione sulla GPU singola](perf_train_gpu_one) (come mixed precision training o gradient accumulation) e [sezione multi-GPU](perf_train_gpu_many) sono generiche e applicabili all'addestramento di modelli in generale quindi assicurati di dargli un'occhiata prima di immergerti in questa sezione.
-
-</Tip>
+> [!TIP]
+> Nota: Molte delle strategie introdotte nella [sezione sulla GPU singola](perf_train_gpu_one) (come mixed precision training o gradient accumulation) e [sezione multi-GPU](perf_train_gpu_many) sono generiche e applicabili all'addestramento di modelli in generale quindi assicurati di dargli un'occhiata prima di immergerti in questa sezione.
 
 Questo documento sarà presto completato con informazioni su come effettuare la formazione su TPU.
diff --git a/docs/source/it/pipeline_tutorial.md b/docs/source/it/pipeline_tutorial.md
index 87f3166623b0..01903ae7b332 100644
--- a/docs/source/it/pipeline_tutorial.md
+++ b/docs/source/it/pipeline_tutorial.md
@@ -22,11 +22,8 @@ La [`pipeline`] rende semplice usare qualsiasi modello dal [Model Hub](https://h
 * Usare uno specifico tokenizer o modello.
 * Usare una [`pipeline`] per compiti che riguardano audio e video.
 
-<Tip>
-
-Dai un'occhiata alla documentazione di [`pipeline`] per una lista completa dei compiti supportati.
-
-</Tip>
+> [!TIP]
+> Dai un'occhiata alla documentazione di [`pipeline`] per una lista completa dei compiti supportati.
 
 ## Utilizzo della Pipeline
 
diff --git a/docs/source/it/preprocessing.md b/docs/source/it/preprocessing.md
index 6d7bc5b2e3df..723116bb431c 100644
--- a/docs/source/it/preprocessing.md
+++ b/docs/source/it/preprocessing.md
@@ -30,11 +30,8 @@ Prima di poter usare i dati in un modello, bisogna processarli in un formato acc
 
 Lo strumento principale per processare dati testuali è un [tokenizer](main_classes/tokenizer). Un tokenizer inizia separando il testo in *tokens* secondo una serie di regole. I tokens sono convertiti in numeri, questi vengono utilizzati per costruire i tensori di input del modello. Anche altri input addizionali se richiesti dal modello vengono aggiunti dal tokenizer.
 
-<Tip>
-
-Se stai pensando si utilizzare un modello preaddestrato, è importante utilizzare il tokenizer preaddestrato associato. Questo assicura che il testo sia separato allo stesso modo che nel corpus usato per l'addestramento, e venga usata la stessa mappatura tokens-to-index (solitamente indicato come il *vocabolario*) come nel preaddestramento.
-
-</Tip>
+> [!TIP]
+> Se stai pensando si utilizzare un modello preaddestrato, è importante utilizzare il tokenizer preaddestrato associato. Questo assicura che il testo sia separato allo stesso modo che nel corpus usato per l'addestramento, e venga usata la stessa mappatura tokens-to-index (solitamente indicato come il *vocabolario*) come nel preaddestramento.
 
 Iniziamo subito caricando un tokenizer preaddestrato con la classe [`AutoTokenizer`]. Questo scarica il *vocabolario* usato quando il modello è stato preaddestrato.
 
diff --git a/docs/source/it/quicktour.md b/docs/source/it/quicktour.md
index 06295d10275d..6d3404054c7e 100644
--- a/docs/source/it/quicktour.md
+++ b/docs/source/it/quicktour.md
@@ -20,12 +20,9 @@ rendered properly in your Markdown viewer.
 
 Entra in azione con 🤗 Transformers! Inizia utilizzando [`pipeline`] per un'inferenza veloce, carica un modello pre-allenato e un tokenizer con una [AutoClass](./model_doc/auto) per risolvere i tuoi compiti legati a testo, immagini o audio.
 
-<Tip>
-
-Tutti gli esempi di codice presenti in questa documentazione hanno un pulsante in alto a sinistra che permette di selezionare tra PyTorch e TensorFlow. Se
-questo non è presente, ci si aspetta che il codice funzioni per entrambi i backend senza alcun cambiamento.
-
-</Tip>
+> [!TIP]
+> Tutti gli esempi di codice presenti in questa documentazione hanno un pulsante in alto a sinistra che permette di selezionare tra PyTorch e TensorFlow. Se
+> questo non è presente, ci si aspetta che il codice funzioni per entrambi i backend senza alcun cambiamento.
 
 ## Pipeline
 
@@ -54,11 +51,8 @@ La [`pipeline`] supporta molti compiti comuni:
 * Classificazione di Audio (Audio Classification, in inglese): assegna un'etichetta ad un segmento di audio dato.
 * Riconoscimento Vocale Automatico (Automatic Speech Recognition o ASR, in inglese): trascrive il contenuto di un audio dato in un testo.
 
-<Tip>
-
-Per maggiori dettagli legati alla [`pipeline`] e ai compiti ad essa associati, fai riferimento alla documentazione [qui](./main_classes/pipelines).
-
-</Tip>
+> [!TIP]
+> Per maggiori dettagli legati alla [`pipeline`] e ai compiti ad essa associati, fai riferimento alla documentazione [qui](./main_classes/pipelines).
 
 ### Utilizzo della Pipeline
 
@@ -230,11 +224,8 @@ Leggi il tutorial sul [preprocessing](./preprocessing) per maggiori dettagli sul
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
 ```
 
-<Tip>
-
-Guarda il [task summary](./task_summary) per sapere quale classe di [`AutoModel`] utilizzare per quale compito.
-
-</Tip>
+> [!TIP]
+> Guarda il [task summary](./task_summary) per sapere quale classe di [`AutoModel`] utilizzare per quale compito.
 
 Ora puoi passare il tuo lotto di input pre-processati direttamente al modello. Devi solo spacchettare il dizionario aggiungendo `**`:
 
@@ -253,21 +244,15 @@ tensor([[0.0041, 0.0037, 0.0203, 0.2005, 0.7713],
         [0.3766, 0.3292, 0.1832, 0.0558, 0.0552]], grad_fn=<SoftmaxBackward0>)
 ```
 
-<Tip>
-
-Tutti i modelli di 🤗 Transformers (PyTorch e TensorFlow) restituiscono i tensori *prima* della funzione finale
-di attivazione (come la softmax) perché la funzione di attivazione finale viene spesso unita a quella di perdita.
-
-</Tip>
+> [!TIP]
+> Tutti i modelli di 🤗 Transformers (PyTorch e TensorFlow) restituiscono i tensori *prima* della funzione finale
+> di attivazione (come la softmax) perché la funzione di attivazione finale viene spesso unita a quella di perdita.
 
 I modelli sono [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) o [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) standard così puoi utilizzarli all'interno del tuo training loop usuale. Tuttavia, per rendere le cose più semplici, 🤗 Transformers fornisce una classe [`Trainer`] per PyTorch che aggiunge delle funzionalità per l'allenamento distribuito, precisione mista, e altro ancora. Per TensorFlow, puoi utilizzare il metodo `fit` di [Keras](https://keras.io/). Fai riferimento al [tutorial per il training](./training) per maggiori dettagli.
 
-<Tip>
-
-Gli output del modello di 🤗 Transformers sono delle dataclasses speciali in modo che i loro attributi vengano auto-completati all'interno di un IDE.
-Gli output del modello si comportano anche come una tupla o un dizionario (ad esempio, puoi indicizzare con un intero, una slice o una stringa) nel qual caso gli attributi che sono `None` vengono ignorati.
-
-</Tip>
+> [!TIP]
+> Gli output del modello di 🤗 Transformers sono delle dataclasses speciali in modo che i loro attributi vengano auto-completati all'interno di un IDE.
+> Gli output del modello si comportano anche come una tupla o un dizionario (ad esempio, puoi indicizzare con un intero, una slice o una stringa) nel qual caso gli attributi che sono `None` vengono ignorati.
 
 ### Salva un modello
 
diff --git a/docs/source/it/serialization.md b/docs/source/it/serialization.md
index 53e16d927eb9..50e2c3e0b78a 100644
--- a/docs/source/it/serialization.md
+++ b/docs/source/it/serialization.md
@@ -251,13 +251,10 @@ Puoi notare che in questo caso, i nomi di output del modello ottimizzato sono
 checkpoint `distilbert/distilbert-base-uncased` precedente. Questo è previsto dal
 modello ottimizato visto che ha una testa di e.
 
-<Tip>
-
-Le caratteristiche che hanno un suffisso `wtih-past` (ad es. `causal-lm-with-past`)
-corrispondono a topologie di modello con stati nascosti precalcolati (chiave e valori
-nei blocchi di attenzione) che possono essere utilizzati per la decodifica autoregressiva veloce.
-
-</Tip>
+> [!TIP]
+> Le caratteristiche che hanno un suffisso `wtih-past` (ad es. `causal-lm-with-past`)
+> corrispondono a topologie di modello con stati nascosti precalcolati (chiave e valori
+> nei blocchi di attenzione) che possono essere utilizzati per la decodifica autoregressiva veloce.
 
 
 ### Esportazione di un modello per un'architettura non supportata
@@ -282,12 +279,9 @@ del modello che desideri esportare:
 * I modelli basati su decoder ereditano da [`~onnx.config.OnnxConfigWithPast`]
 * I modelli encoder-decoder ereditano da[`~onnx.config.OnnxSeq2SeqConfigWithPast`]
 
-<Tip>
-
-Un buon modo per implementare una configurazione ONNX personalizzata è guardare l'implementazione
-esistente nel file `configuration_<model_name>.py` di un'architettura simile.
-
-</Tip>
+> [!TIP]
+> Un buon modo per implementare una configurazione ONNX personalizzata è guardare l'implementazione
+> esistente nel file `configuration_<model_name>.py` di un'architettura simile.
 
 Poiché DistilBERT è un modello basato su encoder, la sua configurazione eredita da
 `OnnxConfig`:
@@ -315,15 +309,12 @@ due input: `input_ids` e `attention_mask`. Questi inputs hanno la stessa forma d
 `(batch_size, sequence_length)` per questo motivo vediamo gli stessi assi usati nella
 configurazione.
 
-<Tip>
-
-Puoi notare che la proprietà `inputs` per `DistilBertOnnxConfig` restituisce un
-`OrdinatoDict`. Ciò garantisce che gli input corrispondano alla loro posizione
-relativa all'interno del metodo `PreTrainedModel.forward()` durante il tracciamento del grafico.
-Raccomandiamo di usare un `OrderedDict` per le proprietà `inputs` e `outputs`
-quando si implementano configurazioni ONNX personalizzate.
-
-</Tip>
+> [!TIP]
+> Puoi notare che la proprietà `inputs` per `DistilBertOnnxConfig` restituisce un
+> `OrdinatoDict`. Ciò garantisce che gli input corrispondano alla loro posizione
+> relativa all'interno del metodo `PreTrainedModel.forward()` durante il tracciamento del grafico.
+> Raccomandiamo di usare un `OrderedDict` per le proprietà `inputs` e `outputs`
+> quando si implementano configurazioni ONNX personalizzate.
 
 Dopo aver implementato una configurazione ONNX, è possibile istanziarla
 fornendo alla configurazione del modello base come segue:
@@ -369,13 +360,10 @@ usare:
 OrderedDict([('logits', {0: 'batch'})])
 ```
 
-<Tip>
-
-Tutte le proprietà e i metodi di base associati a [`~onnx.config.OnnxConfig`] e le
-altre classi di configurazione possono essere sovrascritte se necessario. Guarda
-[`BartOnnxConfig`] per un esempio avanzato.
-
-</Tip>
+> [!TIP]
+> Tutte le proprietà e i metodi di base associati a [`~onnx.config.OnnxConfig`] e le
+> altre classi di configurazione possono essere sovrascritte se necessario. Guarda
+> [`BartOnnxConfig`] per un esempio avanzato.
 
 #### Esportazione del modello
 
@@ -409,16 +397,13 @@ formato come segue:
 >>> onnx.checker.check_model(onnx_model)
 ```
 
-<Tip>
-
-Se il tuo modello è più largo di 2 GB, vedrai che molti file aggiuntivi sono
-creati durante l'esportazione. Questo è _previsto_ perché ONNX utilizza [Protocol
-Buffer](https://developers.google.com/protocol-buffers/) per memorizzare il modello e
-questi hanno un limite di dimensione 2 GB. Vedi la [Documentazione
-ONNX](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md)
-per istruzioni su come caricare modelli con dati esterni.
-
-</Tip>
+> [!TIP]
+> Se il tuo modello è più largo di 2 GB, vedrai che molti file aggiuntivi sono
+> creati durante l'esportazione. Questo è _previsto_ perché ONNX utilizza [Protocol
+> Buffer](https://developers.google.com/protocol-buffers/) per memorizzare il modello e
+> questi hanno un limite di dimensione 2 GB. Vedi la [Documentazione
+> ONNX](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md)
+> per istruzioni su come caricare modelli con dati esterni.
 
 #### Convalida degli output del modello
 
@@ -456,14 +441,11 @@ avere un'idea di cosa è coinvolto.
 
 ## TorchScript
 
-<Tip>
-
-Questo è l'inizio dei nostri esperimenti con TorchScript e stiamo ancora esplorando le sue capacità con
-modelli con variable-input-size. È una nostra priorità e approfondiremo le nostre analisi nelle prossime versioni,
-con più esempi di codici, un'implementazione più flessibile e benchmark che confrontano i codici basati su Python con quelli compilati con
-TorchScript.
-
-</Tip>
+> [!TIP]
+> Questo è l'inizio dei nostri esperimenti con TorchScript e stiamo ancora esplorando le sue capacità con
+> modelli con variable-input-size. È una nostra priorità e approfondiremo le nostre analisi nelle prossime versioni,
+> con più esempi di codici, un'implementazione più flessibile e benchmark che confrontano i codici basati su Python con quelli compilati con
+> TorchScript.
 
 Secondo la documentazione di Pytorch: "TorchScript è un modo per creare modelli serializzabili e ottimizzabili da codice
 Pytorch". I due moduli di Pytorch [JIT e TRACE](https://pytorch.org/docs/stable/jit.html) consentono allo sviluppatore di esportare
diff --git a/docs/source/it/training.md b/docs/source/it/training.md
index 76cd41afc56d..d8d6c88257a5 100644
--- a/docs/source/it/training.md
+++ b/docs/source/it/training.md
@@ -81,11 +81,8 @@ Inizia caricando il tuo modello e specificando il numero di etichette (labels) a
 >>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
 ```
 
-<Tip>
-
-Potresti vedere un warning dato che alcuni dei pesi pre-addestrati non sono stati utilizzati e altri pesi sono stati inizializzati casualmente. Non preoccuparti, è completamente normale! L'head pre-addestrata del modello BERT viene scartata e rimpiazzata da una classification head inizializzata casualmente. Farai il fine-tuning di questa nuova head del modello sul tuo compito di classificazione, trasferendogli la conoscenza del modello pre-addestrato.
-
-</Tip>
+> [!TIP]
+> Potresti vedere un warning dato che alcuni dei pesi pre-addestrati non sono stati utilizzati e altri pesi sono stati inizializzati casualmente. Non preoccuparti, è completamente normale! L'head pre-addestrata del modello BERT viene scartata e rimpiazzata da una classification head inizializzata casualmente. Farai il fine-tuning di questa nuova head del modello sul tuo compito di classificazione, trasferendogli la conoscenza del modello pre-addestrato.
 
 ### Iperparametri per il training
 
@@ -241,11 +238,8 @@ Infine specifica come `device` da usare una GPU se ne hai una. Altrimenti, l'add
 >>> model.to(device)
 ```
 
-<Tip>
-
-Ottieni l'accesso gratuito a una GPU sul cloud se non ne possiedi una usando un notebook sul web come [Colaboratory](https://colab.research.google.com/) o [SageMaker StudioLab](https://studiolab.sagemaker.aws/).
-
-</Tip>
+> [!TIP]
+> Ottieni l'accesso gratuito a una GPU sul cloud se non ne possiedi una usando un notebook sul web come [Colaboratory](https://colab.research.google.com/) o [SageMaker StudioLab](https://studiolab.sagemaker.aws/).
 
 Ottimo, adesso possiamo addestrare! 🥳 
 
diff --git a/docs/source/ja/add_new_model.md b/docs/source/ja/add_new_model.md
index 000e4fd85924..cbf2cd25743d 100644
--- a/docs/source/ja/add_new_model.md
+++ b/docs/source/ja/add_new_model.md
@@ -628,11 +628,8 @@ pytest tests/models/brand_new_bert/test_modeling_brand_new_bert.py
 RUN_SLOW=1 pytest -sv tests/models/brand_new_bert/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests
 ```
 
-<Tip>
-
-Windowsを使用している場合、`RUN_SLOW=1`を`SET RUN_SLOW=1`に置き換えてください。
-
-</Tip>
+> [!TIP]
+> Windowsを使用している場合、`RUN_SLOW=1`を`SET RUN_SLOW=1`に置き換えてください。
 
 次に、*brand_new_bert*に特有のすべての特徴は、別個のテスト内で追加されるべきです。
 `BrandNewBertModelTester`/`BrandNewBertModelTest`の下に。この部分はよく忘れられますが、2つの点で非常に役立ちます：
diff --git a/docs/source/ja/autoclass_tutorial.md b/docs/source/ja/autoclass_tutorial.md
index 6b5c552cd7b6..1d9c866d8fc2 100644
--- a/docs/source/ja/autoclass_tutorial.md
+++ b/docs/source/ja/autoclass_tutorial.md
@@ -23,13 +23,10 @@ http://www.apache.org/licenses/LICENSE-2.0
 この種のチェックポイントに依存しないコードを生成することは、
 コードが1つのチェックポイントで動作すれば、アーキテクチャが異なっていても、同じタスクに向けてトレーニングされた場合は別のチェックポイントでも動作することを意味します。
 
-<Tip>
-
-アーキテクチャはモデルの骨格を指し、チェックポイントは特定のアーキテクチャの重みです。
-たとえば、[BERT](https://huggingface.co/google-bert/bert-base-uncased)はアーキテクチャであり、`google-bert/bert-base-uncased`はチェックポイントです。
-モデルはアーキテクチャまたはチェックポイントのどちらを指す一般的な用語です。
-
-</Tip>
+> [!TIP]
+> アーキテクチャはモデルの骨格を指し、チェックポイントは特定のアーキテクチャの重みです。
+> たとえば、[BERT](https://huggingface.co/google-bert/bert-base-uncased)はアーキテクチャであり、`google-bert/bert-base-uncased`はチェックポイントです。
+> モデルはアーキテクチャまたはチェックポイントのどちらを指す一般的な用語です。
 
 このチュートリアルでは、以下を学習します：
 
@@ -119,16 +116,13 @@ http://www.apache.org/licenses/LICENSE-2.0
 >>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip warning={true}>
-
-PyTorchモデルの場合、  `from_pretrained()`メソッドは内部で`torch.load()`を使用し、内部的には`pickle`を使用しており、セキュリティの問題が知られています。
-一般的には、信頼性のないソースから取得した可能性があるモデルや改ざんされた可能性のあるモデルをロードしないでください。
-このセキュリティリスクは、`Hugging Face Hub`でホストされている公開モデルに対して部分的に緩和されており、各コミットでマルウェアのスキャンが行われています。
-GPGを使用した署名済みコミットの検証などのベストプラクティスについては、Hubのドキュメンテーションを参照してください。
-
-TensorFlowおよびFlaxのチェックポイントには影響がなく、`from_pretrained`メソッドの`from_tf`および`from_flax`引数を使用してPyTorchアーキテクチャ内でロードできます。
-
-</Tip>
+> [!WARNING]
+> PyTorchモデルの場合、  `from_pretrained()`メソッドは内部で`torch.load()`を使用し、内部的には`pickle`を使用しており、セキュリティの問題が知られています。
+> 一般的には、信頼性のないソースから取得した可能性があるモデルや改ざんされた可能性のあるモデルをロードしないでください。
+> このセキュリティリスクは、`Hugging Face Hub`でホストされている公開モデルに対して部分的に緩和されており、各コミットでマルウェアのスキャンが行われています。
+> GPGを使用した署名済みコミットの検証などのベストプラクティスについては、Hubのドキュメンテーションを参照してください。
+>
+> TensorFlowおよびFlaxのチェックポイントには影響がなく、`from_pretrained`メソッドの`from_tf`および`from_flax`引数を使用してPyTorchアーキテクチャ内でロードできます。
 
 一般的に、事前学習済みモデルのインスタンスをロードするために`AutoTokenizer`クラスと`AutoModelFor`クラスの使用をお勧めします。
 これにより、常に正しいアーキテクチャをロードできます。
diff --git a/docs/source/ja/big_models.md b/docs/source/ja/big_models.md
index 78852dc4374c..69892d5a6f95 100644
--- a/docs/source/ja/big_models.md
+++ b/docs/source/ja/big_models.md
@@ -24,11 +24,8 @@ rendered properly in your Markdown viewer.
 
 ステップ1と2の両方がメモリにモデルの完全なバージョンを必要とし、ほとんどの場合は問題ありませんが、モデルのサイズが数ギガバイトになると、これらの2つのコピーをRAMから排除することができなくなる可能性があります。さらに悪いことに、分散トレーニングを実行するために`torch.distributed`を使用している場合、各プロセスは事前学習済みモデルをロードし、これらの2つのコピーをRAMに保存します。
 
-<Tip>
-
-ランダムに作成されたモデルは、メモリ内に「空の」テンソルで初期化されます。これらのランダムな値は、メモリの特定のチャンクにあったものを使用します（したがって、ランダムな値はその時点でのメモリチャンク内の値です）。モデル/パラメータの種類に適した分布（たとえば、正規分布）に従うランダムな初期化は、ステップ3で初期化されていない重みに対して、できるだけ高速に実行されます！
-
-</Tip>
+> [!TIP]
+> ランダムに作成されたモデルは、メモリ内に「空の」テンソルで初期化されます。これらのランダムな値は、メモリの特定のチャンクにあったものを使用します（したがって、ランダムな値はその時点でのメモリチャンク内の値です）。モデル/パラメータの種類に適した分布（たとえば、正規分布）に従うランダムな初期化は、ステップ3で初期化されていない重みに対して、できるだけ高速に実行されます！
 
 このガイドでは、Transformersがこの問題に対処するために提供するソリューションを探ります。なお、これは現在も開発が進行中の分野であり、将来、ここで説明されているAPIがわずかに変更される可能性があることに注意してください。
 
diff --git a/docs/source/ja/create_a_model.md b/docs/source/ja/create_a_model.md
index d708070c3daf..3987673ef6d0 100644
--- a/docs/source/ja/create_a_model.md
+++ b/docs/source/ja/create_a_model.md
@@ -104,11 +104,8 @@ Once you are satisfied with your model configuration, you can save it with [`Pre
 >>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json")
 ```
 
-<Tip>
-
-カスタム構成ファイルを辞書として保存することも、カスタム構成属性とデフォルトの構成属性の違いだけを保存することもできます！詳細については[configuration](main_classes/configuration)のドキュメンテーションをご覧ください。
-
-</Tip>
+> [!TIP]
+> カスタム構成ファイルを辞書として保存することも、カスタム構成属性とデフォルトの構成属性の違いだけを保存することもできます！詳細については[configuration](main_classes/configuration)のドキュメンテーションをご覧ください。
 
 ## Model
 
@@ -179,12 +176,9 @@ Once you are satisfied with your model configuration, you can save it with [`Pre
 
 両方のトークナイザは、エンコードとデコード、新しいトークンの追加、特別なトークンの管理など、共通のメソッドをサポートしています。
 
-<Tip warning={true}>
-
-すべてのモデルが高速なトークナイザをサポートしているわけではありません。
-モデルが高速なトークナイザをサポートしているかどうかを確認するには、この[表](index#supported-frameworks)をご覧ください。
-
-</Tip>
+> [!WARNING]
+> すべてのモデルが高速なトークナイザをサポートしているわけではありません。
+> モデルが高速なトークナイザをサポートしているかどうかを確認するには、この[表](index#supported-frameworks)をご覧ください。
 
 独自のトークナイザをトレーニングした場合、*ボキャブラリー*ファイルからトークナイザを作成できます。
 
@@ -213,11 +207,8 @@ Once you are satisfied with your model configuration, you can save it with [`Pre
 >>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip>
-
-デフォルトでは、[`AutoTokenizer`]は高速なトークナイザを読み込もうとします。`from_pretrained`内で`use_fast=False`を設定することで、この動作を無効にすることができます。
-
-</Tip>
+> [!TIP]
+> デフォルトでは、[`AutoTokenizer`]は高速なトークナイザを読み込もうとします。`from_pretrained`内で`use_fast=False`を設定することで、この動作を無効にすることができます。
 
 ## Image Processor
 
@@ -250,11 +241,8 @@ ViTImageProcessor {
 }
 ```
 
-<Tip>
-
-カスタマイズを必要としない場合、モデルのデフォルトの画像プロセッサパラメータをロードするには、単純に`from_pretrained`メソッドを使用してください。
-
-</Tip>
+> [!TIP]
+> カスタマイズを必要としない場合、モデルのデフォルトの画像プロセッサパラメータをロードするには、単純に`from_pretrained`メソッドを使用してください。
 
 [`ViTImageProcessor`]のパラメータを変更して、カスタムの画像プロセッサを作成できます：
 
@@ -306,11 +294,8 @@ Wav2Vec2FeatureExtractor {
 }
 ```
 
-<Tip>
-
-カスタマイズを行わない場合、モデルのデフォルトの特徴抽出器パラメーターをロードするには、単に `from_pretrained` メソッドを使用してください。
-
-</Tip>
+> [!TIP]
+> カスタマイズを行わない場合、モデルのデフォルトの特徴抽出器パラメーターをロードするには、単に `from_pretrained` メソッドを使用してください。
 
 [`Wav2Vec2FeatureExtractor`] のパラメーターを変更して、カスタム特徴抽出器を作成できます:
 
diff --git a/docs/source/ja/custom_models.md b/docs/source/ja/custom_models.md
index 737f5fd36d03..02767d9d9952 100644
--- a/docs/source/ja/custom_models.md
+++ b/docs/source/ja/custom_models.md
@@ -169,11 +169,8 @@ class ResnetModelForImageClassification(PreTrainedModel):
 両方の場合、`PreTrainedModel`から継承し、`config`を使用してスーパークラスの初期化を呼び出します（通常の`torch.nn.Module`を書くときのような感じです）。
 `config_class`を設定する行は必須ではありませんが、（最後のセクションを参照）、モデルを自動クラスに登録したい場合に使用できます。
 
-<Tip>
-
-モデルがライブラリ内のモデルと非常に似ている場合、このモデルと同じ構成を再利用できます。
-
-</Tip>
+> [!TIP]
+> モデルがライブラリ内のモデルと非常に似ている場合、このモデルと同じ構成を再利用できます。
 
 モデルが返す内容は何でも構いませんが、ラベルが渡されるときに損失を含む辞書を返す（`ResnetModelForImageClassification`のように行ったもの）と、
 モデルを[`Trainer`]クラス内で直接使用できるようになります。独自のトレーニングループまたは他のライブラリを使用する予定である限り、
@@ -207,11 +204,8 @@ resnet50d.model.load_state_dict(pretrained_model.state_dict())
 
 ## Sending the code to the Hub
 
-<Tip warning={true}>
-
-このAPIは実験的であり、次のリリースでわずかな変更があるかもしれません。
-
-</Tip>
+> [!WARNING]
+> このAPIは実験的であり、次のリリースでわずかな変更があるかもしれません。
 
 まず、モデルが`.py`ファイルに完全に定義されていることを確認してください。
 ファイルは相対インポートを他のファイルに依存できますが、すべてのファイルが同じディレクトリにある限り（まだこの機能ではサブモジュールはサポートしていません）、問題ありません。
@@ -229,11 +223,8 @@ resnet50d.model.load_state_dict(pretrained_model.state_dict())
 
 `__init__.py`は空であっても問題ありません。Pythonが`resnet_model`をモジュールとして検出できるようにするために存在します。
 
-<Tip warning={true}>
-
-ライブラリからモデリングファイルをコピーする場合、ファイルの先頭にあるすべての相対インポートを`transformers`パッケージからインポートに置き換える必要があります。
-
-</Tip>
+> [!WARNING]
+> ライブラリからモデリングファイルをコピーする場合、ファイルの先頭にあるすべての相対インポートを`transformers`パッケージからインポートに置き換える必要があります。
 
 既存の設定やモデルを再利用（またはサブクラス化）できることに注意してください。
 
diff --git a/docs/source/ja/generation_strategies.md b/docs/source/ja/generation_strategies.md
index 45eec30c0765..80552839b8c6 100644
--- a/docs/source/ja/generation_strategies.md
+++ b/docs/source/ja/generation_strategies.md
@@ -120,11 +120,8 @@ GenerationConfig {
 
 `generate()` は、その `streamer` 入力を介してストリーミングをサポートしています。`streamer` 入力は、次のメソッドを持つクラスのインスタンスと互換性があります：`put()` と `end()`。内部的には、`put()` は新しいトークンをプッシュするために使用され、`end()` はテキスト生成の終了をフラグ付けするために使用されます。
 
-<Tip warning={true}>
-
-ストリーマークラスのAPIはまだ開発中であり、将来変更される可能性があります。
-
-</Tip>
+> [!WARNING]
+> ストリーマークラスのAPIはまだ開発中であり、将来変更される可能性があります。
 
 実際には、さまざまな目的に対して独自のストリーミングクラスを作成できます！また、使用できる基本的なストリーミングクラスも用意されています。例えば、[`TextStreamer`] クラスを使用して、`generate()` の出力を画面に単語ごとにストリームすることができます：
 
diff --git a/docs/source/ja/glossary.md b/docs/source/ja/glossary.md
index 775bffdd0c69..60b575bc250a 100644
--- a/docs/source/ja/glossary.md
+++ b/docs/source/ja/glossary.md
@@ -260,11 +260,8 @@ The encoded versions have different lengths:
 - 物体検出モデルの場合（[`DetrForObjectDetection`]）、モデルは各個々の画像の予測ラベルと境界ボックスの数に対応する `class_labels` と `boxes` キーを持つ辞書のリストを期待します。
 - 自動音声認識モデルの場合（[`Wav2Vec2ForCTC`]）、モデルは次元が `(batch_size, target_length)` のテンソルを期待し、各値が各個々のトークンの予測ラベルに対応します。
 
-<Tip>
-
-各モデルのラベルは異なる場合があるため、常に各モデルのドキュメントを確認して、それらの特定のラベルに関する詳細情報を確認してください！
-
-</Tip>
+> [!TIP]
+> 各モデルのラベルは異なる場合があるため、常に各モデルのドキュメントを確認して、それらの特定のラベルに関する詳細情報を確認してください！
 
 ベースモデル（[`BertModel`]）はラベルを受け入れません。これらはベースのトランスフォーマーモデルであり、単に特徴を出力します。
 
diff --git a/docs/source/ja/installation.md b/docs/source/ja/installation.md
index cb4919caf505..774c359c221f 100644
--- a/docs/source/ja/installation.md
+++ b/docs/source/ja/installation.md
@@ -118,11 +118,8 @@ pip install -e .
 
 上記のコマンドは、レポジトリをクローンしたフォルダとPythonのライブラリをパスをリンクします。Pythonは通常のライブラリパスに加えて、あなたがクローンしたフォルダの中も見るようになります。例えば、Pythonパッケージが通常、`~/anaconda3/envs/main/lib/python3.7/site-packages/`にインストールされている場合、Pythonはクローンしたフォルダも検索するようになります: `~/transformers/`.
 
-<Tip warning={true}>
-
-ライブラリーを使い続けたい場合は、transformersフォルダーを保持しつづける必要があります。
-
-</Tip>
+> [!WARNING]
+> ライブラリーを使い続けたい場合は、transformersフォルダーを保持しつづける必要があります。
 
 これで、次のコマンドで簡単にクローンを🤗 Transformersの最新版に更新できます:
 
@@ -149,21 +146,15 @@ conda install conda-forge::transformers
 2. シェル環境変数: `HF_HOME`.
 3. シェル環境変数: `XDG_CACHE_HOME` + `/huggingface`.
 
-<Tip>
-
-もし、以前のバージョンのライブラリを使用していた人で、`PYTORCH_TRANSFORMERS_CACHE`または`PYTORCH_PRETRAINED_BERT_CACHE`を設定していた場合、シェル環境変数`TRANSFORMERS_CACHE`を指定しない限り🤗 Transformersはこれらのシェル環境変数を使用します。
-
-</Tip>
+> [!TIP]
+> もし、以前のバージョンのライブラリを使用していた人で、`PYTORCH_TRANSFORMERS_CACHE`または`PYTORCH_PRETRAINED_BERT_CACHE`を設定していた場合、シェル環境変数`TRANSFORMERS_CACHE`を指定しない限り🤗 Transformersはこれらのシェル環境変数を使用します。
 
 ## オフラインモード
 
 🤗 Transformersはローカルファイルのみを使用することでファイアウォールやオフラインの環境でも動作させることができます。この動作を有効にするためには、環境変数`HF_HUB_OFFLINE=1`を設定します。
 
-<Tip>
-
-環境変数`HF_DATASETS_OFFLINE=1`を設定し、オフライントレーニングワークフローに[🤗 Datasets](https://huggingface.co/docs/datasets/)を追加します。
-
-</Tip>
+> [!TIP]
+> 環境変数`HF_DATASETS_OFFLINE=1`を設定し、オフライントレーニングワークフローに[🤗 Datasets](https://huggingface.co/docs/datasets/)を追加します。
 
 例えば、外部インスタンスに対してファイアウォールで保護された通常のネットワーク上でプログラムを実行する場合、通常以下のようなコマンドで実行することになります:
 
@@ -237,8 +228,5 @@ python examples/pytorch/translation/run_translation.py --model_name_or_path goog
 >>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json")
 ```
 
-<Tip>
-
-Hubに保存されているファイルをダウンロードする方法の詳細については、[How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream)セクションを参照してください。
-
-</Tip>
+> [!TIP]
+> Hubに保存されているファイルをダウンロードする方法の詳細については、[How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream)セクションを参照してください。
diff --git a/docs/source/ja/llm_tutorial.md b/docs/source/ja/llm_tutorial.md
index 671382f8c138..5a1acc4cc7b1 100644
--- a/docs/source/ja/llm_tutorial.md
+++ b/docs/source/ja/llm_tutorial.md
@@ -69,11 +69,8 @@ LLM（Language Model）による自己回帰生成の重要な側面の1つは
 
 コードについて話しましょう！
 
-<Tip>
-
-基本的なLLMの使用に興味がある場合、高レベルの [`Pipeline`](pipeline_tutorial) インターフェースが良い出発点です。ただし、LLMはしばしば量子化やトークン選択ステップの細かい制御などの高度な機能が必要であり、これは [`~generation.GenerationMixin.generate`] を介して最良に行われます。LLMとの自己回帰生成はリソースが多く必要であり、適切なスループットのためにGPUで実行する必要があります。
-
-</Tip>
+> [!TIP]
+> 基本的なLLMの使用に興味がある場合、高レベルの [`Pipeline`](pipeline_tutorial) インターフェースが良い出発点です。ただし、LLMはしばしば量子化やトークン選択ステップの細かい制御などの高度な機能が必要であり、これは [`~generation.GenerationMixin.generate`] を介して最良に行われます。LLMとの自己回帰生成はリソースが多く必要であり、適切なスループットのためにGPUで実行する必要があります。
 
 <!-- TODO: llama 2（またはより新しい一般的なベースライン）が利用可能になったら、例を更新する -->
 まず、モデルを読み込む必要があります。
diff --git a/docs/source/ja/main_classes/deepspeed.md b/docs/source/ja/main_classes/deepspeed.md
index affb6c0a724c..0995db184dcb 100644
--- a/docs/source/ja/main_classes/deepspeed.md
+++ b/docs/source/ja/main_classes/deepspeed.md
@@ -589,11 +589,8 @@ TrainingArguments(..., deepspeed=ds_config_dict)
 
 ### Shared Configuration
 
-<Tip warning={true}>
-
-このセクションは必読です
-
-</Tip>
+> [!WARNING]
+> このセクションは必読です
 
 [`Trainer`] と DeepSpeed の両方が正しく機能するには、いくつかの設定値が必要です。
 したがって、検出が困難なエラーにつながる可能性のある定義の競合を防ぐために、それらを構成することにしました。
@@ -1458,15 +1455,12 @@ bf16 は fp32 と同じダイナミック レンジを備えているため、
 }
 ```
 
-<Tip>
-
-`deepspeed==0.6.0`の時点では、bf16 サポートは新しく実験的なものです。
-
-bf16 が有効な状態で [勾配累積](#gradient-accumulation) を使用する場合は、bf16 で勾配が累積されることに注意する必要があります。この形式の精度が低いため、これは希望どおりではない可能性があります。損失のある蓄積につながります。
-
-この問題を修正し、より高精度の `dtype` (fp16 または fp32) を使用するオプションを提供するための作業が行われています。
-
-</Tip>
+> [!TIP]
+> `deepspeed==0.6.0`の時点では、bf16 サポートは新しく実験的なものです。
+>
+> bf16 が有効な状態で [勾配累積](#gradient-accumulation) を使用する場合は、bf16 で勾配が累積されることに注意する必要があります。この形式の精度が低いため、これは希望どおりではない可能性があります。損失のある蓄積につながります。
+>
+> この問題を修正し、より高精度の `dtype` (fp16 または fp32) を使用するオプションを提供するための作業が行われています。
 
 
 ### NCCL Collectives
@@ -1625,15 +1619,11 @@ trainer.deepspeed.save_checkpoint(checkpoint_dir)
 fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
 ```
 
-<Tip>
-
-`load_state_dict_from_zero_checkpoint` が実行されると、`model` はもはや使用できなくなることに注意してください。
-同じアプリケーションの DeepSpeed コンテキスト。つまり、deepspeed エンジンを再初期化する必要があります。
-`model.load_state_dict(state_dict)` はそこからすべての DeepSpeed マジックを削除します。したがって、これは最後にのみ実行してください
-トレーニングの様子。
-
-
-</Tip>
+> [!TIP]
+> `load_state_dict_from_zero_checkpoint` が実行されると、`model` はもはや使用できなくなることに注意してください。
+> 同じアプリケーションの DeepSpeed コンテキスト。つまり、deepspeed エンジンを再初期化する必要があります。
+> `model.load_state_dict(state_dict)` はそこからすべての DeepSpeed マジックを削除します。したがって、これは最後にのみ実行してください
+> トレーニングの様子。
 
 もちろん、class:*~transformers.Trainer* を使用する必要はなく、上記の例を独自のものに調整することができます。
 トレーナー。
diff --git a/docs/source/ja/main_classes/output.md b/docs/source/ja/main_classes/output.md
index b42ed844c65d..b732a6f12247 100644
--- a/docs/source/ja/main_classes/output.md
+++ b/docs/source/ja/main_classes/output.md
@@ -40,12 +40,9 @@ outputs = model(**inputs, labels=labels)
 `output_hidden_states=True`や`output_attentions=True`を渡していないので、`hidden_states`と`attentions`はない。
 `output_attentions=True`を渡さなかったからだ。
 
-<Tip>
-
-`output_hidden_states=True`を渡すと、`outputs.hidden_states[-1]`が `outputs.last_hidden_states` と正確に一致することを期待するかもしれない。
-しかし、必ずしもそうなるとは限りません。モデルによっては、最後に隠された状態が返されたときに、正規化やその後の処理を適用するものもあります。
-
-</Tip>
+> [!TIP]
+> `output_hidden_states=True`を渡すと、`outputs.hidden_states[-1]`が `outputs.last_hidden_states` と正確に一致することを期待するかもしれない。
+> しかし、必ずしもそうなるとは限りません。モデルによっては、最後に隠された状態が返されたときに、正規化やその後の処理を適用するものもあります。
 
 
 通常と同じように各属性にアクセスできます。その属性がモデルから返されなかった場合は、
diff --git a/docs/source/ja/main_classes/pipelines.md b/docs/source/ja/main_classes/pipelines.md
index 3980becebbde..c32b805d3b9d 100644
--- a/docs/source/ja/main_classes/pipelines.md
+++ b/docs/source/ja/main_classes/pipelines.md
@@ -128,16 +128,11 @@ for out in pipe(KeyDataset(dataset, "text"), batch_size=8, truncation="only_firs
     # as batches to the model
 ```
 
-<Tip warning={true}>
-
-
-ただし、これによってパフォーマンスが自動的に向上するわけではありません。状況に応じて、10 倍の高速化または 5 倍の低速化のいずれかになります。
-ハードウェア、データ、使用されている実際のモデルについて。
-
-主に高速化である例:
-
-
-</Tip>
+> [!WARNING]
+> ただし、これによってパフォーマンスが自動的に向上するわけではありません。状況に応じて、10 倍の高速化または 5 倍の低速化のいずれかになります。
+> ハードウェア、データ、使用されている実際のモデルについて。
+>
+> 主に高速化である例:
 
 ```python
 from transformers import pipeline
diff --git a/docs/source/ja/main_classes/quantization.md b/docs/source/ja/main_classes/quantization.md
index d0f0d2c8ae32..960ff466e86b 100644
--- a/docs/source/ja/main_classes/quantization.md
+++ b/docs/source/ja/main_classes/quantization.md
@@ -87,9 +87,8 @@ model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quanti
 
 ディスク オフロードはサポートされていないことに注意してください。さらに、データセットが原因でメモリが不足している場合は、`from_pretained` で `max_memory` を渡す必要がある場合があります。 `device_map`と`max_memory`の詳細については、この [ガイド](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map) を参照してください。
 
-<Tip warning={true}>
-GPTQ 量子化は、現時点ではテキスト モデルでのみ機能します。さらに、量子化プロセスはハードウェアによっては長時間かかる場合があります (NVIDIA A100 を使用した場合、175B モデル = 4 gpu 時間)。モデルの GPTQ 量子化バージョンが存在しない場合は、ハブで確認してください。そうでない場合は、github で要求を送信できます。
-</Tip>
+> [!WARNING]
+> GPTQ 量子化は、現時点ではテキスト モデルでのみ機能します。さらに、量子化プロセスはハードウェアによっては長時間かかる場合があります (NVIDIA A100 を使用した場合、175B モデル = 4 gpu 時間)。モデルの GPTQ 量子化バージョンが存在しない場合は、ハブで確認してください。そうでない場合は、github で要求を送信できます。
 
 ### Push quantized model to 🤗 Hub
 
@@ -233,11 +232,8 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True)
 ```
 
-<Tip warning={true}>
-
-モデルが 4 ビットでロードされると、現時点では量子化された重みをハブにプッシュすることはできないことに注意してください。 4 ビットの重みはまだサポートされていないため、トレーニングできないことにも注意してください。ただし、4 ビット モデルを使用して追加のパラメーターをトレーニングすることもできます。これについては次のセクションで説明します。
-
-</Tip>
+> [!WARNING]
+> モデルが 4 ビットでロードされると、現時点では量子化された重みをハブにプッシュすることはできないことに注意してください。 4 ビットの重みはまだサポートされていないため、トレーニングできないことにも注意してください。ただし、4 ビット モデルを使用して追加のパラメーターをトレーニングすることもできます。これについては次のセクションで説明します。
 
 ### Load a large model in 8bit
 
@@ -263,11 +259,9 @@ print(model.get_memory_footprint())
 
 この統合により、大きなモデルを小さなデバイスにロードし、問題なく実行できるようになりました。
 
-<Tip warning={true}>
-モデルが 8 ビットでロードされると、最新の `transformers`と`bitsandbytes`を使用する場合を除き、量子化された重みをハブにプッシュすることは現在不可能であることに注意してください。 8 ビットの重みはまだサポートされていないため、トレーニングできないことにも注意してください。ただし、8 ビット モデルを使用して追加のパラメーターをトレーニングすることもできます。これについては次のセクションで説明します。
-また、`device_map` はオプションですが、利用可能なリソース上でモデルを効率的にディスパッチするため、推論には `device_map = 'auto'` を設定することが推奨されます。
-
-</Tip>
+> [!WARNING]
+> モデルが 8 ビットでロードされると、最新の `transformers`と`bitsandbytes`を使用する場合を除き、量子化された重みをハブにプッシュすることは現在不可能であることに注意してください。 8 ビットの重みはまだサポートされていないため、トレーニングできないことにも注意してください。ただし、8 ビット モデルを使用して追加のパラメーターをトレーニングすることもできます。これについては次のセクションで説明します。
+> また、`device_map` はオプションですが、利用可能なリソース上でモデルを効率的にディスパッチするため、推論には `device_map = 'auto'` を設定することが推奨されます。
 
 #### Advanced use cases
 
@@ -329,11 +323,8 @@ tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
 model.push_to_hub("bloom-560m-8bit")
 ```
 
-<Tip warning={true}>
-
-大規模なモデルでは、ハブ上で 8 ビット モデルをプッシュすることが強く推奨されます。これにより、コミュニティはメモリ フットプリントの削減と、たとえば Google Colab での大規模なモデルの読み込みによる恩恵を受けることができます。
-
-</Tip>
+> [!WARNING]
+> 大規模なモデルでは、ハブ上で 8 ビット モデルをプッシュすることが強く推奨されます。これにより、コミュニティはメモリ フットプリントの削減と、たとえば Google Colab での大規模なモデルの読み込みによる恩恵を受けることができます。
 
 ### Load a quantized model from the 🤗 Hub
 
diff --git a/docs/source/ja/main_classes/trainer.md b/docs/source/ja/main_classes/trainer.md
index e6e6e28d308b..6f458e0f4839 100644
--- a/docs/source/ja/main_classes/trainer.md
+++ b/docs/source/ja/main_classes/trainer.md
@@ -39,17 +39,14 @@ rendered properly in your Markdown viewer.
 - **evaluate** -- 評価ループを実行し、メトリクスを返します。
 - **predict** -- テスト セットの予測 (ラベルが使用可能な場合はメトリクスも含む) を返します。
 
-<Tip warning={true}>
-
-[`Trainer`] クラスは 🤗 Transformers モデル用に最適化されており、驚くべき動作をする可能性があります
-他の機種で使用する場合。独自のモデルで使用する場合は、次の点を確認してください。
-
-- モデルは常に [`~utils.ModelOutput`] のタプルまたはサブクラスを返します。
-- `labels` 引数が指定され、その損失が最初の値として返される場合、モデルは損失を計算できます。
-  タプルの要素 (モデルがタプルを返す場合)
-- モデルは複数のラベル引数を受け入れることができます ([`TrainingArguments`] で `label_names` を使用して、その名前を [`Trainer`] に示します) が、それらのいずれにも `"label"` という名前を付ける必要はありません。
-
-</Tip>
+> [!WARNING]
+> [`Trainer`] クラスは 🤗 Transformers モデル用に最適化されており、驚くべき動作をする可能性があります
+> 他の機種で使用する場合。独自のモデルで使用する場合は、次の点を確認してください。
+>
+> - モデルは常に [`~utils.ModelOutput`] のタプルまたはサブクラスを返します。
+> - `labels` 引数が指定され、その損失が最初の値として返される場合、モデルは損失を計算できます。
+>   タプルの要素 (モデルがタプルを返す場合)
+> - モデルは複数のラベル引数を受け入れることができます ([`TrainingArguments`] で `label_names` を使用して、その名前を [`Trainer`] に示します) が、それらのいずれにも `"label"` という名前を付ける必要はありません。
 
 以下は、加重損失を使用するように [`Trainer`] をカスタマイズする方法の例です (不均衡なトレーニング セットがある場合に役立ちます)。
 
diff --git a/docs/source/ja/model_doc/altclip.md b/docs/source/ja/model_doc/altclip.md
index fe721d29bfe5..cffb85def9b4 100644
--- a/docs/source/ja/model_doc/altclip.md
+++ b/docs/source/ja/model_doc/altclip.md
@@ -56,11 +56,8 @@ Transformerエンコーダーに画像を与えるには、各画像を固定サ
 >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
 ```
 
-<Tip>
-
-このモデルは`CLIPModel`をベースにしており、オリジナルの[CLIP](clip)と同じように使用してください。
-
-</Tip>
+> [!TIP]
+> このモデルは`CLIPModel`をベースにしており、オリジナルの[CLIP](clip)と同じように使用してください。
 
 ## AltCLIPConfig
 
diff --git a/docs/source/ja/model_doc/auto.md b/docs/source/ja/model_doc/auto.md
index 1a36d2c9bb12..8ae526f6cdbe 100644
--- a/docs/source/ja/model_doc/auto.md
+++ b/docs/source/ja/model_doc/auto.md
@@ -41,13 +41,10 @@ AutoModel.register(NewModelConfig, NewModel)
 
 その後、通常どおりauto classesを使用することができるようになります！
 
-<Tip warning={true}>
-
-あなたの`NewModelConfig`が[`~transformers.PretrainedConfig`]のサブクラスである場合、その`model_type`属性がコンフィグを登録するときに使用するキー（ここでは`"new-model"`）と同じに設定されていることを確認してください。
-
-同様に、あなたの`NewModel`が[`PreTrainedModel`]のサブクラスである場合、その`config_class`属性がモデルを登録する際に使用するクラス（ここでは`NewModelConfig`）と同じに設定されていることを確認してください。
-
-</Tip>
+> [!WARNING]
+> あなたの`NewModelConfig`が[`~transformers.PretrainedConfig`]のサブクラスである場合、その`model_type`属性がコンフィグを登録するときに使用するキー（ここでは`"new-model"`）と同じに設定されていることを確認してください。
+>
+> 同様に、あなたの`NewModel`が[`PreTrainedModel`]のサブクラスである場合、その`config_class`属性がモデルを登録する際に使用するクラス（ここでは`NewModelConfig`）と同じに設定されていることを確認してください。
 
 ## AutoConfig
 
diff --git a/docs/source/ja/model_doc/barthez.md b/docs/source/ja/model_doc/barthez.md
index 5668772c2636..e92de630d66f 100644
--- a/docs/source/ja/model_doc/barthez.md
+++ b/docs/source/ja/model_doc/barthez.md
@@ -38,12 +38,9 @@ BARTHez のコーパス上で多言語 BART を事前訓練し、結果として
 
 このモデルは [moussakam](https://huggingface.co/moussakam) によって寄稿されました。著者のコードは[ここ](https://github.com/moussaKam/BARThez)にあります。
 
-<Tip>
-
-BARThez の実装は、トークン化を除いて BART と同じです。詳細については、[BART ドキュメント](bart) を参照してください。
-構成クラスとそのパラメータ。 BARThez 固有のトークナイザーについては以下に記載されています。
-
-</Tip>
+> [!TIP]
+> BARThez の実装は、トークン化を除いて BART と同じです。詳細については、[BART ドキュメント](bart) を参照してください。
+> 構成クラスとそのパラメータ。 BARThez 固有のトークナイザーについては以下に記載されています。
 
 ### Resources
 
diff --git a/docs/source/ja/model_doc/bert-japanese.md b/docs/source/ja/model_doc/bert-japanese.md
index 86cce741aac6..b2e2cc804a86 100644
--- a/docs/source/ja/model_doc/bert-japanese.md
+++ b/docs/source/ja/model_doc/bert-japanese.md
@@ -68,11 +68,8 @@ MeCab および WordPiece トークン化でモデルを使用する例:
 >>> outputs = bertjapanese(**inputs)
 ```
 
-<Tip>
-
-- この実装はトークン化方法を除いて BERT と同じです。その他の使用例については、[BERT のドキュメント](bert) を参照してください。
-
-</Tip>
+> [!TIP]
+> - この実装はトークン化方法を除いて BERT と同じです。その他の使用例については、[BERT のドキュメント](bert) を参照してください。
 
 このモデルは[cl-tohaku](https://huggingface.co/cl-tohaku)から提供されました。
 
diff --git a/docs/source/ja/model_doc/bertweet.md b/docs/source/ja/model_doc/bertweet.md
index 3a5dddbf04cc..d71176dc53b9 100644
--- a/docs/source/ja/model_doc/bertweet.md
+++ b/docs/source/ja/model_doc/bertweet.md
@@ -54,12 +54,9 @@ al.、2019）。実験では、BERTweet が強力なベースラインである
 >>> # from transformers import TFAutoModel
 >>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
 ```
-<Tip>
-
-この実装は、トークン化方法を除いて BERT と同じです。詳細については、[BERT ドキュメント](bert) を参照してください。
-API リファレンス情報。
-
-</Tip>
+> [!TIP]
+> この実装は、トークン化方法を除いて BERT と同じです。詳細については、[BERT ドキュメント](bert) を参照してください。
+> API リファレンス情報。
 
 このモデルは [dqnguyen](https://huggingface.co/dqnguyen) によって提供されました。元のコードは [ここ](https://github.com/VinAIResearch/BERTweet) にあります。
 
diff --git a/docs/source/ja/model_doc/bort.md b/docs/source/ja/model_doc/bort.md
index 185187219e74..75eed5825fe7 100644
--- a/docs/source/ja/model_doc/bort.md
+++ b/docs/source/ja/model_doc/bort.md
@@ -16,14 +16,11 @@ rendered properly in your Markdown viewer.
 
 # BORT
 
-<Tip warning={true}>
-
-このモデルはメンテナンス モードのみであり、コードを変更する新しい PR は受け付けられません。
-
-このモデルの実行中に問題が発生した場合は、このモデルをサポートしていた最後のバージョン (v4.30.0) を再インストールしてください。
-これを行うには、コマンド `pip install -U Transformers==4.30.0` を実行します。
-
-</Tip>
+> [!WARNING]
+> このモデルはメンテナンス モードのみであり、コードを変更する新しい PR は受け付けられません。
+>
+> このモデルの実行中に問題が発生した場合は、このモデルをサポートしていた最後のバージョン (v4.30.0) を再インストールしてください。
+> これを行うには、コマンド `pip install -U Transformers==4.30.0` を実行します。
 
 ## Overview
 
diff --git a/docs/source/ja/model_doc/byt5.md b/docs/source/ja/model_doc/byt5.md
index 83f7f0b4ac57..11b3b72e6525 100644
--- a/docs/source/ja/model_doc/byt5.md
+++ b/docs/source/ja/model_doc/byt5.md
@@ -40,12 +40,9 @@ T5 アーキテクチャに基づいた事前トレーニング済みのバイ
 このモデルは、[patrickvonplaten](https://huggingface.co/patrickvonplaten) によって提供されました。元のコードは次のとおりです
 [ここ](https://github.com/google-research/byt5) にあります。
 
-<Tip>
-
-ByT5 のアーキテクチャは T5v1.1 モデルに基づいています。API リファレンスについては、[T5v1.1 のドキュメント ページ](t5v1.1) を参照してください。彼らは
-モデルの入力を準備する方法が異なるだけです。以下のコード例を参照してください。
-
-</Tip>
+> [!TIP]
+> ByT5 のアーキテクチャは T5v1.1 モデルに基づいています。API リファレンスについては、[T5v1.1 のドキュメント ページ](t5v1.1) を参照してください。彼らは
+> モデルの入力を準備する方法が異なるだけです。以下のコード例を参照してください。
 
 ByT5 は教師なしで事前トレーニングされているため、単一タスク中にタスク プレフィックスを使用する利点はありません。
 微調整。マルチタスクの微調整を行う場合は、プレフィックスを使用する必要があります。
diff --git a/docs/source/ja/model_doc/camembert.md b/docs/source/ja/model_doc/camembert.md
index ee33721102e1..a2daff7735fe 100644
--- a/docs/source/ja/model_doc/camembert.md
+++ b/docs/source/ja/model_doc/camembert.md
@@ -37,12 +37,9 @@ Bi-direction Encoders for Transformers (BERT) のフランス語版である Cam
 このモデルは [camembert](https://huggingface.co/camembert) によって提供されました。元のコードは [ここ](https://camembert-model.fr/) にあります。
 
 
-<Tip>
-
-この実装はRoBERTaと同じです。使用例については[RoBERTaのドキュメント](roberta)も参照してください。
-入力と出力に関する情報として。
-
-</Tip>
+> [!TIP]
+> この実装はRoBERTaと同じです。使用例については[RoBERTaのドキュメント](roberta)も参照してください。
+> 入力と出力に関する情報として。
 
 ## Resources
 
diff --git a/docs/source/ja/model_doc/code_llama.md b/docs/source/ja/model_doc/code_llama.md
index 1f7ee3051bfc..346bd5028c5a 100644
--- a/docs/source/ja/model_doc/code_llama.md
+++ b/docs/source/ja/model_doc/code_llama.md
@@ -29,17 +29,14 @@ Code Llama モデルはによって [Code Llama: Open Foundation Models for Code
 
 ## Usage tips and examples
 
-<Tip warning={true}>
-
-Code Llama のベースとなる`Llama2`ファミリー モデルは、`bfloat16`を使用してトレーニングされましたが、元の推論では`float16`を使用します。さまざまな精度を見てみましょう。
-
-* `float32`: モデルの初期化に関する PyTorch の規約では、モデルの重みがどの `dtype` で格納されたかに関係なく、モデルを `float32` にロードします。 「transformers」も、PyTorch との一貫性を保つためにこの規則に従っています。これはデフォルトで選択されます。 `AutoModel` API でストレージの重み付けタイプを使用してチェックポイントのロードをキャストする場合は、`dtype="auto"` を指定する必要があります。 `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`。
-* `bfloat16`: コード Llama はこの精度でトレーニングされているため、さらなるトレーニングや微調整に使用することをお勧めします。
-* `float16`: この精度を使用して推論を実行することをお勧めします。通常は `bfloat16` より高速であり、評価メトリクスには `bfloat16` と比べて明らかな低下が見られないためです。 bfloat16 を使用して推論を実行することもできます。微調整後、float16 と bfloat16 の両方で推論結果を確認することをお勧めします。
-
-上で述べたように、モデルを初期化するときに `dtype="auto"` を使用しない限り、ストレージの重みの `dtype` はほとんど無関係です。その理由は、モデルが最初にダウンロードされ (オンラインのチェックポイントの `dtype` を使用)、次に `torch` のデフォルトの `dtype` にキャストされるためです (`torch.float32` になります)。指定された `dtype` がある場合は、代わりにそれが使用されます。
-
-</Tip>
+> [!WARNING]
+> Code Llama のベースとなる`Llama2`ファミリー モデルは、`bfloat16`を使用してトレーニングされましたが、元の推論では`float16`を使用します。さまざまな精度を見てみましょう。
+>
+> * `float32`: モデルの初期化に関する PyTorch の規約では、モデルの重みがどの `dtype` で格納されたかに関係なく、モデルを `float32` にロードします。 「transformers」も、PyTorch との一貫性を保つためにこの規則に従っています。これはデフォルトで選択されます。 `AutoModel` API でストレージの重み付けタイプを使用してチェックポイントのロードをキャストする場合は、`dtype="auto"` を指定する必要があります。 `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`。
+> * `bfloat16`: コード Llama はこの精度でトレーニングされているため、さらなるトレーニングや微調整に使用することをお勧めします。
+> * `float16`: この精度を使用して推論を実行することをお勧めします。通常は `bfloat16` より高速であり、評価メトリクスには `bfloat16` と比べて明らかな低下が見られないためです。 bfloat16 を使用して推論を実行することもできます。微調整後、float16 と bfloat16 の両方で推論結果を確認することをお勧めします。
+>
+> 上で述べたように、モデルを初期化するときに `dtype="auto"` を使用しない限り、ストレージの重みの `dtype` はほとんど無関係です。その理由は、モデルが最初にダウンロードされ (オンラインのチェックポイントの `dtype` を使用)、次に `torch` のデフォルトの `dtype` にキャストされるためです (`torch.float32` になります)。指定された `dtype` がある場合は、代わりにそれが使用されます。
 
 チップ：
 - 充填タスクはすぐにサポートされます。入力を埋めたい場所には `tokenizer.fill_token` を使用する必要があります。
@@ -102,11 +99,9 @@ def remove_non_ascii(s: str) -> str:
 
 LLaMA トークナイザーは、[sentencepiece](https://github.com/google/sentencepiece) に基づく BPE モデルです。センテンスピースの癖の 1 つは、シーケンスをデコードするときに、最初のトークンが単語の先頭 (例: 「Banana」) である場合、トークナイザーは文字列の先頭にプレフィックス スペースを追加しないことです。
 
-<Tip>
-
-コード Llama は、`Llama2` モデルと同じアーキテクチャを持っています。API リファレンスについては、[Llama2 のドキュメント ページ](llama2) を参照してください。
-以下の Code Llama トークナイザーのリファレンスを見つけてください。
-</Tip>
+> [!TIP]
+> コード Llama は、`Llama2` モデルと同じアーキテクチャを持っています。API リファレンスについては、[Llama2 のドキュメント ページ](llama2) を参照してください。
+> 以下の Code Llama トークナイザーのリファレンスを見つけてください。
 
 ## CodeLlamaTokenizer
 
diff --git a/docs/source/ja/model_doc/cpm.md b/docs/source/ja/model_doc/cpm.md
index e10e8af7751b..31c9a29f3fea 100644
--- a/docs/source/ja/model_doc/cpm.md
+++ b/docs/source/ja/model_doc/cpm.md
@@ -38,12 +38,9 @@ GPT-3 の言語は主に英語であり、パラメーターは公開されて
 ここ: https://github.com/TsinghuaAI/CPM-Generate
 
 
-<Tip>
-
-CPM のアーキテクチャは、トークン化方法を除いて GPT-2 と同じです。詳細については、[GPT-2 ドキュメント](openai-community/gpt2) を参照してください。
-API リファレンス情報。
-
-</Tip>
+> [!TIP]
+> CPM のアーキテクチャは、トークン化方法を除いて GPT-2 と同じです。詳細については、[GPT-2 ドキュメント](openai-community/gpt2) を参照してください。
+> API リファレンス情報。
 
 ## CpmTokenizer
 
diff --git a/docs/source/ja/model_doc/deplot.md b/docs/source/ja/model_doc/deplot.md
index 5e125fa9b52e..7f8704ad5720 100644
--- a/docs/source/ja/model_doc/deplot.md
+++ b/docs/source/ja/model_doc/deplot.md
@@ -58,8 +58,5 @@ optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=Fa
 scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000)
 ```
 
-<Tip>
-
-DePlot は、`Pix2Struct`アーキテクチャを使用してトレーニングされたモデルです。 API リファレンスについては、[`Pix2Struct` ドキュメント](pix2struct) を参照してください。
-
-</Tip>
\ No newline at end of file
+> [!TIP]
+> DePlot は、`Pix2Struct`アーキテクチャを使用してトレーニングされたモデルです。 API リファレンスについては、[`Pix2Struct` ドキュメント](pix2struct) を参照してください。
\ No newline at end of file
diff --git a/docs/source/ja/model_doc/dialogpt.md b/docs/source/ja/model_doc/dialogpt.md
index 80e54237854b..0ca6e27665ce 100644
--- a/docs/source/ja/model_doc/dialogpt.md
+++ b/docs/source/ja/model_doc/dialogpt.md
@@ -50,8 +50,5 @@ OpenAI GPT-2に従って、マルチターン対話セッションを長いテ
 モデリング。まず、ダイアログ セッション内のすべてのダイアログ ターンを長いテキスト x_1,..., x_N に連結します (N は
 * 詳細については、元の論文を参照してください。
 
-<Tip>
-
-DialoGPT のアーキテクチャは GPT2 モデルに基づいています。API リファレンスと例については、[GPT2 のドキュメント ページ](openai-community/gpt2) を参照してください。
-
-</Tip>
+> [!TIP]
+> DialoGPT のアーキテクチャは GPT2 モデルに基づいています。API リファレンスと例については、[GPT2 のドキュメント ページ](openai-community/gpt2) を参照してください。
diff --git a/docs/source/ja/model_memory_anatomy.md b/docs/source/ja/model_memory_anatomy.md
index c8b7c90a9468..a4949281cec4 100644
--- a/docs/source/ja/model_memory_anatomy.md
+++ b/docs/source/ja/model_memory_anatomy.md
@@ -143,11 +143,8 @@ default_args = {
 }
 ```
 
-<Tip>
-
-複数の実験を実行する予定がある場合、実験間でメモリを適切にクリアするために、実験の間に Python カーネルを再起動してください。
-
-</Tip>
+> [!TIP]
+> 複数の実験を実行する予定がある場合、実験間でメモリを適切にクリアするために、実験の間に Python カーネルを再起動してください。
 
 ## Memory utilization at vanilla training
 
diff --git a/docs/source/ja/model_sharing.md b/docs/source/ja/model_sharing.md
index 4a282ee6134e..7d462a5b563a 100644
--- a/docs/source/ja/model_sharing.md
+++ b/docs/source/ja/model_sharing.md
@@ -27,11 +27,8 @@ specific language governing permissions and limitations under the License.
 frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
 picture-in-picture" allowfullscreen></iframe>
 
-<Tip>
-
-コミュニティとモデルを共有するには、[huggingface.co](https://huggingface.co/join)でアカウントが必要です。既存の組織に参加したり、新しい組織を作成したりすることもできます。
-
-</Tip>
+> [!TIP]
+> コミュニティとモデルを共有するには、[huggingface.co](https://huggingface.co/join)でアカウントが必要です。既存の組織に参加したり、新しい組織を作成したりすることもできます。
 
 ## Repository Features
 
diff --git a/docs/source/ja/peft.md b/docs/source/ja/peft.md
index 77dd7be86c72..269277ebe97d 100644
--- a/docs/source/ja/peft.md
+++ b/docs/source/ja/peft.md
@@ -66,11 +66,8 @@ peft_model_id = "ybelkada/opt-350m-lora"
 model = AutoModelForCausalLM.from_pretrained(peft_model_id)
 ```
 
-<Tip>
-
-PEFTアダプターを`AutoModelFor`クラスまたは基本モデルクラス（`OPTForCausalLM`または`LlamaForCausalLM`など）で読み込むことができます。
-
-</Tip>
+> [!TIP]
+> PEFTアダプターを`AutoModelFor`クラスまたは基本モデルクラス（`OPTForCausalLM`または`LlamaForCausalLM`など）で読み込むことができます。
 
 また、`load_adapter`メソッドを呼び出すことで、PEFTアダプターを読み込むこともできます：
 
@@ -176,11 +173,8 @@ output = model.generate(**inputs)
 
 PEFTアダプターは[`Trainer`]クラスでサポートされており、特定のユースケースに対してアダプターをトレーニングすることができます。数行のコードを追加するだけで済みます。たとえば、LoRAアダプターをトレーニングする場合:
 
-<Tip>
-
-[`Trainer`]を使用したモデルの微調整に慣れていない場合は、[事前トレーニング済みモデルの微調整](training)チュートリアルをご覧ください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`]を使用したモデルの微調整に慣れていない場合は、[事前トレーニング済みモデルの微調整](training)チュートリアルをご覧ください。
 
 1. タスクタイプとハイパーパラメータに対するアダプターの構成を定義します（ハイパーパラメータの詳細については[`~peft.LoraConfig`]を参照してください）。
 
diff --git a/docs/source/ja/perf_infer_cpu.md b/docs/source/ja/perf_infer_cpu.md
index d23ae65f309f..40051f2ddc48 100644
--- a/docs/source/ja/perf_infer_cpu.md
+++ b/docs/source/ja/perf_infer_cpu.md
@@ -40,13 +40,10 @@ IPEXのリリースはPyTorchに従っています。[IPEXのインストール
 ### Usage of JIT-mode
 Trainerで評価または予測のためにJITモードを有効にするには、ユーザーはTrainerコマンド引数に`jit_mode_eval`を追加する必要があります。
 
-<Tip warning={true}>
-
-PyTorch >= 1.14.0の場合、jitモードはjit.traceでdict入力がサポートされているため、予測と評価に任意のモデルに利益をもたらす可能性があります。
-
-PyTorch < 1.14.0の場合、jitモードはforwardパラメーターの順序がjit.traceのタプル入力の順序と一致するモデルに利益をもたらす可能性があります（質問応答モデルなど）。jit.traceがタプル入力の順序と一致しない場合、テキスト分類モデルなど、jit.traceは失敗し、これをフォールバックさせるために例外でキャッチしています。ログはユーザーに通知するために使用されます。
-
-</Tip>
+> [!WARNING]
+> PyTorch >= 1.14.0の場合、jitモードはjit.traceでdict入力がサポートされているため、予測と評価に任意のモデルに利益をもたらす可能性があります。
+>
+> PyTorch < 1.14.0の場合、jitモードはforwardパラメーターの順序がjit.traceのタプル入力の順序と一致するモデルに利益をもたらす可能性があります（質問応答モデルなど）。jit.traceがタプル入力の順序と一致しない場合、テキスト分類モデルなど、jit.traceは失敗し、これをフォールバックさせるために例外でキャッチしています。ログはユーザーに通知するために使用されます。
 
 [Transformers質問応答の使用例](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)を参考にしてください。
 
diff --git a/docs/source/ja/perf_infer_gpu_many.md b/docs/source/ja/perf_infer_gpu_many.md
index 6a71c1094494..e9d632133e07 100644
--- a/docs/source/ja/perf_infer_gpu_many.md
+++ b/docs/source/ja/perf_infer_gpu_many.md
@@ -16,11 +16,8 @@ rendered properly in your Markdown viewer.
 # Efficient Inference on a Multiple GPUs
 
 この文書には、複数のGPUで効率的に推論を行う方法に関する情報が含まれています。
-<Tip>
-
-注意: 複数のGPUセットアップは、[単一のGPUセクション](./perf_infer_gpu_one)で説明されているほとんどの戦略を使用できます。ただし、より良い使用法のために使用できる簡単なテクニックについても認識しておく必要があります。
-
-</Tip>
+> [!TIP]
+> 注意: 複数のGPUセットアップは、[単一のGPUセクション](./perf_infer_gpu_one)で説明されているほとんどの戦略を使用できます。ただし、より良い使用法のために使用できる簡単なテクニックについても認識しておく必要があります。
 
 ## Flash Attention 2
 
@@ -31,11 +28,8 @@ Flash Attention 2の統合は、複数のGPUセットアップでも機能しま
 [BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer/overview)は、🤗 TransformersモデルをPyTorchネイティブの高速実行パスを使用するように変換し、その下でFlash Attentionなどの最適化されたカーネルを呼び出します。
 
 BetterTransformerは、テキスト、画像、音声モデルの単一GPUおよび複数GPUでの高速推論もサポートしています。
-<Tip>
-
-Flash Attentionは、fp16またはbf16 dtypeを使用しているモデルにのみ使用できます。BetterTransformerを使用する前に、モデルを適切なdtypeにキャストしてください。
-
-</Tip>
+> [!TIP]
+> Flash Attentionは、fp16またはbf16 dtypeを使用しているモデルにのみ使用できます。BetterTransformerを使用する前に、モデルを適切なdtypeにキャストしてください。
 
 ### Decoder models
 
diff --git a/docs/source/ja/perf_infer_gpu_one.md b/docs/source/ja/perf_infer_gpu_one.md
index 478cd382cd2c..78707b335a53 100644
--- a/docs/source/ja/perf_infer_gpu_one.md
+++ b/docs/source/ja/perf_infer_gpu_one.md
@@ -19,11 +19,8 @@ rendered properly in your Markdown viewer.
 
 ## Flash Attention 2
 
-<Tip>
-
-この機能は実験的であり、将来のバージョンで大幅に変更される可能性があります。たとえば、Flash Attention 2 APIは近い将来`BetterTransformer` APIに移行するかもしれません。
-
-</Tip>
+> [!TIP]
+> この機能は実験的であり、将来のバージョンで大幅に変更される可能性があります。たとえば、Flash Attention 2 APIは近い将来`BetterTransformer` APIに移行するかもしれません。
 
 Flash Attention 2は、トランスフォーマーベースのモデルのトレーニングと推論速度を大幅に高速化できます。Flash Attention 2は、Tri Dao氏によって[公式のFlash Attentionリポジトリ](https://github.com/Dao-AILab/flash-attention)で導入されました。Flash Attentionに関する科学論文は[こちら](https://huggingface.co/papers/2205.14135)で見ることができます。
 
@@ -36,11 +33,8 @@ Flash Attention 2を正しくインストールするには、上記のリポジ
 
 さらに多くのモデルにFlash Attention 2のサポートを追加することをGitHubで提案することもでき、変更を統合するためにプルリクエストを開くこともできます。サポートされているモデルは、パディングトークンを使用してトレーニングを含む、推論とトレーニングに使用できます（現在の`BetterTransformer` APIではサポートされていない）。
 
-<Tip>
-
-Flash Attention 2は、モデルのdtypeが`fp16`または`bf16`の場合にのみ使用でき、NVIDIA-GPUデバイスでのみ実行されます。この機能を使用する前に、モデルを適切なdtypeにキャストし、サポートされているデバイスにロードしてください。
-
-</Tip>
+> [!TIP]
+> Flash Attention 2は、モデルのdtypeが`fp16`または`bf16`の場合にのみ使用でき、NVIDIA-GPUデバイスでのみ実行されます。この機能を使用する前に、モデルを適切なdtypeにキャストし、サポートされているデバイスにロードしてください。
 
 ### Quick usage
 
@@ -170,11 +164,8 @@ model.add_adapter(lora_config)
 
 BetterTransformerは、テキスト、画像、およびオーディオモデルの単一およびマルチGPUでの高速な推論をサポートしています。
 
-<Tip>
-
-Flash Attentionは、fp16またはbf16のdtypeを使用するモデルにのみ使用できます。BetterTransformerを使用する前に、モデルを適切なdtypeにキャストしてください。
-
-</Tip>
+> [!TIP]
+> Flash Attentionは、fp16またはbf16のdtypeを使用するモデルにのみ使用できます。BetterTransformerを使用する前に、モデルを適切なdtypeにキャストしてください。
 
 ### Encoder models
 
@@ -264,13 +255,10 @@ You can install `bitsandbytes` and benefit from easy model compression on GPUs.
 
 `bitsandbytes`をインストールし、GPUで簡単なモデルの圧縮を利用できます。FP4量子化を使用すると、ネイティブのフルプレシジョンバージョンと比較してモデルサイズを最大8倍削減できることが期待できます。以下を確認して、どのように始めるかをご覧ください。
 
-<Tip>
-
-Note that this feature can also be used in a multi GPU setup.
-
-この機能は、マルチGPUセットアップでも使用できることに注意してください。
-
-</Tip>
+> [!TIP]
+> Note that this feature can also be used in a multi GPU setup.
+>
+> この機能は、マルチGPUセットアップでも使用できることに注意してください。
 
 ### Requirements [[requirements-for-fp4-mixedprecision-inference]]
 
@@ -326,11 +314,8 @@ model_4bit = AutoModelForCausalLM.from_pretrained(
 
 ## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition
 
-<Tip>
-
-この機能は、マルチGPU環境でも使用できます。
-
-</Tip>
+> [!TIP]
+> この機能は、マルチGPU環境でも使用できます。
 
 論文[`LLM.int8()：スケーラブルなTransformer向けの8ビット行列乗算`](https://huggingface.co/papers/2208.07339)によれば、Hugging Face統合がHub内のすべてのモデルでわずか数行のコードでサポートされています。このメソッドは、半精度（`float16`および`bfloat16`）の重みの場合に`nn.Linear`サイズを2倍、単精度（`float32`）の重みの場合は4倍に縮小し、外れ値に対してほとんど影響を与えません。
 
diff --git a/docs/source/ja/perf_train_cpu_many.md b/docs/source/ja/perf_train_cpu_many.md
index 26da32f57725..92a52b3fd4a2 100644
--- a/docs/source/ja/perf_train_cpu_many.md
+++ b/docs/source/ja/perf_train_cpu_many.md
@@ -46,20 +46,14 @@ where `{pytorch_version}` should be your PyTorch version, for instance 1.13.0.
 Check more approaches for [oneccl_bind_pt installation](https://github.com/intel/torch-ccl).
 Versions of oneCCL and PyTorch must match.
 
-<Tip warning={true}>
-
-oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0)
-PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100
-
-</Tip>
+> [!WARNING]
+> oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0)
+> PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100
 
 `{pytorch_version}` は、あなたのPyTorchのバージョン（例：1.13.0）に置き換える必要があります。重要なのは、oneCCLとPyTorchのバージョンが一致していることです。[oneccl_bind_ptのインストール](https://github.com/intel/torch-ccl)に関するさらなるアプローチを確認できます。
 
-<Tip warning={true}>
-
-`oneccl_bindings_for_pytorch`の1.12.0プリビルトホイールはPyTorch 1.12.1と互換性がありません（これはPyTorch 1.12.0用です）。PyTorch 1.12.1を使用する場合は、`oneccl_bindings_for_pytorch`バージョン1.12.100を使用する必要があります。
-
-</Tip>
+> [!WARNING]
+> `oneccl_bindings_for_pytorch`の1.12.0プリビルトホイールはPyTorch 1.12.1と互換性がありません（これはPyTorch 1.12.0用です）。PyTorch 1.12.1を使用する場合は、`oneccl_bindings_for_pytorch`バージョン1.12.100を使用する必要があります。
 
 ## Intel® MPI library
 
diff --git a/docs/source/ja/perf_train_gpu_many.md b/docs/source/ja/perf_train_gpu_many.md
index 6721ba69a925..80cf3507eeaa 100644
--- a/docs/source/ja/perf_train_gpu_many.md
+++ b/docs/source/ja/perf_train_gpu_many.md
@@ -17,11 +17,8 @@ rendered properly in your Markdown viewer.
 
 単一のGPUでのトレーニングが遅すぎる場合や、モデルの重みが単一のGPUのメモリに収まらない場合、複数のGPUを使用したセットアップが必要となります。単一のGPUから複数のGPUへの切り替えには、ワークロードを分散するためのある種の並列処理が必要です。データ、テンソル、またはパイプラインの並列処理など、さまざまな並列処理技術があります。ただし、すべてに適した一つの解決策は存在せず、最適な設定は使用するハードウェアに依存します。この記事は、おそらく他のフレームワークにも適用される主要な概念に焦点を当てつつ、PyTorchベースの実装に焦点を当てています。
 
-<Tip>
-
-**注意**: [単一GPUセクション](perf_train_gpu_one) で紹介された多くの戦略（混合精度トレーニングや勾配蓄積など）は一般的であり、モデルのトレーニングに一般的に適用されます。したがって、マルチGPUやCPUトレーニングなどの次のセクションに入る前に、それを確認してください。
-
-</Tip>
+> [!TIP]
+> **注意**: [単一GPUセクション](perf_train_gpu_one) で紹介された多くの戦略（混合精度トレーニングや勾配蓄積など）は一般的であり、モデルのトレーニングに一般的に適用されます。したがって、マルチGPUやCPUトレーニングなどの次のセクションに入る前に、それを確認してください。
 
 まず、さまざまな1D並列処理技術とその利点および欠点について詳しく説明し、それらを2Dおよび3D並列処理に組み合わせてさらに高速なトレーニングを実現し、より大きなモデルをサポートする方法を検討します。さまざまな他の強力な代替手法も紹介されます。
 
diff --git a/docs/source/ja/perf_train_gpu_one.md b/docs/source/ja/perf_train_gpu_one.md
index 1a82ebb60d0a..292878b1ccad 100644
--- a/docs/source/ja/perf_train_gpu_one.md
+++ b/docs/source/ja/perf_train_gpu_one.md
@@ -17,11 +17,8 @@ rendered properly in your Markdown viewer.
 
 このガイドでは、メモリの利用効率を最適化し、トレーニングを高速化することで、モデルのトレーニング効率を向上させるために使用できる実用的なテクニックを紹介します。トレーニング中にGPUがどのように利用されるかを理解したい場合は、最初に「[モデルトレーニングの解剖学](model_memory_anatomy)」のコンセプトガイドを参照してください。このガイドは実用的なテクニックに焦点を当てています。
 
-<Tip>
-
-複数のGPUを搭載したマシンにアクセスできる場合、これらのアプローチは依然として有効です。さらに、[マルチGPUセクション](perf_train_gpu_many)で説明されている追加の方法を活用できます。
-
-</Tip>
+> [!TIP]
+> 複数のGPUを搭載したマシンにアクセスできる場合、これらのアプローチは依然として有効です。さらに、[マルチGPUセクション](perf_train_gpu_many)で説明されている追加の方法を活用できます。
 
 大規模なモデルをトレーニングする際、同時に考慮すべき2つの側面があります：
 
@@ -46,11 +43,8 @@ rendered properly in your Markdown viewer.
 | [DeepSpeed Zero](#deepspeed-zero)                          | No                      | Yes                          |
 | [torch.compile](#using-torchcompile)                       | Yes                     | No                           |
 
-<Tip>
-
-**注意**: 小さなモデルと大きなバッチサイズを使用する場合、メモリの節約が行われますが、大きなモデルと小さなバッチサイズを使用する場合、メモリの使用量が増加します。
-
-</Tip>
+> [!TIP]
+> **注意**: 小さなモデルと大きなバッチサイズを使用する場合、メモリの節約が行われますが、大きなモデルと小さなバッチサイズを使用する場合、メモリの使用量が増加します。
 
 これらのテクニックは、[`Trainer`]でモデルをトレーニングしている場合や、純粋なPyTorchループを記述している場合の両方で利用できます。詳細な最適化の設定については、🤗 Accelerateを使用して[これらの最適化を設定できます](#using--accelerate)。
 
@@ -108,11 +102,8 @@ training_args = TrainingArguments(
 
 代替手段として、🤗 Accelerateを使用することもできます - 🤗 Accelerateの例は[このガイドのさらに後ろにあります](#using--accelerate)。
 
-<Tip>
-
-勾配チェックポイントを使用することでメモリ効率が向上する場合がありますが、トレーニング速度は約20%遅くなることに注意してください。
-
-</Tip>
+> [!TIP]
+> 勾配チェックポイントを使用することでメモリ効率が向上する場合がありますが、トレーニング速度は約20%遅くなることに注意してください。
 
 ## Mixed precision training
 
@@ -169,11 +160,8 @@ torch.backends.cudnn.allow_tf32 = True
 TrainingArguments(tf32=True, **default_args)
 ```
 
-<Tip>
-
-tf32は`tensor.to(dtype=torch.tf32)`を介して直接アクセスできません。これは内部のCUDAデータ型です。tf32データ型を使用するには、`torch>=1.7`が必要です。
-
-</Tip>
+> [!TIP]
+> tf32は`tensor.to(dtype=torch.tf32)`を介して直接アクセスできません。これは内部のCUDAデータ型です。tf32データ型を使用するには、`torch>=1.7`が必要です。
 
 tf32と他の精度に関する詳細な情報については、以下のベンチマークを参照してください：
 [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004390803)および
@@ -426,12 +414,9 @@ model = model.to_bettertransformer()
 
 変換後、通常通りモデルをトレーニングしてください。
 
-<Tip warning={true}>
-
-PyTorchネイティブの`scaled_dot_product_attention`演算子は、`attention_mask`が提供されていない場合にのみFlash Attentionにディスパッチできます。
-
-デフォルトでは、トレーニングモードでBetterTransformer統合はマスクサポートを削除し、バッチトレーニングにパディングマスクが必要ないトレーニングにしか使用できません。これは、例えばマスク言語モデリングや因果言語モデリングのような、バッチトレーニングにパディングマスクが不要なトレーニングの場合に該当します。BetterTransformerはパディングマスクが必要なタスクに対するモデルの微調整には適していません。
-
-</Tip>
+> [!WARNING]
+> PyTorchネイティブの`scaled_dot_product_attention`演算子は、`attention_mask`が提供されていない場合にのみFlash Attentionにディスパッチできます。
+>
+> デフォルトでは、トレーニングモードでBetterTransformer統合はマスクサポートを削除し、バッチトレーニングにパディングマスクが必要ないトレーニングにしか使用できません。これは、例えばマスク言語モデリングや因果言語モデリングのような、バッチトレーニングにパディングマスクが不要なトレーニングの場合に該当します。BetterTransformerはパディングマスクが必要なタスクに対するモデルの微調整には適していません。
 
 SDPAを使用したアクセラレーションとメモリの節約について詳しく知りたい場合は、この[ブログ記事](https://pytorch.org/blog/out-of-the-box-acceleration/)をチェックしてください。
diff --git a/docs/source/ja/perf_train_special.md b/docs/source/ja/perf_train_special.md
index 080ff66f4cf5..45481dc03cd8 100644
--- a/docs/source/ja/perf_train_special.md
+++ b/docs/source/ja/perf_train_special.md
@@ -15,10 +15,7 @@ rendered properly in your Markdown viewer.
 
 # Training on Specialized Hardware
 
-<Tip>
-
-注意: [単一GPUセクション](perf_train_gpu_one)で紹介されたほとんどの戦略（混合精度トレーニングや勾配蓄積など）および[マルチGPUセクション](perf_train_gpu_many)は一般的なトレーニングモデルに適用される汎用的なものですので、このセクションに入る前にそれを確認してください。
-
-</Tip>
+> [!TIP]
+> 注意: [単一GPUセクション](perf_train_gpu_one)で紹介されたほとんどの戦略（混合精度トレーニングや勾配蓄積など）および[マルチGPUセクション](perf_train_gpu_many)は一般的なトレーニングモデルに適用される汎用的なものですので、このセクションに入る前にそれを確認してください。
 
 このドキュメントは、専用ハードウェアでトレーニングする方法に関する情報を近日中に追加予定です。
diff --git a/docs/source/ja/perf_train_tpu.md b/docs/source/ja/perf_train_tpu.md
index aadd588ae84d..c00b747fce92 100644
--- a/docs/source/ja/perf_train_tpu.md
+++ b/docs/source/ja/perf_train_tpu.md
@@ -15,10 +15,7 @@ rendered properly in your Markdown viewer.
 
 # Training on TPUs
 
-<Tip>
-
- 注意: [シングルGPUセクション](perf_train_gpu_one)で紹介されているほとんどの戦略（混合精度トレーニングや勾配蓄積など）および[マルチGPUセクション](perf_train_gpu_many)は一般的なモデルのトレーニングに適用できますので、このセクションに入る前にそれを確認してください。
-
-</Tip>
+> [!TIP]
+> 注意: [シングルGPUセクション](perf_train_gpu_one)で紹介されているほとんどの戦略（混合精度トレーニングや勾配蓄積など）および[マルチGPUセクション](perf_train_gpu_many)は一般的なモデルのトレーニングに適用できますので、このセクションに入る前にそれを確認してください。
 
 このドキュメントは、TPUでのトレーニング方法に関する情報をまもなく追加いたします。
diff --git a/docs/source/ja/pipeline_tutorial.md b/docs/source/ja/pipeline_tutorial.md
index c580d30a7c00..db60b5ea2cc4 100644
--- a/docs/source/ja/pipeline_tutorial.md
+++ b/docs/source/ja/pipeline_tutorial.md
@@ -24,11 +24,8 @@ specific language governing permissions and limitations under the License.
 - 特定のトークナイザやモデルの使用方法。
 - オーディオ、ビジョン、マルチモーダルタスクのための[`pipeline`]の使用方法。
 
-<Tip>
-
-サポートされているタスクと利用可能なパラメータの完全な一覧については、[`pipeline`]のドキュメンテーションをご覧ください。
-
-</Tip>
+> [!TIP]
+> サポートされているタスクと利用可能なパラメータの完全な一覧については、[`pipeline`]のドキュメンテーションをご覧ください。
 
 ## Pipeline usage
 
@@ -192,9 +189,8 @@ for out in pipe(KeyDataset(dataset, "audio")):
 
 ## Using pipelines for a webserver
 
-<Tip>
-推論エンジンを作成することは複雑なトピックで、独自のページが必要です。
-</Tip>
+> [!TIP]
+> 推論エンジンを作成することは複雑なトピックで、独自のページが必要です。
 
 [リンク](./pipeline_webserver)
 
@@ -255,16 +251,13 @@ for out in pipe(KeyDataset(dataset, "audio")):
 [{'score': 0.425, 'answer': 'us-001', 'start': 16, 'end': 16}]
 ```
 
-<Tip>
-
-上記の例を実行するには、🤗 Transformersに加えて [`pytesseract`](https://pypi.org/project/pytesseract/) がインストールされている必要があります。
-
-```bash
-sudo apt install -y tesseract-ocr
-pip install pytesseract
-```
-
-</Tip>
+> [!TIP]
+> 上記の例を実行するには、🤗 Transformersに加えて [`pytesseract`](https://pypi.org/project/pytesseract/) がインストールされている必要があります。
+>
+> ```bash
+> sudo apt install -y tesseract-ocr
+> pip install pytesseract
+> ```
 
 ## Using `pipeline` on large models with 🤗 `accelerate`:
 
diff --git a/docs/source/ja/pipeline_webserver.md b/docs/source/ja/pipeline_webserver.md
index 3b35a01490d4..c62919bda7a1 100644
--- a/docs/source/ja/pipeline_webserver.md
+++ b/docs/source/ja/pipeline_webserver.md
@@ -4,10 +4,9 @@ rendered properly in your Markdown viewer.
 
 # Webサーバー用のパイプラインの使用
 
-<Tip>
-推論エンジンの作成は複雑なトピックであり、"最適な"ソリューションはおそらく問題の領域に依存するでしょう。CPUまたはGPUを使用していますか？最低のレイテンシ、最高のスループット、多くのモデルのサポート、または特定のモデルの高度な最適化を望んでいますか？
-このトピックに取り組むための多くの方法があり、私たちが紹介するのは、おそらく最適なソリューションではないかもしれないが、始めるための良いデフォルトです。
-</Tip>
+> [!TIP]
+> 推論エンジンの作成は複雑なトピックであり、"最適な"ソリューションはおそらく問題の領域に依存するでしょう。CPUまたはGPUを使用していますか？最低のレイテンシ、最高のスループット、多くのモデルのサポート、または特定のモデルの高度な最適化を望んでいますか？
+> このトピックに取り組むための多くの方法があり、私たちが紹介するのは、おそらく最適なソリューションではないかもしれないが、始めるための良いデフォルトです。
 
 重要なことは、Webサーバーはリクエストを待機し、受信したように扱うシステムであるため、[データセット](pipeline_tutorial#using-pipelines-on-a-dataset)のように、イテレータを使用できることです。
 
@@ -75,11 +74,8 @@ curl -X POST -d "test [MASK]" http://localhost:8000/
 本当に重要なのは、モデルを**一度だけ**ロードすることです。これにより、ウェブサーバー上にモデルのコピーがないため、不必要なRAMが使用されなくなります。
 その後、キューイングメカニズムを使用して、動的バッチ処理を行うなど、いくつかのアイテムを蓄積してから推論を行うなど、高度な処理を行うことができます：
 
-<Tip warning={true}>
-
-以下のコードサンプルは、可読性のために擬似コードのように書かれています。システムリソースに合理的かどうかを確認せずに実行しないでください！
-
-</Tip>
+> [!WARNING]
+> 以下のコードサンプルは、可読性のために擬似コードのように書かれています。システムリソースに合理的かどうかを確認せずに実行しないでください！
 
 
 ```py
diff --git a/docs/source/ja/pr_checks.md b/docs/source/ja/pr_checks.md
index dc8450b52502..3d7d29453c52 100644
--- a/docs/source/ja/pr_checks.md
+++ b/docs/source/ja/pr_checks.md
@@ -156,11 +156,8 @@ make fix-copies
 
 Transformersライブラリは、モデルコードに関して非常に意見があるため、各モデルは他のモデルに依存せずに完全に1つのファイルに実装する必要があります。したがって、特定のモデルのコードのコピーが元のコードと一貫しているかどうかを確認する仕組みを追加しました。これにより、バグ修正がある場合、他の影響を受けるモデルをすべて確認し、変更を伝達するかコピーを破棄するかを選択できます。
 
-<Tip>
-
-ファイルが別のファイルの完全なコピーである場合、それを`utils/check_copies.py`の`FULL_COPIES`定数に登録する必要があります。
-
-</Tip>
+> [!TIP]
+> ファイルが別のファイルの完全なコピーである場合、それを`utils/check_copies.py`の`FULL_COPIES`定数に登録する必要があります。
 
 この仕組みは、`# Copied from xxx`という形式のコメントに依存しています。`xxx`は、コピーされているクラスまたは関数の完全なパスを含む必要があります。例えば、`RobertaSelfOutput`は`BertSelfOutput`クラスの直接のコピーですので、[こちら](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L289)にコメントがあります。
 
@@ -188,11 +185,8 @@ Transformersライブラリは、モデルコードに関して非常に意見
 
 もし順序が重要な場合（以前の置換と競合する可能性があるため）、置換は左から右に実行されます。
 
-<Tip>
-
-もし置換がフォーマットを変更する場合（たとえば、短い名前を非常に長い名前に置き換える場合など）、自動フォーマッタを適用した後にコピーが確認されます。
-
-</Tip>
+> [!TIP]
+> もし置換がフォーマットを変更する場合（たとえば、短い名前を非常に長い名前に置き換える場合など）、自動フォーマッタを適用した後にコピーが確認されます。
 
 パターンが同じ置換の異なるケース（大文字と小文字のバリアントがある）の場合、オプションとして `all-casing` を追加するだけの別の方法もあります。[こちら](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/mobilebert/modeling_mobilebert.py#L1237)は、`MobileBertForSequenceClassification` 内の例で、コメントがついています。
 
diff --git a/docs/source/ja/preprocessing.md b/docs/source/ja/preprocessing.md
index cb1129a8355e..9ab9d1660cdb 100644
--- a/docs/source/ja/preprocessing.md
+++ b/docs/source/ja/preprocessing.md
@@ -29,12 +29,9 @@ Markdownビューアーで正しく表示されないことがあります。
 * 画像入力の場合、[ImageProcessor](./main_classes/image)を使用して画像をテンソルに変換する方法。
 * マルチモーダル入力の場合、[Processor](./main_classes/processors)を使用してトークナイザと特徴抽出器または画像プロセッサを組み合わせる方法。
 
-<Tip>
-
-`AutoProcessor`は常に動作し、使用するモデルに適切なクラスを自動的に選択します。
-トークナイザ、画像プロセッサ、特徴抽出器、またはプロセッサを使用しているかにかかわらず、動作します。
-
-</Tip>
+> [!TIP]
+> `AutoProcessor`は常に動作し、使用するモデルに適切なクラスを自動的に選択します。
+> トークナイザ、画像プロセッサ、特徴抽出器、またはプロセッサを使用しているかにかかわらず、動作します。
 
 始める前に、🤗 Datasetsをインストールして、いくつかのデータセットを試すことができるようにしてください：
 
@@ -48,11 +45,8 @@ pip install datasets
 
 テキストデータの前処理に使用する主要なツールは、[トークナイザ](main_classes/tokenizer)です。トークナイザは、一連のルールに従ってテキストを*トークン*に分割します。トークンは数値に変換され、その後テンソルに変換され、モデルの入力となります。モデルが必要とする追加の入力は、トークナイザによって追加されます。
 
-<Tip>
-
-事前学習済みモデルを使用する予定の場合、関連する事前学習済みトークナイザを使用することが重要です。これにより、テキストが事前学習コーパスと同じ方法で分割され、事前学習中に通常*ボキャブ*として参照される対応するトークンインデックスを使用します。
-
-</Tip>
+> [!TIP]
+> 事前学習済みモデルを使用する予定の場合、関連する事前学習済みトークナイザを使用することが重要です。これにより、テキストが事前学習コーパスと同じ方法で分割され、事前学習中に通常*ボキャブ*として参照される対応するトークンインデックスを使用します。
 
 [`AutoTokenizer.from_pretrained`]メソッドを使用して事前学習済みトークナイザをロードして、開始しましょう。これにより、モデルが事前学習された*ボキャブ*がダウンロードされます：
 
@@ -162,11 +156,8 @@ pip install datasets
                     [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
 ```
 
-<Tip>
-
-異なるパディングと切り詰めの引数について詳しくは、[パディングと切り詰め](./pad_truncation)のコンセプトガイドをご覧ください。
-
-</Tip>
+> [!TIP]
+> 異なるパディングと切り詰めの引数について詳しくは、[パディングと切り詰め](./pad_truncation)のコンセプトガイドをご覧ください。
 
 ### Build tensors
 
@@ -310,24 +301,18 @@ pip install datasets
 コンピュータビジョンタスクでは、モデル用にデータセットを準備するための[画像プロセッサ](main_classes/image_processor)が必要です。
 画像の前処理には、画像をモデルが期待する入力形式に変換するためのいくつかのステップが含まれています。これらのステップには、リサイズ、正規化、カラーチャネルの補正、および画像をテンソルに変換するなどが含まれます。
 
-<Tip>
-
-画像の前処理は、通常、画像の増強の形式に従います。画像の前処理と画像の増強の両方は画像データを変換しますが、異なる目的があります：
-
-* 画像の増強は、過学習を防ぎ、モデルの堅牢性を向上させるのに役立つ方法で画像を変更します。データを増強する方法は無限で、明るさや色の調整、クロップ、回転、リサイズ、ズームなど、様々な方法があります。ただし、増強操作によって画像の意味が変わらないように注意する必要があります。
-* 画像の前処理は、画像がモデルの期待する入力形式と一致することを保証します。コンピュータビジョンモデルをファインチューニングする場合、画像はモデルが最初にトレーニングされたときとまったく同じ方法で前処理する必要があります。
-
-画像の増強には任意のライブラリを使用できます。画像の前処理には、モデルに関連付けられた`ImageProcessor`を使用します。
-
-</Tip>
+> [!TIP]
+> 画像の前処理は、通常、画像の増強の形式に従います。画像の前処理と画像の増強の両方は画像データを変換しますが、異なる目的があります：
+>
+> * 画像の増強は、過学習を防ぎ、モデルの堅牢性を向上させるのに役立つ方法で画像を変更します。データを増強する方法は無限で、明るさや色の調整、クロップ、回転、リサイズ、ズームなど、様々な方法があります。ただし、増強操作によって画像の意味が変わらないように注意する必要があります。
+> * 画像の前処理は、画像がモデルの期待する入力形式と一致することを保証します。コンピュータビジョンモデルをファインチューニングする場合、画像はモデルが最初にトレーニングされたときとまったく同じ方法で前処理する必要があります。
+>
+> 画像の増強には任意のライブラリを使用できます。画像の前処理には、モデルに関連付けられた`ImageProcessor`を使用します。
 
 コンピュータビジョンのデータセットで画像プロセッサを使用する方法を示すために、[food101](https://huggingface.co/datasets/food101)データセットをロードします（データセットのロード方法の詳細については🤗[Datasetsチュートリアル](https://huggingface.co/docs/datasets/load_hub)を参照）：
 
-<Tip>
-
-データセットがかなり大きいため、🤗 Datasetsの`split`パラメータを使用してトレーニングデータの小さなサンプルのみをロードします！
-
-</Tip>
+> [!TIP]
+> データセットがかなり大きいため、🤗 Datasetsの`split`パラメータを使用してトレーニングデータの小さなサンプルのみをロードします！
 
 ```python
 >>> from datasets import load_dataset
@@ -382,14 +367,12 @@ AutoImageProcessorを[`AutoImageProcessor.from_pretrained`]を使用してロー
 ...     return examples
 ```
 
-<Tip>
-
-上記の例では、画像のサイズ変更を既に画像増強変換で行っているため、`do_resize=False`を設定しました。
-適切な `image_processor` からの `size` 属性を活用しています。画像増強中に画像のサイズ変更を行わない場合は、このパラメータを省略してください。
-デフォルトでは、`ImageProcessor` がサイズ変更を処理します。
-
-画像を増強変換の一部として正規化したい場合は、`image_processor.image_mean` と `image_processor.image_std` の値を使用してください。
-</Tip>
+> [!TIP]
+> 上記の例では、画像のサイズ変更を既に画像増強変換で行っているため、`do_resize=False`を設定しました。
+> 適切な `image_processor` からの `size` 属性を活用しています。画像増強中に画像のサイズ変更を行わない場合は、このパラメータを省略してください。
+> デフォルトでは、`ImageProcessor` がサイズ変更を処理します。
+>
+> 画像を増強変換の一部として正規化したい場合は、`image_processor.image_mean` と `image_processor.image_std` の値を使用してください。
 
 3. 次に、🤗 Datasetsの[`set_transform`](https://huggingface.co/docs/datasets/process#format-transform)を使用して、変換をリアルタイムで適用します：
 
@@ -417,12 +400,9 @@ AutoImageProcessorを[`AutoImageProcessor.from_pretrained`]を使用してロー
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
 </div>
 
-<Tip>
-
-オブジェクト検出、意味セグメンテーション、インスタンスセグメンテーション、およびパノプティックセグメンテーションなどのタスクの場合、`ImageProcessor`は
-ポスト処理メソッドを提供します。これらのメソッドは、モデルの生の出力を境界ボックスやセグメンテーションマップなどの意味のある予測に変換します。
-
-</Tip>
+> [!TIP]
+> オブジェクト検出、意味セグメンテーション、インスタンスセグメンテーション、およびパノプティックセグメンテーションなどのタスクの場合、`ImageProcessor`は
+> ポスト処理メソッドを提供します。これらのメソッドは、モデルの生の出力を境界ボックスやセグメンテーションマップなどの意味のある予測に変換します。
 
 ### Pad
 
diff --git a/docs/source/ja/quicktour.md b/docs/source/ja/quicktour.md
index 44a154a614c5..916134e69bd0 100644
--- a/docs/source/ja/quicktour.md
+++ b/docs/source/ja/quicktour.md
@@ -44,11 +44,8 @@ pip install torch
 [`pipeline`] を使用することで、さまざまなモダリティにわたる多くのタスクに対して即座に使用できます。
 いくつかのタスクは以下の表に示されています：
 
-<Tip>
-
-使用可能なタスクの完全な一覧については、[pipeline API リファレンス](./main_classes/pipelines)を確認してください。
-
-</Tip>
+> [!TIP]
+> 使用可能なタスクの完全な一覧については、[pipeline API リファレンス](./main_classes/pipelines)を確認してください。
 
 | **タスク**                    | **説明**                                                                                                     | **モダリティ**   | **パイプライン識別子**                        |
 |------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------|
@@ -209,11 +206,8 @@ Pass your text to the tokenizer:
 ... )
 ```
 
-<Tip>
-
-[前処理](./preprocessing)チュートリアルをご覧いただき、トークナイゼーションの詳細や、[`AutoImageProcessor`]、[`AutoFeatureExtractor`]、[`AutoProcessor`]を使用して画像、オーディオ、およびマルチモーダル入力を前処理する方法について詳しく説明されているページもご覧ください。
-
-</Tip>
+> [!TIP]
+> [前処理](./preprocessing)チュートリアルをご覧いただき、トークナイゼーションの詳細や、[`AutoImageProcessor`]、[`AutoFeatureExtractor`]、[`AutoProcessor`]を使用して画像、オーディオ、およびマルチモーダル入力を前処理する方法について詳しく説明されているページもご覧ください。
 
 ### AutoModel
 
@@ -229,11 +223,8 @@ Pass your text to the tokenizer:
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
 ```
 
-<Tip>
-
-[`AutoModel`]クラスでサポートされているタスクに関する詳細については、[タスクの概要](./task_summary)を参照してください。
-
-</Tip>
+> [!TIP]
+> [`AutoModel`]クラスでサポートされているタスクに関する詳細については、[タスクの概要](./task_summary)を参照してください。
 
 今、前処理済みのバッチを直接モデルに渡します。辞書を展開するだけで、`**`を追加する必要があります：
 
@@ -253,14 +244,11 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
 ```
 
 
-<Tip>
-
-🤗 Transformersのすべてのモデル（PyTorchまたはTensorFlow）は、最終的な活性化関数（softmaxなど）*前*のテンソルを出力します。
-最終的な活性化関数は、しばしば損失と結合されているためです。モデルの出力は特別なデータクラスであり、その属性はIDEで自動補完されます。
-モデルの出力は、タプルまたは辞書のように動作します（整数、スライス、または文字列でインデックスを付けることができます）。
-この場合、Noneである属性は無視されます。
-
-</Tip>
+> [!TIP]
+> 🤗 Transformersのすべてのモデル（PyTorchまたはTensorFlow）は、最終的な活性化関数（softmaxなど）*前*のテンソルを出力します。
+> 最終的な活性化関数は、しばしば損失と結合されているためです。モデルの出力は特別なデータクラスであり、その属性はIDEで自動補完されます。
+> モデルの出力は、タプルまたは辞書のように動作します（整数、スライス、または文字列でインデックスを付けることができます）。
+> この場合、Noneである属性は無視されます。
 
 ### Save a Model
 
@@ -401,11 +389,8 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
 >>> trainer.train()  # doctest: +SKIP
 ```
 
-<Tip>
-
-翻訳や要約など、シーケンス間モデルを使用するタスクには、代わりに[`Seq2SeqTrainer`]と[`Seq2SeqTrainingArguments`]クラスを使用してください。
-
-</Tip>
+> [!TIP]
+> 翻訳や要約など、シーケンス間モデルを使用するタスクには、代わりに[`Seq2SeqTrainer`]と[`Seq2SeqTrainingArguments`]クラスを使用してください。
 
 [`Trainer`]内のメソッドをサブクラス化することで、トレーニングループの動作をカスタマイズできます。これにより、損失関数、オプティマイザ、スケジューラなどの機能をカスタマイズできます。サブクラス化できるメソッドの一覧については、[`Trainer`]リファレンスをご覧ください。
 
diff --git a/docs/source/ja/serialization.md b/docs/source/ja/serialization.md
index 3e9d81180de0..9970aa729421 100644
--- a/docs/source/ja/serialization.md
+++ b/docs/source/ja/serialization.md
@@ -131,11 +131,8 @@ CLIの代わりに、🤗 TransformersモデルをONNXにプログラム的に
 
 ### Exporting a model with `transformers.onnx`
 
-<Tip warning={true}>
-
-`transformers.onnx`はもはやメンテナンスされていないため、モデルを上記で説明したように🤗 Optimumでエクスポートしてください。このセクションは将来のバージョンで削除されます。
-
-</Tip>
+> [!WARNING]
+> `transformers.onnx`はもはやメンテナンスされていないため、モデルを上記で説明したように🤗 Optimumでエクスポートしてください。このセクションは将来のバージョンで削除されます。
 
 🤗 TransformersモデルをONNXにエクスポートするには、追加の依存関係をインストールしてください：
 
diff --git a/docs/source/ja/tasks/asr.md b/docs/source/ja/tasks/asr.md
index 4ccb31667423..78516e2e3b8e 100644
--- a/docs/source/ja/tasks/asr.md
+++ b/docs/source/ja/tasks/asr.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 1. [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) データセットの [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) を微調整して、音声をテキストに書き起こします。
 2. 微調整したモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/automatic-speech-recognition) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/automatic-speech-recognition) を確認することをお勧めします。
 
 始める前に、必要なライブラリがすべてインストールされていることを確認してください。
 
@@ -228,11 +225,8 @@ MInDS-14 データセットのサンプリング レートは 8000kHz です (
 
 ## Train
 
-<Tip>
-
-[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
 
 これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForCTC`] で Wav2Vec2 をロードします。 `ctc_loss_reduction` パラメータで適用する削減を指定します。多くの場合、デフォルトの合計ではなく平均を使用する方が適切です。
 
@@ -294,11 +288,8 @@ MInDS-14 データセットのサンプリング レートは 8000kHz です (
 ```
 
 
-<Tip>
-
-自動音声認識用にモデルを微調整する方法のより詳細な例については、英語 ASR および英語のこのブログ [投稿](https://huggingface.co/blog/fine-tune-wav2vec2-english) を参照してください。多言語 ASR については、この [投稿](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) を参照してください。
-
-</Tip>
+> [!TIP]
+> 自動音声認識用にモデルを微調整する方法のより詳細な例については、英語 ASR および英語のこのブログ [投稿](https://huggingface.co/blog/fine-tune-wav2vec2-english) を参照してください。多言語 ASR については、この [投稿](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) を参照してください。
 
 ## Inference
 
@@ -325,11 +316,8 @@ MInDS-14 データセットのサンプリング レートは 8000kHz です (
 {'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'}
 ```
 
-<Tip>
-
-転写はまあまあですが、もっと良くなる可能性があります。さらに良い結果を得るには、より多くの例でモデルを微調整してみてください。
-
-</Tip>
+> [!TIP]
+> 転写はまあまあですが、もっと良くなる可能性があります。さらに良い結果を得るには、より多くの例でモデルを微調整してみてください。
 
 必要に応じて、「パイプライン」の結果を手動で複製することもできます。
 
diff --git a/docs/source/ja/tasks/audio_classification.md b/docs/source/ja/tasks/audio_classification.md
index d37485cbe226..dac94b99229e 100644
--- a/docs/source/ja/tasks/audio_classification.md
+++ b/docs/source/ja/tasks/audio_classification.md
@@ -28,11 +28,8 @@ rendered properly in your Markdown viewer.
 1. [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) データセットで [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) を微調整して話者の意図を分類します。
 2. 微調整したモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/audio-classification) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/audio-classification) を確認することをお勧めします。
 
 ```bash
 pip install transformers datasets evaluate
@@ -186,11 +183,8 @@ MInDS-14 データセットのサンプリング レートは 8khz です (こ
 
 ## Train
 
-<Tip>
-
-[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`] を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
 
 これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForAudioClassification`] を使用して、予期されるラベルの数とラベル マッピングを使用して Wav2Vec2 を読み込みます。
 
@@ -244,11 +238,8 @@ MInDS-14 データセットのサンプリング レートは 8khz です (こ
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-音声分類用のモデルを微調整する方法の詳細な例については、対応する [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
-
-</Tip>
+> [!TIP]
+> 音声分類用のモデルを微調整する方法の詳細な例については、対応する [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
 
 ## Inference
 
diff --git a/docs/source/ja/tasks/document_question_answering.md b/docs/source/ja/tasks/document_question_answering.md
index f07cc6dff28e..4d1f5fe3dab8 100644
--- a/docs/source/ja/tasks/document_question_answering.md
+++ b/docs/source/ja/tasks/document_question_answering.md
@@ -28,11 +28,8 @@ rendered properly in your Markdown viewer.
 - [DocVQA データセット](https://huggingface.co/datasets/nielsr/docvqa_1200_examples_donut) の [LayoutLMv2](../model_doc/layoutlmv2) を微調整します。
 - 微調整されたモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/image-to-text) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/image-to-text) を確認することをお勧めします。
 
 LayoutLMv2 は、最後の非表示のヘッダーの上に質問応答ヘッドを追加することで、ドキュメントの質問応答タスクを解決します。
 トークンの状態を調べて、トークンの開始トークンと終了トークンの位置を予測します。
diff --git a/docs/source/ja/tasks/idefics.md b/docs/source/ja/tasks/idefics.md
index 83ed1278496e..48ea538de512 100644
--- a/docs/source/ja/tasks/idefics.md
+++ b/docs/source/ja/tasks/idefics.md
@@ -54,9 +54,8 @@ DeepMind によって最初に開発された最先端の視覚言語モデル
 pip install -q bitsandbytes sentencepiece accelerate transformers
 ```
 
-<Tip>
-量子化されていないバージョンのモデル チェックポイントを使用して次の例を実行するには、少なくとも 20GB の GPU メモリが必要です。
-</Tip>
+> [!TIP]
+> 量子化されていないバージョンのモデル チェックポイントを使用して次の例を実行するには、少なくとも 20GB の GPU メモリが必要です。
 
 ## Loading the model
 
@@ -144,13 +143,11 @@ BOS (Beginning-of-sequence) トークンによりキャプションが作成さ
 A puppy in a flower bed
 ```
 
-<Tip>
-
-増加時に発生するエラーを避けるために、`generate`の呼び出しに`bad_words_ids`を含めることをお勧めします。
-`max_new_tokens`: モデルは、新しい `<image>` または `<fake_token_around_image>` トークンを生成する必要があります。
-モデルによって画像が生成されていません。
-このガイドのようにオンザフライで設定することも、[テキスト生成戦略](../generation_strategies) ガイドで説明されているように `GenerationConfig` に保存することもできます。
-</Tip>
+> [!TIP]
+> 増加時に発生するエラーを避けるために、`generate`の呼び出しに`bad_words_ids`を含めることをお勧めします。
+> `max_new_tokens`: モデルは、新しい `<image>` または `<fake_token_around_image>` トークンを生成する必要があります。
+> モデルによって画像が生成されていません。
+> このガイドのようにオンザフライで設定することも、[テキスト生成戦略](../generation_strategies) ガイドで説明されているように `GenerationConfig` に保存することもできます。
 
 ## Prompted image captioning
 
@@ -335,13 +332,10 @@ The little girl ran
 
 IDEFICS は玄関先にあるカボチャに気づき、幽霊に関する不気味なハロウィーンの話をしたようです。
 
-<Tip>
-
-このような長い出力の場合、テキスト生成戦略を微調整すると大きなメリットが得られます。これは役に立ちます
-生成される出力の品質が大幅に向上します。 [テキスト生成戦略](../generation_strategies) を確認してください。
-詳しく知ることができ。
-
-</Tip>
+> [!TIP]
+> このような長い出力の場合、テキスト生成戦略を微調整すると大きなメリットが得られます。これは役に立ちます
+> 生成される出力の品質が大幅に向上します。 [テキスト生成戦略](../generation_strategies) を確認してください。
+> 詳しく知ることができ。
 
 ## Running inference in batch mode
 
diff --git a/docs/source/ja/tasks/image_captioning.md b/docs/source/ja/tasks/image_captioning.md
index 7649947b2c64..bc74049d1cf6 100644
--- a/docs/source/ja/tasks/image_captioning.md
+++ b/docs/source/ja/tasks/image_captioning.md
@@ -64,11 +64,8 @@ DatasetDict({
 
 データセットには `image`と`text`の 2 つの機能があります。
 
-<Tip>
-
-多くの画像キャプション データセットには、画像ごとに複数のキャプションが含まれています。このような場合、一般的な戦略は、トレーニング中に利用可能なキャプションの中からランダムにキャプションをサンプリングすることです。
-
-</Tip>
+> [!TIP]
+> 多くの画像キャプション データセットには、画像ごとに複数のキャプションが含まれています。このような場合、一般的な戦略は、トレーニング中に利用可能なキャプションの中からランダムにキャプションをサンプリングすることです。
 
 [`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットのトレイン スプリットをトレイン セットとテスト セットに分割します。
 
diff --git a/docs/source/ja/tasks/image_classification.md b/docs/source/ja/tasks/image_classification.md
index 32c30dcff7c8..ac2dde3ade97 100644
--- a/docs/source/ja/tasks/image_classification.md
+++ b/docs/source/ja/tasks/image_classification.md
@@ -30,11 +30,8 @@ rendered properly in your Markdown viewer.
 1. [Food-101](https://huggingface.co/datasets/food101) データセットの [ViT](model_doc/vit) を微調整して、画像内の食品を分類します。
 2. 微調整したモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/image-classification) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/image-classification) を確認することをお勧めします。
 
 始める前に、必要なライブラリがすべてインストールされていることを確認してください。
 
@@ -180,11 +177,8 @@ Datasets、🤗 データセット ライブラリから Food-101 データセ
 
 ## Train
 
-<Tip>
-
-[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`] を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
 
 これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForImageClassification`] を使用して ViT をロードします。ラベルの数と予想されるラベルの数、およびラベル マッピングを指定します。
 
@@ -242,11 +236,8 @@ Datasets、🤗 データセット ライブラリから Food-101 データセ
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-画像分類用のモデルを微調整する方法の詳細な例については、対応する [PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)
-
-</Tip>
+> [!TIP]
+> 画像分類用のモデルを微調整する方法の詳細な例については、対応する [PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)
 
 ## Inference
 
diff --git a/docs/source/ja/tasks/language_modeling.md b/docs/source/ja/tasks/language_modeling.md
index d72ebb6a1046..537f22c6ffbe 100644
--- a/docs/source/ja/tasks/language_modeling.md
+++ b/docs/source/ja/tasks/language_modeling.md
@@ -35,11 +35,8 @@ rendered properly in your Markdown viewer.
 1. [ELI5](https:/) の [r/askscience](https://www.reddit.com/r/askscience/) サブセットで [DistilGPT2](https://huggingface.co/distilbert/distilgpt2) を微調整します。 /huggingface.co/datasets/eli5) データセット。
 2. 微調整したモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/text-generation) を確認することをお勧めします。u
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/text-generation) を確認することをお勧めします。u
 
 始める前に、必要なライブラリがすべてインストールされていることを確認してください。
 
@@ -199,11 +196,8 @@ Apply the `group_texts` function over the entire dataset:
 
 ## Train
 
-<Tip>
-
-[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[基本チュートリアル](../training#train-with-pytorch-trainer) を参照してください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`] を使用したモデルの微調整に慣れていない場合は、[基本チュートリアル](../training#train-with-pytorch-trainer) を参照してください。
 
 これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForCausalLM`] を使用して DistilGPT2 をロードします。
 
@@ -256,13 +250,10 @@ Perplexity: 49.61
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-因果言語モデリング用にモデルを微調整する方法のより詳細な例については、対応するドキュメントを参照してください。
-[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
-または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)。
-
-</Tip>
+> [!TIP]
+> 因果言語モデリング用にモデルを微調整する方法のより詳細な例については、対応するドキュメントを参照してください。
+> [PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
+> または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)。
 
 ## Inference
 
diff --git a/docs/source/ja/tasks/masked_language_modeling.md b/docs/source/ja/tasks/masked_language_modeling.md
index ff4107edb808..181a6b801f5e 100644
--- a/docs/source/ja/tasks/masked_language_modeling.md
+++ b/docs/source/ja/tasks/masked_language_modeling.md
@@ -29,11 +29,8 @@ rendered properly in your Markdown viewer.
 1. [ELI5](https://huggingface.co/distilbert/distilroberta-base) の [r/askscience](https://www.reddit.com/r/askscience/) サブセットで [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) を微調整します。 ://huggingface.co/datasets/eli5) データセット。
 2. 微調整したモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/fill-mask) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/fill-mask) を確認することをお勧めします。
 
 始める前に、必要なライブラリがすべてインストールされていることを確認してください。
 
@@ -185,11 +182,8 @@ pip install transformers datasets evaluate
 
 ## Train
 
-<Tip>
-
-[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
 
 これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForMaskedLM`] を使用して DistilRoBERTa をロードします。
 
@@ -244,13 +238,10 @@ Perplexity: 8.76
 ```
 
 
-<Tip>
-
-マスクされた言語モデリング用にモデルを微調整する方法のより詳細な例については、対応するドキュメントを参照してください。
-[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
-または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)。
-
-</Tip>
+> [!TIP]
+> マスクされた言語モデリング用にモデルを微調整する方法のより詳細な例については、対応するドキュメントを参照してください。
+> [PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
+> または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)。
 
 ## Inference
 
diff --git a/docs/source/ja/tasks/monocular_depth_estimation.md b/docs/source/ja/tasks/monocular_depth_estimation.md
index e7a3a994a60e..d15f90f22e31 100644
--- a/docs/source/ja/tasks/monocular_depth_estimation.md
+++ b/docs/source/ja/tasks/monocular_depth_estimation.md
@@ -25,11 +25,8 @@ rendered properly in your Markdown viewer.
 シーンとそれに対応する深度情報（照明条件などの要因の影響を受ける可能性があります）
 オクルージョンとテクスチャ。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/depth-estimation) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/depth-estimation) を確認することをお勧めします。
 
 このガイドでは、次の方法を学びます。
 
diff --git a/docs/source/ja/tasks/multiple_choice.md b/docs/source/ja/tasks/multiple_choice.md
index d92ff913d606..f9b601c4b9aa 100644
--- a/docs/source/ja/tasks/multiple_choice.md
+++ b/docs/source/ja/tasks/multiple_choice.md
@@ -145,11 +145,8 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
 
 ## Train
 
-<Tip>
-
-[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
 
 これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForMultipleChoice`] を使用して BERT をロードします。
 
@@ -199,13 +196,10 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
 ```
 
 
-<Tip>
-
-複数選択用にモデルを微調整する方法の詳細な例については、対応するセクションを参照してください。
-[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)
-または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb)。
-
-</Tip>
+> [!TIP]
+> 複数選択用にモデルを微調整する方法の詳細な例については、対応するセクションを参照してください。
+> [PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)
+> または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb)。
 
 
 # Inference
diff --git a/docs/source/ja/tasks/object_detection.md b/docs/source/ja/tasks/object_detection.md
index 1b6c03ee1032..5b3ea78fb6c5 100644
--- a/docs/source/ja/tasks/object_detection.md
+++ b/docs/source/ja/tasks/object_detection.md
@@ -32,11 +32,8 @@ rendered properly in your Markdown viewer.
  データセット。
  2. 微調整したモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/object-detection) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/object-detection) を確認することをお勧めします。
 
 始める前に、必要なライブラリがすべてインストールされていることを確認してください。
 
diff --git a/docs/source/ja/tasks/prompting.md b/docs/source/ja/tasks/prompting.md
index ffafc7f9156f..0e4782628ae9 100644
--- a/docs/source/ja/tasks/prompting.md
+++ b/docs/source/ja/tasks/prompting.md
@@ -40,19 +40,16 @@ Falcon、LLaMA などの大規模言語モデルは、事前にトレーニン
 - [高度なプロンプト テクニック: 数回のプロンプトと思考の連鎖](#advanced-prompting-techniques)
 - [プロンプトを表示する代わりに微調整する場合](#prompting-vs-fine-tuning)
 
-<Tip>
-
-迅速なエンジニアリングは、LLM 出力最適化プロセスの一部にすぎません。もう 1 つの重要な要素は、
-最適なテキスト生成戦略。 LLM が生成時に後続の各トークンを選択する方法をカスタマイズできます。
-トレーニング可能なパラメータを一切変更せずにテキストを作成します。テキスト生成パラメータを微調整することで、
-生成されたテキストに繰り返しが含まれているため、より一貫性があり人間らしい響きになります。
-テキスト生成戦略とパラメーターはこのガイドの範囲外ですが、これらのトピックについて詳しくは、次のトピックを参照してください。
-次のガイド:
- 
-* [LLM による生成](../llm_tutorial)
-* [テキスト生成戦略](../generation_strategies)
-
-</Tip>
+> [!TIP]
+> 迅速なエンジニアリングは、LLM 出力最適化プロセスの一部にすぎません。もう 1 つの重要な要素は、
+> 最適なテキスト生成戦略。 LLM が生成時に後続の各トークンを選択する方法をカスタマイズできます。
+> トレーニング可能なパラメータを一切変更せずにテキストを作成します。テキスト生成パラメータを微調整することで、
+> 生成されたテキストに繰り返しが含まれているため、より一貫性があり人間らしい響きになります。
+> テキスト生成戦略とパラメーターはこのガイドの範囲外ですが、これらのトピックについて詳しくは、次のトピックを参照してください。
+> 次のガイド:
+>
+> * [LLM による生成](../llm_tutorial)
+> * [テキスト生成戦略](../generation_strategies)
 
 ## Basics of prompting
 
@@ -133,12 +130,9 @@ pip install -q transformers accelerate
 ... )
 ```
 
-<Tip>
-
-Falcon モデルは `bfloat16` データ型を使用してトレーニングされたため、同じものを使用することをお勧めします。これには、最近の
-CUDA のバージョンに準拠しており、最新のカードで最適に動作します。
-
-</Tip>
+> [!TIP]
+> Falcon モデルは `bfloat16` データ型を使用してトレーニングされたため、同じものを使用することをお勧めします。これには、最近の
+> CUDA のバージョンに準拠しており、最新のカードで最適に動作します。
 
 パイプライン経由でモデルをロードしたので、プロンプトを使用して NLP タスクを解決する方法を見てみましょう。
 
@@ -171,13 +165,10 @@ Positive
 
 その結果、出力には、手順で提供したリストの分類ラベルが含まれており、それは正しいラベルです。
 
-<Tip>
-
-プロンプトに加えて、`max_new_tokens`パラメータを渡していることに気づくかもしれません。トークンの数を制御します。
-モデルが生成します。これは、学習できる多くのテキスト生成パラメーターの 1 つです。
-[テキスト生成戦略](../generation_strategies) ガイドを参照してください。
-
-</Tip>
+> [!TIP]
+> プロンプトに加えて、`max_new_tokens`パラメータを渡していることに気づくかもしれません。トークンの数を制御します。
+> モデルが生成します。これは、学習できる多くのテキスト生成パラメーターの 1 つです。
+> [テキスト生成戦略](../generation_strategies) ガイドを参照してください。
 
 #### Named Entity Recognition
 
diff --git a/docs/source/ja/tasks/question_answering.md b/docs/source/ja/tasks/question_answering.md
index a12205b5cd39..b6c5cd8c2770 100644
--- a/docs/source/ja/tasks/question_answering.md
+++ b/docs/source/ja/tasks/question_answering.md
@@ -44,13 +44,10 @@
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-質問応答用のモデルを微調整する方法の詳細な例については、対応するドキュメントを参照してください。
-[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)
-または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)。
-
-</Tip>
+> [!TIP]
+> 質問応答用のモデルを微調整する方法の詳細な例については、対応するドキュメントを参照してください。
+> [PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)
+> または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)。
 
 ## Evaluate
 
diff --git a/docs/source/ja/tasks/semantic_segmentation.md b/docs/source/ja/tasks/semantic_segmentation.md
index 4a1a141ba4ef..b82e4dc78168 100644
--- a/docs/source/ja/tasks/semantic_segmentation.md
+++ b/docs/source/ja/tasks/semantic_segmentation.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 1. [SceneParse150](https://huggingface.co/datasets/scene_parse_150) データセットの [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) を微調整します。
 2. 微調整したモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/image-segmentation) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/image-segmentation) を確認することをお勧めします。
 
 始める前に、必要なライブラリがすべてインストールされていることを確認してください。
 
@@ -188,11 +185,8 @@ pip install -q datasets transformers evaluate
 これで`compute_metrics`関数の準備が整いました。トレーニングをセットアップするときにこの関数に戻ります。
 
 ## Train
-<Tip>
-
-[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#finetune-with-trainer) の基本的なチュートリアルをご覧ください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#finetune-with-trainer) の基本的なチュートリアルをご覧ください。
 
 これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForSemanticSegmentation`] を使用して SegFormer をロードし、ラベル ID とラベル クラス間のマッピングをモデルに渡します。
 
diff --git a/docs/source/ja/tasks/summarization.md b/docs/source/ja/tasks/summarization.md
index c62583fdb281..52a8a998806f 100644
--- a/docs/source/ja/tasks/summarization.md
+++ b/docs/source/ja/tasks/summarization.md
@@ -30,11 +30,8 @@ rendered properly in your Markdown viewer.
 1. 抽象的な要約のために、[BillSum](https://huggingface.co/datasets/billsum) データセットのカリフォルニア州請求書サブセットで [T5](https://huggingface.co/google-t5/t5-small) を微調整します。
 2. 微調整したモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/summarization) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/summarization) を確認することをお勧めします。
 
 始める前に、必要なライブラリがすべてインストールされていることを確認してください。
 
@@ -160,12 +157,8 @@ pip install transformers datasets evaluate rouge_score
 
 ## Train
 
-<Tip>
-
-
-[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`] を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
 
 これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForSeq2SeqLM`] を使用して T5 をロードします。
 
@@ -216,13 +209,10 @@ pip install transformers datasets evaluate rouge_score
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-要約用にモデルを微調整する方法のより詳細な例については、対応するセクションを参照してください。
-[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)
-または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)。
-
-</Tip>
+> [!TIP]
+> 要約用にモデルを微調整する方法のより詳細な例については、対応するセクションを参照してください。
+> [PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)
+> または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)。
 
 ## Inference
 
diff --git a/docs/source/ja/tasks/text-to-speech.md b/docs/source/ja/tasks/text-to-speech.md
index 1c22dfd71a7b..59819da72b7a 100644
--- a/docs/source/ja/tasks/text-to-speech.md
+++ b/docs/source/ja/tasks/text-to-speech.md
@@ -66,15 +66,12 @@ SpeechT5 のすべての機能がまだ正式リリースにマージされて
 pip install git+https://github.com/huggingface/transformers.git
 ```
 
-<Tip>
-
-このガイドに従うには、GPU が必要です。ノートブックで作業している場合は、次の行を実行して GPU が利用可能かどうかを確認します。
-
-```bash
-!nvidia-smi
-```
-
-</Tip>
+> [!TIP]
+> このガイドに従うには、GPU が必要です。ノートブックで作業している場合は、次の行を実行して GPU が利用可能かどうかを確認します。
+>
+> ```bash
+> !nvidia-smi
+> ```
 
 Hugging Face アカウントにログインして、モデルをアップロードしてコミュニティと共有することをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。
 
diff --git a/docs/source/ja/tasks/token_classification.md b/docs/source/ja/tasks/token_classification.md
index f8dbd9740176..4d850358c5ac 100644
--- a/docs/source/ja/tasks/token_classification.md
+++ b/docs/source/ja/tasks/token_classification.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 1. [WNUT 17](https://huggingface.co/datasets/wnut_17) データセットで [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) を微調整して、新しいエンティティを検出します。
 2. 微調整されたモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/token-classification) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/token-classification) を確認することをお勧めします。
 
 始める前に、必要なライブラリがすべてインストールされていることを確認してください。
 
@@ -240,11 +237,8 @@ pip install transformers datasets evaluate seqeval
 ... }
 ```
 
-<Tip>
-
-[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
 
 これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForTokenClassification`] を使用して、予期されるラベルの数とラベル マッピングを指定して DistilBERT を読み込みます。
 
@@ -295,14 +289,10 @@ pip install transformers datasets evaluate seqeval
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-トークン分類のモデルを微調整する方法のより詳細な例については、対応するセクションを参照してください。
-[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)
-または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)。
-
-
-</Tip>
+> [!TIP]
+> トークン分類のモデルを微調整する方法のより詳細な例については、対応するセクションを参照してください。
+> [PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)
+> または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)。
 
 ## Inference
 
diff --git a/docs/source/ja/tasks/translation.md b/docs/source/ja/tasks/translation.md
index e7ce04d47c1a..714e87594e5a 100644
--- a/docs/source/ja/tasks/translation.md
+++ b/docs/source/ja/tasks/translation.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 1. [OPUS Books](https://huggingface.co/datasets/opus_books) データセットの英語-フランス語サブセットの [T5](https://huggingface.co/google-t5/t5-small) を微調整して、英語のテキストを次の形式に翻訳します。フランス語。
 2. 微調整されたモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/translation) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/translation) を確認することをお勧めします。
 
 始める前に、必要なライブラリがすべてインストールされていることを確認してください。
 
@@ -165,11 +162,8 @@ pip install transformers datasets evaluate sacrebleu
 
 ## Train
 
-<Tip>
-
-[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
-
-</Tip>
+> [!TIP]
+> [`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。
 
 これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForSeq2SeqLM`] を使用して T5 をロードします。
 
@@ -220,13 +214,10 @@ pip install transformers datasets evaluate sacrebleu
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-翻訳用にモデルを微調整する方法の詳細な例については、対応するドキュメントを参照してください。
-[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)
-または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb)。
-
-</Tip>
+> [!TIP]
+> 翻訳用にモデルを微調整する方法の詳細な例については、対応するドキュメントを参照してください。
+> [PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)
+> または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb)。
 
 ## Inference
 
diff --git a/docs/source/ja/tasks/video_classification.md b/docs/source/ja/tasks/video_classification.md
index e7e7803c9408..5657df3dbd33 100644
--- a/docs/source/ja/tasks/video_classification.md
+++ b/docs/source/ja/tasks/video_classification.md
@@ -26,11 +26,8 @@ rendered properly in your Markdown viewer.
 1. [UCF101](https://www.crcv.ucf.edu/) のサブセットで [VideoMAE](https://huggingface.co/docs/transformers/main/en/model_doc/videomae) を微調整します。 data/UCF101.php) データセット。
 2. 微調整したモデルを推論に使用します。
 
-<Tip>
-
-このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/video-classification) を確認することをお勧めします。
-
-</Tip>
+> [!TIP]
+> このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、[タスクページ](https://huggingface.co/tasks/video-classification) を確認することをお勧めします。
 
 始める前に、必要なライブラリがすべてインストールされていることを確認してください。
 ```bash
diff --git a/docs/source/ja/tasks_explained.md b/docs/source/ja/tasks_explained.md
index f619c2b220bd..06411a101b02 100644
--- a/docs/source/ja/tasks_explained.md
+++ b/docs/source/ja/tasks_explained.md
@@ -29,11 +29,8 @@ rendered properly in your Markdown viewer.
 - [GPT2](model_doc/gpt2)：デコーダを使用するテキスト生成などのNLPタスク向け
 - [BART](model_doc/bart)：エンコーダ-デコーダを使用する要約および翻訳などのNLPタスク向け
 
-<Tip>
-
-さらに進む前に、元のTransformerアーキテクチャの基本的な知識を持つと良いです。エンコーダ、デコーダ、および注意力がどのように動作するかを知っておくと、異なるTransformerモデルがどのように動作するかを理解するのに役立ちます。始めているか、リフレッシュが必要な場合は、詳細な情報については当社の[コース](https://huggingface.co/course/chapter1/4?fw=pt)をチェックしてください！
-
-</Tip>
+> [!TIP]
+> さらに進む前に、元のTransformerアーキテクチャの基本的な知識を持つと良いです。エンコーダ、デコーダ、および注意力がどのように動作するかを知っておくと、異なるTransformerモデルがどのように動作するかを理解するのに役立ちます。始めているか、リフレッシュが必要な場合は、詳細な情報については当社の[コース](https://huggingface.co/course/chapter1/4?fw=pt)をチェックしてください！
 
 ## Speech and audio
 
@@ -74,11 +71,8 @@ rendered properly in your Markdown viewer.
 1. 画像をパッチのシーケンスに分割し、Transformerを使用して並列に処理します。
 2. [ConvNeXT](model_doc/convnext)などのモダンなCNNを使用します。これらは畳み込み層を使用しますが、モダンなネットワーク設計を採用しています。
 
-<Tip>
-
-サードアプローチでは、Transformerと畳み込みを組み合わせたものもあります（例：[Convolutional Vision Transformer](model_doc/cvt)または[LeViT](model_doc/levit)）。これらについては議論しませんが、これらはここで調べる2つのアプローチを組み合わせています。
-
-</Tip>
+> [!TIP]
+> サードアプローチでは、Transformerと畳み込みを組み合わせたものもあります（例：[Convolutional Vision Transformer](model_doc/cvt)または[LeViT](model_doc/levit)）。これらについては議論しませんが、これらはここで調べる2つのアプローチを組み合わせています。
 
 ViTとConvNeXTは画像分類によく使用されますが、オブジェクト検出、セグメンテーション、深度推定などの他のビジョンタスクに対しては、DETR、Mask2Former、GLPNなどが適しています。
 
@@ -109,11 +103,8 @@ ViTが導入した主な変更点は、画像をTransformerに供給する方法
 
 #### CNN
 
-<Tip>
-
-このセクションでは畳み込みについて簡単に説明していますが、画像の形状とサイズがどのように変化するかを事前に理解していると役立ちます。畳み込みに慣れていない場合は、fastaiの書籍から[Convolution Neural Networks chapter](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb)をチェックしてみてください！
-
-</Tip>
+> [!TIP]
+> このセクションでは畳み込みについて簡単に説明していますが、画像の形状とサイズがどのように変化するかを事前に理解していると役立ちます。畳み込みに慣れていない場合は、fastaiの書籍から[Convolution Neural Networks chapter](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb)をチェックしてみてください！
 
 [ConvNeXT](model_doc/convnext)は、性能を向上させるために新しいモダンなネットワーク設計を採用したCNNアーキテクチャです。ただし、畳み込みはモデルの中核にまだあります。高レベルから見た場合、[畳み込み（convolution）](glossary#convolution)は、小さな行列（*カーネル*）が画像のピクセルの小さなウィンドウに乗算される操作です。それは特定のテクスチャや線の曲率などの特徴を計算します。その後、次のピクセルのウィンドウに移動します。畳み込みが移動する距離は*ストライド*として知られています。
 
@@ -235,11 +226,8 @@ BERTを質問応答に使用するには、ベースのBERTモデルの上にス
 
 質問応答を試してみる準備はできましたか？DistilBERTを微調整し、推論に使用する方法を学ぶために、完全な[質問応答ガイド](tasks/question_answering)をチェックしてみてください！
 
-<Tip>
-
-💡 注意してください。一度事前トレーニングが完了したBERTを使用してさまざまなタスクに簡単に適用できることに注目してください。必要なのは、事前トレーニング済みモデルに特定のヘッドを追加して、隠れた状態を所望の出力に変換することだけです！
-
-</Tip>
+> [!TIP]
+> 💡 注意してください。一度事前トレーニングが完了したBERTを使用してさまざまなタスクに簡単に適用できることに注目してください。必要なのは、事前トレーニング済みモデルに特定のヘッドを追加して、隠れた状態を所望の出力に変換することだけです！
 
 ### Text generation
 
@@ -257,11 +245,8 @@ GPT-2の事前トレーニングの目標は完全に[因果言語モデリン
 
 テキスト生成を試してみる準備はできましたか？DistilGPT-2を微調整し、推論に使用する方法を学ぶために、完全な[因果言語モデリングガイド](tasks/language_modeling#causal-language-modeling)をチェックしてみてください！
 
-<Tip>
-
-テキスト生成に関する詳細は、[テキスト生成戦略](generation_strategies)ガイドをチェックしてみてください！
-
-</Tip>
+> [!TIP]
+> テキスト生成に関する詳細は、[テキスト生成戦略](generation_strategies)ガイドをチェックしてみてください！
 
 
 ### Summarization
@@ -278,11 +263,8 @@ GPT-2の事前トレーニングの目標は完全に[因果言語モデリン
 
 要約を試す準備はできましたか？T5を微調整して推論に使用する方法を学ぶために、完全な[要約ガイド](tasks/summarization)をご覧ください！
 
-<Tip>
-
-テキスト生成に関する詳細は、[テキスト生成戦略](generation_strategies)ガイドをチェックしてみてください！
-
-</Tip>
+> [!TIP]
+> テキスト生成に関する詳細は、[テキスト生成戦略](generation_strategies)ガイドをチェックしてみてください！
 
 ### Translation
 
@@ -294,8 +276,5 @@ BARTは、ソース言語をターゲット言語にデコードできるよう
 
 翻訳を試す準備はできましたか？T5を微調整して推論に使用する方法を学ぶために、完全な[翻訳ガイド](tasks/summarization)をご覧ください！
 
-<Tip>
-
-テキスト生成に関する詳細は、[テキスト生成戦略](generation_strategies)ガイドをチェックしてみてください！
-
-</Tip>
+> [!TIP]
+> テキスト生成に関する詳細は、[テキスト生成戦略](generation_strategies)ガイドをチェックしてみてください！
diff --git a/docs/source/ja/testing.md b/docs/source/ja/testing.md
index 5425861a1d19..2735b2ca58b0 100644
--- a/docs/source/ja/testing.md
+++ b/docs/source/ja/testing.md
@@ -298,18 +298,11 @@ pip install pytest-flakefinder
 pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
 ```
 
-<Tip>
+> [!TIP]
+> このプラグインは、`pytest-xdist` の `-n` フラグでは動作しません。
 
-このプラグインは、`pytest-xdist` の `-n` フラグでは動作しません。
-
-</Tip>
-
-<Tip>
-
-
-別のプラグイン `pytest-repeat` もありますが、これは `unittest` では動作しません。
-
-</Tip>
+> [!TIP]
+> 別のプラグイン `pytest-repeat` もありますが、これは `unittest` では動作しません。
 
 #### Run tests in a random order
 
@@ -758,17 +751,11 @@ def test_whatever(self):
   - `after=True`：テストの終了時に常に一時ディレクトリが削除されます。
   - `after=False`：テストの終了時に常に一時ディレクトリはそのままになります。
 
-<Tip>
-
-`rm -r`の相当を安全に実行するために、明示的な `tmp_dir` が使用される場合、プロジェクトリポジトリのチェックアウトのサブディレクトリのみが許可されます。誤って `/tmp` などのファイルシステムの重要な部分が削除されないように、常に `./` から始まるパスを渡してください。
-
-</Tip>
-
-<Tip>
-
-各テストは複数の一時ディレクトリを登録でき、要求がない限りすべて自動で削除されます。
+> [!TIP]
+> `rm -r`の相当を安全に実行するために、明示的な `tmp_dir` が使用される場合、プロジェクトリポジトリのチェックアウトのサブディレクトリのみが許可されます。誤って `/tmp` などのファイルシステムの重要な部分が削除されないように、常に `./` から始まるパスを渡してください。
 
-</Tip>
+> [!TIP]
+> 各テストは複数の一時ディレクトリを登録でき、要求がない限りすべて自動で削除されます。
 
 ### Temporary sys.path override
 
diff --git a/docs/source/ja/torchscript.md b/docs/source/ja/torchscript.md
index 27d64a625c8c..81fdca34b9b7 100644
--- a/docs/source/ja/torchscript.md
+++ b/docs/source/ja/torchscript.md
@@ -16,11 +16,8 @@ rendered properly in your Markdown viewer.
 
 # Export to TorchScript
 
-<Tip>
-
-これはTorchScriptを使用した実験の最初であり、可変入力サイズのモデルに対するその能力をまだ探求中です。これは私たちの関心の焦点であり、今後のリリースでは、より柔軟な実装や、PythonベースのコードとコンパイルされたTorchScriptを比較するベンチマークを含む、より多くのコード例で詳細な分析を行います。
-
-</Tip>
+> [!TIP]
+> これはTorchScriptを使用した実験の最初であり、可変入力サイズのモデルに対するその能力をまだ探求中です。これは私たちの関心の焦点であり、今後のリリースでは、より柔軟な実装や、PythonベースのコードとコンパイルされたTorchScriptを比較するベンチマークを含む、より多くのコード例で詳細な分析を行います。
 
 [TorchScriptのドキュメント](https://pytorch.org/docs/stable/jit.html)によれば：
 
diff --git a/docs/source/ja/training.md b/docs/source/ja/training.md
index b90f2a1f53ed..519a2ee0a222 100644
--- a/docs/source/ja/training.md
+++ b/docs/source/ja/training.md
@@ -92,12 +92,9 @@ rendered properly in your Markdown viewer.
 >>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
 ```
 
-<Tip>
-
-一部の事前学習済みの重みが使用されず、一部の重みがランダムに初期化された警告が表示されることがあります。心配しないでください、これは完全に正常です！
-BERTモデルの事前学習済みのヘッドは破棄され、ランダムに初期化された分類ヘッドで置き換えられます。この新しいモデルヘッドをシーケンス分類タスクでファインチューニングし、事前学習モデルの知識をそれに転送します。
-
-</Tip>
+> [!TIP]
+> 一部の事前学習済みの重みが使用されず、一部の重みがランダムに初期化された警告が表示されることがあります。心配しないでください、これは完全に正常です！
+> BERTモデルの事前学習済みのヘッドは破棄され、ランダムに初期化された分類ヘッドで置き換えられます。この新しいモデルヘッドをシーケンス分類タスクでファインチューニングし、事前学習モデルの知識をそれに転送します。
 
 ### Training Hyperparameters
 
@@ -255,11 +252,8 @@ PyTorchから[`AdamW`](https://pytorch.org/docs/stable/generated/torch.optim.Ada
 >>> model.to(device)
 ```
 
-<Tip>
-
-クラウドGPUが利用できない場合、[Colaboratory](https://colab.research.google.com/)や[SageMaker StudioLab](https://studiolab.sagemaker.aws/)などのホストされたノートブックを使用して無料でGPUにアクセスできます。
-
-</Tip>
+> [!TIP]
+> クラウドGPUが利用できない場合、[Colaboratory](https://colab.research.google.com/)や[SageMaker StudioLab](https://studiolab.sagemaker.aws/)などのホストされたノートブックを使用して無料でGPUにアクセスできます。
 
 さて、トレーニングの準備が整いました！ 🥳
 
diff --git a/docs/source/ja/troubleshooting.md b/docs/source/ja/troubleshooting.md
index b13b5993171a..fb1274159220 100644
--- a/docs/source/ja/troubleshooting.md
+++ b/docs/source/ja/troubleshooting.md
@@ -55,11 +55,8 @@ Please try again or make sure your Internet connection is on.
 - [`TrainingArguments`]の中で [`per_device_train_batch_size`](main_classes/trainer#transformers.TrainingArguments.per_device_train_batch_size) の値を減らす。
 - [`TrainingArguments`]の中で [`gradient_accumulation_steps`](main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps) を使用して、全体的なバッチサイズを効果的に増やすことを試す。
 
-<Tip>
-
-メモリ節約のテクニックについての詳細は、[ガイド](performance)を参照してください。
-
-</Tip>
+> [!TIP]
+> メモリ節約のテクニックについての詳細は、[ガイド](performance)を参照してください。
 
 ## Unable to load a saved TensorFlow model
 
@@ -158,11 +155,8 @@ tensor([[-0.1008, -0.4061]], grad_fn=<AddmmBackward0>)
 
 大抵の場合、モデルには `attention_mask` を提供して、パディングトークンを無視し、このような無音のエラーを回避する必要があります。これにより、2番目のシーケンスの出力が実際の出力と一致するようになります。
 
-<Tip>
-
-デフォルトでは、トークナイザは、トークナイザのデフォルトに基づいて `attention_mask` を自動で作成します。
-
-</Tip>
+> [!TIP]
+> デフォルトでは、トークナイザは、トークナイザのデフォルトに基づいて `attention_mask` を自動で作成します。
 
 ```py
 >>> attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1], [1, 0, 0, 0, 0, 0]])
diff --git a/docs/source/ko/add_new_model.md b/docs/source/ko/add_new_model.md
index e30c2dc9f0d2..8581c9dce135 100644
--- a/docs/source/ko/add_new_model.md
+++ b/docs/source/ko/add_new_model.md
@@ -529,11 +529,8 @@ pytest tests/models/brand_new_bert/test_modeling_brand_new_bert.py
 RUN_SLOW=1 pytest -sv tests/models/brand_new_bert/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests
 ```
 
-<Tip>
-
-Windows를 사용하는 경우 `RUN_SLOW=1`을 `SET RUN_SLOW=1`로 바꿔야 합니다.
-
-</Tip>
+> [!TIP]
+> Windows를 사용하는 경우 `RUN_SLOW=1`을 `SET RUN_SLOW=1`로 바꿔야 합니다.
 
 둘째로, *brand_new_bert*에 특화된 모든 기능도 별도의 테스트에서 추가로 테스트해야 합니다. 이 부분은 종종 잊히는데, 두 가지 측면에서 굉장히 유용합니다.
 
diff --git a/docs/source/ko/chat_templating.md b/docs/source/ko/chat_templating.md
index 922b7d885659..8bd934382ac5 100644
--- a/docs/source/ko/chat_templating.md
+++ b/docs/source/ko/chat_templating.md
@@ -212,10 +212,9 @@ The sun.</s>
 
 여기서부터는 일반적인 언어 모델 작업과 같이 `formatted_chat` 열을 사용하여 훈련을 계속하면 됩니다.
 
-<Tip>
-`apply_chat_template(tokenize=False)`로 텍스트를 형식화한 다음 별도의 단계에서 토큰화하는 경우, `add_special_tokens=False` 인수를 설정해야 합니다. `apply_chat_template(tokenize=True)`를 사용하는 경우에는 이 문제를 걱정할 필요가 없습니다!
-기본적으로 일부 토크나이저는 토큰화할 때 `<bos>` 및 `<eos>`와 같은 특별 토큰을 추가합니다. 채팅 템플릿은 항상 필요한 모든 특별 토큰을 포함해야 하므로, 기본 `add_special_tokens=True`로 추가적인 특별 토큰을 추가하면 잘못되거나 중복되는 특별 토큰을 생성하여 모델 성능이 저하될 수 있습니다.
-</Tip>
+> [!TIP]
+> `apply_chat_template(tokenize=False)`로 텍스트를 형식화한 다음 별도의 단계에서 토큰화하는 경우, `add_special_tokens=False` 인수를 설정해야 합니다. `apply_chat_template(tokenize=True)`를 사용하는 경우에는 이 문제를 걱정할 필요가 없습니다!
+> 기본적으로 일부 토크나이저는 토큰화할 때 `<bos>` 및 `<eos>`와 같은 특별 토큰을 추가합니다. 채팅 템플릿은 항상 필요한 모든 특별 토큰을 포함해야 하므로, 기본 `add_special_tokens=True`로 추가적인 특별 토큰을 추가하면 잘못되거나 중복되는 특별 토큰을 생성하여 모델 성능이 저하될 수 있습니다.
 
 ## 고급: 채팅 템플릿에 추가 입력 사용[[advanced-extra-inputs-to-chat-templates]]
 
@@ -375,9 +374,8 @@ The current temperature in Paris, France is 22.0 ° Celsius.<|im_end|>
 
 이것은 더미 도구와 단일 호출을 사용한 간단한 데모였지만, 동일한 기술을 사용하여 여러 실제 도구와 더 긴 대화를 처리할 수 있습니다. 이를 통해 실시간 정보, 계산 도구 또는 대규모 데이터베이스에 접근하여 대화형 에이전트의 기능을 확장할 수 있습니다.
 
-<Tip>
-위에서 보여준 도구 호출 기능은 모든 모델에서 사용되는 것은 아닙니다. 일부 모델은 도구 호출 ID를 사용하고, 일부는 함수 이름만 사용하여 결과와 도구 호출을 순서에 따라 매칭하며, 혼동을 피하기 위해 한 번에 하나의 도구 호출만 발행하는 모델도 있습니다. 가능한 많은 모델과 호환되는 코드를 원한다면, 여기에 보여준 것처럼 도구 호출을 구성하고, 모델이 발행한 순서대로 도구 결과를 반환하는 것을 권장합니다. 각 모델의 채팅 템플릿이 나머지 작업을 처리할 것입니다.
-</Tip>
+> [!TIP]
+> 위에서 보여준 도구 호출 기능은 모든 모델에서 사용되는 것은 아닙니다. 일부 모델은 도구 호출 ID를 사용하고, 일부는 함수 이름만 사용하여 결과와 도구 호출을 순서에 따라 매칭하며, 혼동을 피하기 위해 한 번에 하나의 도구 호출만 발행하는 모델도 있습니다. 가능한 많은 모델과 호환되는 코드를 원한다면, 여기에 보여준 것처럼 도구 호출을 구성하고, 모델이 발행한 순서대로 도구 결과를 반환하는 것을 권장합니다. 각 모델의 채팅 템플릿이 나머지 작업을 처리할 것입니다.
 
 ### 도구 스키마 이해하기[[understanding-tool-schemas]]
 
@@ -591,9 +589,8 @@ tokenizer.push_to_hub("model_name")  # 새 템플릿을 허브에 업로드!
 
 채팅 템플릿을 사용하는 [`~PreTrainedTokenizer.apply_chat_template`] 메소드는 [`TextGenerationPipeline`] 클래스에서 호출되므로, 올바른 채팅 템플릿을 설정하면 모델이 자동으로 [`TextGenerationPipeline`]과 호환됩니다.
 
-<Tip>
-모델을 채팅 용도로 미세 조정하는 경우, 채팅 템플릿을 설정하는 것 외에도 새 채팅 제어 토큰을 토크나이저에 특별 토큰으로 추가하는 것이 좋습니다. 특별 토큰은 절대로 분할되지 않으므로, 제어 토큰이 여러 조각으로 토큰화되는 것을 방지합니다. 또한, 템플릿에서 어시스턴트 생성의 끝을 나타내는 토큰으로 토크나이저의 `eos_token` 속성을 설정해야 합니다. 이렇게 하면 텍스트 생성 도구가 텍스트 생성을 언제 중지해야 할지 정확히 알 수 있습니다.
-</Tip>
+> [!TIP]
+> 모델을 채팅 용도로 미세 조정하는 경우, 채팅 템플릿을 설정하는 것 외에도 새 채팅 제어 토큰을 토크나이저에 특별 토큰으로 추가하는 것이 좋습니다. 특별 토큰은 절대로 분할되지 않으므로, 제어 토큰이 여러 조각으로 토큰화되는 것을 방지합니다. 또한, 템플릿에서 어시스턴트 생성의 끝을 나타내는 토큰으로 토크나이저의 `eos_token` 속성을 설정해야 합니다. 이렇게 하면 텍스트 생성 도구가 텍스트 생성을 언제 중지해야 할지 정확히 알 수 있습니다.
 
 
 ### 왜 일부 모델은 여러 개의 템플릿을 가지고 있나요?[[why-do-some-models-have-multiple-templates]]
diff --git a/docs/source/ko/contributing.md b/docs/source/ko/contributing.md
index f005ac0a569a..e84d655061eb 100644
--- a/docs/source/ko/contributing.md
+++ b/docs/source/ko/contributing.md
@@ -269,11 +269,8 @@ python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classific
 
 기본적으로 느린 테스트는 건너뛰지만 `RUN_SLOW` 환경 변수를 `yes`로 설정하여 실행할 수 있습니다. 이렇게 하면 많은 기가바이트 단위의 모델이 다운로드되므로 충분한 디스크 공간, 좋은 인터넷 연결과 많은 인내가 필요합니다!
 
-<Tip warning={true}>
-
-테스트를 실행하려면 *하위 폴더 경로 또는 테스트 파일 경로*를 지정하세요. 그렇지 않으면 `tests` 또는 `examples` 폴더의 모든 테스트를 실행하게 되어 매우 긴 시간이 걸립니다!
-
-</Tip>
+> [!WARNING]
+> 테스트를 실행하려면 *하위 폴더 경로 또는 테스트 파일 경로*를 지정하세요. 그렇지 않으면 `tests` 또는 `examples` 폴더의 모든 테스트를 실행하게 되어 매우 긴 시간이 걸립니다!
 
 ```bash
 RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model
diff --git a/docs/source/ko/conversations.md b/docs/source/ko/conversations.md
index ee61d41dd3d7..4bd3e86bf634 100644
--- a/docs/source/ko/conversations.md
+++ b/docs/source/ko/conversations.md
@@ -265,11 +265,8 @@ pipe = pipeline("text-generation", "meta-llama/Meta-Llama-3-8B-Instruct", device
 
 ### 성능 고려사항[[performance-considerations]]
 
-<Tip>
-
-언어 모델 성능과 최적화에 대한 보다 자세한 가이드는 [LLM Inference Optimization](./llm_optims)을 참고하세요.
-
-</Tip>
+> [!TIP]
+> 언어 모델 성능과 최적화에 대한 보다 자세한 가이드는 [LLM Inference Optimization](./llm_optims)을 참고하세요.
 
 
 일반적으로 더 큰 채팅 모델은 메모리를 더 많이 요구하고, 
diff --git a/docs/source/ko/custom_models.md b/docs/source/ko/custom_models.md
index 1e76608b1520..86c29b8efd45 100644
--- a/docs/source/ko/custom_models.md
+++ b/docs/source/ko/custom_models.md
@@ -177,11 +177,8 @@ class ResnetModelForImageClassification(PreTrainedModel):
 두 경우 모두 `PreTrainedModel`를 상속받고, `config`를 통해 상위 클래스 초기화를 호출하다는 점을 기억하세요 (일반적인 `torch.nn.Module`을 작성할 때와 비슷함).
 모델을 auto 클래스에 등록하고 싶은 경우에는 `config_class`를 설정하는 부분이 필수입니다 (마지막 섹션 참조).
 
-<Tip>
-
-라이브러리에 존재하는 모델과 굉장히 유사하다면, 모델을 생성할 때 구성을 참조해 재사용할 수 있습니다.
-
-</Tip>
+> [!TIP]
+> 라이브러리에 존재하는 모델과 굉장히 유사하다면, 모델을 생성할 때 구성을 참조해 재사용할 수 있습니다.
 
 원하는 것을 모델이 반환하도록 할 수 있지만, `ResnetModelForImageClassification`에서 했던 것 처럼
 레이블을 통과시켰을 때 손실과 함께 사전 형태로 반환하는 것이 [`Trainer`] 클래스 내에서 직접 모델을 사용하기에 유용합니다.
@@ -213,11 +210,8 @@ resnet50d.model.load_state_dict(pretrained_model.state_dict())
 
 ## Hub로 코드 업로드하기[[sending-the-code-to-the-hub]]
 
-<Tip warning={true}>
-
-이 API는 실험적이며 다음 릴리스에서 약간의 변경 사항이 있을 수 있습니다.
-
-</Tip>
+> [!WARNING]
+> 이 API는 실험적이며 다음 릴리스에서 약간의 변경 사항이 있을 수 있습니다.
 
 먼저 모델이 `.py` 파일에 완전히 정의되어 있는지 확인하세요.
 모든 파일이 동일한 작업 경로에 있기 때문에 상대경로 임포트(relative import)에 의존할 수 있습니다 (transformers에서는 이 기능에 대한 하위 모듈을 지원하지 않습니다).
@@ -234,12 +228,9 @@ resnet50d.model.load_state_dict(pretrained_model.state_dict())
 
 Python이 `resnet_model`을 모듈로 사용할 수 있도록 감지하는 목적이기 때문에 `__init__.py`는 비어 있을 수 있습니다.
 
-<Tip warning={true}>
-
-라이브러리에서 모델링 파일을 복사하는 경우,
-모든 파일 상단에 있는 상대 경로 임포트(relative import) 부분을 `transformers` 패키지에서 임포트 하도록 변경해야 합니다.
-
-</Tip>
+> [!WARNING]
+> 라이브러리에서 모델링 파일을 복사하는 경우,
+> 모든 파일 상단에 있는 상대 경로 임포트(relative import) 부분을 `transformers` 패키지에서 임포트 하도록 변경해야 합니다.
 
 기존 구성이나 모델을 재사용(또는 서브 클래스화)할 수 있습니다.
 
diff --git a/docs/source/ko/debugging.md b/docs/source/ko/debugging.md
index 24b2c7b04b50..79f5ac279e0a 100644
--- a/docs/source/ko/debugging.md
+++ b/docs/source/ko/debugging.md
@@ -48,23 +48,14 @@ NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 to
 ## 언더플로 및 오버플로 감지 [[underflow-and-overflow-detection]]
 
 
-<Tip>
+> [!TIP]
+> 이 기능은 현재 PyTorch에서만 사용할 수 있습니다.
 
-이 기능은 현재 PyTorch에서만 사용할 수 있습니다.
+> [!TIP]
+> 다중 GPU 훈련을 위해서는 DDP (`torch.distributed.launch`)가 필요합니다.
 
-</Tip>
-
-<Tip>
-
-다중 GPU 훈련을 위해서는 DDP (`torch.distributed.launch`)가 필요합니다.
-
-</Tip>
-
-<Tip>
-
-이 기능은 `nn.Module`을 기반으로 하는 모델과 함께 사용할 수 있습니다.
-
-</Tip>
+> [!TIP]
+> 이 기능은 `nn.Module`을 기반으로 하는 모델과 함께 사용할 수 있습니다.
 
 `loss=NaN`이 나타나거나 모델이 `inf` 또는 `nan`으로 인해 다른 이상한 동작을 하는 경우, 언더플로 또는 오버플로의 첫 번째 발생 위치와 그 원인을 파악해야 합니다. 다행히도 이를 자동으로 감지하는 특수 모듈을 활성화하여 쉽게 알아낼 수 있습니다.
 
diff --git a/docs/source/ko/deepspeed.md b/docs/source/ko/deepspeed.md
index d0955ee3db80..b3b7a4fab9b3 100644
--- a/docs/source/ko/deepspeed.md
+++ b/docs/source/ko/deepspeed.md
@@ -30,11 +30,8 @@ GPU가 제한된 환경에서 ZeRO는 최적화 메모리와 계산을 GPU에서
 
 DeepSpeed는 PyPI 또는 Transformers에서 설치할 수 있습니다(자세한 설치 옵션은 DeepSpeed [설치 상세사항](https://www.deepspeed.ai/tutorials/advanced-install/) 또는 GitHub [README](https://github.com/deepspeedai/DeepSpeed#installation)를 참조하세요).
 
-<Tip>
-
-DeepSpeed를 설치하는 데 문제가 있는 경우 [DeepSpeed CUDA 설치](../debugging#deepspeed-cuda-installation) 가이드를 확인하세요. DeepSpeed에는 pip 설치 가능한 PyPI 패키지로 설치할 수 있지만, 하드웨어에 가장 잘 맞고 PyPI 배포판에서는 제공되지 않는 1비트 Adam과 같은 특정 기능을 지원하려면 [소스에서 설치하기](https://www.deepspeed.ai/tutorials/advanced-install/#install-deepspeed-from-source)를 적극 권장합니다.
-
-</Tip>
+> [!TIP]
+> DeepSpeed를 설치하는 데 문제가 있는 경우 [DeepSpeed CUDA 설치](../debugging#deepspeed-cuda-installation) 가이드를 확인하세요. DeepSpeed에는 pip 설치 가능한 PyPI 패키지로 설치할 수 있지만, 하드웨어에 가장 잘 맞고 PyPI 배포판에서는 제공되지 않는 1비트 Adam과 같은 특정 기능을 지원하려면 [소스에서 설치하기](https://www.deepspeed.ai/tutorials/advanced-install/#install-deepspeed-from-source)를 적극 권장합니다.
 
 <hfoptions id="install">
 <hfoption id="PyPI">
@@ -112,19 +109,16 @@ DeepSpeed를 설치하고 메모리 요구 사항을 더 잘 파악했다면 다
 
 DeepSpeed는 트레이닝 실행 방법을 구성하는 모든 매개변수가 포함된 구성 파일을 통해 [`Trainer`] 클래스와 함께 작동합니다. 트레이닝 스크립트를 실행하면 DeepSpeed는 [`Trainer`]로부터 받은 구성을 콘솔에 기록하므로 어떤 구성이 사용되었는지 정확히 확인할 수 있습니다.
 
-<Tip>
-
-DeepSpeed 구성 옵션의 전체 목록은 [DeepSpeed Configuration JSON](https://www.deepspeed.ai/docs/config-json/)에서 확인할 수 있습니다. 또한 [DeepSpeedExamples](https://github.com/deepspeedai/DeepSpeedExamples) 리포지토리 또는 기본 [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) 리포지토리에서 다양한 DeepSpeed 구성 예제에 대한 보다 실용적인 예제를 찾을 수 있습니다. 구체적인 예제를 빠르게 찾으려면 다음과 같이 하세요:
-
-```bash
-git clone https://github.com/deepspeedai/DeepSpeedExamples
-cd DeepSpeedExamples
-find . -name '*json'
-# Lamb 옵티마이저 샘플 찾기
-grep -i Lamb $(find . -name '*json')
-```
-
-</Tip>
+> [!TIP]
+> DeepSpeed 구성 옵션의 전체 목록은 [DeepSpeed Configuration JSON](https://www.deepspeed.ai/docs/config-json/)에서 확인할 수 있습니다. 또한 [DeepSpeedExamples](https://github.com/deepspeedai/DeepSpeedExamples) 리포지토리 또는 기본 [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) 리포지토리에서 다양한 DeepSpeed 구성 예제에 대한 보다 실용적인 예제를 찾을 수 있습니다. 구체적인 예제를 빠르게 찾으려면 다음과 같이 하세요:
+>
+> ```bash
+> git clone https://github.com/deepspeedai/DeepSpeedExamples
+> cd DeepSpeedExamples
+> find . -name '*json'
+> # Lamb 옵티마이저 샘플 찾기
+> grep -i Lamb $(find . -name '*json')
+> ```
 
 명령줄 인터페이스에서 트레이닝하는 경우 DeepSpeed 구성 파일은 JSON 파일의 경로로 전달되거나 노트북 설정에서 [`Trainer`]를 사용하는 경우 중첩된 `dict` 객체로 전달됩니다.
 
@@ -168,10 +162,8 @@ DeepSpeed 구성을 수정하고 [`TrainingArguments`]를 편집할 수도 있
 
 세 가지 구성이 있으며, 각 구성은 서로 다른 ZeRO 단계에 해당합니다. 1단계는 확장성 측면에서 그다지 눈여겨볼만하지 않으므로 이 가이드에서는 2단계와 3단계에 중점을 둡니다. `zero_optimization` 구성에는 활성화할 항목과 구성 방법에 대한 모든 옵션이 포함되어 있습니다. 각 매개변수에 대한 자세한 설명은 [DeepSpeed 구성 JSON](https://www.deepspeed.ai/docs/config-json/) 참조를 참조하세요.
 
-<Tip warning={true}>
-DeepSpeed는 매개변수 이름의 유효성을 검사하지 않으며 오타가 있으면 매개변수의 기본 설정으로 대체합니다. DeepSpeed 엔진 시작 로그 메시지를 보고 어떤 값을 사용할지 확인할 수 있습니다.
-
-</Tip>
+> [!WARNING]
+> DeepSpeed는 매개변수 이름의 유효성을 검사하지 않으며 오타가 있으면 매개변수의 기본 설정으로 대체합니다. DeepSpeed 엔진 시작 로그 메시지를 보고 어떤 값을 사용할지 확인할 수 있습니다.
 
 [`Trainer`]는 동등한 명령줄 인수를 제공하지 않으므로 다음 구성은 DeepSpeed로 설정해야 합니다.
 
@@ -290,11 +282,8 @@ ZeRO-3의 또 다른 고려 사항은 여러 개의 GPU를 사용하는 경우 
 tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True)
 ```
 
-<Tip>
-
-ZeRO-3로 대규모 모델을 초기화하고 매개변수에 액세스하는 방법에 대한 자세한 내용은 [Constructing Massive Models](https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models) 및 [Gathering Parameters](https://deepspeed.readthedocs.io/en/latest/zero3.html#gathering-parameters) 가이드를 참조하세요.
-
-</Tip>
+> [!TIP]
+> ZeRO-3로 대규모 모델을 초기화하고 매개변수에 액세스하는 방법에 대한 자세한 내용은 [Constructing Massive Models](https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models) 및 [Gathering Parameters](https://deepspeed.readthedocs.io/en/latest/zero3.html#gathering-parameters) 가이드를 참조하세요.
 
 </hfoption>
 </hfoptions>
@@ -396,11 +385,8 @@ ZeRO-3로 대규모 모델을 초기화하고 매개변수에 액세스하는 
 
 `offload_optimizer`를 활성화하지 않는 한 DeepSpeed와 트랜스포머 옵티마이저 및 스케줄러를 혼합하여 사용할 수 있습니다. `offload_optimizer`를 활성화하면 CPU와 GPU 구현이 모두 있는 경우 DeepSpeed가 아닌 최적화기(LAMB 제외)를 사용할 수 있습니다.
 
-<Tip warning={true}>
-
-구성 파일의 최적화 프로그램 및 스케줄러 매개변수는 명령줄에서 설정할 수 있으므로 오류를 찾기 어렵지 않습니다. 예를 들어 학습 속도가 다른 곳에서 다른 값으로 설정된 경우 명령줄에서 이를 재정의할 수 있습니다. 최적화 프로그램 및 스케줄러 매개변수 외에도 [`Trainer`] 명령줄 인수가 DeepSpeed 구성과 일치하는지 확인해야 합니다.
-
-</Tip>
+> [!WARNING]
+> 구성 파일의 최적화 프로그램 및 스케줄러 매개변수는 명령줄에서 설정할 수 있으므로 오류를 찾기 어렵지 않습니다. 예를 들어 학습 속도가 다른 곳에서 다른 값으로 설정된 경우 명령줄에서 이를 재정의할 수 있습니다. 최적화 프로그램 및 스케줄러 매개변수 외에도 [`Trainer`] 명령줄 인수가 DeepSpeed 구성과 일치하는지 확인해야 합니다.
 
 <hfoptions id="opt-sched">
 <hfoption id="optimizer">
@@ -616,11 +602,8 @@ DeepSpeed는 단 하나의 GPU로도 여전히 유용합니다:
 1. 일부 계산과 메모리를 CPU로 오프로드하여 더 큰 배치 크기를 사용하거나 일반적으로 맞지 않는 매우 큰 모델을 맞추기 위해 모델에 더 많은 GPU 리소스를 사용할 수 있도록 합니다.
 2. 스마트 GPU 메모리 관리 시스템으로 메모리 조각화를 최소화하여 더 큰 모델과 데이터 배치에 맞출 수 있습니다.
 
-<Tip>
-
-단일 GPU에서 더 나은 성능을 얻으려면 [ZeRO-2](#zero-configuration) 구성 파일에서 `allgather_bucket_size` 및 `reduce_bucket_size` 값을 2e8로 설정하세요.
-
-</Tip>
+> [!TIP]
+> 단일 GPU에서 더 나은 성능을 얻으려면 [ZeRO-2](#zero-configuration) 구성 파일에서 `allgather_bucket_size` 및 `reduce_bucket_size` 값을 2e8로 설정하세요.
 
 </hfoption>
 </hfoptions>
@@ -849,11 +832,8 @@ trainer.deepspeed.save_checkpoint(checkpoint_dir)
 fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
 ```
 
-<Tip>
-
-`load_state_dict_from_zero_checkpoint`가 실행되면 동일한 애플리케이션의 컨텍스트에서 모델을 더 이상 DeepSpeed에서 사용할 수 없습니다. `model.load_state_dict(state_dict)`는 모든 딥스피드 마법을 제거하므로 딥스피드 엔진을 다시 초기화해야 합니다. 이 기능은 훈련이 끝날 때만 사용하세요.
-
-</Tip>
+> [!TIP]
+> `load_state_dict_from_zero_checkpoint`가 실행되면 동일한 애플리케이션의 컨텍스트에서 모델을 더 이상 DeepSpeed에서 사용할 수 없습니다. `model.load_state_dict(state_dict)`는 모든 딥스피드 마법을 제거하므로 딥스피드 엔진을 다시 초기화해야 합니다. 이 기능은 훈련이 끝날 때만 사용하세요.
 
 fp32 가중치의 state_dict를 추출하여 로드할 수도 있습니다:
 
@@ -893,11 +873,8 @@ drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
 python zero_to_fp32.py . pytorch_model.bin
 ```
 
-<Tip>
-
-자세한 사용법은 `python zero_to_fp32.py -h`를 실행하세요. 이 스크립트에는 최종 fp32 가중치의 2배의 일반 RAM이 필요합니다.
-
-</Tip>
+> [!TIP]
+> 자세한 사용법은 `python zero_to_fp32.py -h`를 실행하세요. 이 스크립트에는 최종 fp32 가중치의 2배의 일반 RAM이 필요합니다.
 
 </hfoption>
 </hfoptions>
@@ -918,11 +895,8 @@ deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds
 
 DeepSpeed는 [`Trainer`] 클래스가 없는 트랜스포머에서도 작동합니다. 이는 [`~PreTrainedModel.from_pretrained`]를 호출할 때 ZeRO-3 매개변수를 수집하고 모델을 여러 GPU에 분할하는 작업만 처리하는 [`HfDeepSpeedConfig`]가 처리합니다.
 
-<Tip>
-
-모든 것이 자동으로 처리되기를 원한다면, [`Trainer`]와 함께 DeepSpeed를 사용해 보세요! [DeepSpeed 문서](https://www.deepspeed.ai/)를 참조하여 설정 파일에서 매개변수 값을 수동으로 구성해야 합니다(`"auto"` 값은 사용할 수 없음).
-
-</Tip>
+> [!TIP]
+> 모든 것이 자동으로 처리되기를 원한다면, [`Trainer`]와 함께 DeepSpeed를 사용해 보세요! [DeepSpeed 문서](https://www.deepspeed.ai/)를 참조하여 설정 파일에서 매개변수 값을 수동으로 구성해야 합니다(`"auto"` 값은 사용할 수 없음).
 
 ZeRO-3를 효율적으로 배포하려면 모델 앞에 [`HfDeepSpeedConfig`] 객체를 인스턴스화하고 해당 객체를 유지해야 합니다:
 
diff --git a/docs/source/ko/generation_strategies.md b/docs/source/ko/generation_strategies.md
index c59eff4111f3..690c91b71661 100644
--- a/docs/source/ko/generation_strategies.md
+++ b/docs/source/ko/generation_strategies.md
@@ -119,11 +119,8 @@ GenerationConfig {
 
 `generate()` 메소드는 `streamer` 입력을 통해 스트리밍을 지원합니다. `streamer` 입력은 `put()`과 `end()` 메소드를 가진 클래스의 인스턴스와 호환됩니다. 내부적으로, `put()`은 새 토큰을 추가하는 데 사용되며, `end()`는 텍스트 생성의 끝을 표시하는 데 사용됩니다.
 
-<Tip warning={true}>
-
-스트리머 클래스의 API는 아직 개발 중이며, 향후 변경될 수 있습니다.
-
-</Tip>
+> [!WARNING]
+> 스트리머 클래스의 API는 아직 개발 중이며, 향후 변경될 수 있습니다.
 
 실제로 다양한 목적을 위해 자체 스트리밍 클래스를 만들 수 있습니다! 또한, 기본적인 스트리밍 클래스들도 준비되어 있어 바로 사용할 수 있습니다. 예를 들어, [`TextStreamer`] 클래스를 사용하여 `generate()`의 출력을 화면에 한 단어씩 스트리밍할 수 있습니다:
 
diff --git a/docs/source/ko/glossary.md b/docs/source/ko/glossary.md
index 6f9043b1caba..c0bcdb9c05b3 100644
--- a/docs/source/ko/glossary.md
+++ b/docs/source/ko/glossary.md
@@ -259,11 +259,8 @@ DataParallel 방식에 대해 더 알아보려면 [여기](perf_train_gpu_many#d
 - 객체 탐지 모델([`DetrForObjectDetection`] 등)의 경우, 모델은 `class_labels`와 `boxes` 키를 포함하는 딕셔너리들의 리스트를 입력으로 받습니다. 배치의 각 값은 개별 이미지에 대한 예상 클래스 레이블과 바운딩 박스 정보를 나타냅니다.
 - 자동 음성 인식 모델([`Wav2Vec2ForCTC`] 등)의 경우 모델은 `(batch_size,target_length)` 차원의 텐서를 입력으로 받으며, 각 값은 개별 토큰에 대한 예상 레이블을 나타냅니다.
   
-<Tip>
-
-모델마다 요구하는 레이블 형식이 다를 수 있으므로, 각 모델의 문서를 확인하여 해당 모델에 맞는 레이블 형식을 반드시 확인하세요!
-
-</Tip>
+> [!TIP]
+> 모델마다 요구하는 레이블 형식이 다를 수 있으므로, 각 모델의 문서를 확인하여 해당 모델에 맞는 레이블 형식을 반드시 확인하세요!
 
 기본 모델([`BertModel`] 등)은 레이블을 입력으로 받지 않습니다. 이러한 모델은 단순히 특징(feature)을 출력하는 기본 트랜스포머 모델이기 때문입니다.
 
diff --git a/docs/source/ko/installation.md b/docs/source/ko/installation.md
index a744db40d291..f7511fe9e7ed 100644
--- a/docs/source/ko/installation.md
+++ b/docs/source/ko/installation.md
@@ -118,11 +118,8 @@ pip install -e .
 
 위 명령은 리포지터리를 복제한 위치의 폴더와 Python 라이브러리의 경로를 연결시킵니다. Python이 일반 라이브러리 경로 외에 복제한 폴더 내부를 확인할 것입니다. 예를 들어 Python 패키지가 일반적으로 `~/anaconda3/envs/main/lib/python3.7/site-packages/`에 설치되어 있는데, 명령을 받은 Python이 이제 복제한 폴더인 `~/transformers/`도 검색하게 됩니다.
 
-<Tip warning={true}>
-
-라이브러리를 계속 사용하려면 `transformers` 폴더를 꼭 유지해야 합니다.
-
-</Tip>
+> [!WARNING]
+> 라이브러리를 계속 사용하려면 `transformers` 폴더를 꼭 유지해야 합니다.
 
 복제본은 최신 버전의 🤗 Transformers로 쉽게 업데이트할 수 있습니다.
 
@@ -149,21 +146,15 @@ conda install conda-forge::transformers
 2. 셸 환경 변수: `HF_HOME`
 3. 셸 환경 변수: `XDG_CACHE_HOME` + `/huggingface`
 
-<Tip>
-
-과거 🤗 Transformers에서 쓰였던 셸 환경 변수 `PYTORCH_TRANSFORMERS_CACHE` 또는 `PYTORCH_PRETRAINED_BERT_CACHE`이 설정되있다면, 셸 환경 변수 `TRANSFORMERS_CACHE`을 지정하지 않는 한 우선 사용됩니다.
-
-</Tip>
+> [!TIP]
+> 과거 🤗 Transformers에서 쓰였던 셸 환경 변수 `PYTORCH_TRANSFORMERS_CACHE` 또는 `PYTORCH_PRETRAINED_BERT_CACHE`이 설정되있다면, 셸 환경 변수 `TRANSFORMERS_CACHE`을 지정하지 않는 한 우선 사용됩니다.
 
 ## 오프라인 모드[[offline-mode]]
 
 🤗 Transformers를 로컬 파일만 사용하도록 해서 방화벽 또는 오프라인 환경에서 실행할 수 있습니다. 활성화하려면 `HF_HUB_OFFLINE=1` 환경 변수를 설정하세요.
 
-<Tip>
-
-`HF_DATASETS_OFFLINE=1` 환경 변수를 설정하여 오프라인 훈련 과정에 [🤗 Datasets](https://huggingface.co/docs/datasets/)을 추가할 수 있습니다.
-
-</Tip>
+> [!TIP]
+> `HF_DATASETS_OFFLINE=1` 환경 변수를 설정하여 오프라인 훈련 과정에 [🤗 Datasets](https://huggingface.co/docs/datasets/)을 추가할 수 있습니다.
 
 예를 들어 외부 기기 사이에 방화벽을 둔 일반 네트워크에서 평소처럼 프로그램을 다음과 같이 실행할 수 있습니다.
 
@@ -238,8 +229,5 @@ Another option for using 🤗 Transformers offline is to download the files ahea
 >>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json")
 ```
 
-<Tip>
-
-Hub에 저장된 파일을 다운로드하는 방법을 더 자세히 알아보려면 [Hub에서 파일 다운로드하기](https://huggingface.co/docs/hub/how-to-downstream) 섹션을 참고해주세요.
-
-</Tip>
+> [!TIP]
+> Hub에 저장된 파일을 다운로드하는 방법을 더 자세히 알아보려면 [Hub에서 파일 다운로드하기](https://huggingface.co/docs/hub/how-to-downstream) 섹션을 참고해주세요.
diff --git a/docs/source/ko/llm_tutorial.md b/docs/source/ko/llm_tutorial.md
index d5e0bd356edd..ae2c6e392783 100644
--- a/docs/source/ko/llm_tutorial.md
+++ b/docs/source/ko/llm_tutorial.md
@@ -68,11 +68,8 @@ LLM과 자기회귀 생성을 함께 사용할 때 핵심적인 부분은 이 
 
 코드를 확인해봅시다!
 
-<Tip>
-
-기본 LLM 사용에 관심이 있다면, 우리의 [`Pipeline`](pipeline_tutorial) 인터페이스로 시작하는 것을 추천합니다. 그러나 LLM은 양자화나 토큰 선택 단계에서의 미세한 제어와 같은 고급 기능들을 종종 필요로 합니다. 이러한 작업은 [`~generation.GenerationMixin.generate`]를 통해 가장 잘 수행될 수 있습니다. LLM을 이용한 자기회귀 생성은 자원을 많이 소모하므로, 적절한 처리량을 위해 GPU에서 실행되어야 합니다.
-
-</Tip>
+> [!TIP]
+> 기본 LLM 사용에 관심이 있다면, 우리의 [`Pipeline`](pipeline_tutorial) 인터페이스로 시작하는 것을 추천합니다. 그러나 LLM은 양자화나 토큰 선택 단계에서의 미세한 제어와 같은 고급 기능들을 종종 필요로 합니다. 이러한 작업은 [`~generation.GenerationMixin.generate`]를 통해 가장 잘 수행될 수 있습니다. LLM을 이용한 자기회귀 생성은 자원을 많이 소모하므로, 적절한 처리량을 위해 GPU에서 실행되어야 합니다.
 
 먼저, 모델을 불러오세요.
 
diff --git a/docs/source/ko/llm_tutorial_optimization.md b/docs/source/ko/llm_tutorial_optimization.md
index 63c9f1db45d4..f52958856e6d 100644
--- a/docs/source/ko/llm_tutorial_optimization.md
+++ b/docs/source/ko/llm_tutorial_optimization.md
@@ -643,11 +643,8 @@ length of key-value cache 24
 
 > 더 긴 입력 시퀀스에 대해 동일한 결과와 큰 속도 향상을 가져오기 때문에 키-값 캐시를 *항상* 사용해야 합니다. Transformers는 텍스트 파이프라인이나 [`generate` 메서드](https://huggingface.co/docs/transformers/main_classes/text_generation)를 사용할 때 기본적으로 키-값 캐시를 활성화합니다.
 
-<Tip warning={true}>
-
-참고로, 키-값 캐시를 사용할 것을 권장하지만, 이를 사용할 때 LLM 출력이 약간 다를 수 있습니다. 이것은 행렬 곱셈 커널 자체의 특성 때문입니다 -- 더 자세한 내용은 [여기](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535)에서 읽어볼 수 있습니다.
-
-</Tip>
+> [!WARNING]
+> 참고로, 키-값 캐시를 사용할 것을 권장하지만, 이를 사용할 때 LLM 출력이 약간 다를 수 있습니다. 이것은 행렬 곱셈 커널 자체의 특성 때문입니다 -- 더 자세한 내용은 [여기](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535)에서 읽어볼 수 있습니다.
 
 #### 3.2.1 멀티 라운드 대화 [[321-multi-round-conversation]]
 
diff --git a/docs/source/ko/main_classes/output.md b/docs/source/ko/main_classes/output.md
index c383a522a1aa..2187c4931b8f 100644
--- a/docs/source/ko/main_classes/output.md
+++ b/docs/source/ko/main_classes/output.md
@@ -36,12 +36,9 @@ outputs = model(**inputs, labels=labels)
 `outputs` 객체는 [`~modeling_outputs.SequenceClassifierOutput`]입니다.
 아래 해당 클래스의 문서에서 볼 수 있듯이, `loss`(선택적), `logits`, `hidden_states`(선택적) 및 `attentions`(선택적) 항목이 있습니다. 여기에서는 `labels`를 전달했기 때문에 `loss`가 있지만 `hidden_states`와 `attentions`가 없는데, 이는 `output_hidden_states=True` 또는 `output_attentions=True`를 전달하지 않았기 때문입니다.
 
-<Tip>
-
-`output_hidden_states=True`를 전달할 때 `outputs.hidden_states[-1]`가 `outputs.last_hidden_state`와 정확히 일치할 것으로 예상할 수 있습니다.
-하지만 항상 그런 것은 아닙니다. 일부 모델은 마지막 은닉 상태가 반환될 때 정규화를 적용하거나 다른 후속 프로세스를 적용합니다.
-
-</Tip>
+> [!TIP]
+> `output_hidden_states=True`를 전달할 때 `outputs.hidden_states[-1]`가 `outputs.last_hidden_state`와 정확히 일치할 것으로 예상할 수 있습니다.
+> 하지만 항상 그런 것은 아닙니다. 일부 모델은 마지막 은닉 상태가 반환될 때 정규화를 적용하거나 다른 후속 프로세스를 적용합니다.
 
 
 일반적으로 사용할 때와 동일하게 각 속성들에 접근할 수 있으며, 모델이 해당 속성을 반환하지 않은 경우 `None`이 반환됩니다. 예시에서는 `outputs.loss`는 모델에서 계산한 손실이고 `outputs.attentions`는 `None`입니다.
diff --git a/docs/source/ko/main_classes/pipelines.md b/docs/source/ko/main_classes/pipelines.md
index fa39fe9f5def..ef00770350ee 100644
--- a/docs/source/ko/main_classes/pipelines.md
+++ b/docs/source/ko/main_classes/pipelines.md
@@ -116,13 +116,10 @@ for out in pipe(KeyDataset(dataset, "text"), batch_size=8, truncation="only_firs
     # 이전과 동일한 출력이지만, 내용을 배치로 모델에 전달합니다.
 ```
 
-<Tip warning={true}>
-
-하지만 배치 처리가 항상 성능 향상을 보장하는 것은 아닙니다. 하드웨어, 데이터, 모델에 따라 속도가 10배로 빨라질수도, 5배 느려질 수 있습니다.
-
-주로 속도 향상이 있는 예시:
-
-</Tip>
+> [!WARNING]
+> 하지만 배치 처리가 항상 성능 향상을 보장하는 것은 아닙니다. 하드웨어, 데이터, 모델에 따라 속도가 10배로 빨라질수도, 5배 느려질 수 있습니다.
+>
+> 주로 속도 향상이 있는 예시:
 
 ```python
 from transformers import pipeline
diff --git a/docs/source/ko/main_classes/quantization.md b/docs/source/ko/main_classes/quantization.md
index 6f793f221074..ba9329f4ba5c 100644
--- a/docs/source/ko/main_classes/quantization.md
+++ b/docs/source/ko/main_classes/quantization.md
@@ -21,11 +21,8 @@ rendered properly in your Markdown viewer.
 양자화 기법은 가중치와 활성화를 8비트 정수(int8)와 같은 더 낮은 정밀도의 데이터 타입으로 표현함으로써 메모리와 계산 비용을 줄입니다. 이를 통해 일반적으로는 메모리에 올릴 수 없는 더 큰 모델을 로드할 수 있고, 추론 속도를 높일 수 있습니다. Transformers는 AWQ와 GPTQ 양자화 알고리즘을 지원하며, bitsandbytes를 통해 8비트와 4비트 양자화를 지원합니다.
 Transformers에서 지원되지 않는 양자화 기법들은 [`HfQuantizer`] 클래스를 통해 추가될 수 있습니다.
 
-<Tip>
-
-모델을 양자화하는 방법은 이 [양자화](../quantization) 가이드를 통해 배울 수 있습니다.
-
-</Tip>
+> [!TIP]
+> 모델을 양자화하는 방법은 이 [양자화](../quantization) 가이드를 통해 배울 수 있습니다.
 
 ## QuantoConfig[[transformers.QuantoConfig]]
 
diff --git a/docs/source/ko/main_classes/trainer.md b/docs/source/ko/main_classes/trainer.md
index 23eda74a8bd6..a51a55550657 100644
--- a/docs/source/ko/main_classes/trainer.md
+++ b/docs/source/ko/main_classes/trainer.md
@@ -20,15 +20,12 @@ rendered properly in your Markdown viewer.
 
 [`Seq2SeqTrainer`]와 [`Seq2SeqTrainingArguments`]는 [`Trainer`]와 [`TrainingArguments`] 클래스를 상속하며, 요약이나 번역과 같은 시퀀스-투-시퀀스 작업을 위한 모델 훈련에 적합하게 조정되어 있습니다.
 
-<Tip warning={true}>
-
-[`Trainer`] 클래스는 🤗 Transformers 모델에 최적화되어 있으며, 다른 모델과 함께 사용될 때 예상치 못한 동작을 하게 될 수 있습니다. 자신만의 모델을 사용할 때는 다음을 확인하세요:
-
-- 모델은 항상 튜플이나 [`~utils.ModelOutput`]의 서브클래스를 반환해야 합니다.
-- 모델은 `labels` 인자가 제공되면 손실을 계산할 수 있고, 모델이 튜플을 반환하는 경우 그 손실이 튜플의 첫 번째 요소로 반환되어야 합니다.
-- 모델은 여러 개의 레이블 인자를 수용할 수 있어야 하며, [`Trainer`]에게 이름을 알리기 위해 [`TrainingArguments`]에서 `label_names`를 사용하지만, 그 중 어느 것도 `"label"`로 명명되어서는 안 됩니다.
-
-</Tip>
+> [!WARNING]
+> [`Trainer`] 클래스는 🤗 Transformers 모델에 최적화되어 있으며, 다른 모델과 함께 사용될 때 예상치 못한 동작을 하게 될 수 있습니다. 자신만의 모델을 사용할 때는 다음을 확인하세요:
+>
+> - 모델은 항상 튜플이나 [`~utils.ModelOutput`]의 서브클래스를 반환해야 합니다.
+> - 모델은 `labels` 인자가 제공되면 손실을 계산할 수 있고, 모델이 튜플을 반환하는 경우 그 손실이 튜플의 첫 번째 요소로 반환되어야 합니다.
+> - 모델은 여러 개의 레이블 인자를 수용할 수 있어야 하며, [`Trainer`]에게 이름을 알리기 위해 [`TrainingArguments`]에서 `label_names`를 사용하지만, 그 중 어느 것도 `"label"`로 명명되어서는 안 됩니다.
 
 ## Trainer [[transformers.Trainer]]
 
diff --git a/docs/source/ko/model_doc/altclip.md b/docs/source/ko/model_doc/altclip.md
index f736ab9c5c94..214fafeb98c8 100644
--- a/docs/source/ko/model_doc/altclip.md
+++ b/docs/source/ko/model_doc/altclip.md
@@ -37,11 +37,8 @@ AltCLIP은 멀티모달 비전 및 언어 모델입니다. 이미지와 텍스
 >>> logits_per_image = outputs.logits_per_image  # 이미지-텍스트 유사도 점수
 >>> probs = logits_per_image.softmax(dim=1)  # 라벨 마다 확률을 얻기 위해 softmax 적용
 ```
-<Tip>
-
-이 모델은 `CLIPModel`을 기반으로 하므로, 원래 CLIP처럼 사용할 수 있습니다.
-
-</Tip>
+> [!TIP]
+> 이 모델은 `CLIPModel`을 기반으로 하므로, 원래 CLIP처럼 사용할 수 있습니다.
 
 ## AltCLIPConfig
 
diff --git a/docs/source/ko/model_doc/auto.md b/docs/source/ko/model_doc/auto.md
index f928b1904553..92a63058c98b 100644
--- a/docs/source/ko/model_doc/auto.md
+++ b/docs/source/ko/model_doc/auto.md
@@ -42,13 +42,10 @@ AutoModel.register(NewModelConfig, NewModel)
 
 이후에는 일반적으로 자동 클래스를 사용하는 것처럼 사용할 수 있습니다!
 
-<Tip warning={true}>
-
-만약 `NewModelConfig`가 [`~transformers.PretrainedConfig`]의 서브클래스라면, 해당 `model_type` 속성이 등록할 때 사용하는 키(여기서는 `"new-model"`)와 동일하게 설정되어 있는지 확인하세요.
-
-마찬가지로, `NewModel`이 [`PreTrainedModel`]의 서브클래스라면, 해당 `config_class` 속성이 등록할 때 사용하는 클래스(여기서는 `NewModelConfig`)와 동일하게 설정되어 있는지 확인하세요.
-
-</Tip>
+> [!WARNING]
+> 만약 `NewModelConfig`가 [`~transformers.PretrainedConfig`]의 서브클래스라면, 해당 `model_type` 속성이 등록할 때 사용하는 키(여기서는 `"new-model"`)와 동일하게 설정되어 있는지 확인하세요.
+>
+> 마찬가지로, `NewModel`이 [`PreTrainedModel`]의 서브클래스라면, 해당 `config_class` 속성이 등록할 때 사용하는 클래스(여기서는 `NewModelConfig`)와 동일하게 설정되어 있는지 확인하세요.
 
 ## AutoConfig[[transformers.AutoConfig]]
 
diff --git a/docs/source/ko/model_doc/barthez.md b/docs/source/ko/model_doc/barthez.md
index 4df8eb2cd699..e6e38e63d31a 100644
--- a/docs/source/ko/model_doc/barthez.md
+++ b/docs/source/ko/model_doc/barthez.md
@@ -38,12 +38,9 @@ CamemBERT 및 FlauBERT와 동등하거나 이를 능가함을 보였습니다.*
 
 이 모델은 [moussakam](https://huggingface.co/moussakam)이 기여했습니다. 저자의 코드는 [여기](https://github.com/moussaKam/BARThez)에서 찾을 수 있습니다.
 
-<Tip>
-
-BARThez 구현은 🤗 BART와 동일하나, 토큰화에서 차이가 있습니다. 구성 클래스와 그 매개변수에 대한 정보는 [BART 문서](bart)를 참조하십시오. 
-BARThez 전용 토크나이저는 아래에 문서화되어 있습니다.
-
-</Tip>
+> [!TIP]
+> BARThez 구현은 🤗 BART와 동일하나, 토큰화에서 차이가 있습니다. 구성 클래스와 그 매개변수에 대한 정보는 [BART 문서](bart)를 참조하십시오. 
+> BARThez 전용 토크나이저는 아래에 문서화되어 있습니다.
 
 ## 리소스 [[resources]]
 
diff --git a/docs/source/ko/model_doc/bert-japanese.md b/docs/source/ko/model_doc/bert-japanese.md
index 8c21ef355890..1e6543907f52 100644
--- a/docs/source/ko/model_doc/bert-japanese.md
+++ b/docs/source/ko/model_doc/bert-japanese.md
@@ -66,12 +66,9 @@ MeCab과 WordPiece 토큰화를 사용하는 모델 예시:
 >>> outputs = bertjapanese(**inputs)
 ```
 
-<Tip> 
-
-이는 토큰화 방법을 제외하고는 BERT와 동일합니다. API 참조 정보는 [BERT 문서](https://huggingface.co/docs/transformers/main/en/model_doc/bert)를 참조하세요.
-이 모델은 [cl-tohoku](https://huggingface.co/cl-tohoku)께서 기여하였습니다.
-
-</Tip> 
+> [!TIP]
+> 이는 토큰화 방법을 제외하고는 BERT와 동일합니다. API 참조 정보는 [BERT 문서](https://huggingface.co/docs/transformers/main/en/model_doc/bert)를 참조하세요.
+> 이 모델은 [cl-tohoku](https://huggingface.co/cl-tohoku)께서 기여하였습니다. 
 
 
 ## BertJapaneseTokenizer
diff --git a/docs/source/ko/model_doc/bertweet.md b/docs/source/ko/model_doc/bertweet.md
index 7a46087d0a8e..adcf0b76df59 100644
--- a/docs/source/ko/model_doc/bertweet.md
+++ b/docs/source/ko/model_doc/bertweet.md
@@ -56,11 +56,8 @@ BERTweet은 BERT-base(Devlin et al., 2019)와 동일한 아키텍처를 가지
 >>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
 ```
 
-<Tip> 
-
-이 구현은 토큰화 방법을 제외하고는 BERT와 동일합니다. API 참조 정보는 [BERT 문서](bert) 를 참조하세요.
-
-</Tip>
+> [!TIP]
+> 이 구현은 토큰화 방법을 제외하고는 BERT와 동일합니다. API 참조 정보는 [BERT 문서](bert) 를 참조하세요.
 
 ## Bertweet 토큰화(BertweetTokenizer) [[transformers.BertweetTokenizer]]
 
diff --git a/docs/source/ko/model_doc/chameleon.md b/docs/source/ko/model_doc/chameleon.md
index 0c4eca628db7..201671578eb4 100644
--- a/docs/source/ko/model_doc/chameleon.md
+++ b/docs/source/ko/model_doc/chameleon.md
@@ -114,13 +114,10 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
 
 모델은 8비트 또는 4비트로 로드할 수 있으며, 이는 원본 모델의 성능을 유지하면서 메모리 요구 사항을 크게 줄여줍니다. 먼저 bitsandbytes를 설치하고(`pip install bitsandbytes`), 라이브러리가 지원하는 GPU/가속기를 사용 중인지 확인하십시오.
 
-<Tip>
-
-bitsandbytes는 CUDA 이외의 여러 백엔드를 지원하도록 리팩터링되고 있습니다. 현재 ROCm(AMD GPU) 및 Intel CPU 구현이 성숙 단계이며, Intel XPU는 진행 중이고 Apple Silicon 지원은 Q4/Q1에 예상됩니다. 설치 지침 및 최신 백엔드 업데이트는 [이 링크](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend)를 방문하세요.
-
-전체 공개 전에 버그를 식별하는 데 도움이 되는 피드백을 환영합니다! 자세한 내용과 피드백은 [이 문서](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends)를 확인하세요.
-
-</Tip>
+> [!TIP]
+> bitsandbytes는 CUDA 이외의 여러 백엔드를 지원하도록 리팩터링되고 있습니다. 현재 ROCm(AMD GPU) 및 Intel CPU 구현이 성숙 단계이며, Intel XPU는 진행 중이고 Apple Silicon 지원은 Q4/Q1에 예상됩니다. 설치 지침 및 최신 백엔드 업데이트는 [이 링크](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend)를 방문하세요.
+>
+> 전체 공개 전에 버그를 식별하는 데 도움이 되는 피드백을 환영합니다! 자세한 내용과 피드백은 [이 문서](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends)를 확인하세요.
 
 위의 코드 스니펫을 다음과 같이 변경하면 됩니다:
 
diff --git a/docs/source/ko/model_doc/clip.md b/docs/source/ko/model_doc/clip.md
index b62629fa0771..34395241279e 100644
--- a/docs/source/ko/model_doc/clip.md
+++ b/docs/source/ko/model_doc/clip.md
@@ -69,11 +69,8 @@ pip install -U flash-attn --no-build-isolation
 
 플래시 어텐션2와 호환되는 하드웨어를 가지고 있는지 확인하세요. 이에 대한 자세한 내용은 flash-attn 리포지토리의 공식문서에서 확인할 수 있습니다. 또한 모델을 반정밀도(`torch.float16`)로 로드하는 것을 잊지 마세요.
 
-<Tip warning={true}>
-
-작은 배치 크기를 사용할 때, 플래시 어텐션을 사용하면 모델이 느려지는 것을 느낄 수 있습니다.아래의 [플래시 어텐션과 SDPA를 사용한 예상 속도 향상](#Expected-speedups-with-Flash-Attention-and-SDPA) 섹션을 참조하여 적절한 어텐션 구현을 선택하세요.
-
-</Tip>
+> [!WARNING]
+> 작은 배치 크기를 사용할 때, 플래시 어텐션을 사용하면 모델이 느려지는 것을 느낄 수 있습니다.아래의 [플래시 어텐션과 SDPA를 사용한 예상 속도 향상](#Expected-speedups-with-Flash-Attention-and-SDPA) 섹션을 참조하여 적절한 어텐션 구현을 선택하세요.
 
 플래시 어텐션2를 사용해서 모델을 로드하고 구동하기 위해서 다음 스니펫을 참고하세요:
 
diff --git a/docs/source/ko/model_doc/cohere.md b/docs/source/ko/model_doc/cohere.md
index b53738ded860..44658856aa77 100644
--- a/docs/source/ko/model_doc/cohere.md
+++ b/docs/source/ko/model_doc/cohere.md
@@ -20,15 +20,13 @@ The Cohere Command-R 모델은 Cohere팀이 [Command-R: 프로덕션 규모의 
 
 ## 사용 팁[[usage-tips]]
 
-<Tip warning={true}>
-
-Hub에 업로드된 체크포인트들은 `dtype = 'float16'`을 사용합니다. 
-이는 `AutoModel` API가 체크포인트를 `torch.float32`에서 `torch.float16`으로 변환하는 데 사용됩니다. 
-
-온라인 가중치의 `dtype`은 `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`를 사용하여 모델을 초기화할 때 `dtype="auto"`를 사용하지 않는 한 대부분 무관합니다. 그 이유는 모델이 먼저 다운로드되고(온라인 체크포인트의 `dtype` 사용), 그 다음 `torch`의 기본 `dtype`으로 변환되며(이때 `torch.float32`가 됨), 마지막으로 config에 `dtype`이 제공된 경우 이를 사용하기 때문입니다.
-
-모델을 `float16`으로 훈련하는 것은 권장되지 않으며 `nan`을 생성하는 것으로 알려져 있습니다. 따라서 모델은 `bfloat16`으로 훈련해야 합니다.
-</Tip>
+> [!WARNING]
+> Hub에 업로드된 체크포인트들은 `dtype = 'float16'`을 사용합니다. 
+> 이는 `AutoModel` API가 체크포인트를 `torch.float32`에서 `torch.float16`으로 변환하는 데 사용됩니다. 
+>
+> 온라인 가중치의 `dtype`은 `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`를 사용하여 모델을 초기화할 때 `dtype="auto"`를 사용하지 않는 한 대부분 무관합니다. 그 이유는 모델이 먼저 다운로드되고(온라인 체크포인트의 `dtype` 사용), 그 다음 `torch`의 기본 `dtype`으로 변환되며(이때 `torch.float32`가 됨), 마지막으로 config에 `dtype`이 제공된 경우 이를 사용하기 때문입니다.
+>
+> 모델을 `float16`으로 훈련하는 것은 권장되지 않으며 `nan`을 생성하는 것으로 알려져 있습니다. 따라서 모델은 `bfloat16`으로 훈련해야 합니다.
 모델과 토크나이저는 다음과 같이 로드할 수 있습니다:
 
 ```python
diff --git a/docs/source/ko/model_doc/graphormer.md b/docs/source/ko/model_doc/graphormer.md
index 9e1a893fc5f6..377df6373e42 100644
--- a/docs/source/ko/model_doc/graphormer.md
+++ b/docs/source/ko/model_doc/graphormer.md
@@ -14,12 +14,9 @@ rendered properly in your Markdown viewer.
 
 # Graphormer[[graphormer]]
 
-<Tip warning={true}>
-
-이 모델은 유지 보수 모드로만 운영되며, 코드를 변경하는 새로운 PR(Pull Request)은 받지 않습니다.
-이 모델을 실행하는 데 문제가 발생한다면, 이 모델을 지원하는 마지막 버전인 v4.40.2를 다시 설치해 주세요. 다음 명령어를 실행하여 재설치할 수 있습니다: `pip install -U transformers==4.40.2`.
-
-</Tip>
+> [!WARNING]
+> 이 모델은 유지 보수 모드로만 운영되며, 코드를 변경하는 새로운 PR(Pull Request)은 받지 않습니다.
+> 이 모델을 실행하는 데 문제가 발생한다면, 이 모델을 지원하는 마지막 버전인 v4.40.2를 다시 설치해 주세요. 다음 명령어를 실행하여 재설치할 수 있습니다: `pip install -U transformers==4.40.2`.
 
 ## 개요[[overview]]
 
diff --git a/docs/source/ko/model_doc/llama2.md b/docs/source/ko/model_doc/llama2.md
index 6fd74861be6d..df8e4d1e8172 100644
--- a/docs/source/ko/model_doc/llama2.md
+++ b/docs/source/ko/model_doc/llama2.md
@@ -26,15 +26,12 @@ Llama2 모델은 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Al
 
 [여기](https://huggingface.co/models?search=llama2)에서 모든 Llama2 모델을 확인할 수 있습니다.
 
-<Tip warning={true}>
-
-`Llama2` 모델은 `bfloat16`을 사용하여 훈련되었지만, 원래 추론은 `float16`을 사용합니다. 허브에 업로드된 체크포인트는 `dtype = 'float16'`을 사용하며, 이는 `AutoModel` API에 의해 체크포인트를 `torch.float32`에서 `torch.float16`으로 캐스팅하는 데 사용됩니다. 
-
-온라인 가중치의 `dtype`은 `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`를 사용하여 모델을 초기화할 때 `dtype="auto"`를 사용하지 않는 한 대부분 관련이 없습니다. 그 이유는 모델이 먼저 다운로드될 것이고 (온라인 체크포인트의 `dtype`을 사용하여) 그다음에 기본 `dtype`인 `torch`로 캐스팅하고(`torch.float32`가 됨), 마지막으로 구성(configuration)에서 제공된 `dtype`이 있는 경우 이를 사용하기 때문입니다.
-
-모델을 `float16`에서 훈련하는 것은 권장되지 않으며 `nan`을 생성하는 것으로 알려져 있습니다. 따라서 모델은 `bfloat16`에서 훈련되어야 합니다.
-
-</Tip>
+> [!WARNING]
+> `Llama2` 모델은 `bfloat16`을 사용하여 훈련되었지만, 원래 추론은 `float16`을 사용합니다. 허브에 업로드된 체크포인트는 `dtype = 'float16'`을 사용하며, 이는 `AutoModel` API에 의해 체크포인트를 `torch.float32`에서 `torch.float16`으로 캐스팅하는 데 사용됩니다. 
+>
+> 온라인 가중치의 `dtype`은 `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`를 사용하여 모델을 초기화할 때 `dtype="auto"`를 사용하지 않는 한 대부분 관련이 없습니다. 그 이유는 모델이 먼저 다운로드될 것이고 (온라인 체크포인트의 `dtype`을 사용하여) 그다음에 기본 `dtype`인 `torch`로 캐스팅하고(`torch.float32`가 됨), 마지막으로 구성(configuration)에서 제공된 `dtype`이 있는 경우 이를 사용하기 때문입니다.
+>
+> 모델을 `float16`에서 훈련하는 것은 권장되지 않으며 `nan`을 생성하는 것으로 알려져 있습니다. 따라서 모델은 `bfloat16`에서 훈련되어야 합니다.
 
 🍯 팁:
 
diff --git a/docs/source/ko/model_doc/llama3.md b/docs/source/ko/model_doc/llama3.md
index 8cbd9cde9b66..6849b82637c7 100644
--- a/docs/source/ko/model_doc/llama3.md
+++ b/docs/source/ko/model_doc/llama3.md
@@ -39,15 +39,12 @@ pipeline("Hey how are you doing today?")
 
 ## 사용 팁[[usage-tips]]
 
-<Tip warning={true}>
-
-`라마3` 모델들은 `bfloat16`를 사용하여 훈련되었지만, 원래의 추론은 `float16`을 사용합니다. Hub에 업로드된 체크포인트들은 `dtype = 'float16'`을 사용하는데, 이는 `AutoModel` API가 체크포인트를 `torch.float32`에서 `torch.float16`으로 변환하는데 이용됩니다. 
-
- `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`를 사용하여 모델을 초기화할 때, 온라인 가중치의 `dtype`는 `dtype="auto"`를 사용하지 않는 한 대부분 무관합니다. 그 이유는 모델이 먼저 다운로드되고(온라인 체크포인트의 `dtype`를 사용), 그 다음 `torch`의 `dtype`으로 변환되어(`torch.float32`가 됨), 마지막으로 config에 `dtype`이 제공된 경우 가중치가 사용되기 때문입니다.
-
-`float16`으로 모델을 훈련하는 것은 권장되지 않으며 `nan`을 생성하는 것으로 알려져 있습니다. 따라서 모든 모델은 `bfloat16`으로 훈련되어야 합니다.
-
-</Tip>
+> [!WARNING]
+> `라마3` 모델들은 `bfloat16`를 사용하여 훈련되었지만, 원래의 추론은 `float16`을 사용합니다. Hub에 업로드된 체크포인트들은 `dtype = 'float16'`을 사용하는데, 이는 `AutoModel` API가 체크포인트를 `torch.float32`에서 `torch.float16`으로 변환하는데 이용됩니다. 
+>
+>  `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`를 사용하여 모델을 초기화할 때, 온라인 가중치의 `dtype`는 `dtype="auto"`를 사용하지 않는 한 대부분 무관합니다. 그 이유는 모델이 먼저 다운로드되고(온라인 체크포인트의 `dtype`를 사용), 그 다음 `torch`의 `dtype`으로 변환되어(`torch.float32`가 됨), 마지막으로 config에 `dtype`이 제공된 경우 가중치가 사용되기 때문입니다.
+>
+> `float16`으로 모델을 훈련하는 것은 권장되지 않으며 `nan`을 생성하는 것으로 알려져 있습니다. 따라서 모든 모델은 `bfloat16`으로 훈련되어야 합니다.
 
 팁:
 
diff --git a/docs/source/ko/model_doc/trajectory_transformer.md b/docs/source/ko/model_doc/trajectory_transformer.md
index 9f72a6f71e6d..931be3f70281 100644
--- a/docs/source/ko/model_doc/trajectory_transformer.md
+++ b/docs/source/ko/model_doc/trajectory_transformer.md
@@ -16,13 +16,9 @@ rendered properly in your Markdown viewer.
 
 # 궤적 트랜스포머[[trajectory-transformer]]
 
-<Tip warning={true}>
-
-
-이 모델은 유지 보수 모드로만 운영되며, 코드를 변경하는 새로운 PR(Pull Request)은 받지 않습니다.
-이 모델을 실행하는 데 문제가 발생한다면, 이 모델을 지원하는 마지막 버전인 v4.30.0를 다시 설치해 주세요. 다음 명령어를 실행하여 재설치할 수 있습니다: `pip install -U transformers==4.30.0`.
-
-</Tip>
+> [!WARNING]
+> 이 모델은 유지 보수 모드로만 운영되며, 코드를 변경하는 새로운 PR(Pull Request)은 받지 않습니다.
+> 이 모델을 실행하는 데 문제가 발생한다면, 이 모델을 지원하는 마지막 버전인 v4.30.0를 다시 설치해 주세요. 다음 명령어를 실행하여 재설치할 수 있습니다: `pip install -U transformers==4.30.0`.
 
 ## 개요[[overview]]
 
diff --git a/docs/source/ko/model_memory_anatomy.md b/docs/source/ko/model_memory_anatomy.md
index a729b29a7c30..cf4bd3ad1b13 100644
--- a/docs/source/ko/model_memory_anatomy.md
+++ b/docs/source/ko/model_memory_anatomy.md
@@ -139,11 +139,8 @@ default_args = {
 }
 ```
 
-<Tip>
-
-여러 실험을 실행할 계획이라면, 실험 간에 메모리를 제대로 비우기 위해서 Python 커널을 실험 사이마다 재시작해야 합니다.
-
-</Tip>
+> [!TIP]
+> 여러 실험을 실행할 계획이라면, 실험 간에 메모리를 제대로 비우기 위해서 Python 커널을 실험 사이마다 재시작해야 합니다.
 
 ## 기본 훈련에서의 메모리 활용 [[memory-utilization-at-vanilla-training]]
 
diff --git a/docs/source/ko/model_sharing.md b/docs/source/ko/model_sharing.md
index 223fb6571c1c..944bb71f6278 100644
--- a/docs/source/ko/model_sharing.md
+++ b/docs/source/ko/model_sharing.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
 picture-in-picture" allowfullscreen></iframe>
 
-<Tip>
-
-커뮤니티에 모델을 공유하려면, [huggingface.co](https://huggingface.co/join)에 계정이 필요합니다. 기존 조직에 가입하거나 새로 만들 수도 있습니다.
-
-</Tip>
+> [!TIP]
+> 커뮤니티에 모델을 공유하려면, [huggingface.co](https://huggingface.co/join)에 계정이 필요합니다. 기존 조직에 가입하거나 새로 만들 수도 있습니다.
 
 ## 저장소 특징[[repository-features]]
 
diff --git a/docs/source/ko/peft.md b/docs/source/ko/peft.md
index 7655a2c6b587..9adc24058e93 100644
--- a/docs/source/ko/peft.md
+++ b/docs/source/ko/peft.md
@@ -63,11 +63,8 @@ peft_model_id = "ybelkada/opt-350m-lora"
 model = AutoModelForCausalLM.from_pretrained(peft_model_id)
 ```
 
-<Tip>
-
-`AutoModelFor` 클래스나 기본 모델 클래스(예: `OPTForCausalLM` 또는 `LlamaForCausalLM`) 중 하나를 사용하여 PEFT 어댑터를 가져올 수 있습니다.
-
-</Tip>
+> [!TIP]
+> `AutoModelFor` 클래스나 기본 모델 클래스(예: `OPTForCausalLM` 또는 `LlamaForCausalLM`) 중 하나를 사용하여 PEFT 어댑터를 가져올 수 있습니다.
 
 `load_adapter` 메소드를 호출하여 PEFT 어댑터를 가져올 수도 있습니다.
 
@@ -168,11 +165,8 @@ output = model.generate(**inputs)
 
 PEFT 어댑터는 [`Trainer`] 클래스에서 지원되므로 특정 사용 사례에 맞게 어댑터를 훈련할 수 있습니다. 몇 줄의 코드를 추가하기만 하면 됩니다. 예를 들어 LoRA 어댑터를 훈련하려면:
 
-<Tip>
-
-[`Trainer`]를 사용하여 모델을 미세 조정하는 것이 익숙하지 않다면 [사전훈련된 모델을 미세 조정하기](training) 튜토리얼을 확인하세요.
-
-</Tip>
+> [!TIP]
+> [`Trainer`]를 사용하여 모델을 미세 조정하는 것이 익숙하지 않다면 [사전훈련된 모델을 미세 조정하기](training) 튜토리얼을 확인하세요.
 
 1. 작업 유형 및 하이퍼파라미터를 지정하여 어댑터 구성을 정의합니다. 하이퍼파라미터에 대한 자세한 내용은 [`~peft.LoraConfig`]를 참조하세요.
 
diff --git a/docs/source/ko/perf_infer_cpu.md b/docs/source/ko/perf_infer_cpu.md
index 123e56b4f32c..c5942f148ff4 100644
--- a/docs/source/ko/perf_infer_cpu.md
+++ b/docs/source/ko/perf_infer_cpu.md
@@ -38,13 +38,10 @@ IPEX 배포 주기는 PyTorch를 따라서 이루어집니다. 자세한 정보
 ### JIT 모드 사용법 [[usage-of-jitmode]]
 평가 또는 예측을 위해 Trainer에서 JIT 모드를 사용하려면 Trainer의 명령 인수에 `jit_mode_eval`을 추가해야 합니다.
 
-<Tip warning={true}>
-
-PyTorch의 버전이 1.14.0 이상이라면, jit 모드는 jit.trace에서 dict 입력이 지원되므로, 모든 모델의 예측과 평가가 개선될 수 있습니다.
-
-PyTorch의 버전이 1.14.0 미만이라면, 질의 응답 모델과 같이 forward 매개변수의 순서가 jit.trace의 튜플 입력 순서와 일치하는 모델에 득이 될 수 있습니다. 텍스트 분류 모델과 같이 forward 매개변수 순서가 jit.trace의 튜플 입력 순서와 다른 경우, jit.trace가 실패하며 예외가 발생합니다. 이때 예외상황을 사용자에게 알리기 위해 Logging이 사용됩니다.
-
-</Tip>
+> [!WARNING]
+> PyTorch의 버전이 1.14.0 이상이라면, jit 모드는 jit.trace에서 dict 입력이 지원되므로, 모든 모델의 예측과 평가가 개선될 수 있습니다.
+>
+> PyTorch의 버전이 1.14.0 미만이라면, 질의 응답 모델과 같이 forward 매개변수의 순서가 jit.trace의 튜플 입력 순서와 일치하는 모델에 득이 될 수 있습니다. 텍스트 분류 모델과 같이 forward 매개변수 순서가 jit.trace의 튜플 입력 순서와 다른 경우, jit.trace가 실패하며 예외가 발생합니다. 이때 예외상황을 사용자에게 알리기 위해 Logging이 사용됩니다.
 
 [Transformers 질의 응답](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)의 사용 사례 예시를 참조하세요.
 
diff --git a/docs/source/ko/perf_infer_gpu_one.md b/docs/source/ko/perf_infer_gpu_one.md
index 5936c82e07d7..76e1ecce715d 100644
--- a/docs/source/ko/perf_infer_gpu_one.md
+++ b/docs/source/ko/perf_infer_gpu_one.md
@@ -42,11 +42,8 @@ PyTorch 2.0부터는 어텐션 패스트패스가 인코더와 디코더 모두
 
 `bitsandbytes`를 설치하면 GPU에서 손쉽게 모델을 압축할 수 있습니다. FP4 양자화를 사용하면 원래의 전체 정밀도 버전과 비교하여 모델 크기를 최대 8배 줄일 수 있습니다. 아래에서 시작하는 방법을 확인하세요.
 
-<Tip>
-
-이 기능은 다중 GPU 설정에서도 사용할 수 있습니다.
-
-</Tip>
+> [!TIP]
+> 이 기능은 다중 GPU 설정에서도 사용할 수 있습니다.
 
 ### 요구 사항 [[requirements-for-fp4-mixedprecision-inference]]
 
@@ -95,11 +92,8 @@ model_4bit = AutoModelForCausalLM.from_pretrained(
 
 ## Int8 혼합 정밀도 행렬 분해를 위한 `bitsandbytes` 통합 [[bitsandbytes-integration-for-int8-mixedprecision-matrix-decomposition]]
 
-<Tip>
-
-이 기능은 다중 GPU 설정에서도 사용할 수 있습니다.
-
-</Tip>
+> [!TIP]
+> 이 기능은 다중 GPU 설정에서도 사용할 수 있습니다.
 
 [`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://huggingface.co/papers/2208.07339) 논문에서 우리는 몇 줄의 코드로 Hub의 모든 모델에 대한 Hugging Face 통합을 지원합니다.
 이 방법은 `float16` 및 `bfloat16` 가중치에 대해 `nn.Linear` 크기를 2배로 줄이고, `float32` 가중치에 대해 4배로 줄입니다. 이는 절반 정밀도에서 이상치를 처리함으로써 품질에 거의 영향을 미치지 않습니다.
diff --git a/docs/source/ko/perf_train_cpu_many.md b/docs/source/ko/perf_train_cpu_many.md
index e7a68971a7dc..1bdda2237d79 100644
--- a/docs/source/ko/perf_train_cpu_many.md
+++ b/docs/source/ko/perf_train_cpu_many.md
@@ -44,12 +44,9 @@ pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipe
 [oneccl_bind_pt 설치](https://github.com/intel/torch-ccl)에 대한 더 많은 접근 방법을 확인해 보세요.
 oneCCL과 PyTorch의 버전은 일치해야 합니다.
 
-<Tip warning={true}>
-
-oneccl_bindings_for_pytorch 1.12.0 버전의 미리 빌드된 Wheel 파일은 PyTorch 1.12.1과 호환되지 않습니다(PyTorch 1.12.0용입니다).
-PyTorch 1.12.1은 oneccl_bindings_for_pytorch 1.12.10 버전과 함께 사용해야 합니다.
-
-</Tip>
+> [!WARNING]
+> oneccl_bindings_for_pytorch 1.12.0 버전의 미리 빌드된 Wheel 파일은 PyTorch 1.12.1과 호환되지 않습니다(PyTorch 1.12.0용입니다).
+> PyTorch 1.12.1은 oneccl_bindings_for_pytorch 1.12.10 버전과 함께 사용해야 합니다.
 
 ## Intel® MPI 라이브러리 [[intel-mpi-library]]
 이 표준 기반 MPI 구현을 사용하여 Intel® 아키텍처에서 유연하고 효율적이며 확장 가능한 클러스터 메시징을 제공하세요. 이 구성 요소는 Intel® oneAPI HPC Toolkit의 일부입니다.
diff --git a/docs/source/ko/perf_train_gpu_many.md b/docs/source/ko/perf_train_gpu_many.md
index 801e06e276e4..12f1880ea9d1 100644
--- a/docs/source/ko/perf_train_gpu_many.md
+++ b/docs/source/ko/perf_train_gpu_many.md
@@ -17,11 +17,8 @@ rendered properly in your Markdown viewer.
 
 단일 GPU에서의 훈련이 너무 느리거나 모델 가중치가 단일 GPU의 메모리에 맞지 않는 경우, 다중-GPU 설정을 사용합니다. 단일 GPU에서 다중 GPU로 전환하기 위해서는 작업을 분산해야 합니다. 데이터, 텐서 또는 파이프라인과 같은 병렬화 기법을 사용하여 작업을 병렬로 처리할 수 있습니다. 그러나 이러한 설정을 모두에게 적용할 수 있는 완벽한 해결책은 없으며, 어떤 설정이 가장 적합한지는 사용하는 하드웨어에 따라 달라집니다. 이 문서는 주로 PyTorch 기반의 구현을 중심으로 설명하며, 대부분의 개념은 다른 프레임워크에도 적용될 수 있을 것으로 예상됩니다.
 
-<Tip>
-
- 참고: [단일 GPU 섹션](perf_train_gpu_one)에서 소개된 전략(혼합 정밀도 훈련 또는 그래디언트 누적 등)은 일반적으로 모델 훈련에 적용되며, 다중-GPU 또는 CPU 훈련과 같은 다음 섹션으로 진입하기 전에 해당 섹션을 참고하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 참고: [단일 GPU 섹션](perf_train_gpu_one)에서 소개된 전략(혼합 정밀도 훈련 또는 그래디언트 누적 등)은 일반적으로 모델 훈련에 적용되며, 다중-GPU 또는 CPU 훈련과 같은 다음 섹션으로 진입하기 전에 해당 섹션을 참고하는 것이 좋습니다.
 
 먼저 1D 병렬화 기술에 대해 자세히 논의한 후, 이러한 기술을 결합하여 2D 및 3D 병렬화를 구현하여 더 빠른 훈련과 더 큰 모델을 지원하는 방법을 살펴볼 것입니다. 또한 다른 효과적인 대안 방식도 소개될 예정입니다.
 
diff --git a/docs/source/ko/perf_train_special.md b/docs/source/ko/perf_train_special.md
index 188db542f7c0..9b3e5e9d4636 100644
--- a/docs/source/ko/perf_train_special.md
+++ b/docs/source/ko/perf_train_special.md
@@ -17,15 +17,12 @@ rendered properly in your Markdown viewer.
 
 이전에는 Mac에서 모델을 학습할 때 CPU만 사용할 수 있었습니다. 그러나 이제 PyTorch v1.12의 출시로 Apple의 실리콘 GPU를 사용하여 훨씬 더 빠른 성능으로 모델을 학습할 수 있게 되었습니다. 이는 Pytorch에서 Apple의 Metal Performance Shaders (MPS)를 백엔드로 통합하면서 가능해졌습니다. [MPS 백엔드](https://pytorch.org/docs/stable/notes/mps.html)는 Pytorch 연산을 Metal 세이더로 구현하고 이 모듈들을 mps 장치에서 실행할 수 있도록 지원합니다.
 
-<Tip warning={true}>
-
-일부 Pytorch 연산들은 아직 MPS에서 지원되지 않아 오류가 발생할 수 있습니다. 이를 방지하려면 환경 변수 `PYTORCH_ENABLE_MPS_FALLBACK=1` 를 설정하여 CPU 커널을 대신 사용하도록 해야 합니다(이때 `UserWarning`이 여전히 표시될 수 있습니다). 
-
-<br>
-
-다른 오류가 발생할 경우 [PyTorch](https://github.com/pytorch/pytorch/issues) 리포지토리에 이슈를 등록해주세요. 현재 [`Trainer`]는 MPS 백엔드만 통합하고 있습니다.
-
-</Tip>
+> [!WARNING]
+> 일부 Pytorch 연산들은 아직 MPS에서 지원되지 않아 오류가 발생할 수 있습니다. 이를 방지하려면 환경 변수 `PYTORCH_ENABLE_MPS_FALLBACK=1` 를 설정하여 CPU 커널을 대신 사용하도록 해야 합니다(이때 `UserWarning`이 여전히 표시될 수 있습니다). 
+>
+> <br>
+>
+> 다른 오류가 발생할 경우 [PyTorch](https://github.com/pytorch/pytorch/issues) 리포지토리에 이슈를 등록해주세요. 현재 [`Trainer`]는 MPS 백엔드만 통합하고 있습니다.
 
 `mps` 장치를 이용하면 다음과 같은 이점들을 얻을 수 있습니다:
 
diff --git a/docs/source/ko/pipeline_tutorial.md b/docs/source/ko/pipeline_tutorial.md
index 2f166fc6939f..b9939d6bac1e 100644
--- a/docs/source/ko/pipeline_tutorial.md
+++ b/docs/source/ko/pipeline_tutorial.md
@@ -22,11 +22,8 @@ rendered properly in your Markdown viewer.
 * 특정 토크나이저 또는 모델을 사용하는 방법
 * 언어, 컴퓨터 비전, 오디오 및 멀티모달 태스크에서 [`pipeline`]을 사용하는 방법
 
-<Tip>
-
-지원하는 모든 태스크와 쓸 수 있는 매개변수를 담은 목록은 [`pipeline`] 설명서를 참고해주세요.
-
-</Tip>
+> [!TIP]
+> 지원하는 모든 태스크와 쓸 수 있는 매개변수를 담은 목록은 [`pipeline`] 설명서를 참고해주세요.
 
 ## Pipeline 사용하기[[pipeline-usage]]
 
@@ -182,9 +179,8 @@ for out in pipe(KeyDataset(dataset["audio"])):
 
 ## 웹서버에서 Pipeline 사용하기[[using-pipelines-for-a-webserver]]
 
-<Tip>
-추론 엔진을 만드는 과정은 따로 페이지를 작성할만한 복잡한 주제입니다.
-</Tip>
+> [!TIP]
+> 추론 엔진을 만드는 과정은 따로 페이지를 작성할만한 복잡한 주제입니다.
 
 [Link](./pipeline_webserver)
 
diff --git a/docs/source/ko/pipeline_webserver.md b/docs/source/ko/pipeline_webserver.md
index b7d5366c57c4..8f764cdee81d 100644
--- a/docs/source/ko/pipeline_webserver.md
+++ b/docs/source/ko/pipeline_webserver.md
@@ -4,9 +4,8 @@ rendered properly in your Markdown viewer.
 
 # 웹 서버를 위한 파이프라인 사용하기[[using_pipelines_for_a_webserver]]
 
-<Tip>
-추론 엔진을 만드는 것은 복잡한 주제이며, "최선의" 솔루션은 문제 공간에 따라 달라질 가능성이 높습니다. CPU 또는 GPU를 사용하는지에 따라 다르고 낮은 지연 시간을 원하는지, 높은 처리량을 원하는지, 다양한 모델을 지원할 수 있길 원하는지, 하나의 특정 모델을 고도로 최적화하길 원하는지 등에 따라 달라집니다. 이 주제를 해결하는 방법에는 여러 가지가 있으므로, 이 장에서 제시하는 것은 처음 시도해 보기에 좋은 출발점일 수는 있지만, 이 장을 읽는 여러분이 필요로 하는 최적의 솔루션은 아닐 수 있습니다.
-</Tip>
+> [!TIP]
+> 추론 엔진을 만드는 것은 복잡한 주제이며, "최선의" 솔루션은 문제 공간에 따라 달라질 가능성이 높습니다. CPU 또는 GPU를 사용하는지에 따라 다르고 낮은 지연 시간을 원하는지, 높은 처리량을 원하는지, 다양한 모델을 지원할 수 있길 원하는지, 하나의 특정 모델을 고도로 최적화하길 원하는지 등에 따라 달라집니다. 이 주제를 해결하는 방법에는 여러 가지가 있으므로, 이 장에서 제시하는 것은 처음 시도해 보기에 좋은 출발점일 수는 있지만, 이 장을 읽는 여러분이 필요로 하는 최적의 솔루션은 아닐 수 있습니다.
 
 핵심적으로 이해해야 할 점은 [dataset](pipeline_tutorial#using-pipelines-on-a-dataset)를 다룰 때와 마찬가지로 반복자를 사용 가능하다는 것입니다. 왜냐하면, 웹 서버는 기본적으로 요청을 기다리고 들어오는 대로 처리하는 시스템이기 때문입니다.
 
@@ -74,10 +73,9 @@ curl -X POST -d "test [MASK]" http://localhost:8000/
 중요한 점은 모델을 **한 번만** 가져온다는 것입니다. 따라서 웹 서버에는 모델의 사본이 없습니다. 이런 방식은 불필요한 RAM이 사용되지 않습니다. 그런 다음 큐 메커니즘을 사용하면, 다음과 같은
 동적 배치를 사용하기 위해 추론 전 단계에 몇 개의 항목을 축적하는 것과 같은 멋진 작업을 할 수 있습니다:
 
-<Tip warning={true}>
-코드는 의도적으로 가독성을 위해 의사 코드처럼 작성되었습니다!
-아래 코드를 작동시키기 전에 시스템 자원이 충분한지 확인하세요!
-</Tip>
+> [!WARNING]
+> 코드는 의도적으로 가독성을 위해 의사 코드처럼 작성되었습니다!
+> 아래 코드를 작동시키기 전에 시스템 자원이 충분한지 확인하세요!
 
 ```py
 (string, rq) = await q.get()
diff --git a/docs/source/ko/pr_checks.md b/docs/source/ko/pr_checks.md
index 1d155cd1fb9d..cc4bf435d5fe 100644
--- a/docs/source/ko/pr_checks.md
+++ b/docs/source/ko/pr_checks.md
@@ -147,11 +147,8 @@ make fix-copies
 
 Transformers 라이브러리는 모델 코드에 대해 매우 완고하며, 각 모델은 다른 모델에 의존하지 않고 완전히 단일 파일로 구현되어야 합니다. 이렇게 하기 위해 특정 모델의 코드 복사본이 원본과 일관된 상태로 유지되는지 확인하는 메커니즘을 추가했습니다. 따라서 버그 수정이 필요한 경우 다른 모델에 영향을 주는 모든 모델을 볼 수 있으며 수정을 적용할지 수정된 사본을 삭제할지 선택할 수 있습니다.
 
-<Tip>
-
-파일이 다른 파일의 완전한 사본인 경우 해당 파일을 `utils/check_copies.py`의 `FULL_COPIES` 상수에 등록해야 합니다.
-
-</Tip>
+> [!TIP]
+> 파일이 다른 파일의 완전한 사본인 경우 해당 파일을 `utils/check_copies.py`의 `FULL_COPIES` 상수에 등록해야 합니다.
 
 이 메커니즘은 `# Copied from xxx` 형식의 주석을 기반으로 합니다. `xxx`에는 아래에 복사되는 클래스 또는 함수의 전체 경로가 포함되어야 합니다. 예를 들어 `RobertaSelfOutput`은 `BertSelfOutput` 클래스의 복사본입니다. 따라서 [여기](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L289)에서 주석이 있습니다:
 
@@ -182,11 +179,8 @@ Transformers 라이브러리는 모델 코드에 대해 매우 완고하며, 각
 
 순서가 중요한 경우(이전 수정과 충돌할 수 있는 경우) 수정은 왼쪽에서 오른쪽으로 실행됩니다.
 
-<Tip>
-
-새 변경이 서식을 변경하는 경우(짧은 이름을 매우 긴 이름으로 바꾸는 경우) 자동 서식 지정기를 적용한 후 복사본이 검사됩니다.
-
-</Tip>
+> [!TIP]
+> 새 변경이 서식을 변경하는 경우(짧은 이름을 매우 긴 이름으로 바꾸는 경우) 자동 서식 지정기를 적용한 후 복사본이 검사됩니다.
 
 패턴의 대소문자가 다른 경우(대문자와 소문자가 혼용된 대체 양식) `all-casing` 옵션을 추가하는 방법도 있습니다. [여기](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/mobilebert/modeling_mobilebert.py#L1237)에서 `MobileBertForSequenceClassification`에서 사용된 예시를 볼 수 있습니다:
 
diff --git a/docs/source/ko/quantization/awq.md b/docs/source/ko/quantization/awq.md
index c8f472a95e12..436f8b424a25 100644
--- a/docs/source/ko/quantization/awq.md
+++ b/docs/source/ko/quantization/awq.md
@@ -16,11 +16,8 @@ rendered properly in your Markdown viewer.
 
 # AWQ [[awq]]
 
-<Tip>
-
-이 [노트북](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY) 으로 AWQ 양자화를 실습해보세요 !
-
-</Tip>
+> [!TIP]
+> 이 [노트북](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY) 으로 AWQ 양자화를 실습해보세요 !
 
 [Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978)은 모델의 모든 가중치를 양자화하지 않고, LLM 성능에 중요한 가중치를 유지합니다. 이로써 4비트 정밀도로 모델을 실행해도 성능 저하 없이 양자화 손실을 크게 줄일 수 있습니다.
 
@@ -83,11 +80,8 @@ model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", att
 
 퓨즈된 모듈은 정확도와 성능을 개선합니다. 퓨즈된 모듈은 [Llama](https://huggingface.co/meta-llama) 아키텍처와 [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) 아키텍처의 AWQ모듈에 기본적으로 지원됩니다. 그러나 지원되지 않는 아키텍처에 대해서도 AWQ 모듈을 퓨즈할 수 있습니다.
 
-<Tip warning={true}>
-
-퓨즈된 모듈은 FlashAttention-2와 같은 다른 최적화 기술과 결합할 수 없습니다.
-
-</Tip>
+> [!WARNING]
+> 퓨즈된 모듈은 FlashAttention-2와 같은 다른 최적화 기술과 결합할 수 없습니다.
 
 
 <hfoptions id="fuse">
@@ -226,8 +220,5 @@ output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=5
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
-<Tip warning={true}>
-
-이 기능은 AMD GPUs에서 지원됩니다.
-
-</Tip>
+> [!WARNING]
+> 이 기능은 AMD GPUs에서 지원됩니다.
diff --git a/docs/source/ko/quantization/bitsandbytes.md b/docs/source/ko/quantization/bitsandbytes.md
index 594423967099..2c54e299f31e 100644
--- a/docs/source/ko/quantization/bitsandbytes.md
+++ b/docs/source/ko/quantization/bitsandbytes.md
@@ -125,11 +125,8 @@ model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
 </hfoption>
 </hfoptions>
 
-<Tip warning={true}>
-
-8비트 및 4비트 가중치로 훈련하는 것은 *추가* 매개변수에 대해서만 지원됩니다.
-
-</Tip>
+> [!WARNING]
+> 8비트 및 4비트 가중치로 훈련하는 것은 *추가* 매개변수에 대해서만 지원됩니다.
 
 메모리 사용량을 확인하려면 `get_memory_footprint`를 사용하세요:
 
@@ -147,11 +144,8 @@ model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit",
 
 ## 8비트 (LLM.int8() 알고리즘)[[8-bit-(llm.int8()-algorithm)]]
 
-<Tip>
-
-8비트 양자화에 대한 자세한 내용을 알고 싶다면 이 [블로그 포스트](https://huggingface.co/blog/hf-bitsandbytes-integration)를 참조하세요!
-
-</Tip>
+> [!TIP]
+> 8비트 양자화에 대한 자세한 내용을 알고 싶다면 이 [블로그 포스트](https://huggingface.co/blog/hf-bitsandbytes-integration)를 참조하세요!
 
 이 섹션에서는 오프로딩, 이상치 임곗값, 모듈 변환 건너뛰기 및 미세 조정과 같은 8비트 모델의 특정 기능을 살펴봅니다.
 
@@ -235,11 +229,8 @@ model_8bit = AutoModelForCausalLM.from_pretrained(
 
 ## 4비트 (QLoRA 알고리즘)[[4-bit-(qlora-algorithm)]]
 
-<Tip>
-
-이 [노트북](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf)에서 4비트 양자화를 시도해보고 자세한 내용은 이 [블로그 게시물](https://huggingface.co/blog/4bit-transformers-bitsandbytes)에서 확인하세요.
-
-</Tip>
+> [!TIP]
+> 이 [노트북](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf)에서 4비트 양자화를 시도해보고 자세한 내용은 이 [블로그 게시물](https://huggingface.co/blog/4bit-transformers-bitsandbytes)에서 확인하세요.
 
 이 섹션에서는 계산 데이터 유형 변경, Normal Float 4 (NF4) 데이터 유형 사용, 중첩 양자화 사용과 같은 4비트 모델의 특정 기능 일부를 탐구합니다.
 
diff --git a/docs/source/ko/quantization/gptq.md b/docs/source/ko/quantization/gptq.md
index c54f09c94a33..23d936b9b1e0 100644
--- a/docs/source/ko/quantization/gptq.md
+++ b/docs/source/ko/quantization/gptq.md
@@ -16,11 +16,8 @@ rendered properly in your Markdown viewer.
 
 # GPTQ [[gptq]]
 
-<Tip>
-
-PEFT를 활용한 GPTQ 양자화를 사용해보시려면 이 [노트북](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb)을 참고하시고, 자세한 내용은 이 [블로그 게시물](https://huggingface.co/blog/gptq-integration)에서 확인하세요!
-
-</Tip>
+> [!TIP]
+> PEFT를 활용한 GPTQ 양자화를 사용해보시려면 이 [노트북](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb)을 참고하시고, 자세한 내용은 이 [블로그 게시물](https://huggingface.co/blog/gptq-integration)에서 확인하세요!
 
 [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) 라이브러리는 GPTQ 알고리즘을 구현합니다. 이는 훈련 후 양자화 기법으로, 가중치 행렬의 각 행을 독립적으로 양자화하여 오차를 최소화하는 가중치 버전을 찾습니다. 이 가중치는 int4로 양자화되지만, 추론 중에는 실시간으로 fp16으로 복원됩니다. 이는 int4 가중치가 GPU의 전역 메모리 대신 결합된 커널에서 역양자화되기 때문에 메모리 사용량을 4배 절약할 수 있으며, 더 낮은 비트 너비를 사용함으로써 통신 시간이 줄어들어 추론 속도가 빨라질 것으로 기대할 수 있습니다.
 
@@ -60,11 +57,8 @@ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="aut
 quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, quantization_config=gptq_config)
 ```
 
-<Tip warning={true}>
-
-하드웨어와 모델 매개변수량에 따라 모델을 처음부터 양자화하는 데 드는 시간이 서로 다를 수 있습니다. 예를 들어, 무료 등급의 Google Colab GPU로 비교적 가벼운 [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) 모델을 양자화하는 데 약 5분이 걸리지만, NVIDIA A100으로 175B에 달하는 매개변수를 가진 모델을 양자화하는 데는 약 4시간에 달하는 시간이 걸릴 수 있습니다. 모델을 양자화하기 전에, Hub에서 해당 모델의 GPTQ 양자화 버전이 이미 존재하는지 확인하는 것이 좋습니다.
-
-</Tip>
+> [!WARNING]
+> 하드웨어와 모델 매개변수량에 따라 모델을 처음부터 양자화하는 데 드는 시간이 서로 다를 수 있습니다. 예를 들어, 무료 등급의 Google Colab GPU로 비교적 가벼운 [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) 모델을 양자화하는 데 약 5분이 걸리지만, NVIDIA A100으로 175B에 달하는 매개변수를 가진 모델을 양자화하는 데는 약 4시간에 달하는 시간이 걸릴 수 있습니다. 모델을 양자화하기 전에, Hub에서 해당 모델의 GPTQ 양자화 버전이 이미 존재하는지 확인하는 것이 좋습니다.
 
 모델이 양자화되면, 모델과 토크나이저를 Hub에 푸시하여 쉽게 공유하고 접근할 수 있습니다. [`GPTQConfig`]를 저장하기 위해 [`~PreTrainedModel.push_to_hub`] 메소드를 사용하세요:
 
@@ -104,11 +98,8 @@ gptq_config = GPTQConfig(bits=4, exllama_config={"version":2})
 model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config)
 ```
 
-<Tip warning={true}>
-
-4비트 모델만 지원되며, 양자화된 모델을 PEFT로 미세 조정하는 경우 ExLlama 커널을 비활성화할 것을 권장합니다.
-
-</Tip>
+> [!WARNING]
+> 4비트 모델만 지원되며, 양자화된 모델을 PEFT로 미세 조정하는 경우 ExLlama 커널을 비활성화할 것을 권장합니다.
 
 ExLlama 커널은 전체 모델이 GPU에 있을 때만 지원됩니다. AutoGPTQ(버전 0.4.2 이상)로 CPU에서 추론을 수행하는 경우 ExLlama 커널을 비활성화해야 합니다. 이를 위해 config.json 파일의 양자화 설정에서 ExLlama 커널과 관련된 속성을 덮어써야 합니다.
 
diff --git a/docs/source/ko/quantization/quanto.md b/docs/source/ko/quantization/quanto.md
index 7eff695051d6..80044f1945b2 100644
--- a/docs/source/ko/quantization/quanto.md
+++ b/docs/source/ko/quantization/quanto.md
@@ -16,11 +16,8 @@ rendered properly in your Markdown viewer.
 
 # Quanto[[quanto]]
 
-<Tip>
-
-이 [노트북](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing)으로 Quanto와 transformers를 사용해 보세요!
-
-</Tip>
+> [!TIP]
+> 이 [노트북](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing)으로 Quanto와 transformers를 사용해 보세요!
 
 
 [🤗 Quanto](https://github.com/huggingface/optimum-quanto) 라이브러리는 다목적 파이토치 양자화 툴킷입니다. 이 라이브러리에서 사용되는 양자화 방법은 선형 양자화입니다. Quanto는 다음과 같은 여러 가지 기능을 제공합니다:
diff --git a/docs/source/ko/quicktour.md b/docs/source/ko/quicktour.md
index de882503c9d8..06eea950a79d 100644
--- a/docs/source/ko/quicktour.md
+++ b/docs/source/ko/quicktour.md
@@ -39,11 +39,8 @@ pip install torch
 
 [`pipeline`](./main_classes/pipelines)은 사전 훈련된 모델로 추론하기에 가장 쉽고 빠른 방법입니다. [`pipeline`]은 여러 모달리티에서 다양한 과업을 쉽게 처리할 수 있으며, 아래 표에 표시된 몇 가지 과업을 기본적으로 지원합니다:
 
-<Tip>
-
-사용 가능한 작업의 전체 목록은 [Pipelines API 참조](./main_classes/pipelines)를 확인하세요.
-
-</Tip>
+> [!TIP]
+> 사용 가능한 작업의 전체 목록은 [Pipelines API 참조](./main_classes/pipelines)를 확인하세요.
 
 | **태스크**      | **설명**                                                             | **모달리티**     | **파이프라인 ID**                             |
 |-----------------|----------------------------------------------------------------------|------------------|-----------------------------------------------|
@@ -197,11 +194,8 @@ label: NEGATIVE, with score: 0.5309
 ... )
 ```
 
-<Tip>
-
-[전처리](./preprocessing) 튜토리얼을 참조하시면 토큰화에 대한 자세한 설명과 함께 이미지, 오디오와 멀티모달 입력을 전처리하기 위한 [`AutoImageProcessor`]와 [`AutoFeatureExtractor`], [`AutoProcessor`]의 사용방법도 알 수 있습니다.
-
-</Tip>
+> [!TIP]
+> [전처리](./preprocessing) 튜토리얼을 참조하시면 토큰화에 대한 자세한 설명과 함께 이미지, 오디오와 멀티모달 입력을 전처리하기 위한 [`AutoImageProcessor`]와 [`AutoFeatureExtractor`], [`AutoProcessor`]의 사용방법도 알 수 있습니다.
 
 ### AutoModel [[automodel]]
 
@@ -214,11 +208,8 @@ label: NEGATIVE, with score: 0.5309
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
 ```
 
-<Tip>
-
-[`AutoModel`] 클래스에서 지원하는 과업에 대해서는 [과업 요약](./task_summary)을 참조하세요.
-
-</Tip>
+> [!TIP]
+> [`AutoModel`] 클래스에서 지원하는 과업에 대해서는 [과업 요약](./task_summary)을 참조하세요.
 
 이제 전처리된 입력 묶음을 직접 모델에 전달해야 합니다. 아래처럼 `**`를 앞에 붙여 딕셔너리를 풀어주면 됩니다:
 
@@ -237,11 +228,8 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
         [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
 ```
 
-<Tip>
-
-모든 🤗 Transformers 모델(PyTorch 또는 TensorFlow)은 (softmax와 같은) 최종 활성화 함수 *이전에* 텐서를 출력합니다. 왜냐하면 최종 활성화 함수의 출력은 종종 손실 함수 출력과 결합되기 때문입니다. 모델 출력은 특수한 데이터 클래스이므로 IDE에서 자동 완성됩니다. 모델 출력은 튜플이나 딕셔너리처럼 동작하며 (정수, 슬라이스 또는 문자열로 인덱싱 가능), None인 속성은 무시됩니다.
-
-</Tip>
+> [!TIP]
+> 모든 🤗 Transformers 모델(PyTorch 또는 TensorFlow)은 (softmax와 같은) 최종 활성화 함수 *이전에* 텐서를 출력합니다. 왜냐하면 최종 활성화 함수의 출력은 종종 손실 함수 출력과 결합되기 때문입니다. 모델 출력은 특수한 데이터 클래스이므로 IDE에서 자동 완성됩니다. 모델 출력은 튜플이나 딕셔너리처럼 동작하며 (정수, 슬라이스 또는 문자열로 인덱싱 가능), None인 속성은 무시됩니다.
 
 ### 모델 저장하기 [[save-a-model]]
 
@@ -377,11 +365,8 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
 >>> trainer.train()  # doctest: +SKIP
 ```
 
-<Tip>
-
-번역이나 요약과 같이 시퀀스-시퀀스 모델을 사용하는 과업에는 [`Seq2SeqTrainer`] 및 [`Seq2SeqTrainingArguments`] 클래스를 사용하세요.
-
-</Tip>
+> [!TIP]
+> 번역이나 요약과 같이 시퀀스-시퀀스 모델을 사용하는 과업에는 [`Seq2SeqTrainer`] 및 [`Seq2SeqTrainingArguments`] 클래스를 사용하세요.
 
 [`Trainer`] 내의 메서드를 서브클래스화하여 훈련 루프를 바꿀 수도 있습니다. 이러면 손실 함수, 옵티마이저, 스케줄러와 같은 기능 또한 바꿀 수 있게 됩니다. 변경 가능한 메소드에 대해서는 [`Trainer`] 문서를 참고하세요.
 
diff --git a/docs/source/ko/serialization.md b/docs/source/ko/serialization.md
index 2e521e2b7b4a..25b1d6cf2b32 100644
--- a/docs/source/ko/serialization.md
+++ b/docs/source/ko/serialization.md
@@ -126,11 +126,8 @@ CLI 대신에 `optimum.onnxruntime`을 사용하여 프로그래밍 방식으로
 
 ### `transformers.onnx`를 사용하여 모델 내보내기 [[exporting-a-model-with-transformersonnx]]
 
-<Tip warning={true}>
-
-`tranformers.onnx`는 더 이상 유지되지 않습니다. 위에서 설명한 대로 🤗 Optimum을 사용하여 모델을 내보내세요. 이 섹션은 향후 버전에서 제거될 예정입니다.
-
-</Tip>
+> [!WARNING]
+> `tranformers.onnx`는 더 이상 유지되지 않습니다. 위에서 설명한 대로 🤗 Optimum을 사용하여 모델을 내보내세요. 이 섹션은 향후 버전에서 제거될 예정입니다.
 
 🤗 Transformers 모델을 ONNX로 내보내려면 추가 종속성을 설치하세요:
 
diff --git a/docs/source/ko/tasks/asr.md b/docs/source/ko/tasks/asr.md
index f28dd9fbec04..6fe32bd41cef 100644
--- a/docs/source/ko/tasks/asr.md
+++ b/docs/source/ko/tasks/asr.md
@@ -28,11 +28,8 @@ Siri와 Alexa와 같은 가상 어시스턴트는 ASR 모델을 사용하여 일
 1. [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 데이터 세트에서 [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base)를 미세 조정하여 오디오를 텍스트로 변환합니다.
 2. 미세 조정한 모델을 추론에 사용합니다.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/automatic-speech-recognition)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/automatic-speech-recognition)를 확인하는 것이 좋습니다.
 
 시작하기 전에 필요한 모든 라이브러리가 설치되어 있는지 확인하세요:
 
@@ -232,11 +229,8 @@ MInDS-14 데이터 세트의 샘플링 레이트는 8000kHz이므로([데이터
 
 ## 훈련하기[[train]]
 
-<Tip>
-
-[`Trainer`]로 모델을 미세 조정하는 것이 익숙하지 않다면, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인해보세요!
-
-</Tip>
+> [!TIP]
+> [`Trainer`]로 모델을 미세 조정하는 것이 익숙하지 않다면, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인해보세요!
 
 이제 모델 훈련을 시작할 준비가 되었습니다! [`AutoModelForCTC`]로 Wav2Vec2를 가져오세요. `ctc_loss_reduction` 매개변수로 CTC 손실에 적용할 축소(reduction) 방법을 지정하세요. 기본값인 합계 대신 평균을 사용하는 것이 더 좋은 경우가 많습니다:
 
@@ -297,11 +291,8 @@ MInDS-14 데이터 세트의 샘플링 레이트는 8000kHz이므로([데이터
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-자동 음성 인식을 위해 모델을 미세 조정하는 더 자세한 예제는 영어 자동 음성 인식을 위한 [블로그 포스트](https://huggingface.co/blog/fine-tune-wav2vec2-english)와 다국어 자동 음성 인식을 위한 [포스트](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)를 참조하세요.
-
-</Tip>
+> [!TIP]
+> 자동 음성 인식을 위해 모델을 미세 조정하는 더 자세한 예제는 영어 자동 음성 인식을 위한 [블로그 포스트](https://huggingface.co/blog/fine-tune-wav2vec2-english)와 다국어 자동 음성 인식을 위한 [포스트](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)를 참조하세요.
 
 ## 추론하기[[inference]]
 
@@ -328,11 +319,8 @@ MInDS-14 데이터 세트의 샘플링 레이트는 8000kHz이므로([데이터
 {'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'}
 ```
 
-<Tip>
-
-텍스트로 변환된 결과가 꽤 괜찮지만 더 좋을 수도 있습니다! 더 나은 결과를 얻으려면 더 많은 예제로 모델을 미세 조정하세요!
-
-</Tip>
+> [!TIP]
+> 텍스트로 변환된 결과가 꽤 괜찮지만 더 좋을 수도 있습니다! 더 나은 결과를 얻으려면 더 많은 예제로 모델을 미세 조정하세요!
 
 `pipeline`의 결과를 수동으로 재현할 수도 있습니다:
 
diff --git a/docs/source/ko/tasks/audio_classification.md b/docs/source/ko/tasks/audio_classification.md
index 789d7ee88373..142f323aa692 100644
--- a/docs/source/ko/tasks/audio_classification.md
+++ b/docs/source/ko/tasks/audio_classification.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 1. [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 데이터 세트를 [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base)로 미세 조정하여 화자의 의도를 분류합니다.
 2. 추론에 미세 조정된 모델을 사용하세요.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/audio-classification)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/audio-classification)를 확인하는 것이 좋습니다.
 
 시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요:
 
@@ -187,11 +184,8 @@ MinDS-14 데이터 세트의 샘플링 속도는 8khz이므로(이 정보는 [
 
 ## 훈련[[train]]
 
-<Tip>
-
-[`Trainer`]로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-with-pytorch-trainer)을 살펴보세요!
-
-</Tip>
+> [!TIP]
+> [`Trainer`]로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-with-pytorch-trainer)을 살펴보세요!
 
 이제 모델 훈련을 시작할 준비가 되었습니다! [`AutoModelForAudioClassification`]을 이용해서 Wav2Vec2를 불러옵니다. 예상되는 레이블 수와 레이블 매핑을 지정합니다:
 
@@ -246,11 +240,8 @@ MinDS-14 데이터 세트의 샘플링 속도는 8khz이므로(이 정보는 [
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-For a more in-depth example of how to finetune a model for audio classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
-
-</Tip>
+> [!TIP]
+> For a more in-depth example of how to finetune a model for audio classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
 
 ## 추론[[inference]]
 
diff --git a/docs/source/ko/tasks/document_question_answering.md b/docs/source/ko/tasks/document_question_answering.md
index 6c2d04f4ee85..c15514150c2c 100644
--- a/docs/source/ko/tasks/document_question_answering.md
+++ b/docs/source/ko/tasks/document_question_answering.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 - [DocVQA dataset](https://huggingface.co/datasets/nielsr/docvqa_1200_examples_donut)을 사용해 [LayoutLMv2](../model_doc/layoutlmv2) 미세 조정하기
 - 추론을 위해 미세 조정된 모델을 사용하기
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/image-to-text)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/image-to-text)를 확인하는 것이 좋습니다.
 
 LayoutLMv2는 토큰의 마지막 은닉층 위에 질의 응답 헤드를 추가해 답변의 시작 토큰과 끝 토큰의 위치를 예측함으로써 문서 질의 응답 태스크를 해결합니다. 즉, 문맥이 주어졌을 때 질문에 답하는 정보를 추출하는 추출형 질의 응답(Extractive question answering)으로 문제를 처리합니다.
 문맥은 OCR 엔진의 출력에서 가져오며, 여기서는 Google의 Tesseract를 사용합니다.
diff --git a/docs/source/ko/tasks/idefics.md b/docs/source/ko/tasks/idefics.md
index 00ce40e97607..a73feb3c6bea 100644
--- a/docs/source/ko/tasks/idefics.md
+++ b/docs/source/ko/tasks/idefics.md
@@ -42,9 +42,8 @@ rendered properly in your Markdown viewer.
 pip install -q bitsandbytes sentencepiece accelerate transformers
 ```
 
-<Tip>
-다음 예제를 비양자화된 버전의 모델 체크포인트로 실행하려면 최소 20GB의 GPU 메모리가 필요합니다.
-</Tip>
+> [!TIP]
+> 다음 예제를 비양자화된 버전의 모델 체크포인트로 실행하려면 최소 20GB의 GPU 메모리가 필요합니다.
 
 ## 모델 로드[[loading-the-model]]
 
@@ -122,11 +121,9 @@ IDEFICS는 텍스트 및 이미지 프롬프트를 모두 수용합니다. 그
 A puppy in a flower bed
 ```
 
-<Tip>
-
-`max_new_tokens`의 크기를 증가시킬 때 발생할 수 있는 오류를 피하기 위해 `generate` 호출 시 `bad_words_ids`를 포함하는 것이 좋습니다. 모델로부터 생성된 이미지가 없을 때 새로운 `<image>` 또는 `<fake_token_around_image>` 토큰을 생성하려고 하기 때문입니다.
-이 가이드에서처럼 `bad_words_ids`를 함수 호출 시에 매개변수로 설정하거나, [텍스트 생성 전략](../generation_strategies) 가이드에 설명된 대로 `GenerationConfig`에 저장할 수도 있습니다.
-</Tip>
+> [!TIP]
+> `max_new_tokens`의 크기를 증가시킬 때 발생할 수 있는 오류를 피하기 위해 `generate` 호출 시 `bad_words_ids`를 포함하는 것이 좋습니다. 모델로부터 생성된 이미지가 없을 때 새로운 `<image>` 또는 `<fake_token_around_image>` 토큰을 생성하려고 하기 때문입니다.
+> 이 가이드에서처럼 `bad_words_ids`를 함수 호출 시에 매개변수로 설정하거나, [텍스트 생성 전략](../generation_strategies) 가이드에 설명된 대로 `GenerationConfig`에 저장할 수도 있습니다.
 
 ## 프롬프트 이미지 캡셔닝[[prompted-image-captioning]]
 
@@ -302,10 +299,8 @@ The little girl ran
 
 IDEFICS가 문 앞에 있는 호박을 보고 유령에 대한 으스스한 할로윈 이야기를 만든 것 같습니다.
 
-<Tip>
-
-이처럼 긴 텍스트를 생성할 때는 텍스트 생성 전략을 조정하는 것이 좋습니다. 이렇게 하면 생성된 결과물의 품질을 크게 향상시킬 수 있습니다. 자세한 내용은 [텍스트 생성 전략](../generation_strategies)을 참조하세요.
-</Tip>
+> [!TIP]
+> 이처럼 긴 텍스트를 생성할 때는 텍스트 생성 전략을 조정하는 것이 좋습니다. 이렇게 하면 생성된 결과물의 품질을 크게 향상시킬 수 있습니다. 자세한 내용은 [텍스트 생성 전략](../generation_strategies)을 참조하세요.
 
 ## 배치 모드에서 추론 실행[[running-inference-in-batch-mode]]
 
diff --git a/docs/source/ko/tasks/image_captioning.md b/docs/source/ko/tasks/image_captioning.md
index c4d0f99b6170..e7b4ff36c558 100644
--- a/docs/source/ko/tasks/image_captioning.md
+++ b/docs/source/ko/tasks/image_captioning.md
@@ -68,12 +68,9 @@ DatasetDict({
 
 이 데이터세트는 `image`와 `text`라는 두 특성을 가지고 있습니다.
 
-<Tip>
-
-많은 이미지 캡션 데이터세트에는 이미지당 여러 개의 캡션이 포함되어 있습니다. 
-이러한 경우, 일반적으로 학습 중에 사용 가능한 캡션 중에서 무작위로 샘플을 추출합니다. 
-
-</Tip>
+> [!TIP]
+> 많은 이미지 캡션 데이터세트에는 이미지당 여러 개의 캡션이 포함되어 있습니다. 
+> 이러한 경우, 일반적으로 학습 중에 사용 가능한 캡션 중에서 무작위로 샘플을 추출합니다.
 
 [`~datasets.Dataset.train_test_split`] 메소드를 사용하여 데이터세트의 학습 분할을 학습 및 테스트 세트로 나눕니다:
 
diff --git a/docs/source/ko/tasks/image_classification.md b/docs/source/ko/tasks/image_classification.md
index 54490a6f939a..8ae3fb190687 100644
--- a/docs/source/ko/tasks/image_classification.md
+++ b/docs/source/ko/tasks/image_classification.md
@@ -29,11 +29,8 @@ rendered properly in your Markdown viewer.
 1. [Food-101](https://huggingface.co/datasets/food101) 데이터 세트에서 [ViT](model_doc/vit)를 미세 조정하여 이미지에서 식품 항목을 분류합니다.
 2. 추론을 위해 미세 조정 모델을 사용합니다.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/image-classification)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/image-classification)를 확인하는 것이 좋습니다.
 
 시작하기 전에, 필요한 모든 라이브러리가 설치되어 있는지 확인하세요:
 
@@ -176,11 +173,8 @@ Hugging Face 계정에 로그인하여 모델을 업로드하고 커뮤니티에
 
 ## 훈련[[train]]
 
-<Tip>
-
-[`Trainer`]를 사용하여 모델을 미세 조정하는 방법에 익숙하지 않은 경우, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인하세요!
-
-</Tip>
+> [!TIP]
+> [`Trainer`]를 사용하여 모델을 미세 조정하는 방법에 익숙하지 않은 경우, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인하세요!
 
 이제 모델을 훈련시킬 준비가 되었습니다! [`AutoModelForImageClassification`]로 ViT를 가져옵니다. 예상되는 레이블 수, 레이블 매핑 및 레이블 수를 지정하세요:
 
@@ -239,11 +233,8 @@ Hugging Face 계정에 로그인하여 모델을 업로드하고 커뮤니티에
 ```
 
 
-<Tip>
-
-이미지 분류를 위한 모델을 미세 조정하는 자세한 예제는 다음 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)을 참조하세요.
-
-</Tip>
+> [!TIP]
+> 이미지 분류를 위한 모델을 미세 조정하는 자세한 예제는 다음 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)을 참조하세요.
 
 ## 추론[[inference]]
 
diff --git a/docs/source/ko/tasks/language_modeling.md b/docs/source/ko/tasks/language_modeling.md
index dcb665a0025a..0979d930ae48 100644
--- a/docs/source/ko/tasks/language_modeling.md
+++ b/docs/source/ko/tasks/language_modeling.md
@@ -32,11 +32,8 @@ rendered properly in your Markdown viewer.
 1. [DistilGPT2](https://huggingface.co/distilbert/distilgpt2) 모델을 [ELI5](https://huggingface.co/datasets/eli5) 데이터 세트의 [r/askscience](https://www.reddit.com/r/askscience/) 하위 집합으로 미세 조정
 2. 미세 조정된 모델을 추론에 사용
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/text-generation)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/text-generation)를 확인하는 것이 좋습니다.
 
 시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요:
 
@@ -188,11 +185,8 @@ pip install transformers datasets evaluate
 
 ## 훈련[[train]]
 
-<Tip>
-
-[`Trainer`]를 사용하여 모델을 미세 조정하는 방법을 잘 모르신다면 [기본 튜토리얼](../training#train-with-pytorch-trainer)을 확인해보세요!
-
-</Tip>
+> [!TIP]
+> [`Trainer`]를 사용하여 모델을 미세 조정하는 방법을 잘 모르신다면 [기본 튜토리얼](../training#train-with-pytorch-trainer)을 확인해보세요!
 
 이제 모델을 훈련하기 준비가 되었습니다! [`AutoModelForCausalLM`]를 사용하여 DistilGPT2를 불러옵니다:
 
@@ -244,11 +238,8 @@ Perplexity: 49.61
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-인과 언어 모델링을 위해 모델을 미세 조정하는 더 자세한 예제는 해당하는 [PyTorch 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb) 또는 [TensorFlow 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)을 참조하세요.
-
-</Tip>
+> [!TIP]
+> 인과 언어 모델링을 위해 모델을 미세 조정하는 더 자세한 예제는 해당하는 [PyTorch 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb) 또는 [TensorFlow 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)을 참조하세요.
 
 ## 추론[[inference]]
 
diff --git a/docs/source/ko/tasks/masked_language_modeling.md b/docs/source/ko/tasks/masked_language_modeling.md
index 65da783f9ae8..64d6df81c62d 100644
--- a/docs/source/ko/tasks/masked_language_modeling.md
+++ b/docs/source/ko/tasks/masked_language_modeling.md
@@ -29,11 +29,8 @@ rendered properly in your Markdown viewer.
 1. [ELI5](https://huggingface.co/datasets/eli5) 데이터 세트에서 [r/askscience](https://www.reddit.com/r/askscience/) 부분을 사용해 [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) 모델을 미세 조정합니다.
 2. 추론 시에 직접 미세 조정한 모델을 사용합니다.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/fill-mask)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/fill-mask)를 확인하는 것이 좋습니다.
 
 시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요:
 
@@ -191,10 +188,8 @@ Hugging Face 계정에 로그인하여 모델을 업로드하고 커뮤니티와
 
 ## 훈련[[train]]
 
-<Tip>
-
-[`Trainer`]로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-with-pytorch-trainer)를 살펴보세요!
-</Tip>
+> [!TIP]
+> [`Trainer`]로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-with-pytorch-trainer)를 살펴보세요!
 
 이제 모델 훈련을 시작할 준비가 되었습니다! [`AutoModelForMaskedLM`]를 사용해 DistilRoBERTa 모델을 가져옵니다:
 
@@ -247,12 +242,10 @@ Perplexity: 8.76
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-마스킹된 언어 모델링을 위해 모델을 미세 조정하는 방법에 대한 보다 심층적인 예제는
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
-또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)을 참조하세요.
-</Tip>
+> [!TIP]
+> 마스킹된 언어 모델링을 위해 모델을 미세 조정하는 방법에 대한 보다 심층적인 예제는
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
+> 또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)을 참조하세요.
 
 ## 추론[[inference]]
 
diff --git a/docs/source/ko/tasks/monocular_depth_estimation.md b/docs/source/ko/tasks/monocular_depth_estimation.md
index 2c640d2a86db..351f3481bba8 100644
--- a/docs/source/ko/tasks/monocular_depth_estimation.md
+++ b/docs/source/ko/tasks/monocular_depth_estimation.md
@@ -23,11 +23,8 @@ rendered properly in your Markdown viewer.
 조명 조건, 가려짐, 텍스처와 같은 요소의 영향을 받을 수 있는 장면 내 물체와 해당 깊이 정보 간의 복잡한 관계를 모델이 이해해야 하므로 까다로운 작업입니다.
 
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/depth-estimation)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/depth-estimation)를 확인하는 것이 좋습니다.
 
 이번 가이드에서 배울 내용은 다음과 같습니다:
 
diff --git a/docs/source/ko/tasks/multiple_choice.md b/docs/source/ko/tasks/multiple_choice.md
index c8d99bc02ca1..cf7bca7d157e 100644
--- a/docs/source/ko/tasks/multiple_choice.md
+++ b/docs/source/ko/tasks/multiple_choice.md
@@ -144,11 +144,8 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
 
 ## 훈련 하기[[train]]
 
-<Tip>
-
-[`Trainer`]로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-with-pytorch-trainer)를 살펴보세요!
-
-</Tip>
+> [!TIP]
+> [`Trainer`]로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-with-pytorch-trainer)를 살펴보세요!
 
 이제 모델 훈련을 시작할 준비가 되었습니다! [`AutoModelForMultipleChoice`]로 BERT를 로드합니다:
 
@@ -198,13 +195,10 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
 ```
 
 
-<Tip>
-
-객관식 모델을 미세 조정하는 방법에 대한 보다 심층적인 예는 아래 문서를 참조하세요.
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)
-또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb).
-
-</Tip>
+> [!TIP]
+> 객관식 모델을 미세 조정하는 방법에 대한 보다 심층적인 예는 아래 문서를 참조하세요.
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)
+> 또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb).
 
 ## 추론 하기[[inference]]
 
diff --git a/docs/source/ko/tasks/object_detection.md b/docs/source/ko/tasks/object_detection.md
index 75319d93c24e..5f73ef942ef1 100644
--- a/docs/source/ko/tasks/object_detection.md
+++ b/docs/source/ko/tasks/object_detection.md
@@ -29,11 +29,8 @@ rendered properly in your Markdown viewer.
  1. 합성곱 백본(인풋 데이터의 특성을 추출하는 합성곱 네트워크)과 인코더-디코더 트랜스포머 모델을 결합한 [DETR](https://huggingface.co/docs/transformers/model_doc/detr) 모델을 [CPPE-5](https://huggingface.co/datasets/cppe-5) 데이터 세트에 대해 미세조정 하기
  2. 미세조정 한 모델을 추론에 사용하기.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/object-detection)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/object-detection)를 확인하는 것이 좋습니다.
 
 시작하기 전에 필요한 모든 라이브러리가 설치되어 있는지 확인하세요:
 ```bash
diff --git a/docs/source/ko/tasks/prompting.md b/docs/source/ko/tasks/prompting.md
index dfbb0b8fa5e7..6bdaef542de1 100644
--- a/docs/source/ko/tasks/prompting.md
+++ b/docs/source/ko/tasks/prompting.md
@@ -32,14 +32,11 @@ Falcon, LLaMA 등의 대규모 언어 모델은 사전 훈련된 트랜스포머
 - [고급 프롬프팅 기법: 퓨샷(Few-shot) 프롬프팅과 생각의 사슬(Chain-of-thought, CoT) 기법](#advanced-prompting-techniques)
 - [프롬프팅 대신 미세 조정을 해야 하는 경우](#prompting-vs-fine-tuning)
 
-<Tip>
-
-프롬프트 엔지니어링은 대규모 언어 모델 출력 최적화 과정의 일부일 뿐입니다. 또 다른 중요한 구성 요소는 최적의 텍스트 생성 전략을 선택하는 것입니다. 학습 가능한 매개변수를 수정하지 않고도 대규모 언어 모델이 텍스트를 생성하리 때 각각의 후속 토큰을 선택하는 방식을 사용자가 직접 정의할 수 있습니다. 텍스트 생성 매개변수를 조정함으로써 생성된 텍스트의 반복을 줄이고 더 일관되고 사람이 말하는 것 같은 텍스트를 만들 수 있습니다. 텍스트 생성 전략과 매개변수는 이 가이드의 범위를 벗어나지만, 다음 가이드에서 이러한 주제에 대해 자세히 알아볼 수 있습니다:
- 
-* [대규모 언어 모델을 이용한 생성](../llm_tutorial)
-* [텍스트 생성 전략](../generation_strategies)
-
-</Tip>
+> [!TIP]
+> 프롬프트 엔지니어링은 대규모 언어 모델 출력 최적화 과정의 일부일 뿐입니다. 또 다른 중요한 구성 요소는 최적의 텍스트 생성 전략을 선택하는 것입니다. 학습 가능한 매개변수를 수정하지 않고도 대규모 언어 모델이 텍스트를 생성하리 때 각각의 후속 토큰을 선택하는 방식을 사용자가 직접 정의할 수 있습니다. 텍스트 생성 매개변수를 조정함으로써 생성된 텍스트의 반복을 줄이고 더 일관되고 사람이 말하는 것 같은 텍스트를 만들 수 있습니다. 텍스트 생성 전략과 매개변수는 이 가이드의 범위를 벗어나지만, 다음 가이드에서 이러한 주제에 대해 자세히 알아볼 수 있습니다:
+>
+> * [대규모 언어 모델을 이용한 생성](../llm_tutorial)
+> * [텍스트 생성 전략](../generation_strategies)
 
 ## 프롬프팅의 기초 [[basics-of-prompting]]
 
@@ -112,11 +109,8 @@ pip install -q transformers accelerate
 ... )
 ```
 
-<Tip>
-
-Falcon 모델은 bfloat16 데이터 타입을 사용하여 훈련되었으므로, 같은 타입을 사용하는 것을 권장합니다. 이를 위해서는 최신 버전의 CUDA가 필요하며, 최신 그래픽 카드에서 가장 잘 작동합니다.
-
-</Tip>
+> [!TIP]
+> Falcon 모델은 bfloat16 데이터 타입을 사용하여 훈련되었으므로, 같은 타입을 사용하는 것을 권장합니다. 이를 위해서는 최신 버전의 CUDA가 필요하며, 최신 그래픽 카드에서 가장 잘 작동합니다.
 
 이제 파이프라인을 통해 모델을 로드했으니, 프롬프트를 사용하여 자연어 처리 작업을 해결하는 방법을 살펴보겠습니다.
 
@@ -145,11 +139,8 @@ Positive
 ```
 
 결과적으로, 우리가 지시사항에서 제공한 목록에서 선택된 분류 레이블이 정확하게 포함되어 생성된 것을 확인할 수 있습니다!
-<Tip>
-
-프롬프트 외에도 `max_new_tokens` 매개변수를 전달하는 것을 볼 수 있습니다. 이 매개변수는 모델이 생성할 토큰의 수를 제어하며, [텍스트 생성 전략](../generation_strategies) 가이드에서 배울 수 있는 여러 텍스트 생성 매개변수 중 하나입니다.
-
-</Tip>
+> [!TIP]
+> 프롬프트 외에도 `max_new_tokens` 매개변수를 전달하는 것을 볼 수 있습니다. 이 매개변수는 모델이 생성할 토큰의 수를 제어하며, [텍스트 생성 전략](../generation_strategies) 가이드에서 배울 수 있는 여러 텍스트 생성 매개변수 중 하나입니다.
 
 #### 개체명 인식 [[named-entity-recognition]]
 
diff --git a/docs/source/ko/tasks/question_answering.md b/docs/source/ko/tasks/question_answering.md
index 6e067dc38934..7fb312efd344 100644
--- a/docs/source/ko/tasks/question_answering.md
+++ b/docs/source/ko/tasks/question_answering.md
@@ -30,11 +30,8 @@ rendered properly in your Markdown viewer.
 1. 추출적 질의 응답을 하기 위해 [SQuAD](https://huggingface.co/datasets/squad) 데이터 세트에서 [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) 미세 조정하기
 2. 추론에 미세 조정된 모델 사용하기
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/question-answering)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/question-answering)를 확인하는 것이 좋습니다.
 
 시작하기 전에, 필요한 라이브러리가 모두 설치되어 있는지 확인하세요:
 
@@ -173,11 +170,8 @@ pip install transformers datasets evaluate
 
 ## 훈련[[train]]
 
-<Tip>
-
-[`Trainer`]를 이용해 모델을 미세 조정하는 것에 익숙하지 않다면, [여기](../training#train-with-pytorch-trainer)에서 기초 튜토리얼을 살펴보세요!
-
-</Tip>
+> [!TIP]
+> [`Trainer`]를 이용해 모델을 미세 조정하는 것에 익숙하지 않다면, [여기](../training#train-with-pytorch-trainer)에서 기초 튜토리얼을 살펴보세요!
 
 이제 모델 훈련을 시작할 준비가 되었습니다! [`AutoModelForQuestionAnswering`]으로 DistilBERT를 가져옵니다:
 
@@ -223,11 +217,8 @@ pip install transformers datasets evaluate
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-질의 응답을 위해 모델을 미세 조정하는 방법에 대한 더 자세한 예시는 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb) 또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)을 참조하세요.
-
-</Tip>
+> [!TIP]
+> 질의 응답을 위해 모델을 미세 조정하는 방법에 대한 더 자세한 예시는 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb) 또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)을 참조하세요.
 
 ## 평가[[evaluate]]
 
diff --git a/docs/source/ko/tasks/semantic_segmentation.md b/docs/source/ko/tasks/semantic_segmentation.md
index 68acd8cda9ea..42d031e99de5 100644
--- a/docs/source/ko/tasks/semantic_segmentation.md
+++ b/docs/source/ko/tasks/semantic_segmentation.md
@@ -28,11 +28,8 @@ rendered properly in your Markdown viewer.
 1. [SceneParse150](https://huggingface.co/datasets/scene_parse_150) 데이터 세트를 이용해 [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) 미세 조정하기.
 2. 미세 조정된 모델을 추론에 사용하기.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/image-segmentation)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/image-segmentation)를 확인하는 것이 좋습니다.
 
 시작하기 전에 필요한 모든 라이브러리가 설치되었는지 확인하세요:
 
@@ -186,11 +183,8 @@ pip install -q datasets transformers evaluate
 이제 `compute_metrics` 함수를 사용할 준비가 되었습니다. 트레이닝을 설정할 때 이 함수로 돌아가게 됩니다.
 
 ## 학습하기[[train]]
-<Tip>
-
-만약 [`Trainer`]를 사용해 모델을 미세 조정하는 것에 익숙하지 않다면, [여기](../training#finetune-with-trainer)에서 기본 튜토리얼을 살펴보세요!
-
-</Tip>
+> [!TIP]
+> 만약 [`Trainer`]를 사용해 모델을 미세 조정하는 것에 익숙하지 않다면, [여기](../training#finetune-with-trainer)에서 기본 튜토리얼을 살펴보세요!
 
 이제 모델 학습을 시작할 준비가 되었습니다! [`AutoModelForSemanticSegmentation`]로 SegFormer를 불러오고, 모델에 레이블 ID와 레이블 클래스 간의 매핑을 전달합니다:
 
diff --git a/docs/source/ko/tasks/sequence_classification.md b/docs/source/ko/tasks/sequence_classification.md
index 9ffad8ff0b24..d96ce73863aa 100644
--- a/docs/source/ko/tasks/sequence_classification.md
+++ b/docs/source/ko/tasks/sequence_classification.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 1. [IMDb](https://huggingface.co/datasets/imdb) 데이터셋에서 [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased)를 파인 튜닝하여 영화 리뷰가 긍정적인지 부정적인지 판단합니다.
 2. 추론을 위해 파인 튜닝 모델을 사용합니다.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/text-classification)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/text-classification)를 확인하는 것이 좋습니다.
 
 시작하기 전에, 필요한 모든 라이브러리가 설치되어 있는지 확인하세요:
 
@@ -136,11 +133,8 @@ tokenized_imdb = imdb.map(preprocess_function, batched=True)
 >>> label2id = {"NEGATIVE": 0, "POSITIVE": 1}
 ```
 
-<Tip>
-
-[`Trainer`]를 사용하여 모델을 파인 튜닝하는 방법에 익숙하지 않은 경우, [여기](../training#train-with-pytorch-trainer)의 기본 튜토리얼을 확인하세요!
-
-</Tip>
+> [!TIP]
+> [`Trainer`]를 사용하여 모델을 파인 튜닝하는 방법에 익숙하지 않은 경우, [여기](../training#train-with-pytorch-trainer)의 기본 튜토리얼을 확인하세요!
 
 이제 모델을 훈련시킬 준비가 되었습니다! [`AutoModelForSequenceClassification`]로 DistilBERT를 가쳐오고 예상되는 레이블 수와 레이블 매핑을 지정하세요:
 
@@ -185,11 +179,8 @@ tokenized_imdb = imdb.map(preprocess_function, batched=True)
 >>> trainer.train()
 ```
 
-<Tip>
-
-[`Trainer`]는 `tokenizer`를 전달하면 기본적으로 동적 매핑을 적용합니다. 이 경우, 명시적으로 데이터 수집기를 지정할 필요가 없습니다.
-
-</Tip>
+> [!TIP]
+> [`Trainer`]는 `tokenizer`를 전달하면 기본적으로 동적 매핑을 적용합니다. 이 경우, 명시적으로 데이터 수집기를 지정할 필요가 없습니다.
 
 훈련이 완료되면, [`~transformers.Trainer.push_to_hub`] 메소드를 사용하여 모델을 Hub에 공유할 수 있습니다.
 
@@ -197,11 +188,8 @@ tokenized_imdb = imdb.map(preprocess_function, batched=True)
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-텍스트 분류를 위한 모델을 파인 튜닝하는 자세한 예제는 다음 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb) 또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb)를 참조하세요.
-
-</Tip>
+> [!TIP]
+> 텍스트 분류를 위한 모델을 파인 튜닝하는 자세한 예제는 다음 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb) 또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb)를 참조하세요.
 
 ## 추론[[inference]]
 
diff --git a/docs/source/ko/tasks/summarization.md b/docs/source/ko/tasks/summarization.md
index 848a6cb00d00..ced46cac7a86 100644
--- a/docs/source/ko/tasks/summarization.md
+++ b/docs/source/ko/tasks/summarization.md
@@ -32,11 +32,8 @@ rendered properly in your Markdown viewer.
 1. 생성 요약을 위한 [BillSum](https://huggingface.co/datasets/billsum) 데이터셋 중 캘리포니아 주 법안 하위 집합으로 [T5](https://huggingface.co/google-t5/t5-small)를 파인튜닝합니다.
 2. 파인튜닝된 모델을 사용하여 추론합니다.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/summarization)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/summarization)를 확인하는 것이 좋습니다.
 
 시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요:
 
@@ -167,11 +164,8 @@ Hugging Face 계정에 로그인하면 모델을 업로드하고 커뮤니티에
 
 ## 학습[[train]]
 
-<Tip>
-
-모델을 [`Trainer`]로 파인튜닝 하는 것이 익숙하지 않다면, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인해보세요!
-
-</Tip>
+> [!TIP]
+> 모델을 [`Trainer`]로 파인튜닝 하는 것이 익숙하지 않다면, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인해보세요!
 
 이제 모델 학습을 시작할 준비가 되었습니다! [`AutoModelForSeq2SeqLM`]로 T5를 가져오세요:
 
@@ -224,12 +218,9 @@ Hugging Face 계정에 로그인하면 모델을 업로드하고 커뮤니티에
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-요약을 위해 모델을 파인튜닝하는 방법에 대한 더 자세한 예제를 보려면 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)
-또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)을 참고하세요.
-
-</Tip>
+> [!TIP]
+> 요약을 위해 모델을 파인튜닝하는 방법에 대한 더 자세한 예제를 보려면 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)
+> 또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)을 참고하세요.
 
 ## 추론[[inference]]
 
diff --git a/docs/source/ko/tasks/token_classification.md b/docs/source/ko/tasks/token_classification.md
index e4975405c3de..383d05ae9be4 100644
--- a/docs/source/ko/tasks/token_classification.md
+++ b/docs/source/ko/tasks/token_classification.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 1. [WNUT 17](https://huggingface.co/datasets/wnut_17) 데이터 세트에서 [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased)를 파인 튜닝하여 새로운 개체를 탐지합니다.
 2. 추론을 위해 파인 튜닝 모델을 사용합니다.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/token-classification)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/token-classification)를 확인하는 것이 좋습니다.
 
 시작하기 전에, 필요한 모든 라이브러리가 설치되어 있는지 확인하세요:
 
@@ -240,11 +237,8 @@ Hugging Face 계정에 로그인하여 모델을 업로드하고 커뮤니티에
 ... }
 ```
 
-<Tip>
-
-[`Trainer`]를 사용하여 모델을 파인 튜닝하는 방법에 익숙하지 않은 경우, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인하세요!
-
-</Tip>
+> [!TIP]
+> [`Trainer`]를 사용하여 모델을 파인 튜닝하는 방법에 익숙하지 않은 경우, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인하세요!
 
 이제 모델을 훈련시킬 준비가 되었습니다! [`AutoModelForSequenceClassification`]로 DistilBERT를 가져오고 예상되는 레이블 수와 레이블 매핑을 지정하세요:
 
@@ -295,13 +289,10 @@ Hugging Face 계정에 로그인하여 모델을 업로드하고 커뮤니티에
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-토큰 분류를 위한 모델을 파인 튜닝하는 자세한 예제는 다음
-[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)
-또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)를 참조하세요.
-
-</Tip>
+> [!TIP]
+> 토큰 분류를 위한 모델을 파인 튜닝하는 자세한 예제는 다음
+> [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)
+> 또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)를 참조하세요.
 
 ## 추론[[inference]]
 
diff --git a/docs/source/ko/tasks/translation.md b/docs/source/ko/tasks/translation.md
index 4ecda3de384b..fd74a3951e2b 100644
--- a/docs/source/ko/tasks/translation.md
+++ b/docs/source/ko/tasks/translation.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 1. 영어 텍스트를 프랑스어로 번역하기 위해 [T5](https://huggingface.co/google-t5/t5-small) 모델을 OPUS Books 데이터세트의 영어-프랑스어 하위 집합으로 파인튜닝하는 방법과
 2. 파인튜닝된 모델을 추론에 사용하는 방법입니다.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/translation)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/translation)를 확인하는 것이 좋습니다.
 
 시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요:
 
@@ -167,11 +164,8 @@ pip install transformers datasets evaluate sacrebleu
 
 ## 훈련[[train]]
 
-<Tip>
-
-[`Trainer`]로 모델을 파인튜닝하는 방법에 익숙하지 않다면 [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 살펴보시기 바랍니다!
-
-</Tip>
+> [!TIP]
+> [`Trainer`]로 모델을 파인튜닝하는 방법에 익숙하지 않다면 [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 살펴보시기 바랍니다!
 
 모델을 훈련시킬 준비가 되었군요! [`AutoModelForSeq2SeqLM`]으로 T5를 로드하세요:
 
@@ -221,11 +215,8 @@ pip install transformers datasets evaluate sacrebleu
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-번역을 위해 모델을 파인튜닝하는 방법에 대한 보다 자세한 예제는 해당 [PyTorch 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb) 또는 [TensorFlow 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb)을 참조하세요.
-
-</Tip>
+> [!TIP]
+> 번역을 위해 모델을 파인튜닝하는 방법에 대한 보다 자세한 예제는 해당 [PyTorch 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb) 또는 [TensorFlow 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb)을 참조하세요.
 
 ## 추론[[inference]]
 
diff --git a/docs/source/ko/tasks/video_classification.md b/docs/source/ko/tasks/video_classification.md
index d39d669f8a6f..d81a7489cfb2 100644
--- a/docs/source/ko/tasks/video_classification.md
+++ b/docs/source/ko/tasks/video_classification.md
@@ -26,11 +26,8 @@ rendered properly in your Markdown viewer.
 1. [UCF101](https://www.crcv.ucf.edu/data/UCF101.php) 데이터 세트의 하위 집합을 통해 [VideoMAE](https://huggingface.co/docs/transformers/main/en/model_doc/videomae) 모델을 미세 조정하기.
 2. 미세 조정한 모델을 추론에 사용하기.
 
-<Tip>
-
-이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/video-classification)를 확인하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> 이 작업과 호환되는 모든 아키텍처와 체크포인트를 보려면 [작업 페이지](https://huggingface.co/tasks/video-classification)를 확인하는 것이 좋습니다.
 
 
 시작하기 전에 필요한 모든 라이브러리가 설치되었는지 확인하세요:
diff --git a/docs/source/ko/testing.md b/docs/source/ko/testing.md
index 0a9e8ee47aca..e50fedde8715 100644
--- a/docs/source/ko/testing.md
+++ b/docs/source/ko/testing.md
@@ -323,17 +323,11 @@ pip install pytest-flakefinder
 pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
 ```
 
-<Tip>
+> [!TIP]
+> 이 플러그인은 `pytest-xdist`의 `-n` 플래그와 함께 작동하지 않습니다.
 
-이 플러그인은 `pytest-xdist`의 `-n` 플래그와 함께 작동하지 않습니다.
-
-</Tip>
-
-<Tip>
-
-`pytest-repeat`라는 또 다른 플러그인도 있지만 `unittest`와 함께 작동하지 않습니다.
-
-</Tip>
+> [!TIP]
+> `pytest-repeat`라는 또 다른 플러그인도 있지만 `unittest`와 함께 작동하지 않습니다.
 
 #### 테스트를 임의의 순서로 실행[[run-tests-in-a-random-order]]
 
@@ -787,20 +781,14 @@ def test_whatever(self):
   - `after=True`: 테스트 종료 시 임시 디렉터리가 항상 삭제됩니다.
   - `after=False`: 테스트 종료 시 임시 디렉터리가 항상 그대로 유지됩니다.
 
-<Tip>
-
-`rm -r`에 해당하는 명령을 안전하게 실행하기 위해,
-명시적인 `tmp_dir`을 사용하는 경우 프로젝트 저장소 체크 아웃의 하위 디렉터리만 허용됩니다.
-따라서 실수로 `/tmp`가 아닌 중요한 파일 시스템의 일부가 삭제되지 않도록 항상 `./`로 시작하는 경로를 전달해야 합니다.
-
-</Tip>
-
-<Tip>
-
-각 테스트는 여러 개의 임시 디렉터리를 등록할 수 있으며,
-별도로 요청하지 않는 한 모두 자동으로 제거됩니다.
+> [!TIP]
+> `rm -r`에 해당하는 명령을 안전하게 실행하기 위해,
+> 명시적인 `tmp_dir`을 사용하는 경우 프로젝트 저장소 체크 아웃의 하위 디렉터리만 허용됩니다.
+> 따라서 실수로 `/tmp`가 아닌 중요한 파일 시스템의 일부가 삭제되지 않도록 항상 `./`로 시작하는 경로를 전달해야 합니다.
 
-</Tip>
+> [!TIP]
+> 각 테스트는 여러 개의 임시 디렉터리를 등록할 수 있으며,
+> 별도로 요청하지 않는 한 모두 자동으로 제거됩니다.
 
 ### 임시 sys.path 오버라이드[[temporary-sys.path-override]]
 
diff --git a/docs/source/ko/torchscript.md b/docs/source/ko/torchscript.md
index 28e198c5ec93..9162a36ad66d 100644
--- a/docs/source/ko/torchscript.md
+++ b/docs/source/ko/torchscript.md
@@ -16,13 +16,10 @@ rendered properly in your Markdown viewer.
 
 # TorchScript로 내보내기[[export-to-torchscript]]
 
-<Tip>
-
-TorchScript를 활용한 실험은 아직 초기 단계로, 가변적인 입력 크기 모델들을 통해 그 기능성을 계속 탐구하고 있습니다. 
-이 기능은 저희가 관심을 두고 있는 분야 중 하나이며, 
-앞으로 출시될 버전에서 더 많은 코드 예제, 더 유연한 구현, 그리고 Python 기반 코드와 컴파일된 TorchScript를 비교하는 벤치마크를 등을 통해 분석을 심화할 예정입니다.
-
-</Tip>
+> [!TIP]
+> TorchScript를 활용한 실험은 아직 초기 단계로, 가변적인 입력 크기 모델들을 통해 그 기능성을 계속 탐구하고 있습니다. 
+> 이 기능은 저희가 관심을 두고 있는 분야 중 하나이며, 
+> 앞으로 출시될 버전에서 더 많은 코드 예제, 더 유연한 구현, 그리고 Python 기반 코드와 컴파일된 TorchScript를 비교하는 벤치마크를 등을 통해 분석을 심화할 예정입니다.
 
 [TorchScript 문서](https://pytorch.org/docs/stable/jit.html)에서는 이렇게 말합니다.
 
diff --git a/docs/source/ko/trainer.md b/docs/source/ko/trainer.md
index d753627c86fb..e3ac5f0d60e2 100644
--- a/docs/source/ko/trainer.md
+++ b/docs/source/ko/trainer.md
@@ -18,15 +18,12 @@ rendered properly in your Markdown viewer.
 
 [`Trainer`]는 Transformers 라이브러리에 구현된 PyTorch 모델을 반복하여 훈련 및 평가 과정입니다. 훈련에 필요한 요소(모델, 토크나이저, 데이터셋, 평가 함수, 훈련 하이퍼파라미터 등)만 제공하면 [`Trainer`]가 필요한 나머지 작업을 처리합니다. 이를 통해 직접 훈련 루프를 작성하지 않고도 빠르게 훈련을 시작할 수 있습니다. 또한 [`Trainer`]는 강력한 맞춤 설정과 다양한 훈련 옵션을 제공하여 사용자 맞춤 훈련이 가능합니다.
 
-<Tip>
-
-Transformers는 [`Trainer`] 클래스 외에도 번역이나 요약과 같은 시퀀스-투-시퀀스 작업을 위한 [`Seq2SeqTrainer`] 클래스도 제공합니다. 또한 [TRL](https://hf.co/docs/trl) 라이브러리에는 [`Trainer`] 클래스를 감싸고 Llama-2 및 Mistral과 같은 언어 모델을 자동 회귀 기법으로 훈련하는 데 최적화된 [`~trl.SFTTrainer`] 클래스 입니다. [`~trl.SFTTrainer`]는 시퀀스 패킹, LoRA, 양자화 및 DeepSpeed와 같은 기능을 지원하여 크기 상관없이 모델 효율적으로 확장할 수 있습니다.
-
-<br>
-
-이들 다른 [`Trainer`] 유형 클래스에 대해 더 알고 싶다면 [API 참조](./main_classes/trainer)를 확인하여 언제 어떤 클래스가 적합할지 얼마든지 확인하세요. 일반적으로 [`Trainer`]는 가장 다재다능한 옵션으로, 다양한 작업에 적합합니다. [`Seq2SeqTrainer`]는 시퀀스-투-시퀀스 작업을 위해 설계되었고, [`~trl.SFTTrainer`]는 언어 모델 훈련을 위해 설계되었습니다.
-
-</Tip>
+> [!TIP]
+> Transformers는 [`Trainer`] 클래스 외에도 번역이나 요약과 같은 시퀀스-투-시퀀스 작업을 위한 [`Seq2SeqTrainer`] 클래스도 제공합니다. 또한 [TRL](https://hf.co/docs/trl) 라이브러리에는 [`Trainer`] 클래스를 감싸고 Llama-2 및 Mistral과 같은 언어 모델을 자동 회귀 기법으로 훈련하는 데 최적화된 [`~trl.SFTTrainer`] 클래스 입니다. [`~trl.SFTTrainer`]는 시퀀스 패킹, LoRA, 양자화 및 DeepSpeed와 같은 기능을 지원하여 크기 상관없이 모델 효율적으로 확장할 수 있습니다.
+>
+> <br>
+>
+> 이들 다른 [`Trainer`] 유형 클래스에 대해 더 알고 싶다면 [API 참조](./main_classes/trainer)를 확인하여 언제 어떤 클래스가 적합할지 얼마든지 확인하세요. 일반적으로 [`Trainer`]는 가장 다재다능한 옵션으로, 다양한 작업에 적합합니다. [`Seq2SeqTrainer`]는 시퀀스-투-시퀀스 작업을 위해 설계되었고, [`~trl.SFTTrainer`]는 언어 모델 훈련을 위해 설계되었습니다.
 
 시작하기 전에, 분산 환경에서 PyTorch 훈련과 실행을 할 수 있게 [Accelerate](https://hf.co/docs/accelerate) 라이브러리가 설치되었는지 확인하세요.
 
@@ -182,21 +179,15 @@ trainer = Trainer(
 
 ## 로깅 [[logging]]
 
-<Tip>
-
-로깅 API에 대한 자세한 내용은 [로깅](./main_classes/logging) API 레퍼런스를 확인하세요.
-
-</Tip>
+> [!TIP]
+> 로깅 API에 대한 자세한 내용은 [로깅](./main_classes/logging) API 레퍼런스를 확인하세요.
 
 [`Trainer`]는 기본적으로 `logging.INFO`로 설정되어 있어 오류, 경고 및 기타 기본 정보를 보고합니다. 분산 환경에서는 [`Trainer`] 복제본이 `logging.WARNING`으로 설정되어 오류와 경고만 보고합니다. [`TrainingArguments`]의 [`log_level`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.log_level) 및 [`log_level_replica`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.log_level_replica) 매개변수로 로그 레벨을 변경할 수 있습니다.
 
 각 노드의 로그 레벨 설정을 구성하려면 [`log_on_each_node`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.log_on_each_node) 매개변수를 사용하여 각 노드에서 로그 레벨을 사용할지 아니면 주 노드에서만 사용할지 결정하세요.
 
-<Tip>
-
-[`Trainer`]는 [`Trainer.__init__`] 메소드에서 각 노드에 대해 로그 레벨을 별도로 설정하므로, 다른 Transformers 기능을 사용할 경우 [`Trainer`] 객체를 생성하기 전에 이를 미리 설정하는 것이 좋습니다.
-
-</Tip>
+> [!TIP]
+> [`Trainer`]는 [`Trainer.__init__`] 메소드에서 각 노드에 대해 로그 레벨을 별도로 설정하므로, 다른 Transformers 기능을 사용할 경우 [`Trainer`] 객체를 생성하기 전에 이를 미리 설정하는 것이 좋습니다.
 
 예를 들어, 메인 코드와 모듈을 각 노드에 따라 동일한 로그 레벨을 사용하도록 설정하려면 다음과 같이 합니다.
 
@@ -344,11 +335,8 @@ trainer.train()
 LOMO 옵티마이저는 [제한된 자원으로 대형 언어 모델의 전체 매개변수 미세 조정](https://hf.co/papers/2306.09782)과 [적응형 학습률을 통한 저메모리 최적화(AdaLomo)](https://hf.co/papers/2310.10195)에서 도입되었습니다. 
 이들은 모두 효율적인 전체 매개변수 미세 조정 방법으로 구성되어 있습니다. 이러한 옵티마이저들은 메모리 사용량을 줄이기 위해 그레이디언트 계산과 매개변수 업데이트를 하나의 단계로 융합합니다. LOMO에서 지원되는 옵티마이저는 `"lomo"`와 `"adalomo"`입니다. 먼저 pypi에서 `pip install lomo-optim`를 통해 `lomo`를 설치하거나, GitHub 소스에서 `pip install git+https://github.com/OpenLMLab/LOMO.git`로 설치하세요.
 
-<Tip>
-
-저자에 따르면, `grad_norm` 없이 `AdaLomo`를 사용하는 것이 더 나은 성능과 높은 처리량을 제공한다고 합니다.
-
-</Tip>
+> [!TIP]
+> 저자에 따르면, `grad_norm` 없이 `AdaLomo`를 사용하는 것이 더 나은 성능과 높은 처리량을 제공한다고 합니다.
 
 다음은 IMDB 데이터셋에서 [google/gemma-2b](https://huggingface.co/google/gemma-2b)를 최대 정밀도로 미세 조정하는 간단한 스크립트입니다:
 
@@ -375,11 +363,8 @@ trainer.train()
 
 [`Trainer`] 클래스는 [Accelerate](https://hf.co/docs/accelerate)로 구동되며, 이는 [FullyShardedDataParallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) 및 [DeepSpeed](https://www.deepspeed.ai/)와 같은 통합을 지원하는 분산 환경에서 PyTorch 모델을 쉽게 훈련할 수 있는 라이브러리입니다.
 
-<Tip>
-
-FSDP 샤딩 전략, CPU 오프로드 및 [`Trainer`]와 함께 사용할 수 있는 더 많은 기능을 알아보려면 [Fully Sharded Data Parallel](fsdp) 가이드를 확인하세요.
-
-</Tip>
+> [!TIP]
+> FSDP 샤딩 전략, CPU 오프로드 및 [`Trainer`]와 함께 사용할 수 있는 더 많은 기능을 알아보려면 [Fully Sharded Data Parallel](fsdp) 가이드를 확인하세요.
 
 [`Trainer`]와 Accelerate를 사용하려면 [`accelerate.config`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config) 명령을 실행하여 훈련 환경을 설정하세요. 이 명령은 훈련 스크립트를 실행할 때 사용할 `config_file.yaml`을 생성합니다. 예를 들어, 다음 예시는 설정할 수 있는 일부 구성 예입니다.
 
diff --git a/docs/source/ko/training.md b/docs/source/ko/training.md
index 95a7fe285d3c..8724a5e75085 100644
--- a/docs/source/ko/training.md
+++ b/docs/source/ko/training.md
@@ -85,12 +85,9 @@ rendered properly in your Markdown viewer.
 >>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
 ```
 
-<Tip>
-
-사전 훈련된 가중치 중 일부가 사용되지 않고 일부 가중치가 무작위로 표시된다는 경고가 표시됩니다.
-걱정마세요. 이것은 올바른 동작입니다! 사전 학습된 BERT 모델의 헤드는 폐기되고 무작위로 초기화된 분류 헤드로 대체됩니다. 이제 사전 학습된 모델의 지식으로 시퀀스 분류 작업을 위한 새로운 모델 헤드를 미세 튜닝 합니다.
-
-</Tip>
+> [!TIP]
+> 사전 훈련된 가중치 중 일부가 사용되지 않고 일부 가중치가 무작위로 표시된다는 경고가 표시됩니다.
+> 걱정마세요. 이것은 올바른 동작입니다! 사전 학습된 BERT 모델의 헤드는 폐기되고 무작위로 초기화된 분류 헤드로 대체됩니다. 이제 사전 학습된 모델의 지식으로 시퀀스 분류 작업을 위한 새로운 모델 헤드를 미세 튜닝 합니다.
 
 ### 하이퍼파라미터 훈련[[training-hyperparameters]]
 
@@ -248,11 +245,8 @@ torch.cuda.empty_cache()
 >>> model.to(device)
 ```
 
-<Tip>
-
-[Colaboratory](https://colab.research.google.com/) 또는 [SageMaker StudioLab](https://studiolab.sagemaker.aws/)과 같은 호스팅 노트북이 없는 경우 클라우드 GPU에 무료로 액세스할 수 있습니다.
-
-</Tip>
+> [!TIP]
+> [Colaboratory](https://colab.research.google.com/) 또는 [SageMaker StudioLab](https://studiolab.sagemaker.aws/)과 같은 호스팅 노트북이 없는 경우 클라우드 GPU에 무료로 액세스할 수 있습니다.
 
 이제 훈련할 준비가 되었습니다! 🥳
 
diff --git a/docs/source/ko/troubleshooting.md b/docs/source/ko/troubleshooting.md
index 263d693c23da..48d69f0c5b9d 100644
--- a/docs/source/ko/troubleshooting.md
+++ b/docs/source/ko/troubleshooting.md
@@ -59,11 +59,8 @@ CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 11.17 GiB total capacit
 - [`TrainingArguments`]의 [`per_device_train_batch_size`](main_classes/trainer#transformers.TrainingArguments.per_device_train_batch_size) 값을 줄이세요.
 - [`TrainingArguments`]의 [`gradient_accumulation_steps`](main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps)은 전체 배치 크기를 효과적으로 늘리세요.
 
-<Tip>
-
-메모리 절약 기술에 대한 자세한 내용은 성능 [가이드](performance)를 참조하세요.
-
-</Tip>
+> [!TIP]
+> 메모리 절약 기술에 대한 자세한 내용은 성능 [가이드](performance)를 참조하세요.
 
 ## 저장된 TensorFlow 모델을 가져올 수 없습니다(Unable to load a saved TensorFlow model)[[unable-to-load-a-saved-uensorFlow-model]]
 
@@ -160,11 +157,8 @@ tensor([[-0.1008, -0.4061]], grad_fn=<AddmmBackward0>)
 
 대부분의 경우 모델에 `attention_mask`를 제공하여 패딩 토큰을 무시해야 이러한 조용한 오류를 방지할 수 있습니다. 이제 두 번째 시퀀스의 출력이 실제 출력과 일치합니다:
 
-<Tip>
-
-일반적으로 토크나이저는 특정 토크나이저의 기본 값을 기준으로 사용자에 대한 'attention_mask'를 만듭니다.
-
-</Tip>
+> [!TIP]
+> 일반적으로 토크나이저는 특정 토크나이저의 기본 값을 기준으로 사용자에 대한 'attention_mask'를 만듭니다.
 
 ```py
 >>> attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1], [1, 0, 0, 0, 0, 0]])
diff --git a/docs/source/pt/create_a_model.md b/docs/source/pt/create_a_model.md
index 3eec2233540d..5194978bcfbe 100644
--- a/docs/source/pt/create_a_model.md
+++ b/docs/source/pt/create_a_model.md
@@ -101,11 +101,8 @@ Para reusar o arquivo de configurações, carregue com [`~PretrainedConfig.from_
 >>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json")
 ```
 
-<Tip>
-
-Você pode também salvar seu arquivo de configurações como um dicionário ou até mesmo com a diferença entre as seus atributos de configuração customizados e os atributos de configuração padrões! Olhe a documentação [configuration](main_classes/configuration) para mais detalhes.
-
-</Tip>
+> [!TIP]
+> Você pode também salvar seu arquivo de configurações como um dicionário ou até mesmo com a diferença entre as seus atributos de configuração customizados e os atributos de configuração padrões! Olhe a documentação [configuration](main_classes/configuration) para mais detalhes.
 
 ## Modelo
 
@@ -163,11 +160,8 @@ A útlima classe base que você precisa antes de usar um modelo para dados textu
 
 Os dois tokenizers suporta métodos comuns como os de codificar e decodificar, adicionar novos tokens, e gerenciar tokens especiais.
 
-<Tip warning={true}>
-
-Nem todo modelo suporta um 'fast tokenizer'. De uma olhada aqui [table](index#supported-frameworks) pra checar se um modelo suporta 'fast tokenizer'.
-
-</Tip>
+> [!WARNING]
+> Nem todo modelo suporta um 'fast tokenizer'. De uma olhada aqui [table](index#supported-frameworks) pra checar se um modelo suporta 'fast tokenizer'.
 
 Se você treinou seu prórpio tokenizer, você pode criar um a partir do seu arquivo *vocabulary*:
 
@@ -193,11 +187,8 @@ Criando um 'fast tokenizer' com a classe [`DistilBertTokenizerFast`]:
 >>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip>
-
-Pos padrão, [`AutoTokenizer`] tentará carregar um 'fast tokenizer'. Você pode disabilitar esse comportamento colocando `use_fast=False` no `from_pretrained`.
-
-</Tip>
+> [!TIP]
+> Pos padrão, [`AutoTokenizer`] tentará carregar um 'fast tokenizer'. Você pode disabilitar esse comportamento colocando `use_fast=False` no `from_pretrained`.
 
 ## Extrator de features
 
@@ -229,11 +220,8 @@ ViTFeatureExtractor {
 }
 ```
 
-<Tip>
-
-Se você não estiver procurando por nenhuma customização, apenas use o método `from_pretrained` para carregar parâmetros do modelo de extrator de features padrão.
-
-</Tip>
+> [!TIP]
+> Se você não estiver procurando por nenhuma customização, apenas use o método `from_pretrained` para carregar parâmetros do modelo de extrator de features padrão.
 
 Modifique qualquer parâmetro dentre os [`ViTFeatureExtractor`] para criar seu extrator de features customizado.
 
diff --git a/docs/source/pt/custom_models.md b/docs/source/pt/custom_models.md
index 1866cca182e2..8c60e604a7b7 100644
--- a/docs/source/pt/custom_models.md
+++ b/docs/source/pt/custom_models.md
@@ -182,11 +182,8 @@ Em ambos os casos, observe como herdamos de `PreTrainedModel` e chamamos a inici
 (um pouco parecido quando você escreve um `torch.nn.Module`). A linha que define o `config_class` não é obrigatória, a menos que
 você deseje registrar seu modelo com as classes automáticas (consulte a última seção).
 
-<Tip>
-
-Se o seu modelo for muito semelhante a um modelo dentro da biblioteca, você poderá reutilizar a mesma configuração desse modelo.
-
-</Tip>
+> [!TIP]
+> Se o seu modelo for muito semelhante a um modelo dentro da biblioteca, você poderá reutilizar a mesma configuração desse modelo.
 
 Você pode fazer com que seu modelo retorne o que você quiser,porém retornando um dicionário como fizemos para
 `ResnetModelForImageClassification`, com a função de perda incluída quando os rótulos são passados, vai tornar seu modelo diretamente
@@ -219,11 +216,8 @@ código do modelo é salvo.
 
 ## Enviando o código para o Hub
 
-<Tip warning={true}>
-
-Esta API é experimental e pode ter algumas pequenas alterações nas próximas versões.
-
-</Tip>
+> [!WARNING]
+> Esta API é experimental e pode ter algumas pequenas alterações nas próximas versões.
 
 Primeiro, certifique-se de que seu modelo esteja totalmente definido em um arquivo `.py`. Ele pode contar com importações relativas para alguns outros arquivos 
 desde que todos os arquivos estejam no mesmo diretório (ainda não suportamos submódulos para este recurso). Para o nosso exemplo,
@@ -241,12 +235,9 @@ contém o código do `ResnetModel` e `ResnetModelForImageClassification`.
 
 O `__init__.py` pode estar vazio, apenas está lá para que o Python detecte que o `resnet_model` possa ser usado como um módulo.
 
-<Tip warning={true}>
-
-Se estiver copiando arquivos de modelagem da biblioteca, você precisará substituir todas as importações relativas na parte superior do arquivo
-para importar do pacote `transformers`.
-
-</Tip>
+> [!WARNING]
+> Se estiver copiando arquivos de modelagem da biblioteca, você precisará substituir todas as importações relativas na parte superior do arquivo
+> para importar do pacote `transformers`.
 
 Observe que você pode reutilizar (ou subclasse) uma configuração/modelo existente.
 
diff --git a/docs/source/pt/installation.md b/docs/source/pt/installation.md
index f548736589ac..ef84d08353a8 100644
--- a/docs/source/pt/installation.md
+++ b/docs/source/pt/installation.md
@@ -127,11 +127,8 @@ O Python agora buscará dentro dos arquivos que foram clonados além dos caminho
 Por exemplo, se os pacotes do Python se encontram instalados no caminho `~/anaconda3/envs/main/lib/python3.7/site-packages/`,
 o Python também buscará módulos no diretório onde clonamos o repositório `~/transformers/`.
 
-<Tip warning={true}>
-
-É necessário manter o diretório `transformers` se desejas continuar usando a biblioteca.
-
-</Tip>
+> [!WARNING]
+> É necessário manter o diretório `transformers` se desejas continuar usando a biblioteca.
 
 Assim, É possível atualizar sua cópia local para com a última versão do 🤗 Transformers com o seguinte comando:
 
@@ -161,13 +158,10 @@ No Windows, este diretório pré-definido é dado por `C:\Users\username\.cache\
 2. Variável de ambiente do shell:`HF_HOME` + `transformers/`.
 3. Variável de ambiente do shell: `XDG_CACHE_HOME` + `/huggingface/transformers`.
 
-<Tip>
-
-    O 🤗 Transformers usará as variáveis de ambiente do shell `PYTORCH_TRANSFORMERS_CACHE` ou `PYTORCH_PRETRAINED_BERT_CACHE`
-    se estiver vindo de uma versão anterior da biblioteca que tenha configurado essas variáveis de ambiente, a menos que
-    você especifique a variável de ambiente do shell `TRANSFORMERS_CACHE`.
-
-</Tip>
+> [!TIP]
+> O 🤗 Transformers usará as variáveis de ambiente do shell `PYTORCH_TRANSFORMERS_CACHE` ou `PYTORCH_PRETRAINED_BERT_CACHE`
+>     se estiver vindo de uma versão anterior da biblioteca que tenha configurado essas variáveis de ambiente, a menos que
+>     você especifique a variável de ambiente do shell `TRANSFORMERS_CACHE`.
 
 
 ## Modo Offline
@@ -175,12 +169,9 @@ No Windows, este diretório pré-definido é dado por `C:\Users\username\.cache\
 O 🤗 Transformers também pode ser executado num ambiente de firewall ou fora da rede (offline) usando arquivos locais.
 Para tal, configure a variável de ambiente de modo que `HF_HUB_OFFLINE=1`.
 
-<Tip>
-
-Você pode adicionar o [🤗 Datasets](https://huggingface.co/docs/datasets/) ao pipeline de treinamento offline declarando
-    a variável de ambiente `HF_DATASETS_OFFLINE=1`.
-
-</Tip>
+> [!TIP]
+> Você pode adicionar o [🤗 Datasets](https://huggingface.co/docs/datasets/) ao pipeline de treinamento offline declarando
+>     a variável de ambiente `HF_DATASETS_OFFLINE=1`.
 
 Segue um exemplo de execução do programa numa rede padrão com firewall para instâncias externas, usando o seguinte comando:
 
@@ -255,8 +246,5 @@ Depois que o arquivo for baixado e armazenado no cachê local, especifique seu c
 >>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json")
 ```
 
-<Tip>
-
-Para obter mais detalhes sobre como baixar arquivos armazenados no Hub, consulte a seção [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream).
-
-</Tip>
+> [!TIP]
+> Para obter mais detalhes sobre como baixar arquivos armazenados no Hub, consulte a seção [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream).
diff --git a/docs/source/pt/pipeline_tutorial.md b/docs/source/pt/pipeline_tutorial.md
index 9c0cb3567e72..3d8526cce9b2 100644
--- a/docs/source/pt/pipeline_tutorial.md
+++ b/docs/source/pt/pipeline_tutorial.md
@@ -25,11 +25,8 @@ pode usar eles mesmo assim com o [pipeline]! Este tutorial te ensinará a:
 * Utilizar um tokenizador ou model específico.
 * Utilizar um [`pipeline`] para tarefas de áudio e visão computacional.
 
-<Tip>
-
-    Acesse a documentação do [`pipeline`] para obter uma lista completa de tarefas possíveis.
-
-</Tip>
+> [!TIP]
+> Acesse a documentação do [`pipeline`] para obter uma lista completa de tarefas possíveis.
 
 ## Uso do pipeline
 
diff --git a/docs/source/pt/quicktour.md b/docs/source/pt/quicktour.md
index 541d723fd809..8710191b7331 100644
--- a/docs/source/pt/quicktour.md
+++ b/docs/source/pt/quicktour.md
@@ -20,11 +20,8 @@ rendered properly in your Markdown viewer.
 
 Comece a trabalhar com 🤗 Transformers! Comece usando [`pipeline`] para rápida inferência e facilmente carregue um modelo pré-treinado e um tokenizer com [AutoClass](./model_doc/auto) para resolver tarefas de texto, visão ou áudio.
 
-<Tip>
-
-Todos os exemplos de código apresentados na documentação têm um botão no canto superior direito para escolher se você deseja ocultar ou mostrar o código no Pytorch ou no TensorFlow. Caso contrário, é esperado que funcione para ambos back-ends sem nenhuma alteração.
-
-</Tip>
+> [!TIP]
+> Todos os exemplos de código apresentados na documentação têm um botão no canto superior direito para escolher se você deseja ocultar ou mostrar o código no Pytorch ou no TensorFlow. Caso contrário, é esperado que funcione para ambos back-ends sem nenhuma alteração.
 
 ## Pipeline
 
@@ -53,11 +50,8 @@ A [`pipeline`] apoia diversas tarefas fora da caixa:
 * Classficação de áudio: legenda um trecho de áudio fornecido.
 * Reconhecimento de fala automático: transcreve audio em texto.
 
-<Tip>
-
-Para mais detalhes sobre a [`pipeline`] e tarefas associadas, siga a documentação [aqui](./main_classes/pipelines).
-
-</Tip>
+> [!TIP]
+> Para mais detalhes sobre a [`pipeline`] e tarefas associadas, siga a documentação [aqui](./main_classes/pipelines).
 
 ### Uso da pipeline
 
@@ -226,11 +220,8 @@ Leia o tutorial de [pré-processamento](./pré-processamento) para obter mais de
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
 ```
 
-<Tip>
-
-Veja o [sumário de tarefas](./task_summary) para qual classe de [`AutoModel`] usar para cada tarefa.
-
-</Tip>
+> [!TIP]
+> Veja o [sumário de tarefas](./task_summary) para qual classe de [`AutoModel`] usar para cada tarefa.
 
 Agora você pode passar seu grupo de entradas pré-processadas diretamente para o modelo. Você apenas tem que descompactar o dicionário usando `**`:
 
@@ -249,21 +240,14 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
         [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
 ```
 
-<Tip>
-
-Todos os modelos de 🤗 Transformers (PyTorch ou TensorFlow) geram tensores *antes* da função de ativação final (como softmax) pois essa função algumas vezes é fundida com a perda.
-
-
-</Tip>
+> [!TIP]
+> Todos os modelos de 🤗 Transformers (PyTorch ou TensorFlow) geram tensores *antes* da função de ativação final (como softmax) pois essa função algumas vezes é fundida com a perda.
 
 Os modelos são um standard [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) ou um [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) para que você possa usá-los em seu loop de treinamento habitual. No entanto, para facilitar as coisas, 🤗 Transformers fornece uma classe [`Trainer`] para PyTorch que adiciona funcionalidade para treinamento distribuído, precisão mista e muito mais. Para o TensorFlow, você pode usar o método `fit` de [Keras](https://keras.io/). Consulte o [tutorial de treinamento](./training) para obter mais detalhes.
 
-<Tip>
-
-As saídas do modelo 🤗 Transformers são classes de dados especiais para que seus atributos sejam preenchidos automaticamente em um IDE.
-As saídas do modelo também se comportam como uma tupla ou um dicionário (por exemplo, você pode indexar com um inteiro, uma parte ou uma string), caso em que os atributos `None` são ignorados.
-
-</Tip>
+> [!TIP]
+> As saídas do modelo 🤗 Transformers são classes de dados especiais para que seus atributos sejam preenchidos automaticamente em um IDE.
+> As saídas do modelo também se comportam como uma tupla ou um dicionário (por exemplo, você pode indexar com um inteiro, uma parte ou uma string), caso em que os atributos `None` são ignorados.
 
 ### Salvar um modelo
 
diff --git a/docs/source/pt/serialization.md b/docs/source/pt/serialization.md
index 9e390f07bde4..7052592e87ca 100644
--- a/docs/source/pt/serialization.md
+++ b/docs/source/pt/serialization.md
@@ -21,14 +21,11 @@ exporta-los para um formato serializado que pode ser carregado e executado em
 tempos de execução e hardware. Neste guia, mostraremos como exportar modelos 🤗 Transformers
 para [ONNX (Open Neural Network eXchange)](http://onnx.ai).
 
-<Tip>
-
-Uma vez exportado, um modelo pode ser otimizado para inferência por meio de técnicas como
-quantização e poda. Se você estiver interessado em otimizar seus modelos para serem executados com
-máxima eficiência, confira a biblioteca [🤗 Optimum
-](https://github.com/huggingface/optimum).
-
-</Tip>
+> [!TIP]
+> Uma vez exportado, um modelo pode ser otimizado para inferência por meio de técnicas como
+> quantização e poda. Se você estiver interessado em otimizar seus modelos para serem executados com
+> máxima eficiência, confira a biblioteca [🤗 Optimum
+> ](https://github.com/huggingface/optimum).
 
 ONNX é um padrão aberto que define um conjunto comum de operadores e um formato de arquivo comum
 para representar modelos de aprendizado profundo em uma ampla variedade de estruturas, incluindo PyTorch e
@@ -290,20 +287,14 @@ Observe que, neste caso, os nomes de saída do modelo ajustado são `logits`
 em vez do `last_hidden_state` que vimos com o checkpoint `distilbert/distilbert-base-uncased`
 mais cedo. Isso é esperado, pois o modelo ajustado (fine-tuned) possui uma cabeça de classificação de sequência.
 
-<Tip>
-
-Os recursos que têm um sufixo `with-pass` (como `causal-lm-with-pass`) correspondem a
-classes de modelo com estados ocultos pré-computados (chave e valores nos blocos de atenção)
-que pode ser usado para decodificação autorregressiva rápida.
-
-</Tip>
+> [!TIP]
+> Os recursos que têm um sufixo `with-pass` (como `causal-lm-with-pass`) correspondem a
+> classes de modelo com estados ocultos pré-computados (chave e valores nos blocos de atenção)
+> que pode ser usado para decodificação autorregressiva rápida.
 
-<Tip>
-
-Para modelos do tipo `VisionEncoderDecoder`, as partes do codificador e do decodificador são
-exportados separadamente como dois arquivos ONNX chamados `encoder_model.onnx` e `decoder_model.onnx` respectivamente.
-
-</Tip>
+> [!TIP]
+> Para modelos do tipo `VisionEncoderDecoder`, as partes do codificador e do decodificador são
+> exportados separadamente como dois arquivos ONNX chamados `encoder_model.onnx` e `decoder_model.onnx` respectivamente.
 
 ## Exportando um modelo para uma arquitetura sem suporte
 
@@ -326,12 +317,9 @@ você deve herdar, dependendo do tipo de arquitetura de modelo que deseja export
 * Modelos baseados em decodificador herdam de [`~onnx.config.OnnxConfigWithPast`]
 * Os modelos codificador-decodificador herdam de [`~onnx.config.OnnxSeq2SeqConfigWithPast`]
 
-<Tip>
-
-Uma boa maneira de implementar uma configuração ONNX personalizada é observar as
-implementação no arquivo `configuration_<model_name>.py` de uma arquitetura semelhante.
-
-</Tip>
+> [!TIP]
+> Uma boa maneira de implementar uma configuração ONNX personalizada é observar as
+> implementação no arquivo `configuration_<model_name>.py` de uma arquitetura semelhante.
 
 Como o DistilBERT é um modelo baseado em codificador, sua configuração é herdada de
 `OnnxConfig`:
@@ -358,20 +346,17 @@ dessa entrada. Para o DistilBERT, podemos ver que duas entradas são necessária
 `attention_mask`. Essas entradas têm a mesma forma de `(batch_size, sequence_length)`
 é por isso que vemos os mesmos eixos usados na configuração.
 
-<Tip>
-
-Notice that `inputs` property for `DistilBertOnnxConfig` returns an `OrderedDict`. This
-ensures that the inputs are matched with their relative position within the
-`PreTrainedModel.forward()` method when tracing the graph. We recommend using an
-`OrderedDict` for the `inputs` and `outputs` properties when implementing custom ONNX
-configurations.
-
-Observe que a propriedade `inputs` para `DistilBertOnnxConfig` retorna um `OrderedDict`. Este
-garante que as entradas sejam combinadas com sua posição relativa dentro do
-método `PreTrainedModel.forward()` ao traçar o grafo. Recomendamos o uso de um
-`OrderedDict` para as propriedades `inputs` e `outputs` ao implementar configurações personalizadas ONNX.
-
-</Tip>
+> [!TIP]
+> Notice that `inputs` property for `DistilBertOnnxConfig` returns an `OrderedDict`. This
+> ensures that the inputs are matched with their relative position within the
+> `PreTrainedModel.forward()` method when tracing the graph. We recommend using an
+> `OrderedDict` for the `inputs` and `outputs` properties when implementing custom ONNX
+> configurations.
+>
+> Observe que a propriedade `inputs` para `DistilBertOnnxConfig` retorna um `OrderedDict`. Este
+> garante que as entradas sejam combinadas com sua posição relativa dentro do
+> método `PreTrainedModel.forward()` ao traçar o grafo. Recomendamos o uso de um
+> `OrderedDict` para as propriedades `inputs` e `outputs` ao implementar configurações personalizadas ONNX.
 
 Depois de implementar uma configuração ONNX, você pode instanciá-la fornecendo a
 configuração do modelo base da seguinte forma:
@@ -416,13 +401,10 @@ de classificação, poderíamos usar:
 OrderedDict([('logits', {0: 'batch'})])
 ```
 
-<Tip>
-
-Todas as propriedades e métodos básicos associados a [`~onnx.config.OnnxConfig`] e
-as outras classes de configuração podem ser substituídas se necessário. Confira [`BartOnnxConfig`]
-para um exemplo avançado.
-
-</Tip>
+> [!TIP]
+> Todas as propriedades e métodos básicos associados a [`~onnx.config.OnnxConfig`] e
+> as outras classes de configuração podem ser substituídas se necessário. Confira [`BartOnnxConfig`]
+> para um exemplo avançado.
 
 ### Exportando um modelo
 
@@ -455,16 +437,13 @@ modelo é exportado, você pode testar se o modelo está bem formado da seguinte
 >>> onnx.checker.check_model(onnx_model)
 ```
 
-<Tip>
-
-Se o seu modelo for maior que 2GB, você verá que muitos arquivos adicionais são criados
-durante a exportação. Isso é _esperado_ porque o ONNX usa [Protocol
-Buffers](https://developers.google.com/protocol-buffers/) para armazenar o modelo e estes
-têm um limite de tamanho de 2GB. Veja a [ONNX
-documentação](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md) para
-instruções sobre como carregar modelos com dados externos.
-
-</Tip>
+> [!TIP]
+> Se o seu modelo for maior que 2GB, você verá que muitos arquivos adicionais são criados
+> durante a exportação. Isso é _esperado_ porque o ONNX usa [Protocol
+> Buffers](https://developers.google.com/protocol-buffers/) para armazenar o modelo e estes
+> têm um limite de tamanho de 2GB. Veja a [ONNX
+> documentação](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md) para
+> instruções sobre como carregar modelos com dados externos.
 
 ### Validando a saída dos modelos
 
diff --git a/docs/source/pt/tasks/sequence_classification.md b/docs/source/pt/tasks/sequence_classification.md
index 70db6310e50a..e50a58648ba4 100644
--- a/docs/source/pt/tasks/sequence_classification.md
+++ b/docs/source/pt/tasks/sequence_classification.md
@@ -22,11 +22,8 @@ A classificação de texto é uma tarefa comum de NLP que atribui um rótulo ou
 
 Este guia mostrará como realizar o fine-tuning do [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) no conjunto de dados [IMDb](https://huggingface.co/datasets/imdb) para determinar se a crítica de filme é positiva ou negativa.
 
-<Tip>
-
-Consulte a [página de tarefas de classificação de texto](https://huggingface.co/tasks/text-classification) para obter mais informações sobre outras formas de classificação de texto e seus modelos, conjuntos de dados e métricas associados.
-
-</Tip>
+> [!TIP]
+> Consulte a [página de tarefas de classificação de texto](https://huggingface.co/tasks/text-classification) para obter mais informações sobre outras formas de classificação de texto e seus modelos, conjuntos de dados e métricas associados.
 
 ## Carregue o conjunto de dados IMDb
 
@@ -94,11 +91,8 @@ Carregue o DistilBERT com [`AutoModelForSequenceClassification`] junto com o nú
 >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2)
 ```
 
-<Tip>
-
-Se você não estiver familiarizado com o fine-tuning de um modelo com o [`Trainer`], dê uma olhada no tutorial básico [aqui](../training#finetune-with-trainer)!
-
-</Tip>
+> [!TIP]
+> Se você não estiver familiarizado com o fine-tuning de um modelo com o [`Trainer`], dê uma olhada no tutorial básico [aqui](../training#finetune-with-trainer)!
 
 Nesse ponto, restam apenas três passos:
 
@@ -128,14 +122,8 @@ Nesse ponto, restam apenas três passos:
 >>> trainer.train()
 ```
 
-<Tip>
-
-O [`Trainer`] aplicará o preenchimento dinâmico por padrão quando você definir o argumento `tokenizer` dele. Nesse caso, você não precisa especificar um data collator explicitamente.
-
-</Tip>
-
-<Tip>
-
-Para obter um exemplo mais aprofundado de como executar o fine-tuning de um modelo para classificação de texto, dê uma olhada nesse [notebook utilizando PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb) ou nesse [notebook utilizando TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb).
+> [!TIP]
+> O [`Trainer`] aplicará o preenchimento dinâmico por padrão quando você definir o argumento `tokenizer` dele. Nesse caso, você não precisa especificar um data collator explicitamente.
 
-</Tip>
+> [!TIP]
+> Para obter um exemplo mais aprofundado de como executar o fine-tuning de um modelo para classificação de texto, dê uma olhada nesse [notebook utilizando PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb) ou nesse [notebook utilizando TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb).
diff --git a/docs/source/pt/tasks/token_classification.md b/docs/source/pt/tasks/token_classification.md
index 3c0ac5671589..5e7b3ebe7e67 100644
--- a/docs/source/pt/tasks/token_classification.md
+++ b/docs/source/pt/tasks/token_classification.md
@@ -22,11 +22,8 @@ A classificação de tokens atribui um rótulo a tokens individuais em uma frase
 
 Este guia mostrará como realizar o fine-tuning do [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) no conjunto de dados [WNUT 17](https://huggingface.co/datasets/wnut_17) para detectar novas entidades.
 
-<Tip>
-
-Consulte a [página de tarefas de classificação de tokens](https://huggingface.co/tasks/token-classification) para obter mais informações sobre outras formas de classificação de tokens e seus modelos, conjuntos de dados e métricas associadas.
-
-</Tip>
+> [!TIP]
+> Consulte a [página de tarefas de classificação de tokens](https://huggingface.co/tasks/token-classification) para obter mais informações sobre outras formas de classificação de tokens e seus modelos, conjuntos de dados e métricas associadas.
 
 ## Carregando o conjunto de dados WNUT 17
 
@@ -152,11 +149,8 @@ Carregue o DistilBERT com o [`AutoModelForTokenClassification`] junto com o núm
 >>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=14)
 ```
 
-<Tip>
-
-Se você não estiver familiarizado com o fine-tuning de um modelo com o [`Trainer`], dê uma olhada no tutorial básico [aqui](../training#finetune-with-trainer)!
-
-</Tip>
+> [!TIP]
+> Se você não estiver familiarizado com o fine-tuning de um modelo com o [`Trainer`], dê uma olhada no tutorial básico [aqui](../training#finetune-with-trainer)!
 
 Nesse ponto, restam apenas três passos:
 
@@ -187,8 +181,5 @@ Nesse ponto, restam apenas três passos:
 >>> trainer.train()
 ```
 
-<Tip>
-
-Para obter um exemplo mais aprofundado de como executar o fine-tuning de um modelo para classificação de tokens, dê uma olhada nesse [notebook utilizando PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb) ou nesse [notebook utilizando TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb).
-
-</Tip>
+> [!TIP]
+> Para obter um exemplo mais aprofundado de como executar o fine-tuning de um modelo para classificação de tokens, dê uma olhada nesse [notebook utilizando PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb) ou nesse [notebook utilizando TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb).
diff --git a/docs/source/pt/training.md b/docs/source/pt/training.md
index 67294baee35c..405fda63c084 100644
--- a/docs/source/pt/training.md
+++ b/docs/source/pt/training.md
@@ -96,15 +96,12 @@ sabemos ter 5 labels usamos o seguinte código:
 >>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
 ```
 
-<Tip>
-
-    Você verá um alerta sobre alguns pesos pré-treinados que não estão sendo utilizados e que alguns pesos estão
-    sendo inicializados aleatoriamente. Não se preocupe, essa mensagem é completamente normal.
-    O header/cabeçário pré-treinado do modelo BERT é descartado e substitui-se por um header de classificação
-    inicializado aleatoriamente. Assim, pode aplicar o fine-tuning a este novo header do modelo em sua tarefa
-    de classificação de sequências fazendo um transfer learning do modelo pré-treinado.
-
-</Tip>
+> [!TIP]
+> Você verá um alerta sobre alguns pesos pré-treinados que não estão sendo utilizados e que alguns pesos estão
+>     sendo inicializados aleatoriamente. Não se preocupe, essa mensagem é completamente normal.
+>     O header/cabeçário pré-treinado do modelo BERT é descartado e substitui-se por um header de classificação
+>     inicializado aleatoriamente. Assim, pode aplicar o fine-tuning a este novo header do modelo em sua tarefa
+>     de classificação de sequências fazendo um transfer learning do modelo pré-treinado.
 
 ### Hiperparâmetros de treinamento
 
@@ -195,12 +192,9 @@ Assegure-se de especificar os `return_tensors` para retornar os tensores do Tens
 >>> data_collator = DefaultDataCollator(return_tensors="tf")
 ```
 
-<Tip>
-
-    O [`Trainer`] utiliza [`DataCollatorWithPadding`] por padrão, então você não precisa especificar explicitamente um
-    colador de dados (data collator).
-
-</Tip>
+> [!TIP]
+> O [`Trainer`] utiliza [`DataCollatorWithPadding`] por padrão, então você não precisa especificar explicitamente um
+>     colador de dados (data collator).
 
 Em seguida, converta os datasets tokenizados em datasets do TensorFlow com o método
 [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset).
@@ -347,13 +341,10 @@ em uma CPU pode acabar levando várias horas em vez de minutos.
 >>> model.to(device)
 ```
 
-<Tip>
-
-    Se necessário, você pode obter o acesso gratuito a uma GPU na núvem por meio de um notebook no
-    [Colaboratory](https://colab.research.google.com/) ou [SageMaker StudioLab](https://studiolab.sagemaker.aws/)
-    se não tiver esse recurso de forma local.
-
-</Tip>
+> [!TIP]
+> Se necessário, você pode obter o acesso gratuito a uma GPU na núvem por meio de um notebook no
+>     [Colaboratory](https://colab.research.google.com/) ou [SageMaker StudioLab](https://studiolab.sagemaker.aws/)
+>     se não tiver esse recurso de forma local.
 
 Perfeito, agora estamos prontos para começar o treinamento! 🥳
 
diff --git a/docs/source/zh/autoclass_tutorial.md b/docs/source/zh/autoclass_tutorial.md
index 8df92f65e648..865da1364e43 100644
--- a/docs/source/zh/autoclass_tutorial.md
+++ b/docs/source/zh/autoclass_tutorial.md
@@ -18,12 +18,8 @@ rendered properly in your Markdown viewer.
 
 由于存在许多不同的Transformer架构，因此为您的checkpoint创建一个可用架构可能会具有挑战性。通过`AutoClass`可以自动推断并从给定的checkpoint加载正确的架构, 这也是🤗 Transformers易于使用、简单且灵活核心规则的重要一部分。`from_pretrained()`方法允许您快速加载任何架构的预训练模型，因此您不必花费时间和精力从头开始训练模型。生成这种与checkpoint无关的代码意味着，如果您的代码适用于一个checkpoint，它将适用于另一个checkpoint - 只要它们是为了类似的任务进行训练的 - 即使架构不同。
 
-<Tip>
-
-请记住，架构指的是模型的结构，而checkpoints是给定架构的权重。例如，[BERT](https://huggingface.co/google-bert/bert-base-uncased)是一种架构，而`google-bert/bert-base-uncased`是一个checkpoint。模型是一个通用术语，可以指代架构或checkpoint。
-
-
-</Tip>
+> [!TIP]
+> 请记住，架构指的是模型的结构，而checkpoints是给定架构的权重。例如，[BERT](https://huggingface.co/google-bert/bert-base-uncased)是一种架构，而`google-bert/bert-base-uncased`是一个checkpoint。模型是一个通用术语，可以指代架构或checkpoint。
 
 在这个教程中，学习如何：
 
@@ -114,11 +110,8 @@ rendered properly in your Markdown viewer.
 >>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip warning={true}>
-
-对于PyTorch模型，`from_pretrained()`方法使用`torch.load()`，它内部使用已知是不安全的`pickle`。一般来说，永远不要加载来自不可信来源或可能被篡改的模型。对于托管在Hugging Face Hub上的公共模型，这种安全风险在一定程度上得到了缓解，因为每次提交都会进行[恶意软件扫描](https://huggingface.co/docs/hub/security-malware)。请参阅[Hub文档](https://huggingface.co/docs/hub/security)以了解最佳实践，例如使用GPG进行[签名提交验证](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg)。
-
-</Tip>
+> [!WARNING]
+> 对于PyTorch模型，`from_pretrained()`方法使用`torch.load()`，它内部使用已知是不安全的`pickle`。一般来说，永远不要加载来自不可信来源或可能被篡改的模型。对于托管在Hugging Face Hub上的公共模型，这种安全风险在一定程度上得到了缓解，因为每次提交都会进行[恶意软件扫描](https://huggingface.co/docs/hub/security-malware)。请参阅[Hub文档](https://huggingface.co/docs/hub/security)以了解最佳实践，例如使用GPG进行[签名提交验证](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg)。
 
 一般来说，我们建议使用`AutoTokenizer`类和`AutoModelFor`类来加载预训练的模型实例。这样可以确保每次加载正确的架构。在下一个[教程](preprocessing)中，学习如何使用新加载的`tokenizer`, `image processor`, `feature extractor`和`processor`对数据集进行预处理以进行微调。
 
diff --git a/docs/source/zh/big_models.md b/docs/source/zh/big_models.md
index 2215c7066182..cb3c6802526d 100644
--- a/docs/source/zh/big_models.md
+++ b/docs/source/zh/big_models.md
@@ -24,11 +24,8 @@ rendered properly in your Markdown viewer.
 
 步骤1和2都需要完整版本的模型在内存中，这在大多数情况下不是问题，但如果你的模型开始达到几个GB的大小，这两个副本可能会让你超出内存的限制。更糟糕的是，如果你使用`torch.distributed`来启动分布式训练，每个进程都会加载预训练模型并将这两个副本存储在内存中。
 
-<Tip>
-
-请注意，随机创建的模型使用“空”张量进行初始化，这些张量占用内存空间但不填充它（因此随机值是给定时间内该内存块中的任何内容）。在第3步之后，对未初始化的权重执行适合模型/参数种类的随机初始化（例如正态分布），以尽可能提高速度！
-
-</Tip>
+> [!TIP]
+> 请注意，随机创建的模型使用“空”张量进行初始化，这些张量占用内存空间但不填充它（因此随机值是给定时间内该内存块中的任何内容）。在第3步之后，对未初始化的权重执行适合模型/参数种类的随机初始化（例如正态分布），以尽可能提高速度！
 
 在本指南中，我们将探讨 Transformers 提供的解决方案来处理这个问题。请注意，这是一个积极开发的领域，因此这里解释的API在将来可能会略有变化。
 
diff --git a/docs/source/zh/contributing.md b/docs/source/zh/contributing.md
index 6bada2c8a6c1..7171b2d8ba93 100644
--- a/docs/source/zh/contributing.md
+++ b/docs/source/zh/contributing.md
@@ -268,11 +268,8 @@ python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classific
 
 默认情况下，会跳过时间较长的测试，但你可以将 `RUN_SLOW` 环境变量设置为 `yes` 来运行它们。这将下载以 GB 为单位的模型文件，所以确保你有足够的磁盘空间、良好的网络连接和足够的耐心！
 
-<Tip warning={true}>
-
-记得指定一个*子文件夹的路径或测试文件*来运行测试。否则你将会运行 `tests` 或 `examples` 文件夹中的所有测试，它会花费很长时间！
-
-</Tip>
+> [!WARNING]
+> 记得指定一个*子文件夹的路径或测试文件*来运行测试。否则你将会运行 `tests` 或 `examples` 文件夹中的所有测试，它会花费很长时间！
 
 ```bash
 RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model
diff --git a/docs/source/zh/create_a_model.md b/docs/source/zh/create_a_model.md
index a90b035a5410..570b03274127 100644
--- a/docs/source/zh/create_a_model.md
+++ b/docs/source/zh/create_a_model.md
@@ -102,11 +102,8 @@ DistilBertConfig {
 >>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json")
 ```
 
-<Tip>
-
-你还可以将配置文件保存为字典，甚至只保存自定义配置属性与默认配置属性之间的差异！有关更多详细信息，请参阅 [配置](main_classes/configuration) 文档。
-
-</Tip>
+> [!TIP]
+> 你还可以将配置文件保存为字典，甚至只保存自定义配置属性与默认配置属性之间的差异！有关更多详细信息，请参阅 [配置](main_classes/configuration) 文档。
 
 ## 模型
 
@@ -164,11 +161,8 @@ DistilBertConfig {
 
 这两种分词器都支持常用的方法，如编码和解码、添加新标记以及管理特殊标记。
 
-<Tip warning={true}>
-
-并非每个模型都支持快速分词器。参照这张 [表格](index#supported-frameworks) 查看模型是否支持快速分词器。
-
-</Tip>
+> [!WARNING]
+> 并非每个模型都支持快速分词器。参照这张 [表格](index#supported-frameworks) 查看模型是否支持快速分词器。
 
 如果您训练了自己的分词器，则可以从*词表*文件创建一个分词器：
 
@@ -194,11 +188,8 @@ DistilBertConfig {
 >>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased")
 ```
 
-<Tip>
-
-默认情况下，[`AutoTokenizer`] 将尝试加载快速标记生成器。你可以通过在 `from_pretrained` 中设置 `use_fast=False` 以禁用此行为。
-
-</Tip>
+> [!TIP]
+> 默认情况下，[`AutoTokenizer`] 将尝试加载快速标记生成器。你可以通过在 `from_pretrained` 中设置 `use_fast=False` 以禁用此行为。
 
 ## 图像处理器
 
@@ -230,11 +221,8 @@ ViTImageProcessor {
 }
 ```
 
-<Tip>
-
-如果您不需要进行任何自定义，只需使用 `from_pretrained` 方法加载模型的默认图像处理器参数。
-
-</Tip>
+> [!TIP]
+> 如果您不需要进行任何自定义，只需使用 `from_pretrained` 方法加载模型的默认图像处理器参数。
 
 修改任何 [`ViTImageProcessor`] 参数以创建自定义图像处理器：
 
@@ -284,11 +272,8 @@ Wav2Vec2FeatureExtractor {
 }
 ```
 
-<Tip>
-
-如果您不需要进行任何自定义，只需使用 `from_pretrained` 方法加载模型的默认特征提取器参数。
-
-</Tip>
+> [!TIP]
+> 如果您不需要进行任何自定义，只需使用 `from_pretrained` 方法加载模型的默认特征提取器参数。
 
 修改任何 [`Wav2Vec2FeatureExtractor`] 参数以创建自定义特征提取器：
 
diff --git a/docs/source/zh/custom_models.md b/docs/source/zh/custom_models.md
index d38aaf4511f2..3cd03b4903ac 100644
--- a/docs/source/zh/custom_models.md
+++ b/docs/source/zh/custom_models.md
@@ -161,11 +161,8 @@ class ResnetModelForImageClassification(PreTrainedModel):
 
 在这两种情况下，请注意我们如何继承 `PreTrainedModel` 并使用 `config` 调用了超类的初始化（有点像编写常规的torch.nn.Module）。设置 `config_class` 的那行代码不是必须的，除非你想使用自动类注册你的模型（请参阅最后一节）。
 
-<Tip>
-
-如果你的模型与库中的某个模型非常相似，你可以重用与该模型相同的配置。
-
-</Tip>
+> [!TIP]
+> 如果你的模型与库中的某个模型非常相似，你可以重用与该模型相同的配置。
 
 你可以让模型返回任何你想要的内容，但是像我们为 `ResnetModelForImageClassification` 做的那样返回一个字典，并在传递标签时包含loss，可以使你的模型能够在 [`Trainer`] 类中直接使用。只要你计划使用自己的训练循环或其他库进行训练，也可以使用其他输出格式。
 
@@ -190,11 +187,8 @@ resnet50d.model.load_state_dict(pretrained_model.state_dict())
 
 ## 将代码发送到 Hub
 
-<Tip warning={true}>
-
-此 API 是实验性的，未来的发布中可能会有一些轻微的不兼容更改。
-
-</Tip>
+> [!WARNING]
+> 此 API 是实验性的，未来的发布中可能会有一些轻微的不兼容更改。
 
 首先，确保你的模型在一个 `.py` 文件中完全定义。只要所有文件都位于同一目录中，它就可以依赖于某些其他文件的相对导入（目前我们还不为子模块支持此功能）。对于我们的示例，我们将在当前工作目录中名为 `resnet_model` 的文件夹中定义一个 `modeling_resnet.py` 文件和一个 `configuration_resnet.py` 文件。 配置文件包含 `ResnetConfig` 的代码，模型文件包含 `ResnetModel` 和 `ResnetModelForImageClassification` 的代码。
 
@@ -208,11 +202,8 @@ resnet50d.model.load_state_dict(pretrained_model.state_dict())
 
 `__init__.py` 可以为空，它的存在只是为了让 Python 检测到 `resnet_model` 可以用作模块。
 
-<Tip warning={true}>
-
-如果从库中复制模型文件，你需要将文件顶部的所有相对导入替换为从 `transformers` 包中的导入。
-
-</Tip>
+> [!WARNING]
+> 如果从库中复制模型文件，你需要将文件顶部的所有相对导入替换为从 `transformers` 包中的导入。
 
 请注意，你可以重用（或子类化）现有的配置/模型。
 
diff --git a/docs/source/zh/debugging.md b/docs/source/zh/debugging.md
index 77746a694fce..839c4431d2b3 100644
--- a/docs/source/zh/debugging.md
+++ b/docs/source/zh/debugging.md
@@ -48,23 +48,14 @@ NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 to
 
 ## 下溢和上溢检测
 
-<Tip>
+> [!TIP]
+> 目前，此功能仅适用于PyTorch。
 
-目前，此功能仅适用于PyTorch。
+> [!TIP]
+> 对于多GPU训练，它需要使用DDP（`torch.distributed.launch`）。
 
-</Tip>
-
-<Tip>
-
-对于多GPU训练，它需要使用DDP（`torch.distributed.launch`）。
-
-</Tip>
-
-<Tip>
-
-此功能可以与任何基于`nn.Module`的模型一起使用。
-
-</Tip>
+> [!TIP]
+> 此功能可以与任何基于`nn.Module`的模型一起使用。
 
 如果您开始发现`loss=NaN`或模型因激活值或权重中的`inf`或`nan`而出现一些异常行为，就需要发现第一个下溢或上溢发生的地方以及导致它的原因。幸运的是，您可以通过激活一个特殊模块来自动进行检测。
 
diff --git a/docs/source/zh/installation.md b/docs/source/zh/installation.md
index 5926079f2ce9..61883a9920d0 100644
--- a/docs/source/zh/installation.md
+++ b/docs/source/zh/installation.md
@@ -104,11 +104,8 @@ pip install -e .
 
 这些命令将会链接你克隆的仓库以及你的 Python 库路径。现在，Python 不仅会在正常的库路径中搜索库，也会在你克隆到的文件夹中进行查找。例如，如果你的 Python 包通常本应安装在 `~/anaconda3/envs/main/lib/python3.7/site-packages/` 目录中，在这种情况下 Python 也会搜索你克隆到的文件夹：`~/transformers/`。
 
-<Tip warning={true}>
-
-如果你想继续使用这个库，必须保留 `transformers` 文件夹。
-
-</Tip>
+> [!WARNING]
+> 如果你想继续使用这个库，必须保留 `transformers` 文件夹。
 
 现在，你可以使用以下命令，将你克隆的 🤗 Transformers 库轻松更新至最新版本：
 
@@ -135,21 +132,15 @@ conda install conda-forge::transformers
 2. 环境变量 `HF_HOME`。
 3. 环境变量 `XDG_CACHE_HOME` + `/huggingface`。
 
-<Tip>
-
-除非你明确指定了环境变量 `TRANSFORMERS_CACHE`，🤗 Transformers 将可能会使用较早版本设置的环境变量 `PYTORCH_TRANSFORMERS_CACHE` 或 `PYTORCH_PRETRAINED_BERT_CACHE`。
-
-</Tip>
+> [!TIP]
+> 除非你明确指定了环境变量 `TRANSFORMERS_CACHE`，🤗 Transformers 将可能会使用较早版本设置的环境变量 `PYTORCH_TRANSFORMERS_CACHE` 或 `PYTORCH_PRETRAINED_BERT_CACHE`。
 
 ## 离线模式
 
 🤗 Transformers 可以仅使用本地文件在防火墙或离线环境中运行。设置环境变量 `HF_HUB_OFFLINE=1` 以启用该行为。
 
-<Tip>
-
-通过设置环境变量 `HF_DATASETS_OFFLINE=1` 将 [🤗 Datasets](https://huggingface.co/docs/datasets/) 添加至你的离线训练工作流程中。
-
-</Tip>
+> [!TIP]
+> 通过设置环境变量 `HF_DATASETS_OFFLINE=1` 将 [🤗 Datasets](https://huggingface.co/docs/datasets/) 添加至你的离线训练工作流程中。
 
 例如，你通常会使用以下命令对外部实例进行防火墙保护的的普通网络上运行程序：
 
@@ -223,8 +214,5 @@ python examples/pytorch/translation/run_translation.py --model_name_or_path goog
 >>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json")
 ```
 
-<Tip>
-
-请参阅 [如何从 Hub 下载文件](https://huggingface.co/docs/hub/how-to-downstream) 部分，获取有关下载存储在 Hub 上文件的更多详细信息。
-
-</Tip>
+> [!TIP]
+> 请参阅 [如何从 Hub 下载文件](https://huggingface.co/docs/hub/how-to-downstream) 部分，获取有关下载存储在 Hub 上文件的更多详细信息。
diff --git a/docs/source/zh/llm_tutorial.md b/docs/source/zh/llm_tutorial.md
index e9ea6470c39c..a3ea370b7aab 100644
--- a/docs/source/zh/llm_tutorial.md
+++ b/docs/source/zh/llm_tutorial.md
@@ -70,11 +70,8 @@ pip install transformers bitsandbytes>=0.39.0 -q
 
 让我们谈谈代码！
 
-<Tip>
-
-如果您对基本的LLM使用感兴趣，我们高级的[`Pipeline`](pipeline_tutorial)接口是一个很好的起点。然而，LLMs通常需要像`quantization`和`token选择步骤的精细控制`等高级功能，这最好通过[`~generation.GenerationMixin.generate`]来完成。使用LLM进行自回归生成也是资源密集型的操作，应该在GPU上执行以获得足够的吞吐量。
-
-</Tip>
+> [!TIP]
+> 如果您对基本的LLM使用感兴趣，我们高级的[`Pipeline`](pipeline_tutorial)接口是一个很好的起点。然而，LLMs通常需要像`quantization`和`token选择步骤的精细控制`等高级功能，这最好通过[`~generation.GenerationMixin.generate`]来完成。使用LLM进行自回归生成也是资源密集型的操作，应该在GPU上执行以获得足够的吞吐量。
 
 首先，您需要加载模型。
 
diff --git a/docs/source/zh/main_classes/deepspeed.md b/docs/source/zh/main_classes/deepspeed.md
index 8319f5cad4a3..87cb41d05618 100644
--- a/docs/source/zh/main_classes/deepspeed.md
+++ b/docs/source/zh/main_classes/deepspeed.md
@@ -570,11 +570,8 @@ TrainingArguments(..., deepspeed=ds_config_dict)
 ### 共享配置
 
 
-<Tip warning={true}>
-
-这一部分是必读的。
-
-</Tip>
+> [!WARNING]
+> 这一部分是必读的。
 
 一些配置值对于 [`Trainer`] 和 DeepSpeed 正常运行都是必需的，因此，为了防止定义冲突及导致的难以检测的错误，我们选择通过 [`Trainer`] 命令行参数配置这些值。
 
@@ -1360,15 +1357,12 @@ bf16具有与fp32相同的动态范围，因此不需要损失缩放。
 }
 ```
 
-<Tip>
-
-在`deepspeed==0.6.0`版本中，bf16支持是新的实验性功能。
-
-如果您启用了bf16来进行[梯度累积](#gradient-accumulation)，您需要意识到它会以bf16累积梯度，这可能不是您想要的，因为这种格式的低精度可能会导致lossy accumulation。
-
-修复这个问题的工作正在努力进行，同时提供了使用更高精度的`dtype`（fp16或fp32）的选项。
-
-</Tip>
+> [!TIP]
+> 在`deepspeed==0.6.0`版本中，bf16支持是新的实验性功能。
+>
+> 如果您启用了bf16来进行[梯度累积](#gradient-accumulation)，您需要意识到它会以bf16累积梯度，这可能不是您想要的，因为这种格式的低精度可能会导致lossy accumulation。
+>
+> 修复这个问题的工作正在努力进行，同时提供了使用更高精度的`dtype`（fp16或fp32）的选项。
 
 
 ### NCCL集合
@@ -1519,11 +1513,8 @@ trainer.deepspeed.save_checkpoint(checkpoint_dir)
 fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
 ```
 
-<Tip>
-
-注意，一旦运行了`load_state_dict_from_zero_checkpoint`，该模型将不再可以在相同的应用程序的DeepSpeed上下文中使用。也就是说，您需要重新初始化deepspeed引擎，因为`model.load_state_dict(state_dict)`会从其中移除所有的DeepSpeed相关点。所以您只能训练结束时这样做。
-
-</Tip>
+> [!TIP]
+> 注意，一旦运行了`load_state_dict_from_zero_checkpoint`，该模型将不再可以在相同的应用程序的DeepSpeed上下文中使用。也就是说，您需要重新初始化deepspeed引擎，因为`model.load_state_dict(state_dict)`会从其中移除所有的DeepSpeed相关点。所以您只能训练结束时这样做。
 
 当然，您不必使用类：*~transformers.Trainer*，您可以根据你的需求调整上面的示例。
 
diff --git a/docs/source/zh/main_classes/output.md b/docs/source/zh/main_classes/output.md
index 23af6da6fbee..af9e7be806a5 100644
--- a/docs/source/zh/main_classes/output.md
+++ b/docs/source/zh/main_classes/output.md
@@ -34,11 +34,8 @@ outputs = model(**inputs, labels=labels)
 
 `outputs` 对象是 [`~modeling_outputs.SequenceClassifierOutput`]，如下面该类的文档中所示，它表示它有一个可选的 `loss`，一个 `logits`，一个可选的 `hidden_states` 和一个可选的 `attentions` 属性。在这里，我们有 `loss`，因为我们传递了 `labels`，但我们没有 `hidden_states` 和 `attentions`，因为我们没有传递 `output_hidden_states=True` 或 `output_attentions=True`。
 
-<Tip>
-
-当传递 `output_hidden_states=True` 时，您可能希望 `outputs.hidden_states[-1]` 与 `outputs.last_hidden_states` 完全匹配。然而，这并不总是成立。一些模型在返回最后的 hidden state时对其应用归一化或其他后续处理。
-
-</Tip>
+> [!TIP]
+> 当传递 `output_hidden_states=True` 时，您可能希望 `outputs.hidden_states[-1]` 与 `outputs.last_hidden_states` 完全匹配。然而，这并不总是成立。一些模型在返回最后的 hidden state时对其应用归一化或其他后续处理。
 
 
 您可以像往常一样访问每个属性，如果模型未返回该属性，您将得到 `None`。在这里，例如，`outputs.loss` 是模型计算的损失，而 `outputs.attentions` 是 `None`。
diff --git a/docs/source/zh/main_classes/pipelines.md b/docs/source/zh/main_classes/pipelines.md
index bc16709d8b48..740b5f576332 100644
--- a/docs/source/zh/main_classes/pipelines.md
+++ b/docs/source/zh/main_classes/pipelines.md
@@ -119,13 +119,10 @@ for out in pipe(KeyDataset(dataset, "text"), batch_size=8, truncation="only_firs
     # as batches to the model
 ```
 
-<Tip warning={true}>
-
-然而，这并不自动意味着性能提升。它可能是一个10倍的加速或5倍的减速，具体取决于硬件、数据和实际使用的模型。
-
-主要是加速的示例：
-
-</Tip>
+> [!WARNING]
+> 然而，这并不自动意味着性能提升。它可能是一个10倍的加速或5倍的减速，具体取决于硬件、数据和实际使用的模型。
+>
+> 主要是加速的示例：
 
 ```python
 from transformers import pipeline
diff --git a/docs/source/zh/main_classes/quantization.md b/docs/source/zh/main_classes/quantization.md
index 7d837aacbb45..d5140e378a2b 100644
--- a/docs/source/zh/main_classes/quantization.md
+++ b/docs/source/zh/main_classes/quantization.md
@@ -189,9 +189,8 @@ model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quanti
 请注意，不支持磁盘卸载。此外，如果由于数据集而内存不足，您可能需要在`from_pretrained`中设置`max_memory`。查看这个[指南](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map)以了解有关`device_map`和`max_memory`的更多信息。
 
 
-<Tip warning={true}>
-目前，GPTQ量化仅适用于文本模型。此外，量化过程可能会花费很多时间，具体取决于硬件性能（175B模型在NVIDIA A100上需要4小时）。请在Hub上检查是否有模型的GPTQ量化版本。如果没有，您可以在GitHub上提交需求。 
-</Tip>
+> [!WARNING]
+> 目前，GPTQ量化仅适用于文本模型。此外，量化过程可能会花费很多时间，具体取决于硬件性能（175B模型在NVIDIA A100上需要4小时）。请在Hub上检查是否有模型的GPTQ量化版本。如果没有，您可以在GitHub上提交需求。
 
 ### 推送量化模型到 🤗 Hub
 
@@ -347,11 +346,8 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True)
 ```
 
-<Tip warning={true}>
-
-需要注意的是，一旦模型以 4 位量化方式加载，就无法将量化后的权重推送到 Hub 上。此外，您不能训练 4 位量化权重，因为目前尚不支持此功能。但是，您可以使用 4 位量化模型来训练额外参数，这将在下一部分中介绍。
-
-</Tip>
+> [!WARNING]
+> 需要注意的是，一旦模型以 4 位量化方式加载，就无法将量化后的权重推送到 Hub 上。此外，您不能训练 4 位量化权重，因为目前尚不支持此功能。但是，您可以使用 4 位量化模型来训练额外参数，这将在下一部分中介绍。
 
 ### 加载 8 位量化的大模型
 
@@ -379,14 +375,10 @@ print(model.get_memory_footprint())
 
 通过这种集成，我们能够在较小的设备上加载大模型并运行它们而没有任何问题。
 
-<Tip warning={true}>
-
-需要注意的是，一旦模型以 8 位量化方式加载，除了使用最新的 `transformers` 和 `bitsandbytes` 之外，目前尚无法将量化后的权重推送到 Hub 上。此外，您不能训练 8 位量化权重，因为目前尚不支持此功能。但是，您可以使用 8 位量化模型来训练额外参数，这将在下一部分中介绍。
-
-注意，`device_map` 是可选的，但设置 `device_map = 'auto'` 更适合用于推理，因为它将更有效地调度可用资源上的模型。
-
-
-</Tip>
+> [!WARNING]
+> 需要注意的是，一旦模型以 8 位量化方式加载，除了使用最新的 `transformers` 和 `bitsandbytes` 之外，目前尚无法将量化后的权重推送到 Hub 上。此外，您不能训练 8 位量化权重，因为目前尚不支持此功能。但是，您可以使用 8 位量化模型来训练额外参数，这将在下一部分中介绍。
+>
+> 注意，`device_map` 是可选的，但设置 `device_map = 'auto'` 更适合用于推理，因为它将更有效地调度可用资源上的模型。
 
 #### 高级用例
 
@@ -449,11 +441,8 @@ tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
 model.push_to_hub("bloom-560m-8bit")
 ```
 
-<Tip warning={true}>
-
-对大模型，强烈鼓励将 8 位量化模型推送到 Hub 上，以便让社区能够从内存占用减少和加载中受益，例如在 Google Colab 上加载大模型。
-
-</Tip>
+> [!WARNING]
+> 对大模型，强烈鼓励将 8 位量化模型推送到 Hub 上，以便让社区能够从内存占用减少和加载中受益，例如在 Google Colab 上加载大模型。
 
 ### 从🤗 Hub加载量化模型
 
diff --git a/docs/source/zh/main_classes/trainer.md b/docs/source/zh/main_classes/trainer.md
index 159477fe64a0..d294b9b38a51 100644
--- a/docs/source/zh/main_classes/trainer.md
+++ b/docs/source/zh/main_classes/trainer.md
@@ -18,11 +18,8 @@ rendered properly in your Markdown viewer.
 
 [`Trainer`] 类提供了一个 PyTorch 的 API，用于处理大多数标准用例的全功能训练。它在大多数[示例脚本](https://github.com/huggingface/transformers/tree/main/examples)中被使用。
 
-<Tip>
-
-如果你想要使用自回归技术在文本数据集上微调像 Llama-2 或 Mistral 这样的语言模型，考虑使用 [`trl`](https://github.com/huggingface/trl) 的 [`~trl.SFTTrainer`]。[`~trl.SFTTrainer`] 封装了 [`Trainer`]，专门针对这个特定任务进行了优化，并支持序列打包、LoRA、量化和 DeepSpeed，以有效扩展到任何模型大小。另一方面，[`Trainer`] 是一个更通用的选项，适用于更广泛的任务。
-
-</Tip>
+> [!TIP]
+> 如果你想要使用自回归技术在文本数据集上微调像 Llama-2 或 Mistral 这样的语言模型，考虑使用 [`trl`](https://github.com/huggingface/trl) 的 [`~trl.SFTTrainer`]。[`~trl.SFTTrainer`] 封装了 [`Trainer`]，专门针对这个特定任务进行了优化，并支持序列打包、LoRA、量化和 DeepSpeed，以有效扩展到任何模型大小。另一方面，[`Trainer`] 是一个更通用的选项，适用于更广泛的任务。
 
 在实例化你的 [`Trainer`] 之前，创建一个 [`TrainingArguments`]，以便在训练期间访问所有定制点。
 
@@ -43,15 +40,12 @@ rendered properly in your Markdown viewer.
 - **evaluate** -- 运行评估循环并返回指标。
 - **predict** -- 返回在测试集上的预测（如果有标签，则包括指标）。
 
-<Tip warning={true}>
-
-[`Trainer`] 类被优化用于 🤗 Transformers 模型，并在你在其他模型上使用时可能会有一些令人惊讶的结果。当在你自己的模型上使用时，请确保：
-
-- 你的模型始终返回元组或 [`~utils.ModelOutput`] 的子类。
-- 如果提供了 `labels` 参数，你的模型可以计算损失，并且损失作为元组的第一个元素返回（如果你的模型返回元组）。
-- 你的模型可以接受多个标签参数（在 [`TrainingArguments`] 中使用 `label_names` 将它们的名称指示给 [`Trainer`]），但它们中没有一个应该被命名为 `"label"`。
-
-</Tip>
+> [!WARNING]
+> [`Trainer`] 类被优化用于 🤗 Transformers 模型，并在你在其他模型上使用时可能会有一些令人惊讶的结果。当在你自己的模型上使用时，请确保：
+>
+> - 你的模型始终返回元组或 [`~utils.ModelOutput`] 的子类。
+> - 如果提供了 `labels` 参数，你的模型可以计算损失，并且损失作为元组的第一个元素返回（如果你的模型返回元组）。
+> - 你的模型可以接受多个标签参数（在 [`TrainingArguments`] 中使用 `label_names` 将它们的名称指示给 [`Trainer`]），但它们中没有一个应该被命名为 `"label"`。
 
 以下是如何自定义 [`Trainer`] 以使用加权损失的示例（在训练集不平衡时很有用）：
 
diff --git a/docs/source/zh/model_sharing.md b/docs/source/zh/model_sharing.md
index 26e129a0a2be..de114249f6e0 100644
--- a/docs/source/zh/model_sharing.md
+++ b/docs/source/zh/model_sharing.md
@@ -27,11 +27,8 @@ rendered properly in your Markdown viewer.
 frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
 picture-in-picture" allowfullscreen></iframe>
 
-<Tip>
-
-要与社区共享模型，您需要在[huggingface.co](https://huggingface.co/join)上拥有一个帐户。您还可以加入现有的组织或创建一个新的组织。
-
-</Tip>
+> [!TIP]
+> 要与社区共享模型，您需要在[huggingface.co](https://huggingface.co/join)上拥有一个帐户。您还可以加入现有的组织或创建一个新的组织。
 
 ## 仓库功能
 
diff --git a/docs/source/zh/peft.md b/docs/source/zh/peft.md
index f00ae5ca399d..0f3a8715c927 100644
--- a/docs/source/zh/peft.md
+++ b/docs/source/zh/peft.md
@@ -62,12 +62,8 @@ peft_model_id = "ybelkada/opt-350m-lora"
 model = AutoModelForCausalLM.from_pretrained(peft_model_id)
 ```
 
-<Tip>
-
-你可以使用`AutoModelFor`类或基础模型类（如`OPTForCausalLM`或`LlamaForCausalLM`）来加载一个PEFT adapter。
-
-
-</Tip>
+> [!TIP]
+> 你可以使用`AutoModelFor`类或基础模型类（如`OPTForCausalLM`或`LlamaForCausalLM`）来加载一个PEFT adapter。
 
 您也可以通过`load_adapter`方法来加载 PEFT adapter。
 
@@ -167,11 +163,8 @@ output = model.generate(**inputs)
 PEFT适配器受[`Trainer`]类支持，因此您可以为您的特定用例训练适配器。它只需要添加几行代码即可。例如，要训练一个LoRA adapter：
 
 
-<Tip>
-
-如果你不熟悉如何使用[`Trainer`]微调模型，请查看[微调预训练模型](training)教程。
-
-</Tip>
+> [!TIP]
+> 如果你不熟悉如何使用[`Trainer`]微调模型，请查看[微调预训练模型](training)教程。
 
 1. 使用任务类型和超参数定义adapter配置（参见[`~peft.LoraConfig`]以了解超参数的详细信息）。
 
diff --git a/docs/source/zh/perf_train_special.md b/docs/source/zh/perf_train_special.md
index ee8553475679..85c5a3da3bb5 100644
--- a/docs/source/zh/perf_train_special.md
+++ b/docs/source/zh/perf_train_special.md
@@ -12,15 +12,12 @@ rendered properly in your Markdown viewer.
 
 之前，在 Mac 上训练模型仅限于使用 CPU 训练。不过随着PyTorch v1.12的发布，您可以通过在 Apple Silicon 芯片的 GPU 上训练模型来显著提高性能和训练速度。这是通过将 Apple 的 Metal 性能着色器 (Metal Performance Shaders, MPS) 作为后端集成到PyTorch中实现的。[MPS后端](https://pytorch.org/docs/stable/notes/mps.html) 将 PyTorch 操作视为自定义的 Metal 着色器来实现，并将对应模块部署到`mps`设备上。
 
-<Tip warning={true}>
-
-某些 PyTorch 操作目前还未在 MPS 上实现，可能会抛出错误提示。可以通过设置环境变量`PYTORCH_ENABLE_MPS_FALLBACK=1`来使用CPU内核以避免这种情况发生（您仍然会看到一个`UserWarning`）。
-
-<br>
-
-如果您遇到任何其他错误，请在[PyTorch库](https://github.com/pytorch/pytorch/issues)中创建一个 issue，因为[`Trainer`]类中只集成了 MPS 后端.
-
-</Tip>
+> [!WARNING]
+> 某些 PyTorch 操作目前还未在 MPS 上实现，可能会抛出错误提示。可以通过设置环境变量`PYTORCH_ENABLE_MPS_FALLBACK=1`来使用CPU内核以避免这种情况发生（您仍然会看到一个`UserWarning`）。
+>
+> <br>
+>
+> 如果您遇到任何其他错误，请在[PyTorch库](https://github.com/pytorch/pytorch/issues)中创建一个 issue，因为[`Trainer`]类中只集成了 MPS 后端.
 
 配置好`mps`设备后，您可以：
 
diff --git a/docs/source/zh/pipeline_tutorial.md b/docs/source/zh/pipeline_tutorial.md
index 7c497c6f1c65..457be00fcb99 100644
--- a/docs/source/zh/pipeline_tutorial.md
+++ b/docs/source/zh/pipeline_tutorial.md
@@ -22,11 +22,8 @@ rendered properly in your Markdown viewer.
 - 如何使用特定的`tokenizer`(分词器)或模型。
 - 如何使用[`pipeline`] 进行音频、视觉和多模态任务的推理。
 
-<Tip>
-
-请查看[`pipeline`]文档以获取已支持的任务和可用参数的完整列表。
-
-</Tip>
+> [!TIP]
+> 请查看[`pipeline`]文档以获取已支持的任务和可用参数的完整列表。
 
 ## Pipeline使用
 
@@ -203,9 +200,8 @@ for out in pipe(KeyDataset(dataset, "audio")):
 
 ## 在Web服务器上使用pipelines
 
-<Tip>
-创建推理引擎是一个复杂的主题，值得有自己的页面。
-</Tip>
+> [!TIP]
+> 创建推理引擎是一个复杂的主题，值得有自己的页面。
 
 [链接](./pipeline_webserver)
 
@@ -266,17 +262,14 @@ for out in pipe(KeyDataset(dataset, "audio")):
 [{'score': 0.425, 'answer': 'us-001', 'start': 16, 'end': 16}]
 ```
 
-<Tip>
-
-要运行上面的示例，除了🤗 Transformers之外，您需要安装[`pytesseract`](https://pypi.org/project/pytesseract/)。
-
-
-```bash
-sudo apt install -y tesseract-ocr
-pip install pytesseract
-```
-
-</Tip>
+> [!TIP]
+> 要运行上面的示例，除了🤗 Transformers之外，您需要安装[`pytesseract`](https://pypi.org/project/pytesseract/)。
+>
+>
+> ```bash
+> sudo apt install -y tesseract-ocr
+> pip install pytesseract
+> ```
 
 ## 在大模型上使用🤗 `accelerate`和`pipeline`：
 
diff --git a/docs/source/zh/preprocessing.md b/docs/source/zh/preprocessing.md
index 252f41f214ea..f1f1b631a61b 100644
--- a/docs/source/zh/preprocessing.md
+++ b/docs/source/zh/preprocessing.md
@@ -25,11 +25,8 @@ rendered properly in your Markdown viewer.
 * 图像输入使用[图像处理器](./main_classes/image)(`ImageProcessor`)将图像转换为张量。
 * 多模态输入，使用[处理器](./main_classes/processors)(`Processor`)结合了`Tokenizer`和`ImageProcessor`或`Processor`。
 
-<Tip>
-
-`AutoProcessor` **始终**有效的自动选择适用于您使用的模型的正确`class`，无论您使用的是`Tokenizer`、`ImageProcessor`、`Feature extractor`还是`Processor`。
-
-</Tip>
+> [!TIP]
+> `AutoProcessor` **始终**有效的自动选择适用于您使用的模型的正确`class`，无论您使用的是`Tokenizer`、`ImageProcessor`、`Feature extractor`还是`Processor`。
 
 在开始之前，请安装🤗 Datasets，以便您可以加载一些数据集来进行实验：
 
@@ -44,11 +41,8 @@ pip install datasets
 
 处理文本数据的主要工具是[Tokenizer](main_classes/tokenizer)。`Tokenizer`根据一组规则将文本拆分为`tokens`。然后将这些`tokens`转换为数字，然后转换为张量，成为模型的输入。模型所需的任何附加输入都由`Tokenizer`添加。
 
-<Tip>
-
-如果您计划使用预训练模型，重要的是使用与之关联的预训练`Tokenizer`。这确保文本的拆分方式与预训练语料库相同，并在预训练期间使用相同的标记-索引的对应关系（通常称为*词汇表*-`vocab`）。
-
-</Tip>
+> [!TIP]
+> 如果您计划使用预训练模型，重要的是使用与之关联的预训练`Tokenizer`。这确保文本的拆分方式与预训练语料库相同，并在预训练期间使用相同的标记-索引的对应关系（通常称为*词汇表*-`vocab`）。
 
 开始使用[`AutoTokenizer.from_pretrained`]方法加载一个预训练`tokenizer`。这将下载模型预训练的`vocab`：
 
@@ -161,11 +155,8 @@ pip install datasets
                     [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
 ```
 
-<Tip>
-
-查看[填充和截断](./pad_truncation)概念指南，了解更多有关填充和截断参数的信息。
-
-</Tip>
+> [!TIP]
+> 查看[填充和截断](./pad_truncation)概念指南，了解更多有关填充和截断参数的信息。
 
 ### 构建张量
 
@@ -312,24 +303,18 @@ pip install datasets
 
 对于计算机视觉任务，您需要一个[ image processor](main_classes/image_processor)来准备数据集以供模型使用。图像预处理包括多个步骤将图像转换为模型期望输入的格式。这些步骤包括但不限于调整大小、标准化、颜色通道校正以及将图像转换为张量。
 
-<Tip>
-
-图像预处理通常遵循某种形式的图像增强。图像预处理和图像增强都会改变图像数据，但它们有不同的目的：
-
-* 图像增强可以帮助防止过拟合并增加模型的鲁棒性。您可以在数据增强方面充分发挥创造性 - 调整亮度和颜色、裁剪、旋转、调整大小、缩放等。但要注意不要改变图像的含义。
-* 图像预处理确保图像与模型预期的输入格式匹配。在微调计算机视觉模型时，必须对图像进行与模型训练时相同的预处理。
-
-您可以使用任何您喜欢的图像增强库。对于图像预处理，请使用与模型相关联的`ImageProcessor`。
-
-</Tip>
+> [!TIP]
+> 图像预处理通常遵循某种形式的图像增强。图像预处理和图像增强都会改变图像数据，但它们有不同的目的：
+>
+> * 图像增强可以帮助防止过拟合并增加模型的鲁棒性。您可以在数据增强方面充分发挥创造性 - 调整亮度和颜色、裁剪、旋转、调整大小、缩放等。但要注意不要改变图像的含义。
+> * 图像预处理确保图像与模型预期的输入格式匹配。在微调计算机视觉模型时，必须对图像进行与模型训练时相同的预处理。
+>
+> 您可以使用任何您喜欢的图像增强库。对于图像预处理，请使用与模型相关联的`ImageProcessor`。
 
 加载[food101](https://huggingface.co/datasets/food101)数据集（有关如何加载数据集的更多详细信息，请参阅🤗 [Datasets教程](https://huggingface.co/docs/datasets/load_hub)）以了解如何在计算机视觉数据集中使用图像处理器：
 
-<Tip>
-
-因为数据集相当大，请使用🤗 Datasets的`split`参数加载训练集中的少量样本！
-
-</Tip>
+> [!TIP]
+> 因为数据集相当大，请使用🤗 Datasets的`split`参数加载训练集中的少量样本！
 
 
 ```py
@@ -384,13 +369,10 @@ pip install datasets
 ...     return examples
 ```
 
-<Tip>
-
-在上面的示例中，我们设置`do_resize=False`，因为我们已经在图像增强转换中调整了图像的大小，并利用了适当的`image_processor`的`size`属性。如果您在图像增强期间不调整图像的大小，请将此参数排除在外。默认情况下`ImageProcessor`将处理调整大小。
-
-如果希望将图像标准化步骤为图像增强的一部分，请使用`image_processor.image_mean`和`image_processor.image_std`。
-
-</Tip>
+> [!TIP]
+> 在上面的示例中，我们设置`do_resize=False`，因为我们已经在图像增强转换中调整了图像的大小，并利用了适当的`image_processor`的`size`属性。如果您在图像增强期间不调整图像的大小，请将此参数排除在外。默认情况下`ImageProcessor`将处理调整大小。
+>
+> 如果希望将图像标准化步骤为图像增强的一部分，请使用`image_processor.image_mean`和`image_processor.image_std`。
 
 3. 然后使用🤗 Datasets的[`set_transform`](https://huggingface.co/docs/datasets/process#format-transform)在运行时应用这些变换：
 
@@ -421,11 +403,8 @@ pip install datasets
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
 </div>
 
-<Tip>
-
-对于诸如目标检测、语义分割、实例分割和全景分割等任务，`ImageProcessor`提供了训练后处理方法。这些方法将模型的原始输出转换为有意义的预测，如边界框或分割地图。
-
-</Tip>
+> [!TIP]
+> 对于诸如目标检测、语义分割、实例分割和全景分割等任务，`ImageProcessor`提供了训练后处理方法。这些方法将模型的原始输出转换为有意义的预测，如边界框或分割地图。
 
 ### 填充
 
diff --git a/docs/source/zh/quicktour.md b/docs/source/zh/quicktour.md
index b36a85932b25..4a6883b322bd 100644
--- a/docs/source/zh/quicktour.md
+++ b/docs/source/zh/quicktour.md
@@ -190,11 +190,8 @@ label: NEGATIVE, with score: 0.5309
 ... )
 ```
 
-<Tip>
-
-查阅[预处理](./preprocessing)教程来获得有关分词的更详细的信息，以及如何使用 [`AutoFeatureExtractor`] 和 [`AutoProcessor`] 来处理图像，音频，还有多模式输入。
-
-</Tip>
+> [!TIP]
+> 查阅[预处理](./preprocessing)教程来获得有关分词的更详细的信息，以及如何使用 [`AutoFeatureExtractor`] 和 [`AutoProcessor`] 来处理图像，音频，还有多模式输入。
 
 ### AutoModel
 
@@ -207,11 +204,8 @@ label: NEGATIVE, with score: 0.5309
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
 ```
 
-<Tip>
-
-通过 [任务摘要](./task_summary) 查找 [`AutoModel`] 支持的任务.
-
-</Tip>
+> [!TIP]
+> 通过 [任务摘要](./task_summary) 查找 [`AutoModel`] 支持的任务.
 
 现在可以把预处理好的输入批次直接送进模型。你只需要通过 `**` 来解包字典:
 
@@ -230,12 +224,9 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
         [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
 ```
 
-<Tip>
-
-所有 🤗 Transformers 模型（PyTorch 或 TensorFlow）在最终的激活函数（比如 softmax）*之前* 输出张量，
-因为最终的激活函数常常与 loss 融合。模型的输出是特殊的数据类，所以它们的属性可以在 IDE 中被自动补全。模型的输出就像一个元组或字典（你可以通过整数、切片或字符串来索引它），在这种情况下，为 None 的属性会被忽略。
-
-</Tip>
+> [!TIP]
+> 所有 🤗 Transformers 模型（PyTorch 或 TensorFlow）在最终的激活函数（比如 softmax）*之前* 输出张量，
+> 因为最终的激活函数常常与 loss 融合。模型的输出是特殊的数据类，所以它们的属性可以在 IDE 中被自动补全。模型的输出就像一个元组或字典（你可以通过整数、切片或字符串来索引它），在这种情况下，为 None 的属性会被忽略。
 
 ### 保存模型
 
@@ -357,11 +348,8 @@ tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
 >>> trainer.train()  # doctest: +SKIP
 ```
 
-<Tip>
-
-对于像翻译或摘要这些使用序列到序列模型的任务，用 [`Seq2SeqTrainer`] 和 [`Seq2SeqTrainingArguments`] 来替代。
-
-</Tip>
+> [!TIP]
+> 对于像翻译或摘要这些使用序列到序列模型的任务，用 [`Seq2SeqTrainer`] 和 [`Seq2SeqTrainingArguments`] 来替代。
 
 你可以通过子类化 [`Trainer`] 中的方法来自定义训练循环。这样你就可以自定义像损失函数，优化器和调度器这样的特性。查阅 [`Trainer`] 参考手册了解哪些方法能够被子类化。
 
diff --git a/docs/source/zh/serialization.md b/docs/source/zh/serialization.md
index e4ff6ed290eb..b593418965bf 100644
--- a/docs/source/zh/serialization.md
+++ b/docs/source/zh/serialization.md
@@ -120,11 +120,8 @@ optimum-cli export onnx --model local_path --task question-answering distilbert_
 
 ### 使用 `transformers.onnx` 导出模型
 
-<Tip warning={true}>
-
-`transformers.onnx` 不再进行维护，请如上所述，使用 🤗 Optimum 导出模型。这部分内容将在未来版本中删除。
-
-</Tip>
+> [!WARNING]
+> `transformers.onnx` 不再进行维护，请如上所述，使用 🤗 Optimum 导出模型。这部分内容将在未来版本中删除。
 
 要使用 `transformers.onnx` 将 🤗 Transformers 模型导出为 ONNX，请安装额外的依赖项：
 
diff --git a/docs/source/zh/tasks/asr.md b/docs/source/zh/tasks/asr.md
index 3798640026d5..74e6ee148190 100644
--- a/docs/source/zh/tasks/asr.md
+++ b/docs/source/zh/tasks/asr.md
@@ -29,11 +29,8 @@ Siri 和 Alexa 这类虚拟助手使用 ASR 模型来帮助用户日常生活，
    [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) 进行微调，以将音频转录为文本。
 2. 使用微调后的模型进行推断。
 
-<Tip>
-
-如果您想查看所有与本任务兼容的架构和检查点，最好查看[任务页](https://huggingface.co/tasks/automatic-speech-recognition)。
-
-</Tip>
+> [!TIP]
+> 如果您想查看所有与本任务兼容的架构和检查点，最好查看[任务页](https://huggingface.co/tasks/automatic-speech-recognition)。
 
 在开始之前，请确保您已安装所有必要的库：
 
@@ -242,11 +239,8 @@ Wav2Vec2 分词器仅训练了大写字符，因此您需要确保文本与分
 
 ## 训练
 
-<Tip>
-
-如果您不熟悉使用[`Trainer`]微调模型，请查看这里的基本教程[here](../training#train-with-pytorch-trainer)！
-
-</Tip>
+> [!TIP]
+> 如果您不熟悉使用[`Trainer`]微调模型，请查看这里的基本教程[here](../training#train-with-pytorch-trainer)！
 
 现在您已经准备好开始训练您的模型了！使用 [`AutoModelForCTC`] 加载 Wav2Vec2。
 使用 `ctc_loss_reduction` 参数指定要应用的减少方式。通常最好使用平均值而不是默认的求和：
@@ -310,13 +304,10 @@ Wav2Vec2 分词器仅训练了大写字符，因此您需要确保文本与分
 >>> trainer.push_to_hub()
 ```
 
-<Tip>
-
-要深入了解如何微调模型进行自动语音识别，
-请查看这篇博客[文章](https://huggingface.co/blog/fine-tune-wav2vec2-english)以了解英语 ASR，
-还可以参阅[这篇文章](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)以了解多语言 ASR。
-
-</Tip>
+> [!TIP]
+> 要深入了解如何微调模型进行自动语音识别，
+> 请查看这篇博客[文章](https://huggingface.co/blog/fine-tune-wav2vec2-english)以了解英语 ASR，
+> 还可以参阅[这篇文章](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)以了解多语言 ASR。
 
 ## 推断
 
@@ -344,11 +335,8 @@ Wav2Vec2 分词器仅训练了大写字符，因此您需要确保文本与分
 {'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'}
 ```
 
-<Tip>
-
-转录结果还不错，但可以更好！尝试用更多示例微调您的模型，以获得更好的结果！
-
-</Tip>
+> [!TIP]
+> 转录结果还不错，但可以更好！尝试用更多示例微调您的模型，以获得更好的结果！
 
 如果您愿意，您也可以手动复制 `pipeline` 的结果：
 
diff --git a/docs/source/zh/torchscript.md b/docs/source/zh/torchscript.md
index d3106c524180..27a7d07f3ed6 100644
--- a/docs/source/zh/torchscript.md
+++ b/docs/source/zh/torchscript.md
@@ -16,13 +16,10 @@ rendered properly in your Markdown viewer.
 
 # 导出为 TorchScript
 
-<Tip>
-
-这是开始使用 TorchScript 进行实验的起点，我们仍在探索其在变量输入大小模型中的能力。
-这是我们关注的焦点，我们将在即将发布的版本中深入分析，提供更多的代码示例、更灵活的实现以及比较
-Python 代码与编译 TorchScript 的性能基准。
-
-</Tip>
+> [!TIP]
+> 这是开始使用 TorchScript 进行实验的起点，我们仍在探索其在变量输入大小模型中的能力。
+> 这是我们关注的焦点，我们将在即将发布的版本中深入分析，提供更多的代码示例、更灵活的实现以及比较
+> Python 代码与编译 TorchScript 的性能基准。
 
 根据 [TorchScript 文档](https://pytorch.org/docs/stable/jit.html)：
 
diff --git a/docs/source/zh/training.md b/docs/source/zh/training.md
index 43243ab4cfbf..aa63240bb322 100644
--- a/docs/source/zh/training.md
+++ b/docs/source/zh/training.md
@@ -86,11 +86,8 @@ rendered properly in your Markdown viewer.
 >>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
 ```
 
-<Tip>
-
-您将会看到一个警告，提到一些预训练权重未被使用，以及一些权重被随机初始化。不用担心，这是完全正常的！BERT 模型的预训练`head`被丢弃，并替换为一个随机初始化的分类`head`。您将在您的序列分类任务上微调这个新模型`head`，将预训练模型的知识转移给它。
-
-</Tip>
+> [!TIP]
+> 您将会看到一个警告，提到一些预训练权重未被使用，以及一些权重被随机初始化。不用担心，这是完全正常的！BERT 模型的预训练`head`被丢弃，并替换为一个随机初始化的分类`head`。您将在您的序列分类任务上微调这个新模型`head`，将预训练模型的知识转移给它。
 
 ### 训练超参数
 
@@ -246,11 +243,8 @@ torch.cuda.empty_cache()
 >>> model.to(device)
 ```
 
-<Tip>
-
-如果没有 GPU，可以通过notebook平台如 [Colaboratory](https://colab.research.google.com/) 或 [SageMaker StudioLab](https://studiolab.sagemaker.aws/) 来免费获得云端GPU使用。
-
-</Tip>
+> [!TIP]
+> 如果没有 GPU，可以通过notebook平台如 [Colaboratory](https://colab.research.google.com/) 或 [SageMaker StudioLab](https://studiolab.sagemaker.aws/) 来免费获得云端GPU使用。
 
 现在您已经准备好训练了！🥳
 
diff --git a/src/transformers/configuration_utils.py b/src/transformers/configuration_utils.py
index b9423a8bbf59..5f3abde4cf8b 100755
--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -56,12 +56,9 @@ class PretrainedConfig(PushToHubMixin):
     Base class for all configuration classes. Handles a few parameters common to all models' configurations as well as
     methods for loading/downloading/saving configurations.
 
-    <Tip>
-
-    A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to
-    initialize a model does **not** load the model weights. It only affects the model's configuration.
-
-    </Tip>
+    > [!TIP]
+    > A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to
+    > initialize a model does **not** load the model weights. It only affects the model's configuration.
 
     Class attributes (overridden by derived classes):
 
@@ -89,14 +86,11 @@ class PretrainedConfig(PushToHubMixin):
       model.
     - **num_hidden_layers** (`int`) -- The number of blocks in the model.
 
-    <Tip warning={true}>
-
-    Setting parameters for sequence generation in the model config is deprecated. For backward compatibility, loading
-    some of them will still be possible, but attempting to overwrite them will throw an exception -- you should set
-    them in a [~transformers.GenerationConfig]. Check the documentation of [~transformers.GenerationConfig] for more
-    information about the individual parameters.
-
-    </Tip>
+    > [!WARNING]
+    > Setting parameters for sequence generation in the model config is deprecated. For backward compatibility, loading
+    > some of them will still be possible, but attempting to overwrite them will throw an exception -- you should set
+    > them in a [~transformers.GenerationConfig]. Check the documentation of [~transformers.GenerationConfig] for more
+    > information about the individual parameters.
 
     Arg:
         name_or_path (`str`, *optional*, defaults to `""`):
@@ -563,11 +557,8 @@ def from_pretrained(
                 git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                 identifier allowed by git.
 
-                <Tip>
-
-                To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
-
-                </Tip>
+                > [!TIP]
+                > To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
 
             return_unused_kwargs (`bool`, *optional*, defaults to `False`):
                 If `False`, then this function returns just the final configuration object.
diff --git a/src/transformers/data/data_collator.py b/src/transformers/data/data_collator.py
index b125428a0550..2599907a32df 100644
--- a/src/transformers/data/data_collator.py
+++ b/src/transformers/data/data_collator.py
@@ -648,35 +648,32 @@ class DataCollatorForLanguageModeling(DataCollatorMixin):
         seed (`int`, *optional*):
             The seed to use for the random number generator for masking. If not provided, the global RNG will be used.
 
-    <Tip>
-
-    For best performance, this data collator should be used with a dataset having items that are dictionaries or
-    BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a
-    [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.
-
-    <Example Options and Expectations>
-
-    1. Default Behavior:
-        - `mask_replace_prob=0.8`, `random_replace_prob=0.1`.
-        - Expect 80% of masked tokens replaced with `[MASK]`, 10% replaced with random tokens, and 10% left unchanged.
-
-    2. All masked tokens replaced by `[MASK]`:
-        - `mask_replace_prob=1.0`, `random_replace_prob=0.0`.
-        - Expect all masked tokens to be replaced with `[MASK]`. No tokens are left unchanged or replaced with random tokens.
-
-    3. No `[MASK]` replacement, only random tokens:
-        - `mask_replace_prob=0.0`, `random_replace_prob=1.0`.
-        - Expect all masked tokens to be replaced with random tokens. No `[MASK]` replacements or unchanged tokens.
-
-    4. Balanced replacement:
-        - `mask_replace_prob=0.5`, `random_replace_prob=0.4`.
-        - Expect 50% of masked tokens replaced with `[MASK]`, 40% replaced with random tokens, and 10% left unchanged.
-
-    Note:
-        The sum of `mask_replace_prob` and `random_replace_prob` must not exceed 1. If their sum is less than 1, the
-        remaining proportion will consist of masked tokens left unchanged.
-
-    </Tip>
+    > [!TIP]
+    > For best performance, this data collator should be used with a dataset having items that are dictionaries or
+    > BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a
+    > [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.
+    >
+    > <Example Options and Expectations>
+    >
+    > 1. Default Behavior:
+    >     - `mask_replace_prob=0.8`, `random_replace_prob=0.1`.
+    >     - Expect 80% of masked tokens replaced with `[MASK]`, 10% replaced with random tokens, and 10% left unchanged.
+    >
+    > 2. All masked tokens replaced by `[MASK]`:
+    >     - `mask_replace_prob=1.0`, `random_replace_prob=0.0`.
+    >     - Expect all masked tokens to be replaced with `[MASK]`. No tokens are left unchanged or replaced with random tokens.
+    >
+    > 3. No `[MASK]` replacement, only random tokens:
+    >     - `mask_replace_prob=0.0`, `random_replace_prob=1.0`.
+    >     - Expect all masked tokens to be replaced with random tokens. No `[MASK]` replacements or unchanged tokens.
+    >
+    > 4. Balanced replacement:
+    >     - `mask_replace_prob=0.5`, `random_replace_prob=0.4`.
+    >     - Expect 50% of masked tokens replaced with `[MASK]`, 40% replaced with random tokens, and 10% left unchanged.
+    >
+    > Note:
+    >     The sum of `mask_replace_prob` and `random_replace_prob` must not exceed 1. If their sum is less than 1, the
+    >     remaining proportion will consist of masked tokens left unchanged.
     """
 
     tokenizer: PreTrainedTokenizerBase
@@ -1371,12 +1368,9 @@ class DataCollatorWithFlattening(DefaultDataCollator):
     - optionally returns the kwargs contained in FlashAttentionKwargs
     - optionally returns seq_idx indicating which sequence each token belongs to
 
-    <Tip warning={true}>
-
-    Using `DataCollatorWithFlattening` will flatten the entire mini batch into single long sequence.
-    Make sure your attention computation is able to handle it!
-
-    </Tip>
+    > [!WARNING]
+    > Using `DataCollatorWithFlattening` will flatten the entire mini batch into single long sequence.
+    > Make sure your attention computation is able to handle it!
     """
 
     def __init__(
diff --git a/src/transformers/dynamic_module_utils.py b/src/transformers/dynamic_module_utils.py
index 5b541c076f63..fa7f72f66e13 100644
--- a/src/transformers/dynamic_module_utils.py
+++ b/src/transformers/dynamic_module_utils.py
@@ -367,11 +367,8 @@ def get_cached_module_file(
         repo_type (`str`, *optional*):
             Specify the repo type (useful when downloading from a space for instance).
 
-    <Tip>
-
-    Passing `token=True` is required when you want to use a private model.
-
-    </Tip>
+    > [!TIP]
+    > Passing `token=True` is required when you want to use a private model.
 
     Returns:
         `str`: The path to the module inside the cache.
@@ -509,12 +506,9 @@ def get_class_from_dynamic_module(
     """
     Extracts a class from a module file, present in the local folder or repository of a model.
 
-    <Tip warning={true}>
-
-    Calling this function will execute the code in the module file found locally or downloaded from the Hub. It should
-    therefore only be called on trusted repos.
-
-    </Tip>
+    > [!WARNING]
+    > Calling this function will execute the code in the module file found locally or downloaded from the Hub. It should
+    > therefore only be called on trusted repos.
 
 
 
@@ -562,11 +556,8 @@ def get_class_from_dynamic_module(
             rest of the model. It can be a branch name, a tag name, or a commit id, since we use a git-based system for
             storing models and other artifacts on huggingface.co, so `revision` can be any identifier allowed by git.
 
-    <Tip>
-
-    Passing `token=True` is required when you want to use a private model.
-
-    </Tip>
+    > [!TIP]
+    > Passing `token=True` is required when you want to use a private model.
 
     Returns:
         `typing.Type`: The class, dynamically imported from the module.
diff --git a/src/transformers/feature_extraction_sequence_utils.py b/src/transformers/feature_extraction_sequence_utils.py
index 1a48062cb5c1..9dc65d7e6184 100644
--- a/src/transformers/feature_extraction_sequence_utils.py
+++ b/src/transformers/feature_extraction_sequence_utils.py
@@ -72,13 +72,10 @@ def pad(
         Padding side (left/right) padding values are defined at the feature extractor level (with `self.padding_side`,
         `self.padding_value`)
 
-        <Tip>
-
-        If the `processed_features` passed are dictionary of numpy arrays or PyTorch tensors  the
-        result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
-        PyTorch tensors, you will lose the specific device of your tensors however.
-
-        </Tip>
+        > [!TIP]
+        > If the `processed_features` passed are dictionary of numpy arrays or PyTorch tensors  the
+        > result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
+        > PyTorch tensors, you will lose the specific device of your tensors however.
 
         Args:
             processed_features ([`BatchFeature`], list of [`BatchFeature`], `dict[str, list[float]]`, `dict[str, list[list[float]]` or `list[dict[str, list[float]]]`):
diff --git a/src/transformers/feature_extraction_utils.py b/src/transformers/feature_extraction_utils.py
index 98037256d311..3df44f8a86d6 100644
--- a/src/transformers/feature_extraction_utils.py
+++ b/src/transformers/feature_extraction_utils.py
@@ -290,11 +290,8 @@ def from_pretrained(
                 identifier allowed by git.
 
 
-                <Tip>
-
-                To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
-
-                </Tip>
+                > [!TIP]
+                > To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
 
             return_unused_kwargs (`bool`, *optional*, defaults to `False`):
                 If `False`, then this function returns just the final feature extractor object. If `True`, then this
diff --git a/src/transformers/generation/configuration_utils.py b/src/transformers/generation/configuration_utils.py
index 98a0d14ade1a..5c9a51e678bb 100644
--- a/src/transformers/generation/configuration_utils.py
+++ b/src/transformers/generation/configuration_utils.py
@@ -93,13 +93,10 @@ class GenerationConfig(PushToHubMixin):
 
     To learn more about decoding strategies refer to the [text generation strategies guide](../generation_strategies).
 
-    <Tip>
-
-    A large number of these flags control the logits or the stopping criteria of the generation. Make sure you check
-    the [generate-related classes](https://huggingface.co/docs/transformers/internal/generation_utils) for a full
-    description of the possible manipulations, as well as examples of their usage.
-
-    </Tip>
+    > [!TIP]
+    > A large number of these flags control the logits or the stopping criteria of the generation. Make sure you check
+    > the [generate-related classes](https://huggingface.co/docs/transformers/internal/generation_utils) for a full
+    > description of the possible manipulations, as well as examples of their usage.
 
     Arg:
         > Parameters that control the length of the output
@@ -810,11 +807,8 @@ def from_pretrained(
                 git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                 identifier allowed by git.
 
-                <Tip>
-
-                To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
-
-                </Tip>
+                > [!TIP]
+                > To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
 
             return_unused_kwargs (`bool`, *optional*, defaults to `False`):
                 If `False`, then this function returns just the final configuration object.
diff --git a/src/transformers/generation/logits_process.py b/src/transformers/generation/logits_process.py
index d8f843bd59ae..4424c22075a8 100644
--- a/src/transformers/generation/logits_process.py
+++ b/src/transformers/generation/logits_process.py
@@ -234,12 +234,9 @@ class TemperatureLogitsWarper(LogitsProcessor):
     that it can control the randomness of the predicted tokens. Often used together with [`TopPLogitsWarper`] and
     [`TopKLogitsWarper`].
 
-    <Tip>
-
-    Make sure that `do_sample=True` is included in the `generate` arguments otherwise the temperature value won't have
-    any effect.
-
-    </Tip>
+    > [!TIP]
+    > Make sure that `do_sample=True` is included in the `generate` arguments otherwise the temperature value won't have
+    > any effect.
 
     Args:
         temperature (`float`):
@@ -977,13 +974,10 @@ class NoRepeatNGramLogitsProcessor(LogitsProcessor):
     prompt is also considered to obtain the n-grams.
     [Fairseq](https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345).
 
-    <Tip>
-
-    Use n-gram penalties with care. For instance, penalizing 2-grams (bigrams) in an article about the city of New York
-    might lead to undesirable outcomes where the city's name appears only once in the entire text.
-    [Reference](https://huggingface.co/blog/how-to-generate)
-
-    </Tip>
+    > [!TIP]
+    > Use n-gram penalties with care. For instance, penalizing 2-grams (bigrams) in an article about the city of New York
+    > might lead to undesirable outcomes where the city's name appears only once in the entire text.
+    > [Reference](https://huggingface.co/blog/how-to-generate)
 
     Args:
         ngram_size (`int`):
@@ -1102,13 +1096,10 @@ class SequenceBiasLogitsProcessor(LogitsProcessor):
     one token, consider using beam methods (to gracefully work around partially completed sequences that have a
     negative bias) and applying the bias to their prefixes (to ensure the bias is applied earlier).
 
-    <Tip>
-
-    At a token-level, biasing a word is different from biasing a word with a space before it. If you want to bias
-    "foo" mid-sentence, you'll likely want to add a prefix space and bias " foo" instead. Check the tokenizer section
-    of our NLP course to find out why: https://huggingface.co/learn/nlp-course/chapter2/4?fw=pt
-
-    </Tip>
+    > [!TIP]
+    > At a token-level, biasing a word is different from biasing a word with a space before it. If you want to bias
+    > "foo" mid-sentence, you'll likely want to add a prefix space and bias " foo" instead. Check the tokenizer section
+    > of our NLP course to find out why: https://huggingface.co/learn/nlp-course/chapter2/4?fw=pt
 
     Args:
         sequence_bias (`list[list[Union[list[int], float]]]`):
@@ -1282,15 +1273,12 @@ class NoBadWordsLogitsProcessor(SequenceBiasLogitsProcessor):
     """
     [`LogitsProcessor`] that enforces that specified sequences will never be selected.
 
-    <Tip>
-
-    In order to get the token ids of the words that should not appear in the generated text, make sure to set
-    `add_prefix_space=True` when initializing the tokenizer, and use `tokenizer(bad_words,
-    add_special_tokens=False).input_ids`. The `add_prefix_space` argument is only supported for some slow tokenizers,
-    as fast tokenizers' prefixing behaviours come from `pre tokenizers`. Read more
-    [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers).
-
-    </Tip>
+    > [!TIP]
+    > In order to get the token ids of the words that should not appear in the generated text, make sure to set
+    > `add_prefix_space=True` when initializing the tokenizer, and use `tokenizer(bad_words,
+    > add_special_tokens=False).input_ids`. The `add_prefix_space` argument is only supported for some slow tokenizers,
+    > as fast tokenizers' prefixing behaviours come from `pre tokenizers`. Read more
+    > [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers).
 
     Args:
         bad_words_ids (`list[list[int]]`):
@@ -2131,12 +2119,9 @@ class ClassifierFreeGuidanceLogitsProcessor(LogitsProcessor):
 
     See [the paper](https://huggingface.co/papers/2306.05284) for more information.
 
-    <Tip warning={true}>
-
-    This logits processor is exclusively compatible with
-    [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen)
-
-    </Tip>
+    > [!WARNING]
+    > This logits processor is exclusively compatible with
+    > [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen)
 
     Args:
         guidance_scale (float):
@@ -2190,13 +2175,10 @@ class AlternatingCodebooksLogitsProcessor(LogitsProcessor):
     r"""
     [`LogitsProcessor`] enforcing alternated generation between the two codebooks of Bark.
 
-    <Tip warning={true}>
-
-    This logits processor is exclusively compatible with
-    [Bark](https://huggingface.co/docs/transformers/en/model_doc/bark)'s fine submodel. See the model documentation
-    for examples.
-
-    </Tip>
+    > [!WARNING]
+    > This logits processor is exclusively compatible with
+    > [Bark](https://huggingface.co/docs/transformers/en/model_doc/bark)'s fine submodel. See the model documentation
+    > for examples.
 
     Args:
         input_start_len (`int`):
@@ -2352,12 +2334,9 @@ def __call__(self, input_ids, scores):
 class BarkEosPrioritizerLogitsProcessor(LogitsProcessor):
     r"""This processor ensures that the EOS token is selected if its probability is greater than the `min_eos_p`.
 
-    <Tip warning={true}>
-
-    This logits processor is exclusively compatible with
-    [Bark](https://huggingface.co/docs/transformers/en/model_doc/bark). See the model documentation for examples.
-
-    </Tip>
+    > [!WARNING]
+    > This logits processor is exclusively compatible with
+    > [Bark](https://huggingface.co/docs/transformers/en/model_doc/bark). See the model documentation for examples.
 
     Args:
         eos_token_id (`Union[int, list[int], torch.Tensor]`):
@@ -3018,12 +2997,9 @@ class DiaClassifierFreeGuidanceLogitsProcessor(LogitsProcessor):
     calculation, e.g. conditioned logits centered, and an additional top k selection
     option.
 
-    <Tip warning={true}>
-
-    This logits processor is exclusively compatible with
-    [Dia](https://huggingface.co/docs/transformers/main/en/model_doc/dia)
-
-    </Tip>
+    > [!WARNING]
+    > This logits processor is exclusively compatible with
+    > [Dia](https://huggingface.co/docs/transformers/main/en/model_doc/dia)
 
     Args:
         guidance_scale (float):
@@ -3086,12 +3062,9 @@ class DiaEOSChannelFilterLogitsProcessor(LogitsProcessor):
     2. and 3. are especially important in contexts where we allow sampling to guarantee the
     respective tokens to be (not) sampled.
 
-    <Tip warning={true}>
-
-    This logits processor is exclusively compatible with
-    [Dia](https://huggingface.co/docs/transformers/en/model_doc/dia).
-
-    </Tip>
+    > [!WARNING]
+    > This logits processor is exclusively compatible with
+    > [Dia](https://huggingface.co/docs/transformers/en/model_doc/dia).
 
     Args:
         num_channels (`int`):
@@ -3168,12 +3141,9 @@ class DiaEOSDelayPatternLogitsProcessor(LogitsProcessor):
     theirs at the respective delays (s+2, s+3, s+4). Subsequent padding tokens are
     handled by the `EosTokenCriteria` when an EOS has been detected.
 
-    <Tip warning={true}>
-
-    This logits processor is exclusively compatible with
-    [Dia](https://huggingface.co/docs/transformers/en/model_doc/dia).
-
-    </Tip>
+    > [!WARNING]
+    > This logits processor is exclusively compatible with
+    > [Dia](https://huggingface.co/docs/transformers/en/model_doc/dia).
 
     Args:
         delay_pattern (`List[int]`):
diff --git a/src/transformers/generation/streamers.py b/src/transformers/generation/streamers.py
index eddcfc0f9c34..3bb0de204f71 100644
--- a/src/transformers/generation/streamers.py
+++ b/src/transformers/generation/streamers.py
@@ -42,11 +42,8 @@ class TextStreamer(BaseStreamer):
     """
     Simple text streamer that prints the token(s) to stdout as soon as entire words are formed.
 
-    <Tip warning={true}>
-
-    The API for the streamer classes is still under development and may change in the future.
-
-    </Tip>
+    > [!WARNING]
+    > The API for the streamer classes is still under development and may change in the future.
 
     Parameters:
         tokenizer (`AutoTokenizer`):
@@ -165,11 +162,8 @@ class TextIteratorStreamer(TextStreamer):
     useful for applications that benefit from accessing the generated text in a non-blocking way (e.g. in an interactive
     Gradio demo).
 
-    <Tip warning={true}>
-
-    The API for the streamer classes is still under development and may change in the future.
-
-    </Tip>
+    > [!WARNING]
+    > The API for the streamer classes is still under development and may change in the future.
 
     Parameters:
         tokenizer (`AutoTokenizer`):
@@ -236,11 +230,8 @@ class AsyncTextIteratorStreamer(TextStreamer):
     This is useful for applications that benefit from accessing the generated text asynchronously (e.g. in an
     interactive Gradio demo).
 
-    <Tip warning={true}>
-
-    The API for the streamer classes is still under development and may change in the future.
-
-    </Tip>
+    > [!WARNING]
+    > The API for the streamer classes is still under development and may change in the future.
 
     Parameters:
         tokenizer (`AutoTokenizer`):
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index 156dca000176..c32f508f9025 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -2229,16 +2229,13 @@ def generate(
 
         Generates sequences of token ids for models with a language modeling head.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](../generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](../generation_strategies).
 
         Parameters:
             inputs (`torch.Tensor` of varying shape depending on the modality, *optional*):
diff --git a/src/transformers/image_processing_base.py b/src/transformers/image_processing_base.py
index 8bd65e9bc3ce..0bd7dc807781 100644
--- a/src/transformers/image_processing_base.py
+++ b/src/transformers/image_processing_base.py
@@ -134,11 +134,8 @@ def from_pretrained(
                 identifier allowed by git.
 
 
-                <Tip>
-
-                To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
-
-                </Tip>
+                > [!TIP]
+                > To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
 
             return_unused_kwargs (`bool`, *optional*, defaults to `False`):
                 If `False`, then this function returns just the final image processor object. If `True`, then this
diff --git a/src/transformers/integrations/accelerate.py b/src/transformers/integrations/accelerate.py
index 9464a4a67530..48db842a5fac 100644
--- a/src/transformers/integrations/accelerate.py
+++ b/src/transformers/integrations/accelerate.py
@@ -54,14 +54,11 @@ def init_empty_weights(include_buffers: bool = False):
         tst = nn.Sequential(*[nn.Linear(10000, 10000) for _ in range(1000)])
     ```
 
-    <Tip warning={true}>
-
-    Any model created under this context manager has no weights. As such you can't do something like
-    `model.to(some_device)` with it. To load weights inside your empty model, see [`load_checkpoint_and_dispatch`].
-    Make sure to overwrite the default device_map param for [`load_checkpoint_and_dispatch`], otherwise dispatch is not
-    called.
-
-    </Tip>
+    > [!WARNING]
+    > Any model created under this context manager has no weights. As such you can't do something like
+    > `model.to(some_device)` with it. To load weights inside your empty model, see [`load_checkpoint_and_dispatch`].
+    > Make sure to overwrite the default device_map param for [`load_checkpoint_and_dispatch`], otherwise dispatch is not
+    > called.
     """
     with init_on_device(torch.device("meta"), include_buffers=include_buffers) as f:
         yield f
@@ -145,12 +142,9 @@ def find_tied_parameters(model: "nn.Module", **kwargs):
     """
     Find the tied parameters in a given model.
 
-    <Tip warning={true}>
-
-    The signature accepts keyword arguments, but they are for the recursive part of this function and you should ignore
-    them.
-
-    </Tip>
+    > [!WARNING]
+    > The signature accepts keyword arguments, but they are for the recursive part of this function and you should ignore
+    > them.
 
     Args:
         model (`torch.nn.Module`): The model to inspect.
diff --git a/src/transformers/integrations/peft.py b/src/transformers/integrations/peft.py
index 4fba01df425a..3ece80cbc840 100644
--- a/src/transformers/integrations/peft.py
+++ b/src/transformers/integrations/peft.py
@@ -123,11 +123,8 @@ def load_adapter(
                 git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                 identifier allowed by git.
 
-                <Tip>
-
-                To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
-
-                </Tip>
+                > [!TIP]
+                > To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
 
             token (`str`, `optional`):
                 Whether to use authentication token to load the remote folder. Useful to load private repositories
diff --git a/src/transformers/modeling_outputs.py b/src/transformers/modeling_outputs.py
index 1747f6fa477b..d9cab18e0357 100755
--- a/src/transformers/modeling_outputs.py
+++ b/src/transformers/modeling_outputs.py
@@ -1207,13 +1207,10 @@ class SemanticSegmenterOutput(ModelOutput):
         logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels, logits_height, logits_width)`):
             Classification scores for each pixel.
 
-            <Tip warning={true}>
-
-            The logits returned do not necessarily have the same size as the `pixel_values` passed as inputs. This is
-            to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the
-            original image size as post-processing. You should always check your logits shape and resize as needed.
-
-            </Tip>
+            > [!WARNING]
+            > The logits returned do not necessarily have the same size as the `pixel_values` passed as inputs. This is
+            > to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the
+            > original image size as post-processing. You should always check your logits shape and resize as needed.
 
         hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
             Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 1c57072a0c72..19f0c55ae4ee 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -3688,12 +3688,9 @@ def save_pretrained(
                 We default it to 5GB in order for models to be able to run easily on free-tier google colab instances
                 without CPU OOM issues.
 
-                <Tip warning={true}>
-
-                If a single weight of the model is bigger than `max_shard_size`, it will be in its own checkpoint shard
-                which will be bigger than `max_shard_size`.
-
-                </Tip>
+                > [!WARNING]
+                > If a single weight of the model is bigger than `max_shard_size`, it will be in its own checkpoint shard
+                > which will be bigger than `max_shard_size`.
 
             safe_serialization (`bool`, *optional*, defaults to `True`):
                 Whether to save the model using `safetensors` or the traditional PyTorch way (that uses `pickle`).
@@ -4346,11 +4343,8 @@ def from_pretrained(
                 git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                 identifier allowed by git.
 
-                <Tip>
-
-                To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
-
-                </Tip>
+                > [!TIP]
+                > To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
             attn_implementation (`str`, *optional*):
                 The attention implementation to use in the model (if relevant). Can be any of `"eager"` (manual implementation of the attention), `"sdpa"` (using [`F.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html)), `"flash_attention_2"` (using [Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention)), or `"flash_attention_3"` (using [Dao-AILab/flash-attention/hopper](https://github.com/Dao-AILab/flash-attention/tree/main/hopper)). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual `"eager"` implementation.
 
@@ -4388,13 +4382,10 @@ def from_pretrained(
 
                 3. A string that is a valid `torch.dtype`. E.g. "float32" loads the model in `torch.float32`, "float16" loads in `torch.float16` etc.
 
-                <Tip>
-
-                For some models the `dtype` they were trained in is unknown - you may try to check the model's paper or
-                reach out to the authors and ask them to add this information to the model's card and to insert the
-                `dtype` or `torch_dtype` entry in `config.json` on the hub.
-
-                </Tip>
+                > [!TIP]
+                > For some models the `dtype` they were trained in is unknown - you may try to check the model's paper or
+                > reach out to the authors and ask them to add this information to the model's card and to insert the
+                > `dtype` or `torch_dtype` entry in `config.json` on the hub.
 
             device_map (`str` or `dict[str, Union[int, str, torch.device]]` or `int` or `torch.device`, *optional*):
                 A map that specifies where each submodule should go. It doesn't need to be refined to each
@@ -4457,12 +4448,9 @@ def from_pretrained(
                       supplied `kwargs` value. Remaining keys that do not correspond to any configuration attribute
                       will be passed to the underlying model's `__init__` function.
 
-        <Tip>
-
-        Activate the special ["offline-mode"](https://huggingface.co/transformers/installation.html#offline-mode) to
-        use this method in a firewalled environment.
-
-        </Tip>
+        > [!TIP]
+        > Activate the special ["offline-mode"](https://huggingface.co/transformers/installation.html#offline-mode) to
+        > use this method in a firewalled environment.
 
         Examples:
 
diff --git a/src/transformers/models/albert/tokenization_albert.py b/src/transformers/models/albert/tokenization_albert.py
index 011ad689edbd..4a0f23095655 100644
--- a/src/transformers/models/albert/tokenization_albert.py
+++ b/src/transformers/models/albert/tokenization_albert.py
@@ -54,22 +54,16 @@ class AlbertTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"[CLS]"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"[SEP]"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/models/albert/tokenization_albert_fast.py b/src/transformers/models/albert/tokenization_albert_fast.py
index ed9add51d207..21d319d1a9d2 100644
--- a/src/transformers/models/albert/tokenization_albert_fast.py
+++ b/src/transformers/models/albert/tokenization_albert_fast.py
@@ -55,12 +55,9 @@ class AlbertTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"[CLS]"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"[SEP]"`):
             The end of sequence token. .. note:: When building a sequence using special tokens, this is not the token
diff --git a/src/transformers/models/arcee/modeling_arcee.py.bak b/src/transformers/models/arcee/modeling_arcee.py.bak
new file mode 100644
index 000000000000..e288f63d71b5
--- /dev/null
+++ b/src/transformers/models/arcee/modeling_arcee.py.bak
@@ -0,0 +1,507 @@
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+#           This file was automatically generated from src/transformers/models/arcee/modular_arcee.py.
+#               Do NOT edit this file manually as any edits will be overwritten by the generation of
+#             the file from the modular. If any change should be done, please apply the change to the
+#                          modular_arcee.py file directly. One of our CI enforces this.
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+# coding=utf-8
+# Copyright 2025 Arcee AI and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Callable, Optional, Union
+
+import torch
+from torch import nn
+
+from transformers.utils import auto_docstring
+
+from ...activations import ACT2FN
+from ...cache_utils import Cache, DynamicCache
+from ...generation import GenerationMixin
+from ...integrations import use_kernel_forward_from_hub
+from ...masking_utils import create_causal_mask
+from ...modeling_layers import (
+    GenericForQuestionAnswering,
+    GenericForSequenceClassification,
+    GenericForTokenClassification,
+    GradientCheckpointingLayer,
+)
+from ...modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from ...modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from ...processing_utils import Unpack
+from ...utils import TransformersKwargs, can_return_tuple
+from ...utils.generic import check_model_inputs
+from .configuration_arcee import ArceeConfig
+
+
+class ArceeMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, x):
+        return self.down_proj(self.act_fn(self.up_proj(x)))
+
+
+@use_kernel_forward_from_hub("RMSNorm")
+class ArceeRMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        ArceeRMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+
+
+class ArceeRotaryEmbedding(nn.Module):
+    def __init__(self, config: ArceeConfig, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and isinstance(config.rope_scaling, dict):
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+
+    @torch.no_grad()
+    @dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
+    def forward(self, x, position_ids):
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
+        position_ids_expanded = position_ids[:, None, :].float()
+
+        device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):  # Force float32
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos() * self.attention_scaling
+            sin = emb.sin() * self.attention_scaling
+
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+
+
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs: Unpack[TransformersKwargs],
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+
+    return attn_output, attn_weights
+
+
+class ArceeAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config: ArceeConfig, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
+        )
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+
+
+class ArceeDecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: ArceeConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+
+        self.self_attn = ArceeAttention(config=config, layer_idx=layer_idx)
+
+        self.mlp = ArceeMLP(config)
+        self.input_layernorm = ArceeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = ArceeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[torch.Tensor]:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, _ = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        return hidden_states
+
+
+@auto_docstring
+class ArceePreTrainedModel(PreTrainedModel):
+    config: ArceeConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["ArceeDecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+
+    _can_compile_fullgraph = True
+    _supports_attention_backend = True
+    _can_record_outputs = {
+        "hidden_states": ArceeDecoderLayer,
+        "attentions": ArceeAttention,
+    }
+
+
+@auto_docstring
+class ArceeModel(ArceePreTrainedModel):
+    def __init__(self, config: ArceeConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [ArceeDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = ArceeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = ArceeRotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @check_model_inputs
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> BaseModelOutputWithPast:
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+
+        if inputs_embeds is None:
+            inputs_embeds: torch.Tensor = self.embed_tokens(input_ids)
+
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache()
+
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position: torch.Tensor = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+
+        causal_mask = create_causal_mask(
+            config=self.config,
+            input_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            cache_position=cache_position,
+            past_key_values=past_key_values,
+            position_ids=position_ids,
+        )
+
+        hidden_states = inputs_embeds
+        position_embeddings = self.rotary_emb(hidden_states, position_ids)
+
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            hidden_states = decoder_layer(
+                hidden_states,
+                attention_mask=causal_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_values,
+                cache_position=cache_position,
+                position_embeddings=position_embeddings,
+                **kwargs,
+            )
+
+        hidden_states = self.norm(hidden_states)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+        )
+
+
+@auto_docstring(checkpoint="arcee-ai/AFM-4.5B")
+class ArceeForCausalLM(ArceePreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = ArceeModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def set_decoder(self, decoder):
+        self.model = decoder
+
+    def get_decoder(self):
+        return self.model
+
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> CausalLMOutputWithPast:
+        r"""
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, ArceeForCausalLM
+
+        >>> model = ArceeForCausalLM.from_pretrained("meta-arcee/Arcee-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-arcee/Arcee-2-7b-hf")
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        outputs: BaseModelOutputWithPast = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+@auto_docstring(checkpoint="arcee-ai/AFM-4.5B")
+class ArceeForSequenceClassification(GenericForSequenceClassification, ArceePreTrainedModel):
+    pass
+
+
+@auto_docstring(checkpoint="arcee-ai/AFM-4.5B")
+class ArceeForQuestionAnswering(GenericForQuestionAnswering, ArceePreTrainedModel):
+    base_model_prefix = "transformer"  # For BC, where `transformer` was used instead of `model`
+
+
+@auto_docstring(checkpoint="arcee-ai/AFM-4.5B")
+class ArceeForTokenClassification(GenericForTokenClassification, ArceePreTrainedModel):
+    pass
+
+
+__all__ = [
+    "ArceeForCausalLM",
+    "ArceeForQuestionAnswering",
+    "ArceeForSequenceClassification",
+    "ArceeForTokenClassification",
+    "ArceeModel",
+    "ArceePreTrainedModel",
+]
diff --git a/src/transformers/models/auto/feature_extraction_auto.py b/src/transformers/models/auto/feature_extraction_auto.py
index 6d4c4f554d9d..b3798d8b44e6 100644
--- a/src/transformers/models/auto/feature_extraction_auto.py
+++ b/src/transformers/models/auto/feature_extraction_auto.py
@@ -194,11 +194,8 @@ def get_feature_extractor_config(
         local_files_only (`bool`, *optional*, defaults to `False`):
             If `True`, will only try to load the tokenizer configuration from local files.
 
-    <Tip>
-
-    Passing `token=True` is required when you want to use a private model.
-
-    </Tip>
+    > [!TIP]
+    > Passing `token=True` is required when you want to use a private model.
 
     Returns:
         `Dict`: The configuration of the tokenizer.
@@ -322,11 +319,8 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
                 loaded values. Behavior concerning key/value pairs whose keys are *not* feature extractor attributes is
                 controlled by the `return_unused_kwargs` keyword parameter.
 
-        <Tip>
-
-        Passing `token=True` is required when you want to use a private model.
-
-        </Tip>
+        > [!TIP]
+        > Passing `token=True` is required when you want to use a private model.
 
         Examples:
 
diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py
index a272735af207..ec60b5f066a8 100644
--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
@@ -285,11 +285,8 @@ def get_image_processor_config(
         local_files_only (`bool`, *optional*, defaults to `False`):
             If `True`, will only try to load the image processor configuration from local files.
 
-    <Tip>
-
-    Passing `token=True` is required when you want to use a private model.
-
-    </Tip>
+    > [!TIP]
+    > Passing `token=True` is required when you want to use a private model.
 
     Returns:
         `Dict`: The configuration of the image processor.
@@ -427,11 +424,8 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
                 loaded values. Behavior concerning key/value pairs whose keys are *not* image processor attributes is
                 controlled by the `return_unused_kwargs` keyword parameter.
 
-        <Tip>
-
-        Passing `token=True` is required when you want to use a private model.
-
-        </Tip>
+        > [!TIP]
+        > Passing `token=True` is required when you want to use a private model.
 
         Examples:
 
diff --git a/src/transformers/models/auto/processing_auto.py b/src/transformers/models/auto/processing_auto.py
index 11862a5896b9..e99f5946ffb3 100644
--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -251,11 +251,8 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
                 loaded values. Behavior concerning key/value pairs whose keys are *not* feature extractor attributes is
                 controlled by the `return_unused_kwargs` keyword parameter.
 
-        <Tip>
-
-        Passing `token=True` is required when you want to use a private model.
-
-        </Tip>
+        > [!TIP]
+        > Passing `token=True` is required when you want to use a private model.
 
         Examples:
 
diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py
index d0c3af490d71..aad1fd51169e 100644
--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -867,11 +867,8 @@ def get_tokenizer_config(
             In case the tokenizer config is located inside a subfolder of the model repo on huggingface.co, you can
             specify the folder name here.
 
-    <Tip>
-
-    Passing `token=True` is required when you want to use a private model.
-
-    </Tip>
+    > [!TIP]
+    > Passing `token=True` is required when you want to use a private model.
 
     Returns:
         `dict`: The configuration of the tokenizer.
diff --git a/src/transformers/models/auto/video_processing_auto.py b/src/transformers/models/auto/video_processing_auto.py
index 84bbc8e6fdb1..c9670b06f918 100644
--- a/src/transformers/models/auto/video_processing_auto.py
+++ b/src/transformers/models/auto/video_processing_auto.py
@@ -147,11 +147,8 @@ def get_video_processor_config(
         local_files_only (`bool`, *optional*, defaults to `False`):
             If `True`, will only try to load the video processor configuration from local files.
 
-    <Tip>
-
-    Passing `token=True` is required when you want to use a private model.
-
-    </Tip>
+    > [!TIP]
+    > Passing `token=True` is required when you want to use a private model.
 
     Returns:
         `Dict`: The configuration of the video processor.
@@ -273,11 +270,8 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
                 loaded values. Behavior concerning key/value pairs whose keys are *not* video processor attributes is
                 controlled by the `return_unused_kwargs` keyword parameter.
 
-        <Tip>
-
-        Passing `token=True` is required when you want to use a private model.
-
-        </Tip>
+        > [!TIP]
+        > Passing `token=True` is required when you want to use a private model.
 
         Examples:
 
diff --git a/src/transformers/models/autoformer/modeling_autoformer.py b/src/transformers/models/autoformer/modeling_autoformer.py
index 9e583b0b8187..bddf51e5ef13 100644
--- a/src/transformers/models/autoformer/modeling_autoformer.py
+++ b/src/transformers/models/autoformer/modeling_autoformer.py
@@ -1741,47 +1741,44 @@ def forward(
         >>> mean_prediction = outputs.sequences.mean(dim=1)
         ```
 
-        <Tip>
-
-        The AutoformerForPrediction can also use static_real_features. To do so, set num_static_real_features in
-        AutoformerConfig based on number of such features in the dataset (in case of tourism_monthly dataset it
-        is equal to 1), initialize the model and call as shown below:
-
-        ```
-        >>> from huggingface_hub import hf_hub_download
-        >>> import torch
-        >>> from transformers import AutoformerConfig, AutoformerForPrediction
-
-        >>> file = hf_hub_download(
-        ...     repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
-        ... )
-        >>> batch = torch.load(file)
-
-        >>> # check number of static real features
-        >>> num_static_real_features = batch["static_real_features"].shape[-1]
-
-        >>> # load configuration of pretrained model and override num_static_real_features
-        >>> configuration = AutoformerConfig.from_pretrained(
-        ...     "huggingface/autoformer-tourism-monthly",
-        ...     num_static_real_features=num_static_real_features,
-        ... )
-        >>> # we also need to update feature_size as it is not recalculated
-        >>> configuration.feature_size += num_static_real_features
-
-        >>> model = AutoformerForPrediction(configuration)
-
-        >>> outputs = model(
-        ...     past_values=batch["past_values"],
-        ...     past_time_features=batch["past_time_features"],
-        ...     past_observed_mask=batch["past_observed_mask"],
-        ...     static_categorical_features=batch["static_categorical_features"],
-        ...     static_real_features=batch["static_real_features"],
-        ...     future_values=batch["future_values"],
-        ...     future_time_features=batch["future_time_features"],
-        ... )
-        ```
-
-        </Tip>
+        > [!TIP]
+        > The AutoformerForPrediction can also use static_real_features. To do so, set num_static_real_features in
+        > AutoformerConfig based on number of such features in the dataset (in case of tourism_monthly dataset it
+        > is equal to 1), initialize the model and call as shown below:
+        >
+        > ```
+        > >>> from huggingface_hub import hf_hub_download
+        > >>> import torch
+        > >>> from transformers import AutoformerConfig, AutoformerForPrediction
+        >
+        > >>> file = hf_hub_download(
+        > ...     repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
+        > ... )
+        > >>> batch = torch.load(file)
+        >
+        > >>> # check number of static real features
+        > >>> num_static_real_features = batch["static_real_features"].shape[-1]
+        >
+        > >>> # load configuration of pretrained model and override num_static_real_features
+        > >>> configuration = AutoformerConfig.from_pretrained(
+        > ...     "huggingface/autoformer-tourism-monthly",
+        > ...     num_static_real_features=num_static_real_features,
+        > ... )
+        > >>> # we also need to update feature_size as it is not recalculated
+        > >>> configuration.feature_size += num_static_real_features
+        >
+        > >>> model = AutoformerForPrediction(configuration)
+        >
+        > >>> outputs = model(
+        > ...     past_values=batch["past_values"],
+        > ...     past_time_features=batch["past_time_features"],
+        > ...     past_observed_mask=batch["past_observed_mask"],
+        > ...     static_categorical_features=batch["static_categorical_features"],
+        > ...     static_real_features=batch["static_real_features"],
+        > ...     future_values=batch["future_values"],
+        > ...     future_time_features=batch["future_time_features"],
+        > ... )
+        > ```
         """
 
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
diff --git a/src/transformers/models/bart/tokenization_bart.py b/src/transformers/models/bart/tokenization_bart.py
index f674afe1a412..361b3d5ef344 100644
--- a/src/transformers/models/bart/tokenization_bart.py
+++ b/src/transformers/models/bart/tokenization_bart.py
@@ -92,11 +92,8 @@ class BartTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods.
@@ -112,22 +109,16 @@ class BartTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/bart/tokenization_bart_fast.py b/src/transformers/models/bart/tokenization_bart_fast.py
index 88b002f59529..9ca92e4718e5 100644
--- a/src/transformers/models/bart/tokenization_bart_fast.py
+++ b/src/transformers/models/bart/tokenization_bart_fast.py
@@ -54,11 +54,8 @@ class BartTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
@@ -74,22 +71,16 @@ class BartTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/barthez/tokenization_barthez.py b/src/transformers/models/barthez/tokenization_barthez.py
index bc583e0cd5dc..b8442dcb3c1b 100644
--- a/src/transformers/models/barthez/tokenization_barthez.py
+++ b/src/transformers/models/barthez/tokenization_barthez.py
@@ -51,22 +51,16 @@ class BarthezTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/barthez/tokenization_barthez_fast.py b/src/transformers/models/barthez/tokenization_barthez_fast.py
index 64050ca8848f..ac75437cc5d0 100644
--- a/src/transformers/models/barthez/tokenization_barthez_fast.py
+++ b/src/transformers/models/barthez/tokenization_barthez_fast.py
@@ -51,22 +51,16 @@ class BarthezTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/bartpho/tokenization_bartpho.py b/src/transformers/models/bartpho/tokenization_bartpho.py
index 41a122bf913c..eaf46e9849e0 100644
--- a/src/transformers/models/bartpho/tokenization_bartpho.py
+++ b/src/transformers/models/bartpho/tokenization_bartpho.py
@@ -50,22 +50,16 @@ class BartphoTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/bertweet/tokenization_bertweet.py b/src/transformers/models/bertweet/tokenization_bertweet.py
index 3ce1a3182bf9..4ff42feb3ddd 100644
--- a/src/transformers/models/bertweet/tokenization_bertweet.py
+++ b/src/transformers/models/bertweet/tokenization_bertweet.py
@@ -68,22 +68,16 @@ class BertweetTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/big_bird/tokenization_big_bird_fast.py b/src/transformers/models/big_bird/tokenization_big_bird_fast.py
index 6148585a40b1..43f7ef302b3e 100644
--- a/src/transformers/models/big_bird/tokenization_big_bird_fast.py
+++ b/src/transformers/models/big_bird/tokenization_big_bird_fast.py
@@ -49,12 +49,9 @@ class BigBirdTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token. .. note:: When building a sequence using special tokens, this is not the token
diff --git a/src/transformers/models/biogpt/tokenization_biogpt.py b/src/transformers/models/biogpt/tokenization_biogpt.py
index f84403ca7ddc..ace134588a87 100644
--- a/src/transformers/models/biogpt/tokenization_biogpt.py
+++ b/src/transformers/models/biogpt/tokenization_biogpt.py
@@ -61,22 +61,16 @@ class BioGptTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/blenderbot/tokenization_blenderbot.py b/src/transformers/models/blenderbot/tokenization_blenderbot.py
index 76719fa25494..ea7797c8055b 100644
--- a/src/transformers/models/blenderbot/tokenization_blenderbot.py
+++ b/src/transformers/models/blenderbot/tokenization_blenderbot.py
@@ -98,11 +98,8 @@ class BlenderbotTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods.
@@ -118,22 +115,16 @@ class BlenderbotTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/blenderbot/tokenization_blenderbot_fast.py b/src/transformers/models/blenderbot/tokenization_blenderbot_fast.py
index 0b84200e02d5..a82c25a41f7b 100644
--- a/src/transformers/models/blenderbot/tokenization_blenderbot_fast.py
+++ b/src/transformers/models/blenderbot/tokenization_blenderbot_fast.py
@@ -57,11 +57,8 @@ class BlenderbotTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
@@ -77,22 +74,16 @@ class BlenderbotTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/blip_2/modeling_blip_2.py b/src/transformers/models/blip_2/modeling_blip_2.py
index cb4e36b37308..95ce425b53c0 100644
--- a/src/transformers/models/blip_2/modeling_blip_2.py
+++ b/src/transformers/models/blip_2/modeling_blip_2.py
@@ -1627,11 +1627,8 @@ def forward(
     One can optionally pass `input_ids` to the model, which serve as a text prompt, to make the language model continue
     the prompt. Otherwise, the language model starts generating text from the [BOS] (beginning-of-sequence) token.
 
-    <Tip>
-
-    Note that Flan-T5 checkpoints cannot be cast to float16. They are pre-trained using bfloat16.
-
-    </Tip>
+    > [!TIP]
+    > Note that Flan-T5 checkpoints cannot be cast to float16. They are pre-trained using bfloat16.
     """
 )
 class Blip2ForConditionalGeneration(Blip2PreTrainedModel, GenerationMixin):
diff --git a/src/transformers/models/bloom/tokenization_bloom_fast.py b/src/transformers/models/bloom/tokenization_bloom_fast.py
index b7a9f7449a4e..0862ee59acc6 100644
--- a/src/transformers/models/bloom/tokenization_bloom_fast.py
+++ b/src/transformers/models/bloom/tokenization_bloom_fast.py
@@ -49,11 +49,8 @@ class BloomTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer, but since
     the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
diff --git a/src/transformers/models/byt5/tokenization_byt5.py b/src/transformers/models/byt5/tokenization_byt5.py
index 2a9804db1014..48b0ba74e401 100644
--- a/src/transformers/models/byt5/tokenization_byt5.py
+++ b/src/transformers/models/byt5/tokenization_byt5.py
@@ -35,12 +35,9 @@ class ByT5Tokenizer(PreTrainedTokenizer):
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/models/camembert/tokenization_camembert.py b/src/transformers/models/camembert/tokenization_camembert.py
index cd6e399f208d..8e9131f28944 100644
--- a/src/transformers/models/camembert/tokenization_camembert.py
+++ b/src/transformers/models/camembert/tokenization_camembert.py
@@ -49,22 +49,16 @@ class CamembertTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/camembert/tokenization_camembert_fast.py b/src/transformers/models/camembert/tokenization_camembert_fast.py
index 423058ed959a..5117bc5c5de5 100644
--- a/src/transformers/models/camembert/tokenization_camembert_fast.py
+++ b/src/transformers/models/camembert/tokenization_camembert_fast.py
@@ -53,22 +53,16 @@ class CamembertTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/clvp/tokenization_clvp.py b/src/transformers/models/clvp/tokenization_clvp.py
index 4b0b285561c5..af70a899ed5e 100644
--- a/src/transformers/models/clvp/tokenization_clvp.py
+++ b/src/transformers/models/clvp/tokenization_clvp.py
@@ -96,11 +96,8 @@ class ClvpTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods.
diff --git a/src/transformers/models/code_llama/tokenization_code_llama.py b/src/transformers/models/code_llama/tokenization_code_llama.py
index 94d1b4d65985..ecc59475c57e 100644
--- a/src/transformers/models/code_llama/tokenization_code_llama.py
+++ b/src/transformers/models/code_llama/tokenization_code_llama.py
@@ -68,12 +68,9 @@ class CodeLlamaTokenizer(PreTrainedTokenizer):
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         prefix_token (`str`, *optional*, defaults to `"▁<PRE>"`):
             Prefix token used for infilling.
diff --git a/src/transformers/models/codegen/tokenization_codegen.py b/src/transformers/models/codegen/tokenization_codegen.py
index 4d08c6acd5bb..6ecc575b8530 100644
--- a/src/transformers/models/codegen/tokenization_codegen.py
+++ b/src/transformers/models/codegen/tokenization_codegen.py
@@ -99,11 +99,8 @@ class CodeGenTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods.
diff --git a/src/transformers/models/codegen/tokenization_codegen_fast.py b/src/transformers/models/codegen/tokenization_codegen_fast.py
index 72c8d66c829a..08835c3f845e 100644
--- a/src/transformers/models/codegen/tokenization_codegen_fast.py
+++ b/src/transformers/models/codegen/tokenization_codegen_fast.py
@@ -59,11 +59,8 @@ class CodeGenTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer, but since
     the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
diff --git a/src/transformers/models/cohere/tokenization_cohere_fast.py b/src/transformers/models/cohere/tokenization_cohere_fast.py
index 8072cbe7c17c..7cf9beca7237 100644
--- a/src/transformers/models/cohere/tokenization_cohere_fast.py
+++ b/src/transformers/models/cohere/tokenization_cohere_fast.py
@@ -66,11 +66,8 @@ class CohereTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer, but since
     the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
diff --git a/src/transformers/models/cpm/tokenization_cpm.py b/src/transformers/models/cpm/tokenization_cpm.py
index 5ecfedd0a614..c15c9080692c 100644
--- a/src/transformers/models/cpm/tokenization_cpm.py
+++ b/src/transformers/models/cpm/tokenization_cpm.py
@@ -75,22 +75,16 @@ def __init__(
                 The beginning of sequence token that was used during pretraining. Can be used a sequence classifier
                 token.
 
-                <Tip>
-
-                When building a sequence using special tokens, this is not the token that is used for the beginning of
-                sequence. The token used is the `cls_token`.
-
-                </Tip>
+                > [!TIP]
+                > When building a sequence using special tokens, this is not the token that is used for the beginning of
+                > sequence. The token used is the `cls_token`.
 
             eos_token (`str`, *optional*, defaults to `"</s>"`):
                 The end of sequence token.
 
-                <Tip>
-
-                When building a sequence using special tokens, this is not the token that is used for the end of
-                sequence. The token used is the `sep_token`.
-
-                </Tip>
+                > [!TIP]
+                > When building a sequence using special tokens, this is not the token that is used for the end of
+                > sequence. The token used is the `sep_token`.
 
             unk_token (`str`, *optional*, defaults to `"<unk>"`):
                 The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be
diff --git a/src/transformers/models/cpm/tokenization_cpm_fast.py b/src/transformers/models/cpm/tokenization_cpm_fast.py
index 3e828ca9e0b5..77ec8b781d4f 100644
--- a/src/transformers/models/cpm/tokenization_cpm_fast.py
+++ b/src/transformers/models/cpm/tokenization_cpm_fast.py
@@ -68,22 +68,16 @@ def __init__(
                 The beginning of sequence token that was used during pretraining. Can be used a sequence classifier
                 token.
 
-                <Tip>
-
-                When building a sequence using special tokens, this is not the token that is used for the beginning of
-                sequence. The token used is the `cls_token`.
-
-                </Tip>
+                > [!TIP]
+                > When building a sequence using special tokens, this is not the token that is used for the beginning of
+                > sequence. The token used is the `cls_token`.
 
             eos_token (`str`, *optional*, defaults to `"</s>"`):
                 The end of sequence token.
 
-                <Tip>
-
-                When building a sequence using special tokens, this is not the token that is used for the end of
-                sequence. The token used is the `sep_token`.
-
-                </Tip>
+                > [!TIP]
+                > When building a sequence using special tokens, this is not the token that is used for the end of
+                > sequence. The token used is the `sep_token`.
 
             unk_token (`str`, *optional*, defaults to `"<unk>"`):
                 The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be
diff --git a/src/transformers/models/csm/generation_csm.py b/src/transformers/models/csm/generation_csm.py
index cf8bc141f5d1..63a3f0c6eddf 100644
--- a/src/transformers/models/csm/generation_csm.py
+++ b/src/transformers/models/csm/generation_csm.py
@@ -356,12 +356,10 @@ def generate(
         3. Use these generated codebook tokens as `input_ids` to sample the next first codebook token using the backbone model
         4. Repeat until stopping criteria is met
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, do_sample=True)`.
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, do_sample=True)`.
 
         Parameters:
             inputs_ids (`torch.Tensor` of shape (batch_size, seq_length), *optional*):
diff --git a/src/transformers/models/deberta/tokenization_deberta.py b/src/transformers/models/deberta/tokenization_deberta.py
index 74e958c8030b..159d9261dfab 100644
--- a/src/transformers/models/deberta/tokenization_deberta.py
+++ b/src/transformers/models/deberta/tokenization_deberta.py
@@ -90,11 +90,8 @@ class DebertaTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods.
diff --git a/src/transformers/models/deberta/tokenization_deberta_fast.py b/src/transformers/models/deberta/tokenization_deberta_fast.py
index c2f2e6552d9d..5775169c91bd 100644
--- a/src/transformers/models/deberta/tokenization_deberta_fast.py
+++ b/src/transformers/models/deberta/tokenization_deberta_fast.py
@@ -49,11 +49,8 @@ class DebertaTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer, but since
     the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
diff --git a/src/transformers/models/deit/modeling_deit.py b/src/transformers/models/deit/modeling_deit.py
index 8c9b7e89ecd8..74e8508dbc09 100644
--- a/src/transformers/models/deit/modeling_deit.py
+++ b/src/transformers/models/deit/modeling_deit.py
@@ -495,12 +495,9 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
     custom_intro="""
     DeiT Model with a decoder on top for masked image modeling, as proposed in [SimMIM](https://huggingface.co/papers/2111.09886).
 
-    <Tip>
-
-    Note that we provide a script to pre-train this model on custom data in our [examples
-    directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
-
-    </Tip>
+    > [!TIP]
+    > Note that we provide a script to pre-train this model on custom data in our [examples
+    > directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
     """
 )
 class DeiTForMaskedImageModeling(DeiTPreTrainedModel):
diff --git a/src/transformers/models/deprecated/efficientformer/modeling_efficientformer.py b/src/transformers/models/deprecated/efficientformer/modeling_efficientformer.py
index 2167df912d87..f1901c65911b 100644
--- a/src/transformers/models/deprecated/efficientformer/modeling_efficientformer.py
+++ b/src/transformers/models/deprecated/efficientformer/modeling_efficientformer.py
@@ -704,12 +704,9 @@ class token).
     state of the [CLS] token and a linear layer on top of the final hidden state of the distillation token) e.g. for
     ImageNet.
 
-    <Tip warning={true}>
-
-           This model supports inference-only. Fine-tuning with distillation (i.e. with a teacher) is not yet
-           supported.
-
-    </Tip>
+    > [!WARNING]
+    > This model supports inference-only. Fine-tuning with distillation (i.e. with a teacher) is not yet
+    >        supported.
     """,
     EFFICIENTFORMER_START_DOCSTRING,
 )
diff --git a/src/transformers/models/deprecated/jukebox/tokenization_jukebox.py b/src/transformers/models/deprecated/jukebox/tokenization_jukebox.py
index 473d23d49565..825e550e6d8c 100644
--- a/src/transformers/models/deprecated/jukebox/tokenization_jukebox.py
+++ b/src/transformers/models/deprecated/jukebox/tokenization_jukebox.py
@@ -63,11 +63,8 @@ class JukeboxTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    If nothing is provided, the genres and the artist will either be selected randomly or set to None
-
-    </Tip>
+    > [!TIP]
+    > If nothing is provided, the genres and the artist will either be selected randomly or set to None
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to:
     this superclass for more information regarding those methods.
diff --git a/src/transformers/models/deprecated/tapex/tokenization_tapex.py b/src/transformers/models/deprecated/tapex/tokenization_tapex.py
index fa74d8aa3b55..6c332e01cbc7 100644
--- a/src/transformers/models/deprecated/tapex/tokenization_tapex.py
+++ b/src/transformers/models/deprecated/tapex/tokenization_tapex.py
@@ -205,22 +205,16 @@ class TapexTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
@@ -678,11 +672,8 @@ def batch_encode_plus(
         **kwargs,
     ) -> BatchEncoding:
         """
-        <Tip warning={true}>
-
-        This method is deprecated, `__call__` should be used instead.
-
-        </Tip>
+        > [!WARNING]
+        > This method is deprecated, `__call__` should be used instead.
         """
         # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
         padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
diff --git a/src/transformers/models/deprecated/tvlt/feature_extraction_tvlt.py b/src/transformers/models/deprecated/tvlt/feature_extraction_tvlt.py
index 3c65f4314616..66984515ceae 100644
--- a/src/transformers/models/deprecated/tvlt/feature_extraction_tvlt.py
+++ b/src/transformers/models/deprecated/tvlt/feature_extraction_tvlt.py
@@ -139,12 +139,9 @@ def __call__(
                 Whether to return the attention mask. If left to the default, will return the attention mask according
                 to the specific feature_extractor's default. [What are attention masks?](../glossary#attention-mask)
 
-                <Tip>
-
-                For TvltTransformer models, `attention_mask` should always be passed for batched inference, to avoid
-                subtle bugs.
-
-                </Tip>
+                > [!TIP]
+                > For TvltTransformer models, `attention_mask` should always be passed for batched inference, to avoid
+                > subtle bugs.
 
             sampling_rate (`int`, *optional*):
                 The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
diff --git a/src/transformers/models/deprecated/xlm_prophetnet/tokenization_xlm_prophetnet.py b/src/transformers/models/deprecated/xlm_prophetnet/tokenization_xlm_prophetnet.py
index 77431b13c49f..9defb6ab6cd4 100644
--- a/src/transformers/models/deprecated/xlm_prophetnet/tokenization_xlm_prophetnet.py
+++ b/src/transformers/models/deprecated/xlm_prophetnet/tokenization_xlm_prophetnet.py
@@ -54,22 +54,16 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"[SEP]"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"[SEP]"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"[SEP]"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/donut/modeling_donut_swin.py b/src/transformers/models/donut/modeling_donut_swin.py
index d388e386ae49..4b2ce8c9c3a5 100644
--- a/src/transformers/models/donut/modeling_donut_swin.py
+++ b/src/transformers/models/donut/modeling_donut_swin.py
@@ -919,13 +919,10 @@ def forward(
     DonutSwin Model transformer with an image classification head on top (a linear layer on top of the final hidden state of
     the [CLS] token) e.g. for ImageNet.
 
-    <Tip>
-
-        Note that it's possible to fine-tune DonutSwin on higher resolution images than the ones it has been trained on, by
-        setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
-        position embeddings to the higher resolution.
-
-    </Tip>
+    > [!TIP]
+    > Note that it's possible to fine-tune DonutSwin on higher resolution images than the ones it has been trained on, by
+    >     setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
+    >     position embeddings to the higher resolution.
     """
 )
 # Copied from transformers.models.swin.modeling_swin.SwinForImageClassification with Swin->DonutSwin,swin->donut
diff --git a/src/transformers/models/dots1/modeling_dots1.py.bak b/src/transformers/models/dots1/modeling_dots1.py.bak
new file mode 100644
index 000000000000..26fdc9f76ce4
--- /dev/null
+++ b/src/transformers/models/dots1/modeling_dots1.py.bak
@@ -0,0 +1,614 @@
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+#           This file was automatically generated from src/transformers/models/dots1/modular_dots1.py.
+#               Do NOT edit this file manually as any edits will be overwritten by the generation of
+#             the file from the modular. If any change should be done, please apply the change to the
+#                          modular_dots1.py file directly. One of our CI enforces this.
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+# coding=utf-8
+# Copyright 2025 The rednote-hilab team and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Callable, Optional, Union
+
+import torch
+import torch.nn.functional as F
+from torch import nn
+
+from ...activations import ACT2FN
+from ...cache_utils import Cache, DynamicCache
+from ...generation import GenerationMixin
+from ...integrations import use_kernel_forward_from_hub
+from ...masking_utils import create_causal_mask, create_sliding_window_causal_mask
+from ...modeling_flash_attention_utils import FlashAttentionKwargs
+from ...modeling_layers import GradientCheckpointingLayer
+from ...modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from ...modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from ...processing_utils import Unpack
+from ...utils import TransformersKwargs, auto_docstring, can_return_tuple
+from ...utils.generic import check_model_inputs
+from .configuration_dots1 import Dots1Config
+
+
+@use_kernel_forward_from_hub("RMSNorm")
+class Dots1RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        Dots1RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+
+
+class Dots1RotaryEmbedding(nn.Module):
+    def __init__(self, config: Dots1Config, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and isinstance(config.rope_scaling, dict):
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+
+    @torch.no_grad()
+    @dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
+    def forward(self, x, position_ids):
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
+        position_ids_expanded = position_ids[:, None, :].float()
+
+        device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):  # Force float32
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos() * self.attention_scaling
+            sin = emb.sin() * self.attention_scaling
+
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+
+
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs: Unpack[TransformersKwargs],
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+
+    return attn_output, attn_weights
+
+
+class Dots1Attention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config: Dots1Config, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
+        )
+        self.q_norm = Dots1RMSNorm(self.head_dim, eps=config.rms_norm_eps)  # unlike olmo, only on the head dim!
+        self.k_norm = Dots1RMSNorm(self.head_dim, eps=config.rms_norm_eps)  # thus post q_norm does not need reshape
+        self.sliding_window = config.sliding_window if config.layer_types[layer_idx] == "sliding_attention" else None
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+
+        query_states = self.q_norm(self.q_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
+        key_states = self.k_norm(self.k_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            sliding_window=self.sliding_window,  # diff with Llama
+            **kwargs,
+        )
+
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+
+
+class Dots1MLP(nn.Module):
+    def __init__(self, config, hidden_size=None, intermediate_size=None):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
+        self.intermediate_size = config.intermediate_size if intermediate_size is None else intermediate_size
+
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+
+
+class Dots1MoE(nn.Module):
+    """
+    A mixed expert module containing shared experts.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.experts = nn.ModuleList(
+            [Dots1MLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(config.n_routed_experts)]
+        )
+        self.gate = Dots1TopkRouter(config)
+        self.shared_experts = Dots1MLP(
+            config=config, intermediate_size=config.moe_intermediate_size * config.n_shared_experts
+        )
+
+    def moe(self, hidden_states: torch.Tensor, topk_indices: torch.Tensor, topk_weights: torch.Tensor):
+        r"""
+        CALL FOR CONTRIBUTION! I don't have time to optimise this right now, but expert weights need to be fused
+        to not have to do a loop here (deepseek has 256 experts soooo yeah).
+        """
+        final_hidden_states = torch.zeros_like(hidden_states, dtype=topk_weights.dtype)
+        expert_mask = torch.nn.functional.one_hot(topk_indices, num_classes=len(self.experts))
+        expert_mask = expert_mask.permute(2, 0, 1)
+
+        for expert_idx in range(len(self.experts)):
+            expert = self.experts[expert_idx]
+            mask = expert_mask[expert_idx]
+            token_indices, weight_indices = torch.where(mask)
+
+            if token_indices.numel() > 0:
+                expert_weights = topk_weights[token_indices, weight_indices]
+                expert_input = hidden_states[token_indices]
+                expert_output = expert(expert_input)
+                weighted_output = expert_output * expert_weights.unsqueeze(-1)
+                final_hidden_states.index_add_(0, token_indices, weighted_output)
+
+        # in original deepseek, the output of the experts are gathered once we leave this module
+        # thus the moe module is itelsf an IsolatedParallel module
+        # and all expert are "local" meaning we shard but we don't gather
+        return final_hidden_states.type(hidden_states.dtype)
+
+    def forward(self, hidden_states):
+        residuals = hidden_states
+        orig_shape = hidden_states.shape
+        topk_indices, topk_weights = self.gate(hidden_states)
+        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
+        hidden_states = self.moe(hidden_states, topk_indices, topk_weights).view(*orig_shape)
+        hidden_states = hidden_states + self.shared_experts(residuals)
+        return hidden_states
+
+
+class Dots1TopkRouter(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.top_k = config.num_experts_per_tok
+        self.n_routed_experts = config.n_routed_experts
+        self.routed_scaling_factor = config.routed_scaling_factor
+        self.n_group = config.n_group
+        self.topk_group = config.topk_group
+        self.norm_topk_prob = config.norm_topk_prob
+
+        self.weight = nn.Parameter(torch.empty((self.n_routed_experts, config.hidden_size)))
+        self.register_buffer("e_score_correction_bias", torch.zeros(self.n_routed_experts))
+
+    @torch.no_grad()
+    def get_topk_indices(self, scores):
+        scores_for_choice = scores.view(-1, self.n_routed_experts) + self.e_score_correction_bias.unsqueeze(0)
+        group_scores = (
+            scores_for_choice.view(-1, self.n_group, self.n_routed_experts // self.n_group)
+            .topk(2, dim=-1)[0]
+            .sum(dim=-1)
+        )
+        group_idx = torch.topk(group_scores, k=self.topk_group, dim=-1, sorted=False)[1]
+        group_mask = torch.zeros_like(group_scores)
+        group_mask.scatter_(1, group_idx, 1)
+        score_mask = (
+            group_mask.unsqueeze(-1)
+            .expand(-1, self.n_group, self.n_routed_experts // self.n_group)
+            .reshape(-1, self.n_routed_experts)
+        )
+        scores_for_choice = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)
+        topk_indices = torch.topk(scores_for_choice, k=self.top_k, dim=-1, sorted=False)[1]
+        return topk_indices
+
+    def forward(self, hidden_states):
+        hidden_states = hidden_states.view(-1, self.config.hidden_size)
+        router_logits = F.linear(hidden_states.type(torch.float32), self.weight.type(torch.float32))
+        scores = router_logits.sigmoid()
+        topk_indices = self.get_topk_indices(scores)
+        topk_weights = scores.gather(1, topk_indices)
+        if self.norm_topk_prob:
+            denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20
+            topk_weights /= denominator
+        topk_weights = topk_weights * self.routed_scaling_factor
+        return topk_indices, topk_weights
+
+
+class Dots1DecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: Dots1Config, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+
+        self.self_attn = Dots1Attention(config=config, layer_idx=layer_idx)
+
+        if layer_idx >= config.first_k_dense_replace:
+            self.mlp = Dots1MoE(config)
+        else:
+            self.mlp = Dots1MLP(config)
+
+        self.input_layernorm = Dots1RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = Dots1RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.attention_type = config.layer_types[layer_idx]
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[torch.Tensor]:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, _ = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        return hidden_states
+
+
+@auto_docstring
+class Dots1PreTrainedModel(PreTrainedModel):
+    config: Dots1Config
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["Dots1DecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+
+    _can_compile_fullgraph = True
+    _supports_attention_backend = True
+    _can_record_outputs = {
+        "hidden_states": Dots1DecoderLayer,
+        "attentions": Dots1Attention,
+    }
+
+    def _init_weights(self, module):
+        super()._init_weights(module)
+        if isinstance(module, Dots1TopkRouter):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+
+
+@auto_docstring
+class Dots1Model(Dots1PreTrainedModel):
+    def __init__(self, config: Dots1Config):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [Dots1DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = Dots1RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = Dots1RotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        self.has_sliding_layers = "sliding_attention" in self.config.layer_types
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @check_model_inputs
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> BaseModelOutputWithPast:
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache()
+
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+
+        # It may already have been prepared by e.g. `generate`
+        if not isinstance(causal_mask_mapping := attention_mask, dict):
+            # Prepare mask arguments
+            mask_kwargs = {
+                "config": self.config,
+                "input_embeds": inputs_embeds,
+                "attention_mask": attention_mask,
+                "cache_position": cache_position,
+                "past_key_values": past_key_values,
+                "position_ids": position_ids,
+            }
+            # Create the masks
+            causal_mask_mapping = {
+                "full_attention": create_causal_mask(**mask_kwargs),
+            }
+            # The sliding window alternating layers are not always activated depending on the config
+            if self.has_sliding_layers:
+                causal_mask_mapping["sliding_attention"] = create_sliding_window_causal_mask(**mask_kwargs)
+
+        hidden_states = inputs_embeds
+
+        # create position embeddings to be shared across the decoder layers
+        position_embeddings = self.rotary_emb(hidden_states, position_ids)
+
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            hidden_states = decoder_layer(
+                hidden_states,
+                attention_mask=causal_mask_mapping[decoder_layer.attention_type],
+                position_ids=position_ids,
+                past_key_value=past_key_values,
+                use_cache=use_cache,
+                cache_position=cache_position,
+                position_embeddings=position_embeddings,
+                **kwargs,
+            )
+
+        hidden_states = self.norm(hidden_states)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values if use_cache else None,
+        )
+
+
+@auto_docstring
+class Dots1ForCausalLM(Dots1PreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = Dots1Model(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def set_decoder(self, decoder):
+        self.model = decoder
+
+    def get_decoder(self):
+        return self.model
+
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> CausalLMOutputWithPast:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, Dots1ForCausalLM
+
+        >>> model = Dots1ForCausalLM.from_pretrained("rednote-hilab/dots1.llm1.inst")
+        >>> tokenizer = AutoTokenizer.from_pretrained("rednote-hilab/dots1.llm1.inst")
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        outputs: BaseModelOutputWithPast = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+__all__ = ["Dots1PreTrainedModel", "Dots1Model", "Dots1ForCausalLM"]
diff --git a/src/transformers/models/encodec/modeling_encodec.py b/src/transformers/models/encodec/modeling_encodec.py
index c3c32f5bd61d..c99801e5d382 100644
--- a/src/transformers/models/encodec/modeling_encodec.py
+++ b/src/transformers/models/encodec/modeling_encodec.py
@@ -737,13 +737,10 @@ def forward(
             - 1 for tokens that are **not masked**,
             - 0 for tokens that are **masked**.
 
-            <Tip warning={true}>
-
-            `padding_mask` should always be passed, unless the input was truncated or not padded. This is because in
-            order to process tensors effectively, the input audio should be padded so that `input_length % stride =
-            step` with `step = chunk_length-stride`. This ensures that all chunks are of the same shape
-
-            </Tip>
+            > [!WARNING]
+            > `padding_mask` should always be passed, unless the input was truncated or not padded. This is because in
+            > order to process tensors effectively, the input audio should be padded so that `input_length % stride =
+            > step` with `step = chunk_length-stride`. This ensures that all chunks are of the same shape
         bandwidth (`float`, *optional*):
             The target bandwidth. Must be one of `config.target_bandwidths`. If `None`, uses the smallest possible
             bandwidth. bandwidth is represented as a thousandth of what it is, e.g. 6kbps bandwidth is represented as
diff --git a/src/transformers/models/flaubert/modeling_flaubert.py b/src/transformers/models/flaubert/modeling_flaubert.py
index 5812aa457cbc..2bfbc5d90307 100644
--- a/src/transformers/models/flaubert/modeling_flaubert.py
+++ b/src/transformers/models/flaubert/modeling_flaubert.py
@@ -355,12 +355,9 @@ def forward(
                 Mask for tokens at invalid position, such as query and special symbols (PAD, SEP, CLS). 1.0 means token
                 should be masked.
 
-        <Tip>
-
-        One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
-        `start_states`.
-
-        </Tip>
+        > [!TIP]
+        > One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
+        > `start_states`.
 
         Returns:
             `torch.FloatTensor`: The end logits for SQuAD.
@@ -422,12 +419,9 @@ def forward(
             cls_index (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
                 Position of the CLS token for each sentence in the batch. If `None`, takes the last token.
 
-        <Tip>
-
-        One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
-        `start_states`.
-
-        </Tip>
+        > [!TIP]
+        > One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
+        > `start_states`.
 
         Returns:
             `torch.FloatTensor`: The SQuAD 2.0 answer class.
diff --git a/src/transformers/models/flaubert/tokenization_flaubert.py b/src/transformers/models/flaubert/tokenization_flaubert.py
index dee653450eba..f3cb98697ddb 100644
--- a/src/transformers/models/flaubert/tokenization_flaubert.py
+++ b/src/transformers/models/flaubert/tokenization_flaubert.py
@@ -146,12 +146,9 @@ class FlaubertTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/focalnet/modeling_focalnet.py b/src/transformers/models/focalnet/modeling_focalnet.py
index 9b5d4daed70c..98e340f925bf 100644
--- a/src/transformers/models/focalnet/modeling_focalnet.py
+++ b/src/transformers/models/focalnet/modeling_focalnet.py
@@ -681,12 +681,9 @@ def forward(
 
     This follows the same implementation as in [SimMIM](https://huggingface.co/papers/2111.09886).
 
-    <Tip>
-
-    Note that we provide a script to pre-train this model on custom data in our [examples
-    directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
-
-    </Tip>
+    > [!TIP]
+    > Note that we provide a script to pre-train this model on custom data in our [examples
+    > directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
     """
 )
 class FocalNetForMaskedImageModeling(FocalNetPreTrainedModel):
diff --git a/src/transformers/models/fsmt/tokenization_fsmt.py b/src/transformers/models/fsmt/tokenization_fsmt.py
index 5a4446d8e90b..c30a411797e4 100644
--- a/src/transformers/models/fsmt/tokenization_fsmt.py
+++ b/src/transformers/models/fsmt/tokenization_fsmt.py
@@ -141,12 +141,9 @@ class FSMTTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/gpt2/tokenization_gpt2.py b/src/transformers/models/gpt2/tokenization_gpt2.py
index 608164ef2d83..e4bcaa05824b 100644
--- a/src/transformers/models/gpt2/tokenization_gpt2.py
+++ b/src/transformers/models/gpt2/tokenization_gpt2.py
@@ -93,11 +93,8 @@ class GPT2Tokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods.
diff --git a/src/transformers/models/gpt2/tokenization_gpt2_fast.py b/src/transformers/models/gpt2/tokenization_gpt2_fast.py
index f81c155e8644..6f49b429e2ed 100644
--- a/src/transformers/models/gpt2/tokenization_gpt2_fast.py
+++ b/src/transformers/models/gpt2/tokenization_gpt2_fast.py
@@ -49,11 +49,8 @@ class GPT2TokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer, but since
     the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
diff --git a/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py b/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py
index a3b190a60eb1..46d93916ccc5 100644
--- a/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py
+++ b/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py
@@ -49,11 +49,8 @@ class GPTNeoXTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer, but since
     the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
diff --git a/src/transformers/models/groupvit/modeling_groupvit.py b/src/transformers/models/groupvit/modeling_groupvit.py
index 598845750da2..a38673a31677 100644
--- a/src/transformers/models/groupvit/modeling_groupvit.py
+++ b/src/transformers/models/groupvit/modeling_groupvit.py
@@ -272,13 +272,10 @@ class GroupViTModelOutput(ModelOutput):
     segmentation_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels, logits_height, logits_width)`):
         Classification scores for each pixel.
 
-        <Tip warning={true}>
-
-        The logits returned do not necessarily have the same size as the `pixel_values` passed as inputs. This is
-        to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the
-        original image size as post-processing. You should always check your logits shape and resize as needed.
-
-        </Tip>
+        > [!WARNING]
+        > The logits returned do not necessarily have the same size as the `pixel_values` passed as inputs. This is
+        > to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the
+        > original image size as post-processing. You should always check your logits shape and resize as needed.
     text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
         The text embeddings obtained by applying the projection layer to the pooled output of
         [`GroupViTTextModel`].
diff --git a/src/transformers/models/hiera/modeling_hiera.py b/src/transformers/models/hiera/modeling_hiera.py
index 499c0b454600..eddb542f75c4 100644
--- a/src/transformers/models/hiera/modeling_hiera.py
+++ b/src/transformers/models/hiera/modeling_hiera.py
@@ -1080,12 +1080,9 @@ def forward(self, feature_maps: list[torch.Tensor]) -> torch.Tensor:
     custom_intro="""
     The Hiera Model transformer with the decoder on top for self-supervised pre-training.
 
-    <Tip>
-
-    Note that we provide a script to pre-train this model on custom data in our [examples
-    directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
-
-    </Tip>
+    > [!TIP]
+    > Note that we provide a script to pre-train this model on custom data in our [examples
+    > directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
     """
 )
 class HieraForPreTraining(HieraPreTrainedModel):
@@ -1222,13 +1219,10 @@ def forward(
     Hiera Model transformer with an image classification head on top (a linear layer on top of the final hidden state with
     average pooling) e.g. for ImageNet.
 
-    <Tip>
-
-        Note that it's possible to fine-tune Hiera on higher resolution images than the ones it has been trained on, by
-        setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
-        position embeddings to the higher resolution.
-
-    </Tip>
+    > [!TIP]
+    > Note that it's possible to fine-tune Hiera on higher resolution images than the ones it has been trained on, by
+    >     setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
+    >     position embeddings to the higher resolution.
     """
 )
 class HieraForImageClassification(HieraPreTrainedModel):
diff --git a/src/transformers/models/ijepa/modeling_ijepa.py b/src/transformers/models/ijepa/modeling_ijepa.py
index 2a15c40da4d3..bdb9906beaad 100644
--- a/src/transformers/models/ijepa/modeling_ijepa.py
+++ b/src/transformers/models/ijepa/modeling_ijepa.py
@@ -466,13 +466,10 @@ def forward(
     IJepa Model transformer with an image classification head on top (a linear layer on top of the final hidden states)
     e.g. for ImageNet.
 
-    <Tip>
-
-        Note that it's possible to fine-tune IJepa on higher resolution images than the ones it has been trained on, by
-        setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
-        position embeddings to the higher resolution.
-
-    </Tip>
+    > [!TIP]
+    > Note that it's possible to fine-tune IJepa on higher resolution images than the ones it has been trained on, by
+    >     setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
+    >     position embeddings to the higher resolution.
     """
 )
 class IJepaForImageClassification(IJepaPreTrainedModel):
diff --git a/src/transformers/models/ijepa/modular_ijepa.py b/src/transformers/models/ijepa/modular_ijepa.py
index b37bc41d13bf..9fb579660c06 100644
--- a/src/transformers/models/ijepa/modular_ijepa.py
+++ b/src/transformers/models/ijepa/modular_ijepa.py
@@ -128,13 +128,10 @@ def __init__(self, config: IJepaConfig, add_pooling_layer: bool = False, use_mas
     IJepa Model transformer with an image classification head on top (a linear layer on top of the final hidden states)
     e.g. for ImageNet.
 
-    <Tip>
-
-        Note that it's possible to fine-tune IJepa on higher resolution images than the ones it has been trained on, by
-        setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
-        position embeddings to the higher resolution.
-
-    </Tip>
+    > [!TIP]
+    > Note that it's possible to fine-tune IJepa on higher resolution images than the ones it has been trained on, by
+    >     setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
+    >     position embeddings to the higher resolution.
     """
 )
 class IJepaForImageClassification(IJepaPreTrainedModel, ViTForImageClassification):
diff --git a/src/transformers/models/layoutlmv3/tokenization_layoutlmv3.py b/src/transformers/models/layoutlmv3/tokenization_layoutlmv3.py
index fdf95a34d58d..c6afd35a251c 100644
--- a/src/transformers/models/layoutlmv3/tokenization_layoutlmv3.py
+++ b/src/transformers/models/layoutlmv3/tokenization_layoutlmv3.py
@@ -203,22 +203,16 @@ class LayoutLMv3Tokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/layoutlmv3/tokenization_layoutlmv3_fast.py b/src/transformers/models/layoutlmv3/tokenization_layoutlmv3_fast.py
index d0407638595d..7e4b93deb24e 100644
--- a/src/transformers/models/layoutlmv3/tokenization_layoutlmv3_fast.py
+++ b/src/transformers/models/layoutlmv3/tokenization_layoutlmv3_fast.py
@@ -64,22 +64,16 @@ class LayoutLMv3TokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/layoutxlm/tokenization_layoutxlm.py b/src/transformers/models/layoutxlm/tokenization_layoutxlm.py
index 9c1d5c05a9f9..b39bcb3ad5f1 100644
--- a/src/transformers/models/layoutxlm/tokenization_layoutxlm.py
+++ b/src/transformers/models/layoutxlm/tokenization_layoutxlm.py
@@ -158,22 +158,16 @@ class LayoutXLMTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/layoutxlm/tokenization_layoutxlm_fast.py b/src/transformers/models/layoutxlm/tokenization_layoutxlm_fast.py
index 7b08a3aa5f0e..d793d30f6304 100644
--- a/src/transformers/models/layoutxlm/tokenization_layoutxlm_fast.py
+++ b/src/transformers/models/layoutxlm/tokenization_layoutxlm_fast.py
@@ -160,22 +160,16 @@ class LayoutXLMTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/led/tokenization_led.py b/src/transformers/models/led/tokenization_led.py
index d110ac30d969..324a67a017d4 100644
--- a/src/transformers/models/led/tokenization_led.py
+++ b/src/transformers/models/led/tokenization_led.py
@@ -96,11 +96,8 @@ class LEDTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods.
@@ -116,22 +113,16 @@ class LEDTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/led/tokenization_led_fast.py b/src/transformers/models/led/tokenization_led_fast.py
index baea10f23516..151d6d22da7c 100644
--- a/src/transformers/models/led/tokenization_led_fast.py
+++ b/src/transformers/models/led/tokenization_led_fast.py
@@ -53,11 +53,8 @@ class LEDTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
@@ -73,22 +70,16 @@ class LEDTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/longformer/tokenization_longformer.py b/src/transformers/models/longformer/tokenization_longformer.py
index 104bdd7a9b99..2b1be5a69d2b 100644
--- a/src/transformers/models/longformer/tokenization_longformer.py
+++ b/src/transformers/models/longformer/tokenization_longformer.py
@@ -93,11 +93,8 @@ class LongformerTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods.
@@ -113,22 +110,16 @@ class LongformerTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/longformer/tokenization_longformer_fast.py b/src/transformers/models/longformer/tokenization_longformer_fast.py
index bde6bb55fec6..e2872c704df7 100644
--- a/src/transformers/models/longformer/tokenization_longformer_fast.py
+++ b/src/transformers/models/longformer/tokenization_longformer_fast.py
@@ -53,11 +53,8 @@ class LongformerTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
@@ -73,22 +70,16 @@ class LongformerTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/luke/tokenization_luke.py b/src/transformers/models/luke/tokenization_luke.py
index 4bb19bb5ee73..4b051900f959 100644
--- a/src/transformers/models/luke/tokenization_luke.py
+++ b/src/transformers/models/luke/tokenization_luke.py
@@ -192,11 +192,8 @@ class LukeTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods. It also creates entity sequences, namely
@@ -230,22 +227,16 @@ class LukeTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/markuplm/tokenization_markuplm.py b/src/transformers/models/markuplm/tokenization_markuplm.py
index 0a6f7c3bd6a0..46c4d69620b6 100644
--- a/src/transformers/models/markuplm/tokenization_markuplm.py
+++ b/src/transformers/models/markuplm/tokenization_markuplm.py
@@ -142,22 +142,16 @@ class MarkupLMTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/markuplm/tokenization_markuplm_fast.py b/src/transformers/models/markuplm/tokenization_markuplm_fast.py
index 4033ef319ff8..bd2cc823120d 100644
--- a/src/transformers/models/markuplm/tokenization_markuplm_fast.py
+++ b/src/transformers/models/markuplm/tokenization_markuplm_fast.py
@@ -101,22 +101,16 @@ class MarkupLMTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/mluke/tokenization_mluke.py b/src/transformers/models/mluke/tokenization_mluke.py
index d63129c7b7e4..454e5e6f22b1 100644
--- a/src/transformers/models/mluke/tokenization_mluke.py
+++ b/src/transformers/models/mluke/tokenization_mluke.py
@@ -146,22 +146,16 @@ class MLukeTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/mpnet/tokenization_mpnet.py b/src/transformers/models/mpnet/tokenization_mpnet.py
index bf035cf8e4bd..21f92bb3891f 100644
--- a/src/transformers/models/mpnet/tokenization_mpnet.py
+++ b/src/transformers/models/mpnet/tokenization_mpnet.py
@@ -68,22 +68,16 @@ class MPNetTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/mpnet/tokenization_mpnet_fast.py b/src/transformers/models/mpnet/tokenization_mpnet_fast.py
index 1a470565a845..d5854fbb2c9d 100644
--- a/src/transformers/models/mpnet/tokenization_mpnet_fast.py
+++ b/src/transformers/models/mpnet/tokenization_mpnet_fast.py
@@ -46,22 +46,16 @@ class MPNetTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/musicgen/modeling_musicgen.py b/src/transformers/models/musicgen/modeling_musicgen.py
index 7326ede89e71..49a20294a814 100644
--- a/src/transformers/models/musicgen/modeling_musicgen.py
+++ b/src/transformers/models/musicgen/modeling_musicgen.py
@@ -488,16 +488,13 @@ def forward(
 
             [What are input IDs?](../glossary#input-ids)
 
-            <Tip warning={true}>
-
-            The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
-            target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
-            you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
-            frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
-            target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
-            `input_ids`.
-
-            </Tip>
+            > [!WARNING]
+            > The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
+            > target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
+            > you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
+            > frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
+            > target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
+            > `input_ids`.
         encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional*):
             Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of
             the decoder.
@@ -729,16 +726,13 @@ def forward(
 
             [What are input IDs?](../glossary#input-ids)
 
-            <Tip warning={true}>
-
-            The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
-            target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
-            you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
-            frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
-            target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
-            `input_ids`.
-
-            </Tip>
+            > [!WARNING]
+            > The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
+            > target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
+            > you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
+            > frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
+            > target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
+            > `input_ids`.
         encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional*):
             Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of
             the decoder.
@@ -848,16 +842,13 @@ def forward(
 
             [What are input IDs?](../glossary#input-ids)
 
-            <Tip warning={true}>
-
-            The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
-            target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
-            you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
-            frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
-            target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
-            `input_ids`.
-
-            </Tip>
+            > [!WARNING]
+            > The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
+            > target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
+            > you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
+            > frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
+            > target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
+            > `input_ids`.
         encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional*):
             Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of
             the decoder.
@@ -1077,16 +1068,13 @@ def generate(
 
         Generates sequences of token ids for models with a language modeling head.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Parameters:
             inputs (`torch.Tensor` of varying shape depending on the modality, *optional*):
@@ -1664,16 +1652,13 @@ def forward(
 
             [What are decoder input IDs?](../glossary#decoder-input-ids)
 
-            <Tip warning={true}>
-
-            The `decoder_input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
-            target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
-            you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
-            frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
-            target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
-            `decoder_input_ids`.
-
-            </Tip>
+            > [!WARNING]
+            > The `decoder_input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
+            > target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
+            > you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
+            > frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
+            > target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
+            > `decoder_input_ids`.
         decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
             Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
             be used by default.
@@ -2116,16 +2101,13 @@ def generate(
 
         Generates sequences of token ids for models with a language modeling head.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Parameters:
             inputs (`torch.Tensor` of varying shape depending on the modality, *optional*):
diff --git a/src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py b/src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py
index 744471bab553..28163b1960b2 100644
--- a/src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py
+++ b/src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py
@@ -69,12 +69,9 @@ class MusicgenMelodyFeatureExtractor(SequenceFeatureExtractor):
 
             [What are attention masks?](../glossary#attention-mask)
 
-            <Tip>
-
-            For Whisper models, `attention_mask` should always be passed for batched inference, to avoid subtle
-            bugs.
-
-            </Tip>
+            > [!TIP]
+            > For Whisper models, `attention_mask` should always be passed for batched inference, to avoid subtle
+            > bugs.
         stem_indices (`list[int]`, *optional*, defaults to `[3, 2]`):
             Stem channels to extract if demucs outputs are passed.
     """
@@ -219,9 +216,8 @@ def __call__(
 
                 [What are attention masks?](../glossary#attention-mask)
 
-                <Tip>
-                For Musicgen Melody models, audio `attention_mask` is not necessary.
-                </Tip>
+                > [!TIP]
+                > For Musicgen Melody models, audio `attention_mask` is not necessary.
 
             padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
                 Select a strategy to pad the returned sequences (according to the model's padding side and padding
diff --git a/src/transformers/models/musicgen_melody/modeling_musicgen_melody.py b/src/transformers/models/musicgen_melody/modeling_musicgen_melody.py
index cea583599ee2..e86884150d14 100644
--- a/src/transformers/models/musicgen_melody/modeling_musicgen_melody.py
+++ b/src/transformers/models/musicgen_melody/modeling_musicgen_melody.py
@@ -461,16 +461,13 @@ def forward(
 
             [What are input IDs?](../glossary#input-ids)
 
-            <Tip warning={true}>
-
-            The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
-            target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
-            you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
-            frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
-            target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
-            `input_ids`.
-
-            </Tip>
+            > [!WARNING]
+            > The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
+            > target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
+            > you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
+            > frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
+            > target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
+            > `input_ids`.
         encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional*):
             Sequence of hidden-states representing the concatenation of the text encoder output and the processed audio encoder output.
             Used as a conditional signal and will thus be concatenated to the projected `decoder_input_ids`.
@@ -683,16 +680,13 @@ def forward(
 
             [What are input IDs?](../glossary#input-ids)
 
-            <Tip warning={true}>
-
-            The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
-            target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
-            you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
-            frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
-            target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
-            `input_ids`.
-
-            </Tip>
+            > [!WARNING]
+            > The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
+            > target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
+            > you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
+            > frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
+            > target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
+            > `input_ids`.
         encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional*):
             Sequence of hidden-states representing the concatenation of the text encoder output and the processed audio encoder output.
             Used as a conditional signal and will thus be concatenated to the projected `decoder_input_ids`.
@@ -802,16 +796,13 @@ def forward(
 
             [What are input IDs?](../glossary#input-ids)
 
-            <Tip warning={true}>
-
-            The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
-            target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
-            you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
-            frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
-            target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
-            `input_ids`.
-
-            </Tip>
+            > [!WARNING]
+            > The `input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
+            > target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
+            > you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
+            > frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
+            > target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
+            > `input_ids`.
         encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional*):
             Sequence of hidden-states representing the concatenation of the text encoder output and the processed audio encoder output.
             Used as a conditional signal and will thus be concatenated to the projected `decoder_input_ids`.
@@ -1045,16 +1036,13 @@ def generate(
 
         Generates sequences of token ids for models with a language modeling head.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Parameters:
             inputs (`torch.Tensor` of varying shape depending on the modality, *optional*):
@@ -1577,16 +1565,13 @@ def forward(
 
             [What are decoder input IDs?](../glossary#decoder-input-ids)
 
-            <Tip warning={true}>
-
-            The `decoder_input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
-            target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
-            you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
-            frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
-            target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
-            `decoder_input_ids`.
-
-            </Tip>
+            > [!WARNING]
+            > The `decoder_input_ids` will automatically be converted from shape `(batch_size * num_codebooks,
+            > target_sequence_length)` to `(batch_size, num_codebooks, target_sequence_length)` in the forward pass. If
+            > you obtain audio codes from an audio encoding model, such as [`EncodecModel`], ensure that the number of
+            > frames is equal to 1, and that you reshape the audio codes from `(frames, batch_size, num_codebooks,
+            > target_sequence_length)` to `(batch_size * num_codebooks, target_sequence_length)` prior to passing them as
+            > `decoder_input_ids`.
         decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
             Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
             be used by default.
@@ -2007,16 +1992,13 @@ def generate(
 
         Generates sequences of token ids for models with a language modeling head.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Parameters:
             inputs (`torch.Tensor` of varying shape depending on the modality, *optional*):
diff --git a/src/transformers/models/mvp/tokenization_mvp.py b/src/transformers/models/mvp/tokenization_mvp.py
index f6039df2dc02..63a1438b1a1c 100644
--- a/src/transformers/models/mvp/tokenization_mvp.py
+++ b/src/transformers/models/mvp/tokenization_mvp.py
@@ -92,11 +92,8 @@ class MvpTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods.
@@ -112,22 +109,16 @@ class MvpTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/mvp/tokenization_mvp_fast.py b/src/transformers/models/mvp/tokenization_mvp_fast.py
index ca0bc6b165f7..1adf757055b5 100644
--- a/src/transformers/models/mvp/tokenization_mvp_fast.py
+++ b/src/transformers/models/mvp/tokenization_mvp_fast.py
@@ -54,11 +54,8 @@ class MvpTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
@@ -74,22 +71,16 @@ class MvpTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/nllb/tokenization_nllb.py b/src/transformers/models/nllb/tokenization_nllb.py
index 4962a642bb31..cf7a29a3146e 100644
--- a/src/transformers/models/nllb/tokenization_nllb.py
+++ b/src/transformers/models/nllb/tokenization_nllb.py
@@ -64,22 +64,16 @@ class NllbTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/nllb/tokenization_nllb_fast.py b/src/transformers/models/nllb/tokenization_nllb_fast.py
index 5300b3942b5d..0cc028219c10 100644
--- a/src/transformers/models/nllb/tokenization_nllb_fast.py
+++ b/src/transformers/models/nllb/tokenization_nllb_fast.py
@@ -69,22 +69,16 @@ class NllbTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/parakeet/feature_extraction_parakeet.py b/src/transformers/models/parakeet/feature_extraction_parakeet.py
index d28f1a214a21..1106c3fac2f3 100644
--- a/src/transformers/models/parakeet/feature_extraction_parakeet.py
+++ b/src/transformers/models/parakeet/feature_extraction_parakeet.py
@@ -163,12 +163,9 @@ def __call__(
 
                 [What are attention masks?](../glossary#attention-mask)
 
-                <Tip>
-
-                For Parakeet models, `attention_mask` should always be passed for batched inference, to avoid subtle
-                bugs.
-
-                </Tip>
+                > [!TIP]
+                > For Parakeet models, `attention_mask` should always be passed for batched inference, to avoid subtle
+                > bugs.
 
             return_tensors (`str` or [`~utils.TensorType`], *optional*):
                 If set, will return tensors instead of list of python integers. Acceptable values are:
diff --git a/src/transformers/models/pegasus/tokenization_pegasus.py b/src/transformers/models/pegasus/tokenization_pegasus.py
index b8a4a1c737d1..033d88aae927 100644
--- a/src/transformers/models/pegasus/tokenization_pegasus.py
+++ b/src/transformers/models/pegasus/tokenization_pegasus.py
@@ -51,12 +51,9 @@ class PegasusTokenizer(PreTrainedTokenizer):
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/models/pegasus/tokenization_pegasus_fast.py b/src/transformers/models/pegasus/tokenization_pegasus_fast.py
index 92a37c44ff2e..57531f9db10b 100644
--- a/src/transformers/models/pegasus/tokenization_pegasus_fast.py
+++ b/src/transformers/models/pegasus/tokenization_pegasus_fast.py
@@ -53,12 +53,9 @@ class PegasusTokenizerFast(PreTrainedTokenizerFast):
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/models/perceiver/modeling_perceiver.py b/src/transformers/models/perceiver/modeling_perceiver.py
index 499d01774d06..cb23b557a313 100755
--- a/src/transformers/models/perceiver/modeling_perceiver.py
+++ b/src/transformers/models/perceiver/modeling_perceiver.py
@@ -575,13 +575,10 @@ def _init_weights(self, module):
     custom_intro="""
     The Perceiver: a scalable, fully attentional architecture.
 
-    <Tip>
-
-        Note that it's possible to fine-tune Perceiver on higher resolution images than the ones it has been trained on, by
-        setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
-        position embeddings to the higher resolution.
-
-    </Tip>
+    > [!TIP]
+    > Note that it's possible to fine-tune Perceiver on higher resolution images than the ones it has been trained on, by
+    >     setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
+    >     position embeddings to the higher resolution.
     """
 )
 class PerceiverModel(PerceiverPreTrainedModel):
diff --git a/src/transformers/models/perceiver/tokenization_perceiver.py b/src/transformers/models/perceiver/tokenization_perceiver.py
index f17e7e99ac9d..8109d0e7e0e7 100644
--- a/src/transformers/models/perceiver/tokenization_perceiver.py
+++ b/src/transformers/models/perceiver/tokenization_perceiver.py
@@ -38,12 +38,9 @@ class PerceiverTokenizer(PreTrainedTokenizer):
         eos_token (`str`, *optional*, defaults to `"[EOS]"`):
             The end of sequence token (reserved in the vocab, but not actually used).
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         mask_token (`str`, *optional*, defaults to `"[MASK]"`):
             The MASK token, useful for masked language modeling.
diff --git a/src/transformers/models/phobert/tokenization_phobert.py b/src/transformers/models/phobert/tokenization_phobert.py
index 61ac8194b45c..624fe9501f4d 100644
--- a/src/transformers/models/phobert/tokenization_phobert.py
+++ b/src/transformers/models/phobert/tokenization_phobert.py
@@ -63,22 +63,16 @@ class PhobertTokenizer(PreTrainedTokenizer):
         bos_token (`st`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/pop2piano/modeling_pop2piano.py b/src/transformers/models/pop2piano/modeling_pop2piano.py
index 94c2a7515a44..0f8183aee655 100644
--- a/src/transformers/models/pop2piano/modeling_pop2piano.py
+++ b/src/transformers/models/pop2piano/modeling_pop2piano.py
@@ -1193,14 +1193,11 @@ def generate(
         """
         Generates token ids for midi outputs.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`. For an overview of generation
-        strategies and code examples, check out the [following guide](./generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`. For an overview of generation
+        > strategies and code examples, check out the [following guide](./generation_strategies).
 
         Parameters:
             input_features (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
diff --git a/src/transformers/models/reformer/modeling_reformer.py b/src/transformers/models/reformer/modeling_reformer.py
index e2cb1c4657a8..b28ecac35bbe 100755
--- a/src/transformers/models/reformer/modeling_reformer.py
+++ b/src/transformers/models/reformer/modeling_reformer.py
@@ -2402,12 +2402,9 @@ def forward(
             config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked),
             the loss is only computed for the tokens with labels
 
-            <Tip warning={true}>
-
-            This example uses a false checkpoint since we don't have any available pretrained model for the masked language
-            modeling task with the Reformer architecture.
-
-            </Tip>
+            > [!WARNING]
+            > This example uses a false checkpoint since we don't have any available pretrained model for the masked language
+            > modeling task with the Reformer architecture.
 
         Example:
 
diff --git a/src/transformers/models/reformer/tokenization_reformer.py b/src/transformers/models/reformer/tokenization_reformer.py
index 458b72df4ff6..175c35a60ba8 100644
--- a/src/transformers/models/reformer/tokenization_reformer.py
+++ b/src/transformers/models/reformer/tokenization_reformer.py
@@ -48,12 +48,9 @@ class ReformerTokenizer(PreTrainedTokenizer):
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/models/reformer/tokenization_reformer_fast.py b/src/transformers/models/reformer/tokenization_reformer_fast.py
index d68528de5872..40ef8a5ef783 100644
--- a/src/transformers/models/reformer/tokenization_reformer_fast.py
+++ b/src/transformers/models/reformer/tokenization_reformer_fast.py
@@ -51,12 +51,9 @@ class ReformerTokenizerFast(PreTrainedTokenizerFast):
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/models/rembert/tokenization_rembert.py b/src/transformers/models/rembert/tokenization_rembert.py
index cf27a7b3bae6..ff8a21b6b850 100644
--- a/src/transformers/models/rembert/tokenization_rembert.py
+++ b/src/transformers/models/rembert/tokenization_rembert.py
@@ -45,22 +45,16 @@ class RemBertTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"[CLS]"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"[SEP]"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/models/rembert/tokenization_rembert_fast.py b/src/transformers/models/rembert/tokenization_rembert_fast.py
index fb358746e6d2..52e454c31478 100644
--- a/src/transformers/models/rembert/tokenization_rembert_fast.py
+++ b/src/transformers/models/rembert/tokenization_rembert_fast.py
@@ -55,12 +55,9 @@ class RemBertTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"[CLS]"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"[SEP]"`):
             The end of sequence token. .. note:: When building a sequence using special tokens, this is not the token
diff --git a/src/transformers/models/roberta/tokenization_roberta.py b/src/transformers/models/roberta/tokenization_roberta.py
index 67cdcbbf488a..394ec17f32a0 100644
--- a/src/transformers/models/roberta/tokenization_roberta.py
+++ b/src/transformers/models/roberta/tokenization_roberta.py
@@ -93,11 +93,8 @@ class RobertaTokenizer(PreTrainedTokenizer):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
 
     This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
     this superclass for more information regarding those methods.
@@ -113,22 +110,16 @@ class RobertaTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/roberta/tokenization_roberta_fast.py b/src/transformers/models/roberta/tokenization_roberta_fast.py
index d9ddcfc82d49..d23236a79983 100644
--- a/src/transformers/models/roberta/tokenization_roberta_fast.py
+++ b/src/transformers/models/roberta/tokenization_roberta_fast.py
@@ -52,11 +52,8 @@ class RobertaTokenizerFast(PreTrainedTokenizerFast):
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
     call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
 
-    <Tip>
-
-    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
-
-    </Tip>
+    > [!TIP]
+    > When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with `add_prefix_space=True`.
 
     This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
     refer to this superclass for more information regarding those methods.
@@ -72,22 +69,16 @@ class RobertaTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/seamless_m4t/feature_extraction_seamless_m4t.py b/src/transformers/models/seamless_m4t/feature_extraction_seamless_m4t.py
index d0a8a07b3a9e..f6151f7e2516 100644
--- a/src/transformers/models/seamless_m4t/feature_extraction_seamless_m4t.py
+++ b/src/transformers/models/seamless_m4t/feature_extraction_seamless_m4t.py
@@ -194,12 +194,9 @@ def __call__(
 
                 [What are attention masks?](../glossary#attention-mask)
 
-                <Tip>
-
-                For SeamlessM4T models, `attention_mask` should always be passed for batched inference, to avoid subtle
-                bugs.
-
-                </Tip>
+                > [!TIP]
+                > For SeamlessM4T models, `attention_mask` should always be passed for batched inference, to avoid subtle
+                > bugs.
 
             return_tensors (`str` or [`~utils.TensorType`], *optional*):
                 If set, will return tensors instead of list of python integers. Acceptable values are:
diff --git a/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py b/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py
index 9332e18856a2..07ea919e0e7d 100755
--- a/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py
+++ b/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py
@@ -2611,16 +2611,13 @@ def generate(
         """
         Generates sequences of token ids.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Parameters:
             input_ids (`torch.Tensor` of varying shape depending on the modality, *optional*):
@@ -2870,16 +2867,13 @@ def generate(
         """
         Generates sequences of token ids.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Parameters:
             input_features (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_banks)`):
@@ -3137,19 +3131,16 @@ def generate(
         """
         Generates translated audio waveforms.
 
-        <Tip>
-
-        This method successively calls the `.generate` function of two different sub-models. You can specify keyword
-        arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
-        that will be passed to one of them.
-
-        For example, calling `.generate(input_ids, num_beams=4, speech_do_sample=True)` will successively perform
-        beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!TIP]
+        > This method successively calls the `.generate` function of two different sub-models. You can specify keyword
+        > arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
+        > that will be passed to one of them.
+        >
+        > For example, calling `.generate(input_ids, num_beams=4, speech_do_sample=True)` will successively perform
+        > beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Args:
             input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
@@ -3461,19 +3452,16 @@ def generate(
         """
         Generates translated audio waveforms.
 
-        <Tip>
-
-        This method successively calls the `.generate` function of two different sub-models. You can specify keyword
-        arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
-        that will be passed to one of them.
-
-        For example, calling `.generate(input_features, num_beams=4, speech_do_sample=True)` will successively perform
-        beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!TIP]
+        > This method successively calls the `.generate` function of two different sub-models. You can specify keyword
+        > arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
+        > that will be passed to one of them.
+        >
+        > For example, calling `.generate(input_features, num_beams=4, speech_do_sample=True)` will successively perform
+        > beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Args:
             input_features (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_banks)`):
@@ -3853,19 +3841,16 @@ def generate(
         """
         Generates translated token ids and/or translated audio waveforms.
 
-        <Tip>
-
-        This method successively calls the `.generate` function of two different sub-models. You can specify keyword
-        arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
-        that will be passed to one of them.
-
-        For example, calling `.generate(input_ids=input_ids, num_beams=4, speech_do_sample=True)` will successively
-        perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!TIP]
+        > This method successively calls the `.generate` function of two different sub-models. You can specify keyword
+        > arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
+        > that will be passed to one of them.
+        >
+        > For example, calling `.generate(input_ids=input_ids, num_beams=4, speech_do_sample=True)` will successively
+        > perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
 
         Args:
diff --git a/src/transformers/models/seamless_m4t/tokenization_seamless_m4t.py b/src/transformers/models/seamless_m4t/tokenization_seamless_m4t.py
index fd773316580c..9de5a7b5d798 100644
--- a/src/transformers/models/seamless_m4t/tokenization_seamless_m4t.py
+++ b/src/transformers/models/seamless_m4t/tokenization_seamless_m4t.py
@@ -71,22 +71,16 @@ class SeamlessM4TTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/seamless_m4t/tokenization_seamless_m4t_fast.py b/src/transformers/models/seamless_m4t/tokenization_seamless_m4t_fast.py
index 0318336332c3..6f0cd5ec8301 100644
--- a/src/transformers/models/seamless_m4t/tokenization_seamless_m4t_fast.py
+++ b/src/transformers/models/seamless_m4t/tokenization_seamless_m4t_fast.py
@@ -71,22 +71,16 @@ class SeamlessM4TTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py b/src/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
index 4836416bced6..6fcfac5dfd3a 100644
--- a/src/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
+++ b/src/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
@@ -2818,16 +2818,13 @@ def generate(
         """
         Generates sequences of token ids.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Parameters:
             input_ids (`torch.Tensor` of varying shape depending on the modality, *optional*):
@@ -3085,16 +3082,13 @@ def generate(
         """
         Generates sequences of token ids.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Parameters:
             input_features (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_banks)`):
@@ -3359,19 +3353,16 @@ def generate(
         """
         Generates translated audio waveforms.
 
-        <Tip>
-
-        This method successively calls the `.generate` function of two different sub-models. You can specify keyword
-        arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
-        that will be passed to one of them.
-
-        For example, calling `.generate(input_ids, num_beams=4, speech_do_sample=True)` will successively perform
-        beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!TIP]
+        > This method successively calls the `.generate` function of two different sub-models. You can specify keyword
+        > arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
+        > that will be passed to one of them.
+        >
+        > For example, calling `.generate(input_ids, num_beams=4, speech_do_sample=True)` will successively perform
+        > beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Args:
             input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
@@ -3721,19 +3712,16 @@ def generate(
         """
         Generates translated audio waveforms.
 
-        <Tip>
-
-        This method successively calls the `.generate` function of two different sub-models. You can specify keyword
-        arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
-        that will be passed to one of them.
-
-        For example, calling `.generate(input_features, num_beams=4, speech_do_sample=True)` will successively perform
-        beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!TIP]
+        > This method successively calls the `.generate` function of two different sub-models. You can specify keyword
+        > arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
+        > that will be passed to one of them.
+        >
+        > For example, calling `.generate(input_features, num_beams=4, speech_do_sample=True)` will successively perform
+        > beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Args:
             input_features (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_banks)`):
@@ -4150,19 +4138,16 @@ def generate(
         """
         Generates translated token ids and/or translated audio waveforms.
 
-        <Tip>
-
-        This method successively calls the `.generate` function of two different sub-models. You can specify keyword
-        arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
-        that will be passed to one of them.
-
-        For example, calling `.generate(input_ids=input_ids, num_beams=4, speech_do_sample=True)` will successively
-        perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!TIP]
+        > This method successively calls the `.generate` function of two different sub-models. You can specify keyword
+        > arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
+        > that will be passed to one of them.
+        >
+        > For example, calling `.generate(input_ids=input_ids, num_beams=4, speech_do_sample=True)` will successively
+        > perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
 
         Args:
diff --git a/src/transformers/models/speech_to_text/feature_extraction_speech_to_text.py b/src/transformers/models/speech_to_text/feature_extraction_speech_to_text.py
index fe6698e9ebec..b77b60ce3fde 100644
--- a/src/transformers/models/speech_to_text/feature_extraction_speech_to_text.py
+++ b/src/transformers/models/speech_to_text/feature_extraction_speech_to_text.py
@@ -219,12 +219,9 @@ def __call__(
 
                 [What are attention masks?](../glossary#attention-mask)
 
-                <Tip>
-
-                For Speech2TextTransformer models, `attention_mask` should always be passed for batched inference, to
-                avoid subtle bugs.
-
-                </Tip>
+                > [!TIP]
+                > For Speech2TextTransformer models, `attention_mask` should always be passed for batched inference, to
+                > avoid subtle bugs.
 
             return_tensors (`str` or [`~utils.TensorType`], *optional*):
                 If set, will return tensors instead of list of python integers. Acceptable values are:
diff --git a/src/transformers/models/swin/modeling_swin.py b/src/transformers/models/swin/modeling_swin.py
index c9fdc0d7d044..d9b2a1a0ec46 100644
--- a/src/transformers/models/swin/modeling_swin.py
+++ b/src/transformers/models/swin/modeling_swin.py
@@ -942,12 +942,9 @@ def forward(
     custom_intro="""
     Swin Model with a decoder on top for masked image modeling, as proposed in [SimMIM](https://huggingface.co/papers/2111.09886).
 
-    <Tip>
-
-    Note that we provide a script to pre-train this model on custom data in our [examples
-    directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
-
-    </Tip>
+    > [!TIP]
+    > Note that we provide a script to pre-train this model on custom data in our [examples
+    > directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
     """
 )
 class SwinForMaskedImageModeling(SwinPreTrainedModel):
@@ -1056,13 +1053,10 @@ def forward(
     Swin Model transformer with an image classification head on top (a linear layer on top of the final hidden state of
     the [CLS] token) e.g. for ImageNet.
 
-    <Tip>
-
-        Note that it's possible to fine-tune Swin on higher resolution images than the ones it has been trained on, by
-        setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
-        position embeddings to the higher resolution.
-
-    </Tip>
+    > [!TIP]
+    > Note that it's possible to fine-tune Swin on higher resolution images than the ones it has been trained on, by
+    >     setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
+    >     position embeddings to the higher resolution.
     """
 )
 class SwinForImageClassification(SwinPreTrainedModel):
diff --git a/src/transformers/models/swinv2/modeling_swinv2.py b/src/transformers/models/swinv2/modeling_swinv2.py
index 33be714f96b3..7e1c24addac8 100644
--- a/src/transformers/models/swinv2/modeling_swinv2.py
+++ b/src/transformers/models/swinv2/modeling_swinv2.py
@@ -1019,12 +1019,9 @@ def forward(
         Swinv2 Model with a decoder on top for masked image modeling, as proposed in
     [SimMIM](https://huggingface.co/papers/2111.09886).
 
-        <Tip>
-
-        Note that we provide a script to pre-train this model on custom data in our [examples
-        directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
-
-        </Tip>
+        > [!TIP]
+        > Note that we provide a script to pre-train this model on custom data in our [examples
+        > directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
     """
 )
 # Copied from transformers.models.swin.modeling_swin.SwinForMaskedImageModeling with swin->swinv2, base-simmim-window6-192->tiny-patch4-window8-256,SWIN->SWINV2,Swin->Swinv2,192->256
@@ -1134,13 +1131,10 @@ def forward(
     Swinv2 Model transformer with an image classification head on top (a linear layer on top of the final hidden state
     of the [CLS] token) e.g. for ImageNet.
 
-    <Tip>
-
-        Note that it's possible to fine-tune SwinV2 on higher resolution images than the ones it has been trained on, by
-        setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
-        position embeddings to the higher resolution.
-
-    </Tip>
+    > [!TIP]
+    > Note that it's possible to fine-tune SwinV2 on higher resolution images than the ones it has been trained on, by
+    >     setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
+    >     position embeddings to the higher resolution.
     """
 )
 # Copied from transformers.models.swin.modeling_swin.SwinForImageClassification with SWIN->SWINV2,Swin->Swinv2,swin->swinv2
diff --git a/src/transformers/models/t5/tokenization_t5.py b/src/transformers/models/t5/tokenization_t5.py
index 0a25271345cf..136e7ffe73f4 100644
--- a/src/transformers/models/t5/tokenization_t5.py
+++ b/src/transformers/models/t5/tokenization_t5.py
@@ -56,12 +56,9 @@ class T5Tokenizer(PreTrainedTokenizer):
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/models/t5/tokenization_t5_fast.py b/src/transformers/models/t5/tokenization_t5_fast.py
index bdba1a7928c8..1e9f02dddf2c 100644
--- a/src/transformers/models/t5/tokenization_t5_fast.py
+++ b/src/transformers/models/t5/tokenization_t5_fast.py
@@ -53,12 +53,9 @@ class T5TokenizerFast(PreTrainedTokenizerFast):
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/models/tapas/tokenization_tapas.py b/src/transformers/models/tapas/tokenization_tapas.py
index 7277f562a118..97ddc1ed8c86 100644
--- a/src/transformers/models/tapas/tokenization_tapas.py
+++ b/src/transformers/models/tapas/tokenization_tapas.py
@@ -653,11 +653,8 @@ def batch_encode_plus(
         """
         Prepare a table and a list of strings for the model.
 
-        <Tip warning={true}>
-
-        This method is deprecated, `__call__` should be used instead.
-
-        </Tip>
+        > [!WARNING]
+        > This method is deprecated, `__call__` should be used instead.
 
         Args:
             table (`pd.DataFrame`):
diff --git a/src/transformers/models/udop/tokenization_udop.py b/src/transformers/models/udop/tokenization_udop.py
index a5833333e10a..35bff072e680 100644
--- a/src/transformers/models/udop/tokenization_udop.py
+++ b/src/transformers/models/udop/tokenization_udop.py
@@ -163,12 +163,9 @@ class UdopTokenizer(PreTrainedTokenizer):
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
@@ -833,11 +830,8 @@ def encode_plus_boxes(
         """
         Tokenize and prepare for the model a sequence or a pair of sequences.
 
-        <Tip warning={true}>
-
-        This method is deprecated, `__call__` should be used instead.
-
-        </Tip>
+        > [!WARNING]
+        > This method is deprecated, `__call__` should be used instead.
 
         Args:
             text (`str`, `list[str]` or (for non-fast tokenizers) `list[int]`):
diff --git a/src/transformers/models/udop/tokenization_udop_fast.py b/src/transformers/models/udop/tokenization_udop_fast.py
index 9751f5d65ddf..fbd38f936ab9 100644
--- a/src/transformers/models/udop/tokenization_udop_fast.py
+++ b/src/transformers/models/udop/tokenization_udop_fast.py
@@ -162,12 +162,9 @@ class UdopTokenizerFast(PreTrainedTokenizerFast):
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
@@ -465,11 +462,8 @@ def batch_encode_plus_boxes(
         """
         Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
 
-        <Tip warning={true}>
-
-        This method is deprecated, `__call__` should be used instead.
-
-        </Tip>
+        > [!WARNING]
+        > This method is deprecated, `__call__` should be used instead.
 
         Args:
             batch_text_or_text_pairs (`list[str]`, `list[tuple[str, str]]`, `list[list[str]]`, `list[tuple[list[str], list[str]]]`, and for not-fast tokenizers, also `list[list[int]]`, `list[tuple[list[int], list[int]]]`):
@@ -812,11 +806,8 @@ def encode_plus_boxes(
         """
         Tokenize and prepare for the model a sequence or a pair of sequences.
 
-        <Tip warning={true}>
-
-        This method is deprecated, `__call__` should be used instead.
-
-        </Tip>
+        > [!WARNING]
+        > This method is deprecated, `__call__` should be used instead.
 
         Args:
             text (`str`, `list[str]` or (for non-fast tokenizers) `list[int]`):
diff --git a/src/transformers/models/vit/modeling_vit.py b/src/transformers/models/vit/modeling_vit.py
index 849085bc08b1..6268ca19d4f1 100644
--- a/src/transformers/models/vit/modeling_vit.py
+++ b/src/transformers/models/vit/modeling_vit.py
@@ -499,12 +499,9 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
     custom_intro="""
     ViT Model with a decoder on top for masked image modeling, as proposed in [SimMIM](https://huggingface.co/papers/2111.09886).
 
-    <Tip>
-
-    Note that we provide a script to pre-train this model on custom data in our [examples
-    directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
-
-    </Tip>
+    > [!TIP]
+    > Note that we provide a script to pre-train this model on custom data in our [examples
+    > directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
     """
 )
 class ViTForMaskedImageModeling(ViTPreTrainedModel):
@@ -613,13 +610,10 @@ def forward(
     ViT Model transformer with an image classification head on top (a linear layer on top of the final hidden state of
     the [CLS] token) e.g. for ImageNet.
 
-    <Tip>
-
-        Note that it's possible to fine-tune ViT on higher resolution images than the ones it has been trained on, by
-        setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
-        position embeddings to the higher resolution.
-
-    </Tip>
+    > [!TIP]
+    > Note that it's possible to fine-tune ViT on higher resolution images than the ones it has been trained on, by
+    >     setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
+    >     position embeddings to the higher resolution.
     """
 )
 class ViTForImageClassification(ViTPreTrainedModel):
diff --git a/src/transformers/models/vit_mae/modeling_vit_mae.py b/src/transformers/models/vit_mae/modeling_vit_mae.py
index 2db4df13bc95..1e4c30b4f0c8 100755
--- a/src/transformers/models/vit_mae/modeling_vit_mae.py
+++ b/src/transformers/models/vit_mae/modeling_vit_mae.py
@@ -752,12 +752,9 @@ def forward(self, hidden_states: torch.Tensor, ids_restore: torch.Tensor, interp
     custom_intro="""
     The ViTMAE Model transformer with the decoder on top for self-supervised pre-training.
 
-    <Tip>
-
-    Note that we provide a script to pre-train this model on custom data in our [examples
-    directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
-
-    </Tip>
+    > [!TIP]
+    > Note that we provide a script to pre-train this model on custom data in our [examples
+    > directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
     """
 )
 class ViTMAEForPreTraining(ViTMAEPreTrainedModel):
diff --git a/src/transformers/models/vivit/modeling_vivit.py b/src/transformers/models/vivit/modeling_vivit.py
index 7170d3ff7de3..f7b2810e78e0 100755
--- a/src/transformers/models/vivit/modeling_vivit.py
+++ b/src/transformers/models/vivit/modeling_vivit.py
@@ -545,13 +545,10 @@ def forward(
         ViViT Transformer model with a video classification head on top (a linear layer on top of the final hidden state of the
     [CLS] token) e.g. for Kinetics-400.
 
-        <Tip>
-
-            Note that it's possible to fine-tune ViT on higher resolution images than the ones it has been trained on, by
-            setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
-            position embeddings to the higher resolution.
-
-        </Tip>
+        > [!TIP]
+        > Note that it's possible to fine-tune ViT on higher resolution images than the ones it has been trained on, by
+        >     setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
+        >     position embeddings to the higher resolution.
     """
 )
 class VivitForVideoClassification(VivitPreTrainedModel):
diff --git a/src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py b/src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py
index 3b830c314b31..e56a88548b51 100644
--- a/src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py
+++ b/src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py
@@ -49,18 +49,15 @@ class Wav2Vec2FeatureExtractor(SequenceFeatureExtractor):
         return_attention_mask (`bool`, *optional*, defaults to `False`):
             Whether or not [`~Wav2Vec2FeatureExtractor.__call__`] should return `attention_mask`.
 
-            <Tip>
-
-            Wav2Vec2 models that have set `config.feat_extract_norm == "group"`, such as
-            [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base-960h), have **not** been trained using
-            `attention_mask`. For such models, `input_values` should simply be padded with 0 and no `attention_mask`
-            should be passed.
-
-            For Wav2Vec2 models that have set `config.feat_extract_norm == "layer"`, such as
-            [wav2vec2-lv60](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), `attention_mask` should be
-            passed for batched inference.
-
-            </Tip>"""
+            > [!TIP]
+            > Wav2Vec2 models that have set `config.feat_extract_norm == "group"`, such as
+            > [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base-960h), have **not** been trained using
+            > `attention_mask`. For such models, `input_values` should simply be padded with 0 and no `attention_mask`
+            > should be passed.
+            >
+            > For Wav2Vec2 models that have set `config.feat_extract_norm == "layer"`, such as
+            > [wav2vec2-lv60](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), `attention_mask` should be
+            > passed for batched inference."""
 
     model_input_names = ["input_values", "attention_mask"]
 
@@ -144,18 +141,15 @@ def __call__(
 
                 [What are attention masks?](../glossary#attention-mask)
 
-                <Tip>
-
-                Wav2Vec2 models that have set `config.feat_extract_norm == "group"`, such as
-                [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base-960h), have **not** been trained using
-                `attention_mask`. For such models, `input_values` should simply be padded with 0 and no
-                `attention_mask` should be passed.
-
-                For Wav2Vec2 models that have set `config.feat_extract_norm == "layer"`, such as
-                [wav2vec2-lv60](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), `attention_mask` should
-                be passed for batched inference.
-
-                </Tip>
+                > [!TIP]
+                > Wav2Vec2 models that have set `config.feat_extract_norm == "group"`, such as
+                > [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base-960h), have **not** been trained using
+                > `attention_mask`. For such models, `input_values` should simply be padded with 0 and no
+                > `attention_mask` should be passed.
+                >
+                > For Wav2Vec2 models that have set `config.feat_extract_norm == "layer"`, such as
+                > [wav2vec2-lv60](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), `attention_mask` should
+                > be passed for batched inference.
 
             return_tensors (`str` or [`~utils.TensorType`], *optional*):
                 If set, will return tensors instead of list of python integers. Acceptable values are:
diff --git a/src/transformers/models/wav2vec2/modeling_wav2vec2.py b/src/transformers/models/wav2vec2/modeling_wav2vec2.py
index c517c26288c1..399089257365 100755
--- a/src/transformers/models/wav2vec2/modeling_wav2vec2.py
+++ b/src/transformers/models/wav2vec2/modeling_wav2vec2.py
@@ -1168,23 +1168,17 @@ def load_adapter(self, target_lang: str, force_load=True, **kwargs):
                 git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                 identifier allowed by git.
 
-                <Tip>
-
-                To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
-
-                </Tip>
+                > [!TIP]
+                > To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
 
             mirror (`str`, *optional*):
                 Mirror source to accelerate downloads in China. If you are from China and have an accessibility
                 problem, you can set this option to resolve it. Note that we do not guarantee the timeliness or safety.
                 Please refer to the mirror site for more information.
 
-        <Tip>
-
-        Activate the special ["offline-mode"](https://huggingface.co/transformers/installation.html#offline-mode) to
-        use this method in a firewalled environment.
-
-        </Tip>
+        > [!TIP]
+        > Activate the special ["offline-mode"](https://huggingface.co/transformers/installation.html#offline-mode) to
+        > use this method in a firewalled environment.
 
         Examples:
 
diff --git a/src/transformers/models/wav2vec2/tokenization_wav2vec2.py b/src/transformers/models/wav2vec2/tokenization_wav2vec2.py
index e9f9ce04b1ba..613c85c5e641 100644
--- a/src/transformers/models/wav2vec2/tokenization_wav2vec2.py
+++ b/src/transformers/models/wav2vec2/tokenization_wav2vec2.py
@@ -469,25 +469,19 @@ def batch_decode(
                 Whether or not to output character offsets. Character offsets can be used in combination with the
                 sampling rate and model downsampling rate to compute the time-stamps of transcribed characters.
 
-                <Tip>
-
-                Please take a look at the Example of [`~Wav2Vec2CTCTokenizer.decode`] to better understand how to make
-                use of `output_char_offsets`. [`~Wav2Vec2CTCTokenizer.batch_decode`] works the same way with batched
-                output.
-
-                </Tip>
+                > [!TIP]
+                > Please take a look at the Example of [`~Wav2Vec2CTCTokenizer.decode`] to better understand how to make
+                > use of `output_char_offsets`. [`~Wav2Vec2CTCTokenizer.batch_decode`] works the same way with batched
+                > output.
 
             output_word_offsets (`bool`, *optional*, defaults to `False`):
                 Whether or not to output word offsets. Word offsets can be used in combination with the sampling rate
                 and model downsampling rate to compute the time-stamps of transcribed words.
 
-                <Tip>
-
-                Please take a look at the Example of [`~Wav2Vec2CTCTokenizer.decode`] to better understand how to make
-                use of `output_word_offsets`. [`~Wav2Vec2CTCTokenizer.batch_decode`] works the same way with batched
-                output.
-
-                </Tip>
+                > [!TIP]
+                > Please take a look at the Example of [`~Wav2Vec2CTCTokenizer.decode`] to better understand how to make
+                > use of `output_word_offsets`. [`~Wav2Vec2CTCTokenizer.batch_decode`] works the same way with batched
+                > output.
 
             kwargs (additional keyword arguments, *optional*):
                 Will be passed to the underlying model specific decode method.
@@ -542,21 +536,15 @@ def decode(
                 Whether or not to output character offsets. Character offsets can be used in combination with the
                 sampling rate and model downsampling rate to compute the time-stamps of transcribed characters.
 
-                <Tip>
-
-                Please take a look at the example below to better understand how to make use of `output_char_offsets`.
-
-                </Tip>
+                > [!TIP]
+                > Please take a look at the example below to better understand how to make use of `output_char_offsets`.
 
             output_word_offsets (`bool`, *optional*, defaults to `False`):
                 Whether or not to output word offsets. Word offsets can be used in combination with the sampling rate
                 and model downsampling rate to compute the time-stamps of transcribed words.
 
-                <Tip>
-
-                Please take a look at the example below to better understand how to make use of `output_word_offsets`.
-
-                </Tip>
+                > [!TIP]
+                > Please take a look at the example below to better understand how to make use of `output_word_offsets`.
 
             kwargs (additional keyword arguments, *optional*):
                 Will be passed to the underlying model specific decode method.
@@ -665,18 +653,15 @@ class Wav2Vec2Tokenizer(PreTrainedTokenizer):
         return_attention_mask (`bool`, *optional*, defaults to `False`):
             Whether or not [`~Wav2Vec2Tokenizer.__call__`] should return `attention_mask`.
 
-            <Tip>
-
-            Wav2Vec2 models that have set `config.feat_extract_norm == "group"`, such as
-            [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base-960h), have **not** been trained using
-            `attention_mask`. For such models, `input_values` should simply be padded with 0 and no `attention_mask`
-            should be passed.
-
-            For Wav2Vec2 models that have set `config.feat_extract_norm == "layer"`, such as
-            [wav2vec2-lv60](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), `attention_mask` should be
-            passed for batched inference.
-
-            </Tip>
+            > [!TIP]
+            > Wav2Vec2 models that have set `config.feat_extract_norm == "group"`, such as
+            > [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base-960h), have **not** been trained using
+            > `attention_mask`. For such models, `input_values` should simply be padded with 0 and no `attention_mask`
+            > should be passed.
+            >
+            > For Wav2Vec2 models that have set `config.feat_extract_norm == "layer"`, such as
+            > [wav2vec2-lv60](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), `attention_mask` should be
+            > passed for batched inference.
 
         **kwargs
             Additional keyword arguments passed along to [`PreTrainedTokenizer`]
diff --git a/src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py b/src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py
index c819e63fd6cf..8b7bff2552dd 100644
--- a/src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py
+++ b/src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py
@@ -468,14 +468,11 @@ def decode(
                 Whether or not to output character offsets. Character offsets can be used in combination with the
                 sampling rate and model downsampling rate to compute the time-stamps of transcribed characters.
 
-                <Tip>
-
-                Please take a look at the Example of [`~models.wav2vec2.tokenization_wav2vec2.decode`] to better
-                understand how to make use of `output_word_offsets`.
-                [`~model.wav2vec2_phoneme.tokenization_wav2vec2_phoneme.batch_decode`] works the same way with
-                phonemes.
-
-                </Tip>
+                > [!TIP]
+                > Please take a look at the Example of [`~models.wav2vec2.tokenization_wav2vec2.decode`] to better
+                > understand how to make use of `output_word_offsets`.
+                > [`~model.wav2vec2_phoneme.tokenization_wav2vec2_phoneme.batch_decode`] works the same way with
+                > phonemes.
 
             kwargs (additional keyword arguments, *optional*):
                 Will be passed to the underlying model specific decode method.
@@ -521,14 +518,11 @@ def batch_decode(
                 Whether or not to output character offsets. Character offsets can be used in combination with the
                 sampling rate and model downsampling rate to compute the time-stamps of transcribed characters.
 
-                <Tip>
-
-                Please take a look at the Example of [`~models.wav2vec2.tokenization_wav2vec2.decode`] to better
-                understand how to make use of `output_word_offsets`.
-                [`~model.wav2vec2_phoneme.tokenization_wav2vec2_phoneme.batch_decode`] works analogous with phonemes
-                and batched output.
-
-                </Tip>
+                > [!TIP]
+                > Please take a look at the Example of [`~models.wav2vec2.tokenization_wav2vec2.decode`] to better
+                > understand how to make use of `output_word_offsets`.
+                > [`~model.wav2vec2_phoneme.tokenization_wav2vec2_phoneme.batch_decode`] works analogous with phonemes
+                > and batched output.
 
             kwargs (additional keyword arguments, *optional*):
                 Will be passed to the underlying model specific decode method.
diff --git a/src/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py b/src/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
index beb22ca86749..6a174308a2f7 100644
--- a/src/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
+++ b/src/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
@@ -122,16 +122,13 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
         r"""
         Instantiate a [`Wav2Vec2ProcessorWithLM`] from a pretrained Wav2Vec2 processor.
 
-        <Tip>
-
-        This class method is simply calling the feature extractor's
-        [`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`], Wav2Vec2CTCTokenizer's
-        [`~tokenization_utils_base.PreTrainedTokenizerBase.from_pretrained`], and
-        [`pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub`].
-
-        Please refer to the docstrings of the methods above for more information.
-
-        </Tip>
+        > [!TIP]
+        > This class method is simply calling the feature extractor's
+        > [`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`], Wav2Vec2CTCTokenizer's
+        > [`~tokenization_utils_base.PreTrainedTokenizerBase.from_pretrained`], and
+        > [`pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub`].
+        >
+        > Please refer to the docstrings of the methods above for more information.
 
         Args:
             pretrained_model_name_or_path (`str` or `os.PathLike`):
@@ -309,15 +306,12 @@ def batch_decode(
         """
         Batch decode output logits to audio transcription with language model support.
 
-        <Tip>
-
-        This function makes use of Python's multiprocessing. Currently, multiprocessing is available only on Unix
-        systems (see this [issue](https://github.com/kensho-technologies/pyctcdecode/issues/65)).
-
-        If you are decoding multiple batches, consider creating a `Pool` and passing it to `batch_decode`. Otherwise,
-        `batch_decode` will be very slow since it will create a fresh `Pool` for each call. See usage example below.
-
-        </Tip>
+        > [!TIP]
+        > This function makes use of Python's multiprocessing. Currently, multiprocessing is available only on Unix
+        > systems (see this [issue](https://github.com/kensho-technologies/pyctcdecode/issues/65)).
+        >
+        > If you are decoding multiple batches, consider creating a `Pool` and passing it to `batch_decode`. Otherwise,
+        > `batch_decode` will be very slow since it will create a fresh `Pool` for each call. See usage example below.
 
         Args:
             logits (`np.ndarray`):
@@ -327,12 +321,9 @@ def batch_decode(
                 should be instantiated *after* `Wav2Vec2ProcessorWithLM`. Otherwise, the LM won't be available to the
                 pool's sub-processes.
 
-                <Tip>
-
-                Currently, only pools created with a 'fork' context can be used. If a 'spawn' pool is passed, it will
-                be ignored and sequential decoding will be used instead.
-
-                </Tip>
+                > [!TIP]
+                > Currently, only pools created with a 'fork' context can be used. If a 'spawn' pool is passed, it will
+                > be ignored and sequential decoding will be used instead.
 
             num_processes (`int`, *optional*):
                 If `pool` is not set, number of processes on which the function should be parallelized over. Defaults
@@ -365,13 +356,10 @@ def batch_decode(
                 lists of floats, where the length of the outer list will correspond to the batch size and the length of
                 the inner list will correspond to the number of returned hypotheses . The value should be >= 1.
 
-                <Tip>
-
-                Please take a look at the Example of [`~Wav2Vec2ProcessorWithLM.decode`] to better understand how to
-                make use of `output_word_offsets`. [`~Wav2Vec2ProcessorWithLM.batch_decode`] works the same way with
-                batched output.
-
-                </Tip>
+                > [!TIP]
+                > Please take a look at the Example of [`~Wav2Vec2ProcessorWithLM.decode`] to better understand how to
+                > make use of `output_word_offsets`. [`~Wav2Vec2ProcessorWithLM.batch_decode`] works the same way with
+                > batched output.
 
         Returns:
             [`~models.wav2vec2.Wav2Vec2DecoderWithLMOutput`].
@@ -523,11 +511,8 @@ def decode(
                 of strings, `logit_score` will be a list of floats, and `lm_score` will be a list of floats, where the
                 length of these lists will correspond to the number of returned hypotheses. The value should be >= 1.
 
-                <Tip>
-
-                Please take a look at the example below to better understand how to make use of `output_word_offsets`.
-
-                </Tip>
+                > [!TIP]
+                > Please take a look at the example below to better understand how to make use of `output_word_offsets`.
 
         Returns:
             [`~models.wav2vec2.Wav2Vec2DecoderWithLMOutput`].
diff --git a/src/transformers/models/whisper/feature_extraction_whisper.py b/src/transformers/models/whisper/feature_extraction_whisper.py
index e11895191f95..f49204f06eff 100644
--- a/src/transformers/models/whisper/feature_extraction_whisper.py
+++ b/src/transformers/models/whisper/feature_extraction_whisper.py
@@ -226,12 +226,9 @@ def __call__(
 
                 [What are attention masks?](../glossary#attention-mask)
 
-                <Tip>
-
-                For Whisper models, `attention_mask` should always be passed for batched inference, to avoid subtle
-                bugs.
-
-                </Tip>
+                > [!TIP]
+                > For Whisper models, `attention_mask` should always be passed for batched inference, to avoid subtle
+                > bugs.
 
             return_tensors (`str` or [`~utils.TensorType`], *optional*):
                 If set, will return tensors instead of list of python integers. Acceptable values are:
diff --git a/src/transformers/models/whisper/generation_whisper.py b/src/transformers/models/whisper/generation_whisper.py
index 9c4f0f6e1d63..75011dd5647a 100644
--- a/src/transformers/models/whisper/generation_whisper.py
+++ b/src/transformers/models/whisper/generation_whisper.py
@@ -416,16 +416,13 @@ def generate(
         """
         Transcribes or translates log-mel input features to a sequence of auto-regressively generated token ids.
 
-        <Tip warning={true}>
-
-        Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
-        model's default generation configuration. You can override any `generation_config` by passing the corresponding
-        parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
-
-        For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
-
-        </Tip>
+        > [!WARNING]
+        > Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the
+        > model's default generation configuration. You can override any `generation_config` by passing the corresponding
+        > parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
+        >
+        > For an overview of generation strategies and code examples, check out the [following
+        > guide](./generation_strategies).
 
         Parameters:
             input_features (`torch.Tensor` of shape `(batch_size, feature_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/xglm/tokenization_xglm.py b/src/transformers/models/xglm/tokenization_xglm.py
index 9e0a8706683f..090bbaa89f1d 100644
--- a/src/transformers/models/xglm/tokenization_xglm.py
+++ b/src/transformers/models/xglm/tokenization_xglm.py
@@ -47,22 +47,16 @@ class XGLMTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/xglm/tokenization_xglm_fast.py b/src/transformers/models/xglm/tokenization_xglm_fast.py
index a9c8b3aac257..82d0b50cf3c3 100644
--- a/src/transformers/models/xglm/tokenization_xglm_fast.py
+++ b/src/transformers/models/xglm/tokenization_xglm_fast.py
@@ -48,22 +48,16 @@ class XGLMTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/xlm/modeling_xlm.py b/src/transformers/models/xlm/modeling_xlm.py
index 6fd21d0490de..c5fd12212203 100755
--- a/src/transformers/models/xlm/modeling_xlm.py
+++ b/src/transformers/models/xlm/modeling_xlm.py
@@ -184,12 +184,9 @@ def forward(
                 Mask for tokens at invalid position, such as query and special symbols (PAD, SEP, CLS). 1.0 means token
                 should be masked.
 
-        <Tip>
-
-        One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
-        `start_states`.
-
-        </Tip>
+        > [!TIP]
+        > One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
+        > `start_states`.
 
         Returns:
             `torch.FloatTensor`: The end logits for SQuAD.
@@ -250,12 +247,9 @@ def forward(
             cls_index (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
                 Position of the CLS token for each sentence in the batch. If `None`, takes the last token.
 
-        <Tip>
-
-        One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
-        `start_states`.
-
-        </Tip>
+        > [!TIP]
+        > One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
+        > `start_states`.
 
         Returns:
             `torch.FloatTensor`: The SQuAD 2.0 answer class.
diff --git a/src/transformers/models/xlm/tokenization_xlm.py b/src/transformers/models/xlm/tokenization_xlm.py
index 8c4471a38436..5bf7500f53e5 100644
--- a/src/transformers/models/xlm/tokenization_xlm.py
+++ b/src/transformers/models/xlm/tokenization_xlm.py
@@ -160,12 +160,9 @@ class XLMTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py b/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
index 149a09f5ed61..650ea33cb695 100644
--- a/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
+++ b/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
@@ -47,22 +47,16 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py b/src/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py
index bcdea2325fc1..0155f07cda40 100644
--- a/src/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py
+++ b/src/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py
@@ -49,22 +49,16 @@ class XLMRobertaTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         sep_token (`str`, *optional*, defaults to `"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
diff --git a/src/transformers/models/xlnet/configuration_xlnet.py b/src/transformers/models/xlnet/configuration_xlnet.py
index d32f05c875bb..7a371afd8a5b 100644
--- a/src/transformers/models/xlnet/configuration_xlnet.py
+++ b/src/transformers/models/xlnet/configuration_xlnet.py
@@ -105,16 +105,13 @@ class XLNetConfig(PretrainedConfig):
         use_mems_train (`bool`, *optional*, defaults to `False`):
             Whether or not the model should make use of the recurrent memory mechanism in train mode.
 
-            <Tip>
-
-            For pretraining, it is recommended to set `use_mems_train` to `True`. For fine-tuning, it is recommended to
-            set `use_mems_train` to `False` as discussed
-            [here](https://github.com/zihangdai/xlnet/issues/41#issuecomment-505102587). If `use_mems_train` is set to
-            `True`, one has to make sure that the train batches are correctly pre-processed, *e.g.* `batch_1 = [[This
-            line is], [This is the]]` and `batch_2 = [[ the first line], [ second line]]` and that all batches are of
-            equal size.
-
-            </Tip>
+            > [!TIP]
+            > For pretraining, it is recommended to set `use_mems_train` to `True`. For fine-tuning, it is recommended to
+            > set `use_mems_train` to `False` as discussed
+            > [here](https://github.com/zihangdai/xlnet/issues/41#issuecomment-505102587). If `use_mems_train` is set to
+            > `True`, one has to make sure that the train batches are correctly pre-processed, *e.g.* `batch_1 = [[This
+            > line is], [This is the]]` and `batch_2 = [[ the first line], [ second line]]` and that all batches are of
+            > equal size.
 
     Examples:
 
diff --git a/src/transformers/models/xlnet/modeling_xlnet.py b/src/transformers/models/xlnet/modeling_xlnet.py
index 48fb1b41a61f..95acab50f504 100755
--- a/src/transformers/models/xlnet/modeling_xlnet.py
+++ b/src/transformers/models/xlnet/modeling_xlnet.py
@@ -433,12 +433,9 @@ def forward(
                 Mask for tokens at invalid position, such as query and special symbols (PAD, SEP, CLS). 1.0 means token
                 should be masked.
 
-        <Tip>
-
-        One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
-        `start_states`.
-
-        </Tip>
+        > [!TIP]
+        > One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
+        > `start_states`.
 
         Returns:
             `torch.FloatTensor`: The end logits for SQuAD.
@@ -500,12 +497,9 @@ def forward(
             cls_index (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
                 Position of the CLS token for each sentence in the batch. If `None`, takes the last token.
 
-        <Tip>
-
-        One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
-        `start_states`.
-
-        </Tip>
+        > [!TIP]
+        > One of `start_states` or `start_positions` should be not `None`. If both are set, `start_positions` overrides
+        > `start_states`.
 
         Returns:
             `torch.FloatTensor`: The SQuAD 2.0 answer class.
diff --git a/src/transformers/models/xlnet/tokenization_xlnet.py b/src/transformers/models/xlnet/tokenization_xlnet.py
index 9186db33d788..9477ca9ff4b6 100644
--- a/src/transformers/models/xlnet/tokenization_xlnet.py
+++ b/src/transformers/models/xlnet/tokenization_xlnet.py
@@ -60,22 +60,16 @@ class XLNetTokenizer(PreTrainedTokenizer):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/models/xlnet/tokenization_xlnet_fast.py b/src/transformers/models/xlnet/tokenization_xlnet_fast.py
index 56cd2a50e1b2..ca6af63eb12b 100644
--- a/src/transformers/models/xlnet/tokenization_xlnet_fast.py
+++ b/src/transformers/models/xlnet/tokenization_xlnet_fast.py
@@ -65,22 +65,16 @@ class XLNetTokenizerFast(PreTrainedTokenizerFast):
         bos_token (`str`, *optional*, defaults to `"<s>"`):
             The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the beginning of
-            sequence. The token used is the `cls_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the beginning of
+            > sequence. The token used is the `cls_token`.
 
         eos_token (`str`, *optional*, defaults to `"</s>"`):
             The end of sequence token.
 
-            <Tip>
-
-            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            The token used is the `sep_token`.
-
-            </Tip>
+            > [!TIP]
+            > When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+            > The token used is the `sep_token`.
 
         unk_token (`str`, *optional*, defaults to `"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
diff --git a/src/transformers/pipelines/__init__.py b/src/transformers/pipelines/__init__.py
index a029bb32df03..7a55d1599246 100755
--- a/src/transformers/pipelines/__init__.py
+++ b/src/transformers/pipelines/__init__.py
@@ -540,12 +540,11 @@ def pipeline(
         - A [model](model) that generates predictions from the inputs.
         - Optional post-processing steps to refine the model's output, which can also be handled by processors.
 
-    <Tip>
-    While there are such optional arguments as `tokenizer`, `feature_extractor`, `image_processor`, and `processor`,
-    they shouldn't be specified all at once. If these components are not provided, `pipeline` will try to load
-    required ones automatically. In case you want to provide these components explicitly, please refer to a
-    specific pipeline in order to get more details regarding what components are required.
-    </Tip>
+    > [!TIP]
+    > While there are such optional arguments as `tokenizer`, `feature_extractor`, `image_processor`, and `processor`,
+    > they shouldn't be specified all at once. If these components are not provided, `pipeline` will try to load
+    > required ones automatically. In case you want to provide these components explicitly, please refer to a
+    > specific pipeline in order to get more details regarding what components are required.
 
     Args:
         task (`str`):
@@ -652,11 +651,8 @@ def pipeline(
             [here](https://huggingface.co/docs/accelerate/main/en/package_reference/big_modeling#accelerate.cpu_offload)
             for more information).
 
-            <Tip warning={true}>
-
-            Do not use `device_map` AND `device` at the same time as they will conflict
-
-            </Tip>
+            > [!WARNING]
+            > Do not use `device_map` AND `device` at the same time as they will conflict
 
         dtype (`str` or `torch.dtype`, *optional*):
             Sent directly as `model_kwargs` (just a simpler shortcut) to use the available precision for this model
diff --git a/src/transformers/pipelines/automatic_speech_recognition.py b/src/transformers/pipelines/automatic_speech_recognition.py
index 1f3c21526169..53e61461c90d 100644
--- a/src/transformers/pipelines/automatic_speech_recognition.py
+++ b/src/transformers/pipelines/automatic_speech_recognition.py
@@ -149,24 +149,18 @@ class AutomaticSpeechRecognitionPipeline(ChunkPipeline):
         chunk_length_s (`float`, *optional*, defaults to 0):
             The input length for in each chunk. If `chunk_length_s = 0` then chunking is disabled (default).
 
-            <Tip>
-
-            For more information on how to effectively use `chunk_length_s`, please have a look at the [ASR chunking
-            blog post](https://huggingface.co/blog/asr-chunking).
-
-            </Tip>
+            > [!TIP]
+            > For more information on how to effectively use `chunk_length_s`, please have a look at the [ASR chunking
+            > blog post](https://huggingface.co/blog/asr-chunking).
 
         stride_length_s (`float`, *optional*, defaults to `chunk_length_s / 6`):
             The length of stride on the left and right of each chunk. Used only with `chunk_length_s > 0`. This enables
             the model to *see* more context and infer letters better than without this context but the pipeline
             discards the stride bits at the end to make the final reconstitution as perfect as possible.
 
-            <Tip>
-
-            For more information on how to effectively use `stride_length_s`, please have a look at the [ASR chunking
-            blog post](https://huggingface.co/blog/asr-chunking).
-
-            </Tip>
+            > [!TIP]
+            > For more information on how to effectively use `stride_length_s`, please have a look at the [ASR chunking
+            > blog post](https://huggingface.co/blog/asr-chunking).
 
         device (Union[`int`, `torch.device`], *optional*):
             Device ordinal for CPU/GPU supports. Setting this to `None` will leverage CPU, a positive will run the
diff --git a/src/transformers/pipelines/fill_mask.py b/src/transformers/pipelines/fill_mask.py
index 11810bc2bea3..10d7f7f0a1ad 100644
--- a/src/transformers/pipelines/fill_mask.py
+++ b/src/transformers/pipelines/fill_mask.py
@@ -54,31 +54,24 @@ class FillMaskPipeline(Pipeline):
     which includes the bi-directional models in the library. See the up-to-date list of available models on
     [huggingface.co/models](https://huggingface.co/models?filter=fill-mask).
 
-    <Tip>
-
-    This pipeline only works for inputs with exactly one token masked. Experimental: We added support for multiple
-    masks. The returned values are raw model output, and correspond to disjoint probabilities where one might expect
-    joint probabilities (See [discussion](https://github.com/huggingface/transformers/pull/10222)).
-
-    </Tip>
-
-    <Tip>
-
-    This pipeline now supports tokenizer_kwargs. For example try:
-
-    ```python
-    >>> from transformers import pipeline
-
-    >>> fill_masker = pipeline(model="google-bert/bert-base-uncased")
-    >>> tokenizer_kwargs = {"truncation": True}
-    >>> fill_masker(
-    ...     "This is a simple [MASK]. " + "...with a large amount of repeated text appended. " * 100,
-    ...     tokenizer_kwargs=tokenizer_kwargs,
-    ... )
-    ```
-
-
-    </Tip>
+    > [!TIP]
+    > This pipeline only works for inputs with exactly one token masked. Experimental: We added support for multiple
+    > masks. The returned values are raw model output, and correspond to disjoint probabilities where one might expect
+    > joint probabilities (See [discussion](https://github.com/huggingface/transformers/pull/10222)).
+
+    > [!TIP]
+    > This pipeline now supports tokenizer_kwargs. For example try:
+    >
+    > ```python
+    > >>> from transformers import pipeline
+    >
+    > >>> fill_masker = pipeline(model="google-bert/bert-base-uncased")
+    > >>> tokenizer_kwargs = {"truncation": True}
+    > >>> fill_masker(
+    > ...     "This is a simple [MASK]. " + "...with a large amount of repeated text appended. " * 100,
+    > ...     tokenizer_kwargs=tokenizer_kwargs,
+    > ... )
+    > ```
 
 
     """
diff --git a/src/transformers/pipelines/text_to_audio.py b/src/transformers/pipelines/text_to_audio.py
index d43695b37399..c08aab16231b 100644
--- a/src/transformers/pipelines/text_to_audio.py
+++ b/src/transformers/pipelines/text_to_audio.py
@@ -50,29 +50,26 @@ class TextToAudioPipeline(Pipeline):
 
     Learn more about the basics of using a pipeline in the [pipeline tutorial](../pipeline_tutorial)
 
-    <Tip>
-
-    You can specify parameters passed to the model by using [`TextToAudioPipeline.__call__.forward_params`] or
-    [`TextToAudioPipeline.__call__.generate_kwargs`].
-
-    Example:
-
-    ```python
-    >>> from transformers import pipeline
-
-    >>> music_generator = pipeline(task="text-to-audio", model="facebook/musicgen-small")
-
-    >>> # diversify the music generation by adding randomness with a high temperature and set a maximum music length
-    >>> generate_kwargs = {
-    ...     "do_sample": True,
-    ...     "temperature": 0.7,
-    ...     "max_new_tokens": 35,
-    ... }
-
-    >>> outputs = music_generator("Techno music with high melodic riffs", generate_kwargs=generate_kwargs)
-    ```
-
-    </Tip>
+    > [!TIP]
+    > You can specify parameters passed to the model by using [`TextToAudioPipeline.__call__.forward_params`] or
+    > [`TextToAudioPipeline.__call__.generate_kwargs`].
+    >
+    > Example:
+    >
+    > ```python
+    > >>> from transformers import pipeline
+    >
+    > >>> music_generator = pipeline(task="text-to-audio", model="facebook/musicgen-small")
+    >
+    > >>> # diversify the music generation by adding randomness with a high temperature and set a maximum music length
+    > >>> generate_kwargs = {
+    > ...     "do_sample": True,
+    > ...     "temperature": 0.7,
+    > ...     "max_new_tokens": 35,
+    > ... }
+    >
+    > >>> outputs = music_generator("Techno music with high melodic riffs", generate_kwargs=generate_kwargs)
+    > ```
 
     This pipeline can currently be loaded from [`pipeline`] using the following task identifiers: `"text-to-speech"` or
     `"text-to-audio"`.
diff --git a/src/transformers/pipelines/zero_shot_audio_classification.py b/src/transformers/pipelines/zero_shot_audio_classification.py
index 7d5e36e5dd08..0c74d7c940eb 100644
--- a/src/transformers/pipelines/zero_shot_audio_classification.py
+++ b/src/transformers/pipelines/zero_shot_audio_classification.py
@@ -35,11 +35,8 @@ class ZeroShotAudioClassificationPipeline(Pipeline):
     Zero shot audio classification pipeline using `ClapModel`. This pipeline predicts the class of an audio when you
     provide an audio and a set of `candidate_labels`.
 
-    <Tip warning={true}>
-
-    The default `hypothesis_template` is : `"This is a sound of {}."`. Make sure you update it for your usage.
-
-    </Tip>
+    > [!WARNING]
+    > The default `hypothesis_template` is : `"This is a sound of {}."`. Make sure you update it for your usage.
 
     Example:
     ```python
diff --git a/src/transformers/processing_utils.py b/src/transformers/processing_utils.py
index 5f3f455662e3..a60bec885907 100644
--- a/src/transformers/processing_utils.py
+++ b/src/transformers/processing_utils.py
@@ -715,13 +715,10 @@ def save_pretrained(self, save_directory, push_to_hub: bool = False, legacy_seri
         Saves the attributes of this processor (feature extractor, tokenizer...) in the specified directory so that it
         can be reloaded using the [`~ProcessorMixin.from_pretrained`] method.
 
-        <Tip>
-
-        This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and
-        [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`]. Please refer to the docstrings of the
-        methods above for more information.
-
-        </Tip>
+        > [!TIP]
+        > This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and
+        > [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`]. Please refer to the docstrings of the
+        > methods above for more information.
 
         Args:
             save_directory (`str` or `os.PathLike`):
@@ -1344,15 +1341,12 @@ def from_pretrained(
         r"""
         Instantiate a processor associated with a pretrained model.
 
-        <Tip>
-
-        This class method is simply calling the feature extractor
-        [`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`], image processor
-        [`~image_processing_utils.ImageProcessingMixin`] and the tokenizer
-        [`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`] methods. Please refer to the docstrings of the
-        methods above for more information.
-
-        </Tip>
+        > [!TIP]
+        > This class method is simply calling the feature extractor
+        > [`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`], image processor
+        > [`~image_processing_utils.ImageProcessingMixin`] and the tokenizer
+        > [`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`] methods. Please refer to the docstrings of the
+        > methods above for more information.
 
         Args:
             pretrained_model_name_or_path (`str` or `os.PathLike`):
diff --git a/src/transformers/tokenization_mistral_common.py b/src/transformers/tokenization_mistral_common.py
index d8ea3688efae..db85177d2d2c 100644
--- a/src/transformers/tokenization_mistral_common.py
+++ b/src/transformers/tokenization_mistral_common.py
@@ -1150,13 +1150,10 @@ def pad(
 
         Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,
         `self.pad_token_id`).
-        <Tip>
-
-        If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors, the
-        result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
-        PyTorch tensors, you will lose the specific device of your tensors however.
-
-        </Tip>
+        > [!TIP]
+        > If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors, the
+        > result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
+        > PyTorch tensors, you will lose the specific device of your tensors however.
 
         Args:
             encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `Dict[str, List[int]]`, `Dict[str, List[List[int]]` or `List[Dict[str, List[int]]]`):
diff --git a/src/transformers/tokenization_utils.py b/src/transformers/tokenization_utils.py
index b89e57093152..6a919fd07211 100644
--- a/src/transformers/tokenization_utils.py
+++ b/src/transformers/tokenization_utils.py
@@ -599,12 +599,9 @@ def num_special_tokens_to_add(self, pair: bool = False) -> int:
         """
         Returns the number of added tokens when encoding a sequence with special tokens.
 
-        <Tip>
-
-        This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put
-        this inside your training loop.
-
-        </Tip>
+        > [!TIP]
+        > This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put
+        > this inside your training loop.
 
         Args:
             pair (`bool`, *optional*, defaults to `False`):
diff --git a/src/transformers/tokenization_utils_base.py b/src/transformers/tokenization_utils_base.py
index 74550cb0f6ab..d3bd45776850 100644
--- a/src/transformers/tokenization_utils_base.py
+++ b/src/transformers/tokenization_utils_base.py
@@ -1855,11 +1855,8 @@ def from_pretrained(
                 `eos_token`, `unk_token`, `sep_token`, `pad_token`, `cls_token`, `mask_token`,
                 `additional_special_tokens`. See parameters in the `__init__` for more details.
 
-        <Tip>
-
-        Passing `token=True` is required when you want to use a private model.
-
-        </Tip>
+        > [!TIP]
+        > Passing `token=True` is required when you want to use a private model.
 
         Examples:
 
@@ -3051,11 +3048,8 @@ def encode_plus(
         """
         Tokenize and prepare for the model a sequence or a pair of sequences.
 
-        <Tip warning={true}>
-
-        This method is deprecated, `__call__` should be used instead.
-
-        </Tip>
+        > [!WARNING]
+        > This method is deprecated, `__call__` should be used instead.
 
         Args:
             text (`str`, `list[str]` or (for non-fast tokenizers) `list[int]`):
@@ -3158,11 +3152,8 @@ def batch_encode_plus(
         """
         Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
 
-        <Tip warning={true}>
-
-        This method is deprecated, `__call__` should be used instead.
-
-        </Tip>
+        > [!WARNING]
+        > This method is deprecated, `__call__` should be used instead.
 
         Args:
             batch_text_or_text_pairs (`list[str]`, `list[tuple[str, str]]`, `list[list[str]]`, `list[tuple[list[str], list[str]]]`, and for not-fast tokenizers, also `list[list[int]]`, `list[tuple[list[int], list[int]]]`):
@@ -3261,13 +3252,10 @@ def pad(
         Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the
         text followed by a call to the `pad` method to get a padded encoding.
 
-        <Tip>
-
-        If the `encoded_inputs` passed are dictionary of numpy arrays, or PyTorch tensors, the
-        result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
-        PyTorch tensors, you will lose the specific device of your tensors however.
-
-        </Tip>
+        > [!TIP]
+        > If the `encoded_inputs` passed are dictionary of numpy arrays, or PyTorch tensors, the
+        > result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
+        > PyTorch tensors, you will lose the specific device of your tensors however.
 
         Args:
             encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `dict[str, list[int]]`, `dict[str, list[list[int]]` or `list[dict[str, list[int]]]`):
diff --git a/src/transformers/tokenization_utils_fast.py b/src/transformers/tokenization_utils_fast.py
index fe4873d61b37..9bc4e6fb62dc 100644
--- a/src/transformers/tokenization_utils_fast.py
+++ b/src/transformers/tokenization_utils_fast.py
@@ -386,12 +386,9 @@ def num_special_tokens_to_add(self, pair: bool = False) -> int:
         """
         Returns the number of added tokens when encoding a sequence with special tokens.
 
-        <Tip>
-
-        This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put
-        this inside your training loop.
-
-        </Tip>
+        > [!TIP]
+        > This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put
+        > this inside your training loop.
 
         Args:
             pair (`bool`, *optional*, defaults to `False`):
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 27adca9c836e..1133ce863418 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -311,13 +311,10 @@ class Trainer:
         model ([`PreTrainedModel`] or `torch.nn.Module`, *optional*):
             The model to train, evaluate or use for predictions. If not provided, a `model_init` must be passed.
 
-            <Tip>
-
-            [`Trainer`] is optimized to work with the [`PreTrainedModel`] provided by the library. You can still use
-            your own models defined as `torch.nn.Module` as long as they work the same way as the 🤗 Transformers
-            models.
-
-            </Tip>
+            > [!TIP]
+            > [`Trainer`] is optimized to work with the [`PreTrainedModel`] provided by the library. You can still use
+            > your own models defined as `torch.nn.Module` as long as they work the same way as the 🤗 Transformers
+            > models.
 
         args ([`TrainingArguments`], *optional*):
             The arguments to tweak for training. Will default to a basic instance of [`TrainingArguments`] with the
@@ -3541,14 +3538,11 @@ def hyperparameter_search(
         by `compute_objective`, which defaults to a function returning the evaluation loss when no metric is provided,
         the sum of all metrics otherwise.
 
-        <Tip warning={true}>
-
-        To use this method, you need to have provided a `model_init` when initializing your [`Trainer`]: we need to
-        reinitialize the model at each new run. This is incompatible with the `optimizers` argument, so you need to
-        subclass [`Trainer`] and override the method [`~Trainer.create_optimizer_and_scheduler`] for custom
-        optimizer/scheduler.
-
-        </Tip>
+        > [!WARNING]
+        > To use this method, you need to have provided a `model_init` when initializing your [`Trainer`]: we need to
+        > reinitialize the model at each new run. This is incompatible with the `optimizers` argument, so you need to
+        > subclass [`Trainer`] and override the method [`~Trainer.create_optimizer_and_scheduler`] for custom
+        > optimizer/scheduler.
 
         Args:
             hp_space (`Callable[["optuna.Trial"], dict[str, float]]`, *optional*):
@@ -4268,17 +4262,14 @@ def evaluate(
                 evaluate on each dataset, prepending the dictionary key to the metric name. Datasets must implement the
                 `__len__` method.
 
-                <Tip>
-
-                If you pass a dictionary with names of datasets as keys and datasets as values, evaluate will run
-                separate evaluations on each dataset. This can be useful to monitor how training affects other
-                datasets or simply to get a more fine-grained evaluation.
-                When used with `load_best_model_at_end`, make sure `metric_for_best_model` references exactly one
-                of the datasets. If you, for example, pass in `{"data1": data1, "data2": data2}` for two datasets
-                `data1` and `data2`, you could specify `metric_for_best_model="eval_data1_loss"` for using the
-                loss on `data1` and `metric_for_best_model="eval_data2_loss"` for the loss on `data2`.
-
-                </Tip>
+                > [!TIP]
+                > If you pass a dictionary with names of datasets as keys and datasets as values, evaluate will run
+                > separate evaluations on each dataset. This can be useful to monitor how training affects other
+                > datasets or simply to get a more fine-grained evaluation.
+                > When used with `load_best_model_at_end`, make sure `metric_for_best_model` references exactly one
+                > of the datasets. If you, for example, pass in `{"data1": data1, "data2": data2}` for two datasets
+                > `data1` and `data2`, you could specify `metric_for_best_model="eval_data1_loss"` for using the
+                > loss on `data1` and `metric_for_best_model="eval_data2_loss"` for the loss on `data2`.
 
             ignore_keys (`list[str]`, *optional*):
                 A list of keys in the output of your model (if it is a dictionary) that should be ignored when
@@ -4370,13 +4361,10 @@ def predict(
                 An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
                 "test_bleu" if the prefix is "test" (default)
 
-        <Tip>
-
-        If your predictions or labels have different sequence length (for instance because you're doing dynamic padding
-        in a token classification task) the predictions will be padded (on the right) to allow for concatenation into
-        one array. The padding index is -100.
-
-        </Tip>
+        > [!TIP]
+        > If your predictions or labels have different sequence length (for instance because you're doing dynamic padding
+        > in a token classification task) the predictions will be padded (on the right) to allow for concatenation into
+        > one array. The padding index is -100.
 
         Returns: *NamedTuple* A namedtuple with the following keys:
 
diff --git a/src/transformers/trainer_callback.py b/src/transformers/trainer_callback.py
index c72bdbb70bcd..5f2e356418c4 100644
--- a/src/transformers/trainer_callback.py
+++ b/src/transformers/trainer_callback.py
@@ -38,13 +38,10 @@ class TrainerState:
     A class containing the [`Trainer`] inner state that will be saved along the model and optimizer when checkpointing
     and passed to the [`TrainerCallback`].
 
-    <Tip>
-
-    In all this class, one step is to be understood as one update step. When using gradient accumulation, one update
-    step may require several forward and backward passes: if you use `gradient_accumulation_steps=n`, then one update
-    step requires going through *n* batches.
-
-    </Tip>
+    > [!TIP]
+    > In all this class, one step is to be understood as one update step. When using gradient accumulation, one update
+    > step may require several forward and backward passes: if you use `gradient_accumulation_steps=n`, then one update
+    > step requires going through *n* batches.
 
     Args:
         epoch (`float`, *optional*):
diff --git a/src/transformers/trainer_pt_utils.py b/src/transformers/trainer_pt_utils.py
index b1cb1f551ac5..3a25228112f3 100644
--- a/src/transformers/trainer_pt_utils.py
+++ b/src/transformers/trainer_pt_utils.py
@@ -636,16 +636,13 @@ class IterableDatasetShard(IterableDataset):
     - the shard on process 0 will yield `[0, 1, 4, 5, 8, 9]` so will see batches `[0, 1]`, `[4, 5]`, `[8, 9]`
     - the shard on process 1 will yield `[2, 3, 6, 7, 10, 11]` so will see batches `[2, 3]`, `[6, 7]`, `[10, 11]`
 
-    <Tip warning={true}>
-
-        If your IterableDataset implements some randomization that needs to be applied the same way on all processes
-        (for instance, a shuffling), you should use a `torch.Generator` in a `generator` attribute of the `dataset` to
-        generate your random numbers and call the [`~trainer_pt_utils.IterableDatasetShard.set_epoch`] method of this
-        object. It will set the seed of this `generator` to `seed + epoch` on all processes before starting the
-        iteration. Alternatively, you can also implement a `set_epoch()` method in your iterable dataset to deal with
-        this.
-
-    </Tip>
+    > [!WARNING]
+    > If your IterableDataset implements some randomization that needs to be applied the same way on all processes
+    >     (for instance, a shuffling), you should use a `torch.Generator` in a `generator` attribute of the `dataset` to
+    >     generate your random numbers and call the [`~trainer_pt_utils.IterableDatasetShard.set_epoch`] method of this
+    >     object. It will set the seed of this `generator` to `seed + epoch` on all processes before starting the
+    >     iteration. Alternatively, you can also implement a `set_epoch()` method in your iterable dataset to deal with
+    >     this.
 
     Args:
         dataset (`torch.utils.data.IterableDataset`):
diff --git a/src/transformers/trainer_seq2seq.py b/src/transformers/trainer_seq2seq.py
index ca6842bc0ff3..5b918bb9004b 100644
--- a/src/transformers/trainer_seq2seq.py
+++ b/src/transformers/trainer_seq2seq.py
@@ -219,13 +219,10 @@ def predict(
             gen_kwargs:
                 Additional `generate` specific kwargs.
 
-        <Tip>
-
-        If your predictions or labels have different sequence lengths (for instance because you're doing dynamic
-        padding in a token classification task) the predictions will be padded (on the right) to allow for
-        concatenation into one array. The padding index is -100.
-
-        </Tip>
+        > [!TIP]
+        > If your predictions or labels have different sequence lengths (for instance because you're doing dynamic
+        > padding in a token classification task) the predictions will be padded (on the right) to allow for
+        > concatenation into one array. The padding index is -100.
 
         Returns: *NamedTuple* A namedtuple with the following keys:
 
diff --git a/src/transformers/training_args.py b/src/transformers/training_args.py
index 2abf0d5c883d..fc2270f0c785 100644
--- a/src/transformers/training_args.py
+++ b/src/transformers/training_args.py
@@ -254,12 +254,9 @@ class TrainingArguments:
         gradient_accumulation_steps (`int`, *optional*, defaults to 1):
             Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
 
-            <Tip warning={true}>
-
-            When using gradient accumulation, one step is counted as one step with backward pass. Therefore, logging,
-            evaluation, save will be conducted every `gradient_accumulation_steps * xxx_step` training examples.
-
-            </Tip>
+            > [!WARNING]
+            > When using gradient accumulation, one step is counted as one step with backward pass. Therefore, logging,
+            > evaluation, save will be conducted every `gradient_accumulation_steps * xxx_step` training examples.
 
         eval_accumulation_steps (`int`, *optional*):
             Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If
@@ -271,11 +268,8 @@ class TrainingArguments:
         torch_empty_cache_steps (`int`, *optional*):
             Number of steps to wait before calling `torch.<device>.empty_cache()`. If left unset or set to None, cache will not be emptied.
 
-            <Tip>
-
-            This can help avoid CUDA out-of-memory errors by lowering peak VRAM usage at a cost of about [10% slower performance](https://github.com/huggingface/transformers/issues/31372).
-
-            </Tip>
+            > [!TIP]
+            > This can help avoid CUDA out-of-memory errors by lowering peak VRAM usage at a cost of about [10% slower performance](https://github.com/huggingface/transformers/issues/31372).
 
         learning_rate (`float`, *optional*, defaults to 5e-5):
             The initial learning rate for [`AdamW`] optimizer.
@@ -333,12 +327,9 @@ class TrainingArguments:
             Whether to filter `nan` and `inf` losses for logging. If set to `True` the loss of every step that is `nan`
             or `inf` is filtered and the average loss of the current logging window is taken instead.
 
-            <Tip>
-
-            `logging_nan_inf_filter` only influences the logging of loss values, it does not change the behavior the
-            gradient is computed or applied to the model.
-
-            </Tip>
+            > [!TIP]
+            > `logging_nan_inf_filter` only influences the logging of loss values, it does not change the behavior the
+            > gradient is computed or applied to the model.
 
         save_strategy (`str` or [`~trainer_utils.SaveStrategy`], *optional*, defaults to `"steps"`):
             The checkpoint save strategy to adopt during training. Possible values are:
@@ -449,12 +440,9 @@ class TrainingArguments:
             [`save_total_limit`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_total_limit)
             for more.
 
-            <Tip>
-
-            When set to `True`, the parameters `save_strategy` needs to be the same as `eval_strategy`, and in
-            the case it is "steps", `save_steps` must be a round multiple of `eval_steps`.
-
-            </Tip>
+            > [!TIP]
+            > When set to `True`, the parameters `save_strategy` needs to be the same as `eval_strategy`, and in
+            > the case it is "steps", `save_steps` must be a round multiple of `eval_steps`.
 
         metric_for_best_model (`str`, *optional*):
             Use in conjunction with `load_best_model_at_end` to specify the metric to use to compare two different
@@ -550,10 +538,9 @@ class TrainingArguments:
             evolve in the future. The value is either the location of DeepSpeed json config file (e.g.,
             `ds_config.json`) or an already loaded json file as a `dict`"
 
-            <Tip warning={true}>
-                If enabling any Zero-init, make sure that your model is not initialized until
-                *after* initializing the `TrainingArguments`, else it will not be applied.
-            </Tip>
+            > [!WARNING]
+            > If enabling any Zero-init, make sure that your model is not initialized until
+            >     *after* initializing the `TrainingArguments`, else it will not be applied.
 
         accelerator_config (`str`, `dict`, or `AcceleratorConfig`, *optional*):
             Config to be used with the internal `Accelerator` implementation. The value is either a location of
@@ -643,12 +630,9 @@ class TrainingArguments:
             will be pushed each time a save is triggered (depending on your `save_strategy`). Calling
             [`~Trainer.save_model`] will also trigger a push.
 
-            <Tip warning={true}>
-
-            If `output_dir` exists, it needs to be a local clone of the repository to which the [`Trainer`] will be
-            pushed.
-
-            </Tip>
+            > [!WARNING]
+            > If `output_dir` exists, it needs to be a local clone of the repository to which the [`Trainer`] will be
+            > pushed.
 
         resume_from_checkpoint (`str`, *optional*):
             The path to a folder with a valid checkpoint for your model. This argument is not directly used by
@@ -2332,11 +2316,8 @@ def set_training(
         """
         A method that regroups all basic arguments linked to the training.
 
-        <Tip>
-
-        Calling this method will automatically set `self.do_train` to `True`.
-
-        </Tip>
+        > [!TIP]
+        > Calling this method will automatically set `self.do_train` to `True`.
 
         Args:
             learning_rate (`float`, *optional*, defaults to 5e-5):
@@ -2356,13 +2337,10 @@ def set_training(
             gradient_accumulation_steps (`int`, *optional*, defaults to 1):
                 Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
 
-                <Tip warning={true}>
-
-                When using gradient accumulation, one step is counted as one step with backward pass. Therefore,
-                logging, evaluation, save will be conducted every `gradient_accumulation_steps * xxx_step` training
-                examples.
-
-                </Tip>
+                > [!WARNING]
+                > When using gradient accumulation, one step is counted as one step with backward pass. Therefore,
+                > logging, evaluation, save will be conducted every `gradient_accumulation_steps * xxx_step` training
+                > examples.
 
             seed (`int`, *optional*, defaults to 42):
                 Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use
@@ -2463,11 +2441,8 @@ def set_testing(
         """
         A method that regroups all basic arguments linked to testing on a held-out dataset.
 
-        <Tip>
-
-        Calling this method will automatically set `self.do_predict` to `True`.
-
-        </Tip>
+        > [!TIP]
+        > Calling this method will automatically set `self.do_predict` to `True`.
 
         Args:
             batch_size (`int` *optional*, defaults to 8):
@@ -2582,12 +2557,9 @@ def set_logging(
                 Whether to filter `nan` and `inf` losses for logging. If set to `True` the loss of every step that is
                 `nan` or `inf` is filtered and the average loss of the current logging window is taken instead.
 
-                <Tip>
-
-                `nan_inf_filter` only influences the logging of loss values, it does not change the behavior the
-                gradient is computed or applied to the model.
-
-                </Tip>
+                > [!TIP]
+                > `nan_inf_filter` only influences the logging of loss values, it does not change the behavior the
+                > gradient is computed or applied to the model.
 
             on_each_node (`bool`, *optional*, defaults to `True`):
                 In multinode distributed training, whether to log using `log_level` once per node, or only on the main
@@ -2630,13 +2602,10 @@ def set_push_to_hub(
         """
         A method that regroups all arguments linked to synchronizing checkpoints with the Hub.
 
-        <Tip>
-
-        Calling this method will set `self.push_to_hub` to `True`, which means the `output_dir` will begin a git
-        directory synced with the repo (determined by `model_id`) and the content will be pushed each time a save is
-        triggered (depending on your `self.save_strategy`). Calling [`~Trainer.save_model`] will also trigger a push.
-
-        </Tip>
+        > [!TIP]
+        > Calling this method will set `self.push_to_hub` to `True`, which means the `output_dir` will begin a git
+        > directory synced with the repo (determined by `model_id`) and the content will be pushed each time a save is
+        > triggered (depending on your `self.save_strategy`). Calling [`~Trainer.save_model`] will also trigger a push.
 
         Args:
             model_id (`str`):
diff --git a/src/transformers/utils/auto_docstring.py b/src/transformers/utils/auto_docstring.py
index 9bf44c8bb426..fe5ee8fc674d 100644
--- a/src/transformers/utils/auto_docstring.py
+++ b/src/transformers/utils/auto_docstring.py
@@ -1202,13 +1202,10 @@ def add_intro_docstring(func, class_name, indent_level=0):
     if func.__name__ == "forward":
         intro_docstring = rf"""The [`{class_name}`] forward method, overrides the `__call__` special method.
 
-        <Tip>
-
-        Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
-        instance afterwards instead of this since the former takes care of running the pre and post processing steps while
-        the latter silently ignores them.
-
-        </Tip>
+        > [!TIP]
+        > Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
+        > instance afterwards instead of this since the former takes care of running the pre and post processing steps while
+        > the latter silently ignores them.
 
         """
         intro_docstring = equalize_indent(intro_docstring, indent_level + 4)
diff --git a/src/transformers/utils/doc.py b/src/transformers/utils/doc.py
index f9a787a74a13..fd0b4881774a 100644
--- a/src/transformers/utils/doc.py
+++ b/src/transformers/utils/doc.py
@@ -47,13 +47,10 @@ def docstring_decorator(fn):
         class_name = f"[`{fn.__qualname__.split('.')[0]}`]"
         intro = rf"""    The {class_name} forward method, overrides the `__call__` special method.
 
-    <Tip>
-
-    Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
-    instance afterwards instead of this since the former takes care of running the pre and post processing steps while
-    the latter silently ignores them.
-
-    </Tip>
+    > [!TIP]
+    > Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
+    > instance afterwards instead of this since the former takes care of running the pre and post processing steps while
+    > the latter silently ignores them.
 """
 
         correct_indentation = get_docstring_indentation_level(fn)
@@ -180,13 +177,10 @@ def _prepare_output_docstrings(output_type, config_class, min_indent=None, add_i
 
 
 FAKE_MODEL_DISCLAIMER = """
-    <Tip warning={true}>
-
-    This example uses a random model as the real ones are all very big. To get proper results, you should use
-    {real_checkpoint} instead of {fake_checkpoint}. If you get out-of-memory when loading that checkpoint, you can try
-    adding `device_map="auto"` in the `from_pretrained` call.
-
-    </Tip>
+    > [!WARNING]
+    > This example uses a random model as the real ones are all very big. To get proper results, you should use
+    > {real_checkpoint} instead of {fake_checkpoint}. If you get out-of-memory when loading that checkpoint, you can try
+    > adding `device_map="auto"` in the `from_pretrained` call.
 """
 
 
diff --git a/src/transformers/utils/generic.py b/src/transformers/utils/generic.py
index b39ed65251b2..4e558d2629d7 100644
--- a/src/transformers/utils/generic.py
+++ b/src/transformers/utils/generic.py
@@ -249,12 +249,9 @@ class ModelOutput(OrderedDict):
     tuple) or strings (like a dictionary) that will ignore the `None` attributes. Otherwise behaves like a regular
     python dictionary.
 
-    <Tip warning={true}>
-
-    You can't unpack a `ModelOutput` directly. Use the [`~utils.ModelOutput.to_tuple`] method to convert it to a tuple
-    before.
-
-    </Tip>
+    > [!WARNING]
+    > You can't unpack a `ModelOutput` directly. Use the [`~utils.ModelOutput.to_tuple`] method to convert it to a tuple
+    > before.
     """
 
     def __init_subclass__(cls) -> None:
diff --git a/src/transformers/utils/hub.py b/src/transformers/utils/hub.py
index dab357941b81..8189603a50ee 100644
--- a/src/transformers/utils/hub.py
+++ b/src/transformers/utils/hub.py
@@ -298,11 +298,8 @@ def cached_file(
         repo_type (`str`, *optional*):
             Specify the repo type (useful when downloading from a space for instance).
 
-    <Tip>
-
-    Passing `token=True` is required when you want to use a private model.
-
-    </Tip>
+    > [!TIP]
+    > Passing `token=True` is required when you want to use a private model.
 
     Returns:
         `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
@@ -386,11 +383,8 @@ def cached_files(
             passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
             a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
 
-    <Tip>
-
-    Passing `token=True` is required when you want to use a private model.
-
-    </Tip>
+    > [!TIP]
+    > Passing `token=True` is required when you want to use a private model.
 
     Returns:
         `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
diff --git a/src/transformers/utils/logging.py b/src/transformers/utils/logging.py
index e383653871bf..9e5329c9b9e6 100644
--- a/src/transformers/utils/logging.py
+++ b/src/transformers/utils/logging.py
@@ -165,17 +165,14 @@ def get_verbosity() -> int:
     Returns:
         `int`: The logging level.
 
-    <Tip>
-
-    🤗 Transformers has following logging levels:
-
-    - 50: `transformers.logging.CRITICAL` or `transformers.logging.FATAL`
-    - 40: `transformers.logging.ERROR`
-    - 30: `transformers.logging.WARNING` or `transformers.logging.WARN`
-    - 20: `transformers.logging.INFO`
-    - 10: `transformers.logging.DEBUG`
-
-    </Tip>"""
+    > [!TIP]
+    > 🤗 Transformers has following logging levels:
+    >
+    > - 50: `transformers.logging.CRITICAL` or `transformers.logging.FATAL`
+    > - 40: `transformers.logging.ERROR`
+    > - 30: `transformers.logging.WARNING` or `transformers.logging.WARN`
+    > - 20: `transformers.logging.INFO`
+    > - 10: `transformers.logging.DEBUG`"""
 
     _configure_library_root_logger()
     return _get_library_root_logger().getEffectiveLevel()
diff --git a/src/transformers/utils/peft_utils.py b/src/transformers/utils/peft_utils.py
index e3976acf168b..09e6177aca82 100644
--- a/src/transformers/utils/peft_utils.py
+++ b/src/transformers/utils/peft_utils.py
@@ -65,11 +65,8 @@ def find_adapter_config_file(
             git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
             identifier allowed by git.
 
-            <Tip>
-
-            To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>".
-
-            </Tip>
+            > [!TIP]
+            > To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>".
 
         local_files_only (`bool`, *optional*, defaults to `False`):
             If `True`, will only try to load the tokenizer configuration from local files.
diff --git a/src/transformers/video_processing_utils.py b/src/transformers/video_processing_utils.py
index 4d0e9c58f314..db6e9e95db9e 100644
--- a/src/transformers/video_processing_utils.py
+++ b/src/transformers/video_processing_utils.py
@@ -484,11 +484,8 @@ def from_pretrained(
                 identifier allowed by git.
 
 
-                <Tip>
-
-                To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
-
-                </Tip>
+                > [!TIP]
+                > To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>"`.
 
             return_unused_kwargs (`bool`, *optional*, defaults to `False`):
                 If `False`, then this function returns just the final video processor object. If `True`, then this
@@ -850,11 +847,8 @@ def register_for_auto_class(cls, auto_class="AutoVideoProcessor"):
         Register this class with a given auto class. This should only be used for custom video processors as the ones
         in the library are already mapped with `AutoVideoProcessor `.
 
-        <Tip warning={true}>
-
-        This API is experimental and may have some slight breaking changes in the next releases.
-
-        </Tip>
+        > [!WARNING]
+        > This API is experimental and may have some slight breaking changes in the next releases.
 
         Args:
             auto_class (`str` or `type`, *optional*, defaults to `"AutoVideoProcessor "`):
diff --git a/utils/deprecate_models.py b/utils/deprecate_models.py
index 8cbe319fdb65..eda05f8d0ffb 100644
--- a/utils/deprecate_models.py
+++ b/utils/deprecate_models.py
@@ -45,14 +45,11 @@ def get_last_stable_minor_release():
 def build_tip_message(last_stable_release):
     return (
         """
-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-"""
-        + f"""If you run into any issues running this model, please reinstall the last version that supported this model: v{last_stable_release}.
-You can do so by running the following command: `pip install -U transformers=={last_stable_release}`.
-
-</Tip>"""
+> [!WARNING]
+> This model is in maintenance mode only, we don't accept any new PRs changing its code.
+> """
+>         + f"""If you run into any issues running this model, please reinstall the last version that supported this model: v{last_stable_release}.
+> You can do so by running the following command: `pip install -U transformers=={last_stable_release}`."""
     )