diff --git a/docs/source/en/cache_explanation.md b/docs/source/en/cache_explanation.md
index 1a3439bc7927..124d12e985ad 100644
--- a/docs/source/en/cache_explanation.md
+++ b/docs/source/en/cache_explanation.md
@@ -41,13 +41,13 @@ $$
 
 The query (`Q`), key (`K`), and value (`V`) matrices are projections from the input embeddings of shape `(b, h, T, d_head)`.
 
-For causal attention, the mask prevents the model from attending to future tokens. Once a token is processed, its representation never changes with respect to future tokens, which means \\( K_{\text{past}} \\) and \\( V_{\text{past}} \\) can be cached and reused to compute the last token's representation.
+For causal attention, the mask prevents the model from attending to future tokens. Once a token is processed, its representation never changes with respect to future tokens, which means $ K_{\text{past}} $ and $ V_{\text{past}} $ can be cached and reused to compute the last token's representation.
 
 $$
 \text{Attention}(q_t, [\underbrace{k_1, k_2, \dots, k_{t-1}}_{\text{cached}}, k_{t}], [\underbrace{v_1, v_2, \dots, v_{t-1}}_{\text{cached}}, v_{t}])
 $$
 
-At inference time, you only need the last token's query to compute the representation \\( x_t \\) that predicts the next token \\( t+1 \\). At each step, the new key and value vectors are **stored** in the cache and **appended** to the past keys and values.
+At inference time, you only need the last token's query to compute the representation $ x_t $ that predicts the next token $ t+1 $. At each step, the new key and value vectors are **stored** in the cache and **appended** to the past keys and values.
 
 $$
 K_{\text{cache}} \leftarrow \text{concat}(K_{\text{past}}, k_t), \quad V_{\text{cache}} \leftarrow \text{concat}(V_{\text{past}}, v_t)
@@ -59,7 +59,7 @@ Refer to the table below to compare how caching improves efficiency.
 
 | without caching | with caching |
 |---|---|
-| for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V`
+| for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V` |
 | attention cost per step is **quadratic** with sequence length | attention cost per step is **linear** with sequence length (memory grows linearly, but compute/token remains low) |
 
 ## Cache class
diff --git a/docs/source/en/model_doc/reformer.md b/docs/source/en/model_doc/reformer.md
index c556e01ba13c..35948c4f918e 100644
--- a/docs/source/en/model_doc/reformer.md
+++ b/docs/source/en/model_doc/reformer.md
@@ -50,14 +50,14 @@ found [here](https://github.com/google/trax/tree/master/trax/models/reformer).
 
 Axial Positional Encodings were first implemented in Google's [trax library](https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29)
 and developed by the authors of this model's paper. In models that are treating very long input sequences, the
-conventional position id encodings store an embeddings vector of size \\(d\\) being the `config.hidden_size` for
-every position \\(i, \ldots, n_s\\), with \\(n_s\\) being `config.max_embedding_size`. This means that having
-a sequence length of \\(n_s = 2^{19} \approx 0.5M\\) and a `config.hidden_size` of \\(d = 2^{10} \approx 1000\\)
+conventional position id encodings store an embeddings vector of size $d$ being the `config.hidden_size` for
+every position $i, \ldots, n_s$, with $n_s$ being `config.max_embedding_size`. This means that having
+a sequence length of $n_s = 2^{19} \approx 0.5M$ and a `config.hidden_size` of $d = 2^{10} \approx 1000$
 would result in a position encoding matrix:
 
 $$X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right]$$
 
-which alone has over 500M parameters to store. Axial positional encodings factorize \\(X_{i,j}\\) into two matrices:
+which alone has over 500M parameters to store. Axial positional encodings factorize $X_{i,j}$ into two matrices:
 
 $$X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right]$$
 
@@ -76,16 +76,16 @@ X^{1}_{i, k}, & \text{if }\ i < d^1 \text{ with } k = j \mod n_s^1 \\
 X^{2}_{i - d^1, l}, & \text{if } i \ge d^1 \text{ with } l = \lfloor\frac{j}{n_s^1}\rfloor
 \end{cases}$$
 
-Intuitively, this means that a position embedding vector \\(x_j \in \mathbb{R}^{d}\\) is now the composition of two
-factorized embedding vectors: \\(x^1_{k, l} + x^2_{l, k}\\), where as the `config.max_embedding_size` dimension
-\\(j\\) is factorized into \\(k \text{ and } l\\). This design ensures that each position embedding vector
-\\(x_j\\) is unique.
+Intuitively, this means that a position embedding vector $x_j \in \mathbb{R}^{d}$ is now the composition of two
+factorized embedding vectors: $x^1_{k, l} + x^2_{l, k}$, where as the `config.max_embedding_size` dimension
+$j$ is factorized into $k \text{ and } l$. This design ensures that each position embedding vector
+$x_j$ is unique.
 
-Using the above example again, axial position encoding with \\(d^1 = 2^9, d^2 = 2^9, n_s^1 = 2^9, n_s^2 = 2^{10}\\)
-can drastically reduced the number of parameters from 500 000 000 to \\(2^{18} + 2^{19} \approx 780 000\\) parameters, this means 85% less memory usage.
+Using the above example again, axial position encoding with $d^1 = 2^9, d^2 = 2^9, n_s^1 = 2^9, n_s^2 = 2^{10}$
+can drastically reduced the number of parameters from 500 000 000 to $2^{18} + 2^{19} \approx 780 000$ parameters, this means 85% less memory usage.
 
-In practice, the parameter `config.axial_pos_embds_dim` is set to a tuple \\((d^1, d^2)\\) which sum has to be
-equal to `config.hidden_size` and `config.axial_pos_shape` is set to a tuple \\((n_s^1, n_s^2)\\) which
+In practice, the parameter `config.axial_pos_embds_dim` is set to a tuple $(d^1, d^2)$ which sum has to be
+equal to `config.hidden_size` and `config.axial_pos_shape` is set to a tuple $(n_s^1, n_s^2)$ which
 product has to be equal to `config.max_embedding_size`, which during training has to be equal to the *sequence
 length* of the `input_ids`.
 
@@ -107,10 +107,10 @@ neighboring chunks and `config.lsh_num_chunks_after` following neighboring chunk
 
 For more information, see the [original Paper](https://huggingface.co/papers/2001.04451) or this great [blog post](https://www.pragmatic.ml/reformer-deep-dive/).
 
-Note that `config.num_buckets` can also be factorized into a list \\((n_{\text{buckets}}^1,
-n_{\text{buckets}}^2)\\). This way instead of assigning the query key embedding vectors to one of \\((1,\ldots,
-n_{\text{buckets}})\\) they are assigned to one of \\((1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
-1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)\\). This is crucial for very long sequences to
+Note that `config.num_buckets` can also be factorized into a list $(n_{\text{buckets}}^1,
+n_{\text{buckets}}^2)$. This way instead of assigning the query key embedding vectors to one of $(1,\ldots,
+n_{\text{buckets}})$ they are assigned to one of $(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
+1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)$. This is crucial for very long sequences to
 save memory.
 
 When training a model from scratch, it is recommended to leave `config.num_buckets=None`, so that depending on the
@@ -118,8 +118,8 @@ sequence length a good value for `num_buckets` is calculated on the fly. This va
 saved in the config and should be reused for inference.
 
 Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from
-\\(\mathcal{O}(n_s \times n_s)\\) to \\(\mathcal{O}(n_s \times \log(n_s))\\), which usually represents the memory
-and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length.
+$\mathcal{O}(n_s \times n_s)$ to $\mathcal{O}(n_s \times \log(n_s))$, which usually represents the memory
+and time bottleneck in a transformer model, with $n_s$ being the sequence length.
 
 ### Local Self Attention
 
@@ -129,8 +129,8 @@ the key embedding vectors in its chunk and to the key embedding vectors of `conf
 previous neighboring chunks and `config.local_num_chunks_after` following neighboring chunks.
 
 Using Local self attention, the memory and time complexity of the query-key matmul operation can be reduced from
-\\(\mathcal{O}(n_s \times n_s)\\) to \\(\mathcal{O}(n_s \times \log(n_s))\\), which usually represents the memory
-and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length.
+$\mathcal{O}(n_s \times n_s)$ to $\mathcal{O}(n_s \times \log(n_s))$, which usually represents the memory
+and time bottleneck in a transformer model, with $n_s$ being the sequence length.
 
 ### Training
 
diff --git a/docs/source/en/model_doc/rwkv.md b/docs/source/en/model_doc/rwkv.md
index c0bd1273f615..f3c1ae7ea736 100644
--- a/docs/source/en/model_doc/rwkv.md
+++ b/docs/source/en/model_doc/rwkv.md
@@ -94,27 +94,27 @@ In a traditional auto-regressive Transformer, attention is written as
 
 $$O = \hbox{softmax}(QK^{T} / \sqrt{d}) V$$
 
-with \\(Q\\), \\(K\\) and \\(V\\) are matrices of shape `seq_len x hidden_size` named query, key and value (they are actually bigger matrices with a batch dimension and an attention head dimension but we're only interested in the last two, which is where the matrix product is taken, so for the sake of simplicity we only consider those two). The product \\(QK^{T}\\) then has shape `seq_len x seq_len` and we can take the matrix product with \\(V\\) to get the output \\(O\\) of the same shape as the others.  
+with $Q$, $K$ and $V$ are matrices of shape `seq_len x hidden_size` named query, key and value (they are actually bigger matrices with a batch dimension and an attention head dimension but we're only interested in the last two, which is where the matrix product is taken, so for the sake of simplicity we only consider those two). The product $QK^{T}$ then has shape `seq_len x seq_len` and we can take the matrix product with $V$ to get the output $O$ of the same shape as the others.  
 
 Replacing the softmax by its value gives:
 
 $$O_{i} = \frac{\sum_{j=1}^{i} e^{Q_{i} K_{j}^{T} / \sqrt{d}} V_{j}}{\sum_{j=1}^{i} e^{Q_{i} K_{j}^{T} / \sqrt{d}}}$$
 
-Note that the entries in \\(QK^{T}\\) corresponding to \\(j > i\\) are masked (the sum stops at j) because the attention is not allowed to look at future tokens (only past ones).
+Note that the entries in $QK^{T}$ corresponding to $j > i$ are masked (the sum stops at j) because the attention is not allowed to look at future tokens (only past ones).
 
 In comparison, the RWKV attention is given by
 
 $$O_{i} = \sigma(R_{i}) \frac{\sum_{j=1}^{i} e^{W_{i-j} + K_{j}} V_{j}}{\sum_{j=1}^{i} e^{W_{i-j} + K_{j}}}$$
 
-where \\(R\\) is a new matrix called receptance by the author, \\(K\\) and \\(V\\) are still the key and value (\\(\sigma\\) here is the sigmoid function). \\(W\\) is a new vector that represents the position of the token and is given by
+where $R$ is a new matrix called receptance by the author, $K$ and $V$ are still the key and value ($\sigma$ here is the sigmoid function). $W$ is a new vector that represents the position of the token and is given by
 
 $$W_{0} = u \hbox{  and  } W_{k} = (k-1)w \hbox{ for } k \geq 1$$
 
-with \\(u\\) and \\(w\\) learnable parameters called in the code `time_first` and `time_decay` respectively. The numerator and denominator can both be expressed recursively. Naming them \\(N_{i}\\) and \\(D_{i}\\) we have:
+with $u$ and $w$ learnable parameters called in the code `time_first` and `time_decay` respectively. The numerator and denominator can both be expressed recursively. Naming them $N_{i}$ and $D_{i}$ we have:
 
 $$N_{i} = e^{u + K_{i}} V_{i} + \hat{N}_{i} \hbox{  where  } \hat{N}_{i} = e^{K_{i-1}} V_{i-1} + e^{w + K_{i-2}} V_{i-2} \cdots + e^{(i-2)w + K_{1}} V_{1}$$
 
-so \\(\hat{N}_{i}\\) (called `numerator_state` in the code) satisfies
+so $\hat{N}_{i}$ (called `numerator_state` in the code) satisfies
 
 $$\hat{N}_{0} = 0 \hbox{  and  } \hat{N}_{j+1} = e^{K_{j}} V_{j} + e^{w} \hat{N}_{j}$$
 
@@ -122,7 +122,7 @@ and
 
 $$D_{i} = e^{u + K_{i}} + \hat{D}_{i} \hbox{  where  } \hat{D}_{i} = e^{K_{i-1}} + e^{w + K_{i-2}} \cdots + e^{(i-2)w + K_{1}}$$
 
-so \\(\hat{D}_{i}\\) (called `denominator_state` in the code) satisfies
+so $\hat{D}_{i}$ (called `denominator_state` in the code) satisfies
 
 $$\hat{D}_{0} = 0 \hbox{  and  } \hat{D}_{j+1} = e^{K_{j}} + e^{w} \hat{D}_{j}$$
 
@@ -130,7 +130,7 @@ The actual recurrent formula used are a tiny bit more complex, as for numerical
 
 $$\frac{e^{x_{i}}}{\sum_{j=1}^{n} e^{x_{j}}} = \frac{e^{x_{i} - M}}{\sum_{j=1}^{n} e^{x_{j} - M}}$$
 
-with \\(M\\) the maximum of all \\(x_{j}\\). So here on top of saving the numerator state (\\(\hat{N}\\)) and the denominator state (\\(\hat{D}\\)) we also keep track of the maximum of all terms encountered in the exponentials. So we actually use
+with $M$ the maximum of all $x_{j}$. So here on top of saving the numerator state ($\hat{N}$) and the denominator state ($\hat{D}$) we also keep track of the maximum of all terms encountered in the exponentials. So we actually use
 
 $$\tilde{N}_{i} = e^{-M_{i}} \hat{N}_{i} \hbox{  and  } \tilde{D}_{i} = e^{-M_{i}} \hat{D}_{i}$$
 
@@ -142,7 +142,7 @@ and
 
 $$\tilde{D}_{0} = 0 \hbox{  and  } \tilde{D}_{j+1} = e^{K_{j} - q} + e^{w + M_{j} - q} \tilde{D}_{j} \hbox{  where  } q = \max(K_{j}, w + M_{j})$$
 
-and \\(M_{j+1} = q\\). With those, we can then compute
+and $M_{j+1} = q$. With those, we can then compute
 
 $$N_{i} = e^{u + K_{i} - q} V_{i} + e^{M_{i}} \tilde{N}_{i} \hbox{  where  } q = \max(u + K_{i}, M_{i})$$
 
@@ -152,4 +152,4 @@ $$D_{i} = e^{u + K_{i} - q} + e^{M_{i}} \tilde{D}_{i} \hbox{  where  } q = \max(
 
 which finally gives us
 
-$$O_{i} = \sigma(R_{i}) \frac{N_{i}}{D_{i}}$$
\ No newline at end of file
+$$O_{i} = \sigma(R_{i}) \frac{N_{i}}{D_{i}}$$
diff --git a/docs/source/en/perplexity.md b/docs/source/en/perplexity.md
index 4350b444e62f..260459c7101a 100644
--- a/docs/source/en/perplexity.md
+++ b/docs/source/en/perplexity.md
@@ -23,11 +23,11 @@ that the metric applies specifically to classical language models (sometimes cal
 models) and is not well defined for masked language models like BERT (see [summary of the models](model_summary)).
 
 Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized
-sequence \\(X = (x_0, x_1, \dots, x_t)\\), then the perplexity of \\(X\\) is,
+sequence $X = (x_0, x_1, \dots, x_t)$, then the perplexity of $X$ is,
 
-$$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}$$
+$$ \text{PPL}(X) = \exp\left\{ -\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i})  \right\} $$
 
-where \\(\log p_\theta (x_i|x_{<i})\\) is the log-likelihood of the ith token conditioned on the preceding tokens \\(x_{<i}\\) according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models.
+where $\log p_\theta (x_i|x_{<i})$ is the log-likelihood of the ith token conditioned on the preceding tokens $x_{<i}$ according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models.
 
 This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
 intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
@@ -42,11 +42,11 @@ factorizing a sequence and conditioning on the entire preceding subsequence at e
 
 When working with approximate models, however, we typically have a constraint on the number of tokens the model can
 process. The largest version of [GPT-2](model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we
-cannot calculate \\(p_\theta(x_t|x_{<t})\\) directly when \\(t\\) is greater than 1024.
+cannot calculate $p_\theta(x_t|x_{<t})$ directly when $t$ is greater than 1024.
 
 Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
-input size is \\(k\\), we then approximate the likelihood of a token \\(x_t\\) by conditioning only on the
-\\(k-1\\) tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
+input size is $k$, we then approximate the likelihood of a token $x_t$ by conditioning only on the
+$k-1$ tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
 sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
 log-likelihoods of each segment independently.
 
diff --git a/docs/source/en/quantization/concept_guide.md b/docs/source/en/quantization/concept_guide.md
index e9d3b451484d..44fe5a92afe1 100644
--- a/docs/source/en/quantization/concept_guide.md
+++ b/docs/source/en/quantization/concept_guide.md
@@ -30,7 +30,7 @@ The sections below cover quantization schemes, granularity, and techniques.
 
 ## Quantization schemes
 
-The core idea is to map the range of values found in the original float32 weights and activations to the much smaller range represented by int8 (typically \\([-128, 127]\\)).
+The core idea is to map the range of values found in the original float32 weights and activations to the much smaller range represented by int8 (typically $[-128, 127]$).
 
 This section covers how some quantization techniques work.
 
@@ -40,20 +40,20 @@ This section covers how some quantization techniques work.
 
 ### Affine quantization
 
-The most common method is *affine quantization*. For a given float32 tensor (like a layer's weights), it finds the minimum \\(val_{min}\\) and maximum \\(val_{max}\\) values. This range \\([val_{min}, val_{max}]\\) is mapped to the int8 range \\([q_{min}, q_{max}]\\), which is typically \\([-128, 127]\\).
+The most common method is *affine quantization*. For a given float32 tensor (like a layer's weights), it finds the minimum $val_{min}$ and maximum $val_{max}$ values. This range $[val_{min}, val_{max}]$ is mapped to the int8 range $[q_{min}, q_{max}]$, which is typically $[-128, 127]$.
 
 There are two main ways to perform this mapping, *symmetric* and *asymmetric*. The choice between symmetric and asymmetric quantization determines how the float32 range is mapped to the int8 range.
 
-- Symmetric: This method assumes the original float32 range is symmetric around zero ( \\([ -a, a ]\\) ). This range is mapped symmetrically to the int8 range, for example, \\([-127, 127]\\). A key characteristic is that the float32 value \\(0.0\\) maps directly to the int8 value \\(0\\). This only requires one parameter, the **scale ( \\(S\\) )**, to define the mapping. It can simplify computations, but it might be less accurate if the original data distribution isn't naturally centered around zero.
-- Asymmetric (Affine): This method does not assume the data is centered around zero. It maps the exact range \\([val_{min}, val_{max}]\\) from float32 to the full int8 range, like \\([-128, 127]\\). This requires two parameters, a **scale ( \\(S\\) )** and a **zero-point ( \\(Z\\) )**.
+- Symmetric: This method assumes the original float32 range is symmetric around zero ( $[ -a, a ]$ ). This range is mapped symmetrically to the int8 range, for example, $[-127, 127]$. A key characteristic is that the float32 value $0.0$ maps directly to the int8 value $0$. This only requires one parameter, the **scale ( $S$ )**, to define the mapping. It can simplify computations, but it might be less accurate if the original data distribution isn't naturally centered around zero.
+- Asymmetric (Affine): This method does not assume the data is centered around zero. It maps the exact range $[val_{min}, val_{max}]$ from float32 to the full int8 range, like $[-128, 127]$. This requires two parameters, a **scale ( $S$ )** and a **zero-point ( $Z$ )**.
 
-    scale ( \\(S\\) ): A positive float32 number representing the ratio between the float32 and the int8 range.
+    scale ( $S$ ): A positive float32 number representing the ratio between the float32 and the int8 range.
 
 $$
 S = \frac{val_{max} - val_{min}}{q_{max} - q_{min}}
 $$
 
-zero-Point ( \\(Z\\) ): An int8 value that corresponds to the float32 value \\(0.0\\).
+zero-point ( $Z$ ): An int8 value that corresponds to the float32 value $0.0$.
 
 $$
 Z = q_{min} - round\left(\frac{val_{min}}{S}\right)
@@ -62,13 +62,13 @@ $$
 > [!TIP]
 > In symmetric quantization, Z would typically be fixed at 0.
 
-With these parameters, a float32 value, \\(x\\). can be quantized to int8 ( \\(q\\) ) with the formula below.
+With these parameters, a float32 value, $x$. can be quantized to int8 ( $q$ ) with the formula below.
 
 $$
 q = round\left(\frac{x}{S} + Z\right)
 $$
 
-The int8 value, \\(q\\), can be dequantized back to approximate float32 with the formula below.
+The int8 value, $q$, can be dequantized back to approximate float32 with the formula below.
 
 $$
 x \approx S \cdot (q - Z)
@@ -78,7 +78,7 @@ $$
     <img width="606" alt="dequant" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/dequant.png" />
 </div>
 
-During inference, computations like matrix multiplication are performed using the int8 values ( \\(q\\) ), and the result is dequantized back to float32 (often using a higher-precision accumulation type like int32 internally) before it is passed to the next layer.
+During inference, computations like matrix multiplication are performed using the int8 values ( $q$ ), and the result is dequantized back to float32 (often using a higher-precision accumulation type like int32 internally) before it is passed to the next layer.
 
 ### int4 and weight packing
 
@@ -86,7 +86,7 @@ During inference, computations like matrix multiplication are performed using th
     <img width="606" alt="weight packing" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/weight_packing.png" />
 </div>
 
-int4 quantization further reduces the model size and memory usage (halving it compared to int8). The same affine or symmetric quantization principles apply, mapping the float32 range to the 16 possible values representable by int4 ( \\([-8, 7]\\) for signed int4).
+int4 quantization further reduces the model size and memory usage (halving it compared to int8). The same affine or symmetric quantization principles apply, mapping the float32 range to the 16 possible values representable by int4 ( $[-8, 7]$ for signed int4).
 
 A key aspect of int4 quantization is **weight packing**. Since most hardware can't natively handle 4-bit data types in memory, two int4 values are typically packed together into a single int8 byte for storage and transfer. For example, the first value might occupy the lower 4 bits and the second value the upper 4 bits of the byte (`packed_byte = (val1 & 0x0F) | (val2 << 4)`).
 
@@ -114,10 +114,10 @@ Transformers supports FP8 through specific backends like [FBGEMM](./fbgemm_fp8),
 
 ## Granularity
 
-Quantization parameters ( \\(S\\) and \\(Z\\)) can be calculated in one of two ways.
+Quantization parameters ( $S$ and $Z$) can be calculated in one of two ways.
 
-- Per-Tensor: One set of \\(S\\) and \\(Z\\) for the entire tensor. Simpler, but less accurate if data values vary greatly within the tensor.
-- Per-Channel (or Per-Group/Block): Separate \\(S\\) and \\(Z\\) for each channel or group. More accurate and better performance at the cost of slightly more complexity and memory.
+- Per-Tensor: One set of $S$ and $Z$ for the entire tensor. Simpler, but less accurate if data values vary greatly within the tensor.
+- Per-Channel (or Per-Group/Block): Separate $S$ and $Z$ for each channel or group. More accurate and better performance at the cost of slightly more complexity and memory.
 
 <div class="flex justify-center">
     <img width="625" alt="Granularities" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Granularities.png" />
@@ -171,4 +171,4 @@ To explore quantization and related performance optimization concepts more deepl
 - [Introduction to Quantization cooked in 🤗 with 💗🧑‍🍳](https://huggingface.co/blog/merve/quantization)
 - [EfficientML.ai Lecture 5 - Quantization Part I](https://www.youtube.com/watch?v=RP23-dRVDWM)
 - [Making Deep Learning Go Brrrr From First Principles](https://horace.io/brrr_intro.html)
-- [Accelerating Generative AI with PyTorch Part 2: LLM Optimizations](https://pytorch.org/blog/accelerating-generative-ai-2/)
\ No newline at end of file
+- [Accelerating Generative AI with PyTorch Part 2: LLM Optimizations](https://pytorch.org/blog/accelerating-generative-ai-2/)
diff --git a/docs/source/en/tokenizer_summary.md b/docs/source/en/tokenizer_summary.md
index 34bc16628cad..ba9fedceaf96 100644
--- a/docs/source/en/tokenizer_summary.md
+++ b/docs/source/en/tokenizer_summary.md
@@ -257,8 +257,8 @@ likely tokenization in practice, but also offers the possibility to sample a pos
 probabilities.
 
 Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of
-the words \\(x_{1}, \dots, x_{N}\\) and that the set of all possible tokenizations for a word \\(x_{i}\\) is
-defined as \\(S(x_{i})\\), then the overall loss is defined as
+the words $x_{1}, \dots, x_{N}$ and that the set of all possible tokenizations for a word $x_{i}$ is
+defined as $S(x_{i})$, then the overall loss is defined as
 
 $$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )$$