From c4e0c14b355f9f448c5ff32551d4ad2f9b53e95e Mon Sep 17 00:00:00 2001 From: Bryson Jones Date: Sun, 29 Mar 2026 08:27:24 -0700 Subject: [PATCH] fix math rendering --- README.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f68a95a..c01d745 100644 --- a/README.md +++ b/README.md @@ -273,13 +273,23 @@ Calculates advantage values $A^\pi(\mathbf{o}_t, \mathbf{a}_t)$ from offline tra #### 1. Post-Training (N-Step Lookahead) -* **Formula:** $$A^\pi(\mathbf{o}_t, \mathbf{a}_t) = \sum_{t'=t}^{t+N-1} r'_{t'} + V^\pi(\mathbf{o}_{t+N}) - V^\pi(\mathbf{o}_t)$$ +**Formula:** + +```math +A^\pi(\mathbf{o}_t, \mathbf{a}_t) = \sum_{t'=t}^{t+N-1} r'_{t'} + V^\pi(\mathbf{o}_{t+N}) - V^\pi(\mathbf{o}_t) +``` + * **Configuration:** $N = 50$ * **Execution:** Sum rewards over the $N$-step window, add the future value $V^\pi(\mathbf{o}_{t+N})$, and subtract the current value $V^\pi(\mathbf{o}_t)$. #### 2. Pre-Training (Full Episode) -* **Formula:** $$A^\pi(\mathbf{o}_t, \mathbf{a}_t) = \sum_{t'=t}^{T} r'_{t'} - V^\pi(\mathbf{o}_t)$$ +**Formula:** + +```math +A^\pi(\mathbf{o}_t, \mathbf{a}_t) = \sum_{t'=t}^{T} r'_{t'} - V^\pi(\mathbf{o}_t) +``` + * **Configuration:** $N = T$ (where $T$ is the terminal episode step) * **Execution:** Calculate the empirical return from step $t$ to the episode's end, then subtract the baseline $V^\pi(\mathbf{o}_t)$.