iAmGiG · iAmGiG · Apr 25, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/docs/papers/jrfm/03_Methodology.tex b/docs/papers/jrfm/03_Methodology.tex
@@ -151,7 +151,7 @@ \subsection{Multi-Phase Validation Strategy}
 
 \subsection{LLM Configuration}
 
-We use OpenAI o4-mini \citep{openai2024reasoning} with temperature=1.0, max tokens=16,384, processed via Batch API (asynchronous, 100\% completion rate). The model receives a system message (``financial market analyst identifying persistent dealer gamma regimes''), a 30-day obfuscated GEX sequence with classification criteria, and outputs structured JSON with regime type, confidence (0--100), reasoning trace, and computed metrics. Total processing cost across all 2,221 evaluations was \$11.07. The complete prompt, API configuration, and output schema are reproduced verbatim in Appendix~\ref{app:prompt}.
+We use OpenAI o4-mini \citep{openai2024reasoning} via the OpenAI Batch API (asynchronous; 100\% completion rate across 2,221 requests). Reasoning models including \texttt{o4-mini} reject user-supplied \texttt{temperature} values and run at the default temperature of~1; our Batch submission code does not override this, and we do not set an explicit \texttt{max\_completion\_tokens} cap (the OpenAI API default for \texttt{o4-mini} applies). The model receives a single user-role message (the first paragraph serves as the de~facto system instruction: ``financial market analyst identifying persistent dealer gamma regimes''), a 30-day obfuscated GEX sequence with classification criteria, and is instructed in the prompt to return a JSON object with regime type, confidence (0--100), reasoning trace, and computed metrics; JSON parse-failure rate across the 1,307 per-window records where raw responses are retained was 0.46\% (6 windows, treated as non-detections; see Appendix~\ref{app:prompt} for details). Total processing cost across all evaluations was \$11.07. The complete prompt, API-parameter list, and output schema are reproduced verbatim in Appendix~\ref{app:prompt}.
 
 \subsection{Markov-Switching Benchmark}
 \label{sec:methodology:benchmark}

diff --git a/docs/papers/jrfm/07_Appendix_A_Prompts.tex b/docs/papers/jrfm/07_Appendix_A_Prompts.tex
@@ -29,21 +29,36 @@ \subsection{Model and API Configuration}
 
 \begin{itemize}
     \item \textbf{Model:} \texttt{o4-mini}
-    \item \textbf{Temperature:} 1.0 (OpenAI reasoning models require a
-          fixed temperature of 1; sampling-temperature adjustment is not
-          exposed for \texttt{o1}, \texttt{o3}, or \texttt{o4} model
-          families)
-    \item \textbf{Maximum completion tokens:} 16{,}384
-    \item \textbf{Response format:} JSON object (enforced via
-          \texttt{response\_format=\{"type":"json\_object"\}})
-    \item \textbf{Access mode:} OpenAI Batch API, batched 1{,}000 requests
-          per submission
+    \item \textbf{Temperature:} 1.0. OpenAI reasoning models
+          (\texttt{o1}, \texttt{o3}, \texttt{o4} families, and the newer
+          GPT-5 reasoning variants) reject user-supplied
+          \texttt{temperature} or \texttt{top\_p} values and run at the
+          default temperature of 1; our batch submission code does not
+          override this, so all 2{,}221 requests used temperature 1
+          implicitly.
+    \item \textbf{Maximum completion tokens:} not explicitly set in the
+          Batch API request body; the OpenAI API default for
+          \texttt{o4-mini} applies.
+    \item \textbf{Response format:} not enforced via the API
+          \texttt{response\_format} field; the model is instructed in
+          the prompt to return a JSON object with a specific schema. Of
+          the 1{,}307 per-window records for which raw JSON responses
+          are retained, 1{,}301 parsed cleanly (99.54\%); six failed
+          JSON parsing and are recorded with an explicit \texttt{error}
+          field and treated as non-detections.
+    \item \textbf{Seed:} the OpenAI Batch API exposes a \texttt{seed}
+          parameter for best-effort reproducibility; we did not set a
+          seed in this study, so each evaluation reflects the model's
+          native sampling at temperature 1.
+    \item \textbf{Access mode:} OpenAI Batch API (asynchronous; 24-hour
+          SLA per batch submission).
 \end{itemize}
 
 \noindent\textbf{Reproducibility note.}
-Because the reasoning models do not accept a user-supplied
-\texttt{seed} parameter and run at a fixed \texttt{temperature}, exact
-bit-identical replication of any single response is not guaranteed.
+Exact bit-identical replication of any single response is not
+guaranteed: temperature is fixed at 1 on the server, and even with a
+seed the OpenAI documentation states that determinism is best-effort
+and can shift when the server \texttt{system\_fingerprint} changes.
 Reproducibility at the \emph{distributional} level is achieved by
 (i)~the large sample size (N = 2{,}221 evaluations) and
 (ii)~the mechanical criteria embedded in the prompt, which give the
@@ -185,7 +200,7 @@ \subsection{System Message and User Prompt}
 - Average magnitude $3-5B
 - 5-7 sign flips
 - Example: "20 negative days, avg $4B, 6 flips"
-- Note: Borderline cases should generally be REJECTED unless other
+- **Note**: Borderline cases should generally be REJECTED unless other
   factors strengthen confidence
 
 **0-49 (Reject - Not Persistent)**
@@ -204,21 +219,19 @@ \subsection{System Message and User Prompt}
 
 Provide your analysis in this exact JSON structure:
 
+```json
 {
     "regime_detected": true/false,
-    "regime_type": "persistent_positive|persistent_negative|
-                    transitional|low_conviction",
+    "regime_type": "persistent_positive|persistent_negative|transitional|low_conviction",
     "positive_days": <count as integer>,
     "negative_days": <count as integer>,
     "avg_magnitude_billions": <value as number>,
     "sign_flips": <count as integer>,
     "persistence_pct": <percentage as number>,
     "confidence": <integer 0-100>,
-    "reasoning": "Explain step-by-step why this is/isn't a persistent
-                  regime. Reference specific metrics (persistence %,
-                  avg magnitude, sign flips). If rejecting, state which
-                  criterion failed."
+    "reasoning": "Explain step-by-step why this is/isn't a persistent regime. Reference specific metrics (persistence %, avg magnitude, sign flips). If rejecting, state which criterion failed."
 }
+```
 
 **IMPORTANT**: All numeric fields (confidence, positive_days,
 negative_days, sign_flips, avg_magnitude_billions, persistence_pct)
@@ -279,7 +292,12 @@ \subsection{Output Schema and Parsing}
 \noindent
 Parsing is performed by \texttt{src/validation/batch\_regime\_validator.py}
 via a robust JSON extractor that tolerates markdown code-fence wrappers
-and minor formatting drift. Any response failing schema validation is
-flagged for manual review; across the 2{,}221 evaluations in this study,
-the schema-validation failure rate was 0\% (all responses were
-machine-parseable).
+and minor formatting drift. Across the 1{,}307 per-window records for
+which the raw responses are retained in the results YAML files (Phases
+1--4 and the Phase 2 negative-control suite), the JSON parse-failure
+rate was 0.46\% (6~windows failed to parse as valid JSON and are
+recorded with an explicit \texttt{error} field; these windows are
+treated as non-detections in all aggregate rates reported in
+Section~\ref{sec:regime}). Phase 5 multi-year per-window records were
+not retained in the published pipeline; Table~\ref{tab:phase5} reports
+only the aggregate count per year.
diff --git a/docs/papers/jrfm/Regan_Xie_JRFM.pdf b/docs/papers/jrfm/Regan_Xie_JRFM.pdf
diff --git a/docs/papers/jrfm/figures/fig01_obfuscation.png b/docs/papers/jrfm/figures/fig01_obfuscation.png
diff --git a/docs/papers/jrfm/figures/fig02_regime_window.png b/docs/papers/jrfm/figures/fig02_regime_window.png
diff --git a/docs/papers/jrfm/figures/fig03_validation_pipeline.png b/docs/papers/jrfm/figures/fig03_validation_pipeline.png
diff --git a/docs/papers/jrfm/figures/fig04_selectivity.png b/docs/papers/jrfm/figures/fig04_selectivity.png
diff --git a/docs/papers/jrfm/figures/fig05_gex_magnitude_distribution.png b/docs/papers/jrfm/figures/fig05_gex_magnitude_distribution.png
diff --git a/docs/papers/jrfm/figures/fig06_detection_progression.png b/docs/papers/jrfm/figures/fig06_detection_progression.png
diff --git a/docs/papers/jrfm/figures/fig09_threshold_sensitivity.png b/docs/papers/jrfm/figures/fig09_threshold_sensitivity.png
diff --git a/docs/papers/jrfm/figures/fig10_hmm_agreement.png b/docs/papers/jrfm/figures/fig10_hmm_agreement.png