iAmGiG · iAmGiG · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/docs/papers/jrfm/01_Introduction.tex b/docs/papers/jrfm/01_Introduction.tex
@@ -11,7 +11,7 @@ \section{Introduction}
 
 \textbf{Single-day validation.} Applied to 242 trading days (SPY, 2024), obfuscation testing achieves 71.5\% detection of dealer hedging patterns using unbiased prompts, with 90.9\% of detections materializing in forward returns. A \textbf{raw chain validation} removing all pre-calculated metrics achieves 92.3\% detection---outperforming the GEX-assisted baseline by 30.8 percentage points---demonstrating that LLMs reconstruct dealer positioning from first principles rather than matching parametric summaries \citep{regan2025obfuscation}.
 
-\textbf{Multi-day regime detection.} Extending to 30-day windows across six years (2020--2025), the framework achieves 81.2\% detection of persistent regimes in 2024 versus 12.1\% in 2020 (69.1 percentage point separation, $\varphi = 0.672$, $p < 0.0001$), with 0\% false positives on synthetic controls. Multi-year analysis reveals gradual regime evolution tracking 0DTE adoption: detection rates rise from 3.7\% (2021) to 100\% (2024), with average GEX magnitude growing from \$3.0B to \$20.3B.
+\textbf{Multi-day regime detection.} Extending to 30-day windows across six years (2020--2025), the framework achieves 81.2\% detection of persistent regimes in 2024 (95\% CI [75.8, 86.1]\%) versus 12.1\% in 2020 (95\% CI [8.1, 16.6]\%) --- a 69.1 percentage point separation, $\varphi = 0.69$, Fisher's exact $p = 1.8 \times 10^{-52}$ --- with 0\% false positives on synthetic controls. Multi-year analysis reveals gradual regime evolution tracking 0DTE adoption: detection rates rise from 3.7\% (2021) to 100\% (2024), with average GEX magnitude growing from \$3.0B to \$20.3B.
 
 \subsection{Research Questions}
 
@@ -45,6 +45,39 @@ \subsection{Contributions}
 \item \textbf{Detection-alpha orthogonality}: Stable detection (68--74\% quarterly) persists as economic profitability collapses (Sharpe 1.8 $\rightarrow$ 0.1), establishing detected patterns as risk management signals rather than alpha generators.
 \end{enumerate}
 
+\subsection{Positioning}
+\label{sec:introduction:positioning}
+
+The contribution is primarily \textit{methodological}. We propose
+temporal obfuscation testing---and the associated WHO$\rightarrow$WHOM
+$\rightarrow$WHAT causal framework and multi-scale validation
+protocol---as a generalizable procedure for validating whether an LLM
+is reasoning from structural relationships rather than from
+memorization of training-data surface patterns. Options dealer
+gamma-exposure regime detection is chosen as the empirical demonstration
+domain because it offers three features that an LLM validation study
+requires simultaneously: mechanical constraints that are theoretically
+grounded in microstructure, a large quantitative testbed (2,221
+evaluations across six years), and sharp temporal structure (the
+pre- versus post-0DTE contrast) that a genuinely reasoning system
+should distinguish from noise.
+
+The financial-market findings reported here---the 69.1 percentage
+point 2024-versus-2020 detection gap, the 0\% false-positive rate on
+transitional and low-magnitude synthetic controls, and the gradual
+2021--2024 regime evolution tracking 0DTE adoption---are therefore
+presented as \textit{downstream evidence} that the methodology
+discriminates between persistent and fragmented market structures in
+ways consistent with known microstructure dynamics, not as novel
+claims about options market microstructure per se. Readers interested
+primarily in the financial-markets angle will find the relevant
+observations in Sections~\ref{sec:regime} and~\ref{sec:discussion};
+readers interested primarily in LLM validation methodology will find
+the generalizable contribution in
+Sections~\ref{sec:methodology} and~\ref{sec:discussion}. This framing
+is maintained consistently through the Conclusion
+(Section~\ref{sec:conclusion}).
+
 \subsection{Paper Organization}
 
 Section~\ref{sec:related} reviews related work. Section~\ref{sec:methodology} presents the unified methodology covering obfuscation testing, causal framework, and regime detection criteria. Section~\ref{sec:single_day} reports single-day validation results including raw chain analysis. Section~\ref{sec:regime} presents multi-day regime detection and market structure evolution. Section~\ref{sec:discussion} discusses implications and limitations. Section~\ref{sec:conclusion} concludes.
diff --git a/docs/papers/jrfm/03_Methodology.tex b/docs/papers/jrfm/03_Methodology.tex
@@ -151,7 +151,35 @@ \subsection{Multi-Phase Validation Strategy}
 
 \subsection{LLM Configuration}
 
-We use OpenAI o4-mini \citep{openai2024reasoning} with temperature=1.0, max tokens=16,384, processed via Batch API (asynchronous, 100\% completion rate). The model receives a system message (``financial market analyst identifying persistent dealer gamma regimes''), a 30-day obfuscated GEX sequence with classification criteria, and outputs structured JSON with regime type, confidence (0--100), reasoning trace, and computed metrics. Total processing cost across all 2,221 evaluations was \$11.07.
+We use OpenAI o4-mini \citep{openai2024reasoning} with temperature=1.0, max tokens=16,384, processed via Batch API (asynchronous, 100\% completion rate). The model receives a system message (``financial market analyst identifying persistent dealer gamma regimes''), a 30-day obfuscated GEX sequence with classification criteria, and outputs structured JSON with regime type, confidence (0--100), reasoning trace, and computed metrics. Total processing cost across all 2,221 evaluations was \$11.07. The complete prompt, API configuration, and output schema are reproduced verbatim in Appendix~\ref{app:prompt}.
+
+\subsection{Markov-Switching Benchmark}
+\label{sec:methodology:benchmark}
+
+To situate the LLM regime detector against a textbook alternative, we fit
+a two-state Markov-switching regression
+\citep{hamilton1989new,nystrup2020regime} to the daily SPY log-return
+series for each year under study using the standard
+\texttt{statsmodels.tsa.regime\_switching.MarkovRegression}
+implementation (switching intercept, switching variance, estimated by
+the standard EM algorithm to convergence). This is the conventional
+\textit{volatility-regime} benchmark: a low-variance state is interpreted
+as a stable regime, a high-variance state as transitional. For each
+30-day window in our Phase~3 (2024) and Phase~4 (2020) datasets we
+compute the majority smoothed state across the 30 days and record this
+as the benchmark's \emph{detected} label, taking the low-variance state
+as the ``regime'' analogue.
+
+Because the LLM explicitly targets dealer \emph{gamma} positioning rather
+than variance, we additionally fit the HMM on the daily net-GEX series
+directly (where the cached daily series is available, i.e.\ for 2024).
+This GEX-native fit is a more directly analogous benchmark: the LLM and
+the HMM are then both scoring regime structure in the same physical
+quantity, differing only in mechanism (sequence-level structural
+reasoning vs.\ parametric two-state Gaussian EM).
+
+Agreement between each benchmark and the LLM is quantified with Cohen's
+$\kappa$ on the per-window binary detection labels.
 
 \subsection{LLM Usage Disclosure}