Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 34 additions & 1 deletion docs/papers/jrfm/01_Introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ \section{Introduction}

\textbf{Single-day validation.} Applied to 242 trading days (SPY, 2024), obfuscation testing achieves 71.5\% detection of dealer hedging patterns using unbiased prompts, with 90.9\% of detections materializing in forward returns. A \textbf{raw chain validation} removing all pre-calculated metrics achieves 92.3\% detection---outperforming the GEX-assisted baseline by 30.8 percentage points---demonstrating that LLMs reconstruct dealer positioning from first principles rather than matching parametric summaries \citep{regan2025obfuscation}.

\textbf{Multi-day regime detection.} Extending to 30-day windows across six years (2020--2025), the framework achieves 81.2\% detection of persistent regimes in 2024 versus 12.1\% in 2020 (69.1 percentage point separation, $\varphi = 0.672$, $p < 0.0001$), with 0\% false positives on synthetic controls. Multi-year analysis reveals gradual regime evolution tracking 0DTE adoption: detection rates rise from 3.7\% (2021) to 100\% (2024), with average GEX magnitude growing from \$3.0B to \$20.3B.
\textbf{Multi-day regime detection.} Extending to 30-day windows across six years (2020--2025), the framework achieves 81.2\% detection of persistent regimes in 2024 (95\% CI [75.8, 86.1]\%) versus 12.1\% in 2020 (95\% CI [8.1, 16.6]\%) --- a 69.1 percentage point separation, $\varphi = 0.69$, Fisher's exact $p = 1.8 \times 10^{-52}$ --- with 0\% false positives on synthetic controls. Multi-year analysis reveals gradual regime evolution tracking 0DTE adoption: detection rates rise from 3.7\% (2021) to 100\% (2024), with average GEX magnitude growing from \$3.0B to \$20.3B.

\subsection{Research Questions}

Expand Down Expand Up @@ -45,6 +45,39 @@ \subsection{Contributions}
\item \textbf{Detection-alpha orthogonality}: Stable detection (68--74\% quarterly) persists as economic profitability collapses (Sharpe 1.8 $\rightarrow$ 0.1), establishing detected patterns as risk management signals rather than alpha generators.
\end{enumerate}

\subsection{Positioning}
\label{sec:introduction:positioning}

The contribution is primarily \textit{methodological}. We propose
temporal obfuscation testing---and the associated WHO$\rightarrow$WHOM
$\rightarrow$WHAT causal framework and multi-scale validation
protocol---as a generalizable procedure for validating whether an LLM
is reasoning from structural relationships rather than from
memorization of training-data surface patterns. Options dealer
gamma-exposure regime detection is chosen as the empirical demonstration
domain because it offers three features that an LLM validation study
requires simultaneously: mechanical constraints that are theoretically
grounded in microstructure, a large quantitative testbed (2,221
evaluations across six years), and sharp temporal structure (the
pre- versus post-0DTE contrast) that a genuinely reasoning system
should distinguish from noise.

The financial-market findings reported here---the 69.1 percentage
point 2024-versus-2020 detection gap, the 0\% false-positive rate on
transitional and low-magnitude synthetic controls, and the gradual
2021--2024 regime evolution tracking 0DTE adoption---are therefore
presented as \textit{downstream evidence} that the methodology
discriminates between persistent and fragmented market structures in
ways consistent with known microstructure dynamics, not as novel
claims about options market microstructure per se. Readers interested
primarily in the financial-markets angle will find the relevant
observations in Sections~\ref{sec:regime} and~\ref{sec:discussion};
readers interested primarily in LLM validation methodology will find
the generalizable contribution in
Sections~\ref{sec:methodology} and~\ref{sec:discussion}. This framing
is maintained consistently through the Conclusion
(Section~\ref{sec:conclusion}).

\subsection{Paper Organization}

Section~\ref{sec:related} reviews related work. Section~\ref{sec:methodology} presents the unified methodology covering obfuscation testing, causal framework, and regime detection criteria. Section~\ref{sec:single_day} reports single-day validation results including raw chain analysis. Section~\ref{sec:regime} presents multi-day regime detection and market structure evolution. Section~\ref{sec:discussion} discusses implications and limitations. Section~\ref{sec:conclusion} concludes.
30 changes: 29 additions & 1 deletion docs/papers/jrfm/03_Methodology.tex
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,35 @@ \subsection{Multi-Phase Validation Strategy}

\subsection{LLM Configuration}

We use OpenAI o4-mini \citep{openai2024reasoning} with temperature=1.0, max tokens=16,384, processed via Batch API (asynchronous, 100\% completion rate). The model receives a system message (``financial market analyst identifying persistent dealer gamma regimes''), a 30-day obfuscated GEX sequence with classification criteria, and outputs structured JSON with regime type, confidence (0--100), reasoning trace, and computed metrics. Total processing cost across all 2,221 evaluations was \$11.07.
We use OpenAI o4-mini \citep{openai2024reasoning} with temperature=1.0, max tokens=16,384, processed via Batch API (asynchronous, 100\% completion rate). The model receives a system message (``financial market analyst identifying persistent dealer gamma regimes''), a 30-day obfuscated GEX sequence with classification criteria, and outputs structured JSON with regime type, confidence (0--100), reasoning trace, and computed metrics. Total processing cost across all 2,221 evaluations was \$11.07. The complete prompt, API configuration, and output schema are reproduced verbatim in Appendix~\ref{app:prompt}.

\subsection{Markov-Switching Benchmark}
\label{sec:methodology:benchmark}

To situate the LLM regime detector against a textbook alternative, we fit
a two-state Markov-switching regression
\citep{hamilton1989new,nystrup2020regime} to the daily SPY log-return
series for each year under study using the standard
\texttt{statsmodels.tsa.regime\_switching.MarkovRegression}
implementation (switching intercept, switching variance, estimated by
the standard EM algorithm to convergence). This is the conventional
\textit{volatility-regime} benchmark: a low-variance state is interpreted
as a stable regime, a high-variance state as transitional. For each
30-day window in our Phase~3 (2024) and Phase~4 (2020) datasets we
compute the majority smoothed state across the 30 days and record this
as the benchmark's \emph{detected} label, taking the low-variance state
as the ``regime'' analogue.

Because the LLM explicitly targets dealer \emph{gamma} positioning rather
than variance, we additionally fit the HMM on the daily net-GEX series
directly (where the cached daily series is available, i.e.\ for 2024).
This GEX-native fit is a more directly analogous benchmark: the LLM and
the HMM are then both scoring regime structure in the same physical
quantity, differing only in mechanism (sequence-level structural
reasoning vs.\ parametric two-state Gaussian EM).

Agreement between each benchmark and the LLM is quantified with Cohen's
$\kappa$ on the per-window binary detection labels.

\subsection{LLM Usage Disclosure}

Expand Down
Loading
Loading