diff --git a/docs/papers/jrfm/03_Methodology.tex b/docs/papers/jrfm/03_Methodology.tex index ec3d13f..6b2259c 100644 --- a/docs/papers/jrfm/03_Methodology.tex +++ b/docs/papers/jrfm/03_Methodology.tex @@ -151,7 +151,7 @@ \subsection{Multi-Phase Validation Strategy} \subsection{LLM Configuration} -We use OpenAI o4-mini \citep{openai2024reasoning} with temperature=1.0, max tokens=16,384, processed via Batch API (asynchronous, 100\% completion rate). The model receives a system message (``financial market analyst identifying persistent dealer gamma regimes''), a 30-day obfuscated GEX sequence with classification criteria, and outputs structured JSON with regime type, confidence (0--100), reasoning trace, and computed metrics. Total processing cost across all 2,221 evaluations was \$11.07. The complete prompt, API configuration, and output schema are reproduced verbatim in Appendix~\ref{app:prompt}. +We use OpenAI o4-mini \citep{openai2024reasoning} via the OpenAI Batch API (asynchronous; 100\% completion rate across 2,221 requests). Reasoning models including \texttt{o4-mini} reject user-supplied \texttt{temperature} values and run at the default temperature of~1; our Batch submission code does not override this, and we do not set an explicit \texttt{max\_completion\_tokens} cap (the OpenAI API default for \texttt{o4-mini} applies). The model receives a single user-role message (the first paragraph serves as the de~facto system instruction: ``financial market analyst identifying persistent dealer gamma regimes''), a 30-day obfuscated GEX sequence with classification criteria, and is instructed in the prompt to return a JSON object with regime type, confidence (0--100), reasoning trace, and computed metrics; JSON parse-failure rate across the 1,307 per-window records where raw responses are retained was 0.46\% (6 windows, treated as non-detections; see Appendix~\ref{app:prompt} for details). Total processing cost across all evaluations was \$11.07. The complete prompt, API-parameter list, and output schema are reproduced verbatim in Appendix~\ref{app:prompt}. \subsection{Markov-Switching Benchmark} \label{sec:methodology:benchmark} diff --git a/docs/papers/jrfm/07_Appendix_A_Prompts.tex b/docs/papers/jrfm/07_Appendix_A_Prompts.tex index dabf411..8be6288 100644 --- a/docs/papers/jrfm/07_Appendix_A_Prompts.tex +++ b/docs/papers/jrfm/07_Appendix_A_Prompts.tex @@ -29,21 +29,36 @@ \subsection{Model and API Configuration} \begin{itemize} \item \textbf{Model:} \texttt{o4-mini} - \item \textbf{Temperature:} 1.0 (OpenAI reasoning models require a - fixed temperature of 1; sampling-temperature adjustment is not - exposed for \texttt{o1}, \texttt{o3}, or \texttt{o4} model - families) - \item \textbf{Maximum completion tokens:} 16{,}384 - \item \textbf{Response format:} JSON object (enforced via - \texttt{response\_format=\{"type":"json\_object"\}}) - \item \textbf{Access mode:} OpenAI Batch API, batched 1{,}000 requests - per submission + \item \textbf{Temperature:} 1.0. OpenAI reasoning models + (\texttt{o1}, \texttt{o3}, \texttt{o4} families, and the newer + GPT-5 reasoning variants) reject user-supplied + \texttt{temperature} or \texttt{top\_p} values and run at the + default temperature of 1; our batch submission code does not + override this, so all 2{,}221 requests used temperature 1 + implicitly. + \item \textbf{Maximum completion tokens:} not explicitly set in the + Batch API request body; the OpenAI API default for + \texttt{o4-mini} applies. + \item \textbf{Response format:} not enforced via the API + \texttt{response\_format} field; the model is instructed in + the prompt to return a JSON object with a specific schema. Of + the 1{,}307 per-window records for which raw JSON responses + are retained, 1{,}301 parsed cleanly (99.54\%); six failed + JSON parsing and are recorded with an explicit \texttt{error} + field and treated as non-detections. + \item \textbf{Seed:} the OpenAI Batch API exposes a \texttt{seed} + parameter for best-effort reproducibility; we did not set a + seed in this study, so each evaluation reflects the model's + native sampling at temperature 1. + \item \textbf{Access mode:} OpenAI Batch API (asynchronous; 24-hour + SLA per batch submission). \end{itemize} \noindent\textbf{Reproducibility note.} -Because the reasoning models do not accept a user-supplied -\texttt{seed} parameter and run at a fixed \texttt{temperature}, exact -bit-identical replication of any single response is not guaranteed. +Exact bit-identical replication of any single response is not +guaranteed: temperature is fixed at 1 on the server, and even with a +seed the OpenAI documentation states that determinism is best-effort +and can shift when the server \texttt{system\_fingerprint} changes. Reproducibility at the \emph{distributional} level is achieved by (i)~the large sample size (N = 2{,}221 evaluations) and (ii)~the mechanical criteria embedded in the prompt, which give the @@ -185,7 +200,7 @@ \subsection{System Message and User Prompt} - Average magnitude $3-5B - 5-7 sign flips - Example: "20 negative days, avg $4B, 6 flips" -- Note: Borderline cases should generally be REJECTED unless other +- **Note**: Borderline cases should generally be REJECTED unless other factors strengthen confidence **0-49 (Reject - Not Persistent)** @@ -204,21 +219,19 @@ \subsection{System Message and User Prompt} Provide your analysis in this exact JSON structure: +```json { "regime_detected": true/false, - "regime_type": "persistent_positive|persistent_negative| - transitional|low_conviction", + "regime_type": "persistent_positive|persistent_negative|transitional|low_conviction", "positive_days": , "negative_days": , "avg_magnitude_billions": , "sign_flips": , "persistence_pct": , "confidence": , - "reasoning": "Explain step-by-step why this is/isn't a persistent - regime. Reference specific metrics (persistence %, - avg magnitude, sign flips). If rejecting, state which - criterion failed." + "reasoning": "Explain step-by-step why this is/isn't a persistent regime. Reference specific metrics (persistence %, avg magnitude, sign flips). If rejecting, state which criterion failed." } +``` **IMPORTANT**: All numeric fields (confidence, positive_days, negative_days, sign_flips, avg_magnitude_billions, persistence_pct) @@ -279,7 +292,12 @@ \subsection{Output Schema and Parsing} \noindent Parsing is performed by \texttt{src/validation/batch\_regime\_validator.py} via a robust JSON extractor that tolerates markdown code-fence wrappers -and minor formatting drift. Any response failing schema validation is -flagged for manual review; across the 2{,}221 evaluations in this study, -the schema-validation failure rate was 0\% (all responses were -machine-parseable). +and minor formatting drift. Across the 1{,}307 per-window records for +which the raw responses are retained in the results YAML files (Phases +1--4 and the Phase 2 negative-control suite), the JSON parse-failure +rate was 0.46\% (6~windows failed to parse as valid JSON and are +recorded with an explicit \texttt{error} field; these windows are +treated as non-detections in all aggregate rates reported in +Section~\ref{sec:regime}). Phase 5 multi-year per-window records were +not retained in the published pipeline; Table~\ref{tab:phase5} reports +only the aggregate count per year. diff --git a/docs/papers/jrfm/Regan_Xie_JRFM.pdf b/docs/papers/jrfm/Regan_Xie_JRFM.pdf index 87b69df..9f8c4c7 100644 Binary files a/docs/papers/jrfm/Regan_Xie_JRFM.pdf and b/docs/papers/jrfm/Regan_Xie_JRFM.pdf differ diff --git a/docs/papers/jrfm/figures/fig01_obfuscation.png b/docs/papers/jrfm/figures/fig01_obfuscation.png index 5dd7d7f..3818893 100644 Binary files a/docs/papers/jrfm/figures/fig01_obfuscation.png and b/docs/papers/jrfm/figures/fig01_obfuscation.png differ diff --git a/docs/papers/jrfm/figures/fig02_regime_window.png b/docs/papers/jrfm/figures/fig02_regime_window.png index ffdbdf3..926f4e7 100644 Binary files a/docs/papers/jrfm/figures/fig02_regime_window.png and b/docs/papers/jrfm/figures/fig02_regime_window.png differ diff --git a/docs/papers/jrfm/figures/fig03_validation_pipeline.png b/docs/papers/jrfm/figures/fig03_validation_pipeline.png index e2ad512..6348ec6 100644 Binary files a/docs/papers/jrfm/figures/fig03_validation_pipeline.png and b/docs/papers/jrfm/figures/fig03_validation_pipeline.png differ diff --git a/docs/papers/jrfm/figures/fig04_selectivity.png b/docs/papers/jrfm/figures/fig04_selectivity.png index b81e1ad..298432c 100644 Binary files a/docs/papers/jrfm/figures/fig04_selectivity.png and b/docs/papers/jrfm/figures/fig04_selectivity.png differ diff --git a/docs/papers/jrfm/figures/fig05_gex_magnitude_distribution.png b/docs/papers/jrfm/figures/fig05_gex_magnitude_distribution.png index 4fa6bfb..ea9db1b 100644 Binary files a/docs/papers/jrfm/figures/fig05_gex_magnitude_distribution.png and b/docs/papers/jrfm/figures/fig05_gex_magnitude_distribution.png differ diff --git a/docs/papers/jrfm/figures/fig06_detection_progression.png b/docs/papers/jrfm/figures/fig06_detection_progression.png index 8fd5e6d..fe9a608 100644 Binary files a/docs/papers/jrfm/figures/fig06_detection_progression.png and b/docs/papers/jrfm/figures/fig06_detection_progression.png differ diff --git a/docs/papers/jrfm/figures/fig09_threshold_sensitivity.png b/docs/papers/jrfm/figures/fig09_threshold_sensitivity.png index b1d4f84..20ddecf 100644 Binary files a/docs/papers/jrfm/figures/fig09_threshold_sensitivity.png and b/docs/papers/jrfm/figures/fig09_threshold_sensitivity.png differ diff --git a/docs/papers/jrfm/figures/fig10_hmm_agreement.png b/docs/papers/jrfm/figures/fig10_hmm_agreement.png index be12e36..85ed439 100644 Binary files a/docs/papers/jrfm/figures/fig10_hmm_agreement.png and b/docs/papers/jrfm/figures/fig10_hmm_agreement.png differ diff --git a/docs/papers/jrfm/portal_upload/build_r3_docx.py b/docs/papers/jrfm/portal_upload/build_r3_docx.py new file mode 100644 index 0000000..227848c --- /dev/null +++ b/docs/papers/jrfm/portal_upload/build_r3_docx.py @@ -0,0 +1,843 @@ +"""Generate the Reviewer 3 response as a .docx that follows the MDPI +response-to-reviewer template structure. + +The MDPI template (shipped alongside this file as +``Example for author to respond reviewer - MDPI.docx``) requires five +numbered sections: + + 1. Summary (neutral thank-you) + 2. Questions for General Evaluation (table of ratings + responses) + 3. Point-by-point response to Comments and Suggestions for Authors + (Comments N: / Response N: format) + 4. Response to Comments on the Quality of English Language + 5. Additional clarifications + +Our earlier point-by-point response (response_to_reviewers.md) used +internal R3.1-R3.9 numbering and did not include sections 1, 2, 4, or 5. +This script builds a .docx with the correct structure, with the +reviewer's verbatim comments pasted into each Comments N: entry and our +response text pasted into each Response N: entry. + +Usage: + python build_r3_docx.py + +Outputs: + response_R3_MDPI.docx (upload this to the portal) + +Requires: python-docx (installed via pip). +""" + +from __future__ import annotations + +from pathlib import Path + +from docx import Document +from docx.enum.table import WD_ALIGN_VERTICAL +from docx.enum.text import WD_ALIGN_PARAGRAPH +from docx.shared import Cm, Pt, RGBColor + +HERE = Path(__file__).resolve().parent +import os + +# Default output; if the file is locked (likely open in Word), save to a +# sibling with a _v2 suffix so the user can diff and replace manually. +PRIMARY = HERE / "response_R3_MDPI.docx" +FALLBACK = HERE / "response_R3_MDPI_v2.docx" + + +def _select_output() -> Path: + try: + # Test write-ability: open the primary for append-binary and close. + if PRIMARY.exists(): + with PRIMARY.open("ab"): + pass + return PRIMARY + except (OSError, PermissionError): + return FALLBACK + + +OUTPUT = _select_output() + + +# ---------- Content: reviewer comments (verbatim, in order of appearance) ---------- + +REVIEWER_SUMMARY_COMMENT = ( + "The manuscript addresses a timely and interesting topic at the " + "intersection of financial market microstructure and large language " + "model validation. The idea of using temporal obfuscation to " + "distinguish structural reasoning from memorization is original and " + "potentially valuable. However, several aspects of the paper would " + "benefit from further clarification, strengthening, and refinement." +) + +# Per-item reviewer comments (8 substantive items). Paragraphs are +# transcribed verbatim from the review report as received. +COMMENTS = [ + ( + "1", + "The introduction must be shortened and made more focused. It currently " + "contains overly long and philosophical paragraphs. It should clearly " + "state the research gap, the contribution, and how the paper differs " + "from existing studies in financial econometrics. More recent references " + "(especially 2022-2025) on options market microstructure, gamma " + "exposure, and 0DTE dynamics must be added and critically discussed.", + ), + ( + "2", + "The positioning of the paper must be clarified. It is not clear " + "whether the contribution is mainly methodological (LLM validation) or " + "financial (market microstructure). This needs to be explicitly stated " + "and consistently reflected throughout the paper.", + ), + ( + "3", + "The research design must be strengthened. The paper currently lacks " + "comparison with standard benchmark models such as regime-switching " + "models or volatility-based approaches. At least one benchmark model " + "should be included to validate the added value of the proposed " + "framework. The causal interpretation related to 0DTE should be " + "moderated or supported with stronger empirical evidence.", + ), + ( + "4", + "The methodology section needs more transparency. The exact prompts " + "used for the LLM must be provided (preferably in an appendix). The " + "choice of thresholds (70% persistence, $5B magnitude, <=5 flips) must " + "be justified or tested through sensitivity analysis. The impact of " + "model parameters (e.g., temperature = 1.0) on reproducibility must be " + "explained.", + ), + ( + "5", + "The results section must include statistical validation. The paper " + "relies heavily on percentages without reporting statistical " + "significance, confidence intervals, or robustness tests. These must " + "be added. Some interpretations are too strong compared to the evidence " + "and should be moderated.", + ), + ( + "6", + "The discussion must be better connected to finance. The implications " + "for risk management, market efficiency, and practitioners should be " + "explicitly developed. The current discussion is too general and " + "sometimes theoretical.", + ), + ( + "7", + "The limitations section must be expanded. It should clearly address " + "the use of a single asset (SPY), the dependence on one LLM model, and " + "the lack of external validation.", + ), + ( + "8", + "Figures and tables must be improved. Some are too dense and difficult " + "to read. Labels and captions should be clearer and more explanatory.", + ), +] + +ENGLISH_COMMENT = ( + "The clarity of the manuscript needs improvement. Many sentences are too " + "long and complex, which affects readability. The writing should be " + "simplified by using shorter sentences, more direct wording, and by " + "removing redundant or overly elaborate expressions. Careful language " + "editing is recommended to improve clarity and flow." +) + + +# ---------- Content: our responses (paragraph-mode text) ---------- + +# Each entry is a list of paragraphs; each paragraph is a list of +# (text, bold) tuples. bold=True renders the run in bold; bold=False +# renders plain. Response text is rendered in RED per the MDPI template +# convention. + +R: dict[str, list[list[tuple[str, bool]]]] = { + "1": [ + [ + ("We agree and have rewritten Section 1 Introduction to address each element of this comment.", False), + ], + [ + ("(i) Shortened and less philosophical. ", True), + ( + 'The original paragraph-1 opener ("The decisive question ' + 'confronting any deployment of large language models...") ' + "has been removed. The new Section 1 opens with a two-sentence, " + "direct statement of the validation problem and why it is " + "first-order in finance specifically.", + False, + ), + ], + [ + ("(ii) Explicit research gap. ", True), + ( + 'A new paragraph titled "Research gap" follows the opener. ' + "It names what prior literature has done independently " + "(dealer-gamma microstructure, 0DTE growth, LLM-reasoning " + "probing in non-financial domains) and states precisely which " + "combination has not been attempted: an LLM structural-reasoning " + "validation method that (a) controls for training-data " + "memorisation of specific events and dates, (b) is tested at a " + "scale comparable to the target domain, and (c) discriminates " + "genuine structural detection from reproduction of a " + "volatility-regime classifier. The Markov-switching benchmark " + "added per the reviewer's comment 3 below is then introduced " + "as the direct test of element (c).", + False, + ), + ], + [ + ("(iii) Why 0DTE matters here. ", True), + ( + 'A new "Why 0DTE matters here" paragraph replaces the previous ' + '"practical urgency" framing. It explains that 0DTE growth is ' + "a natural setting for an obfuscation study because it created " + "an observable structural shift within the training horizon of " + "modern LLMs.", + False, + ), + ], + [ + ("(iv) 2022-2025 references added and critically discussed. ", True), + ( + "The key new citation is Dim, Eraker & Vilkov (2023) " + '("0DTEs: Trading, Gamma Risk and Volatility Propagation", ' + "SSRN 4692190), which provides the first systematic empirical " + "study of 0DTE dealer inventory. It is now cited in Section 1 " + "and critically discussed in Section 2.2, noting that it " + "establishes dealer-hedging rather than information flow as " + "the dominant channel through which 0DTE trading affects the " + "underlying. We retain the existing 2022-2025 refs " + "(Anderegg et al. 2022; Fishman 2023 Goldman Sachs; " + "CBOE 2024 and 2025 research notes; Dim, Marsh, Schrimpf 2025 " + "BIS).", + False, + ), + ], + [ + ( + "Change location: Section 1 Introduction paragraphs 1-4 (full rewrite), Section 2.2 Zero-Days-to-Expiration Options (new critical discussion of dim2023odtes), references.bib (new dim2023odtes entry).", + True, + ), + ], + ], + "2": [ + [ + ( + "We agree and have stated the positioning explicitly in two " + "places to ensure the stance is consistent throughout the " + "paper.", + False, + ), + ], + [ + ( + "The primary contribution is methodological: temporal " + "obfuscation testing (with the WHO->WHOM->WHAT causal " + "framework and multi-scale validation protocol) as a " + "generalizable procedure for validating LLM structural " + "reasoning. Options dealer gamma-exposure regime detection is " + "the empirical demonstration domain, selected because it " + "combines theoretically grounded mechanical constraints, a " + "large quantitative testbed, and the sharp pre- vs " + "post-0DTE temporal contrast. The financial-market findings " + "(69.1pp detection gap, 0% false-positive rate on synthetic " + "controls, 2021-2024 0DTE-tracking regime evolution) are " + "downstream evidence that the methodology discriminates " + "correctly, not novel claims about options microstructure.", + False, + ), + ], + [ + ( + 'Change location: new Section 1.3 "Positioning" subsection (between Contributions and Paper Organization) and Section 7 Conclusion opening (rewritten to echo the same stance before the four numbered contributions).', + True, + ), + ], + ], + "3": [ + [ + ( + "We have added a two-state Markov-switching regression benchmark and moderated the 0DTE causal language in parallel.", + False, + ), + ], + [ + ("(a) Benchmark comparison. ", True), + ( + "We fit statsmodels.tsa.regime_switching.MarkovRegression " + "(2-state, switching intercept and variance, standard EM) to " + "(i) SPY daily log returns for 2020 (canonical volatility " + "benchmark), (ii) SPY daily log returns for 2024 (same), and " + "(iii) the 2024 daily net-GEX series (GEX-native analogue " + "benchmark). Per-window agreement with LLM labels: 2020 " + "returns N=201, kappa=0.045; 2024 returns N=222, kappa=-0.178; " + "2024 net-GEX N=221, kappa=0.610. The LLM detector is not " + "reducible to a returns-based volatility regime (kappa near 0 " + "or negative) but is consistent with a mechanical 2-state " + "Gaussian on the same physical series (substantial " + "agreement). This directly answers the reviewer's concern: " + "the LLM reasons about dealer-gamma structure, not variance " + "regimes.", + False, + ), + ], + [ + ("(b) Moderated 0DTE causal language. ", True), + ( + 'Section 6.3 "Market Structure Evolution and 0DTE Hypothesis" ' + "has been rewritten with explicit causal-inference hygiene: " + "the 0DTE correspondence is framed as temporal coincidence " + "supported by a plausible mechanical channel rather than a " + "demonstrated causal relationship; four concurrent " + "confounders are named (interest rates, systematic short-vol " + "flow, passive/index AUM growth, market-maker concentration); " + "three candidate causal-identification designs are proposed " + "(a 0DTE suspension natural experiment, a counterfactual " + "non-SPY launch, an instrumental-variable design); and the " + 'discussion closes with the explicit caveat that "less ' + 'easily reconciled" is not "ruled out". Section 6 ' + "Conclusion contribution 3 similarly replaces " + '"0DTE-driven structural reorganization" with ' + "temporal-coincidence language.", + False, + ), + ], + [ + ( + "Change location: new Section 3.8 Markov-Switching Benchmark, new Section 5.6 Comparison with Markov-Switching Benchmark (Table 6 + Figure 8), Section 6.3 rewrite, Section 7 Conclusion contribution 3 rewrite.", + True, + ), + ], + ], + "4": [ + [ + ("We have addressed this comment in three parts.", False), + ], + [ + ("(a) Exact prompts. ", True), + ( + "The complete regime-detection prompt is now reproduced " + "verbatim in a new Appendix A, together with the actual " + "OpenAI Batch API configuration we used (model o4-mini; " + "temperature defaults to 1 because reasoning models reject " + "user-supplied temperature overrides; max_completion_tokens " + "not explicitly set, so the OpenAI API default applies; JSON " + "structure requested in the prompt rather than enforced via " + "the response_format field) and the output JSON schema used " + "for parsing. The appendix is transcribed directly from the " + "build_regime_prompt() function in the publicly released " + "source code.", + False, + ), + ], + [ + ("(b) Threshold sensitivity. ", True), + ( + "A 5×3×3 grid sweep (persistence in {60, 65, 70, " + "75, 80}%, magnitude in {$3B, $5B, $7B}, flips <= {3, 5, 7}; " + "45 configurations) has been applied to the 223 Phase 3 " + "(2024) and 220 Phase 4 (2020) per-window records already " + "on disk. Results: the 2024-vs-2020 detection gap ranges " + "[34.1, 85.2] pp across configurations (median 63.2 pp) and " + "exceeds 50 pp in 40/45 configurations. Reported in new " + "Section 5.5 Threshold Sensitivity with Figure 7 heatmap.", + False, + ), + ], + [ + ("(c) Temperature / reproducibility. ", True), + ( + "Appendix A contains a Reproducibility note explaining that " + "OpenAI reasoning models (o1, o3, o4-mini, and GPT-5 " + "reasoning variants) reject user-supplied temperature / " + "top_p values and run at the default temperature of 1. The " + "seed parameter is supported by o4-mini (OpenAI documents " + "it as best-effort determinism that can shift when the " + "server system_fingerprint changes), but we did not set a " + "seed in this study. Bit-identical reproduction of any " + "single response is therefore not guaranteed. " + "Reproducibility at the distributional level is established " + "through the N = 2,221 evaluation sample and the mechanical " + "numerical thresholds embedded in the prompt itself.", + False, + ), + ], + [ + ( + "Change location: new Appendix A (pp. 24-29 in the revised PDF), new Section 5.5 Threshold Sensitivity, cross-reference added in Section 3.5 LLM Configuration pointing to Appendix A.", + True, + ), + ], + ], + "5": [ + [ + ("We agree. The revision addresses this comment in four parts.", False), + ], + [ + ("(a) Confidence intervals. ", True), + ( + "Every detection rate reported in Section 4 Results now " + "carries a 95% confidence interval. For Phases 1-4 and all " + "Phase 2 negative controls we report a 10,000-replicate " + "percentile bootstrap over windows (deterministic seed); for " + "Phase 5 per-year rates we report 95% Wilson score intervals " + "(Brown, Cai & DasGupta, 2001). Phase 3 full 2024: 81.2% " + "[75.8, 86.1]%. Phase 4 full 2020: 12.1% [8.1, 16.6]%. The " + "2020 upper CI bound (17.3%) does not overlap the 2024 lower " + "CI bound (75.8%), which directly supports the 69.1 pp " + "separation claim with bounded evidence rather than point " + 'estimates alone. A new "Statistical conventions" paragraph ' + "at the head of Section 4.1 documents the methodology.", + False, + ), + ], + [ + ("(b) Expanded chi-square / Fisher reporting. ", True), + ( + "Phase 4 (2020 vs 2024): Pearson's chi-square = 213.67 " + "(df=1, p = 2.2e-48), Yates-corrected chi-square = 210.90 " + "(p = 8.7e-48), Fisher's exact two-sided p = 1.8e-52 " + "(odds ratio 31.3), phi = 0.69, risk difference 69.1 pp " + "(95% Wald CI [62.4, 75.7] pp). Phase 5 (2023 -> 2024 " + "transition): chi-square = 314.4 (p = 2.4e-70), Fisher's " + "exact p = 9.9e-87, phi = 0.82. Abstract and Introduction " + "updated to report Fisher's exact p rather than a bare " + '"p < 0.0001".', + False, + ), + ], + [ + ("(c) Robustness. ", True), + ( + "The 45-configuration threshold-sensitivity sweep described " + "under comment 4(b) above functions as the robustness test " + "(gap > 50 pp in 40/45 configurations).", + False, + ), + ], + [ + ("(d) Moderated claim language. ", True), + ( + "Section 7 Conclusion contribution 2 now reports the 69.1 pp " + "separation with explicit CI brackets on each rate and " + "Fisher's exact p, and cites the 45-configuration robustness " + "of the 50 pp gap. Contribution 3 moderates the 0DTE-causal " + "language (see comment 3(b) above). Section 6.3 similarly " + 'softens "tipping-point dynamic strengthens the structural ' + 'interpretation" to "is consistent with, rather than proof ' + 'of". Statistical claims on the 2020-vs-2024 separation are ' + "preserved as-is; only the causal-inference language around " + "0DTE is moderated.", + False, + ), + ], + [ + ( + "Change location: Section 4.1 statistical conventions paragraph; Section 4.3 Phase 1/3 inline CIs; Tables 2, 3, 4, 5 CI columns; references.bib added brown2001interval; new reprocessing scripts under scripts/validation/paper2/jrfm_revision/.", + True, + ), + ], + ], + "6": [ + [ + ("We agree that the original discussion was too general on the practitioner side.", False), + ], + [ + ( + 'The previous Section 6.6 "Practitioner Implications" has ' + 'been renamed "Practical Implications" and restructured ' + "into three explicit subsubsections matching the three axes " + "the reviewer identified:", + False, + ), + ], + [ + ("Risk management. ", True), + ( + "Three concrete applications developed: intraday volatility " + "budgeting (regime as leading indicator for volatility-of-" + "volatility exposure sizing), option-book hedging under OpEx " + "concentration, and risk-scenario design (2020 fragmented vs " + "2024 persistent-negative as natural conditioning variables).", + False, + ), + ], + [ + ("Market efficiency. ", True), + ( + "A positive account is offered: the detection-alpha " + "orthogonality is consistent with a weakly efficient market " + "in which structural constraints are reliably identifiable " + "but already priced. This reconciles persistent microstructure " + "influence with Sharpe deterioration.", + False, + ), + ], + [ + ("Practitioners: pipeline design and deployment. ", True), + ( + "Two design implications developed: (i) the 30.8 pp advantage " + "of raw strike-level data over pre-aggregated GEX challenges " + "the default of parametric aggregation, with generalisations " + "to credit risk, fixed-income surveillance, and equity factor " + "research; (ii) the 2022-2024 0DTE regime shift implies that " + "static microstructure models calibrated to pre-2022 data " + "need recalibration.", + False, + ), + ], + [ + ( + 'Change location: Section 6.6 "Practical Implications" (renamed from "Practitioner Implications"), three new subsubsections.', + True, + ), + ], + ], + "7": [ + [ + ("We thank the reviewer for flagging these specific omissions.", False), + ], + [ + ( + 'Section 6.7 has been renamed "Limitations and Future Work" ' + "and expanded from six limitations to seven. Each item is " + "now explicitly tied to a concrete follow-up study. The " + "three items the reviewer named are now addressed as:", + False, + ), + ], + [ + ("(a) Single-asset scope. ", True), + ( + 'Item 1 ("Single-asset scope") explicitly acknowledges ' + "that all results concern SPY, lists QQQ, IWM, individual " + "equities, and non-equity underliers as relevant but " + "untested targets, and identifies cross-asset replication as " + "the single highest-priority item for future work.", + False, + ), + ], + [ + ("(b) Single-LLM dependence. ", True), + ( + "A dedicated second item proposes a model-swap protocol " + "covering Anthropic Claude, OpenAI o3, Google Gemini, and " + "open-source reasoning models using identical prompts, with " + "cross-model agreement analysis as the diagnostic.", + False, + ), + ], + [ + ("(c) Lack of independent external validation. ", True), + ( + "A new third item acknowledges that per-window ground-truth " + "metrics are computed from the same Alpha Vantage feed used " + "to construct the windows, and proposes cross-validation " + "against CBOE DataShop / OPRA / commercial vendors " + "(SpotGamma, MenthorQ) and against related microstructure " + "observables.", + False, + ), + ], + [ + ( + "Change location: Section 6.7 Limitations and Future Work (renamed, expanded 6 -> 7 items, each with explicit future-work sentence).", + True, + ), + ], + ], + "8": [ + [ + ( + "We addressed the comment on figures and tables in two " + "complementary ways: (i) a figure-font pass to raise " + "in-figure text to publication-legible sizes and " + "standardise across all figures, and (ii) a caption " + "rewrite to make each caption self-contained.", + False, + ), + ], + [ + ("(a) Figure font-size standardisation. ", True), + ( + "We audited every hardcoded ``fontsize=`` and " + "``labelsize=`` value across the eight JRFM figure " + "generators and found values as low as 8-11 pt, which " + "rendered as sub-10 pt type when the figure was scaled to " + "textwidth in an A4 layout. We applied a uniform size-" + 'bump rule (floor 12 pt, "+2" on moderate sizes) across ' + "all eight figures, producing a consistent typographic " + "hierarchy (12 pt for smallest annotations, rising to " + "16-18 pt for titles and display numbers). All six " + "original figures (Figures 1-6) and the two revision-" + "added figures (Figures 7-8) were regenerated from the " + "bumped scripts. The one-shot bump script " + "(``scripts/bump_font_sizes.py``) is committed in the " + "code release so the change is reproducible.", + False, + ), + ], + [ + ("(b) Self-contained captions. ", True), + ( + "Every caption now follows the rule: state (i) what is " + "shown, (ii) the key numerical values a reader should " + "notice, and (iii) what conclusion the reader should " + "take from the figure. Five figure captions (Figures 1, " + "3, 4, 5, 6) were rewritten in this pass; Figures 7 and " + "8 and Tables 2-6 (added during other parts of the " + "revision) were already written to this standard. Each " + 'rewritten caption ends with an explicit "Read this ' + 'figure as:" clause giving the intended interpretation. ' + "For example, Figure 5 (GEX magnitude distribution) " + 'closes with "Read this figure as: the magnitude ' + "criterion alone -- before persistence or stability are " + "even checked -- already separates the two eras, and " + "the chosen $5B threshold is positioned in the trough " + "between the two distributions rather than in the bulk " + 'of either."', + False, + ), + ], + [ + ( + "The resulting figures are materially easier to read at " + "journal-print scale than the originals, with consistent " + "font sizes across all eight figures and no in-figure " + "text smaller than 12 pt.", + False, + ), + ], + [ + ( + "Change location: eight figure PNGs regenerated under docs/papers/paper2/figures/output/ and copied into docs/papers/jrfm/figures/; the bump script and all six modified figure generators committed; captions in Section 3 (Figure 1) and Section 4 (Figures 3, 4, 5, 6) rewritten in the .tex.", + True, + ), + ], + ], +} + +ENGLISH_RESPONSE: list[list[tuple[str, bool]]] = [ + [ + ("We performed a full editing pass over the manuscript after all content changes were settled.", False), + ], + [ + ( + "We checked the manuscript for the usual English-editing " + 'offenders ("In order to", "It should be noted that", "It is ' + 'worth noting", "Due to the fact that", "This is because", ' + '"Obviously", "Clearly"). None of these phrases appear in the ' + "manuscript; the original draft was already written in an " + "active, direct register, so no changes were required on this " + "axis.", + False, + ), + ], + [ + ( + "We identified the paragraphs with the most elaborate nested-" + "clause sentences (the Section 1 philosophical opener and " + "Section 5.5 Dispersed Knowledge) and rewrote them for " + "directness. Section 1 was fully replaced in the rewrite for " + "comment 1 above. Section 5.5 was tightened by breaking three " + ">40-word sentences into two-sentence units while retaining the " + "Hayek citation and the 30.8 pp empirical claim.", + False, + ), + ], + [ + ( + "We verified terminology consistency across sections: " + '"regime" (not "state") for the detection target, ' + '"persistent / fragmented" (not "stable / unstable") for ' + 'the binary outcome, "obfuscation" (not "anonymisation"), ' + '"dealer gamma positioning" where the detection task is the ' + "referent.", + False, + ), + ], + [ + ( + "Change location: targeted tightening in Section 6.5 Dispersed Knowledge; Section 1 opener and Section 6.3 Market Structure Evolution rewrites landed in the earlier comments 1 and 3 commits; technical-term consistency verified throughout.", + True, + ), + ], +] + + +# ---------- Doc builder ---------- + +RED = RGBColor(0xC0, 0x00, 0x00) # MDPI convention: responses in red +BLACK = RGBColor(0x00, 0x00, 0x00) + + +def add_styled_paragraph( + doc: Document, runs: list[tuple[str, bool]], *, response: bool, italic_first: bool = False +) -> None: + p = doc.add_paragraph() + for i, (text, bold) in enumerate(runs): + run = p.add_run(text) + run.bold = bold + if response: + run.font.color.rgb = RED + if italic_first and i == 0: + run.italic = True + + +def add_comment_block(doc: Document, n: str, comment: str, response_paragraphs: list[list[tuple[str, bool]]]) -> None: + # "Comments N:" + p = doc.add_paragraph() + r = p.add_run(f"Comments {n}: ") + r.bold = True + r.font.color.rgb = BLACK + p.add_run(comment) + + # "Response N:" + p2 = doc.add_paragraph() + r2 = p2.add_run(f"Response {n}: ") + r2.bold = True + r2.font.color.rgb = RED + # Mark revisions in red per template convention + for run_set in response_paragraphs: + add_styled_paragraph(doc, run_set, response=True) + + +def build() -> None: + doc = Document() + + # Set default paragraph / font size + style = doc.styles["Normal"] + style.font.name = "Calibri" + style.font.size = Pt(11) + + # Title + title = doc.add_heading("Response to Reviewer 3 Comments", level=0) + title.alignment = WD_ALIGN_PARAGRAPH.CENTER + subtitle = doc.add_paragraph( + "JRFM Submission jrfm-4256551 — Validating LLM Structural " + "Reasoning: Detecting Persistent Market Regimes Through Temporal " + "Obfuscation" + ) + subtitle.alignment = WD_ALIGN_PARAGRAPH.CENTER + author = doc.add_paragraph("Christopher Regan and Ying Xie, Kennesaw State University") + author.alignment = WD_ALIGN_PARAGRAPH.CENTER + date_para = doc.add_paragraph("24 April 2026") + date_para.alignment = WD_ALIGN_PARAGRAPH.CENTER + + # 1. Summary + doc.add_heading("1. Summary", level=1) + doc.add_paragraph( + "Thank you very much for taking the time to review this " + "manuscript and for the substantive, constructive feedback. The " + "reviewer's comments identified meaningful improvements in " + "introduction focus, contribution positioning, benchmark " + "comparison, methodological transparency, statistical rigour, " + "practitioner connection, limitations scope, and figure clarity. " + "We have addressed every point-by-point comment in the revised " + "manuscript; detailed responses and the corresponding revisions " + "(marked in red) are provided below. The revised manuscript is " + "31 A4 pages, up from 18 in the originally submitted version." + ) + # The reviewer's own summary paragraph + doc.add_paragraph(f"Reviewer's summary: “{REVIEWER_SUMMARY_COMMENT}”") + + # 2. Questions for General Evaluation + doc.add_heading("2. Questions for General Evaluation", level=1) + rows = [ + ( + "Does the introduction provide sufficient background and include all relevant references?", + "Must be improved", + 'Addressed in Comment 1 below: Section 1 rewritten, new "Research gap" paragraph, new 2022-2025 references (Dim, Eraker & Vilkov 2023).', + ), + ( + "Is the research design appropriate?", + "Can be improved", + "Addressed in Comment 3: new Markov-switching benchmark (Section 3.8 + Section 5.6) demonstrates the framework is not reducible to a volatility-regime classifier.", + ), + ( + "Are the methods adequately described?", + "Can be improved", + "Addressed in Comment 4: new Appendix A reproduces the full LLM prompt verbatim; new Section 5.5 reports a 45-configuration threshold-sensitivity sweep; reproducibility posture documented.", + ), + ( + "Are the results clearly presented?", + "Can be improved", + "Addressed in Comment 5: every detection rate in Section 4 now carries a 95% bootstrap or Wilson CI; full chi-square / Fisher statistics reported; robustness confirmed.", + ), + ( + "Are the conclusions supported by the results?", + "Can be improved", + "Addressed in Comments 2 and 5(d): positioning statement added to Section 1.3 and echoed in Section 7 opening; strong-claim language moderated where CIs or sensitivity warranted.", + ), + ( + "Are all figures and tables clear and well-presented?", + "Can be improved", + 'Addressed in Comment 8: captions on Figures 1, 3, 4, 5, 6 rewritten to be self-contained with explicit "Read this figure as:" clauses.', + ), + ( + "Quality of English Language", + "Can be improved", + "Addressed in Section 4 of this response: targeted tightening in Section 5.5; wordy-transition check passed; terminology consistency verified.", + ), + ] + table = doc.add_table(rows=1 + len(rows), cols=3) + table.style = "Table Grid" + hdr = table.rows[0].cells + for i, text in enumerate(("Question", "Reviewer's evaluation", "Response and revisions")): + hdr[i].text = "" + run = hdr[i].paragraphs[0].add_run(text) + run.bold = True + for i, (q, ev, resp) in enumerate(rows, start=1): + cells = table.rows[i].cells + cells[0].text = q + cells[1].text = ev + cells[2].text = resp + for c in cells: + c.vertical_alignment = WD_ALIGN_VERTICAL.TOP + + # 3. Point-by-point response + doc.add_heading("3. Point-by-point response to Comments and Suggestions for Authors", level=1) + for n, comment in COMMENTS: + add_comment_block(doc, n, comment, R[n]) + + # 4. English Language + doc.add_heading("4. Response to Comments on the Quality of English Language", level=1) + # Reviewer comment + p = doc.add_paragraph() + r = p.add_run("Point 1: ") + r.bold = True + p.add_run(ENGLISH_COMMENT) + # Response + p2 = doc.add_paragraph() + r2 = p2.add_run("Response 1: ") + r2.bold = True + r2.font.color.rgb = RED + for run_set in ENGLISH_RESPONSE: + add_styled_paragraph(doc, run_set, response=True) + + # 5. Additional clarifications + doc.add_heading("5. Additional clarifications", level=1) + doc.add_paragraph( + "We have also raised with the handling editor (separately, via " + "the portal comments-to-editor field) that Reviewer 1's report " + "appears to apply to a different manuscript — it asks about " + "conformable derivatives in the Heston framework, Heston-He-Zhu " + "comparisons, jump-diffusion and fractional models, and " + "computational challenges in an option-pricing algorithm, none " + "of which are topics our manuscript addresses. We are prepared " + "to respond substantively once the correct review is available, " + "or to a replacement reviewer if that is more expedient. This " + "clarification is orthogonal to Reviewer 3's comments and does " + "not affect the revisions above." + ) + doc.add_paragraph( + "We have also incorporated Reviewer 2's recommendation for " + "acceptance (uploaded separately as the Reviewer 2 response)." + ) + + doc.save(OUTPUT) + print(f"Wrote {OUTPUT}") + + +if __name__ == "__main__": + build() diff --git a/docs/papers/jrfm/portal_upload/build_r3_pdf.py b/docs/papers/jrfm/portal_upload/build_r3_pdf.py new file mode 100644 index 0000000..6bebd81 --- /dev/null +++ b/docs/papers/jrfm/portal_upload/build_r3_pdf.py @@ -0,0 +1,446 @@ +"""Convert the Reviewer 3 point-by-point response Markdown into a LaTeX +document and compile to PDF using pdflatex (MiKTeX). + +Why this script exists: the JRFM / MDPI portal accepts the Author's Notes +to Reviewer as either pasted text or uploaded PDF/Word. Reviewer 3's +response is long (~29 KB, many tables and bullet lists), so uploading a +PDF is cleaner than pasting raw text. Pandoc is not installed on this +machine, but pdflatex is, so we hand-roll a small Markdown -> LaTeX +converter tailored to the specific constructs used in the source file: + +- Top-level (#) and sub (## / ###) headings +- Blockquotes (lines starting with "> ") +- Bold (**...**) and italics (*...*) +- Inline code (`...`) +- Unordered lists (- / *) +- Ordered lists (1.) +- Simple pipe tables (| ... | ...) +- Horizontal rules (---) + +It is not a general-purpose converter; it only handles what +response_R3_pointbypoint.md contains. + +Usage: + python build_r3_pdf.py + +Outputs: + response_R3_pointbypoint.tex (intermediate) + response_R3_pointbypoint.pdf (upload this to the portal) +""" + +from __future__ import annotations + +import re +import subprocess +import sys +from pathlib import Path + +HERE = Path(__file__).resolve().parent +SRC_MD = HERE / "response_R3_pointbypoint.md" +OUT_TEX = HERE / "response_R3_pointbypoint.tex" +OUT_PDF = HERE / "response_R3_pointbypoint.pdf" + + +# ---------- LaTeX escaping and inline formatting ---------- + +LATEX_ESCAPES = { + "\\": r"\textbackslash{}", + "&": r"\&", + "%": r"\%", + "$": r"\$", + "#": r"\#", + "_": r"\_", + "{": r"\{", + "}": r"\}", + "~": r"\textasciitilde{}", + "^": r"\textasciicircum{}", +} + +# Unicode characters that appear in the source and require LaTeX math-mode +# or special-command substitutes under the default T1/pdflatex setup. +UNICODE_MAP = { + "κ": r"$\kappa$", + "φ": r"$\varphi$", + "χ": r"$\chi$", + "²": r"$^{2}$", + "³": r"$^{3}$", + "×": r"$\times$", + "→": r"$\rightarrow$", + "≥": r"$\geq$", + "≤": r"$\leq$", + "≈": r"$\approx$", + "≠": r"$\neq$", + "—": "---", + "–": "--", + "…": r"\ldots{}", + "•": r"\textbullet{}", + "′": "'", + "″": "''", + "‘": "`", + "’": "'", + "“": "``", + "”": "''", + "∈": r"$\in$", + "∞": r"$\infty$", + "±": r"$\pm$", + "°": r"$^\circ$", + "μ": r"$\mu$", + "σ": r"$\sigma$", + "α": r"$\alpha$", + "β": r"$\beta$", + "Δ": r"$\Delta$", + "Σ": r"$\Sigma$", + "§": r"\S{}~", + "−": "-", # U+2212 unicode minus + "‐": "-", # U+2010 unicode hyphen + "‑": "-", # U+2011 non-breaking hyphen + # Superscript digits (U+2070..U+2079 and U+207A..U+207F) -- common in + # scientific p-values like 10^-48 written as 10⁻⁴⁸ + "⁰": r"$^{0}$", + "¹": r"$^{1}$", + "²": r"$^{2}$", + "³": r"$^{3}$", + "⁴": r"$^{4}$", + "⁵": r"$^{5}$", + "⁶": r"$^{6}$", + "⁷": r"$^{7}$", + "⁸": r"$^{8}$", + "⁹": r"$^{9}$", + "⁻": r"$^{-}$", + "⁺": r"$^{+}$", + # Subscript digits + "₀": r"$_{0}$", + "₁": r"$_{1}$", + "₂": r"$_{2}$", + "₃": r"$_{3}$", + "₄": r"$_{4}$", + "₅": r"$_{5}$", + "₆": r"$_{6}$", + "₇": r"$_{7}$", + "₈": r"$_{8}$", + "₉": r"$_{9}$", + " ": "~", # non-breaking space +} + + +def escape_latex(s: str) -> str: + # Escape LaTeX specials FIRST (otherwise Unicode-map replacements like + # ``$\times$`` would have their backslashes escaped and break). + out = [] + for ch in s: + if ch in LATEX_ESCAPES: + out.append(LATEX_ESCAPES[ch]) + else: + out.append(ch) + s = "".join(out) + # Now substitute Unicode characters into LaTeX commands in-place; the + # replacements are already valid LaTeX and must not be re-escaped. + for src, dst in UNICODE_MAP.items(): + s = s.replace(src, dst) + return s + + +def apply_inline(s: str) -> str: + """Apply inline markdown (bold, italic, code) after LaTeX-escaping. + + Order: + 1. protect inline code spans first (they should not be further + processed) + 2. escape the rest + 3. apply bold and italic markers + 4. splice code spans back in + """ + # Extract inline code spans + code_spans = [] + + def stash_code(m): + code_spans.append(m.group(1)) + return f"\x00CODE{len(code_spans) - 1}\x00" + + s = re.sub(r"`([^`]+)`", stash_code, s) + + s = escape_latex(s) + + # Bold: **text** -> \textbf{text} + s = re.sub(r"\*\*([^*]+)\*\*", r"\\textbf{\1}", s) + # Italic: *text* -> \textit{text} + s = re.sub(r"\*([^*]+)\*", r"\\textit{\1}", s) + + # Restore code spans + def restore_code(m): + idx = int(m.group(1)) + return r"\texttt{" + escape_latex(code_spans[idx]) + "}" + + s = re.sub(r"\x00CODE(\d+)\x00", restore_code, s) + + return s + + +# ---------- Pre-processing: make inline-LaTeX snippets readable ---------- + + +def preprocess_latex_snippets(md: str) -> str: + """Rewrite inline-LaTeX snippets that appear in the Markdown source into + plain-text readable forms. + + The response document cites the main manuscript's LaTeX source liberally + (``\citep{dim2023odtes}``, ``\ref{sec:methodology}``). In a stand-alone + response PDF without bibliography or label resolution these should read + as plain prose rather than raw macros. + """ + # \citep{key} and \citet{key} -> (key) + md = re.sub(r"\\citep\{([^}]+)\}", r"(\1)", md) + md = re.sub(r"\\citet\{([^}]+)\}", r"\1", md) + # \citealp{a,b,c} -> a, b, c + md = re.sub(r"\\citealp\{([^}]+)\}", r"\1", md) + # \ref{sec:xxx} -> sec:xxx (plain-text label; reader can find it in manuscript) + md = re.sub(r"\\ref\{([^}]+)\}", r"\1", md) + # \S\ref{...} or \S~\ref{...} -> §(...) + md = re.sub(r"\\S[~\s]*", "§", md) + # \emph{text} -> *text* so downstream italic handling kicks in + md = re.sub(r"\\emph\{([^}]+)\}", r"*\1*", md) + # \textbf{text} -> **text** + md = re.sub(r"\\textbf\{([^}]+)\}", r"**\1**", md) + # \textit{text} -> *text* + md = re.sub(r"\\textit\{([^}]+)\}", r"*\1*", md) + # \texttt{text} -> `text` + md = re.sub(r"\\texttt\{([^}]+)\}", r"`\1`", md) + # Remove stray \\ at end of line (line-break markers) + md = re.sub(r"\\\\\s*$", "", md, flags=re.MULTILINE) + return md + + +# ---------- List-continuation folder ---------- + + +def fold_list_continuations(md: str) -> str: + """Fold multi-line list items into single logical lines. + + In CommonMark, an ordered or unordered list item's prose can span + multiple physical lines as long as the continuation lines are + indented past the list marker. Our block converter treats each + physical line independently, which fragments multi-line items into + multiple single-item lists. This pre-pass joins continuation lines + back onto their opening line so the block converter sees one + item-per-line. + """ + lines = md.splitlines() + out: list[str] = [] + in_item = False # are we inside a list item whose prose continues? + for line in lines: + stripped = line.strip() + is_blank = stripped == "" + starts_list = bool(re.match(r"^(\s*)([-*+]|\d+\.)\s+", line)) + is_indented_continuation = line.startswith((" ", "\t")) and not is_blank and not starts_list + if in_item and is_indented_continuation: + # Append (with a single space) to the previous output line. + out[-1] = out[-1].rstrip() + " " + stripped + continue + if is_blank or starts_list is False: + in_item = False + if starts_list: + in_item = True + out.append(line) + return "\n".join(out) + + +# ---------- Block-level conversion ---------- + + +def convert(md: str) -> str: + lines = md.splitlines() + out: list[str] = [] + i = 0 + in_list = False + list_type: str | None = None # "itemize" or "enumerate" + + def close_list(): + nonlocal in_list, list_type + if in_list: + out.append(f"\\end{{{list_type}}}") + in_list = False + list_type = None + + while i < len(lines): + line = lines[i].rstrip("\n") + + # Horizontal rule + if re.match(r"^---\s*$", line): + close_list() + out.append(r"\bigskip\hrule\bigskip") + i += 1 + continue + + # Headings + m = re.match(r"^(#{1,6})\s+(.*)$", line) + if m: + close_list() + level, text = len(m.group(1)), apply_inline(m.group(2)) + sec_cmds = { + 1: r"\section*{%s}", + 2: r"\subsection*{%s}", + 3: r"\subsubsection*{%s}", + 4: r"\paragraph{%s}", + 5: r"\subparagraph{%s}", + 6: r"\paragraph{%s}", + } + out.append(sec_cmds[level] % text) + i += 1 + continue + + # Blockquote + if line.startswith(">"): + close_list() + quote_lines = [] + while i < len(lines) and lines[i].startswith(">"): + quote_lines.append(lines[i].lstrip("> ").rstrip()) + i += 1 + out.append(r"\begin{quote}") + out.append(apply_inline(" ".join(quote_lines))) + out.append(r"\end{quote}") + continue + + # Table: contiguous lines starting with | + if line.startswith("|") and i + 1 < len(lines) and re.match(r"^\s*\|\s*[-:| ]+\|", lines[i + 1]): + close_list() + header_cells = [c.strip() for c in line.strip().strip("|").split("|")] + # skip separator row + i += 2 + body_rows = [] + while i < len(lines) and lines[i].startswith("|"): + row_cells = [c.strip() for c in lines[i].strip().strip("|").split("|")] + body_rows.append(row_cells) + i += 1 + ncols = len(header_cells) + col_spec = "|".join(["l"] * ncols) + out.append(r"\begin{table}[H]") + out.append(r"\centering") + out.append(r"\small") + out.append(r"\begin{tabular}{|" + col_spec + "|}") + out.append(r"\hline") + out.append(" & ".join(r"\textbf{" + apply_inline(c) + "}" for c in header_cells) + r" \\ \hline") + for row in body_rows: + # Pad short rows + row = row + [""] * (ncols - len(row)) + out.append(" & ".join(apply_inline(c) for c in row[:ncols]) + r" \\ \hline") + out.append(r"\end{tabular}") + out.append(r"\end{table}") + continue + + # Ordered list + m = re.match(r"^(\s*)(\d+)\.\s+(.*)$", line) + if m: + indent, _n, text = m.groups() + if not in_list or list_type != "enumerate": + close_list() + out.append(r"\begin{enumerate}") + in_list = True + list_type = "enumerate" + out.append(r"\item " + apply_inline(text)) + i += 1 + continue + + # Unordered list + m = re.match(r"^(\s*)[-*+]\s+(.*)$", line) + if m: + indent, text = m.groups() + if not in_list or list_type != "itemize": + close_list() + out.append(r"\begin{itemize}") + in_list = True + list_type = "itemize" + out.append(r"\item " + apply_inline(text)) + i += 1 + continue + + # Blank line + if line.strip() == "": + close_list() + out.append("") + i += 1 + continue + + # Regular paragraph line + close_list() + out.append(apply_inline(line)) + i += 1 + + close_list() + return "\n".join(out) + + +# ---------- Preamble and driver ---------- + +PREAMBLE = r""" +\documentclass[11pt,a4paper]{article} +\usepackage[margin=2.4cm]{geometry} +\usepackage[T1]{fontenc} +\usepackage[utf8]{inputenc} +\usepackage{lmodern} +\usepackage{parskip} +\usepackage{microtype} +\usepackage{xcolor} +\usepackage{hyperref} +\usepackage{array} +\usepackage{booktabs} +\usepackage{float} +\usepackage{textcomp} +\hypersetup{colorlinks=true, linkcolor=blue, urlcolor=blue} +\setcounter{secnumdepth}{0} + +\title{Response to Reviewer 3 \\ \large JRFM Submission jrfm-4256551} +\author{Christopher Regan \and Ying Xie \\ Kennesaw State University} +\date{24 April 2026} + +\begin{document} +\maketitle +""" + +POSTAMBLE = r""" +\end{document} +""" + + +def main() -> int: + if not SRC_MD.exists(): + print(f"ERROR: {SRC_MD} not found", file=sys.stderr) + return 1 + + md = SRC_MD.read_text(encoding="utf-8") + md = preprocess_latex_snippets(md) + md = fold_list_continuations(md) + body = convert(md) + tex = PREAMBLE + body + POSTAMBLE + OUT_TEX.write_text(tex, encoding="utf-8") + print(f"Wrote {OUT_TEX}") + + # Compile twice for cross-references (not strictly needed here, but safe). + for pass_num in (1, 2): + result = subprocess.run( + [ + "pdflatex", + "-interaction=nonstopmode", + "-halt-on-error", + OUT_TEX.name, + ], + cwd=HERE, + capture_output=True, + text=True, + encoding="utf-8", + errors="replace", + ) + if result.returncode != 0: + print(f"--- pdflatex stderr (pass {pass_num}) ---", file=sys.stderr) + print(result.stdout[-2000:], file=sys.stderr) + return 1 + # Clean aux/log after successful compile + for ext in (".aux", ".log", ".out", ".toc"): + p = OUT_TEX.with_suffix(ext) + if p.exists(): + p.unlink() + print(f"Wrote {OUT_PDF}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/docs/papers/jrfm/portal_upload/editor_note_R1_mismatch.md b/docs/papers/jrfm/portal_upload/editor_note_R1_mismatch.md new file mode 100644 index 0000000..9f8d006 --- /dev/null +++ b/docs/papers/jrfm/portal_upload/editor_note_R1_mismatch.md @@ -0,0 +1,46 @@ +# Note to the Editor — Reviewer 1 assignment (jrfm-4256551) + +*To be sent to the JRFM handling editor via the portal comments-to-editor field, separately from the Reviewer 1 response box.* + +--- + +Dear Editor, + +Thank you for forwarding the reports for **jrfm-4256551**. On review, +Reviewer 1's comments do not appear to apply to our manuscript. The +report asks about: + +1. the rigorous integration of conformable derivatives into the + classical Heston framework, +2. comparison against the Heston–He and Zhu (HZ) model, with + consideration of jump-diffusion and fractional alternatives, +3. estimation and positivity of conformable parameters, +4. computational challenges in an option-pricing algorithm, +5. practical advantages over traditional pricing approaches, and +6. two references on numerical methods for stochastic volatility models. + +Our submission — *Validating LLM Structural Reasoning: Detecting +Persistent Market Regimes Through Temporal Obfuscation* — is an +empirical LLM-validation study that uses temporal obfuscation on +gamma-exposure sequences. It does **not** propose an option-pricing +model, does **not** introduce conformable derivatives, and does **not** +compare against Heston or HZ models. None of Reviewer 1's seven +questions map to content in our manuscript, so a substantive +point-by-point reply against these comments is not feasible. + +We respectfully request clarification: was this report forwarded from a +different submission in error, or could Reviewer 1 be asked to +re-review the correct manuscript (alternatively, a replacement reviewer +could be assigned)? We are happy to respond substantively to any +review of the actual paper. + +Meanwhile, we have addressed Reviewer 2's recommendation (acceptance +with no revisions) and Reviewer 3's point-by-point comments +(uploaded separately), and have uploaded a revised manuscript +incorporating all Reviewer 3 changes. + +Thank you for your time. + +Sincerely, +Christopher Regan +(on behalf of the authors) diff --git a/docs/papers/jrfm/portal_upload/response_R1_note.md b/docs/papers/jrfm/portal_upload/response_R1_note.md new file mode 100644 index 0000000..f66a2c2 --- /dev/null +++ b/docs/papers/jrfm/portal_upload/response_R1_note.md @@ -0,0 +1,13 @@ +# Reviewer 1 — Author's Notes to Reviewer (ready to paste) + +Please see my note to the Editor — we believe this review concerns a +different manuscript and have requested clarification. The seven +comments in Reviewer 1's report (rigorous integration of conformable +derivatives into the classical Heston framework, comparison against +Heston–He–Zhu, jump-diffusion and fractional models, estimation and +positivity of conformable parameters, computational challenges in an +option-pricing algorithm, and two references on numerical methods for +stochastic volatility) do not correspond to our submission, which is +an empirical LLM-validation study using temporal obfuscation on +gamma-exposure sequences. We are prepared to respond substantively +once the correct review is available. diff --git a/docs/papers/jrfm/portal_upload/response_R2_note.md b/docs/papers/jrfm/portal_upload/response_R2_note.md new file mode 100644 index 0000000..f7bf250 --- /dev/null +++ b/docs/papers/jrfm/portal_upload/response_R2_note.md @@ -0,0 +1,16 @@ +# Reviewer 2 — Author's Notes to Reviewer (ready to paste) + +**Comments 1.** In this paper, the temporal obfuscation testing as a +methodology for validating LLM structural reasoning in domain-specific +applications is presented and applying this framework to options dealer +gamma exposure (GEX) patterns, the detection is validated by using +2,221 evaluations (1,412 real windows plus 809 synthetic controls) +spanning 2020–2025. These studies have important theoretical value. I +recommend it to be published in JRFM. + +**Response 1.** We thank the reviewer for their careful reading of the +manuscript and for the positive recommendation. We are grateful for +the confirmation that the temporal obfuscation framework and the scale +of the validation (2,221 evaluations across the 2020–2025 period) +contribute meaningful theoretical value to the field. No changes were +requested in this review, and none have been made in response. diff --git a/docs/papers/jrfm/portal_upload/response_R3_MDPI.docx b/docs/papers/jrfm/portal_upload/response_R3_MDPI.docx new file mode 100644 index 0000000..d412db7 Binary files /dev/null and b/docs/papers/jrfm/portal_upload/response_R3_MDPI.docx differ diff --git a/docs/papers/jrfm/portal_upload/response_R3_pointbypoint.md b/docs/papers/jrfm/portal_upload/response_R3_pointbypoint.md new file mode 100644 index 0000000..b77a7a9 --- /dev/null +++ b/docs/papers/jrfm/portal_upload/response_R3_pointbypoint.md @@ -0,0 +1,585 @@ +## Reviewer 3 — Point-by-point response + +Reviewer 3 provided eight substantive comments organised into the following +groups. We address each in turn, indicating the exact manuscript location of +every change (page / section / paragraph) in the revised manuscript. + +### R3.1 — Introduction (must be improved) + +> The introduction must be shortened and made more focused. It currently +> contains overly long and philosophical paragraphs. It should clearly state +> the research gap, the contribution, and how the paper differs from +> existing studies in financial econometrics. More recent references +> (especially 2022–2025) on options market microstructure, gamma exposure, +> and 0DTE dynamics must be added and critically discussed. + +**Response:** We rewrote §1 Introduction to address each element the +reviewer asked for: + +**(i) Shortened and less philosophical.** The original paragraph-1 +opener ("The decisive question confronting any deployment of large +language models ...") has been removed. The new §1 opens with a +two-sentence, direct statement of the validation problem and why it +is first-order in finance specifically. + +**(ii) Explicit research gap.** A new paragraph titled "Research gap" +(in bold) follows the opener. It names what prior literature has done +independently — dealer-gamma microstructure +\citep{ni2005stock,garleanu2009demand,anderegg2022impact,dim2023odtes,dim2025zero}, +0DTE growth and volatility impact +\citep{cboe2024zero,fishman2023gamma,cboe2025spx0dte}, and LLM-reasoning +probing in non-financial domains +\citep{wei2022chain,kojima2022large,mccoy2023embers} — and states +precisely which combination has not been attempted: an LLM +structural-reasoning validation method that (a) controls for +training-data memorisation of specific events and dates, (b) is tested +at a scale comparable to the target domain, and (c) discriminates +genuine structural detection from reproduction of a volatility-regime +classifier. The Markov-switching benchmark added per R3.3a +(§\ref{sec:regime:benchmark}) is then introduced as the direct test of +element (c). + +**(iii) Why 0DTE matters here.** A new "Why 0DTE matters here" +paragraph replaces the previous "practical urgency" framing. It +explains that 0DTE growth is a natural setting for an obfuscation study +because it created an observable structural shift \emph{within the +training horizon of modern LLMs} — so if the LLM reports 2024 as +persistent-regime and 2020 as fragmented-regime after dates/tickers are +stripped, it cannot be recalling that 2024 contained the word "0DTE". + +**(iv) 2022–2025 references added and critically discussed.** The +key addition is \citet{dim2023odtes} ("0DTEs: Trading, Gamma Risk and +Volatility Propagation", SSRN 4692190), which provides the first +systematic empirical study of 0DTE dealer inventory and is now cited in +§1 and critically discussed in §2.2 alongside \citet{dim2025zero}. The +new discussion notes that Dim, Eraker & Vilkov (2023) establishes +dealer-hedging rather than information flow as the dominant channel +through which 0DTE trading affects the underlying, and that this +characterisation is consistent with our multi-year empirical panel in +§4 (detection of persistent dealer-gamma regimes growing from 3.7% in +2021 to 100% in 2024–2025). We retain the existing 2022–2025 refs +already cited (Anderegg et al.\ 2022; Fishman 2023 Goldman Sachs; +CBOE 2024 and 2025 research notes; Dim, Marsh, Schrimpf 2025 BIS). + +**(v) Differentiation from financial-econometrics literature.** §2.5 +"Regime Detection in Financial Markets" and §2.7 "Research Gap" +already state the differentiation from Hamilton (1989), +Ang & Bekaert (2002), and Nystrup et al. (2018) regime-detection +traditions. The new §1 Research Gap paragraph restates this in terms +the LLM-validation reader will recognise: prior regime detection +detects regimes through \emph{statistical properties of observable +outcomes} (volatility clustering, return distributions); our +contribution detects regimes through \emph{dealer positioning +constraints} (a microstructure-grounded signal with explicit causal +interpretation) \emph{while holding the LLM accountable for its own +reasoning} via obfuscation. + +**Change location:** + +- `01_Introduction.tex`: paragraphs 1–4 of §1 fully rewritten; §1.1 + Research Questions, §1.2 Contributions, §1.3 Positioning, §1.5 Paper + Organization retained unchanged from the prior revision commits. +- `02_Related_Work.tex`: §2.2 "Zero-Days-to-Expiration Options" + expanded to include `dim2023odtes` critical discussion alongside the + existing `dim2025zero`. +- `references.bib`: new entry `dim2023odtes` (Dim, Eraker & Vilkov, + SSRN 4692190, November 2023) added in the 0DTE section. + +**Status:** done + +--- + +### R3.2 — Paper positioning + +> The positioning of the paper must be clarified. It is not clear whether +> the contribution is mainly methodological (LLM validation) or financial +> (market microstructure). This needs to be explicitly stated and +> consistently reflected throughout the paper. + +**Response:** We agree and have stated the positioning explicitly in +two places to ensure the stance is consistent throughout the paper. + +The primary contribution is **methodological**: temporal obfuscation +testing (with the WHO→WHOM→WHAT causal framework and multi-scale +validation protocol) as a generalizable procedure for validating LLM +structural reasoning. Options dealer gamma-exposure regime detection is +the **empirical demonstration domain** — selected because it combines +theoretically grounded mechanical constraints, a large quantitative +testbed, and the sharp pre-vs-post-0DTE temporal contrast — not because +the paper is proposing novel claims about options microstructure. The +financial-market findings (69.1pp detection gap, 0% FP rate on synthetic +controls, 2021–2024 0DTE-tracking regime evolution) are downstream +evidence that the methodology discriminates correctly, not the primary +contribution. + +**Change location:** + +- New §1.3 "Positioning" subsection (label + `sec:introduction:positioning`) between §1 Contributions and §1 Paper + Organization. Two paragraphs: first states the methodological primacy + and the rationale for GEX as the demonstration domain; second explains + that the financial findings are downstream evidence and provides a + reader-routing note for methodology-first vs finance-first readers. +- §7 Conclusion opening rewritten to echo the same stance before listing + the four contributions, so that the stance frames the closing summary. + +**Status:** done + +--- + +### R3.3 — Benchmark comparison & causal claims + +> The research design must be strengthened. The paper currently lacks +> comparison with standard benchmark models such as regime-switching models +> or volatility-based approaches. At least one benchmark model should be +> included to validate the added value of the proposed framework. The +> causal interpretation related to 0DTE should be moderated or supported +> with stronger empirical evidence. + +**Response (part a — benchmark): DONE.** We have added a two-state +Markov-switching regression benchmark (the textbook regime-switching +model, `statsmodels.tsa.regime_switching.MarkovRegression`) on the +daily SPY return series for 2020 and 2024, and additionally on the +2024 net-GEX daily panel where the cached series is available. Details +in new §3.8 "Markov-Switching Benchmark" and new §5.6 "Comparison +with Markov-Switching Benchmark" (with Table 6 + Figure 8, +`fig10_hmm_agreement.png`). + +Three findings emerge: + +| Year | HMM input | N | LLM rate | HMM rate | Agree | Cohen's κ | +| --- | --- | --- | --- | --- | --- | --- | +| 2020 | SPY returns | 201 | 8.5% | 80.1% | 28.4% | 0.045 | +| 2024 | SPY returns | 222 | 81.1% | 87.4% | 68.5% | −0.178 | +| 2024 | Net GEX | 221 | 81.0% | 65.2% | 84.2% | 0.610 | + +1. A returns-based HMM (canonical volatility-regime benchmark) detects + a **different signal** from the LLM: κ is near zero for 2020 and + negative for 2024, so the two classifiers disagree more than chance + — the LLM is not reducible to a variance regime detector. +2. When the HMM is fitted directly on the daily net-GEX series (2024), + agreement with the LLM jumps to **κ = 0.61** (substantial) — the + two converge on the same windows 84.2% of the time. +3. Taken together this is evidence that the LLM's regime concept is + anchored in dealer-gamma structure specifically (where a mechanical + HMM on the same series agrees with it) rather than in any generic + variance / volatility regime (where the classical benchmark + disagrees). + +The benchmark fits and per-window analysis are produced deterministically +by `scripts/validation/paper2/jrfm_revision/hmm_benchmark.py` with +outputs at +`reports/validation/paper2_regime_windows/jrfm_revision_hmm_benchmark.yaml` +and `docs/papers/paper2/figures/output/fig10_hmm_agreement.png`. + +**Response (part b — causal language):** Moderated in the B4 commit +(R3.5d) above. §7 Conclusion contribution 3 now describes the 0DTE +correspondence as "coincides with" rather than "drove"; §6.3 softens +the "tipping-point dynamic strengthens the structural interpretation" +phrasing to "is consistent with, rather than proof of"; §5.7 +Limitations explicitly names interest-rate regime, passive-flow +concentration, and market-maker inventory as alternative +contemporaneous factors that cannot be excluded observationally. +Deeper §6.3 revision is still scheduled in C2 below. + +**Change location:** + +- §3.8 Markov-Switching Benchmark (new subsection) +- §5.6 Comparison with Markov-Switching Benchmark (new subsection, + Table 6, Figure 8) +- `scripts/validation/paper2/jrfm_revision/hmm_benchmark.py` (new) +- `docs/papers/paper2/figures/output/fig10_hmm_agreement.png` (new) + with local copy in `docs/papers/jrfm/figures/` +- §7 Conclusion + §6.3 + §6.7 moderations as described under R3.5d + +**Status:** (a) done; (b) done (moderations in §6.3 applied in B4 plus a +fuller §6.3 rewrite in the C2 commit). §6.3 now explicitly (i) frames +the 0DTE correspondence as temporal coincidence supported by a +plausible mechanical channel rather than a demonstrated causal +relationship, (ii) enumerates four concurrent confounders (interest +rates, short-vol flow, passive/index AUM, market-maker concentration), +(iii) proposes three candidate causal-identification designs (0DTE +suspension natural experiment, counterfactual non-SPY launch, IV +design), and (iv) closes with an explicit acknowledgement that +"less easily reconciled" is not "ruled out" and that disentangling the +channels is beyond the scope of an LLM-validation paper. + +--- + +### R3.4 — Methodology transparency (prompts, thresholds, temperature) + +> The methodology section needs more transparency. The exact prompts used +> for the LLM must be provided (preferably in an appendix). The choice of +> thresholds (70% persistence, $5B magnitude, ≤5 flips) must be justified +> or tested through sensitivity analysis. The impact of model parameters +> (e.g., temperature = 1.0) on reproducibility must be explained. + +**Response:** We have addressed this comment in three parts: + +**(a) Prompts.** The complete regime-detection prompt is now reproduced +verbatim in a new Appendix A, together with the actual OpenAI Batch API +configuration we used (model `o4-mini`; temperature defaults to 1 +because reasoning models reject user-supplied temperature overrides; +`max_completion_tokens` not explicitly set, so the OpenAI API default +applies; JSON structure requested in the prompt rather than enforced +via `response_format`) and the output JSON schema used for parsing. +The appendix is transcribed directly from +`src/llm/mechanics_prompt_builder.py::build_regime_prompt()` in the +publicly released source code, so the reader has full prompt visibility +from the manuscript alone. + +**(c) Temperature and reproducibility.** Appendix A contains a +Reproducibility note explaining that OpenAI reasoning models +(`o1`, `o3`, `o4-mini`, and GPT-5 reasoning variants) reject +user-supplied `temperature` / `top_p` values and run at the default +temperature of 1. The seed parameter is supported by `o4-mini` +(OpenAI documents it as best-effort determinism that can shift when +the server `system_fingerprint` changes), but we did not set a seed +in this study. Bit-identical reproduction of any single response is +therefore not guaranteed. Reproducibility at the distributional level +is established through the N = 2,221 evaluation sample and the +mechanical numerical thresholds embedded in the prompt itself, which +anchor the model on concrete criteria rather than free-form judgment. + +**(b) Threshold sensitivity — DONE.** A post-hoc sensitivity sweep has +been added as new §5.5 "Threshold Sensitivity" with Figure 7 +(`fig09_threshold_sensitivity.png`). The sweep spans a 5×3×3 grid +(persistence ∈ {60, 65, 70, 75, 80}%, magnitude ∈ {$3B, $5B, $7B}, +flips ≤ {3, 5, 7}; 45 configurations in total) applied to the 223 +Phase 3 (2024) and 220 Phase 4 (2020) per-window records already on +disk — no new LLM queries required. + +Key findings reported in §5.5: + +- The 2024-vs-2020 detection gap ranges from 34.1 to 85.2 pp across + the 45 configurations (median 63.2 pp). +- The gap exceeds 50 pp in 40 of 45 configurations. +- The five sub-50 pp cells all occur at the most permissive magnitude + threshold ($3B) combined with the strictest flip limit (≤3) — + deliberately degenerate settings. +- The persistence threshold has essentially no binding effect in this + data because 2024 regime windows saturate ≥60% persistence and 2020 + windows rarely clear any persistence bar — so choosing 60%, 70%, or + 80% produces identical detection rates. +- Magnitude is the binding threshold; flip tolerance is the secondary + lever. + +The analysis is produced deterministically by the new +`scripts/validation/paper2/jrfm_revision/threshold_sensitivity.py` +(YAML summary at +`reports/validation/paper2_regime_windows/jrfm_revision_threshold_sensitivity.yaml`, +heatmap at `docs/papers/paper2/figures/output/fig09_threshold_sensitivity.png` +with a local copy at `docs/papers/jrfm/figures/fig09_threshold_sensitivity.png` +for LaTeX compilation). + +**Change location:** + +- New Appendix A on pp. 24–29 (parts (a) and (c) above). +- Main text §3 Methodology: brief cross-reference added to Appendix A + where prompts were previously described in prose. +- New §5.5 "Threshold Sensitivity" subsection with Figure 7 (part (b)). + +**Status:** done + +--- + +### R3.5 — Statistical rigour in results + +> The results section must include statistical validation. The paper relies +> heavily on percentages without reporting statistical significance, +> confidence intervals, or robustness tests. These must be added. Some +> interpretations are too strong compared to the evidence and should be +> moderated. + +**Response:** We agree. The revision addresses this comment in four +parts; part (a) is complete, (b/c/d) are in progress. + +**(a) Confidence intervals — DONE.** Every detection rate reported in +§4 Results now carries a 95% confidence interval. Methodology: + +- For Phases 1--4 and all Phase 2 negative controls, per-window records + are available, so we report a 10,000-replicate percentile bootstrap + over windows (deterministic seed). +- For Phase 5 per-year rates (2020--2025), where only aggregate counts + survive in the published pipeline, we report 95% Wilson score + intervals for binomial proportions, which have equivalent coverage + properties and are the standard recommendation in + \citet{brown2001interval}. + +The methodology is spelled out in a new "Statistical conventions" +paragraph at the head of §4.1, and all CIs are produced deterministically +by the new reprocessing script +`scripts/validation/paper2/jrfm_revision/bootstrap_detection_ci.py` +shipped with the code release. + +Key numerical landings (point-estimate [95% CI] N): + +| Phase | Rate (95% CI) | +| --- | --- | +| Phase 1 baseline 2024 Q1 | 71.2% [57.7, 82.7]% (37/52) | +| Phase 3 full 2024 | 81.2% [75.8, 86.1]% (181/223) | +| Phase 4 full 2020 | 12.1% [8.1, 16.6]% (27/223) | +| Phase 2b transitional 2020 | 0.0% [0.0, 1.7]% (0/223) | +| Phase 5 2020 | 12.2% [8.5, 17.3]% (26/213) | +| Phase 5 2024 | 100% [98.4, 100.0]% (241/241) | +| Phase 5 2025 | 100% [98.5, 100.0]% (245/245) | + +Critically, the 2020 upper CI bound (17.3%) does not overlap the 2024 +lower CI bound (98.4%), which directly supports the 69.1pp separation +claim with bounded evidence rather than point estimates alone. + +**(b) Expanded χ² / Fisher reporting — DONE.** Every headline +contingency now reports the full suite of statistics rather than just φ +and "p < 0.0001". Specifically: + +- §5.3 Phase 4 (2020 vs 2024, 223 each): Pearson's χ² = 213.67 (df=1, + p = 2.2×10⁻⁴⁸), Yates-corrected χ² = 210.90 (p = 8.7×10⁻⁴⁸), + Fisher's exact two-sided p = 1.8×10⁻⁵² with odds ratio 31.3, + φ = 0.69 (refined from the previously rounded 0.672), and a risk + difference of 69.1pp with a 95% Wald CI of [62.4, 75.7]pp. +- §5.4 Phase 5 (2023→2024 transition, 228 vs 241): χ² = 314.4 + (p = 2.4×10⁻⁷⁰), Fisher's exact p = 9.9×10⁻⁸⁷ (OR diverges because + all 241 2024 windows are detected), φ = 0.82 (refined from 0.783). +- Abstract and Introduction updated to report the 2020-vs-2024 + comparison with both CI brackets on each rate and Fisher's exact p + (the strongest and most defensible statistic here given the zero + cell), instead of a single "p < 0.0001". + +**(c) Threshold robustness — DONE** (see R3.4b response above). + +**(d) Moderated claim language — DONE.** With CIs and the 45-configuration +sensitivity sweep now in hand, we made two targeted moderations: + +- §7 Conclusion contribution 2 now reports the 69.1pp separation with + explicit CI brackets on each side and Fisher's exact p, and cites the + 45-configuration robustness of the 50pp gap, rather than citing the + separation as a standalone point estimate. +- §7 Conclusion contribution 3 replaces "0DTE-driven structural + reorganization" with language that identifies temporal coincidence and + explicitly acknowledges alternative contemporaneous factors (interest + rates, passive flow concentration, market-maker inventory), noting + that stronger causal evidence would require a natural experiment. +- §6.3 "Market Structure Evolution" similarly softens the + "tipping-point dynamic strengthens the structural interpretation" + phrasing to "is consistent with, rather than proof of" and + cross-references §6.7 Limitations for the causal-identification + caveat. + +These moderations make the paper's causal claims about 0DTE match the +quality of observational evidence available here; they do not weaken +the statistical claims on 2020-vs-2024 separation, which the new +χ² / Fisher / sensitivity results strengthen. + +**Change location:** + +- `04_Results.tex` §5 opening "Statistical conventions" paragraph +- `04_Results.tex` §5.1 Phase 1/3 inline rates in text +- `04_Results.tex` Table 2 (negative controls) — CI column added +- `04_Results.tex` Table 3 (Phase 4 comparison) — CIs on both rates +- `04_Results.tex` Table 5 (Phase 5) — new CI column +- `references.bib` — added `brown2001interval` for Wilson score cite +- `scripts/validation/paper2/jrfm_revision/bootstrap_detection_ci.py` — new reprocessing script + +**Status:** done — all four parts (CIs, χ²/Fisher expansion, +window/threshold robustness in §5.5, moderated claim language in §6.3 +and §7) landed across the B1, B2, B3, B4, C2 revision commits. + +--- + +### R3.6 — Discussion: finance connections + +> The discussion must be better connected to finance. The implications for +> risk management, market efficiency, and practitioners should be explicitly +> developed. The current discussion is too general and sometimes +> theoretical. + +**Response:** We agree that the original discussion was too general on +the practitioner side. The previous §6.6 "Practitioner Implications" +subsection has been renamed "Practical Implications" and restructured +into three explicit subsubsections exactly matching the three axes the +reviewer identified: + +**(a) Risk management.** Three concrete applications developed: +intraday volatility budgeting (regime as a leading indicator for +volatility-of-volatility exposure sizing), option-book hedging under +OpEx concentration (persistent-positive regimes amplify the OpEx +pinning dynamic), and risk-scenario design (2020 fragmented vs 2024 +persistent-negative as natural conditioning variables for stress-test +calibration). + +**(b) Market efficiency.** A new positive account is offered: the +detection-alpha orthogonality is consistent with a weakly efficient +market in which structural constraints are reliably identifiable but +already priced. This reconciles two claims often treated as +contradictory — that dealer-gamma positioning measurably influences +short-horizon price dynamics, and that systematic strategies exploiting +it deteriorate as attention accumulates — and explains why +microstructure-aware research can be genuinely informative for risk +without being informative for alpha. + +**(c) Practitioners: pipeline design and model deployment.** Two +design implications developed from the experimental results: (i) the +30.8pp advantage of raw strike-level data over pre-aggregated GEX +challenges the default of parametric aggregation in quantitative +pipelines, with generalisations to credit risk, fixed-income +surveillance, and equity factor research explicitly noted; (ii) the +2022–2024 0DTE regime shift implies that static microstructure models +calibrated to pre-2022 data need recalibration rather than drift +correction. + +**Change location:** §6.6 "Practical Implications" (renamed from +"Practitioner Implications"), with new `sec:discussion:practical` label +and three new `\subsubsection` headings corresponding to the +reviewer's three axes. The subsection expanded from one dense +paragraph (4 insights) to three structured subsubsections (~1 page). + +**Status:** done + +--- + +### R3.7 — Limitations expansion + +> The limitations section must be expanded. It should clearly address the +> use of a single asset (SPY), the dependence on one LLM model, and the +> lack of external validation. + +**Response:** We thank the reviewer for flagging these specific omissions. +We have renamed §5.7 to "Limitations and Future Work" and expanded it +from six limitations to seven, with each item now explicitly tied to a +concrete follow-up study. The three items the reviewer named are now +addressed as follows: + +**(a) Single-asset scope.** The first limitation item (now titled +"Single-asset scope") explicitly acknowledges that all results concern +SPY, lists QQQ, IWM, individual equities, and non-equity underliers as +relevant but untested targets, and identifies cross-asset replication as +the single highest-priority item for future work. A pre-registered +protocol applying the same framework to at least QQQ and one individual +equity (e.g., NVDA or AAPL) is proposed. + +**(b) Single-LLM dependence.** A dedicated second item ("Single-LLM +dependence") acknowledges that all 2,221 evaluations used one reasoning +model (o4-mini), so the reported detection rates are conditional on +that model's priors. We propose a model-swap protocol covering Anthropic +Claude, OpenAI o3, Google Gemini, and open-source reasoning models +using identical prompts and obfuscated sequences, with cross-model +agreement analysis as the diagnostic. + +**(c) Lack of independent external validation.** A new third item +("Lack of independent external validation") acknowledges that per-window +ground-truth metrics are computed from the same Alpha Vantage feed used +to construct the windows, and proposes cross-validation against CBOE +DataShop / OPRA / commercial vendors (SpotGamma, MenthorQ) and against +related microstructure observables (realised volatility, +implied-realised spread, opening auction imbalance). + +**Change location:** §6.7 Limitations and Future Work (p.\ 17 in the +revised PDF). The subsection was relabelled from "Limitations" to +"Limitations and Future Work" and expanded from 6 to 7 items. Each item +now includes an explicit future-work sentence indicating how it could +be addressed. + +**Status:** done + +--- + +### R3.8 — Figures and tables + +> Figures and tables must be improved. Some are too dense and difficult to +> read. Labels and captions should be clearer and more explanatory. + +**Response:** We made every caption in the manuscript self-contained, +following the rule that a caption should state (i) what is shown, +(ii) the key numerical values a reader should notice, and (iii) what +conclusion the reader should take from the figure. Four figure +captions (Figures 1, 3, 4, 5, 6) were rewritten to match this standard; +the figures and tables added in the earlier B1/B3/C1 commits +(Figures 7 and 8, Tables 2–6) were already written to it. + +Each rewritten caption ends with an explicit "Read this figure as:" +clause that tells the reader the intended interpretation. Examples: + +- **Figure 1 (Obfuscation)**: "Read this figure as: anything the LLM + correctly infers from the right-hand input must come from the + numerical structure alone, not from memorised date-specific context + in the training corpus." +- **Figure 4 (Selectivity)**: "Read this figure as: detection is not + a function of a single criterion but of all three acting jointly — + high magnitude alone or high persistence alone is not sufficient." +- **Figure 5 (GEX magnitude distribution)**: "Read this figure as: + the magnitude criterion alone — before persistence or stability are + even checked — already separates the two eras, and the chosen $5B + threshold is positioned in the trough between the two distributions + rather than in the bulk of either." +- **Figure 6 (Temporal progression)**: "Read this figure as: the LLM + regime-detection signal is not a smooth secular trend but a discrete + step-change, coincident with the maturation of the 0DTE options + market; it is not a proof of causation but is less easily reconciled + with gradual drift." + +On the reviewer's remark that "some are too dense and difficult to +read": we reviewed each figure under the density lens and concluded +that none of the eight figures currently in the JRFM manuscript are +overly dense once the captions make the intended reading explicit. The +reviewer may have been referring to Figures 7 and 8 in a prior +version (the AIAI conference version), which had a crowded 9-panel +layout; those were not carried over into the JRFM manuscript. If the +editor identifies a specific figure that still reads as too dense, we +will happily simplify it. + +**Change location:** captions in `03_Methodology.tex` (Figure 1) and +`04_Results.tex` (Figures 3, 4, 5, 6); all other figure and table +captions were already self-contained from prior revision commits. + +**Status:** done + +--- + +### R3.9 — English language quality + +> The clarity of the manuscript needs improvement. Many sentences are too +> long and complex, which affects readability. The writing should be +> simplified by using shorter sentences, more direct wording, and by +> removing redundant or overly elaborate expressions. Careful language +> editing is recommended to improve clarity and flow. + +**Response:** We performed a full editing pass over the manuscript +after all content changes were settled. Summary of what was done: + +**(a) Wordy transitions and hedging tics.** We checked the manuscript +for the usual English-editing offenders ("In order to", "It should be +noted that", "It is worth noting", "Due to the fact that", "This is +because", "Obviously", "Clearly"). None of these phrases appear in +the manuscript — the original draft was already written in an active, +direct register. No changes were required on this axis. + +**(b) Long-sentence decomposition.** We identified the paragraphs +with the most elaborate nested-clause sentences (the §1 philosophical +opener and §6.5 Dispersed Knowledge were the two heaviest) and rewrote +them for directness. The §1 opener was fully replaced in the R3.1 +rewrite above (which removed roughly 120 words of philosophical prose). +§5.5 was tightened in this commit by breaking three >40-word sentences +into two-sentence units while retaining the Hayek citation and the +30.8pp empirical claim. + +**(c) Active voice where natural.** The manuscript is already +predominantly in active voice; we did not force passive-to-active +rewrites in passages where passive carries the correct emphasis (e.g., +"the framework achieves 81.2\% detection" is active; "detection was +observed at 81.2\%" would be worse). + +**(d) Consistency of technical terms.** We verified consistent +terminology across sections: "regime" (not "state") for the detection +target, "persistent / fragmented" (not "stable / unstable") for the +binary outcome, "dealer gamma positioning" (not "dealer gamma +exposure" in the context of the detection task), "obfuscation" +(not "anonymisation"). No ad-hoc substitutions were made. + +**Change location:** targeted tightening in §6.5 Dispersed Knowledge +(sentences broken up); §1 opener and §6.3 Market Structure Evolution +rewrites landed in the earlier D1 and C2 commits. Technical-term +consistency verified throughout. + +**Status:** done + +--- diff --git a/docs/papers/jrfm/portal_upload/response_R3_pointbypoint.pdf b/docs/papers/jrfm/portal_upload/response_R3_pointbypoint.pdf new file mode 100644 index 0000000..333ac7a Binary files /dev/null and b/docs/papers/jrfm/portal_upload/response_R3_pointbypoint.pdf differ diff --git a/docs/papers/jrfm/portal_upload/response_R3_pointbypoint.tex b/docs/papers/jrfm/portal_upload/response_R3_pointbypoint.tex new file mode 100644 index 0000000..e9a37d0 --- /dev/null +++ b/docs/papers/jrfm/portal_upload/response_R3_pointbypoint.tex @@ -0,0 +1,561 @@ + +\documentclass[11pt,a4paper]{article} +\usepackage[margin=2.4cm]{geometry} +\usepackage[T1]{fontenc} +\usepackage[utf8]{inputenc} +\usepackage{lmodern} +\usepackage{parskip} +\usepackage{microtype} +\usepackage{xcolor} +\usepackage{hyperref} +\usepackage{array} +\usepackage{booktabs} +\usepackage{float} +\usepackage{textcomp} +\hypersetup{colorlinks=true, linkcolor=blue, urlcolor=blue} +\setcounter{secnumdepth}{0} + +\title{Response to Reviewer 3 \\ \large JRFM Submission jrfm-4256551} +\author{Christopher Regan \and Ying Xie \\ Kennesaw State University} +\date{24 April 2026} + +\begin{document} +\maketitle +\subsection*{Reviewer 3 --- Point-by-point response} + +Reviewer 3 provided eight substantive comments organised into the following +groups. We address each in turn, indicating the exact manuscript location of +every change (page / section / paragraph) in the revised manuscript. + +\subsubsection*{R3.1 --- Introduction (must be improved)} + +\begin{quote} +The introduction must be shortened and made more focused. It currently contains overly long and philosophical paragraphs. It should clearly state the research gap, the contribution, and how the paper differs from existing studies in financial econometrics. More recent references (especially 2022--2025) on options market microstructure, gamma exposure, and 0DTE dynamics must be added and critically discussed. +\end{quote} + +\textbf{Response:} We rewrote \S{}~1 Introduction to address each element the +reviewer asked for: + +\textbf{(i) Shortened and less philosophical.} The original paragraph-1 +opener ("The decisive question confronting any deployment of large +language models ...") has been removed. The new \S{}~1 opens with a +two-sentence, direct statement of the validation problem and why it +is first-order in finance specifically. + +\textbf{(ii) Explicit research gap.} A new paragraph titled "Research gap" +(in bold) follows the opener. It names what prior literature has done +independently --- dealer-gamma microstructure +(ni2005stock,garleanu2009demand,anderegg2022impact,dim2023odtes,dim2025zero), +0DTE growth and volatility impact +(cboe2024zero,fishman2023gamma,cboe2025spx0dte), and LLM-reasoning +probing in non-financial domains +(wei2022chain,kojima2022large,mccoy2023embers) --- and states +precisely which combination has not been attempted: an LLM +structural-reasoning validation method that (a) controls for +training-data memorisation of specific events and dates, (b) is tested +at a scale comparable to the target domain, and (c) discriminates +genuine structural detection from reproduction of a volatility-regime +classifier. The Markov-switching benchmark added per R3.3a +(\S{}~sec:regime:benchmark) is then introduced as the direct test of +element (c). + +\textbf{(iii) Why 0DTE matters here.} A new "Why 0DTE matters here" +paragraph replaces the previous "practical urgency" framing. It +explains that 0DTE growth is a natural setting for an obfuscation study +because it created an observable structural shift *within the +training horizon of modern LLMs* --- so if the LLM reports 2024 as +persistent-regime and 2020 as fragmented-regime after dates/tickers are +stripped, it cannot be recalling that 2024 contained the word "0DTE". + +\textbf{(iv) 2022--2025 references added and critically discussed.} The +key addition is dim2023odtes ("0DTEs: Trading, Gamma Risk and +Volatility Propagation", SSRN 4692190), which provides the first +systematic empirical study of 0DTE dealer inventory and is now cited in +\S{}~1 and critically discussed in \S{}~2.2 alongside dim2025zero. The +new discussion notes that Dim, Eraker \& Vilkov (2023) establishes +dealer-hedging rather than information flow as the dominant channel +through which 0DTE trading affects the underlying, and that this +characterisation is consistent with our multi-year empirical panel in +\S{}~4 (detection of persistent dealer-gamma regimes growing from 3.7\% in +2021 to 100\% in 2024--2025). We retain the existing 2022--2025 refs +already cited (Anderegg et al.\textbackslash{} 2022; Fishman 2023 Goldman Sachs; +CBOE 2024 and 2025 research notes; Dim, Marsh, Schrimpf 2025 BIS). + +\textbf{(v) Differentiation from financial-econometrics literature.} \S{}~2.5 +"Regime Detection in Financial Markets" and \S{}~2.7 "Research Gap" +already state the differentiation from Hamilton (1989), +Ang \& Bekaert (2002), and Nystrup et al. (2018) regime-detection +traditions. The new \S{}~1 Research Gap paragraph restates this in terms +the LLM-validation reader will recognise: prior regime detection +detects regimes through *statistical properties of observable +outcomes* (volatility clustering, return distributions); our +contribution detects regimes through *dealer positioning +constraints* (a microstructure-grounded signal with explicit causal +interpretation) *while holding the LLM accountable for its own +reasoning* via obfuscation. + +\textbf{Change location:} + +\begin{itemize} +\item \texttt{01\_Introduction.tex}: paragraphs 1--4 of \S{}~1 fully rewritten; \S{}~1.1 Research Questions, \S{}~1.2 Contributions, \S{}~1.3 Positioning, \S{}~1.5 Paper Organization retained unchanged from the prior revision commits. +\item \texttt{02\_Related\_Work.tex}: \S{}~2.2 "Zero-Days-to-Expiration Options" expanded to include \texttt{dim2023odtes} critical discussion alongside the existing \texttt{dim2025zero}. +\item \texttt{references.bib}: new entry \texttt{dim2023odtes} (Dim, Eraker \& Vilkov, SSRN 4692190, November 2023) added in the 0DTE section. +\end{itemize} + +\textbf{Status:} done + +\bigskip\hrule\bigskip + +\subsubsection*{R3.2 --- Paper positioning} + +\begin{quote} +The positioning of the paper must be clarified. It is not clear whether the contribution is mainly methodological (LLM validation) or financial (market microstructure). This needs to be explicitly stated and consistently reflected throughout the paper. +\end{quote} + +\textbf{Response:} We agree and have stated the positioning explicitly in +two places to ensure the stance is consistent throughout the paper. + +The primary contribution is \textbf{methodological}: temporal obfuscation +testing (with the WHO$\rightarrow$WHOM$\rightarrow$WHAT causal framework and multi-scale +validation protocol) as a generalizable procedure for validating LLM +structural reasoning. Options dealer gamma-exposure regime detection is +the \textbf{empirical demonstration domain} --- selected because it combines +theoretically grounded mechanical constraints, a large quantitative +testbed, and the sharp pre-vs-post-0DTE temporal contrast --- not because +the paper is proposing novel claims about options microstructure. The +financial-market findings (69.1pp detection gap, 0\% FP rate on synthetic +controls, 2021--2024 0DTE-tracking regime evolution) are downstream +evidence that the methodology discriminates correctly, not the primary +contribution. + +\textbf{Change location:} + +\begin{itemize} +\item New \S{}~1.3 "Positioning" subsection (label \texttt{sec:introduction:positioning}) between \S{}~1 Contributions and \S{}~1 Paper Organization. Two paragraphs: first states the methodological primacy and the rationale for GEX as the demonstration domain; second explains that the financial findings are downstream evidence and provides a reader-routing note for methodology-first vs finance-first readers. +\item \S{}~7 Conclusion opening rewritten to echo the same stance before listing the four contributions, so that the stance frames the closing summary. +\end{itemize} + +\textbf{Status:} done + +\bigskip\hrule\bigskip + +\subsubsection*{R3.3 --- Benchmark comparison \& causal claims} + +\begin{quote} +The research design must be strengthened. The paper currently lacks comparison with standard benchmark models such as regime-switching models or volatility-based approaches. At least one benchmark model should be included to validate the added value of the proposed framework. The causal interpretation related to 0DTE should be moderated or supported with stronger empirical evidence. +\end{quote} + +\textbf{Response (part a --- benchmark): DONE.} We have added a two-state +Markov-switching regression benchmark (the textbook regime-switching +model, \texttt{statsmodels.tsa.regime\_switching.MarkovRegression}) on the +daily SPY return series for 2020 and 2024, and additionally on the +2024 net-GEX daily panel where the cached series is available. Details +in new \S{}~3.8 "Markov-Switching Benchmark" and new \S{}~5.6 "Comparison +with Markov-Switching Benchmark" (with Table 6 + Figure 8, +\texttt{fig10\_hmm\_agreement.png}). + +Three findings emerge: + +\begin{table}[H] +\centering +\small +\begin{tabular}{|l|l|l|l|l|l|l|} +\hline +\textbf{Year} & \textbf{HMM input} & \textbf{N} & \textbf{LLM rate} & \textbf{HMM rate} & \textbf{Agree} & \textbf{Cohen's $\kappa$} \\ \hline +2020 & SPY returns & 201 & 8.5\% & 80.1\% & 28.4\% & 0.045 \\ \hline +2024 & SPY returns & 222 & 81.1\% & 87.4\% & 68.5\% & -0.178 \\ \hline +2024 & Net GEX & 221 & 81.0\% & 65.2\% & 84.2\% & 0.610 \\ \hline +\end{tabular} +\end{table} + +\begin{enumerate} +\item A returns-based HMM (canonical volatility-regime benchmark) detects a \textbf{different signal} from the LLM: $\kappa$ is near zero for 2020 and negative for 2024, so the two classifiers disagree more than chance --- the LLM is not reducible to a variance regime detector. +\item When the HMM is fitted directly on the daily net-GEX series (2024), agreement with the LLM jumps to \textbf{$\kappa$ = 0.61} (substantial) --- the two converge on the same windows 84.2\% of the time. +\item Taken together this is evidence that the LLM's regime concept is anchored in dealer-gamma structure specifically (where a mechanical HMM on the same series agrees with it) rather than in any generic variance / volatility regime (where the classical benchmark disagrees). +\end{enumerate} + +The benchmark fits and per-window analysis are produced deterministically +by \texttt{scripts/validation/paper2/jrfm\_revision/hmm\_benchmark.py} with +outputs at +\texttt{reports/validation/paper2\_regime\_windows/jrfm\_revision\_hmm\_benchmark.yaml} +and \texttt{docs/papers/paper2/figures/output/fig10\_hmm\_agreement.png}. + +\textbf{Response (part b --- causal language):} Moderated in the B4 commit +(R3.5d) above. \S{}~7 Conclusion contribution 3 now describes the 0DTE +correspondence as "coincides with" rather than "drove"; \S{}~6.3 softens +the "tipping-point dynamic strengthens the structural interpretation" +phrasing to "is consistent with, rather than proof of"; \S{}~5.7 +Limitations explicitly names interest-rate regime, passive-flow +concentration, and market-maker inventory as alternative +contemporaneous factors that cannot be excluded observationally. +Deeper \S{}~6.3 revision is still scheduled in C2 below. + +\textbf{Change location:} + +\begin{itemize} +\item \S{}~3.8 Markov-Switching Benchmark (new subsection) +\item \S{}~5.6 Comparison with Markov-Switching Benchmark (new subsection, Table 6, Figure 8) +\item \texttt{scripts/validation/paper2/jrfm\_revision/hmm\_benchmark.py} (new) +\item \texttt{docs/papers/paper2/figures/output/fig10\_hmm\_agreement.png} (new) with local copy in \texttt{docs/papers/jrfm/figures/} +\item \S{}~7 Conclusion + \S{}~6.3 + \S{}~6.7 moderations as described under R3.5d +\end{itemize} + +\textbf{Status:} (a) done; (b) done (moderations in \S{}~6.3 applied in B4 plus a +fuller \S{}~6.3 rewrite in the C2 commit). \S{}~6.3 now explicitly (i) frames +the 0DTE correspondence as temporal coincidence supported by a +plausible mechanical channel rather than a demonstrated causal +relationship, (ii) enumerates four concurrent confounders (interest +rates, short-vol flow, passive/index AUM, market-maker concentration), +(iii) proposes three candidate causal-identification designs (0DTE +suspension natural experiment, counterfactual non-SPY launch, IV +design), and (iv) closes with an explicit acknowledgement that +"less easily reconciled" is not "ruled out" and that disentangling the +channels is beyond the scope of an LLM-validation paper. + +\bigskip\hrule\bigskip + +\subsubsection*{R3.4 --- Methodology transparency (prompts, thresholds, temperature)} + +\begin{quote} +The methodology section needs more transparency. The exact prompts used for the LLM must be provided (preferably in an appendix). The choice of thresholds (70\% persistence, \$5B magnitude, $\leq$5 flips) must be justified or tested through sensitivity analysis. The impact of model parameters (e.g., temperature = 1.0) on reproducibility must be explained. +\end{quote} + +\textbf{Response:} We have addressed this comment in three parts: + +\textbf{(a) Prompts.} The complete regime-detection prompt is now reproduced +verbatim in a new Appendix A, together with the actual OpenAI Batch API +configuration we used (model \texttt{o4-mini}; temperature defaults to 1 +because reasoning models reject user-supplied temperature overrides; +\texttt{max\_completion\_tokens} not explicitly set, so the OpenAI API default +applies; JSON structure requested in the prompt rather than enforced +via \texttt{response\_format}) and the output JSON schema used for parsing. +The appendix is transcribed directly from +\texttt{src/llm/mechanics\_prompt\_builder.py::build\_regime\_prompt()} in the +publicly released source code, so the reader has full prompt visibility +from the manuscript alone. + +\textbf{(c) Temperature and reproducibility.} Appendix A contains a +Reproducibility note explaining that OpenAI reasoning models +(\texttt{o1}, \texttt{o3}, \texttt{o4-mini}, and GPT-5 reasoning variants) reject +user-supplied \texttt{temperature} / \texttt{top\_p} values and run at the default +temperature of 1. The seed parameter is supported by \texttt{o4-mini} +(OpenAI documents it as best-effort determinism that can shift when +the server \texttt{system\_fingerprint} changes), but we did not set a seed +in this study. Bit-identical reproduction of any single response is +therefore not guaranteed. Reproducibility at the distributional level +is established through the N = 2,221 evaluation sample and the +mechanical numerical thresholds embedded in the prompt itself, which +anchor the model on concrete criteria rather than free-form judgment. + +\textbf{(b) Threshold sensitivity --- DONE.} A post-hoc sensitivity sweep has +been added as new \S{}~5.5 "Threshold Sensitivity" with Figure 7 +(\texttt{fig09\_threshold\_sensitivity.png}). The sweep spans a 5$\times$3$\times$3 grid +(persistence $\in$ \{60, 65, 70, 75, 80\}\%, magnitude $\in$ \{\$3B, \$5B, \$7B\}, +flips $\leq$ \{3, 5, 7\}; 45 configurations in total) applied to the 223 +Phase 3 (2024) and 220 Phase 4 (2020) per-window records already on +disk --- no new LLM queries required. + +Key findings reported in \S{}~5.5: + +\begin{itemize} +\item The 2024-vs-2020 detection gap ranges from 34.1 to 85.2 pp across the 45 configurations (median 63.2 pp). +\item The gap exceeds 50 pp in 40 of 45 configurations. +\item The five sub-50 pp cells all occur at the most permissive magnitude threshold (\$3B) combined with the strictest flip limit ($\leq$3) --- deliberately degenerate settings. +\item The persistence threshold has essentially no binding effect in this data because 2024 regime windows saturate $\geq$60\% persistence and 2020 windows rarely clear any persistence bar --- so choosing 60\%, 70\%, or 80\% produces identical detection rates. +\item Magnitude is the binding threshold; flip tolerance is the secondary lever. +\end{itemize} + +The analysis is produced deterministically by the new +\texttt{scripts/validation/paper2/jrfm\_revision/threshold\_sensitivity.py} +(YAML summary at +\texttt{reports/validation/paper2\_regime\_windows/jrfm\_revision\_threshold\_sensitivity.yaml}, +heatmap at \texttt{docs/papers/paper2/figures/output/fig09\_threshold\_sensitivity.png} +with a local copy at \texttt{docs/papers/jrfm/figures/fig09\_threshold\_sensitivity.png} +for LaTeX compilation). + +\textbf{Change location:} + +\begin{itemize} +\item New Appendix A on pp. 24--29 (parts (a) and (c) above). +\item Main text \S{}~3 Methodology: brief cross-reference added to Appendix A where prompts were previously described in prose. +\item New \S{}~5.5 "Threshold Sensitivity" subsection with Figure 7 (part (b)). +\end{itemize} + +\textbf{Status:} done + +\bigskip\hrule\bigskip + +\subsubsection*{R3.5 --- Statistical rigour in results} + +\begin{quote} +The results section must include statistical validation. The paper relies heavily on percentages without reporting statistical significance, confidence intervals, or robustness tests. These must be added. Some interpretations are too strong compared to the evidence and should be moderated. +\end{quote} + +\textbf{Response:} We agree. The revision addresses this comment in four +parts; part (a) is complete, (b/c/d) are in progress. + +\textbf{(a) Confidence intervals --- DONE.} Every detection rate reported in +\S{}~4 Results now carries a 95\% confidence interval. Methodology: + +\begin{itemize} +\item For Phases 1--4 and all Phase 2 negative controls, per-window records are available, so we report a 10,000-replicate percentile bootstrap over windows (deterministic seed). +\item For Phase 5 per-year rates (2020--2025), where only aggregate counts survive in the published pipeline, we report 95\% Wilson score intervals for binomial proportions, which have equivalent coverage properties and are the standard recommendation in brown2001interval. +\end{itemize} + +The methodology is spelled out in a new "Statistical conventions" +paragraph at the head of \S{}~4.1, and all CIs are produced deterministically +by the new reprocessing script +\texttt{scripts/validation/paper2/jrfm\_revision/bootstrap\_detection\_ci.py} +shipped with the code release. + +Key numerical landings (point-estimate [95\% CI] N): + +\begin{table}[H] +\centering +\small +\begin{tabular}{|l|l|} +\hline +\textbf{Phase} & \textbf{Rate (95\% CI)} \\ \hline +Phase 1 baseline 2024 Q1 & 71.2\% [57.7, 82.7]\% (37/52) \\ \hline +Phase 3 full 2024 & 81.2\% [75.8, 86.1]\% (181/223) \\ \hline +Phase 4 full 2020 & 12.1\% [8.1, 16.6]\% (27/223) \\ \hline +Phase 2b transitional 2020 & 0.0\% [0.0, 1.7]\% (0/223) \\ \hline +Phase 5 2020 & 12.2\% [8.5, 17.3]\% (26/213) \\ \hline +Phase 5 2024 & 100\% [98.4, 100.0]\% (241/241) \\ \hline +Phase 5 2025 & 100\% [98.5, 100.0]\% (245/245) \\ \hline +\end{tabular} +\end{table} + +Critically, the 2020 upper CI bound (17.3\%) does not overlap the 2024 +lower CI bound (98.4\%), which directly supports the 69.1pp separation +claim with bounded evidence rather than point estimates alone. + +\textbf{(b) Expanded $\chi$$^{2}$ / Fisher reporting --- DONE.} Every headline +contingency now reports the full suite of statistics rather than just $\varphi$ +and "p < 0.0001". Specifically: + +\begin{itemize} +\item \S{}~5.3 Phase 4 (2020 vs 2024, 223 each): Pearson's $\chi$$^{2}$ = 213.67 (df=1, p = 2.2$\times$10$^{-}$$^{4}$$^{8}$), Yates-corrected $\chi$$^{2}$ = 210.90 (p = 8.7$\times$10$^{-}$$^{4}$$^{8}$), Fisher's exact two-sided p = 1.8$\times$10$^{-}$$^{5}$$^{2}$ with odds ratio 31.3, $\varphi$ = 0.69 (refined from the previously rounded 0.672), and a risk difference of 69.1pp with a 95\% Wald CI of [62.4, 75.7]pp. +\item \S{}~5.4 Phase 5 (2023$\rightarrow$2024 transition, 228 vs 241): $\chi$$^{2}$ = 314.4 (p = 2.4$\times$10$^{-}$$^{7}$$^{0}$), Fisher's exact p = 9.9$\times$10$^{-}$$^{8}$$^{7}$ (OR diverges because all 241 2024 windows are detected), $\varphi$ = 0.82 (refined from 0.783). +\item Abstract and Introduction updated to report the 2020-vs-2024 comparison with both CI brackets on each rate and Fisher's exact p (the strongest and most defensible statistic here given the zero cell), instead of a single "p < 0.0001". +\end{itemize} + +\textbf{(c) Threshold robustness --- DONE} (see R3.4b response above). + +\textbf{(d) Moderated claim language --- DONE.} With CIs and the 45-configuration +sensitivity sweep now in hand, we made two targeted moderations: + +\begin{itemize} +\item \S{}~7 Conclusion contribution 2 now reports the 69.1pp separation with explicit CI brackets on each side and Fisher's exact p, and cites the 45-configuration robustness of the 50pp gap, rather than citing the separation as a standalone point estimate. +\item \S{}~7 Conclusion contribution 3 replaces "0DTE-driven structural reorganization" with language that identifies temporal coincidence and explicitly acknowledges alternative contemporaneous factors (interest rates, passive flow concentration, market-maker inventory), noting that stronger causal evidence would require a natural experiment. +\item \S{}~6.3 "Market Structure Evolution" similarly softens the "tipping-point dynamic strengthens the structural interpretation" phrasing to "is consistent with, rather than proof of" and cross-references \S{}~6.7 Limitations for the causal-identification caveat. +\end{itemize} + +These moderations make the paper's causal claims about 0DTE match the +quality of observational evidence available here; they do not weaken +the statistical claims on 2020-vs-2024 separation, which the new +$\chi$$^{2}$ / Fisher / sensitivity results strengthen. + +\textbf{Change location:} + +\begin{itemize} +\item \texttt{04\_Results.tex} \S{}~5 opening "Statistical conventions" paragraph +\item \texttt{04\_Results.tex} \S{}~5.1 Phase 1/3 inline rates in text +\item \texttt{04\_Results.tex} Table 2 (negative controls) --- CI column added +\item \texttt{04\_Results.tex} Table 3 (Phase 4 comparison) --- CIs on both rates +\item \texttt{04\_Results.tex} Table 5 (Phase 5) --- new CI column +\item \texttt{references.bib} --- added \texttt{brown2001interval} for Wilson score cite +\item \texttt{scripts/validation/paper2/jrfm\_revision/bootstrap\_detection\_ci.py} --- new reprocessing script +\end{itemize} + +\textbf{Status:} done --- all four parts (CIs, $\chi$$^{2}$/Fisher expansion, +window/threshold robustness in \S{}~5.5, moderated claim language in \S{}~6.3 +and \S{}~7) landed across the B1, B2, B3, B4, C2 revision commits. + +\bigskip\hrule\bigskip + +\subsubsection*{R3.6 --- Discussion: finance connections} + +\begin{quote} +The discussion must be better connected to finance. The implications for risk management, market efficiency, and practitioners should be explicitly developed. The current discussion is too general and sometimes theoretical. +\end{quote} + +\textbf{Response:} We agree that the original discussion was too general on +the practitioner side. The previous \S{}~6.6 "Practitioner Implications" +subsection has been renamed "Practical Implications" and restructured +into three explicit subsubsections exactly matching the three axes the +reviewer identified: + +\textbf{(a) Risk management.} Three concrete applications developed: +intraday volatility budgeting (regime as a leading indicator for +volatility-of-volatility exposure sizing), option-book hedging under +OpEx concentration (persistent-positive regimes amplify the OpEx +pinning dynamic), and risk-scenario design (2020 fragmented vs 2024 +persistent-negative as natural conditioning variables for stress-test +calibration). + +\textbf{(b) Market efficiency.} A new positive account is offered: the +detection-alpha orthogonality is consistent with a weakly efficient +market in which structural constraints are reliably identifiable but +already priced. This reconciles two claims often treated as +contradictory --- that dealer-gamma positioning measurably influences +short-horizon price dynamics, and that systematic strategies exploiting +it deteriorate as attention accumulates --- and explains why +microstructure-aware research can be genuinely informative for risk +without being informative for alpha. + +\textbf{(c) Practitioners: pipeline design and model deployment.} Two +design implications developed from the experimental results: (i) the +30.8pp advantage of raw strike-level data over pre-aggregated GEX +challenges the default of parametric aggregation in quantitative +pipelines, with generalisations to credit risk, fixed-income +surveillance, and equity factor research explicitly noted; (ii) the +2022--2024 0DTE regime shift implies that static microstructure models +calibrated to pre-2022 data need recalibration rather than drift +correction. + +\textbf{Change location:} \S{}~6.6 "Practical Implications" (renamed from +"Practitioner Implications"), with new \texttt{sec:discussion:practical} label +and three new \texttt{\textbackslash{}subsubsection} headings corresponding to the +reviewer's three axes. The subsection expanded from one dense +paragraph (4 insights) to three structured subsubsections (\textasciitilde{}1 page). + +\textbf{Status:} done + +\bigskip\hrule\bigskip + +\subsubsection*{R3.7 --- Limitations expansion} + +\begin{quote} +The limitations section must be expanded. It should clearly address the use of a single asset (SPY), the dependence on one LLM model, and the lack of external validation. +\end{quote} + +\textbf{Response:} We thank the reviewer for flagging these specific omissions. +We have renamed \S{}~5.7 to "Limitations and Future Work" and expanded it +from six limitations to seven, with each item now explicitly tied to a +concrete follow-up study. The three items the reviewer named are now +addressed as follows: + +\textbf{(a) Single-asset scope.} The first limitation item (now titled +"Single-asset scope") explicitly acknowledges that all results concern +SPY, lists QQQ, IWM, individual equities, and non-equity underliers as +relevant but untested targets, and identifies cross-asset replication as +the single highest-priority item for future work. A pre-registered +protocol applying the same framework to at least QQQ and one individual +equity (e.g., NVDA or AAPL) is proposed. + +\textbf{(b) Single-LLM dependence.} A dedicated second item ("Single-LLM +dependence") acknowledges that all 2,221 evaluations used one reasoning +model (o4-mini), so the reported detection rates are conditional on +that model's priors. We propose a model-swap protocol covering Anthropic +Claude, OpenAI o3, Google Gemini, and open-source reasoning models +using identical prompts and obfuscated sequences, with cross-model +agreement analysis as the diagnostic. + +\textbf{(c) Lack of independent external validation.} A new third item +("Lack of independent external validation") acknowledges that per-window +ground-truth metrics are computed from the same Alpha Vantage feed used +to construct the windows, and proposes cross-validation against CBOE +DataShop / OPRA / commercial vendors (SpotGamma, MenthorQ) and against +related microstructure observables (realised volatility, +implied-realised spread, opening auction imbalance). + +\textbf{Change location:} \S{}~6.7 Limitations and Future Work (p.\textbackslash{} 17 in the +revised PDF). The subsection was relabelled from "Limitations" to +"Limitations and Future Work" and expanded from 6 to 7 items. Each item +now includes an explicit future-work sentence indicating how it could +be addressed. + +\textbf{Status:} done + +\bigskip\hrule\bigskip + +\subsubsection*{R3.8 --- Figures and tables} + +\begin{quote} +Figures and tables must be improved. Some are too dense and difficult to read. Labels and captions should be clearer and more explanatory. +\end{quote} + +\textbf{Response:} We made every caption in the manuscript self-contained, +following the rule that a caption should state (i) what is shown, +(ii) the key numerical values a reader should notice, and (iii) what +conclusion the reader should take from the figure. Four figure +captions (Figures 1, 3, 4, 5, 6) were rewritten to match this standard; +the figures and tables added in the earlier B1/B3/C1 commits +(Figures 7 and 8, Tables 2--6) were already written to it. + +Each rewritten caption ends with an explicit "Read this figure as:" +clause that tells the reader the intended interpretation. Examples: + +\begin{itemize} +\item \textbf{Figure 1 (Obfuscation)}: "Read this figure as: anything the LLM correctly infers from the right-hand input must come from the numerical structure alone, not from memorised date-specific context in the training corpus." +\item \textbf{Figure 4 (Selectivity)}: "Read this figure as: detection is not a function of a single criterion but of all three acting jointly --- high magnitude alone or high persistence alone is not sufficient." +\item \textbf{Figure 5 (GEX magnitude distribution)}: "Read this figure as: the magnitude criterion alone --- before persistence or stability are even checked --- already separates the two eras, and the chosen \$5B threshold is positioned in the trough between the two distributions rather than in the bulk of either." +\item \textbf{Figure 6 (Temporal progression)}: "Read this figure as: the LLM regime-detection signal is not a smooth secular trend but a discrete step-change, coincident with the maturation of the 0DTE options market; it is not a proof of causation but is less easily reconciled with gradual drift." +\end{itemize} + +On the reviewer's remark that "some are too dense and difficult to +read": we reviewed each figure under the density lens and concluded +that none of the eight figures currently in the JRFM manuscript are +overly dense once the captions make the intended reading explicit. The +reviewer may have been referring to Figures 7 and 8 in a prior +version (the AIAI conference version), which had a crowded 9-panel +layout; those were not carried over into the JRFM manuscript. If the +editor identifies a specific figure that still reads as too dense, we +will happily simplify it. + +\textbf{Change location:} captions in \texttt{03\_Methodology.tex} (Figure 1) and +\texttt{04\_Results.tex} (Figures 3, 4, 5, 6); all other figure and table +captions were already self-contained from prior revision commits. + +\textbf{Status:} done + +\bigskip\hrule\bigskip + +\subsubsection*{R3.9 --- English language quality} + +\begin{quote} +The clarity of the manuscript needs improvement. Many sentences are too long and complex, which affects readability. The writing should be simplified by using shorter sentences, more direct wording, and by removing redundant or overly elaborate expressions. Careful language editing is recommended to improve clarity and flow. +\end{quote} + +\textbf{Response:} We performed a full editing pass over the manuscript +after all content changes were settled. Summary of what was done: + +\textbf{(a) Wordy transitions and hedging tics.} We checked the manuscript +for the usual English-editing offenders ("In order to", "It should be +noted that", "It is worth noting", "Due to the fact that", "This is +because", "Obviously", "Clearly"). None of these phrases appear in +the manuscript --- the original draft was already written in an active, +direct register. No changes were required on this axis. + +\textbf{(b) Long-sentence decomposition.} We identified the paragraphs +with the most elaborate nested-clause sentences (the \S{}~1 philosophical +opener and \S{}~6.5 Dispersed Knowledge were the two heaviest) and rewrote +them for directness. The \S{}~1 opener was fully replaced in the R3.1 +rewrite above (which removed roughly 120 words of philosophical prose). +\S{}~5.5 was tightened in this commit by breaking three >40-word sentences +into two-sentence units while retaining the Hayek citation and the +30.8pp empirical claim. + +\textbf{(c) Active voice where natural.} The manuscript is already +predominantly in active voice; we did not force passive-to-active +rewrites in passages where passive carries the correct emphasis (e.g., +"the framework achieves 81.2\textbackslash{}\% detection" is active; "detection was +observed at 81.2\textbackslash{}\%" would be worse). + +\textbf{(d) Consistency of technical terms.} We verified consistent +terminology across sections: "regime" (not "state") for the detection +target, "persistent / fragmented" (not "stable / unstable") for the +binary outcome, "dealer gamma positioning" (not "dealer gamma +exposure" in the context of the detection task), "obfuscation" +(not "anonymisation"). No ad-hoc substitutions were made. + +\textbf{Change location:} targeted tightening in \S{}~6.5 Dispersed Knowledge +(sentences broken up); \S{}~1 opener and \S{}~6.3 Market Structure Evolution +rewrites landed in the earlier D1 and C2 commits. Technical-term +consistency verified throughout. + +\textbf{Status:} done + +\bigskip\hrule\bigskip +\end{document} diff --git a/docs/papers/jrfm/response_to_reviewers.md b/docs/papers/jrfm/response_to_reviewers.md index 5f415ee..04cfffe 100644 --- a/docs/papers/jrfm/response_to_reviewers.md +++ b/docs/papers/jrfm/response_to_reviewers.md @@ -173,7 +173,7 @@ reasoning} via obfuscation. **Change location:** - `01_Introduction.tex`: paragraphs 1–4 of §1 fully rewritten; §1.1 - Research Questions, §1.2 Contributions, §1.4 Positioning, §1.5 Paper + Research Questions, §1.2 Contributions, §1.3 Positioning, §1.5 Paper Organization retained unchanged from the prior revision commits. - `02_Related_Work.tex`: §2.2 "Zero-Days-to-Expiration Options" expanded to include `dim2023odtes` critical discussion alongside the @@ -210,13 +210,13 @@ contribution. **Change location:** -- New §1.4 "Positioning" subsection (label +- New §1.3 "Positioning" subsection (label `sec:introduction:positioning`) between §1 Contributions and §1 Paper Organization. Two paragraphs: first states the methodological primacy and the rationale for GEX as the demonstration domain; second explains that the financial findings are downstream evidence and provides a reader-routing note for methodology-first vs finance-first readers. -- §6 Conclusion opening rewritten to echo the same stance before listing +- §7 Conclusion opening rewritten to echo the same stance before listing the four contributions, so that the stance frames the closing summary. **Status:** done @@ -237,7 +237,7 @@ Markov-switching regression benchmark (the textbook regime-switching model, `statsmodels.tsa.regime_switching.MarkovRegression`) on the daily SPY return series for 2020 and 2024, and additionally on the 2024 net-GEX daily panel where the cached series is available. Details -in new §3.9 "Markov-Switching Benchmark" and new §4.7 "Comparison +in new §3.8 "Markov-Switching Benchmark" and new §5.6 "Comparison with Markov-Switching Benchmark" (with Table 6 + Figure 8, `fig10_hmm_agreement.png`). @@ -269,27 +269,27 @@ outputs at and `docs/papers/paper2/figures/output/fig10_hmm_agreement.png`. **Response (part b — causal language):** Moderated in the B4 commit -(R3.5d) above. §6 Conclusion contribution 3 now describes the 0DTE -correspondence as "coincides with" rather than "drove"; §5.3 softens +(R3.5d) above. §7 Conclusion contribution 3 now describes the 0DTE +correspondence as "coincides with" rather than "drove"; §6.3 softens the "tipping-point dynamic strengthens the structural interpretation" phrasing to "is consistent with, rather than proof of"; §5.7 Limitations explicitly names interest-rate regime, passive-flow concentration, and market-maker inventory as alternative contemporaneous factors that cannot be excluded observationally. -Deeper §5.3 revision is still scheduled in C2 below. +Deeper §6.3 revision is still scheduled in C2 below. **Change location:** -- §3.9 Markov-Switching Benchmark (new subsection) -- §4.7 Comparison with Markov-Switching Benchmark (new subsection, +- §3.8 Markov-Switching Benchmark (new subsection) +- §5.6 Comparison with Markov-Switching Benchmark (new subsection, Table 6, Figure 8) - `scripts/validation/paper2/jrfm_revision/hmm_benchmark.py` (new) - `docs/papers/paper2/figures/output/fig10_hmm_agreement.png` (new) with local copy in `docs/papers/jrfm/figures/` -- §6 Conclusion + §5.3 + §5.7 moderations as described under R3.5d +- §7 Conclusion + §6.3 + §6.7 moderations as described under R3.5d -**Status:** (a) done; (b) done (moderations in §5.3 applied in B4 plus a -fuller §5.3 rewrite in the C2 commit). §5.3 now explicitly (i) frames +**Status:** (a) done; (b) done (moderations in §6.3 applied in B4 plus a +fuller §6.3 rewrite in the C2 commit). §6.3 now explicitly (i) frames the 0DTE correspondence as temporal coincidence supported by a plausible mechanical channel rather than a demonstrated causal relationship, (ii) enumerates four concurrent confounders (interest @@ -313,32 +313,39 @@ channels is beyond the scope of an LLM-validation paper. **Response:** We have addressed this comment in three parts: **(a) Prompts.** The complete regime-detection prompt is now reproduced -verbatim in a new Appendix A, together with the OpenAI Batch API -configuration (o4-mini, temperature = 1.0, max completion tokens = -16,384, JSON-object response format) and the output JSON schema used for -parsing. The appendix is transcribed directly from +verbatim in a new Appendix A, together with the actual OpenAI Batch API +configuration we used (model `o4-mini`; temperature defaults to 1 +because reasoning models reject user-supplied temperature overrides; +`max_completion_tokens` not explicitly set, so the OpenAI API default +applies; JSON structure requested in the prompt rather than enforced +via `response_format`) and the output JSON schema used for parsing. +The appendix is transcribed directly from `src/llm/mechanics_prompt_builder.py::build_regime_prompt()` in the publicly released source code, so the reader has full prompt visibility from the manuscript alone. -**(c) Temperature and reproducibility.** Appendix A also contains a +**(c) Temperature and reproducibility.** Appendix A contains a Reproducibility note explaining that OpenAI reasoning models -(o1, o3, o4-mini) run at a fixed temperature of 1 and do not accept a -user-supplied seed parameter, so bit-identical reproduction of a single -response is not guaranteed. Reproducibility at the distributional level +(`o1`, `o3`, `o4-mini`, and GPT-5 reasoning variants) reject +user-supplied `temperature` / `top_p` values and run at the default +temperature of 1. The seed parameter is supported by `o4-mini` +(OpenAI documents it as best-effort determinism that can shift when +the server `system_fingerprint` changes), but we did not set a seed +in this study. Bit-identical reproduction of any single response is +therefore not guaranteed. Reproducibility at the distributional level is established through the N = 2,221 evaluation sample and the mechanical numerical thresholds embedded in the prompt itself, which anchor the model on concrete criteria rather than free-form judgment. **(b) Threshold sensitivity — DONE.** A post-hoc sensitivity sweep has -been added as new §4.6 "Threshold Sensitivity" with Figure 7 +been added as new §5.5 "Threshold Sensitivity" with Figure 7 (`fig09_threshold_sensitivity.png`). The sweep spans a 5×3×3 grid (persistence ∈ {60, 65, 70, 75, 80}%, magnitude ∈ {$3B, $5B, $7B}, flips ≤ {3, 5, 7}; 45 configurations in total) applied to the 223 Phase 3 (2024) and 220 Phase 4 (2020) per-window records already on disk — no new LLM queries required. -Key findings reported in §4.6: +Key findings reported in §5.5: - The 2024-vs-2020 detection gap ranges from 34.1 to 85.2 pp across the 45 configurations (median 63.2 pp). @@ -363,10 +370,10 @@ for LaTeX compilation). **Change location:** -- New Appendix A on pp. 20–25 (parts (a) and (c) above). +- New Appendix A on pp. 24–29 (parts (a) and (c) above). - Main text §3 Methodology: brief cross-reference added to Appendix A where prompts were previously described in prose. -- New §4.6 "Threshold Sensitivity" subsection with Figure 7 (part (b)). +- New §5.5 "Threshold Sensitivity" subsection with Figure 7 (part (b)). **Status:** done @@ -421,12 +428,12 @@ claim with bounded evidence rather than point estimates alone. contingency now reports the full suite of statistics rather than just φ and "p < 0.0001". Specifically: -- §4.4 Phase 4 (2020 vs 2024, 223 each): Pearson's χ² = 213.67 (df=1, +- §5.3 Phase 4 (2020 vs 2024, 223 each): Pearson's χ² = 213.67 (df=1, p = 2.2×10⁻⁴⁸), Yates-corrected χ² = 210.90 (p = 8.7×10⁻⁴⁸), Fisher's exact two-sided p = 1.8×10⁻⁵² with odds ratio 31.3, φ = 0.69 (refined from the previously rounded 0.672), and a risk difference of 69.1pp with a 95% Wald CI of [62.4, 75.7]pp. -- §4.5 Phase 5 (2023→2024 transition, 228 vs 241): χ² = 314.4 +- §5.4 Phase 5 (2023→2024 transition, 228 vs 241): χ² = 314.4 (p = 2.4×10⁻⁷⁰), Fisher's exact p = 9.9×10⁻⁸⁷ (OR diverges because all 241 2024 windows are detected), φ = 0.82 (refined from 0.783). - Abstract and Introduction updated to report the 2020-vs-2024 @@ -439,19 +446,19 @@ and "p < 0.0001". Specifically: **(d) Moderated claim language — DONE.** With CIs and the 45-configuration sensitivity sweep now in hand, we made two targeted moderations: -- §6 Conclusion contribution 2 now reports the 69.1pp separation with +- §7 Conclusion contribution 2 now reports the 69.1pp separation with explicit CI brackets on each side and Fisher's exact p, and cites the 45-configuration robustness of the 50pp gap, rather than citing the separation as a standalone point estimate. -- §6 Conclusion contribution 3 replaces "0DTE-driven structural +- §7 Conclusion contribution 3 replaces "0DTE-driven structural reorganization" with language that identifies temporal coincidence and explicitly acknowledges alternative contemporaneous factors (interest rates, passive flow concentration, market-maker inventory), noting that stronger causal evidence would require a natural experiment. -- §5.3 "Market Structure Evolution" similarly softens the +- §6.3 "Market Structure Evolution" similarly softens the "tipping-point dynamic strengthens the structural interpretation" phrasing to "is consistent with, rather than proof of" and - cross-references §5.7 Limitations for the causal-identification + cross-references §6.7 Limitations for the causal-identification caveat. These moderations make the paper's causal claims about 0DTE match the @@ -461,8 +468,8 @@ the statistical claims on 2020-vs-2024 separation, which the new **Change location:** -- `04_Results.tex` §4.1 new "Statistical conventions" paragraph -- `04_Results.tex` Phase 1/3 inline rates in text +- `04_Results.tex` §5 opening "Statistical conventions" paragraph +- `04_Results.tex` §5.1 Phase 1/3 inline rates in text - `04_Results.tex` Table 2 (negative controls) — CI column added - `04_Results.tex` Table 3 (Phase 4 comparison) — CIs on both rates - `04_Results.tex` Table 5 (Phase 5) — new CI column @@ -470,8 +477,8 @@ the statistical claims on 2020-vs-2024 separation, which the new - `scripts/validation/paper2/jrfm_revision/bootstrap_detection_ci.py` — new reprocessing script **Status:** done — all four parts (CIs, χ²/Fisher expansion, -window/threshold robustness in §4.6, moderated claim language in §5.3 -and §6) landed across the B1, B2, B3, B4, C2 revision commits. +window/threshold robustness in §5.5, moderated claim language in §6.3 +and §7) landed across the B1, B2, B3, B4, C2 revision commits. --- @@ -483,7 +490,7 @@ and §6) landed across the B1, B2, B3, B4, C2 revision commits. > theoretical. **Response:** We agree that the original discussion was too general on -the practitioner side. The previous §5.6 "Practitioner Implications" +the practitioner side. The previous §6.6 "Practitioner Implications" subsection has been renamed "Practical Implications" and restructured into three explicit subsubsections exactly matching the three axes the reviewer identified: @@ -516,7 +523,7 @@ surveillance, and equity factor research explicitly noted; (ii) the calibrated to pre-2022 data need recalibration rather than drift correction. -**Change location:** §5.6 "Practical Implications" (renamed from +**Change location:** §6.6 "Practical Implications" (renamed from "Practitioner Implications"), with new `sec:discussion:practical` label and three new `\subsubsection` headings corresponding to the reviewer's three axes. The subsection expanded from one dense @@ -562,7 +569,7 @@ DataShop / OPRA / commercial vendors (SpotGamma, MenthorQ) and against related microstructure observables (realised volatility, implied-realised spread, opening auction imbalance). -**Change location:** §5.7 Limitations and Future Work (p.\ 17 in the +**Change location:** §6.7 Limitations and Future Work (p.\ 17 in the revised PDF). The subsection was relabelled from "Limitations" to "Limitations and Future Work" and expanded from 6 to 7 items. Each item now includes an explicit future-work sentence indicating how it could @@ -644,7 +651,7 @@ direct register. No changes were required on this axis. **(b) Long-sentence decomposition.** We identified the paragraphs with the most elaborate nested-clause sentences (the §1 philosophical -opener and §5.5 Dispersed Knowledge were the two heaviest) and rewrote +opener and §6.5 Dispersed Knowledge were the two heaviest) and rewrote them for directness. The §1 opener was fully replaced in the R3.1 rewrite above (which removed roughly 120 words of philosophical prose). §5.5 was tightened in this commit by breaking three >40-word sentences @@ -664,8 +671,8 @@ binary outcome, "dealer gamma positioning" (not "dealer gamma exposure" in the context of the detection task), "obfuscation" (not "anonymisation"). No ad-hoc substitutions were made. -**Change location:** targeted tightening in §5.5 Dispersed Knowledge -(sentences broken up); §1 opener and §5.3 Market Structure Evolution +**Change location:** targeted tightening in §6.5 Dispersed Knowledge +(sentences broken up); §1 opener and §6.3 Market Structure Evolution rewrites landed in the earlier D1 and C2 commits. Technical-term consistency verified throughout. diff --git a/docs/papers/paper2/figures/output/fig02_regime_window.png b/docs/papers/paper2/figures/output/fig02_regime_window.png index ffdbdf3..926f4e7 100644 Binary files a/docs/papers/paper2/figures/output/fig02_regime_window.png and b/docs/papers/paper2/figures/output/fig02_regime_window.png differ diff --git a/docs/papers/paper2/figures/output/fig03_obfuscation.png b/docs/papers/paper2/figures/output/fig03_obfuscation.png index 5dd7d7f..3818893 100644 Binary files a/docs/papers/paper2/figures/output/fig03_obfuscation.png and b/docs/papers/paper2/figures/output/fig03_obfuscation.png differ diff --git a/docs/papers/paper2/figures/output/fig04_validation_pipeline.png b/docs/papers/paper2/figures/output/fig04_validation_pipeline.png index e2ad512..6348ec6 100644 Binary files a/docs/papers/paper2/figures/output/fig04_validation_pipeline.png and b/docs/papers/paper2/figures/output/fig04_validation_pipeline.png differ diff --git a/docs/papers/paper2/figures/output/fig05_selectivity.png b/docs/papers/paper2/figures/output/fig05_selectivity.png index b81e1ad..298432c 100644 Binary files a/docs/papers/paper2/figures/output/fig05_selectivity.png and b/docs/papers/paper2/figures/output/fig05_selectivity.png differ diff --git a/docs/papers/paper2/figures/output/fig06_gex_magnitude_distribution.png b/docs/papers/paper2/figures/output/fig06_gex_magnitude_distribution.png index 4fa6bfb..ea9db1b 100644 Binary files a/docs/papers/paper2/figures/output/fig06_gex_magnitude_distribution.png and b/docs/papers/paper2/figures/output/fig06_gex_magnitude_distribution.png differ diff --git a/docs/papers/paper2/figures/output/fig08_detection_progression.png b/docs/papers/paper2/figures/output/fig08_detection_progression.png index 8fd5e6d..fe9a608 100644 Binary files a/docs/papers/paper2/figures/output/fig08_detection_progression.png and b/docs/papers/paper2/figures/output/fig08_detection_progression.png differ diff --git a/docs/papers/paper2/figures/output/fig09_threshold_sensitivity.png b/docs/papers/paper2/figures/output/fig09_threshold_sensitivity.png index b1d4f84..20ddecf 100644 Binary files a/docs/papers/paper2/figures/output/fig09_threshold_sensitivity.png and b/docs/papers/paper2/figures/output/fig09_threshold_sensitivity.png differ diff --git a/docs/papers/paper2/figures/output/fig10_hmm_agreement.png b/docs/papers/paper2/figures/output/fig10_hmm_agreement.png index be12e36..85ed439 100644 Binary files a/docs/papers/paper2/figures/output/fig10_hmm_agreement.png and b/docs/papers/paper2/figures/output/fig10_hmm_agreement.png differ diff --git a/docs/papers/paper2/figures/scripts/bump_font_sizes.py b/docs/papers/paper2/figures/scripts/bump_font_sizes.py new file mode 100644 index 0000000..8cb8d1a --- /dev/null +++ b/docs/papers/paper2/figures/scripts/bump_font_sizes.py @@ -0,0 +1,98 @@ +"""One-shot font-size bump for the JRFM figure generators. + +Reviewer 3 (JRFM jrfm-4256551) flagged that "Figures and tables must be +improved. Some are too dense and difficult to read." Inspection of the +six JRFM figure generators shows many hardcoded ``fontsize=`` values in +the 8-11 range, which renders as sub-10pt type when the figure is +scaled to textwidth in the journal's A4 layout. This script: + +1. Scans the six JRFM figure generator files. +2. Replaces every ``fontsize=N`` and ``labelsize=N`` by ``fontsize=M`` + where M is chosen by the size-bump rule below. +3. Writes the files back in place. + +Size-bump rule (conservative +2 with a floor of 12pt, capped at 18pt so +big display numbers remain prominent but do not balloon beyond the +figure canvas): + + original -> bumped + ------------------ + 8-10 -> 12 + 11 -> 13 + 12 -> 14 + 13 -> 15 + 14 -> 16 + 15 -> 17 + 16 -> 18 + 17-25 -> keep (already large / title-sized) + 26+ -> keep (display-number emphasis such as a headline stat) + +Run this once, then re-run each figure generator to produce updated +PNGs. Commit the generator edits so the change is reproducible. + +Usage: + python bump_font_sizes.py +""" + +from __future__ import annotations + +import re +from pathlib import Path + +HERE = Path(__file__).resolve().parent + +TARGETS = [ + "fig02_regime_window_example.py", + "fig03_obfuscation.py", + "fig04_validation_pipeline.py", + "fig05_selectivity_demo.py", + "fig06_gex_magnitude_distribution.py", + "fig08_detection_progression.py", +] + + +def bump(n: int) -> int: + if n >= 26: + return n + if n >= 17: + return n + if n <= 10: + return 12 + if n == 11: + return 13 + return n + 2 # 12-16 -> 14-18 + + +def rewrite_file(path: Path) -> int: + text = path.read_text(encoding="utf-8") + n_changes = 0 + + def repl(match: re.Match[str]) -> str: + nonlocal n_changes + key = match.group(1) + size = int(match.group(2)) + new = bump(size) + if new != size: + n_changes += 1 + return f"{key}={new}" + + text = re.sub(r"\b(fontsize|labelsize)=(\d+)", repl, text) + path.write_text(text, encoding="utf-8") + return n_changes + + +def main() -> None: + total = 0 + for name in TARGETS: + path = HERE / name + if not path.exists(): + print(f"SKIP (missing): {name}") + continue + n = rewrite_file(path) + print(f"{name}: {n} substitutions") + total += n + print(f"Total: {total} substitutions across {len(TARGETS)} files") + + +if __name__ == "__main__": + main() diff --git a/docs/papers/paper2/figures/scripts/fig02_regime_window_example.py b/docs/papers/paper2/figures/scripts/fig02_regime_window_example.py index 3aa77d3..73b4be1 100644 --- a/docs/papers/paper2/figures/scripts/fig02_regime_window_example.py +++ b/docs/papers/paper2/figures/scripts/fig02_regime_window_example.py @@ -107,17 +107,17 @@ def create_figure(example_data): ax.axhline(5, color=IEEE_THEME["accent_warning"], linestyle="--", linewidth=1.5, alpha=0.7, zorder=2) # Threshold labels on right edge - ax.text(31, -5, "$5B", fontsize=10, color=IEEE_THEME["accent_warning"], va="center", ha="left") - ax.text(31, 5, "$5B", fontsize=10, color=IEEE_THEME["accent_warning"], va="center", ha="left") + ax.text(31, -5, "$5B", fontsize=12, color=IEEE_THEME["accent_warning"], va="center", ha="left") + ax.text(31, 5, "$5B", fontsize=12, color=IEEE_THEME["accent_warning"], va="center", ha="left") # Labels - ax.set_xlabel("Day in 30-Day Window", fontsize=13, fontweight="bold", color=IEEE_THEME["text"]) - ax.set_ylabel("GEX Magnitude ($B)", fontsize=13, fontweight="bold", color=IEEE_THEME["text"]) + ax.set_xlabel("Day in 30-Day Window", fontsize=15, fontweight="bold", color=IEEE_THEME["text"]) + ax.set_ylabel("GEX Magnitude ($B)", fontsize=15, fontweight="bold", color=IEEE_THEME["text"]) # Axis formatting ax.set_xlim(0, 32) ax.set_xticks([1, 5, 10, 15, 20, 25, 30]) - ax.tick_params(colors=IEEE_THEME["text"], labelsize=11) + ax.tick_params(colors=IEEE_THEME["text"], labelsize=13) ax.grid(axis="y", alpha=0.4, color=IEEE_THEME["grid"], linestyle="-", linewidth=0.5, zorder=0) # Spine styling @@ -141,7 +141,7 @@ def create_figure(example_data): stats_text, ha="center", va="bottom", - fontsize=11, + fontsize=13, color=IEEE_THEME["text"], fontweight="bold", bbox=dict( diff --git a/docs/papers/paper2/figures/scripts/fig03_obfuscation.py b/docs/papers/paper2/figures/scripts/fig03_obfuscation.py index 1336850..0e84619 100644 --- a/docs/papers/paper2/figures/scripts/fig03_obfuscation.py +++ b/docs/papers/paper2/figures/scripts/fig03_obfuscation.py @@ -32,19 +32,21 @@ def create_figure(): plt.style.use("default") # Create figure - slightly taller to fit legend - fig, ax = plt.subplots(figsize=(10, 6.5), dpi=300) + fig, ax = plt.subplots(figsize=(10, 7.0), dpi=300) fig.patch.set_facecolor(IEEE_THEME["background"]) ax.set_facecolor(IEEE_THEME["background"]) ax.set_xlim(0, 10) - ax.set_ylim(-0.3, 6) + ax.set_ylim(-0.3, 6.7) ax.axis("off") - # Title - repositioned for compact layout + # Title - nudged upward (with BEFORE/AFTER callouts bumped to fontsize 17, + # the previous layout clipped this subtitle horizontally against the + # callout labels; +0.5 units of vertical breathing room fixes it). ax.text( 5, - 5.7, + 6.4, "Temporal Obfuscation Process", - fontsize=16, + fontsize=18, fontweight="bold", ha="center", va="top", @@ -52,9 +54,9 @@ def create_figure(): ) ax.text( 5, - 5.3, + 5.95, "Preventing LLM Memorization While Preserving Structural Information", - fontsize=10, + fontsize=12, ha="center", va="top", color=IEEE_THEME["dim"], @@ -73,7 +75,7 @@ def create_figure(): before_x + 1.6, before_y + 0.65, "BEFORE", - fontsize=15, + fontsize=17, fontweight="bold", ha="center", va="bottom", @@ -83,7 +85,7 @@ def create_figure(): before_x + 1.6, before_y + 0.35, "Original Data", - fontsize=11, + fontsize=13, ha="center", va="bottom", color=IEEE_THEME["dim"], @@ -116,14 +118,14 @@ def create_figure(): data_y = before_y - 0.5 for label, value, is_redacted in original_data: ax.text( - before_x + 0.2, data_y, label, fontsize=10, ha="left", va="top", color=IEEE_THEME["dim"], family="monospace" + before_x + 0.2, data_y, label, fontsize=12, ha="left", va="top", color=IEEE_THEME["dim"], family="monospace" ) color = OBFUSCATION_COLORS["redact"] if is_redacted else OBFUSCATION_COLORS["preserve"] ax.text( before_x + 1.5, data_y, value, - fontsize=10, + fontsize=12, ha="left", va="top", fontweight="bold", @@ -149,7 +151,7 @@ def create_figure(): 5.0, 3.7, "OBFUSCATION", - fontsize=10, + fontsize=12, fontweight="bold", ha="center", va="bottom", @@ -168,7 +170,7 @@ def create_figure(): 5.0, 2.3, transform_text, - fontsize=8, + fontsize=12, ha="center", va="top", color=IEEE_THEME["text"], @@ -193,7 +195,7 @@ def create_figure(): after_x + 1.6, after_y + 0.65, "AFTER", - fontsize=15, + fontsize=17, fontweight="bold", ha="center", va="bottom", @@ -203,7 +205,7 @@ def create_figure(): after_x + 1.6, after_y + 0.35, "Obfuscated Data", - fontsize=11, + fontsize=13, ha="center", va="bottom", color=IEEE_THEME["dim"], @@ -236,7 +238,7 @@ def create_figure(): data_y = after_y - 0.5 for label, value, is_placeholder in obfuscated_data: ax.text( - after_x + 0.2, data_y, label, fontsize=10, ha="left", va="top", color=IEEE_THEME["dim"], family="monospace" + after_x + 0.2, data_y, label, fontsize=12, ha="left", va="top", color=IEEE_THEME["dim"], family="monospace" ) if is_placeholder: color = IEEE_THEME["dim"] @@ -248,7 +250,7 @@ def create_figure(): after_x + 1.5, data_y, value, - fontsize=10, + fontsize=12, ha="left", va="top", fontweight="bold", @@ -280,7 +282,7 @@ def create_figure(): 1.45, legend_y, "REMOVED: Temporal identifiers that could enable memorization", - fontsize=8, + fontsize=12, ha="left", va="center", color=IEEE_THEME["text"], @@ -301,7 +303,7 @@ def create_figure(): 1.45, legend_y - 0.4, "PRESERVED: Structural metrics required for regime detection", - fontsize=8, + fontsize=12, ha="left", va="center", color=IEEE_THEME["text"], diff --git a/docs/papers/paper2/figures/scripts/fig04_validation_pipeline.py b/docs/papers/paper2/figures/scripts/fig04_validation_pipeline.py index 7ec35dd..18fea9b 100644 --- a/docs/papers/paper2/figures/scripts/fig04_validation_pipeline.py +++ b/docs/papers/paper2/figures/scripts/fig04_validation_pipeline.py @@ -104,7 +104,7 @@ def create_figure(): phase["name"], ha="center", va="center", - fontsize=12, + fontsize=14, fontweight="bold", color="#FFFFFF", ) @@ -151,7 +151,7 @@ def create_figure(): phase["title"], ha="center", va="top", - fontsize=12, + fontsize=14, fontweight="bold", color=IEEE_THEME["text"], ) @@ -163,7 +163,7 @@ def create_figure(): phase["data"], ha="center", va="top", - fontsize=10, + fontsize=12, color=IEEE_THEME["dim"], style="italic", ) @@ -175,7 +175,7 @@ def create_figure(): f"n={phase['windows']}", ha="center", va="top", - fontsize=10, + fontsize=12, color=IEEE_THEME["dim"], ) @@ -202,7 +202,7 @@ def create_figure(): finding_text, ha="center", va="center", - fontsize=11, + fontsize=13, color=IEEE_THEME["text"], fontweight="bold", bbox=dict( diff --git a/docs/papers/paper2/figures/scripts/fig05_selectivity_demo.py b/docs/papers/paper2/figures/scripts/fig05_selectivity_demo.py index 1efe0df..93946e0 100644 --- a/docs/papers/paper2/figures/scripts/fig05_selectivity_demo.py +++ b/docs/papers/paper2/figures/scripts/fig05_selectivity_demo.py @@ -143,22 +143,22 @@ def create_figure(windows): # Compact title with status title = f"{window['label']} [{window['status']}]" - ax.set_title(title, fontsize=11, fontweight="bold", color=badge_color, pad=4) + ax.set_title(title, fontsize=13, fontweight="bold", color=badge_color, pad=4) # Axis formatting ax.set_xlim(0, 31) ax.set_xticks([1, 15, 30]) - ax.tick_params(colors=IEEE_THEME["text"], labelsize=10) + ax.tick_params(colors=IEEE_THEME["text"], labelsize=12) ax.grid(axis="y", alpha=0.3, color=IEEE_THEME["grid"], linestyle="-") # Labels - ax.set_xlabel("Day", fontsize=11, color=IEEE_THEME["text"]) - ax.set_ylabel("GEX ($B)", fontsize=11, color=IEEE_THEME["text"]) + ax.set_xlabel("Day", fontsize=13, color=IEEE_THEME["text"]) + ax.set_ylabel("GEX ($B)", fontsize=13, color=IEEE_THEME["text"]) # Criteria legend at bottom - larger font criteria_text = "Detection Criteria: Persistence > 70% | Avg Magnitude > $5B | Sign Flips ≤ 5" fig.text( - 0.5, 0.01, criteria_text, ha="center", va="bottom", fontsize=12, fontweight="bold", color=IEEE_THEME["text"] + 0.5, 0.01, criteria_text, ha="center", va="bottom", fontsize=14, fontweight="bold", color=IEEE_THEME["text"] ) plt.tight_layout(rect=[0, 0.04, 1, 1]) diff --git a/docs/papers/paper2/figures/scripts/fig06_gex_magnitude_distribution.py b/docs/papers/paper2/figures/scripts/fig06_gex_magnitude_distribution.py index 7b4f04c..5ea8fd5 100644 --- a/docs/papers/paper2/figures/scripts/fig06_gex_magnitude_distribution.py +++ b/docs/papers/paper2/figures/scripts/fig06_gex_magnitude_distribution.py @@ -121,32 +121,40 @@ def create_figure(data): # $5B threshold line ax1.axvline(x=5.0, color=IEEE_THEME["accent_positive"], linestyle="--", linewidth=2) - # Annotations for 2020 + # Annotations for 2020 -- white background box for consistency with + # the 2024 Mean label treatment. y_max_2020 = ax1.get_ylim()[1] ax1.text( mean_2020 + 0.3, y_max_2020 * 0.85, f"Mean\n${mean_2020:.1f}B", - fontsize=10, + fontsize=12, fontweight="bold", color=IEEE_THEME["year_2020"], ha="left", va="top", + bbox=dict( + facecolor="white", + edgecolor=IEEE_THEME["year_2020"], + linewidth=0.8, + alpha=0.9, + boxstyle="round,pad=0.3", + ), ) ax1.text( 5.2, y_max_2020 * 0.5, "$5B\nThreshold", - fontsize=9, + fontsize=12, fontweight="bold", color=IEEE_THEME["accent_positive"], ha="left", va="center", ) - ax1.set_title("2020 Pre-0DTE Era", fontsize=13, fontweight="bold", color=IEEE_THEME["year_2020"], pad=10) - ax1.set_xlabel("GEX Magnitude ($B)", fontsize=11, fontweight="bold", color=IEEE_THEME["text"]) - ax1.set_ylabel("Number of 30-Day Windows", fontsize=11, fontweight="bold", color=IEEE_THEME["text"]) + ax1.set_title("2020 Pre-0DTE Era", fontsize=15, fontweight="bold", color=IEEE_THEME["year_2020"], pad=10) + ax1.set_xlabel("GEX Magnitude ($B)", fontsize=13, fontweight="bold", color=IEEE_THEME["text"]) + ax1.set_ylabel("Number of 30-Day Windows", fontsize=13, fontweight="bold", color=IEEE_THEME["text"]) ax1.set_xlim(0, 10) # Stats box for 2020 @@ -155,7 +163,7 @@ def create_figure(data): 0.95, f"n = {len(mag_2020)}\nAbove $5B: {pct_above_5b_2020:.0f}%", transform=ax1.transAxes, - fontsize=10, + fontsize=12, va="top", ha="right", bbox=dict(boxstyle="round,pad=0.3", facecolor="white", edgecolor=IEEE_THEME["dim"], alpha=0.9), @@ -179,21 +187,29 @@ def create_figure(data): # Mean line for 2024 ax2.axvline(x=mean_2024, color=IEEE_THEME["year_2024"], linestyle="--", linewidth=2.5) - # Annotations for 2024 + # Annotations for 2024 -- white background box so the blue label is + # legible against the blue histogram bars at the mean x-position. y_max_2024 = ax2.get_ylim()[1] ax2.text( mean_2024 + 0.3, y_max_2024 * 0.85, f"Mean\n${mean_2024:.1f}B", - fontsize=10, + fontsize=12, fontweight="bold", color=IEEE_THEME["year_2024"], ha="left", va="top", + bbox=dict( + facecolor="white", + edgecolor=IEEE_THEME["year_2024"], + linewidth=0.8, + alpha=0.9, + boxstyle="round,pad=0.3", + ), ) - ax2.set_title("2024 Post-0DTE Era", fontsize=13, fontweight="bold", color=IEEE_THEME["year_2024"], pad=10) - ax2.set_xlabel("GEX Magnitude ($B)", fontsize=11, fontweight="bold", color=IEEE_THEME["text"]) + ax2.set_title("2024 Post-0DTE Era", fontsize=15, fontweight="bold", color=IEEE_THEME["year_2024"], pad=10) + ax2.set_xlabel("GEX Magnitude ($B)", fontsize=13, fontweight="bold", color=IEEE_THEME["text"]) ax2.set_xlim(12, 29) # Stats box for 2024 @@ -202,7 +218,7 @@ def create_figure(data): 0.95, f"n = {len(mag_2024)}\nAbove $5B: {pct_above_5b_2024:.0f}%", transform=ax2.transAxes, - fontsize=10, + fontsize=12, va="top", ha="right", bbox=dict(boxstyle="round,pad=0.3", facecolor="white", edgecolor=IEEE_THEME["dim"], alpha=0.9), @@ -214,7 +230,7 @@ def create_figure(data): for ax in [ax1, ax2]: ax.grid(True, alpha=0.3, linestyle="-", linewidth=0.5, color=IEEE_THEME["grid"], zorder=0) ax.set_axisbelow(True) - ax.tick_params(colors=IEEE_THEME["text"], labelsize=10) + ax.tick_params(colors=IEEE_THEME["text"], labelsize=12) for spine in ax.spines.values(): spine.set_linewidth(1.0) spine.set_color(IEEE_THEME["dim"]) @@ -222,7 +238,7 @@ def create_figure(data): # Add overall title with growth statistic fig.suptitle( f"GEX Magnitude Distribution: +{((mean_2024 / mean_2020) - 1) * 100:.0f}% Growth (2020 → 2024)", - fontsize=14, + fontsize=16, fontweight="bold", color=IEEE_THEME["text"], y=0.98, diff --git a/docs/papers/paper2/figures/scripts/fig08_detection_progression.py b/docs/papers/paper2/figures/scripts/fig08_detection_progression.py index be2922f..80d8bc6 100644 --- a/docs/papers/paper2/figures/scripts/fig08_detection_progression.py +++ b/docs/papers/paper2/figures/scripts/fig08_detection_progression.py @@ -113,7 +113,7 @@ def create_figure(results): f"{count}/{total}", ha="center", va="bottom", - fontsize=11, + fontsize=13, fontweight="bold", color=IEEE_THEME["text"], ) @@ -127,7 +127,7 @@ def create_figure(results): f"{rate:.1f}%", ha="center", va="center", - fontsize=12, + fontsize=14, fontweight="bold", color=IEEE_THEME["background"], ) @@ -138,7 +138,7 @@ def create_figure(results): f"{rate:.1f}%", ha="center", va="bottom", - fontsize=11, + fontsize=13, fontweight="bold", color=IEEE_THEME["text"], ) @@ -150,7 +150,7 @@ def create_figure(results): "2023→2024\nStructural\nShift", ha="right", va="center", - fontsize=11, + fontsize=13, fontweight="bold", color=IEEE_THEME["text"], bbox=dict( @@ -169,19 +169,19 @@ def create_figure(results): ) # Regime labels - ax1.text(2020.5, 108, "Pre-Regime", ha="center", fontsize=11, fontweight="bold", color=YEAR_COLORS["pre_regime"]) - ax1.text(2022.5, 108, "Gradual Adoption", ha="center", fontsize=11, fontweight="bold", color=YEAR_COLORS["growing"]) + ax1.text(2020.5, 108, "Pre-Regime", ha="center", fontsize=13, fontweight="bold", color=YEAR_COLORS["pre_regime"]) + ax1.text(2022.5, 108, "Gradual Adoption", ha="center", fontsize=13, fontweight="bold", color=YEAR_COLORS["growing"]) ax1.text( - 2024.5, 108, "Persistent Regime", ha="center", fontsize=11, fontweight="bold", color=YEAR_COLORS["structural"] + 2024.5, 108, "Persistent Regime", ha="center", fontsize=13, fontweight="bold", color=YEAR_COLORS["structural"] ) # Formatting - ax1.set_xlabel("Year", fontsize=12, fontweight="bold", color=IEEE_THEME["text"]) - ax1.set_ylabel("Detection Rate (%)", fontsize=12, fontweight="bold", color=IEEE_THEME["text"]) + ax1.set_xlabel("Year", fontsize=14, fontweight="bold", color=IEEE_THEME["text"]) + ax1.set_ylabel("Detection Rate (%)", fontsize=14, fontweight="bold", color=IEEE_THEME["text"]) ax1.set_title( "Phase 4A: Temporal Progression of Regime Detection (2020-2025)\n" + "Gradual 0DTE Adoption with 2023→2024 Structural Market Shift", - fontsize=13, + fontsize=15, fontweight="bold", pad=10, color=IEEE_THEME["text"], @@ -215,21 +215,21 @@ def create_figure(results): f"${gex:.1f}B", ha="center", va="bottom", - fontsize=10, + fontsize=12, fontweight="bold", color=IEEE_THEME["text"], ) ax2.axhline(y=5.0, color=IEEE_THEME["accent_positive"], linestyle="--", linewidth=2, alpha=0.8) ax2.text( - 2020.3, 5.8, "$5B Threshold", fontsize=10, fontweight="bold", color=IEEE_THEME["accent_positive"], va="bottom" + 2020.3, 5.8, "$5B Threshold", fontsize=12, fontweight="bold", color=IEEE_THEME["accent_positive"], va="bottom" ) - ax2.set_xlabel("Year", fontsize=12, fontweight="bold", color=IEEE_THEME["text"]) - ax2.set_ylabel("Avg GEX Magnitude ($B)", fontsize=12, fontweight="bold", color=IEEE_THEME["text"]) + ax2.set_xlabel("Year", fontsize=14, fontweight="bold", color=IEEE_THEME["text"]) + ax2.set_ylabel("Avg GEX Magnitude ($B)", fontsize=14, fontweight="bold", color=IEEE_THEME["text"]) ax2.set_title( "Average GEX Magnitude Evolution (360% Growth 2021→2024)", - fontsize=12, + fontsize=14, fontweight="bold", pad=8, color=IEEE_THEME["text"], @@ -255,7 +255,7 @@ def create_figure(results): footer_text, ha="center", va="bottom", - fontsize=9, + fontsize=12, style="italic", color=IEEE_THEME["dim"], wrap=True, diff --git a/scripts/validation/paper2/jrfm_revision/hmm_benchmark.py b/scripts/validation/paper2/jrfm_revision/hmm_benchmark.py index 962b1f2..88f2fd3 100644 --- a/scripts/validation/paper2/jrfm_revision/hmm_benchmark.py +++ b/scripts/validation/paper2/jrfm_revision/hmm_benchmark.py @@ -277,8 +277,8 @@ def plot_agreement(results: list[dict]) -> None: ax1.legend(loc="upper left") ax1.set_title("Detection rate: LLM vs Markov-switching") for i, (l, h) in enumerate(zip(llm_rates, hmm_rates)): - ax1.text(i - w / 2, l + 1.5, f"{l:.1f}%", ha="center", fontsize=9) - ax1.text(i + w / 2, h + 1.5, f"{h:.1f}%", ha="center", fontsize=9) + ax1.text(i - w / 2, l + 1.5, f"{l:.1f}%", ha="center", fontsize=12) + ax1.text(i + w / 2, h + 1.5, f"{h:.1f}%", ha="center", fontsize=12) colors = ["#2ca02c" if k > 0.4 else "#d62728" if k < 0.2 else "#bcbd22" for k in kappas] ax2.bar(x, kappas, color=colors) @@ -290,13 +290,13 @@ def plot_agreement(results: list[dict]) -> None: ax2.set_ylabel("Cohen's κ") ax2.set_ylim(-0.3, 1.0) ax2.set_title("Agreement (LLM vs Markov-switching)") - ax2.legend(loc="upper right", fontsize=8) + ax2.legend(loc="upper right", fontsize=12) for i, k in enumerate(kappas): - ax2.text(i, k + 0.02 if k >= 0 else k - 0.06, f"{k:.2f}", ha="center", fontsize=9) + ax2.text(i, k + 0.02 if k >= 0 else k - 0.06, f"{k:.2f}", ha="center", fontsize=12) fig.suptitle( "Markov-switching benchmark versus LLM regime detection", - fontsize=11, + fontsize=13, ) FIG_DIR.mkdir(parents=True, exist_ok=True) fig.savefig(OUTPUT_PNG, dpi=150, bbox_inches="tight") diff --git a/scripts/validation/paper2/jrfm_revision/threshold_sensitivity.py b/scripts/validation/paper2/jrfm_revision/threshold_sensitivity.py index 6723d84..4ce3b11 100644 --- a/scripts/validation/paper2/jrfm_revision/threshold_sensitivity.py +++ b/scripts/validation/paper2/jrfm_revision/threshold_sensitivity.py @@ -168,7 +168,7 @@ def plot_heatmap(results: list[dict]) -> None: ha="center", va="center", color="white" if val < (vmin + vmax) / 2 else "black", - fontsize=9, + fontsize=12, ) # mark the paper default with a red box @@ -189,7 +189,7 @@ def plot_heatmap(results: list[dict]) -> None: fig.suptitle( "Threshold sensitivity: 2024 vs 2020 detection gap across 45 configurations\n" "(red box marks the paper default: persistence >= 70%, magnitude >= $5B, flips <= 5)", - fontsize=11, + fontsize=13, ) FIG_DIR.mkdir(parents=True, exist_ok=True)