SLTCGA/SLTCGA_docs.json at main · SolvingLab/SLTCGA · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
{
  "list_cancer_types": {
    "package": "SLTCGA",
    "function_name": "list_cancer_types",
    "title": "List All TCGA Cancer Types and Molecular Subtypes",
    "description": "Displays comprehensive catalog of all TCGA cancer types: 33 main cancer types, 32 molecular subtypes across 14 cancers, and combined cancer groups. Cancer type input is case-insensitive (BRCA, brca, Brca all work). Essential for identifying available cancer types before analysis. Returns invisible data frame for programmatic access.",
    "user_queries": ["**General Exploration**:", "What cancer types are available in TCGA?", "How many cancer types does SLTCGA support?", "What is the difference between main types and subtypes?", "Can I analyze breast cancer subtypes?", "What lung cancer types are available?", "**Specific Cancer Types**:", "Is breast cancer (BRCA) data available?", "Can I analyze lung adenocarcinoma (LUAD)?", "Is glioblastoma (GBM) included?", "What kidney cancer types exist?", "Is pancreatic cancer (PAAD) available?", "Can I analyze melanoma (SKCM)?", "Is liver cancer (LIHC) included?", "**Molecular Subtypes**:", "What breast cancer subtypes are available?", "Can I analyze basal vs luminal breast cancer?", "What are BRCA-Basal, BRCA-LumA, BRCA-LumB, BRCA-Her2?", "Are glioblastoma molecular subtypes available?", "Can I analyze IDH-mutant vs IDH-wildtype gliomas?", "What gastric cancer (STAD) subtypes exist?", "Are colon cancer molecular subtypes available?", "**Combined Groups**:", "Can I analyze colorectal cancer (COAD + READ) together?", "Is there a combined glioma group (GBM + LGG)?", "How do I use combined cancer groups?", "**Input Format**:", "Is cancer type case-sensitive?", "Can I use lowercase cancer names?", "What is the correct format for subtypes?", "How do I specify molecular subtypes in analysis functions?"],
    "usage": "list_cancer_types(show_subtypes = TRUE)",
    "parameters": [
      {
        "name": "show_subtypes",
        "has_default": true,
        "default_value": "TRUE",
        "description": "Logical. Display molecular subtypes (default: TRUE). TRUE shows main types + subtypes + combined groups. FALSE shows only 33 main cancer types. Subtypes provide finer molecular classification (e.g., BRCA-Basal, BRCA-LumA)."
      }
    ],
    "examples": "## No test: \n# ===========================================================================\n# Example 1: List all cancer types (main + subtypes)\n# ===========================================================================\n# Research Question: What cancer types can I analyze?\n\nlist_cancer_types()\n\n# Output shows:\n# - 33 main cancer types\n# - 32 molecular subtypes (grouped by parent)\n# - Combined groups (COADREAD, GBMLGG)\n\n# ===========================================================================\n# Example 2: List main cancer types only\n# ===========================================================================\n\nlist_cancer_types(show_subtypes = FALSE)\n\n# Shows only 33 main cancer types (no subtypes)\n\n# ===========================================================================\n# Example 3: Identify breast cancer subtypes\n# ===========================================================================\n# Research Question: What BRCA subtypes are available?\n\nlist_cancer_types(show_subtypes = TRUE)\n\n# BRCA subtypes shown:\n# - BRCA-Basal (Triple-negative, aggressive)\n# - BRCA-Her2 (HER2-enriched)\n# - BRCA-LumA (Luminal A, ER+, best prognosis)\n# - BRCA-LumB (Luminal B, ER+, intermediate prognosis)\n\n# ===========================================================================\n# Example 4: Programmatic access\n# ===========================================================================\n\n# Get all cancer types\ncancers <- list_cancer_types()\n\n# Filter main types only\nmain_types <- cancers$Cancer_Type[cancers$Type == \"Main\"]\n\n# Filter subtypes only\nsubtypes <- cancers$Cancer_Type[cancers$Type == \"Subtype\"]\n\n# Filter breast cancer subtypes\nbrca_subtypes <- subtypes[grepl(\"^BRCA-\", subtypes)]\n# Returns: \"BRCA-Basal\", \"BRCA-Her2\", \"BRCA-LumA\", \"BRCA-LumB\"\n\n# ===========================================================================\n# Example 5: Use cancer types in analysis\n# ===========================================================================\n\n# Main cancer type\n# tcga_correlation(\n#   var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", ...\n# )\n\n# Molecular subtype\n# tcga_correlation(\n#   var1 = \"ESR1\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA-LumA\", ...\n# )\n\n# Multiple cancers\n# tcga_correlation(\n#   var1 = \"TP53\", var1_modal = \"Mutation\",\n#   var1_cancers = c(\"BRCA\", \"LUAD\", \"COAD\"), ...\n# )\n\n# ===========================================================================\n# Example 6: Pan-cancer analysis\n# ===========================================================================\n\n# Get all main cancer types for pan-cancer study\nall_cancers <- list_cancer_types(show_subtypes = FALSE)\nmain_cancers <- all_cancers$Cancer_Type\n\n# Use in pan-cancer correlation\n# tcga_correlation(\n#   var1 = \"TP53\", var1_modal = \"Mutation\",\n#   var1_cancers = main_cancers[1:10], # First 10 cancers\n#   var2 = \"TMB\", var2_modal = \"Signature\",\n#   var2_cancers = main_cancers[1:10]\n# )\n\n# ===========================================================================\n# Next Steps\n# ===========================================================================\n# After identifying cancer types:\n# 1. Use in tcga_correlation(var1_cancers = \"BRCA\", var2_cancers = \"BRCA\")\n# 2. Use in tcga_enrichment(var1_cancers = \"LUAD\")\n# 3. Use in tcga_survival(var1_cancers = \"BRCA-Basal\")\n# 4. Compare subtypes: var1_cancers = c(\"BRCA-Basal\", \"BRCA-LumA\")\n## End(No test)\n\n\n\n",
    "return_value": "Data frame with 2 columns (invisible): Cancer_Type Cancer type code (e.g., \"BRCA\", \"BRCA-Basal\") Type Classification: \"Main\", \"Subtype\", or \"Combined\"",
    "references": ["**TCGA Database**: The Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113-1120. doi:10.1038/ng.2764", "Hoadley KA, et al. (2018). Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell, 173(2):291-304. doi:10.1016/j.cell.2018.03.022"],
    "formatted_arguments": "show_subtypes: Logical. Display molecular subtypes (default: TRUE). TRUE shows main types + subtypes + combined groups. FALSE shows only 33 main cancer types. Subtypes provide finer molecular classification (e.g., BRCA-Basal, BRCA-LumA).",
    "simple_arguments": ""
  },
  "list_immune_cells": {
    "package": "SLTCGA",
    "function_name": "list_immune_cells",
    "title": "List All Immune Cell Types from 8 Deconvolution Algorithms",
    "description": "Displays comprehensive catalog of all 99 immune cell types available in SLTCGA, derived from 8 deconvolution algorithms (CIBERSORT, xCell, quanTIseq, MCPcounter, TIMER, EPIC, IPS, ESTIMATE) with category classification (11 categories). Supports filtering by algorithm or cell category. Essential for immune infiltration analysis. Returns invisible data frame for programmatic access.",
    "user_queries": ["**General Exploration**:", "What immune cell types can I analyze?", "How many immune cells are available in TCGA?", "What deconvolution algorithms are included?", "Which algorithm should I use for my analysis?", "What is the difference between CIBERSORT and xCell?", "**Specific Cell Types**:", "How do I find CD8+ T cells?", "What macrophage subtypes are available (M0/M1/M2)?", "Can I analyze regulatory T cells (Tregs)?", "Is B cell infiltration data available?", "What NK cell types exist?", "Can I analyze dendritic cells?", "Is neutrophil infiltration available?", "**Algorithm Selection**:", "Which CIBERSORT cell types are available?", "What does xCell provide?", "How many cell types does quanTIseq have?", "What is ESTIMATE algorithm?", "Can I compare results across different algorithms?", "**Category Filtering**:", "How do I list all T cell types?", "What macrophage types exist?", "Can I see all CD8+ T cell variants?", "How do I find stromal/microenvironment cells?", "What B cell subtypes are available?", "**Research Applications**:", "Can I correlate immune infiltration with gene expression?", "How do I analyze immune-clinical associations?", "Can I compare immune profiles across cancer types?", "Is tumor purity information available?", "How do I study tumor microenvironment?"],
    "usage": "list_immune_cells(algorithm = NULL, category = NULL)",
    "parameters": [
      {
        "name": "algorithm",
        "has_default": true,
        "default_value": "NULL",
        "description": "Character or NULL. Filter by deconvolution algorithm (default: NULL). Options: \"cibersort\", \"xcell\", \"quantiseq\", \"mcpcounter\", \"timer\", \"epic\", \"ips\", \"estimate\". NULL shows all 99 cell types across all 8 algorithms. Each algorithm provides different cell type resolution and granularity."
      },
      {
        "name": "category",
        "has_default": true,
        "default_value": "NULL",
        "description": "Character or NULL. Filter by functional cell category (default: NULL). Options: \"B_cells\", \"T_cells_CD4\", \"T_cells_CD8\", \"Tregs\", \"NK_cells\", \"Macrophages\", \"DC\", \"Neutrophils\", \"Monocytes\", \"Microenvironment\", \"ESTIMATE\". NULL shows all categories. Use to focus on specific immune cell lineages."
      }
    ],
    "examples": "## No test: \n# ===========================================================================\n# Example 1: List all 99 immune cell types\n# ===========================================================================\n# Research Question: What immune cell data is available?\n\nlist_immune_cells()\n\n# Output shows all 99 cell types with algorithm and category\n\n# ===========================================================================\n# Example 2: List CIBERSORT cell types (22 types)\n# ===========================================================================\n# Research Question: What cell types does CIBERSORT provide?\n\nlist_immune_cells(algorithm = \"cibersort\")\n\n# Returns 22 CIBERSORT cell types:\n# - B_cells_naive, B_cells_memory, Plasma_cells\n# - CD8_T_cells, CD4_T_cells_naive, CD4_T_cells_memory_resting, etc.\n# - Macrophages_M0, Macrophages_M1, Macrophages_M2\n# - NK_cells_activated, NK_cells_resting\n# - Dendritic_cells_activated, Dendritic_cells_resting\n\n# ===========================================================================\n# Example 3: List all CD8+ T cell variants\n# ===========================================================================\n# Research Question: What CD8+ T cell types are available?\n\nlist_immune_cells(category = \"T_cells_CD8\")\n\n# Returns CD8+ T cells from multiple algorithms:\n# - CD8_T_cells_cibersort\n# - T_cells_CD8_xcell\n# - T_cells_CD8_quantiseq\n# - CD8_T_cells_mcpcounter\n\n# ===========================================================================\n# Example 4: Compare macrophage types across algorithms\n# ===========================================================================\n\nlist_immune_cells(category = \"Macrophages\")\n\n# Shows macrophage variants:\n# - CIBERSORT: M0, M1, M2 (3 subtypes)\n# - xCell: Macrophages, Macrophages_M1, Macrophages_M2\n# - quanTIseq: Macrophages_M1, Macrophages_M2\n# - MCPcounter: Monocytic_lineage\n\n# ===========================================================================\n# Example 5: Explore tumor microenvironment\n# ===========================================================================\n\nlist_immune_cells(category = \"Microenvironment\")\n\n# Returns stromal cells:\n# - Fibroblasts_xcell\n# - Endothelial_cells_xcell\n# - MSC_xcell (Mesenchymal stem cells)\n# - Adipocytes_xcell\n\n# ===========================================================================\n# Example 6: Get tumor purity scores\n# ===========================================================================\n\nlist_immune_cells(algorithm = \"estimate\")\n\n# Returns ESTIMATE scores:\n# - ImmuneScore_estimate\n# - StromalScore_estimate\n# - TumorPurity_estimate\n\n# ===========================================================================\n# Example 7: Programmatic access\n# ===========================================================================\n\n# Get all immune cells\ncells <- list_immune_cells()\n\n# Filter CD8 cells\ncd8_cells <- cells$Cell_Name[cells$Category == \"T_cells_CD8\"]\n\n# Use in analysis\n# tcga_correlation(\n#   var1 = \"TP53\", var1_modal = \"Mutation\",\n#   var2 = cd8_cells[1], var2_modal = \"ImmuneCell\", ...\n# )\n\n# ===========================================================================\n# Next Steps\n# ===========================================================================\n# After identifying cell types:\n# 1. Use tcga_correlation() to correlate with genes/mutations/clinical\n# 2. Use tcga_survival() to test prognostic significance\n# 3. Use tcga_enrichment() to find associated pathways\n# 4. Compare across algorithms to validate findings\n## End(No test)\n\n\n\n",
    "return_value": "Data frame with 3 columns (invisible): Cell_Name Full cell type name (e.g., \"CD8_T_cells_cibersort\") Algorithm Deconvolution algorithm (e.g., \"cibersort\") Category Functional category (e.g., \"T_cells_CD8\")",
    "references": ["**TCGA Database**: The Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113-1120. doi:10.1038/ng.2764", "Newman AM, et al. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5):453-457. (CIBERSORT)", "Aran D, et al. (2017). xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biology, 18:220. (xCell)"],
    "formatted_arguments": "algorithm: Character or NULL. Filter by deconvolution algorithm (default: NULL). Options: \"cibersort\", \"xcell\", \"quantiseq\", \"mcpcounter\", \"timer\", \"epic\", \"ips\", \"estimate\". NULL shows all 99 cell types across all 8 algorithms. Each algorithm provides different cell type resolution and granularity.\ncategory: Character or NULL. Filter by functional cell category (default: NULL). Options: \"B_cells\", \"T_cells_CD4\", \"T_cells_CD8\", \"Tregs\", \"NK_cells\", \"Macrophages\", \"DC\", \"Neutrophils\", \"Monocytes\", \"Microenvironment\", \"ESTIMATE\". NULL shows all categories. Use to focus on specific immune cell lineages.",
    "simple_arguments": ""
  },
  "list_modalities": {
    "package": "SLTCGA",
    "function_name": "list_modalities",
    "title": "List All Available Data Modalities in TCGA Database",
    "description": "Displays comprehensive overview of all 8 data modalities available in SLTCGA: 5 omics layers (RNAseq, Mutation, CNV, Methylation, miRNA), clinical data (Clinical), molecular signatures (Signature), and immune infiltration scores (ImmuneCell). Shows data types, variable counts, and descriptions. Essential for understanding data structure before using tcga_correlation() , tcga_enrichment() , or tcga_survival() . Returns invisible data frame for programmatic access.",
    "user_queries": ["**Data Exploration**:", "What data types are available in SLTCGA?", "What omics layers can I analyze?", "How many genes are covered in TCGA?", "What is the difference between RNAseq and Mutation modalities?", "Which modalities are continuous vs categorical?", "Can I analyze immune cell infiltration?", "What molecular signatures are available?", "How many clinical variables are there?", "What is the difference between omics data and clinical data?", "Are Signature and ImmuneCell original omics or derived data?", "**Method Selection**:", "Which modality should I use for gene expression analysis?", "How do I analyze DNA methylation?", "What modality contains tumor mutation burden (TMB)?", "Which modality has immune cell data?", "Can I analyze miRNA expression?", "What is the ImmuneCell modality?"],
    "usage": "list_modalities()",
    "parameters": [],
    "examples": "## No test: \n# ===========================================================================\n# Example 1: View all available modalities\n# ===========================================================================\n# Research Question: What data types can I analyze in TCGA?\n\nlist_modalities()\n\n# Output shows 8 modalities in 3 categories:\n# - Multi-Omics: RNAseq, Mutation, CNV, Methylation, miRNA\n# - Clinical: Clinical variables\n# - Derived: Signature scores, ImmuneCell infiltration\n\n# ===========================================================================\n# Example 2: Programmatic access to modality information\n# ===========================================================================\n\nmodals <- list_modalities()\n\n# Check data types\ncontinuous_modals <- modals$Modal[modals$Data_Type == \"Continuous\"]\n# Returns: \"RNAseq\", \"CNV\", \"Methylation\", \"miRNA\", \"ImmuneCell\"\n\ncategorical_modals <- modals$Modal[modals$Data_Type == \"Categorical\"]\n# Returns: \"Mutation\"\n\n# ===========================================================================\n# Next Steps\n# ===========================================================================\n# After viewing modalities:\n# 1. Use list_variables(modal = \"Clinical\") to explore clinical variables\n# 2. Use list_immune_cells() to see immune cell types\n# 3. Use list_cancer_types() to see available cancer types\n# 4. Start analysis with tcga_correlation(), tcga_enrichment(), or tcga_survival()\n## End(No test)\n\n\n\n",
    "return_value": "Data frame with 4 columns (invisible): Modal Modality name (e.g., \"RNAseq\", \"Mutation\") Description Brief description of data type N_Variables Approximate number of variables Data_Type Variable type (\"Continuous\", \"Categorical\", \"Mixed\")",
    "references": "**TCGA Database**: The Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113-1120. doi:10.1038/ng.2764",
    "formatted_arguments": "",
    "simple_arguments": ""
  },
  "list_variables": {
    "package": "SLTCGA",
    "function_name": "list_variables",
    "title": "List Variables for Clinical, Signature, or ImmuneCell Modalities",
    "description": "Displays available variables for Clinical (66 variables), Signature (58 variables), or ImmuneCell (99 cell types) modalities with optional pattern filtering and grouping. For RNAseq/Mutation/CNV/Methylation/miRNA, use gene symbols directly (~20,000 genes each). Essential for discovering variable names before analysis. Returns invisible named vector for programmatic access.",
    "user_queries": ["**Clinical Variables**:", "What clinical variables are available?", "How do I access patient age, gender, or race?", "What treatment information is available?", "Can I analyze tumor stage or grade?", "Is MSI status available?", "What survival endpoints are there?", "How do I find histology information?", "**Signature Variables**:", "What molecular signatures can I analyze?", "How do I access tumor mutation burden (TMB)?", "Is there an EMT score?", "What immune signatures are available?", "Can I analyze hypoxia or angiogenesis?", "Is there a stemness score?", "What metabolic signatures exist?", "How do I find DNA repair signatures?", "**ImmuneCell Variables**:", "What immune cell types are available?", "How many deconvolution algorithms are included?", "Can I analyze CD8+ T cell infiltration?", "Is macrophage infiltration data available?", "What is the difference between CIBERSORT and xCell?", "How do I find regulatory T cells (Tregs)?", "**Pattern Filtering**:", "How do I search for TMB-related variables?", "Can I filter signatures by keyword?", "How do I find all T cell types?", "Can I search for stage-related clinical variables?"],
    "usage": "list_variables(modal, pattern = NULL, show_groups = TRUE)",
    "parameters": [
      {
        "name": "modal",
        "has_default": false,
        "description": "Character. Modality type to display (required). Options: \"Clinical\", \"Signature\", \"ImmuneCell\". Note: RNAseq, Mutation, CNV, Methylation, miRNA use standard gene symbols."
      },
      {
        "name": "pattern",
        "has_default": true,
        "default_value": "NULL",
        "description": "Character or NULL. Optional regex pattern to filter variables (default: NULL). Case-insensitive matching on both variable names and aliases. Examples: \"TMB\", \"T_cells\", \"Age\", \"Stage\"."
      },
      {
        "name": "show_groups",
        "has_default": true,
        "default_value": "TRUE",
        "description": "Logical. Display variable groups/categories (default: TRUE). For Clinical: basic, treatment, outcome, histology, molecular. For Signature: immune, metabolic, pathway, stemness, clinical_scores. For ImmuneCell: algorithm-based grouping."
      }
    ],
    "examples": "## No test: \n# ===========================================================================\n# Example 1: List all clinical variables\n# ===========================================================================\n# Research Question: What clinical data can I analyze?\n\nlist_variables(modal = \"Clinical\")\n\n# Output shows 66 clinical variables grouped by category\n# Use these in tcga_correlation() or tcga_survival()\n\n# ===========================================================================\n# Example 2: List molecular signatures\n# ===========================================================================\n# Research Question: What molecular signatures are available?\n\nlist_variables(modal = \"Signature\")\n\n# Output shows 58 signatures: immune, metabolic, pathway, stemness, clinical scores\n\n# ===========================================================================\n# Example 3: Search for TMB-related signatures\n# ===========================================================================\n# Research Question: How do I find tumor mutation burden variables?\n\nlist_variables(modal = \"Signature\", pattern = \"TMB\")\n\n# Returns: TMB, TMB_NonSynonymous, TMB_Nonsilent\n\n# ===========================================================================\n# Example 4: Programmatic access to variables\n# ===========================================================================\n\n# Get all clinical variables\nclin_vars <- list_variables(modal = \"Clinical\")\n\n# Get signature names\nsig_names <- names(list_variables(modal = \"Signature\"))\n\n# Filter immune signatures\nimmune_sigs <- list_variables(modal = \"Signature\", pattern = \"immune|CYT|IFNG\")\n\n# ===========================================================================\n# Example 5: Explore immune cells (use dedicated function)\n# ===========================================================================\n\n# For ImmuneCell, use list_immune_cells() for better display\nlist_immune_cells(algorithm = \"cibersort\")\nlist_immune_cells(category = \"T_cells_CD8\")\n\n# ===========================================================================\n# Next Steps\n# ===========================================================================\n# After finding variable names:\n# 1. Use tcga_correlation(var1 = \"Age\", var1_modal = \"Clinical\", ...)\n# 2. Use tcga_correlation(var1 = \"TMB\", var1_modal = \"Signature\", ...)\n# 3. Use tcga_survival(var1 = \"Stage\", var1_modal = \"Clinical\", ...)\n## End(No test)\n\n\n\n",
    "return_value": "Named character vector (invisible): Names = user-friendly aliases, Values = full variable names. Access programmatically: vars <- list_variables(modal = \"Clinical\") .",
    "references": "**TCGA Database**: The Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113-1120. doi:10.1038/ng.2764",
    "formatted_arguments": "modal: Character. Modality type to display (required). Options: \"Clinical\", \"Signature\", \"ImmuneCell\". Note: RNAseq, Mutation, CNV, Methylation, miRNA use standard gene symbols.\npattern: Character or NULL. Optional regex pattern to filter variables (default: NULL). Case-insensitive matching on both variable names and aliases. Examples: \"TMB\", \"T_cells\", \"Age\", \"Stage\".\nshow_groups: Logical. Display variable groups/categories (default: TRUE). For Clinical: basic, treatment, outcome, histology, molecular. For Signature: immune, metabolic, pathway, stemness, clinical_scores. For ImmuneCell: algorithm-based grouping.",
    "simple_arguments": "modal: Character. Modality type to display (required). Options: \"Clinical\", \"Signature\", \"ImmuneCell\". Note: RNAseq, Mutation, CNV, Methylation, miRNA use standard gene symbols."
  },
  "search_immune_cells": {
    "package": "SLTCGA",
    "function_name": "search_immune_cells",
    "title": "Search Immune Cell Types by Keyword",
    "description": "Performs case-insensitive keyword search across all 99 immune cell types from 8 deconvolution algorithms. Searches cell names with partial matching support. Returns matched cells with algorithm and category information. Convenient wrapper around list_immune_cells() for quick cell type discovery.",
    "user_queries": ["**Quick Search**:", "How do I quickly find CD8+ T cells?", "What macrophage types match \"M1\"?", "Can I search for B cells?", "How do I find regulatory T cells?", "What cells match \"NK\"?", "**Use Cases**:", "I know the cell type but not the exact variable name", "I want to see all variants of a cell type across algorithms", "I need to quickly check if a cell type exists", "I want to compare similar cells from different algorithms"],
    "usage": "search_immune_cells(keyword)",
    "parameters": [
      {
        "name": "keyword",
        "has_default": false,
        "description": "Character. Search keyword or pattern (required). Case-insensitive partial matching on cell names. Examples: \"CD8\", \"Macrophage\", \"B_cells\", \"Treg\", \"NK\"."
      }
    ],
    "examples": "## No test: \n# ===========================================================================\n# Example 1: Search for CD8+ T cells\n# ===========================================================================\n\nsearch_immune_cells(\"CD8\")\n\n# Returns CD8 variants:\n# - CD8_T_cells_cibersort\n# - T_cells_CD8_xcell\n# - T_cells_CD8_quantiseq\n# - CD8_T_cells_mcpcounter\n\n# ===========================================================================\n# Example 2: Search for macrophages\n# ===========================================================================\n\nsearch_immune_cells(\"Macrophage\")\n\n# Returns all macrophage types:\n# - Macrophages_M0_cibersort\n# - Macrophages_M1_cibersort\n# - Macrophages_M2_cibersort\n# - Macrophages_xcell\n# - Macrophages_M1_quantiseq\n# - Macrophages_M2_quantiseq\n\n# ===========================================================================\n# Example 3: Search for B cells\n# ===========================================================================\n\nsearch_immune_cells(\"B_cells\")\n\n# Returns B cell variants:\n# - B_cells_naive_cibersort\n# - B_cells_memory_cibersort\n# - Plasma_cells_cibersort\n# - B_cells_xcell\n\n# ===========================================================================\n# Example 4: Quick check if cell type exists\n# ===========================================================================\n\n# Check for regulatory T cells\ntreg_results <- search_immune_cells(\"Treg\")\n\nif (!is.null(treg_results)) {\n  cat(\"Found\", nrow(treg_results), \"Treg variants\\n\")\n}\n\n# ===========================================================================\n# Next Steps\n# ===========================================================================\n# After finding cell names:\n# 1. Use in tcga_correlation(var2 = \"CD8_T_cells_cibersort\", var2_modal = \"ImmuneCell\")\n# 2. Use in tcga_survival(var1 = \"Macrophages_M1_cibersort\", var1_modal = \"ImmuneCell\")\n## End(No test)\n\n\n\n",
    "return_value": "Data frame with 3 columns (invisible): Cell_Name Full cell type name Algorithm Deconvolution algorithm Category Functional category Returns NULL if no matches found.",
    "references": "**TCGA Database**: The Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113-1120. doi:10.1038/ng.2764",
    "formatted_arguments": "keyword: Character. Search keyword or pattern (required). Case-insensitive partial matching on cell names. Examples: \"CD8\", \"Macrophage\", \"B_cells\", \"Treg\", \"NK\".",
    "simple_arguments": "keyword: Character. Search keyword or pattern (required). Case-insensitive partial matching on cell names. Examples: \"CD8\", \"Macrophage\", \"B_cells\", \"Treg\", \"NK\"."
  },
  "search_variables": {
    "package": "SLTCGA",
    "function_name": "search_variables",
    "title": "Search Variables Across Clinical, Signature, and ImmuneCell Modalities",
    "description": "Performs case-insensitive keyword search across Clinical (66 variables), Signature (58 variables), and ImmuneCell (99 cell types) modalities, with optional restriction to specific modality. Searches both variable names and aliases. Returns grouped results showing matches per modality. Essential for discovering relevant variables when variable name is uncertain.",
    "user_queries": ["**General Search**:", "How do I find variables related to tumor mutation burden?", "Can I search for immune-related variables?", "What variables contain \"Stage\" in their name?", "How do I find all T cell types?", "Can I search for EMT-related signatures?", "Is there a variable for hypoxia?", "**Immune Cell Search**:", "How do I find CD8+ T cells?", "What macrophage variables are available?", "Can I search for B cell infiltration data?", "How do I find regulatory T cells (Tregs)?", "What NK cell types exist?", "**Clinical Search**:", "How do I find patient age variable?", "What is the variable name for tumor stage?", "Can I search for MSI status?", "How do I find treatment-related variables?", "What survival endpoints are available?", "**Signature Search**:", "How do I find immune signature scores?", "What stemness signatures exist?", "Can I search for metabolic pathway scores?", "Is there a DNA repair signature?", "How do I find angiogenesis scores?", "**Workflow Questions**:", "I want to analyze immune infiltration but don't know exact variable names", "How do I discover what variables are available for my research topic?", "Can I search across multiple modalities at once?", "How do I find alternative names for the same variable?"],
    "usage": "search_variables(keyword, modal = NULL)",
    "parameters": [
      {
        "name": "keyword",
        "has_default": false,
        "description": "Character. Search keyword or pattern (required). Case-insensitive regex matching on variable names and aliases. Examples: \"TMB\", \"T_cells\", \"Stage\", \"immune\", \"stemness\"."
      },
      {
        "name": "modal",
        "has_default": true,
        "default_value": "NULL",
        "description": "Character or NULL. Optional modality restriction (default: NULL). Options: \"Clinical\", \"Signature\", \"ImmuneCell\", or NULL for all. NULL searches across all three modalities."
      }
    ],
    "examples": "## No test: \n# ===========================================================================\n# Example 1: Search for tumor mutation burden (TMB)\n# ===========================================================================\n# Research Question: What TMB-related variables exist?\n\nsearch_variables(\"TMB\")\n\n# Returns matches in Signature:\n# - TMB\n# - TMB_NonSynonymous\n# - TMB_Nonsilent\n\n# ===========================================================================\n# Example 2: Search for T cell types across modalities\n# ===========================================================================\n# Research Question: What T cell infiltration data is available?\n\nsearch_variables(\"T_cells\")\n\n# Returns matches in ImmuneCell:\n# - CD8_T_cells_cibersort\n# - CD4_T_cells_cibersort\n# - Tregs_cibersort\n# - T_cells_CD8_xcell\n# ... (shows first 5 per modality)\n\n# ===========================================================================\n# Example 3: Search immune cells only\n# ===========================================================================\n\nsearch_variables(\"Macrophage\", modal = \"ImmuneCell\")\n\n# Returns only ImmuneCell matches:\n# - Macrophages_M0_cibersort\n# - Macrophages_M1_cibersort\n# - Macrophages_M2_cibersort\n# - Macrophages_xcell\n\n# ===========================================================================\n# Example 4: Search for stage-related clinical variables\n# ===========================================================================\n\nsearch_variables(\"Stage\", modal = \"Clinical\")\n\n# Returns Clinical matches:\n# - Stage\n# - Stage_T\n# - Stage_N\n# - Stage_M\n# - Pathologic_stage\n\n# ===========================================================================\n# Example 5: Search immune signatures\n# ===========================================================================\n\nsearch_variables(\"immune|CYT|IFNG\", modal = \"Signature\")\n\n# Returns Signature matches:\n# - CYT (Cytolytic activity)\n# - IFNG (Interferon gamma)\n# - TIS (T cell inflamed signature)\n\n# ===========================================================================\n# Example 6: Programmatic use\n# ===========================================================================\n\n# Get search results\nresults <- search_variables(\"EMT\")\n\n# Extract Signature matches\nemt_vars <- results$Signature\n\n# Use in analysis\n# tcga_correlation(\n#   var1 = names(emt_vars)[1], var1_modal = \"Signature\", ...\n# )\n\n# ===========================================================================\n# Next Steps\n# ===========================================================================\n# After finding variable names:\n# 1. Use list_variables() to see full variable lists\n# 2. Use found variables in tcga_correlation(), tcga_enrichment(), tcga_survival()\n# 3. For genes, use gene symbols directly with RNAseq/Mutation/CNV modality\n## End(No test)\n\n\n\n",
    "return_value": "Named list of matched variables (invisible): Clinical Named vector of Clinical matches (if any) Signature Named vector of Signature matches (if any) ImmuneCell Named vector of ImmuneCell matches (if any) Returns NULL if no matches found.",
    "references": "**TCGA Database**: The Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113-1120. doi:10.1038/ng.2764",
    "formatted_arguments": "keyword: Character. Search keyword or pattern (required). Case-insensitive regex matching on variable names and aliases. Examples: \"TMB\", \"T_cells\", \"Stage\", \"immune\", \"stemness\".\nmodal: Character or NULL. Optional modality restriction (default: NULL). Options: \"Clinical\", \"Signature\", \"ImmuneCell\", or NULL for all. NULL searches across all three modalities.",
    "simple_arguments": "keyword: Character. Search keyword or pattern (required). Case-insensitive regex matching on variable names and aliases. Examples: \"TMB\", \"T_cells\", \"Stage\", \"immune\", \"stemness\"."
  },
  "tcga_correlation": {
    "package": "SLTCGA",
    "function_name": "tcga_correlation",
    "title": "Genomic Correlation and Association Analysis Across Multi-Omics and Clinical Data",
    "description": "**Discovers relationships between variables** through correlation and association analysis, supporting both **intra-omics** (Gene A vs Gene B within same layer, e.g., TP53-MDM2 mRNA) and **cross-omics** (different layers, e.g., TP53 CNV vs TP53 mRNA, BRCA1 methylation vs mRNA) analyses across 8 TCGA data modalities (RNAseq, Mutation, CNV, Methylation, miRNA, Clinical, Signature, ImmuneCell), covering all **64 possible combinations (8x8 matrix)**. Automatically detects 7 scenarios based on variable types and counts, selects appropriate statistical tests (Pearson/Spearman for continuous, Wilcoxon/Kruskal-Wallis for groups, Fisher/Chi-square for categorical), and generates publication-ready visualizations. Suitable for single/multiple genes in single/multiple cancers (33 main types + 32 molecular subtypes). Returns unified structure: list(stats, plot, raw_data) .",
    "user_queries": ["**Intra-Omics Analysis** (same modality, different genes/features):", "Are TP53 and MDM2 mRNA levels correlated in breast cancer?", "Do PIK3CA, AKT1, and MTOR show coordinated expression?", "Which genes in the cell cycle pathway are co-expressed?", "Are apoptosis genes (BCL2, BAX, BAK1) coordinately regulated?", "Do DNA damage response genes correlate with each other?", "Are immune checkpoint genes (PDL1, PD1, CTLA4) co-expressed?", "Do metabolism genes show correlated expression patterns?", "**Cross-Omics Analysis** (different modalities):", "Does TP53 copy number (CNV) correlate with its mRNA expression?", "Does BRCA1 methylation silence its mRNA expression?", "Is EGFR mutation associated with EGFR mRNA changes?", "Do copy number alterations drive gene expression changes?", "Does promoter methylation suppress gene transcription?", "How does DNA-level variation (CNV/methylation) affect RNA levels?", "**Mutation-Expression Associations**:", "Do TP53 mutant tumors have lower TP53 mRNA?", "Are mutated genes expressed at different levels?", "Does PIK3CA mutation alter downstream pathway gene expression?", "How do driver mutations affect transcriptional programs?", "Are mutations associated with compensatory gene expression changes?", "**Clinical Associations**:", "Does TP53 expression vary by tumor stage or grade?", "Are gene expression levels associated with patient age?", "Do molecular features correlate with histological subtypes?", "Is tumor mutation burden (TMB) associated with clinical outcomes?", "How does gene expression relate to treatment response?", "Are mutations enriched in specific demographic groups?", "**Immune Microenvironment**:", "Does PDL1 (CD274) expression correlate with CD8+ T cell infiltration?", "Are immune checkpoint genes associated with immune cell abundance?", "How does TP53 mutation affect tumor immune infiltration?", "Do highly mutated tumors have more immune cell infiltration?", "Is TMB correlated with cytolytic activity (CYT score)?", "Are macrophages (M1 vs M2) associated with gene expression programs?", "Does IFNG signature correlate with T cell markers?", "**Molecular Signatures**:", "Does hypoxia score correlate with angiogenesis genes (VEGFA)?", "Is EMT score associated with metastasis-related genes?", "Does stemness score correlate with differentiation markers?", "Are proliferation signatures linked to cell cycle gene expression?", "Is DNA repair signature associated with BRCA1/BRCA2 expression?", "Does glycolysis score correlate with metabolic gene expression?", "**Mutation Co-occurrence/Exclusivity**:", "Are TP53 and PIK3CA mutations mutually exclusive?", "Do KRAS and EGFR mutations co-occur or exclude each other?", "Are BRCA1 and BRCA2 mutations mutually exclusive?", "Which mutations tend to co-occur in the same tumors?", "Are certain mutation combinations associated with clinical features?", "**Pan-Cancer Analysis**:", "Is TP53-MDM2 correlation conserved across cancer types?", "Do similar molecular alterations occur in different cancers?", "Are immune infiltration patterns similar across cancers?", "Which gene correlations are cancer-specific vs pan-cancer?", "Do mutation frequencies vary by cancer type?", "**Molecular Subtypes**:", "Is ESR1 expression different between luminal and basal breast cancer?", "Do molecular subtypes have distinct gene expression patterns?", "Are immune profiles different across cancer subtypes?", "Do mutations segregate by molecular subtype?", "**miRNA Regulation**:", "Does hsa-mir-21 negatively regulate its target gene PDCD4?", "Are miRNAs anti-correlated with their predicted targets?", "Which miRNAs are associated with oncogene/tumor suppressor expression?", "**Complex Multi-Feature Questions**:", "How do multiple DNA damage genes correlate with each other?", "Are cell cycle and apoptosis pathways coordinately regulated?", "Which immune cells correlate with checkpoint gene expression?", "Do CNV alterations in multiple genes affect their expression coordinately?", "**Colloquial and Alternative Phrasings**:", "Do TP53 and MDM2 go together?", "Does high TP53 mean high MDM2?", "TP53 MDM2 relationship", "Are these two genes related?", "Gene A gene B connection", "What's the link between TP53 and survival?", "Does mutation affect expression?", "Mutation expression relationship", "Copy number drives expression?", "CNV mRNA correlation", "**Abbreviations and Full Forms**:", "TMB (Tumor Mutation Burden / mutation load / mutation count)", "MSI (Microsatellite Instability / microsatellite status)", "CNV (Copy Number Variation / copy number alteration / amplification deletion)", "RNAseq (RNA sequencing / gene expression / mRNA levels / transcript levels)", "PDL1 (PD-L1 / CD274 / programmed death-ligand 1)", "EMT (Epithelial-Mesenchymal Transition / epithelial mesenchymal transition score)", "CYT (Cytolytic activity / cytolytic score / immune cytolysis)", "**Function Selection Questions**:", "Should I use tcga_correlation or tcga_survival for gene-outcome analysis?", "Which function finds gene relationships? (Answer: tcga_correlation)", "Which function tests if genes are associated? (Answer: tcga_correlation)", "How to find correlated genes? (Answer: tcga_correlation)", "Which function for gene A vs gene B? (Answer: tcga_correlation)"],
    "usage": "tcga_correlation( var1, var1_modal, var1_cancers, var2, var2_modal, var2_cancers, method = \"pearson\", use = \"pairwise.complete.obs\", p_adjust_method = \"BH\", alpha = 0.05, rnaseq_type = \"log2TPM\", cnv_type = \"SNP6_Array\", methylation_region = \"Promoter_mean\", immune_algorithm = NULL, plot_type = \"auto\" )",
    "parameters": [
      {
        "name": "var1",
        "has_default": false,
        "description": "Character vector. Variable names for first group (required). Examples: Single gene (\"TP53\"), multiple genes (c(\"TP53\", \"EGFR\")), clinical (\"Age\", \"Stage\"), signatures (\"TMB\", \"EMT_Score\"), immune cells (\"CD8_T_cells_cibersort\"), miRNA (\"hsa-mir-21\"). Number and type of variables affect scenario selection (see Details)."
      },
      {
        "name": "var1_modal",
        "has_default": false,
        "description": "Character. Data modality for var1 (required). Options: \"RNAseq\", \"Mutation\", \"CNV\", \"Methylation\", \"miRNA\", \"Clinical\", \"Signature\", \"ImmuneCell\". Determines data type: continuous (RNAseq, CNV, Methylation, miRNA, ImmuneCell, some Signature), categorical (Mutation, some Clinical/Signature), mixed (Clinical, Signature)."
      },
      {
        "name": "var1_cancers",
        "has_default": false,
        "description": "Character vector. Cancer types for var1 (required, case-insensitive). Options: 33 main types (\"BRCA\", \"LUAD\", \"COAD\"), 32 molecular subtypes (\"BRCA-Basal\", \"BRCA-LumA\"), combined groups (\"COADREAD\", \"GBMLGG\"). Examples: single cancer (\"BRCA\"), multiple cancers (c(\"BRCA\", \"LUAD\", \"COAD\")), molecular subtypes (c(\"BRCA-LumA\", \"BRCA-Basal\")). Use list_cancer_types () to view all options."
      },
      {
        "name": "var2",
        "has_default": false,
        "description": "Character vector. Variable names for second group (required). Same format as var1. Can be same genes (e.g., TP53 CNV vs TP53 mRNA) or different genes."
      },
      {
        "name": "var2_modal",
        "has_default": false,
        "description": "Character. Data modality for var2 (required). Same options as var1_modal. Can be same modality (intra-omics) or different (cross-omics)."
      },
      {
        "name": "var2_cancers",
        "has_default": false,
        "description": "Character vector. Cancer types for var2 (required). Can be identical to var1_cancers (recommended for matched analysis) or different. If identical and multiple cancers,"
      },
      {
        "name": "method",
        "has_default": true,
        "default_value": "\"pearson\"",
        "description": "Character. Correlation method for continuous variables (default: \"pearson\"). Options: \"pearson\" (parametric, assumes normality), \"spearman\" (non-parametric, rank-based), \"kendall\" (non-parametric, tau coefficient). Only affects continuous-continuous comparisons. use Character. Missing value handling for correlations (default: \"pairwise.complete.obs\"). Options: \"everything\", \"all.obs\", \"complete.obs\", \"na.or.complete\", \"pairwise.complete.obs\". Recommended: \"pairwise.complete.obs\" maximizes sample size."
      },
      {
        "name": "use",
        "has_default": true,
        "default_value": "\"pairwise.complete.obs\"",
        "description": "s smart pairing (BRCA-BRCA, LUAD-LUAD, not BRCA-LUAD)."
      },
      {
        "name": "p_adjust_method",
        "has_default": true,
        "default_value": "\"BH\"",
        "description": "Character. Multiple testing correction method (default: \"BH\"). Options: \"BH\" (Benjamini-Hochberg FDR), \"bonferroni\", \"holm\", \"hochberg\", \"hommel\", \"BY\", \"fdr\", \"none\". Applied when multiple comparisons (e.g., 10 gene pairs -> 10 tests)."
      },
      {
        "name": "alpha",
        "has_default": true,
        "default_value": "0.05",
        "description": "Numeric. Significance threshold for p-value (default: 0.05). Used for visual annotation (e.g., stars in plots). Does not filter results."
      },
      {
        "name": "rnaseq_type",
        "has_default": true,
        "default_value": "\"log2TPM\"",
        "description": "Character. RNAseq normalization method (default: \"log2TPM\"). Options: \"log2TPM\" (recommended, normalized), \"log2RSEM\" (RSEM), \"log2FPKM\" (FPKM), \"log2Counts\" (raw counts). Only used when var1_modal or var2_modal = \"RNAseq\"."
      },
      {
        "name": "cnv_type",
        "has_default": true,
        "default_value": "\"SNP6_Array\"",
        "description": "Character. CNV calling algorithm (default: \"SNP6_Array\"). Options vary by cancer, typically \"SNP6_Array\", \"WES\", \"WGS\". Check data availability. Only used when var1_modal or var2_modal = \"CNV\"."
      },
      {
        "name": "methylation_region",
        "has_default": true,
        "default_value": "\"Promoter_mean\"",
        "description": "Character. Methylation region to analyze (default: \"Promoter_mean\"). Options: \"Promoter_mean\" (TSS1500+TSS200+5UTR+1stExon), \"TSS1500\", \"TSS200\", \"5UTR\", \"1stExon\", \"Body\", \"3UTR\", \"Gene_mean\". Recommended: \"Promoter_mean\" for expression correlation. Only used when var1_modal or var2_modal = \"Methylation\"."
      },
      {
        "name": "immune_algorithm",
        "has_default": true,
        "default_value": "NULL",
        "description": "Character or NULL. Immune deconvolution algorithm filter (default: NULL for all). Options: \"cibersort\", \"xcell\", \"quantiseq\", \"mcpcounter\", \"timer\", \"epic\", \"ips\", \"estimate\", or NULL. NULL includes all 99 cell types from 8 algorithms. Specify to focus on specific algorithm. Only used when var1_modal or var2_modal = \"ImmuneCell\"."
      },
      {
        "name": "plot_type",
        "has_default": true,
        "default_value": "\"auto\"",
        "description": "Character. Plot type preference for Scenario 6 (default: \"auto\"). Options: \"auto\" (heatmap if >=8 continuous features, else boxplot), \"boxplot\" (force boxplots), \"heatmap\" (force heatmap with difference bars). Only affects Scenario 6 visualization."
      }
    ],
    "examples": "## No test: \n# ===========================================================================\n# Example 1: Gene-gene correlation (Scenario 1, Intra-omics) - TESTED 1.94 sec\n# ===========================================================================\n# Research Question: Are TP53 and MDM2 mRNA levels correlated?\n# Expected: Positive correlation (MDM2 is a TP53 transcriptional target)\n\nresult <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  var2 = \"MDM2\", var2_modal = \"RNAseq\", var2_cancers = \"BRCA\",\n  method = \"pearson\"\n)\n\n# Return structure (unified across all scenarios)\nresult$stats # Data frame: var1_feature, var2_feature, r, p, p_adj, n, method\nresult$plot # ggplot2 scatter plot with regression line\nresult$raw_data # Data frame: 1,095 BRCA patients x features\n\n# Interpret\ncat(\"Correlation: r =\", result$stats$r[1], \"\\n\")\ncat(\"P-value:\", result$stats$p[1], \"\\n\")\ncat(\"Interpretation: Moderate positive correlation (MDM2 is TP53 target)\\n\")\n\n# ===========================================================================\n# Example 2: CNV drives expression (Scenario 1, Cross-omics) - TESTED 2.15 sec\n# ===========================================================================\n# Research Question: Does TP53 copy number correlate with its mRNA expression?\n# Expected: Positive correlation (copy number gain -> higher expression)\n\nresult <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"CNV\", var1_cancers = \"BRCA\",\n  var2 = \"TP53\", var2_modal = \"RNAseq\", var2_cancers = \"BRCA\"\n)\n\nresult$stats # r = 0.35, p < 0.001\n# Interpretation: Copy number changes moderately drive expression\n\n# ===========================================================================\n# Example 3: Methylation silences expression (Cross-omics) - TESTED 2.48 sec\n# ===========================================================================\n# Research Question: Does BRCA1 methylation silence its expression?\n# Expected: Negative correlation (hypermethylation -> lower expression)\n\nresult <- tcga_correlation(\n  var1 = \"BRCA1\", var1_modal = \"Methylation\", var1_cancers = \"BRCA\",\n  var2 = \"BRCA1\", var2_modal = \"RNAseq\", var2_cancers = \"BRCA\"\n)\n\nresult$stats # r = -0.38, p < 0.001\n# Interpretation: Promoter methylation negatively regulates BRCA1 expression\n\n# ===========================================================================\n# Example 4: Pan-cancer analysis (Scenario 1, Multiple cancers) - TESTED 0.38 sec\n# ===========================================================================\n# Research Question: Is TP53-MDM2 correlation conserved across cancer types?\n\nresult <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"RNAseq\",\n  var1_cancers = c(\"BRCA\", \"LUAD\", \"COAD\"),\n  var2 = \"MDM2\", var2_modal = \"RNAseq\",\n  var2_cancers = c(\"BRCA\", \"LUAD\", \"COAD\")\n)\n\nresult$stats # 3 rows, one per cancer\nresult$plot # Lollipop plot comparing correlations across cancers\n\n# Check consistency\nall(result$stats$r > 0.3) # TRUE: consistent positive correlation\n\n# ===========================================================================\n# Example 5: 1 vs multiple genes (Scenario 2) - TESTED 1.12 sec\n# ===========================================================================\n# Research Question: How does TP53 expression correlate with DNA damage response genes?\n\nresult <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  var2 = c(\"MDM2\", \"CDKN1A\", \"BAX\", \"GADD45A\"),\n  var2_modal = \"RNAseq\", var2_cancers = \"BRCA\"\n)\n\nresult$stats # 4 rows (one per gene pair), with adjusted p-values\nresult$plot # Lollipop plot showing all correlations\n\n# Find strongest correlation\nresult$stats[which.max(abs(result$stats$r)), ]\n\n# ===========================================================================\n# Example 6: Gene matrix (Scenario 3) - TESTED 1.85 sec\n# ===========================================================================\n# Research Question: How do cell cycle genes correlate with apoptosis genes?\n\nresult <- tcga_correlation(\n  var1 = c(\"CCND1\", \"CDK4\", \"CDK6\"), var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  var2 = c(\"BCL2\", \"BAX\", \"BAK1\"), var2_modal = \"RNAseq\", var2_cancers = \"BRCA\"\n)\n\nresult$stats # 9 rows (3x3 matrix), all pairwise correlations\nresult$plot # Dot plot matrix\n\n# Filter significant correlations\nsig_cors <- result$stats[result$stats$p_adj < 0.05, ]\n\n# ===========================================================================\n# Example 7: Expression vs mutation (Scenario 4) - TESTED 1.68 sec\n# ===========================================================================\n# Research Question: Do TP53 mutant tumors have lower TP53 mRNA?\n# Expected: Yes (truncating mutations -> nonsense-mediated decay)\n\nresult <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  var2 = \"TP53\", var2_modal = \"Mutation\", var2_cancers = \"BRCA\"\n)\n\nresult$stats # Wilcoxon test: p < 0.001, effect_size (Cohen's d)\nresult$plot # Box plot: WildType vs Mutant\n\n# Interpretation: Mutant tumors have significantly lower TP53 expression\n\n# ===========================================================================\n# Example 8: Clinical association (Scenario 4) - TESTED 1.45 sec\n# ===========================================================================\n# Research Question: Does TP53 expression vary by tumor stage?\n\nresult <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  var2 = \"Stage\", var2_modal = \"Clinical\", var2_cancers = \"BRCA\"\n)\n\nresult$stats # Kruskal-Wallis test (>2 stage groups)\nresult$plot # Box plot by stage\n\n# ===========================================================================\n# Example 9: Immune infiltration (Scenario 1) - TESTED 1.92 sec\n# ===========================================================================\n# Research Question: Does PDL1 expression correlate with CD8+ T cell infiltration?\n# Expected: Positive (PDL1 upregulated in immune-inflamed tumors)\n\nresult <- tcga_correlation(\n  var1 = \"CD274\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", # CD274 = PDL1\n  var2 = \"CD8_T_cells_cibersort\", var2_modal = \"ImmuneCell\", var2_cancers = \"BRCA\"\n)\n\nresult$stats # r = 0.52, p < 0.001 (strong positive)\n# Interpretation: PDL1 high in CD8+ T cell-infiltrated tumors (adaptive immune resistance)\n\n# ===========================================================================\n# Example 10: Mutation vs immune (Scenario 6 with heatmap) - TESTED 4.23 sec\n# ===========================================================================\n# Research Question: How does TP53 mutation affect immune cell infiltration?\n\nresult <- tcga_correlation(\n  var1 = c(\n    \"CD8_T_cells_cibersort\", \"CD4_T_cells_memory_resting_cibersort\",\n    \"Macrophages_M1_cibersort\", \"Macrophages_M2_cibersort\",\n    \"B_cells_memory_cibersort\", \"NK_cells_activated_cibersort\",\n    \"Dendritic_cells_activated_cibersort\", \"Neutrophils_cibersort\"\n  ),\n  var1_modal = \"ImmuneCell\", var1_cancers = \"BRCA\",\n  var2 = \"TP53\", var2_modal = \"Mutation\", var2_cancers = \"BRCA\",\n  plot_type = \"heatmap\" # Force heatmap (auto-selected if >=8 features)\n)\n\nresult$stats # 8 Wilcoxon tests with adjusted p-values\nresult$plot # Heatmap with mean difference bars\n\n# Find cell types significantly different\nsig_cells <- result$stats[result$stats$p_adj < 0.05, ]\n\n# ===========================================================================\n# Example 11: Mutation co-occurrence (Scenario 7) - TESTED 1.15 sec\n# ===========================================================================\n# Research Question: Are TP53 and PIK3CA mutations mutually exclusive?\n\nresult <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n  var2 = \"PIK3CA\", var2_modal = \"Mutation\", var2_cancers = \"BRCA\"\n)\n\nresult$stats # Fisher's exact test, odds_ratio, contingency_table\nresult$plot # Bar plot showing co-occurrence\n\n# Interpret odds ratio\n# OR > 1: co-occurrence, OR < 1: mutual exclusivity, OR = 1: independent\n\n# ===========================================================================\n# Example 12: TMB vs expression (Signature-RNAseq) - TESTED 1.58 sec\n# ===========================================================================\n# Research Question: Does tumor mutation burden correlate with immune checkpoint expression?\n\nresult <- tcga_correlation(\n  var1 = \"TMB\", var1_modal = \"Signature\", var1_cancers = \"BRCA\",\n  var2 = c(\"CD274\", \"PDCD1\", \"CTLA4\"), var2_modal = \"RNAseq\", var2_cancers = \"BRCA\"\n)\n\nresult$stats # 3 correlations (TMB vs each gene)\n# Expected: Positive correlations (high TMB -> immune activation -> checkpoint upregulation)\n\n# ===========================================================================\n# Example 13: Molecular subtypes (Special case) - TESTED 0.82 sec\n# ===========================================================================\n# Research Question: Is ESR1 expression different between luminal and basal breast cancer?\n# Expected: Yes (luminal = ER+, basal = ER-)\n\nresult <- tcga_correlation(\n  var1 = \"ESR1\", var1_modal = \"RNAseq\",\n  var1_cancers = c(\"BRCA-LumA\", \"BRCA-Basal\"),\n  var2 = \"GATA3\", var2_modal = \"RNAseq\",\n  var2_cancers = c(\"BRCA-LumA\", \"BRCA-Basal\")\n)\n\nresult$plot # Lollipop plot comparing subtypes\n# Expected: Strong positive correlation in LumA, weak in Basal\n\n# ===========================================================================\n# Example 14: miRNA-mRNA (miRNA-RNAseq) - TESTED 1.73 sec\n# ===========================================================================\n# Research Question: Does hsa-mir-21 negatively regulate its target gene PDCD4?\n# Expected: Negative correlation (miRNA silences target)\n\nresult <- tcga_correlation(\n  var1 = \"hsa-mir-21\", var1_modal = \"miRNA\", var1_cancers = \"BRCA\",\n  var2 = \"PDCD4\", var2_modal = \"RNAseq\", var2_cancers = \"BRCA\"\n)\n\nresult$stats # r = -0.28, p < 0.001 (moderate negative)\n# Interpretation: miR-21 upregulation associated with PDCD4 downregulation\n\n# ===========================================================================\n# Example 15: Custom analysis with raw_data\n# ===========================================================================\n# Use raw_data for custom modeling\n\nresult <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  var2 = \"MDM2\", var2_modal = \"RNAseq\", var2_cancers = \"BRCA\"\n)\n\n# Access merged data\ndata <- result$raw_data\nhead(data) # Columns: TP53, MDM2, cancer_type\n\n# Custom analysis: Linear model\nmodel <- lm(BRCA_MDM2_RNAseq ~ BRCA_TP53_RNAseq, data = data)\nsummary(model)\n\n# Filter by expression level\nhigh_tp53 <- data[data$BRCA_TP53_RNAseq > median(data$BRCA_TP53_RNAseq, na.rm = TRUE), ]\n\n# ===========================================================================\n# Example 16: Common Mistakes and How to Fix Them\n# ===========================================================================\n\n# MISTAKE 1: Using different cancers for var1 and var2 (cross-cancer correlation)\n# ❌ WRONG: May not be biologically meaningful\nresult_wrong <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  var2 = \"MDM2\", var2_modal = \"RNAseq\", var2_cancers = \"LUAD\" # Different cancer!\n)\n# This compares TP53 in breast cancer patients vs MDM2 in lung cancer patients\n\n# ✅ CORRECT: Use same cancer for meaningful within-cancer correlation\nresult_correct <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  var2 = \"MDM2\", var2_modal = \"RNAseq\", var2_cancers = \"BRCA\" # Same cancer\n)\n\n# MISTAKE 2: Forgetting to specify modality correctly\n# ❌ WRONG: Using wrong modality for mutation status\n# result_wrong <- tcga_correlation(\n#   var1 = \"TP53\", var1_modal = \"RNAseq\",  # Wrong! TP53 is categorical (Mutation)\n#   var2 = \"MDM2\", var2_modal = \"RNAseq\"\n# )\n\n# ✅ CORRECT: Use Mutation modal for mutation status\nresult_correct <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"Mutation\", # Correct modality\n  var2 = \"MDM2\", var2_modal = \"RNAseq\", var2_cancers = \"BRCA\"\n)\n\n# MISTAKE 3: Expecting high correlation for unrelated genes\nresult <- tcga_correlation(\n  var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  var2 = \"RANDOM_GENE\", var2_modal = \"RNAseq\", var2_cancers = \"BRCA\"\n)\n# If r close to 0 and p > 0.05: This is correct! Not all genes are correlated.\n# Low correlation doesn't mean analysis failed - it means no relationship exists.\n\n# ===========================================================================\n# Next Steps\n# ===========================================================================\n# After correlation analysis:\n# 1. Use tcga_enrichment() to find pathways associated with correlated genes\n# 2. Use tcga_survival() to test prognostic significance\n# 3. Validate findings in independent cancer types (pan-cancer analysis)\n# 4. Integrate multiple omics for comprehensive understanding\n## End(No test)\n\n\n\n",
    "return_value": "**Unified Return Structure**: List with 3 components (consistent across all scenarios) **Quick Access Guide** (common operations): Get statistics: result$stats View plot: print(result$plot) or just result$plot Save plot: Already auto-saved to sltcga_output/*.png Export data: write.csv(result$raw_data, \"mydata.csv\") Check sample size: nrow(result$raw_data) Filter significant: result$stats[result$stats$p_adj < 0.05, ] Get top correlation: result$stats[which.max(abs(result$stats$r)), ] Extract column names: colnames(result$raw_data) stats Data frame with statistical results (1+ rows, one per comparison): **Continuous-Continuous** (Scenarios 1-3): var1_feature, var2_feature, r, p, p_adj, n, method **Categorical-Continuous** (Scenarios 4-6): var1_feature, var2_feature, statistic, p, p_adj, n, test_method, effect_size **Categorical-Categorical** (Scenario 7): var1_feature, var2_feature, statistic, p, p_adj, n, test_method, odds_ratio, contingency_table Column names vary by scenario but always include: feature names, p-value, sample size, test method. Always a data frame (never NULL). Use result$stats to access. plot Visualization object (type varies by scenario): **ggplot2**: Scenarios 1-6 (scatter, lollipop, dot, box plots) **patchwork**: Combined ggplot2 objects (e.g., multiple boxplots) **ComplexHeatmap**: Scenario 6 with heatmap option Access: result$plot . Dimensions: attr(result$plot, \"width\") , attr(result$plot, \"height\") . Auto-saved to sltcga_output/*.png (300 DPI). Print with print(result$plot) . raw_data Data frame with merged input data: Rows = samples (patients), rownames = sample IDs Columns = analyzed features + cancer_type column Use for: Custom analysis, filtering, quality checks, downstream modeling Access: result$raw_data . Sample size: nrow(result$raw_data) .",
    "references": ["**TCGA Database**: The Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113-1120. doi:10.1038/ng.2764", "Hoadley KA, et al. (2018). Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell, 173(2):291-304. doi:10.1016/j.cell.2018.03.022", "Sanchez-Vega F, et al. (2018). Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell, 173(2):321-337. doi:10.1016/j.cell.2018.03.035"],
    "formatted_arguments": "var1: Character vector. Variable names for first group (required). Examples: Single gene (\"TP53\"), multiple genes (c(\"TP53\", \"EGFR\")), clinical (\"Age\", \"Stage\"), signatures (\"TMB\", \"EMT_Score\"), immune cells (\"CD8_T_cells_cibersort\"), miRNA (\"hsa-mir-21\"). Number and type of variables affect scenario selection (see Details).\nvar1_modal: Character. Data modality for var1 (required). Options: \"RNAseq\", \"Mutation\", \"CNV\", \"Methylation\", \"miRNA\", \"Clinical\", \"Signature\", \"ImmuneCell\". Determines data type: continuous (RNAseq, CNV, Methylation, miRNA, ImmuneCell, some Signature), categorical (Mutation, some Clinical/Signature), mixed (Clinical, Signature).\nvar1_cancers: Character vector. Cancer types for var1 (required, case-insensitive). Options: 33 main types (\"BRCA\", \"LUAD\", \"COAD\"), 32 molecular subtypes (\"BRCA-Basal\", \"BRCA-LumA\"), combined groups (\"COADREAD\", \"GBMLGG\"). Examples: single cancer (\"BRCA\"), multiple cancers (c(\"BRCA\", \"LUAD\", \"COAD\")), molecular subtypes (c(\"BRCA-LumA\", \"BRCA-Basal\")). Use list_cancer_types () to view all options.\nvar2: Character vector. Variable names for second group (required). Same format as var1. Can be same genes (e.g., TP53 CNV vs TP53 mRNA) or different genes.\nvar2_modal: Character. Data modality for var2 (required). Same options as var1_modal. Can be same modality (intra-omics) or different (cross-omics).\nvar2_cancers: Character vector. Cancer types for var2 (required). Can be identical to var1_cancers (recommended for matched analysis) or different. If identical and multiple cancers,\nmethod: Character. Correlation method for continuous variables (default: \"pearson\"). Options: \"pearson\" (parametric, assumes normality), \"spearman\" (non-parametric, rank-based), \"kendall\" (non-parametric, tau coefficient). Only affects continuous-continuous comparisons. use Character. Missing value handling for correlations (default: \"pairwise.complete.obs\"). Options: \"everything\", \"all.obs\", \"complete.obs\", \"na.or.complete\", \"pairwise.complete.obs\". Recommended: \"pairwise.complete.obs\" maximizes sample size.\nuse: s smart pairing (BRCA-BRCA, LUAD-LUAD, not BRCA-LUAD).\np_adjust_method: Character. Multiple testing correction method (default: \"BH\"). Options: \"BH\" (Benjamini-Hochberg FDR), \"bonferroni\", \"holm\", \"hochberg\", \"hommel\", \"BY\", \"fdr\", \"none\". Applied when multiple comparisons (e.g., 10 gene pairs -> 10 tests).\nalpha: Numeric. Significance threshold for p-value (default: 0.05). Used for visual annotation (e.g., stars in plots). Does not filter results.\nrnaseq_type: Character. RNAseq normalization method (default: \"log2TPM\"). Options: \"log2TPM\" (recommended, normalized), \"log2RSEM\" (RSEM), \"log2FPKM\" (FPKM), \"log2Counts\" (raw counts). Only used when var1_modal or var2_modal = \"RNAseq\".\ncnv_type: Character. CNV calling algorithm (default: \"SNP6_Array\"). Options vary by cancer, typically \"SNP6_Array\", \"WES\", \"WGS\". Check data availability. Only used when var1_modal or var2_modal = \"CNV\".\nmethylation_region: Character. Methylation region to analyze (default: \"Promoter_mean\"). Options: \"Promoter_mean\" (TSS1500+TSS200+5UTR+1stExon), \"TSS1500\", \"TSS200\", \"5UTR\", \"1stExon\", \"Body\", \"3UTR\", \"Gene_mean\". Recommended: \"Promoter_mean\" for expression correlation. Only used when var1_modal or var2_modal = \"Methylation\".\nimmune_algorithm: Character or NULL. Immune deconvolution algorithm filter (default: NULL for all). Options: \"cibersort\", \"xcell\", \"quantiseq\", \"mcpcounter\", \"timer\", \"epic\", \"ips\", \"estimate\", or NULL. NULL includes all 99 cell types from 8 algorithms. Specify to focus on specific algorithm. Only used when var1_modal or var2_modal = \"ImmuneCell\".\nplot_type: Character. Plot type preference for Scenario 6 (default: \"auto\"). Options: \"auto\" (heatmap if >=8 continuous features, else boxplot), \"boxplot\" (force boxplots), \"heatmap\" (force heatmap with difference bars). Only affects Scenario 6 visualization.",
    "simple_arguments": "var1: Character vector. Variable names for first group (required). Examples: Single gene (\"TP53\"), multiple genes (c(\"TP53\", \"EGFR\")), clinical (\"Age\", \"Stage\"), signatures (\"TMB\", \"EMT_Score\"), immune cells (\"CD8_T_cells_cibersort\"), miRNA (\"hsa-mir-21\"). Number and type of variables affect scenario selection (see Details).\nvar1_modal: Character. Data modality for var1 (required). Options: \"RNAseq\", \"Mutation\", \"CNV\", \"Methylation\", \"miRNA\", \"Clinical\", \"Signature\", \"ImmuneCell\". Determines data type: continuous (RNAseq, CNV, Methylation, miRNA, ImmuneCell, some Signature), categorical (Mutation, some Clinical/Signature), mixed (Clinical, Signature).\nvar1_cancers: Character vector. Cancer types for var1 (required, case-insensitive). Options: 33 main types (\"BRCA\", \"LUAD\", \"COAD\"), 32 molecular subtypes (\"BRCA-Basal\", \"BRCA-LumA\"), combined groups (\"COADREAD\", \"GBMLGG\"). Examples: single cancer (\"BRCA\"), multiple cancers (c(\"BRCA\", \"LUAD\", \"COAD\")), molecular subtypes (c(\"BRCA-LumA\", \"BRCA-Basal\")). Use list_cancer_types () to view all options.\nvar2: Character vector. Variable names for second group (required). Same format as var1. Can be same genes (e.g., TP53 CNV vs TP53 mRNA) or different genes.\nvar2_modal: Character. Data modality for var2 (required). Same options as var1_modal. Can be same modality (intra-omics) or different (cross-omics).\nvar2_cancers: Character vector. Cancer types for var2 (required). Can be identical to var1_cancers (recommended for matched analysis) or different. If identical and multiple cancers,"
  },
  "tcga_enrichment": {
    "package": "SLTCGA",
    "function_name": "tcga_enrichment",
    "title": "Genome-Wide Scan and Pathway Enrichment Analysis with GSEA",
    "description": "**Discovers genes and pathways** affected by query variable through two complementary workflows: (1) **genome-wide scan** compares query against all ~20,000 genes to identify top correlated/ differentially expressed individual genes (answers: \"Which genes are affected?\"), (2) **pathway enrichment (GSEA)** tests association with biological pathways from 8 databases (MsigDB, GO, KEGG, Reactome, WikiPathways, MeSH, Disease Ontology, Enrichr) to discover functional programs (answers: \"Which pathways are activated/suppressed?\"). Supports TCGA data modalities (RNAseq, Mutation, CNV, Methylation, miRNA, Clinical, Signature, ImmuneCell). Automatically detects 8 scenarios (8-15) based on variable type (categorical: DEA-based, continuous: correlation-based) and count (single: network/paired plots, multiple: matrix/heatmap). TCGA enrichment uses RNAseq as genome-wide reference. Returns unified structure: list(stats, plot, raw_data) .",
    "user_queries": ["**Mutation Effects**:", "Which genes are differentially expressed in TP53 mutant tumors?", "What pathways are disrupted by PIK3CA mutation?", "How does KRAS mutation affect transcriptional programs?", "Which genes are commonly dysregulated across multiple mutations?", "What are the downstream effects of BRCA1 mutation?", "Does EGFR mutation activate specific pathways?", "**Gene Expression Programs**:", "Which genes are co-expressed with TP53?", "What pathways are associated with high TP53 expression?", "Which genes correlate with immune checkpoint expression?", "What are the transcriptional targets of MYC?", "Which pathways are enriched in high vs low TP53 expressors?", "**Immune Microenvironment**:", "What pathways are associated with CD8+ T cell infiltration?", "Which genes correlate with immune cell abundance?", "What pathways are activated in immune-inflamed tumors?", "How does macrophage infiltration affect transcriptional programs?", "Are interferon pathways enriched with high immune infiltration?", "**Clinical Associations**:", "What pathways differ between early and advanced stage tumors?", "Which genes are associated with tumor grade?", "What pathways are enriched in specific histological subtypes?", "How does patient age affect gene expression programs?", "**Molecular Signatures**:", "What pathways are associated with high TMB?", "Which genes correlate with hypoxia score?", "What pathways are enriched in EMT-high tumors?", "Are DNA repair pathways enriched with high genomic instability?", "Which genes are associated with stemness signatures?", "**Genome-Wide Discovery**:", "How do I find genes associated with my query variable?", "What is genome-wide scan vs pathway enrichment?", "Which approach finds individual genes (genome scan)?", "How do I discover novel gene associations?", "Can I find transcriptional targets of my gene of interest?", "**Pathway Analysis**:", "What pathways are enriched in my dataset?", "How do I perform GSEA analysis?", "Which database should I use (MsigDB, GO, KEGG)?", "What is NES (normalized enrichment score)?", "How do I interpret GSEA results?", "What are leading edge genes?", "**Database Selection**:", "Which pathway database is most comprehensive?", "What is the difference between MsigDB collections (H, C2, C5)?", "Should I use GO, KEGG, or Reactome?", "What is MsigDB Hallmark collection?", "How do I choose between GO BP, CC, and MF?", "What is Enrichr and when should I use it?", "**Comparative Analysis**:", "How do I compare pathway enrichments across multiple mutations?", "Which mutations have similar transcriptional effects?", "Can I compare gene expression programs of related genes?", "How do I find pathways shared by multiple genes?", "What genes are commonly dysregulated by different mutations?", "**Multi-Omics Integration**:", "Can I analyze CNV effects on pathways?", "How does methylation affect pathway activity?", "Can I use miRNA to find pathway associations?", "How do I integrate different omics layers for pathway analysis?", "**Pan-Cancer Analysis**:", "Are TP53 mutation effects consistent across cancers?", "Which pathways are universally affected by specific mutations?", "How do I compare enrichment results across cancer types?", "Are pathway associations cancer-type specific?", "**Workflow Questions**:", "Should I run genome scan or pathway enrichment first?", "How do I validate enrichment findings?", "Can I use enrichment results to design experiments?", "How do I extract genes from enriched pathways?", "What do I do with raw_data output?", "**Technical Questions**:", "What is the difference between DEA and correlation?", "When are categorical vs continuous methods used?", "How does GSEA work?", "What is FDR correction and why is it important?", "How many pathways should I test?", "How do I speed up GSEA computation?", "**Colloquial and Alternative Phrasings**:", "Which genes does TP53 mutation affect?", "What pathways does this mutation mess with?", "Mutation affected genes", "Find genes related to my mutation", "What biological processes are involved?", "Pathway discovery mutation", "Gene set enrichment my data", "What's happening downstream of this gene?", "Which pathways are turned on or off?", "Functional analysis mutation", "**Abbreviations and Full Forms**:", "GSEA (Gene Set Enrichment Analysis / pathway enrichment / gene set analysis)", "DEA (Differential Expression Analysis / differential gene expression / DE analysis)", "GO (Gene Ontology / GO terms / GO biological process)", "KEGG (Kyoto Encyclopedia of Genes and Genomes / KEGG pathways)", "MSigDB (Molecular Signatures Database / molecular signature database)", "NES (Normalized Enrichment Score / enrichment score / ES)", "FDR (False Discovery Rate / adjusted p-value / q-value / corrected p-value)", "logFC (log Fold Change / fold change / FC / expression change)", "**Function Selection Questions**:", "Should I use tcga_enrichment or tcga_correlation for pathway analysis?", "Which function finds affected pathways? (Answer: tcga_enrichment)", "Which function for GSEA? (Answer: tcga_enrichment)", "How to find genes affected by mutation? (Answer: tcga_enrichment genome scan)", "Which function discovers biological processes? (Answer: tcga_enrichment)", "Correlation vs enrichment - which for pathway discovery? (Answer: tcga_enrichment)"],
    "usage": "tcga_enrichment( var1, var1_modal, var1_cancers, analysis_type = \"enrichment\", enrich_database = \"MsigDB\", enrich_ont = \"BP\", genome_modal = \"RNAseq\", method = \"pearson\", top_n = 50, n_workers = 6, rnaseq_type = \"log2TPM\", kegg_category = \"pathway\", msigdb_category = \"H\", hgdisease_source = \"do\", mesh_method = \"gendoo\", mesh_category = \"A\", enrichrdb_library = \"Cancer_Cell_Line_Encyclopedia\", immune_algorithm = NULL )",
    "parameters": [
      {
        "name": "var1",
        "has_default": false,
        "description": "Character vector. Query variable names (required). Examples: Single gene (\"TP53\"), multiple genes (c(\"TP53\", \"PIK3CA\")), mutation status (\"TP53\" with modal=\"Mutation\"), clinical (\"Stage\"), signature (\"TMB\"), immune cells (\"CD8_T_cells_cibersort\"), miRNA (\"hsa-mir-21\"). Variable type (continuous/categorical) determines analysis"
      },
      {
        "name": "var1_modal",
        "has_default": false,
        "description": "Character. Data modality for query variables (required). Options: \"RNAseq\", \"Mutation\", \"CNV\", \"Methylation\", \"miRNA\", \"Clinical\", \"Signature\", \"ImmuneCell\". Determines variable type: continuous (RNAseq, CNV, Methylation, miRNA, ImmuneCell) use correlation, categorical (Mutation, some Clinical) use differential expression analysis (DEA). **Note**: Clinical variables cannot use genome-wide scan (use tcga_correlation() instead)."
      },
      {
        "name": "var1_cancers",
        "has_default": false,
        "description": "Character vector. Cancer types for analysis (required, case-insensitive). Options: 33 main types (\"BRCA\", \"LUAD\"), 32 molecular subtypes (\"BRCA-Basal\"). Single or multiple cancers supported. Use list_cancer_types () to view all."
      },
      {
        "name": "analysis_type",
        "has_default": true,
        "default_value": "\"enrichment\"",
        "description": "Character. Analysis workflow type (default: \"enrichment\"). Options: \"genome\": Genome-wide scan -> Identifies top correlated/DE genes -> Network or Dot plot \"enrichment\": Pathway enrichment (GSEA) -> Tests pathway associations -> GSEA plots \"genome\" discovers individual genes, \"enrichment\" discovers biological pathways/processes."
      },
      {
        "name": "enrich_database",
        "has_default": true,
        "default_value": "\"MsigDB\"",
        "description": "Character. Pathway database for GSEA (default: \"MsigDB\"). Options: \"MsigDB\" (Molecular Signatures), \"GO\" (Gene Ontology), \"KEGG\" (pathways), \"Wiki\" (WikiPathways), \"Reactome\" (reactions), \"Mesh\" (MeSH terms), \"HgDisease\" (disease ontology), \"Enrichrdb\" (Enrichr libraries). Only used when analysis_type = \"enrichment\" . Different databases answer different questions (see Details)."
      },
      {
        "name": "enrich_ont",
        "has_default": true,
        "default_value": "\"BP\"",
        "description": "Character. GO sub-ontology (default: \"BP\"). Options: \"BP\" (Biological Process), \"CC\" (Cellular Component), \"MF\" (Molecular Function), \"ALL\". Only used when enrich_database = \"GO\" . BP most commonly used for pathway analysis."
      },
      {
        "name": "genome_modal",
        "has_default": true,
        "default_value": "\"RNAseq\"",
        "description": "Character. Reference genome layer for comparison (default: \"RNAseq\"). Options: \"RNAseq\" only. TCGA enrichment always uses RNAseq as genome-wide reference (~20,000 genes). If other modality specified, automatically overridden to \"RNAseq\" with warning. Compares query variable against all genes' mRNA expression. method Character. Correlation method for continuous variables (default: \"pearson\"). Options: \"pearson\", \"spearman\", \"kendall\". Only used for continuous query variables (e.g., RNAseq, CNV, ImmuneCell). Categorical variables use DEA (limma-voom) instead."
      },
      {
        "name": "method",
        "has_default": true,
        "default_value": "\"pearson\"",
        "description": "correlation vs DEA). Number of variables affects scenario."
      },
      {
        "name": "top_n",
        "has_default": true,
        "default_value": "50",
        "description": "Integer. Number of top results to display in plots (default: 50). Range: 10-100 recommended. For genome scan: top N correlated/DE genes shown. For GSEA: top N enriched pathways shown. Larger values provide comprehensive view but cluttered plots."
      },
      {
        "name": "n_workers",
        "has_default": true,
        "default_value": "6",
        "description": "Integer. Number of parallel workers for GSEA computation (default: 6). Range: 1-12 recommended. Higher values speed up GSEA (CPU-intensive) but use more memory. Automatically capped at available cores. Use 1 for debugging, 6-8 for production."
      },
      {
        "name": "rnaseq_type",
        "has_default": true,
        "default_value": "\"log2TPM\"",
        "description": "Character. RNAseq normalization method (default: \"log2TPM\"). Options: \"log2TPM\", \"log2RSEM\", \"log2FPKM\", \"log2Counts\". Affects query variable data (if var1_modal=\"RNAseq\") and genome-wide reference. Recommended: \"log2TPM\" (normalized)."
      },
      {
        "name": "kegg_category",
        "has_default": true,
        "default_value": "\"pathway\"",
        "description": "Character. KEGG pathway category (default: \"pathway\"). Options: \"pathway\", \"module\", \"disease\", \"drug\", \"network\". Only used when enrich_database = \"KEGG\" . \"pathway\" most common for functional analysis."
      },
      {
        "name": "msigdb_category",
        "has_default": true,
        "default_value": "\"H\"",
        "description": "Character. MsigDB collection (default: \"H\"). Options: \"H\" (hallmark, 50 pathways), \"C1\" (positional), \"C2\" (curated, 5,000+), \"C3\" (regulatory targets), \"C4\" (computational), \"C5\" (GO), \"C6\" (oncogenic), \"C7\" (immunologic), \"C8\" (cell type). Only used when enrich_database = \"MsigDB\" . Recommended: \"H\" (focused), \"C2\" (comprehensive), \"C5\" (GO-based)."
      },
      {
        "name": "hgdisease_source",
        "has_default": true,
        "default_value": "\"do\"",
        "description": "Character. Disease ontology source (default: \"do\"). Options: \"do\" (Disease Ontology), \"nci\" (NCI Thesaurus), \"mesh\" (MeSH). Only used when enrich_database = \"HgDisease\" . Links genes to disease associations."
      },
      {
        "name": "mesh_method",
        "has_default": true,
        "default_value": "\"gendoo\"",
        "description": "Character. MeSH analysis method (default: \"gendoo\"). Options: \"gendoo\", \"gene2pubmed\". Only used when enrich_database = \"Mesh\" . \"gendoo\" uses literature mining, \"gene2pubmed\" uses PubMed annotations."
      },
      {
        "name": "mesh_category",
        "has_default": true,
        "default_value": "\"A\"",
        "description": "Character. MeSH category (default: \"A\"). Options: \"A\" (anatomy), \"B\" (organisms), \"C\" (diseases), \"D\" (chemicals), etc. Only used when enrich_database = \"Mesh\" . Filters MeSH terms by category."
      },
      {
        "name": "enrichrdb_library",
        "has_default": true,
        "default_value": "\"Cancer_Cell_Line_Encyclopedia\"",
        "description": "Character. Enrichr library (default: \"Cancer_Cell_Line_Encyclopedia\"). Options: 100+ libraries (see Enrichr website). Popular: \"KEGG_2021_Human\", \"GO_Biological_Process_2021\", \"WikiPathway_2021_Human\", \"Reactome_2022\", \"MSigDB_Hallmark_2020\", \"Cancer_Cell_Line_Encyclopedia\". Only used when enrich_database = \"Enrichrdb\" . Diverse gene set collections."
      },
      {
        "name": "immune_algorithm",
        "has_default": true,
        "default_value": "NULL",
        "description": "Character or NULL. Immune deconvolution algorithm filter (default: NULL). Options: \"cibersort\", \"xcell\", \"quantiseq\", \"mcpcounter\", \"timer\", \"epic\", \"ips\", \"estimate\", or NULL. Only used when var1_modal = \"ImmuneCell\" . NULL includes all algorithms."
      }
    ],
    "examples": "## No test: \n# ===========================================================================\n# Example 1: Mutation genome scan (Scenario 8) - TESTED 10.2 sec\n# ===========================================================================\n# Research Question: Which genes are differentially expressed in TP53 mutant tumors?\n# Expected: p53 pathway genes (MDM2, CDKN1A, BAX) upregulated\n\nresult <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n  analysis_type = \"genome\", top_n = 50\n)\n\n# Return structure\nresult$stats # DE genes: gene, logFC, p, p_adj\nresult$plot # Network plot (TP53 at center, DE genes as nodes)\nresult$raw_data # Full 20,000 genes with statistics\n\n# Interpret\ntop_genes <- head(result$stats[order(result$stats$p_adj), ], 10)\ncat(\"Top 10 DE genes:\\n\")\nprint(top_genes[, c(\"gene\", \"logFC\", \"p_adj\")])\n\n# Find TP53 targets\ntp53_targets <- result$stats[result$stats$gene %in% c(\"MDM2\", \"CDKN1A\", \"BAX\"), ]\n\n# ===========================================================================\n# Example 2: Mutation pathway enrichment (Scenario 9) - TESTED 18.5 sec\n# ===========================================================================\n# Research Question: What pathways are disrupted by TP53 mutation?\n# Expected: p53 pathway, apoptosis, DNA repair\n\nresult <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n  analysis_type = \"enrichment\", enrich_database = \"MsigDB\",\n  msigdb_category = \"H\", top_n = 20\n)\n\nresult$stats # Pathways: pathway, NES, p, p_adj, leading_edge\nresult$plot # GSEA running enrichment + bar plot\n\n# Find significant pathways\nsig_pathways <- result$stats[result$stats$p_adj < 0.05, ]\ncat(\"Significant pathways:\", nrow(sig_pathways), \"\\n\")\n\n# Interpretation\n# NES > 0: pathway enriched in mutant tumors\n# NES < 0: pathway enriched in wildtype tumors\n\n# ===========================================================================\n# Example 3: Gene expression genome scan (Scenario 12) - TESTED 6.8 sec\n# ===========================================================================\n# Research Question: Which genes are co-expressed with TP53?\n# Expected: p53 pathway members (MDM2, CDKN1A, BAX)\n\nresult <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  analysis_type = \"genome\", method = \"pearson\", top_n = 50\n)\n\nresult$stats # Correlated genes: gene, r, p, p_adj\nresult$plot # Network plot\n\n# Find positively correlated genes\npos_cor <- result$stats[result$stats$r > 0.3 & result$stats$p_adj < 0.05, ]\ncat(\"Positively correlated genes:\", nrow(pos_cor), \"\\n\")\n\n# ===========================================================================\n# Example 4: Gene expression pathway enrichment (Scenario 13) - TESTED 14.2 sec\n# ===========================================================================\n# Research Question: What pathways are associated with TP53 expression?\n\nresult <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\",\n  analysis_type = \"enrichment\", enrich_database = \"MsigDB\",\n  msigdb_category = \"H\"\n)\n\nresult$stats # Enriched pathways with NES\n# High TP53 expression enriches: p53 pathway, DNA repair, apoptosis\n\n# ===========================================================================\n# Example 5: Multiple mutations genome scan (Scenario 10) - TESTED 25.3 sec\n# ===========================================================================\n# Research Question: Compare transcriptional effects of multiple mutations\n\nresult <- tcga_enrichment(\n  var1 = c(\"TP53\", \"PIK3CA\", \"GATA3\"), var1_modal = \"Mutation\",\n  var1_cancers = \"BRCA\", analysis_type = \"genome\", top_n = 50\n)\n\nresult$stats # DE genes for each mutation\nresult$plot # Dot plot matrix (mutations x genes)\nresult$raw_data # List of 3 data frames (one per mutation)\n\n# Find genes dysregulated by multiple mutations\n# (requires custom analysis of raw_data)\n\n# ===========================================================================\n# Example 6: Multiple mutations pathway enrichment (Scenario 11) - TESTED 48.7 sec\n# ===========================================================================\n# Research Question: Compare pathway disruptions by different mutations\n\nresult <- tcga_enrichment(\n  var1 = c(\"TP53\", \"PIK3CA\", \"GATA3\"), var1_modal = \"Mutation\",\n  var1_cancers = \"BRCA\", analysis_type = \"enrichment\",\n  enrich_database = \"MsigDB\", msigdb_category = \"H\", top_n = 30\n)\n\nresult$stats # Pathways for each mutation\nresult$plot # Heatmap matrix (mutations x pathways, colored by NES)\n\n# Interpretation:\n# - TP53: p53 pathway, apoptosis, DNA repair\n# - PIK3CA: PI3K/AKT/mTOR, metabolism\n# - GATA3: Estrogen response, epithelial differentiation\n\n# ===========================================================================\n# Example 7: Multiple genes pathway enrichment (Scenario 15) - TESTED 42.1 sec\n# ===========================================================================\n# Research Question: Do p53 pathway members enrich similar pathways?\n\nresult <- tcga_enrichment(\n  var1 = c(\"TP53\", \"MDM2\", \"CDKN1A\"), var1_modal = \"RNAseq\",\n  var1_cancers = \"BRCA\", analysis_type = \"enrichment\",\n  enrich_database = \"MsigDB\", msigdb_category = \"H\"\n)\n\nresult$stats # Pathways for each gene\nresult$plot # Heatmap matrix showing shared enrichments\n\n# Expected: All 3 genes enrich p53 pathway, DNA repair, cell cycle\n\n# ===========================================================================\n# Example 8: GO Biological Process enrichment - TESTED 52.3 sec\n# ===========================================================================\n# Research Question: Detailed GO terms for TP53 mutation\n\nresult <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n  analysis_type = \"enrichment\", enrich_database = \"GO\",\n  enrich_ont = \"BP\", top_n = 30\n)\n\nresult$stats # GO:BP terms with NES\n# More detailed than Hallmark, but more generic terms\n\n# ===========================================================================\n# Example 9: KEGG pathway enrichment - TESTED 21.5 sec\n# ===========================================================================\n# Research Question: KEGG pathways affected by PIK3CA mutation\n\nresult <- tcga_enrichment(\n  var1 = \"PIK3CA\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n  analysis_type = \"enrichment\", enrich_database = \"KEGG\",\n  kegg_category = \"pathway\"\n)\n\nresult$stats # KEGG pathways\n# Expected: PI3K-Akt signaling pathway, mTOR signaling\n\n# ===========================================================================\n# Example 10: Immune cell enrichment - TESTED 15.8 sec\n# ===========================================================================\n# Research Question: What pathways are associated with CD8+ T cell infiltration?\n\nresult <- tcga_enrichment(\n  var1 = \"CD8_T_cells_cibersort\", var1_modal = \"ImmuneCell\",\n  var1_cancers = \"BRCA\", analysis_type = \"enrichment\",\n  enrich_database = \"MsigDB\", msigdb_category = \"H\"\n)\n\nresult$stats # Immune-related pathways\n# Expected: Interferon response, inflammatory response, allograft rejection\n\n# ===========================================================================\n# Example 11: TMB signature enrichment - TESTED 16.3 sec\n# ===========================================================================\n# Research Question: What pathways are associated with high TMB?\n\nresult <- tcga_enrichment(\n  var1 = \"TMB\", var1_modal = \"Signature\", var1_cancers = \"BRCA\",\n  analysis_type = \"enrichment\", enrich_database = \"MsigDB\",\n  msigdb_category = \"H\"\n)\n\nresult$stats # DNA repair and immune pathways\n# High TMB tumors: DNA repair deficiency + immune activation\n\n# ===========================================================================\n# Example 12: Clinical variable enrichment - TESTED 11.7 sec\n# ===========================================================================\n# Research Question: What pathways differ by tumor stage?\n# Note: Cannot use genome scan with Clinical (use enrichment only)\n\nresult <- tcga_enrichment(\n  var1 = \"Stage\", var1_modal = \"Clinical\", var1_cancers = \"BRCA\",\n  analysis_type = \"enrichment\", enrich_database = \"MsigDB\",\n  msigdb_category = \"H\"\n)\n\nresult$stats # Pathways associated with advanced stage\n# Expected: EMT, angiogenesis, metastasis-related pathways\n\n# ===========================================================================\n# Example 13: Custom analysis with raw_data\n# ===========================================================================\n# Use raw_data for custom filtering or secondary analysis\n\nresult <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n  analysis_type = \"genome\"\n)\n\n# Access full genome-wide data\nfull_data <- result$raw_data\nhead(full_data) # All ~20,000 genes\n\n# Custom filtering\nstrong_de <- full_data[abs(full_data$logFC) > 2 & full_data$p_adj < 0.001, ]\n\n# Extract specific genes\nmy_genes <- c(\"MDM2\", \"CDKN1A\", \"BAX\", \"BCL2\", \"MYC\")\nmy_genes_data <- full_data[full_data$gene %in% my_genes, ]\n\n# Use for downstream analysis\n# - Survival analysis of top DE genes\n# - Validate in independent cancer types\n# - Build gene signatures from DE genes\n\n# ===========================================================================\n# Example 14: Pan-cancer comparison\n# ===========================================================================\n# Compare TP53 mutation effects across cancers\n\n# BRCA\nbrca_result <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n  analysis_type = \"enrichment\", enrich_database = \"MsigDB\"\n)\n\n# LUAD\nluad_result <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"LUAD\",\n  analysis_type = \"enrichment\", enrich_database = \"MsigDB\"\n)\n\n# Compare enriched pathways\nbrca_paths <- brca_result$stats$pathway[brca_result$stats$p_adj < 0.05]\nluad_paths <- luad_result$stats$pathway[luad_result$stats$p_adj < 0.05]\n\n# Shared pathways\nshared <- intersect(brca_paths, luad_paths)\ncat(\"Shared pathways:\", length(shared), \"\\n\")\n\n# ===========================================================================\n# Example 15: Common Mistakes and How to Fix Them\n# ===========================================================================\n\n# MISTAKE 1: Using Clinical variables with genome scan\n# ❌ WRONG: Clinical variables cannot be used for genome-wide scan\n# result_wrong <- tcga_enrichment(\n#   var1 = \"Stage\", var1_modal = \"Clinical\", var1_cancers = \"BRCA\",\n#   analysis_type = \"genome\"  # Error! Clinical vars don't have genome-wide data\n# )\n# Error: Clinical variables cannot be used for genome-wide scans\n\n# ✅ CORRECT: Use enrichment for Clinical, or use tcga_correlation instead\nresult_correct <- tcga_enrichment(\n  var1 = \"Stage\", var1_modal = \"Clinical\", var1_cancers = \"BRCA\",\n  analysis_type = \"enrichment\" # Enrichment works with Clinical\n)\n\n# Or use tcga_correlation for Clinical vs genes\n# result_alt <- tcga_correlation(\n#   var1 = \"Stage\", var1_modal = \"Clinical\", var1_cancers = \"BRCA\",\n#   var2 = c(\"TP53\", \"MYC\", \"KRAS\"), var2_modal = \"RNAseq\", var2_cancers = \"BRCA\"\n# )\n\n# MISTAKE 2: Wrong database parameter for chosen database\n# ❌ WRONG: Specifying GO parameters when using MsigDB\n# result_wrong <- tcga_enrichment(\n#   var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n#   enrich_database = \"MsigDB\",\n#   enrich_ont = \"BP\"  # This is for GO, not MsigDB!\n# )\n# Won't cause error but parameter is ignored\n\n# ✅ CORRECT: Use database-specific parameters\n# For MsigDB: use msigdb_category\nresult_msigdb <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n  enrich_database = \"MsigDB\",\n  msigdb_category = \"H\" # Correct parameter for MsigDB\n)\n\n# For GO: use enrich_ont\nresult_go <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n  enrich_database = \"GO\",\n  enrich_ont = \"BP\" # Correct parameter for GO\n)\n\n# MISTAKE 3: Misinterpreting NES sign\nresult <- tcga_enrichment(\n  var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\",\n  analysis_type = \"enrichment\", enrich_database = \"MsigDB\"\n)\n\n# ❌ WRONG: \"NES = 2.5 means pathway is downregulated\"\n# ✅ CORRECT: NES > 0 means pathway UPREGULATED/ACTIVATED in mutant group\n#            NES < 0 means pathway DOWNREGULATED/SUPPRESSED in mutant group\n\nsig_pathways <- result$stats[result$stats$p_adj < 0.05, ]\nupregulated <- sig_pathways[sig_pathways$NES > 0, ] # Activated in mutants\ndownregulated <- sig_pathways[sig_pathways$NES < 0, ] # Suppressed in mutants\n\n# MISTAKE 4: Expecting too many significant pathways\n# Not finding significant pathways (p_adj > 0.05 for all) doesn't mean:\n# - Analysis failed\n# - Something is wrong\n# It may mean: The query variable doesn't have strong pathway-level effects\n# Try: genome scan to find individual genes, or different database\n\n# ===========================================================================\n# Next Steps\n# ===========================================================================\n# After enrichment analysis:\n# 1. Validate top genes with tcga_correlation() in independent cancers\n# 2. Test prognostic value with tcga_survival()\n# 3. Explore leading edge genes from enriched pathways\n# 4. Compare across molecular subtypes\n# 5. Design targeted experiments based on pathway insights\n## End(No test)\n\n\n\n",
    "return_value": "**Unified Return Structure**: List with 3 components (consistent across scenarios) **Quick Access Guide** (common operations): Get enrichment results: result$stats View plot: print(result$plot) Save plot: Already auto-saved to sltcga_output/*.png Filter significant: result$stats[result$stats$p_adj < 0.05, ] Get top pathways: head(result$stats[order(abs(result$stats$NES), decreasing=TRUE), ], 10) Extract leading edge: result$stats$leading_edge[[1]] (for GSEA) Get upregulated: result$stats[result$stats$NES > 0 & result$stats$p_adj < 0.05, ] Get downregulated: result$stats[result$stats$NES < 0 & result$stats$p_adj < 0.05, ] Export data: write.csv(result$stats, \"enrichment_results.csv\") Access full genome: result$raw_data (all ~20,000 genes) stats Data frame with enrichment/association results (variable number of rows): **Genome scan (analysis_type=\"genome\")**: Genes correlated/associated with query Continuous query: gene, r (correlation), p, p_adj, n Categorical query: gene, logFC (fold change), p, p_adj, n **GSEA (analysis_type=\"enrichment\")**: Enriched pathways Columns: pathway, NES (normalized enrichment score), p, p_adj, size (gene count), leading_edge (core enriched genes) NES > 0: pathway upregulated/activated, NES < 0: downregulated/suppressed Always includes adjusted p-values (FDR). Multiple variables -> multiple sections in results. Use result$stats to access. Filter by p_adj < 0.05 for significant results. plot Visualization object (type varies by scenario): **Scenario 8/12**: Network plot (query variable at center, top genes as nodes) **Scenario 9/13**: Paired plots (GSEA running enrichment + bar plot) **Scenario 10/14**: Dot plot matrix (variables x genes or pathways) **Scenario 11/15**: Heatmap matrix (variables x pathways with NES colors) All ggplot2 objects. Access: result$plot . Dimensions: attr(result$plot, \"width/height\") . Auto-saved to sltcga_output/*.png (300 DPI). raw_data Complete genome-wide analysis results (full gene list): Contains all ~20,000 genes with statistics (not just top N shown in plots) Use for: Custom filtering, extracting specific genes, secondary analysis For GSEA: raw_data contains gene-level statistics used as GSEA input Access: result$raw_data . Can be list (multiple variables) or data frame (single variable).",
    "references": ["**TCGA Database**: The Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113-1120. doi:10.1038/ng.2764", "Subramanian A, et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102(43):15545-15550. doi:10.1073/pnas.0506580102", "Liberzon A, et al. (2015). The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Systems, 1(6):417-425. doi:10.1016/j.cels.2015.12.004", "The Gene Ontology Consortium (2021). The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Research, 49(D1):D325-D334. doi:10.1093/nar/gkaa1113"],
    "formatted_arguments": "var1: Character vector. Query variable names (required). Examples: Single gene (\"TP53\"), multiple genes (c(\"TP53\", \"PIK3CA\")), mutation status (\"TP53\" with modal=\"Mutation\"), clinical (\"Stage\"), signature (\"TMB\"), immune cells (\"CD8_T_cells_cibersort\"), miRNA (\"hsa-mir-21\"). Variable type (continuous/categorical) determines analysis\nvar1_modal: Character. Data modality for query variables (required). Options: \"RNAseq\", \"Mutation\", \"CNV\", \"Methylation\", \"miRNA\", \"Clinical\", \"Signature\", \"ImmuneCell\". Determines variable type: continuous (RNAseq, CNV, Methylation, miRNA, ImmuneCell) use correlation, categorical (Mutation, some Clinical) use differential expression analysis (DEA). **Note**: Clinical variables cannot use genome-wide scan (use tcga_correlation() instead).\nvar1_cancers: Character vector. Cancer types for analysis (required, case-insensitive). Options: 33 main types (\"BRCA\", \"LUAD\"), 32 molecular subtypes (\"BRCA-Basal\"). Single or multiple cancers supported. Use list_cancer_types () to view all.\nanalysis_type: Character. Analysis workflow type (default: \"enrichment\"). Options: \"genome\": Genome-wide scan -> Identifies top correlated/DE genes -> Network or Dot plot \"enrichment\": Pathway enrichment (GSEA) -> Tests pathway associations -> GSEA plots \"genome\" discovers individual genes, \"enrichment\" discovers biological pathways/processes.\nenrich_database: Character. Pathway database for GSEA (default: \"MsigDB\"). Options: \"MsigDB\" (Molecular Signatures), \"GO\" (Gene Ontology), \"KEGG\" (pathways), \"Wiki\" (WikiPathways), \"Reactome\" (reactions), \"Mesh\" (MeSH terms), \"HgDisease\" (disease ontology), \"Enrichrdb\" (Enrichr libraries). Only used when analysis_type = \"enrichment\" . Different databases answer different questions (see Details).\nenrich_ont: Character. GO sub-ontology (default: \"BP\"). Options: \"BP\" (Biological Process), \"CC\" (Cellular Component), \"MF\" (Molecular Function), \"ALL\". Only used when enrich_database = \"GO\" . BP most commonly used for pathway analysis.\ngenome_modal: Character. Reference genome layer for comparison (default: \"RNAseq\"). Options: \"RNAseq\" only. TCGA enrichment always uses RNAseq as genome-wide reference (~20,000 genes). If other modality specified, automatically overridden to \"RNAseq\" with warning. Compares query variable against all genes' mRNA expression. method Character. Correlation method for continuous variables (default: \"pearson\"). Options: \"pearson\", \"spearman\", \"kendall\". Only used for continuous query variables (e.g., RNAseq, CNV, ImmuneCell). Categorical variables use DEA (limma-voom) instead.\nmethod: correlation vs DEA). Number of variables affects scenario.\ntop_n: Integer. Number of top results to display in plots (default: 50). Range: 10-100 recommended. For genome scan: top N correlated/DE genes shown. For GSEA: top N enriched pathways shown. Larger values provide comprehensive view but cluttered plots.\nn_workers: Integer. Number of parallel workers for GSEA computation (default: 6). Range: 1-12 recommended. Higher values speed up GSEA (CPU-intensive) but use more memory. Automatically capped at available cores. Use 1 for debugging, 6-8 for production.\nrnaseq_type: Character. RNAseq normalization method (default: \"log2TPM\"). Options: \"log2TPM\", \"log2RSEM\", \"log2FPKM\", \"log2Counts\". Affects query variable data (if var1_modal=\"RNAseq\") and genome-wide reference. Recommended: \"log2TPM\" (normalized).\nkegg_category: Character. KEGG pathway category (default: \"pathway\"). Options: \"pathway\", \"module\", \"disease\", \"drug\", \"network\". Only used when enrich_database = \"KEGG\" . \"pathway\" most common for functional analysis.\nmsigdb_category: Character. MsigDB collection (default: \"H\"). Options: \"H\" (hallmark, 50 pathways), \"C1\" (positional), \"C2\" (curated, 5,000+), \"C3\" (regulatory targets), \"C4\" (computational), \"C5\" (GO), \"C6\" (oncogenic), \"C7\" (immunologic), \"C8\" (cell type). Only used when enrich_database = \"MsigDB\" . Recommended: \"H\" (focused), \"C2\" (comprehensive), \"C5\" (GO-based).\nhgdisease_source: Character. Disease ontology source (default: \"do\"). Options: \"do\" (Disease Ontology), \"nci\" (NCI Thesaurus), \"mesh\" (MeSH). Only used when enrich_database = \"HgDisease\" . Links genes to disease associations.\nmesh_method: Character. MeSH analysis method (default: \"gendoo\"). Options: \"gendoo\", \"gene2pubmed\". Only used when enrich_database = \"Mesh\" . \"gendoo\" uses literature mining, \"gene2pubmed\" uses PubMed annotations.\nmesh_category: Character. MeSH category (default: \"A\"). Options: \"A\" (anatomy), \"B\" (organisms), \"C\" (diseases), \"D\" (chemicals), etc. Only used when enrich_database = \"Mesh\" . Filters MeSH terms by category.\nenrichrdb_library: Character. Enrichr library (default: \"Cancer_Cell_Line_Encyclopedia\"). Options: 100+ libraries (see Enrichr website). Popular: \"KEGG_2021_Human\", \"GO_Biological_Process_2021\", \"WikiPathway_2021_Human\", \"Reactome_2022\", \"MSigDB_Hallmark_2020\", \"Cancer_Cell_Line_Encyclopedia\". Only used when enrich_database = \"Enrichrdb\" . Diverse gene set collections.\nimmune_algorithm: Character or NULL. Immune deconvolution algorithm filter (default: NULL). Options: \"cibersort\", \"xcell\", \"quantiseq\", \"mcpcounter\", \"timer\", \"epic\", \"ips\", \"estimate\", or NULL. Only used when var1_modal = \"ImmuneCell\" . NULL includes all algorithms.",
    "simple_arguments": "var1: Character vector. Query variable names (required). Examples: Single gene (\"TP53\"), multiple genes (c(\"TP53\", \"PIK3CA\")), mutation status (\"TP53\" with modal=\"Mutation\"), clinical (\"Stage\"), signature (\"TMB\"), immune cells (\"CD8_T_cells_cibersort\"), miRNA (\"hsa-mir-21\"). Variable type (continuous/categorical) determines analysis\nvar1_modal: Character. Data modality for query variables (required). Options: \"RNAseq\", \"Mutation\", \"CNV\", \"Methylation\", \"miRNA\", \"Clinical\", \"Signature\", \"ImmuneCell\". Determines variable type: continuous (RNAseq, CNV, Methylation, miRNA, ImmuneCell) use correlation, categorical (Mutation, some Clinical) use differential expression analysis (DEA). **Note**: Clinical variables cannot use genome-wide scan (use tcga_correlation() instead).\nvar1_cancers: Character vector. Cancer types for analysis (required, case-insensitive). Options: 33 main types (\"BRCA\", \"LUAD\"), 32 molecular subtypes (\"BRCA-Basal\"). Single or multiple cancers supported. Use list_cancer_types () to view all."
  },
  "tcga_survival": {
    "package": "SLTCGA",
    "function_name": "tcga_survival",
    "title": "Prognostic Survival Analysis with Kaplan-Meier Curves and Cox Regression",
    "description": "**Evaluates prognostic value and predicts patient outcomes** through survival analysis (Kaplan-Meier + Cox regression) across 8 TCGA data modalities (RNAseq, Mutation, CNV, Methylation, miRNA, Clinical, Signature, ImmuneCell) with **4 survival endpoints** (OS: overall survival, DSS: disease-specific survival, PFI: progression-free interval, DFI: disease-free interval). Automatically dichotomizes continuous variables using optimal cutpoint (maximizes separation) or median/quantile, performs Kaplan-Meier analysis with log-rank test, fits Cox proportional hazards model for hazard ratios, and generates publication-ready visualizations (KM curves + Cox forest plot for single feature, forest plot for multiple features). Covers 2 scenarios (16: single variable -> KM+Cox, 17: multiple variables -> forest plot). Supports 33 main cancer types + 32 molecular subtypes. Returns unified structure: \\code{list(stats, plot, raw_data)}. ",
    "user_queries": ["**Gene Expression Prognosis**:", "Is TP53 expression prognostic for survival in breast cancer?", "Do patients with high BRCA1 expression have better survival?", "Which genes in the PI3K/AKT pathway predict survival?", "Is EGFR expression associated with survival outcomes?", "Do immune checkpoint genes (PDL1, PD1, CTLA4) predict survival?", "Are cell cycle genes prognostic in aggressive cancers?", "**Mutation Prognosis**:", "Do TP53 mutant tumors have worse survival than wildtype?", "Is PIK3CA mutation prognostic in breast cancer?", "Are KRAS mutant lung cancers associated with poor survival?", "Do patients with BRCA1/BRCA2 mutations have different survival?", "Which driver mutations predict survival outcomes?", "Are mutation combinations prognostic?", "**Clinical Factors**:", "How does tumor stage affect survival probability?", "Is patient age prognostic for survival?", "Does tumor grade predict survival outcomes?", "Is histological subtype associated with prognosis?", "Do treatment histories affect survival?", "Which clinical factors are most prognostic?", "**Immune Infiltration Prognosis**:", "Does CD8+ T cell infiltration predict better survival?", "Are B cells associated with improved prognosis?", "Do M2 macrophages predict worse outcomes?", "Is high immune infiltration prognostic?", "Which immune cell types are most prognostic?", "Does tumor-infiltrating lymphocyte (TIL) abundance predict survival?", "**Molecular Signatures**:", "Is tumor mutation burden (TMB) prognostic?", "Does hypoxia score predict survival outcomes?", "Is EMT signature associated with poor prognosis?", "Do stemness scores predict survival?", "Is cytolytic activity (CYT) score prognostic?", "Are proliferation signatures associated with outcomes?", "**Multi-Omics Integration**:", "Does CNV amplification predict survival?", "Is promoter methylation prognostic?", "Do miRNA levels predict survival outcomes?", "Which omics layer is most prognostic (RNA vs DNA vs epigenetic)?", "**Survival Endpoints**:", "What is the difference between OS, DSS, PFI, and DFI?", "Which survival endpoint should I use for my study?", "Is TP53 prognostic for overall survival vs disease-specific survival?", "Are results consistent across different survival endpoints?", "**Cutoff Methods**:", "Should I use optimal or median cutoff for gene expression?", "What is optimal cutpoint and how does it work?", "Does cutoff choice affect prognostic significance?", "How do I validate optimal cutoff findings?", "**Multiple Variables**:", "Which genes in my pathway are prognostic?", "How do I compare prognostic value of multiple genes?", "Which immune cells are most predictive of survival?", "Can I test multiple clinical factors together?", "How do I identify the most prognostic features?", "**Molecular Subtypes**:", "Is TP53 prognostic in luminal vs basal breast cancer?", "Do prognostic markers differ by molecular subtype?", "Are immune profiles prognostic in specific subtypes?", "Should I stratify by subtype in survival analysis?", "**Pan-Cancer Questions**:", "Is TP53 prognostic across multiple cancer types?", "Do immune markers predict survival in different cancers?", "Which genes are universally prognostic?", "Are prognostic factors cancer-type specific?", "**Statistical Interpretation**:", "What does hazard ratio mean?", "How do I interpret HR > 1 vs HR < 1?", "What is C-index and how do I interpret it?", "What is the difference between log-rank and Cox p-values?", "When is a p-value considered significant?", "What is a good C-index value?", "**Colloquial and Alternative Phrasings**:", "Does high TP53 mean longer survival?", "Do mutant tumors die faster?", "Gene expression survival connection", "Does this gene predict outcome?", "Is this a good prognostic marker?", "High vs low expression survival difference", "Mutation survival impact", "Does this feature predict death?", "Patient survival gene relationship", "Outcome prediction gene expression", "**Abbreviations and Full Forms**:", "OS (Overall Survival / death from any cause / overall mortality)", "DSS (Disease-Specific Survival / cancer-related death / disease-related survival)", "PFI (Progression-Free Interval / progression or death / disease progression)", "DFI (Disease-Free Interval / recurrence / relapse-free survival / RFS)", "HR (Hazard Ratio / risk ratio / relative risk)", "KM (Kaplan-Meier / survival curve / survival probability)", "Cox (Cox regression / Cox proportional hazards / Cox model)", "**Function Selection Questions**:", "Should I use tcga_survival or tcga_correlation for prognosis?", "Which function tests survival impact? (Answer: tcga_survival)", "Which function for prognostic value? (Answer: tcga_survival)", "How to test if gene predicts survival? (Answer: tcga_survival)", "Which function for KM curves? (Answer: tcga_survival)", "Gene expression and patient outcome - which function? (Answer: tcga_survival)"],
    "usage": "tcga_survival( var1, var1_modal, var1_cancers, surv_type = \"OS\", cutoff_type = \"optimal\", minprop = 0.1, percent = 0.25, palette = c(\"#ED6355\", \"#41A98E\", \"#EFA63A\", \"#3a6ea5\"), show_cindex = TRUE, rnaseq_type = \"log2TPM\", cnv_type = \"SNP6_Array\", methylation_region = \"Promoter_mean\", immune_algorithm = NULL )",
    "parameters": [
      {
        "name": "var1",
        "has_default": false,
        "description": "Character vector. Variable names for survival analysis (required). Examples: Single gene (\"TP53\"), multiple genes (c(\"TP53\", \"EGFR\", \"KRAS\")), clinical (\"Stage\", \"Age\"), signatures (\"TMB\", \"Hypoxia_Score\"), immune cells (\"CD8_T_cells_cibersort\"), miRNA (\"hsa-mir-21\"). For continuous variables (RNAseq, CNV, etc.), automatically dichotomized into High/Low groups using specified cutoff method. For categorical variables (Mutation, some Clinical), used as-is for group comparison."
      },
      {
        "name": "var1_modal",
        "has_default": false,
        "description": "Character. Data modality for survival predictors (required). Options: \"RNAseq\", \"Mutation\", \"CNV\", \"Methylation\", \"miRNA\", \"Clinical\", \"Signature\", \"ImmuneCell\". Determines variable type: continuous modalities are dichotomized, categorical used directly."
      },
      {
        "name": "var1_cancers",
        "has_default": false,
        "description": "Character vector. Cancer type for analysis (required, case-insensitive). Options: 33 main types (\"BRCA\", \"LUAD\"), 32 molecular subtypes (\"BRCA-Basal\", \"BRCA-LumA\"). **Important**: Only single cancer type supported (survival data is cancer-specific). For pan-cancer survival, run separately per cancer type and compare results. Use list_cancer_types () to view all options."
      },
      {
        "name": "surv_type",
        "has_default": true,
        "default_value": "\"OS\"",
        "description": "Character. Survival endpoint to analyze (default: \"OS\"). Options: \"OS\" (Overall Survival): Death from any cause, most commonly used \"DSS\" (Disease-Specific Survival): Death from cancer specifically \"PFI\" (Progression-Free Interval): Disease progression or death \"DFI\" (Disease-Free Interval): Cancer recurrence after complete response Different endpoints answer different clinical questions. OS is most robust (fewer missing data)."
      },
      {
        "name": "cutoff_type",
        "has_default": true,
        "default_value": "\"optimal\"",
        "description": "Character. Method to dichotomize continuous variables (default: \"optimal\"). Options: \"optimal\": Maximizes log-rank test statistic (best separation, data-driven) \"median\": 50th"
      },
      {
        "name": "minprop",
        "has_default": true,
        "default_value": "0.1",
        "description": "Numeric. Minimum proportion per group when using optimal cutoff (default: 0.1). Range: 0-0.5. Prevents extreme cutoffs creating tiny groups (e.g., 5 % vs 95%). Example: minprop = 0.1 ensures each group has >=10 % of samples. \\item percent Numeric. Percentile for \"quantile\" cutoff method (default: 0.25). Range: 0-1. Example: 0.25 = 25th percentile (bottom 25 % vs top 75%), 0.75 = 75th percentile. Only used when cutoff_type = \"quantile\" . \\item"
      },
      {
        "name": "percent",
        "has_default": true,
        "default_value": "0.25",
        "description": "ile (balanced groups, unbiased) \"mean\": Average value (can create imbalanced groups) \"quantile\": Custom percentile specified by percent parameter Recommended: \"optimal\" for discovery, \"median\" for validation/reporting."
      },
      {
        "name": "palette",
        "has_default": true,
        "default_value": "c(\"#ED6355\", \"#41A98E\", \"#EFA63A\", \"#3a6ea5\")",
        "description": "Character vector. Colors for survival curves (default: c(\"#ED6355\", \"#41A98E\", \"#EFA63A\", \"#3a6ea5\")). Provide at least 2 colors for High/Low groups. Additional colors used for multi-level categorical variables. Examples: c(\"red\", \"blue\"), c(\"#E41A1C\", \"#377EB8\"), RColorBrewer palettes. \\item"
      },
      {
        "name": "show_cindex",
        "has_default": true,
        "default_value": "TRUE",
        "description": "Logical. Display concordance index (C-index) on KM plot (default: TRUE). C-index: 0.5 = random prediction, 1.0 = perfect prediction, >0.7 = good prognostic marker. Set FALSE to hide C-index from plot. \\item"
      },
      {
        "name": "rnaseq_type",
        "has_default": true,
        "default_value": "\"log2TPM\"",
        "description": "Character. RNAseq normalization method (default: \"log2TPM\"). Options: \"log2TPM\", \"log2RSEM\", \"log2FPKM\", \"log2Counts\". Only used when var1_modal = \"RNAseq\". \\item"
      },
      {
        "name": "cnv_type",
        "has_default": true,
        "default_value": "\"SNP6_Array\"",
        "description": "Character. CNV calling algorithm (default: \"SNP6_Array\"). Options: \"SNP6_Array\", \"WES\", \"WGS\". Only used when var1_modal = \"CNV\". \\item"
      },
      {
        "name": "methylation_region",
        "has_default": true,
        "default_value": "\"Promoter_mean\"",
        "description": "Character. Methylation region (default: \"Promoter_mean\"). Options: \"Promoter_mean\", \"TSS1500\", \"TSS200\", \"5UTR\", \"1stExon\", \"Body\", \"3UTR\", \"Gene_mean\". Only used when var1_modal = \"Methylation\". \\item"
      },
      {
        "name": "immune_algorithm",
        "has_default": true,
        "default_value": "NULL",
        "description": "Character or NULL. Immune deconvolution algorithm (default: NULL for all). Options: \"cibersort\", \"xcell\", \"quantiseq\", \"mcpcounter\", \"timer\", \"epic\", \"ips\", \"estimate\", or NULL. Only used when var1_modal = \"ImmuneCell\". **Unified Return Structure**: List with 3 components (consistent across scenarios) **Quick Access Guide** (common operations): Get statistics: result$stats View KM plot: print(result$plot) Save plot: Already auto-saved to sltcga_output/*.png Check HR: result$stats$cox_hr (>1 = risk, <1 = protective) Check significance: result$stats$cox_pvalue < 0.05 Get C-index: result$stats$cox_cindex (>0.7 = good) Export data: write.csv(result$raw_data, \"survival_data.csv\") Sample size: nrow(result$raw_data) Survival time: result$raw_data$[CANCER]_[ENDPOINT]_time Event status: result$raw_data$[CANCER]_[ENDPOINT]_event stats Data frame with survival analysis results (1+ rows, one per variable): **Scenario 16 (single variable)**: variable, km_pvalue, cox_hr, cox_hr_lower, cox_hr_upper, cox_pvalue, cox_cindex **Scenario 17 (multiple variables)**: variable, hr, hr_lower, hr_upper, p_value, cindex. For multi-level categorical, multiple rows per variable. Key columns: variable : Feature name (e.g., \"TP53 (RNAseq, BRCA)\") km_pvalue : Log-rank test p-value (Scenario 16 only) cox_hr : Hazard ratio from Cox model (HR > 1: worse survival, HR < 1: better survival) cox_hr_lower , cox_hr_upper : 95 % confidence interval for HR cox_pvalue : Cox model p-value (Wald test) cox_cindex : Concordance index (0.5-1.0, >0.7 = good predictor) Always a data frame (never NULL). Use result$stats to access. plot Visualization object (type varies by scenario): **Scenario 16**: Patchwork object with KM curve (left) + Cox forest plot (right) **Scenario 17**: ggplot2 forest plot showing HRs for all variables Access: result$plot . Dimensions: attr(result$plot, \"width\") , attr(result$plot, \"height\") . Auto-saved to sltcga_output/*.png (300 DPI). Print with print(result$plot) . KM curves include: survival curves with confidence bands, at-risk table, log-rank p-value, C-index (if enabled). Forest plots include: HR point estimates, 95 % CI error bars, vertical line at HR=1 (null effect). raw_data Data frame with merged input and survival data: Rows = samples (patients), rownames = sample IDs Columns = analyzed features + survival columns (time, event, group assignments) Survival columns: CANCER_ENDPOINT_time (days), CANCER_ENDPOINT_event (0/1) Group columns: For continuous variables, includes dichotomized groups (High/Low) Use for: Custom survival models, covariate adjustment, sensitivity analyses, data export. Access: result$raw_data . Sample size: nrow(result$raw_data) . **Evaluates prognostic value and predicts patient outcomes** through survival analysis (Kaplan-Meier + Cox regression) across 8 TCGA data modalities (RNAseq, Mutation, CNV, Methylation, miRNA, Clinical, Signature, ImmuneCell) with **4 survival endpoints** (OS: overall survival, DSS: disease-specific survival, PFI: progression-free interval, DFI: disease-free interval). Automatically dichotomizes continuous variables using optimal cutpoint (maximizes separation) or median/quantile, performs Kaplan-Meier analysis with log-rank test, fits Cox proportional hazards model for hazard ratios, and generates publication-ready visualizations (KM curves + Cox forest plot for single feature, forest plot for multiple features). Covers 2 scenarios (16: single variable -> KM+Cox, 17: multiple variables -> forest plot). Supports 33 main cancer types + 32 molecular subtypes. Returns unified structure: list(stats, plot, raw_data) . **How to Interpret Results** (Step-by-Step Decision Tree): **Step 1: Check statistical significance** Cox p-value < 0.05 -> Significant prognostic factor -> Proceed to Step 2 Cox p-value >= 0.05 -> Not prognostic -> Variable does not predict survival Log-rank p-value: Supportive evidence (agreement with Cox p strengthens conclusion) **Step 2: Determine prognostic direction** HR > 1 -> RISK FACTOR (high values = worse survival, increased death risk) HR < 1 -> PROTECTIVE (high values = better survival, decreased death risk) HR = 1 -> No effect (neutral) **Step 3: Assess effect magnitude** HR > 2.0 or < 0.5 -> Strong prognostic effect -> Clinically meaningful HR 1.5-2.0 or 0.5-0.67 -> Moderate effect -> Worth validating HR 1.2-1.5 or 0.67-0.83 -> Weak effect -> May have limited clinical utility **Step 4: Check confidence interval** 95 % CI excludes 1.0 -> Robust finding (e.g., CI: 1.2-2.5) 95 % CI includes 1.0 -> Not significant (e.g., CI: 0.8-1.3) Wide CI -> Imprecise estimate, need larger sample **Step 5: Evaluate discrimination ability** C-index > 0.7 -> Good prognostic biomarker -> Clinical potential C-index 0.6-0.7 -> Moderate discrimination -> Needs multivariate model C-index 0.5-0.6 -> Weak discrimination -> Limited predictive value C-index = 0.5 -> Random prediction -> Not useful **Interpretation Templates for LLM**: **Protective factor template** (HR < 1): \"High [feature] expression is associated with [better/improved] [endpoint] (HR = [HR_value], 95 % CI: [CI_lower]-[CI_upper], p = [p_value], C-index = [c_index]) in [cancer] patients, suggesting [feature] may be a [protective/favorable prognostic] factor. Patients with high [feature] have [percentage] % [reduced/lower] risk of [death/progression].\" Example: \"High TP53 expression is associated with improved overall survival (HR = 0.78, 95 % CI: 0.62-0.98, p = 0.031, C-index = 0.56) in BRCA patients, suggesting TP53 may be a protective factor. Patients with high TP53 have 22 % reduced risk of death.\" **Risk factor template** (HR > 1): \" [feature] [mutation/high expression] is associated with [worse/poor] [endpoint] (HR = [HR_value], 95 % CI: [CI_lower]-[CI_upper], p = [p_value]) in [cancer] patients, suggesting [feature] may be a [risk/adverse prognostic] factor. Patients with [mutant/high] [feature] have [percentage] % [increased/higher] risk of [death/progression].\" Example: \"TP53 mutation is associated with worse overall survival (HR = 1.52, 95 % CI: 1.12-2.07, p = 0.007) in BRCA patients, suggesting TP53 mutation may be an adverse prognostic factor. Patients with mutant TP53 have 52 % increased risk of death.\" **Not prognostic template** (p >= 0.05): \" [feature] is not significantly associated with [endpoint] (HR = [HR_value], p = [p_value]) in [cancer] patients, suggesting [feature] does not predict survival outcomes in this cancer type. This may indicate [context-dependency/subtype-specific effects].\" **4 Survival Endpoints (Choose Based on Research Question)**: **OS (Overall Survival)**: Death from any cause (most robust, complete data) Use for: Overall prognosis, treatment efficacy, biomarker validation Event = death (any cause), Censored = alive at last follow-up Most commonly reported, easiest to collect **DSS (Disease-Specific Survival)**: Death from cancer specifically Use for: Cancer-specific prognosis, avoiding confounding by non-cancer deaths Event = cancer-related death, Censored = alive or death from other causes More biologically relevant but requires accurate cause-of-death data **PFI (Progression-Free Interval)**: Disease progression or death Use for: Treatment response, early efficacy signals Event = progression or death, Censored = progression-free and alive Earlier endpoint than OS (occurs sooner), good for aggressive cancers **DFI (Disease-Free Interval)**: Recurrence after complete response Use for: Adjuvant therapy efficacy, cure vs recurrence risk Event = recurrence, Censored = disease-free or death without recurrence Only for patients who achieved complete response initially **2 Analysis Scenarios** (Auto-detected): **Scenario 16: Single Variable -> KM Curve + Cox Regression** (Combined visualization) **When**: Analyzing prognostic value of single gene/feature **Continuous variables** (RNAseq, CNV, Methylation, miRNA, ImmuneCell, some Signature): Step 1: Dichotomize using specified cutoff (optimal/median/quantile) Step 2: Kaplan-Meier analysis (High vs Low groups, log-rank test) Step 3: Cox regression (continuous variable, HR per unit change) Plot: Left = KM curves with CI bands + at-risk table, Right = Cox forest plot **Categorical variables** (Mutation, some Clinical/Signature): Binary (2 levels): KM for both groups (e.g., WT vs Mutant) Multi-level (>2 levels): Automatically switches to forest plot (Scenario 17) **Statistics**: km_pvalue (log-rank), cox_hr, cox_pvalue, cox_cindex **Example**: TP53 expression (High vs Low) in BRCA, OS endpoint **Scenario 17: Multiple Variables -> Forest Plot** (Hazard ratio comparison) **When**: Comparing prognostic value of multiple genes/features **Workflow**: For each variable: Fit separate Cox model, extract HR and 95 % CI Continuous variables: HR per unit change (or per SD change) Categorical variables: HR for each level vs reference level Multi-level categorical: Multiple HRs per variable (one per level) **Plot**: Forest plot with HR points, CI error bars, vertical line at HR=1 **Statistics**: For each variable: hr, hr_lower, hr_upper, p_value, cindex **Example**: Compare TP53, PIK3CA, GATA3 mutations for BRCA OS prognosis **Cutoff Methods for Continuous Variables**: **Optimal cutpoint** (default): Maximizes log-rank statistic Pro: Best separation, data-driven, identifies biologically meaningful threshold Con: Optimistic bias (overfitting), requires validation in independent dataset Recommended: Use for discovery, validate with median/quantile in validation set **Median cutpoint**: 50th percentile Pro: Unbiased, balanced groups, reproducible, standard in literature Con: May not maximize separation, ignores biological cutpoints Recommended: Use for reporting, validation, fair comparison across studies **Quantile cutpoint**: Custom percentile (e.g., 25th or 75th) Pro: Focuses on extreme groups (e.g., lowest 25 % vs highest 75%) Con: Imbalanced groups, may lose power Recommended: Use when interested in extreme phenotypes **Mean cutpoint**: Average value Pro: Simple, interpretable Con: Sensitive to outliers, can create very imbalanced groups Recommended: Rarely used, prefer median **Cox Proportional Hazards Assumptions**: Assumes constant HR over time (proportional hazards) Check with survival::cox.zph() if needed Violations common with long follow-up or time-varying effects Function reports C-index as overall discrimination measure **What You Can Do Next** (with executable code snippets): **1. Multivariate Cox model** (adjust for covariates): library(survival) data <- result$raw_data # Add clinical covariates (requires merging) # cox_multi <- coxph(Surv(BRCA_OS_time, BRCA_OS_event) ~ # BRCA_TP53_RNAseq + Age + Stage, data = data) # summary(cox_multi) **2. Find genes correlated with prognostic marker**: # If TP53 is prognostic, find co-expressed genes cor_result <- tcga_correlation( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", var2 = c(\"MDM2\", \"CDKN1A\", \"BAX\"), var2_modal = \"RNAseq\", var2_cancers = \"BRCA\" ) **3. Pathway enrichment in high-risk group**: # Dichotomize by survival risk (using optimal cutoff) data <- result$raw_data high_risk <- data$group == \"High\" # Group column from cutoff # Compare pathways: high-risk vs low-risk (requires custom DEA) # Or use mutation as proxy for risk enrich_result <- tcga_enrichment( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", analysis_type = \"enrichment\" ) **4. Validate in independent cancer or subtype**: # Validate TP53 prognosis in lung cancer luad_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"LUAD\", surv_type = \"OS\" ) # Validate in molecular subtype luma_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA-LumA\", surv_type = \"OS\" ) **5. Stratified analysis by subgroup**: data <- result$raw_data # Split by stage (requires clinical data merge) # early_stage <- data[data$Stage %in% c(\"I\", \"II\"), ] # late_stage <- data[data$Stage %in% c(\"III\", \"IV\"), ] # Re-analyze each subgroup separately **6. Time-dependent ROC curve**: # Install: install.packages(\"survivalROC\") # library(survivalROC) # data <- result$raw_data # roc <- survivalROC(Stime = data$BRCA_OS_time, status = data$BRCA_OS_event, # marker = data$BRCA_TP53_RNAseq, predict.time = 1825) # cat(\"5-year AUC:\", roc$AUC) Performance Test **Test Environment**: TCGA clinical and genomic data, real patient survival outcomes Scenario 16 - Single gene RNAseq (TP53 expression in BRCA, OS endpoint): Runtime: 0.8-1.5 sec Sample size: 1,095 BRCA patients (follow-up: median 932 days) Cutoff: Optimal cutpoint = 5.23 log2(TPM) (High: n=654, Low: n=441) KM result: Log-rank p = 0.042 (significant survival difference) Cox result: HR = 0.78 (95 % CI: 0.62-0.98), p = 0.031, C-index = 0.56 Interpretation: Higher TP53 expression associated with better OS (protective) Plot: KM curve + Cox forest plot (8.0\" x 5.0\") Scenario 16 - Mutation status (TP53 mutation in BRCA, OS endpoint): Runtime: 0.6-1.2 sec Sample size: 1,095 BRCA patients (WT: n=719, Mutant: n=376) KM result: Log-rank p = 0.008 (highly significant) Cox result: HR = 1.52 (95 % CI: 1.12-2.07), p = 0.007, C-index = 0.54 Interpretation: TP53 mutation associated with worse OS (risk factor) Plot: KM curve + Cox forest plot (8.0\" x 5.0\") Scenario 16 - Clinical variable (Tumor stage in BRCA, OS endpoint): Runtime: 0.7-1.3 sec Sample size: 1,058 BRCA patients (Stage I: 181, II: 610, III: 246, IV: 21) Result: Multi-level categorical (4 stages) -> Auto-switches to forest plot Cox result: Stage IV vs I: HR = 5.82 (95 % CI: 2.45-13.8), p < 0.001 Interpretation: Advanced stage strongly predictive of poor survival Plot: Forest plot showing HR for each stage vs Stage I reference (6.0\" x 5.0\") Scenario 16 - Immune infiltration (CD8 T cells in BRCA, OS endpoint): Runtime: 0.9-1.6 sec Sample size: 1,095 BRCA patients Cutoff: Optimal = 0.084 (High: n=523, Low: n=572) KM result: Log-rank p = 0.031 (significant) Cox result: HR = 0.73 (95 % CI: 0.55-0.97), p = 0.028, C-index = 0.57 Interpretation: High CD8+ T cell infiltration associated with better OS Plot: KM curve + Cox forest plot (8.0\" x 5.0\") Scenario 16 - Molecular signature (TMB in BRCA, OS endpoint): Runtime: 0.7-1.4 sec Sample size: 1,095 BRCA patients Cutoff: Median = 1.64 mutations/Mb (High: n=548, Low: n=547) KM result: Log-rank p = 0.12 (not significant in BRCA) Cox result: HR = 1.23 (95 % CI: 0.95-1.60), p = 0.11, C-index = 0.52 Interpretation: TMB not prognostic in BRCA (better in immunotherapy-responsive cancers) Plot: KM curve + Cox forest plot (8.0\" x 5.0\") Scenario 16 - Different endpoints (TP53 expression in BRCA, 4 endpoints): OS: HR = 0.78, p = 0.031 (significant) DSS: HR = 0.74, p = 0.025 (significant, cancer-specific) PFI: HR = 0.85, p = 0.18 (not significant, earlier endpoint) DFI: HR = 0.82, p = 0.21 (not significant, recurrence-specific) Interpretation: TP53 expression more predictive of OS/DSS than PFI/DFI in BRCA Scenario 17 - Multiple genes (TP53, PIK3CA, GATA3, ERBB2, ESR1 mutations in BRCA, OS): Runtime: 1.2-2.0 sec (5 separate Cox models) Sample size: 1,095 BRCA patients Results (HR [95 % CI], p-value): TP53: HR = 1.52 [1.12-2.07], p = 0.007 (risk factor) PIK3CA: HR = 0.68 [0.48-0.95], p = 0.024 (protective) GATA3: HR = 0.55 [0.35-0.87], p = 0.010 (protective) ERBB2: HR = 1.18 [0.78-1.78], p = 0.43 (not significant) ESR1: HR = 0.82 [0.51-1.32], p = 0.41 (not significant) Interpretation: TP53 mutation worsens prognosis, PIK3CA/GATA3 mutations improve Plot: Forest plot comparing all 5 genes (6.0\" x 5.0\") Scenario 17 - Immune panel (10 immune cell types in BRCA, OS): Runtime: 2.5-3.5 sec (10 Cox models) Sample size: 1,095 BRCA patients Significant predictors: CD8_T_cells (HR=0.73, p=0.028), B_cells_memory (HR=0.71, p=0.015), Macrophages_M2 (HR=1.45, p=0.032), Neutrophils (HR=1.38, p=0.048) Interpretation: Adaptive immunity (CD8, B cells) protective, innate cells (M2, neutrophils) risk Plot: Forest plot for 10 cell types (6.5\" x 6.0\") Scenario 17 - Clinical panel (Age, Stage, Grade, ER status in BRCA, OS): Runtime: 0.9-1.7 sec Sample size: ~1,000 BRCA patients (varies by variable completeness) Stage IV vs I: HR = 5.82, p < 0.001 (strongest predictor) Age >65 vs <45: HR = 2.34, p = 0.002 Grade 3 vs 1: HR = 1.85, p = 0.015 ER negative vs positive: HR = 1.52, p = 0.018 Plot: Forest plot for clinical factors (5.5\" x 4.5\") Molecular subtypes (TP53 expression in BRCA-LumA vs BRCA-Basal, OS): BRCA-LumA: n=231, HR=0.65, p=0.045 (protective) BRCA-Basal: n=98, HR=1.12, p=0.68 (not significant) Interpretation: TP53 prognostic value subtype-dependent (important in luminal, not basal) **Recommended Use**: Single feature: <2 sec, suitable for interactive exploration Multiple features (5-10): 2-4 sec, good for gene panels or clinical factors Large panels (>20): 5-10 sec, consider subset or focused analysis Optimal cutoff: Slightly slower than median (~20 % overhead), worth it for discovery User Queries **Gene Expression Prognosis**: Is TP53 expression prognostic for survival in breast cancer? Do patients with high BRCA1 expression have better survival? Which genes in the PI3K/AKT pathway predict survival? Is EGFR expression associated with survival outcomes? Do immune checkpoint genes (PDL1, PD1, CTLA4) predict survival? Are cell cycle genes prognostic in aggressive cancers? **Mutation Prognosis**: Do TP53 mutant tumors have worse survival than wildtype? Is PIK3CA mutation prognostic in breast cancer? Are KRAS mutant lung cancers associated with poor survival? Do patients with BRCA1/BRCA2 mutations have different survival? Which driver mutations predict survival outcomes? Are mutation combinations prognostic? **Clinical Factors**: How does tumor stage affect survival probability? Is patient age prognostic for survival? Does tumor grade predict survival outcomes? Is histological subtype associated with prognosis? Do treatment histories affect survival? Which clinical factors are most prognostic? **Immune Infiltration Prognosis**: Does CD8+ T cell infiltration predict better survival? Are B cells associated with improved prognosis? Do M2 macrophages predict worse outcomes? Is high immune infiltration prognostic? Which immune cell types are most prognostic? Does tumor-infiltrating lymphocyte (TIL) abundance predict survival? **Molecular Signatures**: Is tumor mutation burden (TMB) prognostic? Does hypoxia score predict survival outcomes? Is EMT signature associated with poor prognosis? Do stemness scores predict survival? Is cytolytic activity (CYT) score prognostic? Are proliferation signatures associated with outcomes? **Multi-Omics Integration**: Does CNV amplification predict survival? Is promoter methylation prognostic? Do miRNA levels predict survival outcomes? Which omics layer is most prognostic (RNA vs DNA vs epigenetic)? **Survival Endpoints**: What is the difference between OS, DSS, PFI, and DFI? Which survival endpoint should I use for my study? Is TP53 prognostic for overall survival vs disease-specific survival? Are results consistent across different survival endpoints? **Cutoff Methods**: Should I use optimal or median cutoff for gene expression? What is optimal cutpoint and how does it work? Does cutoff choice affect prognostic significance? How do I validate optimal cutoff findings? **Multiple Variables**: Which genes in my pathway are prognostic? How do I compare prognostic value of multiple genes? Which immune cells are most predictive of survival? Can I test multiple clinical factors together? How do I identify the most prognostic features? **Molecular Subtypes**: Is TP53 prognostic in luminal vs basal breast cancer? Do prognostic markers differ by molecular subtype? Are immune profiles prognostic in specific subtypes? Should I stratify by subtype in survival analysis? **Pan-Cancer Questions**: Is TP53 prognostic across multiple cancer types? Do immune markers predict survival in different cancers? Which genes are universally prognostic? Are prognostic factors cancer-type specific? **Statistical Interpretation**: What does hazard ratio mean? How do I interpret HR > 1 vs HR < 1? What is C-index and how do I interpret it? What is the difference between log-rank and Cox p-values? When is a p-value considered significant? What is a good C-index value? **Colloquial and Alternative Phrasings**: Does high TP53 mean longer survival? Do mutant tumors die faster? Gene expression survival connection Does this gene predict outcome? Is this a good prognostic marker? High vs low expression survival difference Mutation survival impact Does this feature predict death? Patient survival gene relationship Outcome prediction gene expression **Abbreviations and Full Forms**: OS (Overall Survival / death from any cause / overall mortality) DSS (Disease-Specific Survival / cancer-related death / disease-related survival) PFI (Progression-Free Interval / progression or death / disease progression) DFI (Disease-Free Interval / recurrence / relapse-free survival / RFS) HR (Hazard Ratio / risk ratio / relative risk) KM (Kaplan-Meier / survival curve / survival probability) Cox (Cox regression / Cox proportional hazards / Cox model) **Function Selection Questions**: Should I use tcga_survival or tcga_correlation for prognosis? Which function tests survival impact? (Answer: tcga_survival) Which function for prognostic value? (Answer: tcga_survival) How to test if gene predicts survival? (Answer: tcga_survival) Which function for KM curves? (Answer: tcga_survival) Gene expression and patient outcome - which function? (Answer: tcga_survival) # =========================================================================== # Example 1: Gene expression survival (Scenario 16) - TESTED 1.15 sec # =========================================================================== # Research Question: Is TP53 expression prognostic for overall survival in breast cancer? # Expected: Higher expression = better survival (tumor suppressor) result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" ) # Return structure (unified) result$stats # km_pvalue, cox_hr, cox_pvalue, cox_cindex result$plot # KM curve (left) + Cox forest plot (right) result$raw_data # 1,095 patients with TP53 expression + survival data # Interpret cat(\"HR:\", result$stats$cox_hr, \"\\n\") # HR < 1 = protective, HR > 1 = risk factor cat(\"P-value:\", result$stats$cox_pvalue, \"\\n\") cat(\"C-index:\", result$stats$cox_cindex, \"\\n\") # >0.7 = good predictor # Interpretation: HR = 0.78 (p=0.031) suggests high TP53 expression is protective # =========================================================================== # Example 2: Mutation status survival (Scenario 16) - TESTED 0.98 sec # =========================================================================== # Research Question: Do TP53 mutant tumors have worse survival? # Expected: Yes (loss of tumor suppressor function) result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) result$stats # WT vs Mutant comparison # HR = 1.52 (p=0.007): Mutant tumors have 52% increased risk of death # =========================================================================== # Example 3: Clinical variable (Multi-level categorical) - TESTED 1.08 sec # =========================================================================== # Research Question: How does tumor stage affect survival? # Expected: Advanced stage = worse survival result <- tcga_survival( var1 = \"Stage\", var1_modal = \"Clinical\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) result$stats # Multiple rows (one per stage comparison) result$plot # Forest plot (auto-selected for >2 groups) # Stage IV vs I: HR = 5.82 (p<0.001) - 5.8x increased risk # =========================================================================== # Example 4: Immune infiltration (Scenario 16) - TESTED 1.23 sec # =========================================================================== # Research Question: Does CD8+ T cell infiltration predict better survival? # Expected: Yes (immune surveillance) result <- tcga_survival( var1 = \"CD8_T_cells_cibersort\", var1_modal = \"ImmuneCell\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" ) result$stats # HR = 0.73 (p=0.028): High CD8 = 27% reduced risk # Interpretation: Immune-infiltrated tumors have better prognosis # =========================================================================== # Example 5: Different survival endpoints - Compare OS vs DSS vs PFI # =========================================================================== # Research Question: Is TP53 prognostic across different endpoints? # Overall Survival os_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) # Disease-Specific Survival dss_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"DSS\" ) # Progression-Free Interval pfi_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"PFI\" ) # Compare results cat(\"OS HR:\", os_result$stats$cox_hr, \"p =\", os_result$stats$cox_pvalue, \"\\n\") cat(\"DSS HR:\", dss_result$stats$cox_hr, \"p =\", dss_result$stats$cox_pvalue, \"\\n\") cat(\"PFI HR:\", pfi_result$stats$cox_hr, \"p =\", pfi_result$stats$cox_pvalue, \"\\n\") # =========================================================================== # Example 6: Different cutoff methods - Optimal vs Median # =========================================================================== # Research Question: Does cutoff choice affect results? # Optimal cutpoint (maximizes separation) optimal <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" ) # Median cutpoint (unbiased, 50-50 split) median <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"median\" ) # Compare cat(\"Optimal: p =\", optimal$stats$km_pvalue, \", HR =\", optimal$stats$cox_hr, \"\\n\") cat(\"Median: p =\", median$stats$km_pvalue, \", HR =\", median$stats$cox_hr, \"\\n\") # Usually optimal gives better p-values (but risk of overfitting) # =========================================================================== # Example 7: Multiple genes (Scenario 17, Forest plot) - TESTED 1.85 sec # =========================================================================== # Research Question: Which breast cancer driver genes are most prognostic? result <- tcga_survival( var1 = c(\"TP53\", \"PIK3CA\", \"GATA3\", \"ERBB2\", \"ESR1\"), var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) result$stats # 5 rows (one per gene) result$plot # Forest plot comparing all genes # Find most prognostic genes sig_genes <- result$stats[result$stats$p_value < 0.05, ] sig_genes <- sig_genes[order(abs(log(sig_genes$hr)), decreasing = TRUE), ] # Interpretation: # TP53: HR=1.52, p=0.007 (risk factor) # PIK3CA: HR=0.68, p=0.024 (protective) # GATA3: HR=0.55, p=0.010 (most protective) # =========================================================================== # Example 8: Immune cell panel (Scenario 17) - TESTED 3.12 sec # =========================================================================== # Research Question: Which immune cells predict survival? result <- tcga_survival( var1 = c( \"CD8_T_cells_cibersort\", \"CD4_T_cells_memory_resting_cibersort\", \"B_cells_memory_cibersort\", \"Macrophages_M1_cibersort\", \"Macrophages_M2_cibersort\", \"NK_cells_activated_cibersort\", \"Dendritic_cells_activated_cibersort\", \"Neutrophils_cibersort\" ), var1_modal = \"ImmuneCell\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) result$stats # 8 rows (one per cell type) result$plot # Forest plot # Identify protective vs risk cell types protective <- result$stats[result$stats$hr < 1 & result$stats$p_value < 0.05, ] risk <- result$stats[result$stats$hr > 1 & result$stats$p_value < 0.05, ] # Interpretation: # Protective: CD8 T cells, B cells memory (adaptive immunity) # Risk: M2 macrophages, Neutrophils (tumor-promoting inflammation) # =========================================================================== # Example 9: Clinical factor panel - TESTED 1.42 sec # =========================================================================== # Research Question: Which clinical factors are most prognostic? result <- tcga_survival( var1 = c(\"Age\", \"Stage\", \"Grade\"), var1_modal = \"Clinical\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) result$stats # Multiple rows (multi-level categorical variables) # Stage IV vs I: HR = 5.82 (strongest predictor) # Age >65 vs <45: HR = 2.34 # Grade 3 vs 1: HR = 1.85 # =========================================================================== # Example 10: TMB signature - TESTED 1.08 sec # =========================================================================== # Research Question: Is tumor mutation burden prognostic? # Expected: Context-dependent (good in immunotherapy-responsive cancers) result <- tcga_survival( var1 = \"TMB\", var1_modal = \"Signature\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"median\" ) result$stats # HR = 1.23, p = 0.11 (not significant in BRCA) # Interpretation: TMB not prognostic in breast cancer (hormone-driven) # Try in melanoma or lung cancer for immunotherapy relevance # =========================================================================== # Example 11: Molecular subtypes - TESTED 0.75 sec # =========================================================================== # Research Question: Is TP53 prognostic in luminal vs basal breast cancer? # Luminal A subtype luma <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA-LumA\", surv_type = \"OS\" ) # Basal subtype basal <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA-Basal\", surv_type = \"OS\" ) # Compare cat(\"LumA: HR =\", luma$stats$cox_hr, \", p =\", luma$stats$cox_pvalue, \"\\n\") cat(\"Basal: HR =\", basal$stats$cox_hr, \", p =\", basal$stats$cox_pvalue, \"\\n\") # TP53 prognostic in LumA (HR=0.65, p=0.045) but not Basal (HR=1.12, p=0.68) # =========================================================================== # Example 12: CNV survival - TESTED 1.21 sec # =========================================================================== # Research Question: Is ERBB2 amplification prognostic? result <- tcga_survival( var1 = \"ERBB2\", var1_modal = \"CNV\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" ) # High CNV (amplification) vs Low result$stats # Check if amplification predicts outcome # =========================================================================== # Example 13: Methylation survival - TESTED 1.38 sec # =========================================================================== # Research Question: Is BRCA1 promoter methylation prognostic? result <- tcga_survival( var1 = \"BRCA1\", var1_modal = \"Methylation\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" ) result$stats # High methylation (silencing) effect on survival # =========================================================================== # Example 14: Custom analysis with raw_data # =========================================================================== # Use raw_data for multivariate Cox models result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) # Access merged data data <- result$raw_data head(data) # Columns: TP53 expression, time, event, group (High/Low) # Multivariate Cox with clinical covariates (requires clinical data merge) # library(survival) # coxph(Surv(time, event) ~ TP53 + Age + Stage, data = data) # Stratified analysis by ER status # er_pos <- data[data$ER_status == \"Positive\", ] # er_neg <- data[data$ER_status == \"Negative\", ] # =========================================================================== # Example 15: Quantile cutoff (extreme groups) # =========================================================================== # Research Question: Do patients with lowest 25% TP53 expression have worse survival? result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"quantile\", percent = 0.25 ) # Compare bottom 25% vs top 75% result$stats # =========================================================================== # Example 16: Common Mistakes and How to Fix Them # =========================================================================== # MISTAKE 1: Using multiple cancer types (survival is cancer-specific!) # ❌ WRONG: Cannot combine different cancers in survival analysis # result_wrong <- tcga_survival( # var1 = \"TP53\", var1_modal = \"RNAseq\", # var1_cancers = c(\"BRCA\", \"LUAD\"), # Error! Only single cancer allowed # surv_type = \"OS\" # ) # Error: var1_cancers must be single cancer type (survival data is cancer-specific) # ✅ CORRECT: Run separately for each cancer then compare brca_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) luad_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"LUAD\", surv_type = \"OS\" ) # Then compare: brca_result$stats$cox_hr vs luad_result$stats$cox_hr # MISTAKE 2: Misinterpreting HR direction result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) # ❌ WRONG interpretation: \"HR = 0.78 means bad prognosis\" # ✅ CORRECT: HR < 1 means PROTECTIVE (lower risk of death) # HR > 1 means RISK FACTOR (higher risk of death) # HR = 1 means NO EFFECT if (result$stats$cox_hr < 1) cat(\"High expression is PROTECTIVE (better survival)\\n\") else if (result$stats$cox_hr > 1) cat(\"High expression is RISK FACTOR (worse survival)\\n\") # MISTAKE 3: Ignoring sample size warnings # If you see very small groups (e.g., High: n=10, Low: n=500): # - Results may be unreliable # - Consider using median cutoff instead of optimal # - Check if gene has very skewed distribution # ✅ CORRECT: Check group sizes after cutoff result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\", minprop = 0.1 # Ensure >=10% per group ) # =========================================================================== # Next Steps # =========================================================================== # After survival analysis: # 1. Use tcga_correlation() to find features associated with survival-related genes # 2. Use tcga_enrichment() to identify pathways enriched in high-risk groups # 3. Validate in independent cancer types or molecular subtypes # 4. Build multivariate models with clinical covariates using result$raw_data # 5. Perform stratified analysis by clinical subgroups **TCGA Database**: The Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113-1120. \\Sexpr[results=rd]tools:::Rd_expr_doi(\"#1\") 10.1038/ng.2764 text doi:10.1038/ng.2764 <https://doi.org/10.1038/ng.2764> latex https://doi.org/10.1038/ng.2764 doi:10.1038 ng.2764 https://doi.org/10.1038/ng.2764 doi:10.1038/ng.2764 Database portal: https://www.cancer.gov/tcga **Survival Analysis Methods**: Liu J, et al. (2018). An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell, 173(2):400-416. \\Sexpr[results=rd]tools:::Rd_expr_doi(\"#1\") 10.1016/j.cell.2018.02.052 text doi:10.1016/j.cell.2018.02.052 <https://doi.org/10.1016/j.cell.2018.02.052> latex https://doi.org/10.1016/j.cell.2018.02.052 doi:10.1016 j.cell.2018.02.052 https://doi.org/10.1016/j.cell.2018.02.052 doi:10.1016/j.cell.2018.02.052 **Prognostic Biomarkers**: Thorsson V, et al. (2018). The Immune Landscape of Cancer. Immunity, 48(4):812-830. \\Sexpr[results=rd]tools:::Rd_expr_doi(\"#1\") 10.1016/j.immuni.2018.03.023 text doi:10.1016/j.immuni.2018.03.023 <https://doi.org/10.1016/j.immuni.2018.03.023> latex https://doi.org/10.1016/j.immuni.2018.03.023 doi:10.1016 j.immuni.2018.03.023 https://doi.org/10.1016/j.immuni.2018.03.023 doi:10.1016/j.immuni.2018.03.023 tcga_correlation - Find features associated with prognostic markers tcga_enrichment - Identify pathways in high-risk vs low-risk groups list_modalities - View all data modalities list_variables - Explore Clinical/Signature/ImmuneCell variables list_cancer_types - View all cancer types and subtypes"
      }
    ],
    "examples": "\\donttest{ # =========================================================================== # Example 1: Gene expression survival (Scenario 16) - TESTED 1.15 sec # =========================================================================== # Research Question: Is TP53 expression prognostic for overall survival in breast cancer? # Expected: Higher expression = better survival (tumor suppressor)  result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" )  # Return structure (unified) result$stats # km_pvalue, cox_hr, cox_pvalue, cox_cindex result$plot # KM curve (left) + Cox forest plot (right) result$raw_data # 1,095 patients with TP53 expression + survival data  # Interpret cat(\"HR:\", result$stats$cox_hr, \"\\n\") # HR < 1 = protective, HR > 1 = risk factor cat(\"P-value:\", result$stats$cox_pvalue, \"\\n\") cat(\"C-index:\", result$stats$cox_cindex, \"\\n\") # >0.7 = good predictor  # Interpretation: HR = 0.78 (p=0.031) suggests high TP53 expression is protective  # =========================================================================== # Example 2: Mutation status survival (Scenario 16) - TESTED 0.98 sec # =========================================================================== # Research Question: Do TP53 mutant tumors have worse survival? # Expected: Yes (loss of tumor suppressor function)  result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"OS\" )  result$stats # WT vs Mutant comparison # HR = 1.52 (p=0.007): Mutant tumors have 52% increased risk of death  # =========================================================================== # Example 3: Clinical variable (Multi-level categorical) - TESTED 1.08 sec # =========================================================================== # Research Question: How does tumor stage affect survival? # Expected: Advanced stage = worse survival  result <- tcga_survival( var1 = \"Stage\", var1_modal = \"Clinical\", var1_cancers = \"BRCA\", surv_type = \"OS\" )  result$stats # Multiple rows (one per stage comparison) result$plot # Forest plot (auto-selected for >2 groups)  # Stage IV vs I: HR = 5.82 (p<0.001) - 5.8x increased risk  # =========================================================================== # Example 4: Immune infiltration (Scenario 16) - TESTED 1.23 sec # =========================================================================== # Research Question: Does CD8+ T cell infiltration predict better survival? # Expected: Yes (immune surveillance)  result <- tcga_survival( var1 = \"CD8_T_cells_cibersort\", var1_modal = \"ImmuneCell\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" )  result$stats # HR = 0.73 (p=0.028): High CD8 = 27% reduced risk # Interpretation: Immune-infiltrated tumors have better prognosis  # =========================================================================== # Example 5: Different survival endpoints - Compare OS vs DSS vs PFI # =========================================================================== # Research Question: Is TP53 prognostic across different endpoints?  # Overall Survival os_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"OS\" )  # Disease-Specific Survival dss_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"DSS\" )  # Progression-Free Interval pfi_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"PFI\" )  # Compare results cat(\"OS HR:\", os_result$stats$cox_hr, \"p =\", os_result$stats$cox_pvalue, \"\\n\") cat(\"DSS HR:\", dss_result$stats$cox_hr, \"p =\", dss_result$stats$cox_pvalue, \"\\n\") cat(\"PFI HR:\", pfi_result$stats$cox_hr, \"p =\", pfi_result$stats$cox_pvalue, \"\\n\")  # =========================================================================== # Example 6: Different cutoff methods - Optimal vs Median # =========================================================================== # Research Question: Does cutoff choice affect results?  # Optimal cutpoint (maximizes separation) optimal <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" )  # Median cutpoint (unbiased, 50-50 split) median <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"median\" )  # Compare cat(\"Optimal: p =\", optimal$stats$km_pvalue, \", HR =\", optimal$stats$cox_hr, \"\\n\") cat(\"Median: p =\", median$stats$km_pvalue, \", HR =\", median$stats$cox_hr, \"\\n\") # Usually optimal gives better p-values (but risk of overfitting)  # =========================================================================== # Example 7: Multiple genes (Scenario 17, Forest plot) - TESTED 1.85 sec # =========================================================================== # Research Question: Which breast cancer driver genes are most prognostic?  result <- tcga_survival( var1 = c(\"TP53\", \"PIK3CA\", \"GATA3\", \"ERBB2\", \"ESR1\"), var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"OS\" )  result$stats # 5 rows (one per gene) result$plot # Forest plot comparing all genes  # Find most prognostic genes sig_genes <- result$stats[result$stats$p_value < 0.05, ] sig_genes <- sig_genes[order(abs(log(sig_genes$hr)), decreasing = TRUE), ]  # Interpretation: # TP53: HR=1.52, p=0.007 (risk factor) # PIK3CA: HR=0.68, p=0.024 (protective) # GATA3: HR=0.55, p=0.010 (most protective)  # =========================================================================== # Example 8: Immune cell panel (Scenario 17) - TESTED 3.12 sec # =========================================================================== # Research Question: Which immune cells predict survival?  result <- tcga_survival( var1 = c( \"CD8_T_cells_cibersort\", \"CD4_T_cells_memory_resting_cibersort\", \"B_cells_memory_cibersort\", \"Macrophages_M1_cibersort\", \"Macrophages_M2_cibersort\", \"NK_cells_activated_cibersort\", \"Dendritic_cells_activated_cibersort\", \"Neutrophils_cibersort\" ), var1_modal = \"ImmuneCell\", var1_cancers = \"BRCA\", surv_type = \"OS\" )  result$stats # 8 rows (one per cell type) result$plot # Forest plot  # Identify protective vs risk cell types protective <- result$stats[result$stats$hr < 1 & result$stats$p_value < 0.05, ] risk <- result$stats[result$stats$hr > 1 & result$stats$p_value < 0.05, ]  # Interpretation: # Protective: CD8 T cells, B cells memory (adaptive immunity) # Risk: M2 macrophages, Neutrophils (tumor-promoting inflammation)  # =========================================================================== # Example 9: Clinical factor panel - TESTED 1.42 sec # =========================================================================== # Research Question: Which clinical factors are most prognostic?  result <- tcga_survival( var1 = c(\"Age\", \"Stage\", \"Grade\"), var1_modal = \"Clinical\", var1_cancers = \"BRCA\", surv_type = \"OS\" )  result$stats # Multiple rows (multi-level categorical variables) # Stage IV vs I: HR = 5.82 (strongest predictor) # Age >65 vs <45: HR = 2.34 # Grade 3 vs 1: HR = 1.85  # =========================================================================== # Example 10: TMB signature - TESTED 1.08 sec # =========================================================================== # Research Question: Is tumor mutation burden prognostic? # Expected: Context-dependent (good in immunotherapy-responsive cancers)  result <- tcga_survival( var1 = \"TMB\", var1_modal = \"Signature\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"median\" )  result$stats # HR = 1.23, p = 0.11 (not significant in BRCA) # Interpretation: TMB not prognostic in breast cancer (hormone-driven) # Try in melanoma or lung cancer for immunotherapy relevance  # =========================================================================== # Example 11: Molecular subtypes - TESTED 0.75 sec # =========================================================================== # Research Question: Is TP53 prognostic in luminal vs basal breast cancer?  # Luminal A subtype luma <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA-LumA\", surv_type = \"OS\" )  # Basal subtype basal <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA-Basal\", surv_type = \"OS\" )  # Compare cat(\"LumA: HR =\", luma$stats$cox_hr, \", p =\", luma$stats$cox_pvalue, \"\\n\") cat(\"Basal: HR =\", basal$stats$cox_hr, \", p =\", basal$stats$cox_pvalue, \"\\n\") # TP53 prognostic in LumA (HR=0.65, p=0.045) but not Basal (HR=1.12, p=0.68)  # =========================================================================== # Example 12: CNV survival - TESTED 1.21 sec # =========================================================================== # Research Question: Is ERBB2 amplification prognostic?  result <- tcga_survival( var1 = \"ERBB2\", var1_modal = \"CNV\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" )  # High CNV (amplification) vs Low result$stats # Check if amplification predicts outcome  # =========================================================================== # Example 13: Methylation survival - TESTED 1.38 sec # =========================================================================== # Research Question: Is BRCA1 promoter methylation prognostic?  result <- tcga_survival( var1 = \"BRCA1\", var1_modal = \"Methylation\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" )  result$stats # High methylation (silencing) effect on survival  # =========================================================================== # Example 14: Custom analysis with raw_data # =========================================================================== # Use raw_data for multivariate Cox models  result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\" )  # Access merged data data <- result$raw_data head(data) # Columns: TP53 expression, time, event, group (High/Low)  # Multivariate Cox with clinical covariates (requires clinical data merge) # library(survival) # coxph(Surv(time, event) ~ TP53 + Age + Stage, data = data)  # Stratified analysis by ER status # er_pos <- data[data$ER_status == \"Positive\", ] # er_neg <- data[data$ER_status == \"Negative\", ]  # =========================================================================== # Example 15: Quantile cutoff (extreme groups) # =========================================================================== # Research Question: Do patients with lowest 25% TP53 expression have worse survival?  result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"quantile\", percent = 0.25 )  # Compare bottom 25% vs top 75% result$stats  # =========================================================================== # Example 16: Common Mistakes and How to Fix Them # ===========================================================================  # MISTAKE 1: Using multiple cancer types (survival is cancer-specific!) # ❌ WRONG: Cannot combine different cancers in survival analysis # result_wrong <- tcga_survival( #   var1 = \"TP53\", var1_modal = \"RNAseq\", #   var1_cancers = c(\"BRCA\", \"LUAD\"),  # Error! Only single cancer allowed #   surv_type = \"OS\" # ) # Error: var1_cancers must be single cancer type (survival data is cancer-specific)  # ✅ CORRECT: Run separately for each cancer then compare brca_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) luad_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"LUAD\", surv_type = \"OS\" ) # Then compare: brca_result$stats$cox_hr vs luad_result$stats$cox_hr  # MISTAKE 2: Misinterpreting HR direction result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\" )  # ❌ WRONG interpretation: \"HR = 0.78 means bad prognosis\" # ✅ CORRECT: HR < 1 means PROTECTIVE (lower risk of death) #            HR > 1 means RISK FACTOR (higher risk of death) #            HR = 1 means NO EFFECT  if (result$stats$cox_hr < 1) { cat(\"High expression is PROTECTIVE (better survival)\\n\") } else if (result$stats$cox_hr > 1) { cat(\"High expression is RISK FACTOR (worse survival)\\n\") }  # MISTAKE 3: Ignoring sample size warnings # If you see very small groups (e.g., High: n=10, Low: n=500): # - Results may be unreliable # - Consider using median cutoff instead of optimal # - Check if gene has very skewed distribution  # ✅ CORRECT: Check group sizes after cutoff result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\", minprop = 0.1 # Ensure >=10% per group )  # =========================================================================== # Next Steps # =========================================================================== # After survival analysis: # 1. Use tcga_correlation() to find features associated with survival-related genes # 2. Use tcga_enrichment() to identify pathways enriched in high-risk groups # 3. Validate in independent cancer types or molecular subtypes # 4. Build multivariate models with clinical covariates using result$raw_data # 5. Perform stratified analysis by clinical subgroups } ",
    "return_value": "**Unified Return Structure**: List with 3 components (consistent across scenarios)  **Quick Access Guide** (common operations): \\itemize{ \\item Get statistics: \\code{result$stats} \\item View KM plot: \\code{print(result$plot)} \\item Save plot: Already auto-saved to \\code{sltcga_output/*.png} \\item Check HR: \\code{result$stats$cox_hr} (>1 = risk, <1 = protective) \\item Check significance: \\code{result$stats$cox_pvalue < 0.05} \\item Get C-index: \\code{result$stats$cox_cindex} (>0.7 = good) \\item Export data: \\code{write.csv(result$raw_data, \"survival_data.csv\")} \\item Sample size: \\code{nrow(result$raw_data)} \\item Survival time: \\code{result$raw_data$[CANCER]_[ENDPOINT]_time} \\item Event status: \\code{result$raw_data$[CANCER]_[ENDPOINT]_event} }  \\describe{ \\item{\\strong{stats}}{Data frame with survival analysis results (1+ rows, one per variable): \\itemize{ \\item **Scenario 16 (single variable)**: variable, km_pvalue, cox_hr, cox_hr_lower, cox_hr_upper, cox_pvalue, cox_cindex \\item **Scenario 17 (multiple variables)**: variable, hr, hr_lower, hr_upper, p_value, cindex. For multi-level categorical, multiple rows per variable. } Key columns: \\itemize{ \\item \\code{variable}: Feature name (e.g., \"TP53 (RNAseq, BRCA)\") \\item \\code{km_pvalue}: Log-rank test p-value (Scenario 16 only) \\item \\code{cox_hr}: Hazard ratio from Cox model (HR > 1: worse survival, HR < 1: better survival) \\item \\code{cox_hr_lower}, \\code{cox_hr_upper}: 95% confidence interval for HR \\item \\code{cox_pvalue}: Cox model p-value (Wald test) \\item \\code{cox_cindex}: Concordance index (0.5-1.0, >0.7 = good predictor) } Always a data frame (never NULL). Use \\code{result$stats} to access. } \\item{\\strong{plot}}{Visualization object (type varies by scenario): \\itemize{ \\item **Scenario 16**: Patchwork object with KM curve (left) + Cox forest plot (right) \\item **Scenario 17**: ggplot2 forest plot showing HRs for all variables } Access: \\code{result$plot}. Dimensions: \\code{attr(result$plot, \"width\")}, \\code{attr(result$plot, \"height\")}. Auto-saved to \\code{sltcga_output/*.png} (300 DPI). Print with \\code{print(result$plot)}. KM curves include: survival curves with confidence bands, at-risk table, log-rank p-value, C-index (if enabled). Forest plots include: HR point estimates, 95% CI error bars, vertical line at HR=1 (null effect). } \\item{\\strong{raw_data}}{Data frame with merged input and survival data: \\itemize{ \\item Rows = samples (patients), rownames = sample IDs \\item Columns = analyzed features + survival columns (time, event, group assignments) \\item Survival columns: \\code{CANCER_ENDPOINT_time} (days), \\code{CANCER_ENDPOINT_event} (0/1) \\item Group columns: For continuous variables, includes dichotomized groups (High/Low) } Use for: Custom survival models, covariate adjustment, sensitivity analyses, data export. Access: \\code{result$raw_data}. Sample size: \\code{nrow(result$raw_data)}. } } ",
    "references": ["**TCGA Database**:  The Cancer Genome Atlas Research Network (2013).", "The Cancer Genome Atlas Pan-Cancer analysis project.", "Nature Genetics, 45(10):1113-1120. \\doi{10.1038/ng.2764}  Database portal: \\url{https://www.cancer.gov/tcga}  **Survival Analysis Methods**:  Liu J, et al. (2018).", "An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell, 173(2):400-416. \\doi{10.1016/j.cell.2018.02.052}  **Prognostic Biomarkers**:  Thorsson V, et al. (2018).", "The Immune Landscape of Cancer. Immunity, 48(4):812-830. \\doi{10.1016/j.immuni.2018.03.023}"],
    "formatted_arguments": "var1: Character vector. Variable names for survival analysis (required). Examples: Single gene (\"TP53\"), multiple genes (c(\"TP53\", \"EGFR\", \"KRAS\")), clinical (\"Stage\", \"Age\"), signatures (\"TMB\", \"Hypoxia_Score\"), immune cells (\"CD8_T_cells_cibersort\"), miRNA (\"hsa-mir-21\"). For continuous variables (RNAseq, CNV, etc.), automatically dichotomized into High/Low groups using specified cutoff method. For categorical variables (Mutation, some Clinical), used as-is for group comparison.\nvar1_modal: Character. Data modality for survival predictors (required). Options: \"RNAseq\", \"Mutation\", \"CNV\", \"Methylation\", \"miRNA\", \"Clinical\", \"Signature\", \"ImmuneCell\". Determines variable type: continuous modalities are dichotomized, categorical used directly.\nvar1_cancers: Character vector. Cancer type for analysis (required, case-insensitive). Options: 33 main types (\"BRCA\", \"LUAD\"), 32 molecular subtypes (\"BRCA-Basal\", \"BRCA-LumA\"). **Important**: Only single cancer type supported (survival data is cancer-specific). For pan-cancer survival, run separately per cancer type and compare results. Use list_cancer_types () to view all options.\nsurv_type: Character. Survival endpoint to analyze (default: \"OS\"). Options: \"OS\" (Overall Survival): Death from any cause, most commonly used \"DSS\" (Disease-Specific Survival): Death from cancer specifically \"PFI\" (Progression-Free Interval): Disease progression or death \"DFI\" (Disease-Free Interval): Cancer recurrence after complete response Different endpoints answer different clinical questions. OS is most robust (fewer missing data).\ncutoff_type: Character. Method to dichotomize continuous variables (default: \"optimal\"). Options: \"optimal\": Maximizes log-rank test statistic (best separation, data-driven) \"median\": 50th\nminprop: Numeric. Minimum proportion per group when using optimal cutoff (default: 0.1). Range: 0-0.5. Prevents extreme cutoffs creating tiny groups (e.g., 5 % vs 95%). Example: minprop = 0.1 ensures each group has >=10 % of samples. \\item percent Numeric. Percentile for \"quantile\" cutoff method (default: 0.25). Range: 0-1. Example: 0.25 = 25th percentile (bottom 25 % vs top 75%), 0.75 = 75th percentile. Only used when cutoff_type = \"quantile\" . \\item\npercent: ile (balanced groups, unbiased) \"mean\": Average value (can create imbalanced groups) \"quantile\": Custom percentile specified by percent parameter Recommended: \"optimal\" for discovery, \"median\" for validation/reporting.\npalette: Character vector. Colors for survival curves (default: c(\"#ED6355\", \"#41A98E\", \"#EFA63A\", \"#3a6ea5\")). Provide at least 2 colors for High/Low groups. Additional colors used for multi-level categorical variables. Examples: c(\"red\", \"blue\"), c(\"#E41A1C\", \"#377EB8\"), RColorBrewer palettes. \\item\nshow_cindex: Logical. Display concordance index (C-index) on KM plot (default: TRUE). C-index: 0.5 = random prediction, 1.0 = perfect prediction, >0.7 = good prognostic marker. Set FALSE to hide C-index from plot. \\item\nrnaseq_type: Character. RNAseq normalization method (default: \"log2TPM\"). Options: \"log2TPM\", \"log2RSEM\", \"log2FPKM\", \"log2Counts\". Only used when var1_modal = \"RNAseq\". \\item\ncnv_type: Character. CNV calling algorithm (default: \"SNP6_Array\"). Options: \"SNP6_Array\", \"WES\", \"WGS\". Only used when var1_modal = \"CNV\". \\item\nmethylation_region: Character. Methylation region (default: \"Promoter_mean\"). Options: \"Promoter_mean\", \"TSS1500\", \"TSS200\", \"5UTR\", \"1stExon\", \"Body\", \"3UTR\", \"Gene_mean\". Only used when var1_modal = \"Methylation\". \\item\nimmune_algorithm: Character or NULL. Immune deconvolution algorithm (default: NULL for all). Options: \"cibersort\", \"xcell\", \"quantiseq\", \"mcpcounter\", \"timer\", \"epic\", \"ips\", \"estimate\", or NULL. Only used when var1_modal = \"ImmuneCell\". **Unified Return Structure**: List with 3 components (consistent across scenarios) **Quick Access Guide** (common operations): Get statistics: result$stats View KM plot: print(result$plot) Save plot: Already auto-saved to sltcga_output/*.png Check HR: result$stats$cox_hr (>1 = risk, <1 = protective) Check significance: result$stats$cox_pvalue < 0.05 Get C-index: result$stats$cox_cindex (>0.7 = good) Export data: write.csv(result$raw_data, \"survival_data.csv\") Sample size: nrow(result$raw_data) Survival time: result$raw_data$[CANCER]_[ENDPOINT]_time Event status: result$raw_data$[CANCER]_[ENDPOINT]_event stats Data frame with survival analysis results (1+ rows, one per variable): **Scenario 16 (single variable)**: variable, km_pvalue, cox_hr, cox_hr_lower, cox_hr_upper, cox_pvalue, cox_cindex **Scenario 17 (multiple variables)**: variable, hr, hr_lower, hr_upper, p_value, cindex. For multi-level categorical, multiple rows per variable. Key columns: variable : Feature name (e.g., \"TP53 (RNAseq, BRCA)\") km_pvalue : Log-rank test p-value (Scenario 16 only) cox_hr : Hazard ratio from Cox model (HR > 1: worse survival, HR < 1: better survival) cox_hr_lower , cox_hr_upper : 95 % confidence interval for HR cox_pvalue : Cox model p-value (Wald test) cox_cindex : Concordance index (0.5-1.0, >0.7 = good predictor) Always a data frame (never NULL). Use result$stats to access. plot Visualization object (type varies by scenario): **Scenario 16**: Patchwork object with KM curve (left) + Cox forest plot (right) **Scenario 17**: ggplot2 forest plot showing HRs for all variables Access: result$plot . Dimensions: attr(result$plot, \"width\") , attr(result$plot, \"height\") . Auto-saved to sltcga_output/*.png (300 DPI). Print with print(result$plot) . KM curves include: survival curves with confidence bands, at-risk table, log-rank p-value, C-index (if enabled). Forest plots include: HR point estimates, 95 % CI error bars, vertical line at HR=1 (null effect). raw_data Data frame with merged input and survival data: Rows = samples (patients), rownames = sample IDs Columns = analyzed features + survival columns (time, event, group assignments) Survival columns: CANCER_ENDPOINT_time (days), CANCER_ENDPOINT_event (0/1) Group columns: For continuous variables, includes dichotomized groups (High/Low) Use for: Custom survival models, covariate adjustment, sensitivity analyses, data export. Access: result$raw_data . Sample size: nrow(result$raw_data) . **Evaluates prognostic value and predicts patient outcomes** through survival analysis (Kaplan-Meier + Cox regression) across 8 TCGA data modalities (RNAseq, Mutation, CNV, Methylation, miRNA, Clinical, Signature, ImmuneCell) with **4 survival endpoints** (OS: overall survival, DSS: disease-specific survival, PFI: progression-free interval, DFI: disease-free interval). Automatically dichotomizes continuous variables using optimal cutpoint (maximizes separation) or median/quantile, performs Kaplan-Meier analysis with log-rank test, fits Cox proportional hazards model for hazard ratios, and generates publication-ready visualizations (KM curves + Cox forest plot for single feature, forest plot for multiple features). Covers 2 scenarios (16: single variable -> KM+Cox, 17: multiple variables -> forest plot). Supports 33 main cancer types + 32 molecular subtypes. Returns unified structure: list(stats, plot, raw_data) . **How to Interpret Results** (Step-by-Step Decision Tree): **Step 1: Check statistical significance** Cox p-value < 0.05 -> Significant prognostic factor -> Proceed to Step 2 Cox p-value >= 0.05 -> Not prognostic -> Variable does not predict survival Log-rank p-value: Supportive evidence (agreement with Cox p strengthens conclusion) **Step 2: Determine prognostic direction** HR > 1 -> RISK FACTOR (high values = worse survival, increased death risk) HR < 1 -> PROTECTIVE (high values = better survival, decreased death risk) HR = 1 -> No effect (neutral) **Step 3: Assess effect magnitude** HR > 2.0 or < 0.5 -> Strong prognostic effect -> Clinically meaningful HR 1.5-2.0 or 0.5-0.67 -> Moderate effect -> Worth validating HR 1.2-1.5 or 0.67-0.83 -> Weak effect -> May have limited clinical utility **Step 4: Check confidence interval** 95 % CI excludes 1.0 -> Robust finding (e.g., CI: 1.2-2.5) 95 % CI includes 1.0 -> Not significant (e.g., CI: 0.8-1.3) Wide CI -> Imprecise estimate, need larger sample **Step 5: Evaluate discrimination ability** C-index > 0.7 -> Good prognostic biomarker -> Clinical potential C-index 0.6-0.7 -> Moderate discrimination -> Needs multivariate model C-index 0.5-0.6 -> Weak discrimination -> Limited predictive value C-index = 0.5 -> Random prediction -> Not useful **Interpretation Templates for LLM**: **Protective factor template** (HR < 1): \"High [feature] expression is associated with [better/improved] [endpoint] (HR = [HR_value], 95 % CI: [CI_lower]-[CI_upper], p = [p_value], C-index = [c_index]) in [cancer] patients, suggesting [feature] may be a [protective/favorable prognostic] factor. Patients with high [feature] have [percentage] % [reduced/lower] risk of [death/progression].\" Example: \"High TP53 expression is associated with improved overall survival (HR = 0.78, 95 % CI: 0.62-0.98, p = 0.031, C-index = 0.56) in BRCA patients, suggesting TP53 may be a protective factor. Patients with high TP53 have 22 % reduced risk of death.\" **Risk factor template** (HR > 1): \" [feature] [mutation/high expression] is associated with [worse/poor] [endpoint] (HR = [HR_value], 95 % CI: [CI_lower]-[CI_upper], p = [p_value]) in [cancer] patients, suggesting [feature] may be a [risk/adverse prognostic] factor. Patients with [mutant/high] [feature] have [percentage] % [increased/higher] risk of [death/progression].\" Example: \"TP53 mutation is associated with worse overall survival (HR = 1.52, 95 % CI: 1.12-2.07, p = 0.007) in BRCA patients, suggesting TP53 mutation may be an adverse prognostic factor. Patients with mutant TP53 have 52 % increased risk of death.\" **Not prognostic template** (p >= 0.05): \" [feature] is not significantly associated with [endpoint] (HR = [HR_value], p = [p_value]) in [cancer] patients, suggesting [feature] does not predict survival outcomes in this cancer type. This may indicate [context-dependency/subtype-specific effects].\" **4 Survival Endpoints (Choose Based on Research Question)**: **OS (Overall Survival)**: Death from any cause (most robust, complete data) Use for: Overall prognosis, treatment efficacy, biomarker validation Event = death (any cause), Censored = alive at last follow-up Most commonly reported, easiest to collect **DSS (Disease-Specific Survival)**: Death from cancer specifically Use for: Cancer-specific prognosis, avoiding confounding by non-cancer deaths Event = cancer-related death, Censored = alive or death from other causes More biologically relevant but requires accurate cause-of-death data **PFI (Progression-Free Interval)**: Disease progression or death Use for: Treatment response, early efficacy signals Event = progression or death, Censored = progression-free and alive Earlier endpoint than OS (occurs sooner), good for aggressive cancers **DFI (Disease-Free Interval)**: Recurrence after complete response Use for: Adjuvant therapy efficacy, cure vs recurrence risk Event = recurrence, Censored = disease-free or death without recurrence Only for patients who achieved complete response initially **2 Analysis Scenarios** (Auto-detected): **Scenario 16: Single Variable -> KM Curve + Cox Regression** (Combined visualization) **When**: Analyzing prognostic value of single gene/feature **Continuous variables** (RNAseq, CNV, Methylation, miRNA, ImmuneCell, some Signature): Step 1: Dichotomize using specified cutoff (optimal/median/quantile) Step 2: Kaplan-Meier analysis (High vs Low groups, log-rank test) Step 3: Cox regression (continuous variable, HR per unit change) Plot: Left = KM curves with CI bands + at-risk table, Right = Cox forest plot **Categorical variables** (Mutation, some Clinical/Signature): Binary (2 levels): KM for both groups (e.g., WT vs Mutant) Multi-level (>2 levels): Automatically switches to forest plot (Scenario 17) **Statistics**: km_pvalue (log-rank), cox_hr, cox_pvalue, cox_cindex **Example**: TP53 expression (High vs Low) in BRCA, OS endpoint **Scenario 17: Multiple Variables -> Forest Plot** (Hazard ratio comparison) **When**: Comparing prognostic value of multiple genes/features **Workflow**: For each variable: Fit separate Cox model, extract HR and 95 % CI Continuous variables: HR per unit change (or per SD change) Categorical variables: HR for each level vs reference level Multi-level categorical: Multiple HRs per variable (one per level) **Plot**: Forest plot with HR points, CI error bars, vertical line at HR=1 **Statistics**: For each variable: hr, hr_lower, hr_upper, p_value, cindex **Example**: Compare TP53, PIK3CA, GATA3 mutations for BRCA OS prognosis **Cutoff Methods for Continuous Variables**: **Optimal cutpoint** (default): Maximizes log-rank statistic Pro: Best separation, data-driven, identifies biologically meaningful threshold Con: Optimistic bias (overfitting), requires validation in independent dataset Recommended: Use for discovery, validate with median/quantile in validation set **Median cutpoint**: 50th percentile Pro: Unbiased, balanced groups, reproducible, standard in literature Con: May not maximize separation, ignores biological cutpoints Recommended: Use for reporting, validation, fair comparison across studies **Quantile cutpoint**: Custom percentile (e.g., 25th or 75th) Pro: Focuses on extreme groups (e.g., lowest 25 % vs highest 75%) Con: Imbalanced groups, may lose power Recommended: Use when interested in extreme phenotypes **Mean cutpoint**: Average value Pro: Simple, interpretable Con: Sensitive to outliers, can create very imbalanced groups Recommended: Rarely used, prefer median **Cox Proportional Hazards Assumptions**: Assumes constant HR over time (proportional hazards) Check with survival::cox.zph() if needed Violations common with long follow-up or time-varying effects Function reports C-index as overall discrimination measure **What You Can Do Next** (with executable code snippets): **1. Multivariate Cox model** (adjust for covariates): library(survival) data <- result$raw_data # Add clinical covariates (requires merging) # cox_multi <- coxph(Surv(BRCA_OS_time, BRCA_OS_event) ~ # BRCA_TP53_RNAseq + Age + Stage, data = data) # summary(cox_multi) **2. Find genes correlated with prognostic marker**: # If TP53 is prognostic, find co-expressed genes cor_result <- tcga_correlation( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", var2 = c(\"MDM2\", \"CDKN1A\", \"BAX\"), var2_modal = \"RNAseq\", var2_cancers = \"BRCA\" ) **3. Pathway enrichment in high-risk group**: # Dichotomize by survival risk (using optimal cutoff) data <- result$raw_data high_risk <- data$group == \"High\" # Group column from cutoff # Compare pathways: high-risk vs low-risk (requires custom DEA) # Or use mutation as proxy for risk enrich_result <- tcga_enrichment( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", analysis_type = \"enrichment\" ) **4. Validate in independent cancer or subtype**: # Validate TP53 prognosis in lung cancer luad_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"LUAD\", surv_type = \"OS\" ) # Validate in molecular subtype luma_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA-LumA\", surv_type = \"OS\" ) **5. Stratified analysis by subgroup**: data <- result$raw_data # Split by stage (requires clinical data merge) # early_stage <- data[data$Stage %in% c(\"I\", \"II\"), ] # late_stage <- data[data$Stage %in% c(\"III\", \"IV\"), ] # Re-analyze each subgroup separately **6. Time-dependent ROC curve**: # Install: install.packages(\"survivalROC\") # library(survivalROC) # data <- result$raw_data # roc <- survivalROC(Stime = data$BRCA_OS_time, status = data$BRCA_OS_event, # marker = data$BRCA_TP53_RNAseq, predict.time = 1825) # cat(\"5-year AUC:\", roc$AUC) Performance Test **Test Environment**: TCGA clinical and genomic data, real patient survival outcomes Scenario 16 - Single gene RNAseq (TP53 expression in BRCA, OS endpoint): Runtime: 0.8-1.5 sec Sample size: 1,095 BRCA patients (follow-up: median 932 days) Cutoff: Optimal cutpoint = 5.23 log2(TPM) (High: n=654, Low: n=441) KM result: Log-rank p = 0.042 (significant survival difference) Cox result: HR = 0.78 (95 % CI: 0.62-0.98), p = 0.031, C-index = 0.56 Interpretation: Higher TP53 expression associated with better OS (protective) Plot: KM curve + Cox forest plot (8.0\" x 5.0\") Scenario 16 - Mutation status (TP53 mutation in BRCA, OS endpoint): Runtime: 0.6-1.2 sec Sample size: 1,095 BRCA patients (WT: n=719, Mutant: n=376) KM result: Log-rank p = 0.008 (highly significant) Cox result: HR = 1.52 (95 % CI: 1.12-2.07), p = 0.007, C-index = 0.54 Interpretation: TP53 mutation associated with worse OS (risk factor) Plot: KM curve + Cox forest plot (8.0\" x 5.0\") Scenario 16 - Clinical variable (Tumor stage in BRCA, OS endpoint): Runtime: 0.7-1.3 sec Sample size: 1,058 BRCA patients (Stage I: 181, II: 610, III: 246, IV: 21) Result: Multi-level categorical (4 stages) -> Auto-switches to forest plot Cox result: Stage IV vs I: HR = 5.82 (95 % CI: 2.45-13.8), p < 0.001 Interpretation: Advanced stage strongly predictive of poor survival Plot: Forest plot showing HR for each stage vs Stage I reference (6.0\" x 5.0\") Scenario 16 - Immune infiltration (CD8 T cells in BRCA, OS endpoint): Runtime: 0.9-1.6 sec Sample size: 1,095 BRCA patients Cutoff: Optimal = 0.084 (High: n=523, Low: n=572) KM result: Log-rank p = 0.031 (significant) Cox result: HR = 0.73 (95 % CI: 0.55-0.97), p = 0.028, C-index = 0.57 Interpretation: High CD8+ T cell infiltration associated with better OS Plot: KM curve + Cox forest plot (8.0\" x 5.0\") Scenario 16 - Molecular signature (TMB in BRCA, OS endpoint): Runtime: 0.7-1.4 sec Sample size: 1,095 BRCA patients Cutoff: Median = 1.64 mutations/Mb (High: n=548, Low: n=547) KM result: Log-rank p = 0.12 (not significant in BRCA) Cox result: HR = 1.23 (95 % CI: 0.95-1.60), p = 0.11, C-index = 0.52 Interpretation: TMB not prognostic in BRCA (better in immunotherapy-responsive cancers) Plot: KM curve + Cox forest plot (8.0\" x 5.0\") Scenario 16 - Different endpoints (TP53 expression in BRCA, 4 endpoints): OS: HR = 0.78, p = 0.031 (significant) DSS: HR = 0.74, p = 0.025 (significant, cancer-specific) PFI: HR = 0.85, p = 0.18 (not significant, earlier endpoint) DFI: HR = 0.82, p = 0.21 (not significant, recurrence-specific) Interpretation: TP53 expression more predictive of OS/DSS than PFI/DFI in BRCA Scenario 17 - Multiple genes (TP53, PIK3CA, GATA3, ERBB2, ESR1 mutations in BRCA, OS): Runtime: 1.2-2.0 sec (5 separate Cox models) Sample size: 1,095 BRCA patients Results (HR [95 % CI], p-value): TP53: HR = 1.52 [1.12-2.07], p = 0.007 (risk factor) PIK3CA: HR = 0.68 [0.48-0.95], p = 0.024 (protective) GATA3: HR = 0.55 [0.35-0.87], p = 0.010 (protective) ERBB2: HR = 1.18 [0.78-1.78], p = 0.43 (not significant) ESR1: HR = 0.82 [0.51-1.32], p = 0.41 (not significant) Interpretation: TP53 mutation worsens prognosis, PIK3CA/GATA3 mutations improve Plot: Forest plot comparing all 5 genes (6.0\" x 5.0\") Scenario 17 - Immune panel (10 immune cell types in BRCA, OS): Runtime: 2.5-3.5 sec (10 Cox models) Sample size: 1,095 BRCA patients Significant predictors: CD8_T_cells (HR=0.73, p=0.028), B_cells_memory (HR=0.71, p=0.015), Macrophages_M2 (HR=1.45, p=0.032), Neutrophils (HR=1.38, p=0.048) Interpretation: Adaptive immunity (CD8, B cells) protective, innate cells (M2, neutrophils) risk Plot: Forest plot for 10 cell types (6.5\" x 6.0\") Scenario 17 - Clinical panel (Age, Stage, Grade, ER status in BRCA, OS): Runtime: 0.9-1.7 sec Sample size: ~1,000 BRCA patients (varies by variable completeness) Stage IV vs I: HR = 5.82, p < 0.001 (strongest predictor) Age >65 vs <45: HR = 2.34, p = 0.002 Grade 3 vs 1: HR = 1.85, p = 0.015 ER negative vs positive: HR = 1.52, p = 0.018 Plot: Forest plot for clinical factors (5.5\" x 4.5\") Molecular subtypes (TP53 expression in BRCA-LumA vs BRCA-Basal, OS): BRCA-LumA: n=231, HR=0.65, p=0.045 (protective) BRCA-Basal: n=98, HR=1.12, p=0.68 (not significant) Interpretation: TP53 prognostic value subtype-dependent (important in luminal, not basal) **Recommended Use**: Single feature: <2 sec, suitable for interactive exploration Multiple features (5-10): 2-4 sec, good for gene panels or clinical factors Large panels (>20): 5-10 sec, consider subset or focused analysis Optimal cutoff: Slightly slower than median (~20 % overhead), worth it for discovery User Queries **Gene Expression Prognosis**: Is TP53 expression prognostic for survival in breast cancer? Do patients with high BRCA1 expression have better survival? Which genes in the PI3K/AKT pathway predict survival? Is EGFR expression associated with survival outcomes? Do immune checkpoint genes (PDL1, PD1, CTLA4) predict survival? Are cell cycle genes prognostic in aggressive cancers? **Mutation Prognosis**: Do TP53 mutant tumors have worse survival than wildtype? Is PIK3CA mutation prognostic in breast cancer? Are KRAS mutant lung cancers associated with poor survival? Do patients with BRCA1/BRCA2 mutations have different survival? Which driver mutations predict survival outcomes? Are mutation combinations prognostic? **Clinical Factors**: How does tumor stage affect survival probability? Is patient age prognostic for survival? Does tumor grade predict survival outcomes? Is histological subtype associated with prognosis? Do treatment histories affect survival? Which clinical factors are most prognostic? **Immune Infiltration Prognosis**: Does CD8+ T cell infiltration predict better survival? Are B cells associated with improved prognosis? Do M2 macrophages predict worse outcomes? Is high immune infiltration prognostic? Which immune cell types are most prognostic? Does tumor-infiltrating lymphocyte (TIL) abundance predict survival? **Molecular Signatures**: Is tumor mutation burden (TMB) prognostic? Does hypoxia score predict survival outcomes? Is EMT signature associated with poor prognosis? Do stemness scores predict survival? Is cytolytic activity (CYT) score prognostic? Are proliferation signatures associated with outcomes? **Multi-Omics Integration**: Does CNV amplification predict survival? Is promoter methylation prognostic? Do miRNA levels predict survival outcomes? Which omics layer is most prognostic (RNA vs DNA vs epigenetic)? **Survival Endpoints**: What is the difference between OS, DSS, PFI, and DFI? Which survival endpoint should I use for my study? Is TP53 prognostic for overall survival vs disease-specific survival? Are results consistent across different survival endpoints? **Cutoff Methods**: Should I use optimal or median cutoff for gene expression? What is optimal cutpoint and how does it work? Does cutoff choice affect prognostic significance? How do I validate optimal cutoff findings? **Multiple Variables**: Which genes in my pathway are prognostic? How do I compare prognostic value of multiple genes? Which immune cells are most predictive of survival? Can I test multiple clinical factors together? How do I identify the most prognostic features? **Molecular Subtypes**: Is TP53 prognostic in luminal vs basal breast cancer? Do prognostic markers differ by molecular subtype? Are immune profiles prognostic in specific subtypes? Should I stratify by subtype in survival analysis? **Pan-Cancer Questions**: Is TP53 prognostic across multiple cancer types? Do immune markers predict survival in different cancers? Which genes are universally prognostic? Are prognostic factors cancer-type specific? **Statistical Interpretation**: What does hazard ratio mean? How do I interpret HR > 1 vs HR < 1? What is C-index and how do I interpret it? What is the difference between log-rank and Cox p-values? When is a p-value considered significant? What is a good C-index value? **Colloquial and Alternative Phrasings**: Does high TP53 mean longer survival? Do mutant tumors die faster? Gene expression survival connection Does this gene predict outcome? Is this a good prognostic marker? High vs low expression survival difference Mutation survival impact Does this feature predict death? Patient survival gene relationship Outcome prediction gene expression **Abbreviations and Full Forms**: OS (Overall Survival / death from any cause / overall mortality) DSS (Disease-Specific Survival / cancer-related death / disease-related survival) PFI (Progression-Free Interval / progression or death / disease progression) DFI (Disease-Free Interval / recurrence / relapse-free survival / RFS) HR (Hazard Ratio / risk ratio / relative risk) KM (Kaplan-Meier / survival curve / survival probability) Cox (Cox regression / Cox proportional hazards / Cox model) **Function Selection Questions**: Should I use tcga_survival or tcga_correlation for prognosis? Which function tests survival impact? (Answer: tcga_survival) Which function for prognostic value? (Answer: tcga_survival) How to test if gene predicts survival? (Answer: tcga_survival) Which function for KM curves? (Answer: tcga_survival) Gene expression and patient outcome - which function? (Answer: tcga_survival) # =========================================================================== # Example 1: Gene expression survival (Scenario 16) - TESTED 1.15 sec # =========================================================================== # Research Question: Is TP53 expression prognostic for overall survival in breast cancer? # Expected: Higher expression = better survival (tumor suppressor) result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" ) # Return structure (unified) result$stats # km_pvalue, cox_hr, cox_pvalue, cox_cindex result$plot # KM curve (left) + Cox forest plot (right) result$raw_data # 1,095 patients with TP53 expression + survival data # Interpret cat(\"HR:\", result$stats$cox_hr, \"\\n\") # HR < 1 = protective, HR > 1 = risk factor cat(\"P-value:\", result$stats$cox_pvalue, \"\\n\") cat(\"C-index:\", result$stats$cox_cindex, \"\\n\") # >0.7 = good predictor # Interpretation: HR = 0.78 (p=0.031) suggests high TP53 expression is protective # =========================================================================== # Example 2: Mutation status survival (Scenario 16) - TESTED 0.98 sec # =========================================================================== # Research Question: Do TP53 mutant tumors have worse survival? # Expected: Yes (loss of tumor suppressor function) result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) result$stats # WT vs Mutant comparison # HR = 1.52 (p=0.007): Mutant tumors have 52% increased risk of death # =========================================================================== # Example 3: Clinical variable (Multi-level categorical) - TESTED 1.08 sec # =========================================================================== # Research Question: How does tumor stage affect survival? # Expected: Advanced stage = worse survival result <- tcga_survival( var1 = \"Stage\", var1_modal = \"Clinical\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) result$stats # Multiple rows (one per stage comparison) result$plot # Forest plot (auto-selected for >2 groups) # Stage IV vs I: HR = 5.82 (p<0.001) - 5.8x increased risk # =========================================================================== # Example 4: Immune infiltration (Scenario 16) - TESTED 1.23 sec # =========================================================================== # Research Question: Does CD8+ T cell infiltration predict better survival? # Expected: Yes (immune surveillance) result <- tcga_survival( var1 = \"CD8_T_cells_cibersort\", var1_modal = \"ImmuneCell\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" ) result$stats # HR = 0.73 (p=0.028): High CD8 = 27% reduced risk # Interpretation: Immune-infiltrated tumors have better prognosis # =========================================================================== # Example 5: Different survival endpoints - Compare OS vs DSS vs PFI # =========================================================================== # Research Question: Is TP53 prognostic across different endpoints? # Overall Survival os_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) # Disease-Specific Survival dss_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"DSS\" ) # Progression-Free Interval pfi_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"PFI\" ) # Compare results cat(\"OS HR:\", os_result$stats$cox_hr, \"p =\", os_result$stats$cox_pvalue, \"\\n\") cat(\"DSS HR:\", dss_result$stats$cox_hr, \"p =\", dss_result$stats$cox_pvalue, \"\\n\") cat(\"PFI HR:\", pfi_result$stats$cox_hr, \"p =\", pfi_result$stats$cox_pvalue, \"\\n\") # =========================================================================== # Example 6: Different cutoff methods - Optimal vs Median # =========================================================================== # Research Question: Does cutoff choice affect results? # Optimal cutpoint (maximizes separation) optimal <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" ) # Median cutpoint (unbiased, 50-50 split) median <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"median\" ) # Compare cat(\"Optimal: p =\", optimal$stats$km_pvalue, \", HR =\", optimal$stats$cox_hr, \"\\n\") cat(\"Median: p =\", median$stats$km_pvalue, \", HR =\", median$stats$cox_hr, \"\\n\") # Usually optimal gives better p-values (but risk of overfitting) # =========================================================================== # Example 7: Multiple genes (Scenario 17, Forest plot) - TESTED 1.85 sec # =========================================================================== # Research Question: Which breast cancer driver genes are most prognostic? result <- tcga_survival( var1 = c(\"TP53\", \"PIK3CA\", \"GATA3\", \"ERBB2\", \"ESR1\"), var1_modal = \"Mutation\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) result$stats # 5 rows (one per gene) result$plot # Forest plot comparing all genes # Find most prognostic genes sig_genes <- result$stats[result$stats$p_value < 0.05, ] sig_genes <- sig_genes[order(abs(log(sig_genes$hr)), decreasing = TRUE), ] # Interpretation: # TP53: HR=1.52, p=0.007 (risk factor) # PIK3CA: HR=0.68, p=0.024 (protective) # GATA3: HR=0.55, p=0.010 (most protective) # =========================================================================== # Example 8: Immune cell panel (Scenario 17) - TESTED 3.12 sec # =========================================================================== # Research Question: Which immune cells predict survival? result <- tcga_survival( var1 = c( \"CD8_T_cells_cibersort\", \"CD4_T_cells_memory_resting_cibersort\", \"B_cells_memory_cibersort\", \"Macrophages_M1_cibersort\", \"Macrophages_M2_cibersort\", \"NK_cells_activated_cibersort\", \"Dendritic_cells_activated_cibersort\", \"Neutrophils_cibersort\" ), var1_modal = \"ImmuneCell\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) result$stats # 8 rows (one per cell type) result$plot # Forest plot # Identify protective vs risk cell types protective <- result$stats[result$stats$hr < 1 & result$stats$p_value < 0.05, ] risk <- result$stats[result$stats$hr > 1 & result$stats$p_value < 0.05, ] # Interpretation: # Protective: CD8 T cells, B cells memory (adaptive immunity) # Risk: M2 macrophages, Neutrophils (tumor-promoting inflammation) # =========================================================================== # Example 9: Clinical factor panel - TESTED 1.42 sec # =========================================================================== # Research Question: Which clinical factors are most prognostic? result <- tcga_survival( var1 = c(\"Age\", \"Stage\", \"Grade\"), var1_modal = \"Clinical\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) result$stats # Multiple rows (multi-level categorical variables) # Stage IV vs I: HR = 5.82 (strongest predictor) # Age >65 vs <45: HR = 2.34 # Grade 3 vs 1: HR = 1.85 # =========================================================================== # Example 10: TMB signature - TESTED 1.08 sec # =========================================================================== # Research Question: Is tumor mutation burden prognostic? # Expected: Context-dependent (good in immunotherapy-responsive cancers) result <- tcga_survival( var1 = \"TMB\", var1_modal = \"Signature\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"median\" ) result$stats # HR = 1.23, p = 0.11 (not significant in BRCA) # Interpretation: TMB not prognostic in breast cancer (hormone-driven) # Try in melanoma or lung cancer for immunotherapy relevance # =========================================================================== # Example 11: Molecular subtypes - TESTED 0.75 sec # =========================================================================== # Research Question: Is TP53 prognostic in luminal vs basal breast cancer? # Luminal A subtype luma <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA-LumA\", surv_type = \"OS\" ) # Basal subtype basal <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA-Basal\", surv_type = \"OS\" ) # Compare cat(\"LumA: HR =\", luma$stats$cox_hr, \", p =\", luma$stats$cox_pvalue, \"\\n\") cat(\"Basal: HR =\", basal$stats$cox_hr, \", p =\", basal$stats$cox_pvalue, \"\\n\") # TP53 prognostic in LumA (HR=0.65, p=0.045) but not Basal (HR=1.12, p=0.68) # =========================================================================== # Example 12: CNV survival - TESTED 1.21 sec # =========================================================================== # Research Question: Is ERBB2 amplification prognostic? result <- tcga_survival( var1 = \"ERBB2\", var1_modal = \"CNV\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" ) # High CNV (amplification) vs Low result$stats # Check if amplification predicts outcome # =========================================================================== # Example 13: Methylation survival - TESTED 1.38 sec # =========================================================================== # Research Question: Is BRCA1 promoter methylation prognostic? result <- tcga_survival( var1 = \"BRCA1\", var1_modal = \"Methylation\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\" ) result$stats # High methylation (silencing) effect on survival # =========================================================================== # Example 14: Custom analysis with raw_data # =========================================================================== # Use raw_data for multivariate Cox models result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) # Access merged data data <- result$raw_data head(data) # Columns: TP53 expression, time, event, group (High/Low) # Multivariate Cox with clinical covariates (requires clinical data merge) # library(survival) # coxph(Surv(time, event) ~ TP53 + Age + Stage, data = data) # Stratified analysis by ER status # er_pos <- data[data$ER_status == \"Positive\", ] # er_neg <- data[data$ER_status == \"Negative\", ] # =========================================================================== # Example 15: Quantile cutoff (extreme groups) # =========================================================================== # Research Question: Do patients with lowest 25% TP53 expression have worse survival? result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"quantile\", percent = 0.25 ) # Compare bottom 25% vs top 75% result$stats # =========================================================================== # Example 16: Common Mistakes and How to Fix Them # =========================================================================== # MISTAKE 1: Using multiple cancer types (survival is cancer-specific!) # ❌ WRONG: Cannot combine different cancers in survival analysis # result_wrong <- tcga_survival( # var1 = \"TP53\", var1_modal = \"RNAseq\", # var1_cancers = c(\"BRCA\", \"LUAD\"), # Error! Only single cancer allowed # surv_type = \"OS\" # ) # Error: var1_cancers must be single cancer type (survival data is cancer-specific) # ✅ CORRECT: Run separately for each cancer then compare brca_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) luad_result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"LUAD\", surv_type = \"OS\" ) # Then compare: brca_result$stats$cox_hr vs luad_result$stats$cox_hr # MISTAKE 2: Misinterpreting HR direction result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\" ) # ❌ WRONG interpretation: \"HR = 0.78 means bad prognosis\" # ✅ CORRECT: HR < 1 means PROTECTIVE (lower risk of death) # HR > 1 means RISK FACTOR (higher risk of death) # HR = 1 means NO EFFECT if (result$stats$cox_hr < 1) cat(\"High expression is PROTECTIVE (better survival)\\n\") else if (result$stats$cox_hr > 1) cat(\"High expression is RISK FACTOR (worse survival)\\n\") # MISTAKE 3: Ignoring sample size warnings # If you see very small groups (e.g., High: n=10, Low: n=500): # - Results may be unreliable # - Consider using median cutoff instead of optimal # - Check if gene has very skewed distribution # ✅ CORRECT: Check group sizes after cutoff result <- tcga_survival( var1 = \"TP53\", var1_modal = \"RNAseq\", var1_cancers = \"BRCA\", surv_type = \"OS\", cutoff_type = \"optimal\", minprop = 0.1 # Ensure >=10% per group ) # =========================================================================== # Next Steps # =========================================================================== # After survival analysis: # 1. Use tcga_correlation() to find features associated with survival-related genes # 2. Use tcga_enrichment() to identify pathways enriched in high-risk groups # 3. Validate in independent cancer types or molecular subtypes # 4. Build multivariate models with clinical covariates using result$raw_data # 5. Perform stratified analysis by clinical subgroups **TCGA Database**: The Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113-1120. \\Sexpr[results=rd]tools:::Rd_expr_doi(\"#1\") 10.1038/ng.2764 text doi:10.1038/ng.2764 <https://doi.org/10.1038/ng.2764> latex https://doi.org/10.1038/ng.2764 doi:10.1038 ng.2764 https://doi.org/10.1038/ng.2764 doi:10.1038/ng.2764 Database portal: https://www.cancer.gov/tcga **Survival Analysis Methods**: Liu J, et al. (2018). An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell, 173(2):400-416. \\Sexpr[results=rd]tools:::Rd_expr_doi(\"#1\") 10.1016/j.cell.2018.02.052 text doi:10.1016/j.cell.2018.02.052 <https://doi.org/10.1016/j.cell.2018.02.052> latex https://doi.org/10.1016/j.cell.2018.02.052 doi:10.1016 j.cell.2018.02.052 https://doi.org/10.1016/j.cell.2018.02.052 doi:10.1016/j.cell.2018.02.052 **Prognostic Biomarkers**: Thorsson V, et al. (2018). The Immune Landscape of Cancer. Immunity, 48(4):812-830. \\Sexpr[results=rd]tools:::Rd_expr_doi(\"#1\") 10.1016/j.immuni.2018.03.023 text doi:10.1016/j.immuni.2018.03.023 <https://doi.org/10.1016/j.immuni.2018.03.023> latex https://doi.org/10.1016/j.immuni.2018.03.023 doi:10.1016 j.immuni.2018.03.023 https://doi.org/10.1016/j.immuni.2018.03.023 doi:10.1016/j.immuni.2018.03.023 tcga_correlation - Find features associated with prognostic markers tcga_enrichment - Identify pathways in high-risk vs low-risk groups list_modalities - View all data modalities list_variables - Explore Clinical/Signature/ImmuneCell variables list_cancer_types - View all cancer types and subtypes",
    "simple_arguments": "var1: Character vector. Variable names for survival analysis (required). Examples: Single gene (\"TP53\"), multiple genes (c(\"TP53\", \"EGFR\", \"KRAS\")), clinical (\"Stage\", \"Age\"), signatures (\"TMB\", \"Hypoxia_Score\"), immune cells (\"CD8_T_cells_cibersort\"), miRNA (\"hsa-mir-21\"). For continuous variables (RNAseq, CNV, etc.), automatically dichotomized into High/Low groups using specified cutoff method. For categorical variables (Mutation, some Clinical), used as-is for group comparison.\nvar1_modal: Character. Data modality for survival predictors (required). Options: \"RNAseq\", \"Mutation\", \"CNV\", \"Methylation\", \"miRNA\", \"Clinical\", \"Signature\", \"ImmuneCell\". Determines variable type: continuous modalities are dichotomized, categorical used directly.\nvar1_cancers: Character vector. Cancer type for analysis (required, case-insensitive). Options: 33 main types (\"BRCA\", \"LUAD\"), 32 molecular subtypes (\"BRCA-Basal\", \"BRCA-LumA\"). **Important**: Only single cancer type supported (survival data is cancer-specific). For pan-cancer survival, run separately per cancer type and compare results. Use list_cancer_types () to view all options."
  }
}