RWRtoolkit/README.Rmd at main · dkainer/RWRtoolkit · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# RWRtoolkit

<!-- badges: start -->
<!-- badges: end -->

RWRtoolkit enables easy use of RandomWalk with Restart on multiplex networks.  These functions are an extension to the [RandomWalkRestartMH](https://github.com/alberto-valdeolivas/RandomWalkRestartMH) R package.  Also provided are scripts for use as command line tools.


## Installation

#### Dependencies

Installation of this R package requires R >= 4.1.0 and devtools.  If you use prefer the use of conda you can create the base environment with `conda create --name r-RWRtoolkit -c conda-forge "r>=4.1" "r-base>=4.1" r-devtools` (`r-irkernel` is optional). You can also install devtools from within a base R environment with `install.packages("devtools")`.

##### Installation Issues

###### Unable to access Bioconductor:
You may likely run into an issue with your R environment installing packages via bioconductor:
```R
devtools::install()
Error: Unknown remote type: bioc
  cannot open URL 'https://bioconductor.org/config.yaml'
```
To ensure the issue is a certificate issue, use another library to call bioconductor:
```R
httr::GET("https://bioconductor.org/config.yaml")
Error in curl::curl_fetch_memory(url, handle = handle) :
  SSL peer certificate or SSH remote key was not OK: [bioconductor.org] SSL certificate problem: self-signed certificate in certificate chain
```

This problem is an SSL error where you will need to update your SSL certificate.
To fix this, in your terminal, type the following:
```bash
# 1. Get a certificate if you don't have one
curl -o ~/.ssh/cert.pem https://curl.se/ca/cacert.pem

# 2. Get a bioconductor specific certificate
echo | openssl s_client -showcerts -servername bioconductor.org -connect bioconductor.org:443 2>/dev/null | openssl x509 -inform pem -outform pem > bioconductor_cert.pem

# 3. Append your bioconductor cert to your cert.pem file
cat bioconductor_cert.pem >> ~/.ssh/cert.pem

# 4. Add the cert path to your `.Renviron`
echo CURL_CA_BUNDLE=/Users/96v/.ssh/cert.pem > ~/.Renviron
```

In a newly restarted R environment, type:
```R
httr::set_config(httr::config(cainfo = "/Users/96v/.ssh/cert.pem"))
response <- httr::GET("https://bioconductor.org/config.yaml")
print(response)
```

You ought to get an output similar to:
```
Response [https://bioconductor.org/config.yaml]
  Date: 2024-08-14 10:57
  Status: 200
  Content-Type: <unknown>
  Size: 12.6 kB
<BINARY BODY>
```

Now, `devtools::install()` ought to work.


###### devtools/r-devtools installation
It is possible you may run into issues installing `r-devtools` via conda or `devtools` via R's `install.packages()` function.

**textshaping**
This might be due to a failure in the installation of `textshaping`. `textshaping` requires the libraries `harfbuzz` and `fribidi` libraries, yet uses the `pkg-config` command, which may be external to your environment.  There are multiple options for fixing (linux/MacOS installation recommendations taken from R install.packages ANTICONF):
  - Anaconda: conda install -c conda-forge pkg-config harfbuzz fribidi
  - deb: libharfbuzz-dev libfribidi-dev (Debian, Ubuntu, etc)
  - rpm: harfbuzz-devel fribidi-devel (Fedora, EPEL)
  - csw: libharfbuzz_dev libfribidi_dev (Solaris)
  - brew: harfbuzz fribidi (OSX)

**libgit2**
libgit2:  Depending on how your packages were installed, you may run into an SSL issue when attempting to install devtools. This is due to the installation of `gert`, which requires an installation of `libgit2` (installable via the [binaries](https://libgit2.org/), [conda](https://anaconda.org/conda-forge/libgit2), [homebrew](https://formulae.brew.sh/formula/libgit2), [yum](https://yum-info.contradodigital.com/view-package/epel/libgit2/), or package manager of your choice).


#### Package Installation

You may clone this repo and install directly.  This is particularly useful to use the CLI scripts or for development purposes.

```
git clone https://github.com/dkainer/RWRtoolkit.git
cd RWRtoolkit
R
devtools::install()
```

From a clean environment this may take a while (~20 min).


#### Secondary Method (install as an R package directly)

You can install the released version of RWRtoolkit from [GitHub](https://github.com/dkainer/RWRtoolkit/) with:

``` r
devtools::install_github("dkainer/RWRtoolkit")
```

## Running RWRtoolkit

#### Loading RWRtoolkit:

RWRtoolkit can be run as either an R package or a command line tool depending on your preferences.

- **R Package:**
  Simply loading the library with the `library` function in R loads RWRtoolkit:

  ```R
  library(RWRtoolkit)
  ```

- **Command Line Tool:**
  **If you have downloaded the code** via GitHub, you can access the command line script code by navigating to the `RWRtoolkit/inst/scripts` directory.

  **If you have downloaded the code** via `devtools::install_github`  open an R session and type:

  ```R
  library(RWRtoolkit)
  .libPaths()
  ```

  Which ought to output a path similar to:

  ```
  /Library/Frameworks/R.framework/Versions/4.0/Resources/library/
  ```

  This is the directory in which your installed R libraries exist.

  From the above directory (hereby referred to as `<LIBPATHS_DIRECTORY>` ), the script files can be found on the path:

  ```
  <LIBPATHS_DIRECTORY>/RWRtoolkit/scripts
  ```

  Note: the paths are not the same as the GitHub repository due to the `devtools::install` function's lifting of all directories within the `inst` directory  during the build/installation phase.

  From the above path, all scripts can be accessed as:

  ```bash
  Rscript <LIBPATHS_DIRECTORY>/RWRtoolkit/scripts/run_loe.R
    --data            <LIBPATHS_DIRECTORY>/RWRtoolkit/example_data/string_interactions.Rdata \
    --seed_geneset    <LIBPATHS_DIRECTORY>/RWRtoolkit/example_data/geneset1.txt \
    --tau             "1.0,1.0" \
    -o                ./outdir
  ```


#### Running
RWRtoolkit enables RandomWalk with Restart (RWR) on homogenous multiplex networks.  RWRtoolkit provides functions for both creating the muliplex networks and running RWR.

##### Usage Options:

The tools provided by RWRtoolkit can be used either directly in R or by use of command line scripts.  The R functions follow the convention of `RWRtoolkit::RWR_func` such as `RWRtoolkit::RWR_make_multiplex`.  View help with `?RWRtoolkit::RWR_make_multiplex`.  The command line scripts are available in `./inst/scripts` and can be used with `Rscript` such as `Rscript run_make_multiplex.R`.  Run `Rscript run_make_multiplex.R -h` to view the help.  You can use these scripts from any location, but remember to either use complete paths or paths local to where you are running when applicable.

##### Initial Step:

The first step in RWRtoolkit is to build the RData object that represents the multiplex network using `RWR_make_multiplex`.  This function requires an `flist` (a **f**ile **list**) input file which represents the set of networks to create the multiplex object.  Each row in the flist is a triple defining the network: \{file_path, name, group\}. An example flist for a homogeneous networks looks like (separated by any of the following delimiters  `,\t |;`):

|      **file_path**     |    **name**   |
|:----------------------:|:-------------:|
| /path/to/file1.txt     | PPI           |
| /path/to/file2.txt     | Co-Domain     |

At this stage you also define values for delta.  **Delta** sets the probability to change between layers at the next step. If delta = 0, the particle will always remain in the same layer after a non-restart iteration.  On the other hand, if delta = 1, the particle will always change between layers, therefore not following the specific edges of each layer.  The default is 0.5.  Note delta must be greater than 0 and less than or equal to 1.
Please note that for large networks or a large number of networks this function may take a long time.

This function will not return anything, it will save the relevant objects (the multiplex object *mpo*, adjacency matrix, and normalized adjacency matrix) to file to be used in subsequent functions.

When using the CLI script, remember to use complete paths or paths local to where you run `scripts/run_make_multiplex.R` in your `flist`.


## RWRToolkit Examples

- **Running in R**
    The below code assumes an R session was initialized from within the `inst` directory of RWRtoolkit. Output will be within the `RWRtoolkit/inst` directory. (This is necessary due to the files within `flist.tsv`  having relative paths)

    ```R
    RWRtoolkit::RWR_make_multiplex(
      flist="./example_data/flist.tsv",
      delta=0.5,
      output="./RWRtoolkit_MPO_Output/myExampleNetwork.Rdata"
    )
    ```

- **Running CLI**
    If running the code from the cloned GitHub repository, the below code ought to be run from within the `inst` directory.  If running from the `devtools::install_github` method, the below code ought to be run from with the RWRtoolkit directory located at `<LIBPATHS_DIRECTORY>/RWRtoolkit`. Output will be saved to your home directory.

    ```bash
    Rscript scripts/run_make_multiplex.R \
      --flist example_data/flist.tsv \
      --delta 0.25 \
      --out ./RWRtoolkit_MPO_Output/myExampleNetwork.Rdata
    ```

### Next Steps:

The choice of the next script depends on the type of analysis desired.
RWRtoolkit provides several different workflows outlined below.


#### RWR_CV.R

*RWR Cross Validation* performs K-fold cross validation on a single gene
set, finding the RWR rank of the left-out genes.  Can choose between three
modes: (1) leave-one-out `loo` to leave only one gene from the gene set out and
find its rank, (2) cross-validation `kfold` to run k-fold cross-validation
for a specified value of *k*, or (3) singletons `singletons` to use a single gene
as a seed and find the rank of all remaining genes.

- **Input:** Pre-calculated interaction network (using
  `RWR_make_multiplex.R`), and a single geneset.
- **Output:** Table/dataframe with the ranking of each gene in the gene set when
  left out, as well as AUPRC and AUROC curves.

Examples

- **Running in R**

    ```R
    # Can be run from anywhere so long as RWRtoolkit is installed.
    extdata.dir <- system.file("example_data", package="RWRtoolkit")

    string.interactions.fp <- paste(extdata.dir, "string_interactions.Rdata", sep='/')
    geneset.path <- paste(extdata.dir, 'geneset1.tsv', sep='/')
    outdir.path <- './RWRtoolkit_CV_Output/'

    RWRtoolkit::RWR_CV(
      data = string.interactions.fp ,
      genesetPath = geneset.path,
      outdirPath = outdir.path)
    ```

- **Running CLI**
    If running the code from the cloned GitHub repository, the below code ought to be run from within the `inst` directory.  If running from the `devtools::install_github` method, the below code ought to be run from with the RWRtoolkit directory located at `<LIBPATHS_DIRECTORY>/RWRtoolkit`. Output will be saved to your home directory.

    ```bash
    Rscript ./scripts/run_cv.R \
      --data ./example_data/string_interactions.Rdata \
      --geneset ./example_data/geneset1.tsv \
      -o ./RWRtoolkit_CV_Output/
    ```

#### RWR_LOE.R

*RWR Lines of Evidence* has two possible functions.  Given one geneset
of seeds, rankings for all other genes in the network will be returned.
Given a second geneset of genes to be queried, rankings for just the genes
in that geneset will be returned.  This can be used to build multiple
lines of evidence from the various input networks to relate the two gene sets.

- **Input:** Pre-calculated interaction network (using
  `RWR_make_multiplex`), and one or two genesets.
- **Output:** Table/dataframe with a ranking of non-seed genes (either the rest of the genes in the network if only one input geneset is used, or just the genes in the second geneset if one is provided).


Examples

- **Running in R**

    ```R
    # Can be run from anywhere so long as RWRtoolkit is installed.
    extdata.dir <- system.file("example_data", package="RWRtoolkit")

    string.interactions.fp <- paste(extdata.dir, "string_interactions.Rdata", sep='/')
    geneset.path <- paste(extdata.dir, 'geneset1.tsv', sep='/')
    outdir.path <- './RWRtoolkitOutput_LOE/'

    RWRtoolkit::RWR_LOE(
      data= string.interactions.fp,
      seed_geneset= geneset.path,
      tau = c(1, 1, 1, 1, 1, 1, 1, 1, 1),
      outdir= outdir.path )
    ```

- **Running CLI**

    ```bash
    Rscript scripts/run_loe.R \
      --data            ./example_data/string_interactions.Rdata \
      --seed_geneset    ./example_data/geneset1.tsv \
      --tau             "1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0" \
      -o                ./RWRtoolkitOutput_LOE
    ```

#### RWR_netstats.R

*RWR Net Stats*  performs offers a series of statistical methods for extracting metrics for networks and multiplex layers. There are multiple options within netstats:

- **Input:**
    - A multiplex object (from RWR_make_multiplex) or an flist.
    - A reference network: Optional (Depending on methods chosen)
    - A network of interest: Optional (Depending on methods chosen)
    - Network Scoring Metric: ("jaccard", "overlap", or "both")

- **Output:** In R: a list containing tables of metrics flagged from input parameters. Files for each table can be saved by supplying an `output_dir`


Examples

- **Running in R**

    ```R
    # Can be run from anywhere so long as RWRtoolkit is installed.
    extdata.dir <- system.file("example_data", package="RWRtoolkit")

    mpo_path <- paste(extdata.dir, "string_interactions.Rdata", sep = "/")
    gold.fp <- paste(extdata.dir, "netstat/combined_score-random-gold.tsv", sep='/')
    network.fp <- paste(extdata.dir, "netstat/combined_score-random-test.tsv", sep='/')
    outdir.path <- "~/RWRtoolkitOutput/"

    RWRtoolkit::RWR_netstats(
          data = mpo_path,
          network_1 = gold.fp,
          network_2 = network.fp,
          basic_statistics = T,
          scoring_metric = "both",
          pairwise_between_mpo_layer = T,
          multiplex_layers_to_refnet = T,
          net_to_net_similarity = T,
          calculate_tau_for_mpo = T,
          merged_with_all_edges = T,
          merged_with_edgecounts = T,
          calculate_exclusivity_for_mpo = T,
          outdir = "./",
          verbose = T
     )

    ```

- **Running CLI**

    ```bash
    Rscript scripts/run_netstats.R \
      --data ./example_data/string_interactions.Rdata  \
      --network_1 ./example_data/netstat/combined_score-random-gold.tsv \
      --network_2 ./example_data/netstat/combined_score-random-test.tsv \
      --scoring_metric both \
      --outdir ./RWRtoolkitOutput_Netstats \
      --basic_statistics  \
      --pairwise_between_mpo_layer  \
      --multiplex_layers_to_refnet  \
      --net_to_net_similarity  \
      --calculate_tau_for_mpo  \
      --merged_with_all_edges  \
      --merged_with_edgecounts  \
      --calculate_exclusivity_for_mpo  \
      --verbose
    ```


#### RWR_shortestpaths.R

Find shortest paths between genes in gene sets. Given a single gene
set, find the shortest paths between the genes in that gene set. Given
two gene sets, find the shortest paths for pairs of genes between gene
sets.

- **Input:**
    - Pre-calculated interaction network (`data`).  The layers will be flattened into a single network to find the shortest paths.
    - A file in TSV format containing genes of interest (`source-geneset`).
    - Optional second file in TSV format containing genes of
      interest (`target-geneset`) to find pairs of paths to the `source-geneset`.
- **Output:** Edge list table.

Examples

- **Running in R**

    ```R
    # Can be run from anywhere so long as RWRtoolkit is installed.
    extdata.dir <- system.file("example_data", package="RWRtoolkit")

    string.interactions.fp <- paste(extdata.dir, "string_interactions.Rdata", sep='/')
    source.geneset.path <- paste(extdata.dir, 'geneset1.tsv', sep='/')
    target.geneset.path <- paste(extdata.dir, 'geneset1.tsv', sep='/')
    outdir.path <- './RWRtoolkitOutput_SP/'


    RWRtoolkit::RWR_ShortestPaths(
        data = string.interactions.fp,
        source_geneset = source.geneset.path,
        target_geneset = target.geneset.path,
        outdir = outdir.path
    )
    ```

- **Running CLI**

    ```bash
    Rscript scripts/run_shortestpaths.R \
        --data ./example_data/string_interactions.Rdata \
        --source_geneset ./example_data/geneset1.tsv \
        --target_geneset ./example_data/geneset2.tsv \
        -o ./RWRtoolkitOutput_SP/
    ```