You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+24-38Lines changed: 24 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -129,25 +129,24 @@ output:
129
129
#### Mode Selection (Choose One)
130
130
131
131
-**`-a`**, **`--all`**:<br>
132
-
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
132
+
All timestamps. Gives one folder per timestamp.
133
133
-**`-l`**, **`--last`**:<br>
134
-
Download the last version of each file snapshot. You will get one directory with a rebuild of the page. It contains the last version of each file of your specified `--range`.
134
+
Last Version. Gives one folder containing the last version of each file of specified `--range`.
135
135
-**`-f`**, **`--first`**:<br>
136
-
Download the first version of each file snapshot. You will get one directory with a rebuild of the page. It contains the first version of each file of your specified `--range`.
137
-
-**`-s`**, **`--save`**:<br>
138
-
Save a page to the Wayback Machine. (beta)
136
+
First Version. Gives one folder containing the first version of each file of specified `--range`.
139
137
140
138
#### Optional query parameters
141
139
140
+
Parameters for archive.org CDX query. No effect on snapshot download itself.
141
+
142
142
-**`-e`**, **`--explicit`**:<br>
143
-
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
143
+
Only the explicit URL. No wildcard subdomains or paths. For example get: root-only (`https://example.com`) or specific file (`login.html`, `?query=this`).
144
144
145
145
-**`--limit`**`<count>`:<br>
146
-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
146
+
Limits the snapshots fetched from archive.org CDX. (Will have no effect on existing CDX files)
147
147
148
148
-**Range Selection:**<br>
149
-
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range`, the `start` and `end` will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
Set the query range in years (`range`) or a timestamp (`start` and/or `end`). If `range` then ignores `start` and `end`. Format for timestamps: YYYYMMDDhhmmss. Timestamp can as specific as needed (year 2019, year+month+day 20190101, ...).
151
150
152
151
-**`-r`**, **`--range`**:<br>
153
152
Specify the range in years for which to search and download snapshots.
@@ -157,57 +156,56 @@ output:
157
156
Timestamp to end searching.
158
157
159
158
-**Filtering:**<br>
160
-
A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
161
159
162
160
-**`--filetype`**`<filetype>`:<br>
163
-
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
161
+
Specify filetypes to download. Example: `--filetype jpg,css,js`. You can only filter filetypes which are stored by archive.org (.html mostly not)
164
162
165
163
-**`--statuscode`**`<statuscode>`:<br>
166
-
Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
164
+
Specify HTTP status codes to download. Example: `--statuscode 200,301`. PyWayBackup will always skip `404`and `301`.<br>
167
165
Common status codes you may want to handle/filter:
168
166
-`200` (OK)
169
-
-`301` (Moved Permanently - will redirect snapshot)
167
+
-`301` (Moved Permanently)
170
168
-`404` (Not Found - snapshot seems to be empty)
171
169
-`500` (Internal Server Error - snapshot is at least for now not available)
172
170
173
-
### Optional
171
+
####Optional Behavior Manipulation
174
172
175
-
#### Behavior Manipulation
173
+
Parameters will change the download behavior for snapshots.
176
174
177
175
-**`-o`**, **`--output`**:<br>
178
176
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
179
177
180
178
-**`-m`**, **`--metadata`**<br>
181
-
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
179
+
Folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). If you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
182
180
183
181
-**`--verbose`**:<br>
184
182
Increase output verbosity.
185
183
186
184
-**`--log`**<!-- `<path>` -->:<br>
187
-
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
185
+
Saves a log file into the output-dir. `waybackup_<sanitized_url>.log`.
188
186
189
187
-**`--progress`**:<br>
190
188
Shows a progress bar instead of the default output.
191
189
192
190
-**`--workers`**`<count>`:<br>
193
-
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
191
+
Number of simultaneous download workers. Default is 1, safe range is about 10. Too many workers may lead to refused connections by archive.org.
194
192
195
193
-**`--no-redirect`**:<br>
196
-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
194
+
Disables following redirects of snapshots. Can prevent timestamp-folder mismatches caused by redirects.
197
195
198
196
-**`--retry`**`<attempts>`:<br>
199
-
Specifies number of retry attempts for failed downloads.
197
+
Retry attempts for failed downloads.
200
198
201
199
-**`--delay`**`<seconds>`:<br>
202
-
Specifies delay between download requests in seconds. Default is no delay (0).
200
+
Delay between download requests in seconds. Default is no delay (0).
203
201
204
202
#### Job Handling:
205
203
206
204
-**`--reset`**:
207
-
If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
205
+
If set, the job will be reset, and `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch.
208
206
209
207
-**`--keep`**:
210
-
If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
208
+
If set, `cdx` and `db`files will be kept after the job is finished. Otherwise they will be deleted.
211
209
212
210
<br>
213
211
<br>
@@ -218,23 +216,11 @@ output:
218
216
219
217
`pywaybackup` resumes interrupted jobs. The tool automatically continues from where it left off.
220
218
221
-
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
222
-
-Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
223
-
-Skips previously downloaded files to save time.
219
+
Only resumes queries if:
220
+
-existing `.cdx`and `.db` files in an `output dir`
221
+
-command is identical by `URL`, `mode`, and `optional query parameters`
0 commit comments