Skip to content

Commit 25adb36

Browse files
committed
added missing return type; trimmed readme
1 parent 9b2ae8e commit 25adb36

File tree

2 files changed

+25
-39
lines changed

2 files changed

+25
-39
lines changed

README.md

Lines changed: 24 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -129,25 +129,24 @@ output:
129129
#### Mode Selection (Choose One)
130130

131131
- **`-a`**, **`--all`**:<br>
132-
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
132+
All timestamps. Gives one folder per timestamp.
133133
- **`-l`**, **`--last`**:<br>
134-
Download the last version of each file snapshot. You will get one directory with a rebuild of the page. It contains the last version of each file of your specified `--range`.
134+
Last Version. Gives one folder containing the last version of each file of specified `--range`.
135135
- **`-f`**, **`--first`**:<br>
136-
Download the first version of each file snapshot. You will get one directory with a rebuild of the page. It contains the first version of each file of your specified `--range`.
137-
- **`-s`**, **`--save`**:<br>
138-
Save a page to the Wayback Machine. (beta)
136+
First Version. Gives one folder containing the first version of each file of specified `--range`.
139137

140138
#### Optional query parameters
141139

140+
Parameters for archive.org CDX query. No effect on snapshot download itself.
141+
142142
- **`-e`**, **`--explicit`**:<br>
143-
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
143+
Only the explicit URL. No wildcard subdomains or paths. For example get: root-only (`https://example.com`) or specific file (`login.html`, `?query=this`).
144144

145145
- **`--limit`** `<count>`:<br>
146-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
146+
Limits the snapshots fetched from archive.org CDX. (Will have no effect on existing CDX files)
147147

148148
- **Range Selection:**<br>
149-
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range`, the `start` and `end` will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
150-
(year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
149+
Set the query range in years (`range`) or a timestamp (`start` and/or `end`). If `range` then ignores `start` and `end`. Format for timestamps: YYYYMMDDhhmmss. Timestamp can as specific as needed (year 2019, year+month+day 20190101, ...).
151150

152151
- **`-r`**, **`--range`**:<br>
153152
Specify the range in years for which to search and download snapshots.
@@ -157,57 +156,56 @@ output:
157156
Timestamp to end searching.
158157

159158
- **Filtering:**<br>
160-
A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
161159

162160
- **`--filetype`** `<filetype>`:<br>
163-
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
161+
Specify filetypes to download. Example: `--filetype jpg,css,js`. You can only filter filetypes which are stored by archive.org (.html mostly not)
164162

165163
- **`--statuscode`** `<statuscode>`:<br>
166-
Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
164+
Specify HTTP status codes to download. Example: `--statuscode 200,301`. PyWayBackup will always skip `404` and `301`.<br>
167165
Common status codes you may want to handle/filter:
168166
- `200` (OK)
169-
- `301` (Moved Permanently - will redirect snapshot)
167+
- `301` (Moved Permanently)
170168
- `404` (Not Found - snapshot seems to be empty)
171169
- `500` (Internal Server Error - snapshot is at least for now not available)
172170

173-
### Optional
171+
#### Optional Behavior Manipulation
174172

175-
#### Behavior Manipulation
173+
Parameters will change the download behavior for snapshots.
176174

177175
- **`-o`**, **`--output`**:<br>
178176
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
179177

180178
- **`-m`**, **`--metadata`**<br>
181-
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
179+
Folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). If you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
182180

183181
- **`--verbose`**:<br>
184182
Increase output verbosity.
185183

186184
- **`--log`** <!-- `<path>` -->:<br>
187-
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
185+
Saves a log file into the output-dir. `waybackup_<sanitized_url>.log`.
188186

189187
- **`--progress`**:<br>
190188
Shows a progress bar instead of the default output.
191189

192190
- **`--workers`** `<count>`:<br>
193-
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
191+
Number of simultaneous download workers. Default is 1, safe range is about 10. Too many workers may lead to refused connections by archive.org.
194192

195193
- **`--no-redirect`**:<br>
196-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
194+
Disables following redirects of snapshots. Can prevent timestamp-folder mismatches caused by redirects.
197195

198196
- **`--retry`** `<attempts>`:<br>
199-
Specifies number of retry attempts for failed downloads.
197+
Retry attempts for failed downloads.
200198

201199
- **`--delay`** `<seconds>`:<br>
202-
Specifies delay between download requests in seconds. Default is no delay (0).
200+
Delay between download requests in seconds. Default is no delay (0).
203201

204202
#### Job Handling:
205203

206204
- **`--reset`**:
207-
If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
205+
If set, the job will be reset, and `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch.
208206

209207
- **`--keep`**:
210-
If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
208+
If set, `cdx` and `db` files will be kept after the job is finished. Otherwise they will be deleted.
211209

212210
<br>
213211
<br>
@@ -218,23 +216,11 @@ output:
218216

219217
`pywaybackup` resumes interrupted jobs. The tool automatically continues from where it left off.
220218

221-
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
222-
- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
223-
- Skips previously downloaded files to save time.
219+
Only resumes queries if:
220+
- existing `.cdx` and `.db` files in an `output dir`
221+
- command is identical by `URL`, `mode`, and `optional query parameters`
224222
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
225223
226-
#### Resetting a Job (`--reset`)
227-
228-
- Deletes `.cdx` and `.db` files and restarts the process from scratch.
229-
- Does **not** remove already downloaded files.
230-
- `waybackup -u https://example.com -a --reset`
231-
232-
#### Keeping Job Data (`--keep`)
233-
234-
- Normally, `.cdx` and `.db` files are deleted after a successful job.
235-
- `--keep` preserves them for future re-analysis or extending the query.
236-
- `waybackup -u https://example.com -a --keep`
237-
238224
<br>
239225
<br>
240226

pywaybackup/PyWayBackup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -409,7 +409,7 @@ def paths(self, rel: bool = False) -> dict:
409409
}
410410
return {key: (os.path.relpath(path) if rel else path) for key, path in files.items() if path and os.path.exists(path)}
411411

412-
def status(self):
412+
def status(self) -> dict:
413413
"""
414414
Return the current status of the backup process by a dictionary:
415415
{'task':, 'current':, 'total':, 'progress':}

0 commit comments

Comments
 (0)