You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7-144Lines changed: 7 additions & 144 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,162 +1,24 @@
1
1
# leakrfc
2
2
3
-
_An RFC for leaks_
3
+
"_A RFC for leaks_"
4
4
5
5
[leak-rfc.org](https://leak-rfc.org)
6
6
7
7
`leakrfc` provides a _data standard_ and _archive storage_ for leaked data, private and public document collections. The concepts and implementations are originally inspired by [mmmeta](https://github.com/simonwoerpel/mmmeta) and [Aleph's servicelayer archive](https://github.com/alephdata/servicelayer).
8
8
9
-
`leakrfc` acts as a standardized storage and retrieval mechanism for documents and their metadata. It provides an high-level interface for generating and sharing document collections and importing them into various analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](docs.aleph.occrp.org/).
9
+
`leakrfc` acts as a multi-tenant storage and retrieval mechanism for documents and their metadata. It provides a high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](https://docs.aleph.occrp.org/).
10
10
11
-
It can act as a drop-in replacement for the underlying archive of [Aleph](https://docs.aleph.occrp.org/).
11
+
## Installation
12
12
13
-
## install
13
+
Requires python 3.11 or later.
14
14
15
15
```bash
16
16
pip install leakrfc
17
17
```
18
18
19
-
## build a dataset
19
+
## Documentation
20
20
21
-
`leakrfc` stores _metadata_ for the files that then refers to the actual _source file_.
22
-
23
-
List the files in a public accessible source (using [`anystore`](https://github.com/investigativedata/anystore/)):
The _metadata_ and _source files_ are now stored in the archive (`./data` by default). All _metadata_ and other information lives in the `ddos_patriotfront/.leakrfc` subdirectory. Files are keyed and retrievable by their checksum (default: `sha1`).
36
-
37
-
Retrieve file metadata:
38
-
39
-
```bash
40
-
leakrfc -d ddos_patriotfront head "19338a97797bcc0eeb832cf7169cbbafc54ed255"
41
-
```
42
-
43
-
Retrieve actual file blob:
44
-
45
-
```bash
46
-
leakrfc -d ddos_patriotfront get "19338a97797bcc0eeb832cf7169cbbafc54ed255"> file.pdf
Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: `{"sub": "<dataset>/<key>"}`). Therefore, clients need to be able to create such tokens (knowing the secret key) and handle dataset permissions.
71
-
72
-
Tokens should have a short expiration (via `exp` property in payload).
73
-
74
-
```bash
75
-
# token in Authorization header
76
-
curl -H 'Authorization: Bearer <token>' ...
77
-
78
-
# metadata only via headers
79
-
curl -I "http://localhost:5000/file"
80
-
81
-
# bytes stream of file
82
-
curl -s "http://localhost:5000/file"> /tmp/file.s
83
-
```
84
-
85
-
## configure storage
86
-
87
-
```yaml
88
-
storage_config:
89
-
uri: s3://my_bucket
90
-
backend_kwargs:
91
-
endpoint_url: https://s3.example.org
92
-
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
93
-
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
94
-
```
95
-
96
-
### pass through legacy aleph
97
-
98
-
```yaml
99
-
storage_config:
100
-
uri: gcs://aleph_archive/
101
-
legacy_aleph: true
102
-
copy_over: true # subsequently merge legacy archive data into `leakrfc`
103
-
```
104
-
105
-
## layout
106
-
107
-
The _RFC_ is reflected by the following layout structure for a _Dataset_:
108
-
109
-
```bash
110
-
./archive/
111
-
my_dataset/
112
-
113
-
# metadata maintained by `leakrfc`
114
-
.leakrfc/
115
-
index.json # generated dataset metadata served for clients
`leakrfc` provides a simpel api powered by [FastAPI](https://fastapi.tiangolo.com/) for clients to retrieve file metadata and blobs. It therefore acts as a proxy between client and archive, so that the client doesn't need to know where the actual blobs live. The api can handle authorization via [JSON Web Tokens](https://jwt.io).
2
+
3
+
## Start local api server
4
+
5
+
This is for a quick testing setup:
6
+
7
+
```bash
8
+
export LEAKRFC_URI=./data
9
+
uvicorn leakrfc.api:app
10
+
```
11
+
12
+
!!! warning
13
+
14
+
Never run the api with `DEBUG=1` in a production application and make sure to have a proper setup with a load balancer (e.g. nginx) doing TLS termination in front of it. As well make sure to set a good `LEAKRFC_API_SECRET_KEY` environment variable for the token authorization.
Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: `{"sub": "<dataset>/<key>"}`). Therefore, clients need to be able to create such tokens (knowing the secret key configured via `LEAKRFC_API_SECRET_KEY`) and handle dataset permissions.
45
+
46
+
Tokens should have a short expiration (via `exp` property in payload).
For incremental processing of tasks, `leakrfc` uses a global cache to track task results. If a computed cache key for a specific task (e.g. sync a file, extract an archive) is already found in cache, running the task again will be skipped. This is implemented very granular and applies to all kinds of operations, such as [crawl](./crawl.md), [make](./make.md) and the adapters (currently [aleph](./sync/aleph.md))
2
+
3
+
`leakrfc` is using [anystore](https://docs.investigraph.dev/lib/anystore/cache/) for the cache implementation, so any supported backend is possible. Recommended backends are redis or sql, but a distributed cloud-backend (such as a shared s3 bucket) can make sense, too.
4
+
5
+
Per default, an in-memory cache is used, which doesn't persist.
0 commit comments