Skip to content

Commit 8914365

Browse files
authored
Merge pull request #13 from investigativedata/develop
v0.0.4
2 parents d35bf64 + a59c5c4 commit 8914365

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+3910
-4412
lines changed

.bumpversion.cfg

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
[bumpversion]
2-
current_version = 0.0.3
2+
current_version = 0.0.4
33
commit = True
44
tag = True
55
message = 🔖 Bump version: {current_version} → {new_version}
66

77
[bumpversion:file:VERSION]
88

99
[bumpversion:file:pyproject.toml]
10+
search = version = "{current_version}"
11+
replace = version = "{new_version}"
1012

1113
[bumpversion:file:leakrfc/__init__.py]

.github/workflows/docker.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ jobs:
3939
uses: docker/build-push-action@v6
4040
with:
4141
context: .
42-
platforms: linux/amd64,linux/arm64
42+
platforms: linux/amd64 #,linux/arm64
4343
push: true
4444
tags: ${{ steps.meta.outputs.tags }}
4545
labels: ${{ steps.meta.outputs.labels }}

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
_wip
22
archive/*
33
.anystore/*
4+
tests/fixtures/**/.leakrfc/*
45
# Byte-compiled / optimized / DLL files
56
__pycache__/
67
*.py[cod]

.pre-commit-config.yaml

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -69,10 +69,7 @@ repos:
6969
- id: rst-inline-touching-normal
7070

7171
- repo: https://github.com/python-poetry/poetry
72-
rev: 1.8.0
72+
rev: 2.0.1
7373
hooks:
7474
- id: poetry-check
7575
- id: poetry-lock
76-
args: ["--no-update"]
77-
- id: poetry-export
78-
args: ["-f", "requirements.txt", "-o", "requirements.txt"]

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,7 @@ clean:
3333
find . -name '*.pyo' -exec rm -f {} +
3434
find . -name '*~' -exec rm -f {} +
3535
find . -name '__pycache__' -exec rm -fr {} +
36+
37+
documentation:
38+
mkdocs build
39+
aws --endpoint-url https://s3.investigativedata.org s3 sync ./site s3://docs.investigraph.dev/lib/leakrfc

NOTICE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
LEAKRFC, (C) 2024 investigativedata.io
2+
LEAKRFC, (C) 2025 investigativedata.io
23

34
This product includes software developed at investigativedata.io
45
(https://investigativedata.io)

README.md

Lines changed: 7 additions & 144 deletions
Original file line numberDiff line numberDiff line change
@@ -1,162 +1,24 @@
11
# leakrfc
22

3-
_An RFC for leaks_
3+
"_A RFC for leaks_"
44

55
[leak-rfc.org](https://leak-rfc.org)
66

77
`leakrfc` provides a _data standard_ and _archive storage_ for leaked data, private and public document collections. The concepts and implementations are originally inspired by [mmmeta](https://github.com/simonwoerpel/mmmeta) and [Aleph's servicelayer archive](https://github.com/alephdata/servicelayer).
88

9-
`leakrfc` acts as a standardized storage and retrieval mechanism for documents and their metadata. It provides an high-level interface for generating and sharing document collections and importing them into various analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](docs.aleph.occrp.org/).
9+
`leakrfc` acts as a multi-tenant storage and retrieval mechanism for documents and their metadata. It provides a high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](https://docs.aleph.occrp.org/).
1010

11-
It can act as a drop-in replacement for the underlying archive of [Aleph](https://docs.aleph.occrp.org/).
11+
## Installation
1212

13-
## install
13+
Requires python 3.11 or later.
1414

1515
```bash
1616
pip install leakrfc
1717
```
1818

19-
## build a dataset
19+
## Documentation
2020

21-
`leakrfc` stores _metadata_ for the files that then refers to the actual _source file_.
22-
23-
List the files in a public accessible source (using [`anystore`](https://github.com/investigativedata/anystore/)):
24-
25-
```bash
26-
ANYSTORE_URI="https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/" anystore keys
27-
```
28-
29-
Crawl these documents into this _dataset_:
30-
31-
```bash
32-
leakrfc -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes"
33-
```
34-
35-
The _metadata_ and _source files_ are now stored in the archive (`./data` by default). All _metadata_ and other information lives in the `ddos_patriotfront/.leakrfc` subdirectory. Files are keyed and retrievable by their checksum (default: `sha1`).
36-
37-
Retrieve file metadata:
38-
39-
```bash
40-
leakrfc -d ddos_patriotfront head "19338a97797bcc0eeb832cf7169cbbafc54ed255"
41-
```
42-
43-
Retrieve actual file blob:
44-
45-
```bash
46-
leakrfc -d ddos_patriotfront get "19338a97797bcc0eeb832cf7169cbbafc54ed255" > file.pdf
47-
```
48-
49-
## api
50-
51-
### run api
52-
53-
```bash
54-
export LEAKRFC_ARCHIVE__URI=./data
55-
uvicorn leakrfc.api:app
56-
```
57-
58-
### request a file
59-
60-
For public files:
61-
62-
```bash
63-
# metadata only via headers
64-
curl -I "http://localhost:5000/<dataset>/<sha1>"
65-
66-
# bytes stream of file
67-
curl -s "http://localhost:5000/<dataset>/<sha1>" > /tmp/file.lrfc
68-
```
69-
70-
Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: `{"sub": "<dataset>/<key>"}`). Therefore, clients need to be able to create such tokens (knowing the secret key) and handle dataset permissions.
71-
72-
Tokens should have a short expiration (via `exp` property in payload).
73-
74-
```bash
75-
# token in Authorization header
76-
curl -H 'Authorization: Bearer <token>' ...
77-
78-
# metadata only via headers
79-
curl -I "http://localhost:5000/file"
80-
81-
# bytes stream of file
82-
curl -s "http://localhost:5000/file" > /tmp/file.s
83-
```
84-
85-
## configure storage
86-
87-
```yaml
88-
storage_config:
89-
uri: s3://my_bucket
90-
backend_kwargs:
91-
endpoint_url: https://s3.example.org
92-
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
93-
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
94-
```
95-
96-
### pass through legacy aleph
97-
98-
```yaml
99-
storage_config:
100-
uri: gcs://aleph_archive/
101-
legacy_aleph: true
102-
copy_over: true # subsequently merge legacy archive data into `leakrfc`
103-
```
104-
105-
## layout
106-
107-
The _RFC_ is reflected by the following layout structure for a _Dataset_:
108-
109-
```bash
110-
./archive/
111-
my_dataset/
112-
113-
# metadata maintained by `leakrfc`
114-
.leakrfc/
115-
index.json # generated dataset metadata served for clients
116-
config.yml # dataset configuration
117-
documents.csv # document database (all metadata combined)
118-
keys.csv # hash -> uri mapping for all files
119-
state/ # processing state
120-
logs/
121-
created_at
122-
updated_at
123-
entities/
124-
entities.ftm.json
125-
files/ # FILE METADATA STORAGE:
126-
a1/b1/a1b1c1.../info.json # - file metadata as json REQUIRED
127-
a1/b1/a1b1c1.../txt # - extracted plain text
128-
a1/b1/a1b1c1.../converted.pdf # - converted file, e.g. from .docx to .pdf for better web display
129-
a1/b1/a1b1c1.../extracted/ # - extracted files from packages/archives
130-
foo.txt
131-
export/
132-
my_dataset.img.zst # dump as image
133-
my_dataset.leakrfc # dump as zipfile
134-
135-
# actual (read-only) data
136-
Arbitrary Folder/
137-
Source1.pdf
138-
Tables/
139-
Another_File.xlsx
140-
```
141-
142-
### dataset config.yml
143-
144-
Follows the specification in [`ftmq.model.Dataset`](https://github.com/investigativedata/ftmq/blob/main/ftmq/model/dataset.py):
145-
146-
```yaml
147-
name: my_dataset # also known as "foreign_id"
148-
title: An awesome leak
149-
description: >
150-
Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name
151-
labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim
152-
similique asperiores quod et quae maiores. Et accusantium accusantium error
153-
et alias aut omnis eos. Omnis porro sit eum et.
154-
updated_at: 2024-09-25
155-
index_url: https://static.example.org/my_dataset/index.json
156-
# add more metadata
157-
158-
leakrfc: # see above
159-
```
21+
[docs.investigraph.dev/lib/leakrfc](https://docs.investigraph.dev/lib/leakrfc)
16022

16123
## Development
16224

@@ -183,6 +45,7 @@ Before creating a commit, this checks for correct code formatting (isort, black)
18345
## License and Copyright
18446

18547
`leakrfc`, (C) 2024 investigativedata.io
48+
`leakrfc`, (C) 2025 investigativedata.io
18649

18750
`leakrfc` is licensed under the AGPLv3 or later license.
18851

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.0.3
1+
0.0.4

docs/api.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
`leakrfc` provides a simpel api powered by [FastAPI](https://fastapi.tiangolo.com/) for clients to retrieve file metadata and blobs. It therefore acts as a proxy between client and archive, so that the client doesn't need to know where the actual blobs live. The api can handle authorization via [JSON Web Tokens](https://jwt.io).
2+
3+
## Start local api server
4+
5+
This is for a quick testing setup:
6+
7+
```bash
8+
export LEAKRFC_URI=./data
9+
uvicorn leakrfc.api:app
10+
```
11+
12+
!!! warning
13+
14+
Never run the api with `DEBUG=1` in a production application and make sure to have a proper setup with a load balancer (e.g. nginx) doing TLS termination in front of it. As well make sure to set a good `LEAKRFC_API_SECRET_KEY` environment variable for the token authorization.
15+
16+
## Request a file
17+
18+
For public files:
19+
20+
```bash
21+
# metadata only via headers
22+
curl -I "http://localhost:5000/test_dataset/utf.txt"
23+
24+
HTTP/1.1 200 OK
25+
date: Thu, 16 Jan 2025 08:44:59 GMT
26+
server: uvicorn
27+
content-length: 4
28+
content-type: application/json
29+
x-leakrfc-version: 0.0.3
30+
x-leakrfc-dataset: test_dataset
31+
x-leakrfc-key: utf.txt
32+
x-leakrfc-sha1: 5a6acf229ba576d9a40b09292595658bbb74ef56
33+
x-leakrfc-name: utf.txt
34+
x-leakrfc-size: 19
35+
x-mimetype: text/plain
36+
content-type: text/plain
37+
```
38+
39+
```bash
40+
# bytes stream of file
41+
curl -s "http://localhost:5000/<dataset>/<path>" > /tmp/file.pdf
42+
```
43+
44+
Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: `{"sub": "<dataset>/<key>"}`). Therefore, clients need to be able to create such tokens (knowing the secret key configured via `LEAKRFC_API_SECRET_KEY`) and handle dataset permissions.
45+
46+
Tokens should have a short expiration (via `exp` property in payload).
47+
48+
```bash
49+
# token in Authorization header
50+
curl -H 'Authorization: Bearer <token>' ...
51+
52+
# metadata only via headers
53+
curl -I "http://localhost:5000/file"
54+
55+
# bytes stream of file
56+
curl -s "http://localhost:5000/file" > /tmp/file.lrfc
57+
```

docs/cache.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
For incremental processing of tasks, `leakrfc` uses a global cache to track task results. If a computed cache key for a specific task (e.g. sync a file, extract an archive) is already found in cache, running the task again will be skipped. This is implemented very granular and applies to all kinds of operations, such as [crawl](./crawl.md), [make](./make.md) and the adapters (currently [aleph](./sync/aleph.md))
2+
3+
`leakrfc` is using [anystore](https://docs.investigraph.dev/lib/anystore/cache/) for the cache implementation, so any supported backend is possible. Recommended backends are redis or sql, but a distributed cloud-backend (such as a shared s3 bucket) can make sense, too.
4+
5+
Per default, an in-memory cache is used, which doesn't persist.
6+
7+
## Configure
8+
9+
Via environment var:
10+
11+
```bash
12+
LEAKRFC_CACHE__URI=redis://localhost
13+
14+
# additional config
15+
LEAKRFC_CACHE__DEFAULT_TTL=3600 # seconds
16+
LEAKRFC_CACHE__BACKEND_CONFIG__REDIS_PREFIX=my-prefix
17+
```

0 commit comments

Comments
 (0)