Skip to content

Commit 78fd284

Browse files
committed
Improve documentation
1 parent 033cb42 commit 78fd284

File tree

2 files changed

+191
-0
lines changed

2 files changed

+191
-0
lines changed

README.md

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# Crawler Core
2+
3+
## Inner workings
4+
5+
### Controller
6+
7+
The following sequence diagrams show how the controller is started from `CommonMain.main`.
8+
To aid readability, we split the inner workings of the Controller into two diagrams. The first one depicts how `PublishBulkScanJob` is triggered, which includes the scheduling feature.
9+
The second diagram shows how `PublishBulkScanJob` then publishes the bulk scan.
10+
11+
```mermaid
12+
%%{init: { 'sequence': {'mirrorActors':false} } }%%
13+
sequenceDiagram
14+
participant main as CommonMain.main
15+
participant cmdConfig as :ControllerCommandConfig
16+
participant controller as :Controller
17+
participant quartz as Quartz Scheduler
18+
19+
activate main
20+
alt Parse arguments using JCommander
21+
main ->> cmdConfig: setVariable(...)
22+
end
23+
24+
create participant mq as :RabbitMqOrchestrationProvider
25+
main ->> mq: create
26+
create participant mongo as :MongoPersistenceProvider
27+
main ->> mongo: create
28+
29+
main ->> controller: start()
30+
deactivate main
31+
activate controller
32+
controller ->>+ cmdConfig: getCronInterval
33+
cmdConfig -->>- controller: interval
34+
35+
36+
controller ->>+ cmdConfig: getTargetListProvider()
37+
create participant targetListProvider as targetListProvider:ITargetListProvider
38+
cmdConfig ->> targetListProvider: create
39+
cmdConfig -->>- controller: targetListProvider
40+
41+
controller ->>+ cmdConfig: isMonitored
42+
cmdConfig -->>- controller: monitored
43+
opt if monitored
44+
create participant monitor as :ProgressMonitor
45+
controller ->> monitor: create
46+
end
47+
48+
controller -) quartz: scheduleJob(interval, PublishScanJob.class)
49+
controller ->>+ quartz: start
50+
deactivate controller
51+
loop as per schedule (may repeat)
52+
create participant publishbulkscan as :PublishBulkScanJob
53+
quartz ->> publishbulkscan: create
54+
quartz ->>+ publishbulkscan: execute
55+
end
56+
57+
```
58+
59+
```mermaid
60+
%%{init: { 'sequence': {'mirrorActors':false} } }%%
61+
sequenceDiagram
62+
participant publishbulkscan as :PublishBulkScanJob.execute()
63+
participant mq as :RabbitMqOrchestrationProvider
64+
participant mongo as :MongoPersistenceProvider
65+
participant monitor as :ProgressMonitor
66+
67+
activate publishbulkscan
68+
create participant bulkscan as bulkScan:BulkScan
69+
publishbulkscan ->> bulkscan: create
70+
publishbulkscan ->>+ bulkscan: setTargetsGiven(int)
71+
72+
publishbulkscan ->> mongo: insertBulkScan(bulkScan)
73+
74+
opt if monitored
75+
publishbulkscan ->>+ monitor: startMonitoringBulkScanProgress(bulkScan)
76+
create participant bmonitor as :BulkscanMonitor
77+
monitor ->> bmonitor: create
78+
monitor ->> mq: registerDoneNotificationConsumer(BulkScanMonitor)
79+
end
80+
81+
create participant submitter as :JobSubmitter
82+
publishbulkscan ->> submitter: create
83+
84+
loop for each targetString
85+
publishbulkscan ->>+ submitter: apply(targetString)
86+
%%create participant target as :ScanTarget
87+
%%submitter ->> target: fromTargetString(targetString)
88+
submitter ->> submitter: create ScanTarget and JobDescription
89+
alt parseable and resolvable
90+
submitter ->> mq: submitScanJob(jobDescription)
91+
else error
92+
submitter ->> mongo: insertScanResult(error, jobDescription)
93+
end
94+
submitter -->>- publishbulkscan: JobStatus
95+
96+
alt if monitored
97+
mq -)+ bmonitor: consumeDoneNotification
98+
bmonitor ->>- bmonitor: print stats
99+
end
100+
end
101+
102+
opt if monitored
103+
publishbulkscan -) monitor: stopMonitoringAndFinalizeBulkScan(bulkScan.get_id())
104+
deactivate monitor
105+
end
106+
deactivate publishbulkscan
107+
```
108+
109+
### Worker
110+
111+
The following sequence diagrams show how the worker is started from `CommonMain.main`.
112+
To aid readability, we split the inner workings of the Worker into two diagrams. The first one depicts how the `Worker` is triggered for each job.
113+
The second diagram shows how each job is handled internally.
114+
115+
```mermaid
116+
%%{init: { 'sequence': {'mirrorActors':false} } }%%
117+
sequenceDiagram
118+
participant main as CommonMain.main
119+
participant cmdConfig as :WorkerCommandConfig
120+
121+
activate main
122+
alt Parse arguments using JCommander
123+
main ->> cmdConfig: setVariable(...)
124+
end
125+
126+
create participant mq as :RabbitMqOrchestrationProvider
127+
main ->> mq: create
128+
create participant mongo as :MongoPersistenceProvider
129+
main ->> mongo: create
130+
131+
create participant worker as :Worker
132+
main ->> worker: create
133+
134+
main ->> worker: start()
135+
deactivate main
136+
activate worker
137+
worker ->>- mq: registerScanJobConsumer(this::handleScanJob)
138+
139+
loop for each scan job (possibly in parallel)
140+
mq ->>+ worker: handleScanJob(ScanJobDescription)
141+
end
142+
```
143+
144+
```mermaid
145+
%%{init: { 'sequence': {'mirrorActors':false} } }%%
146+
sequenceDiagram
147+
participant mq as :RabbitMqOrchestrationProvider
148+
participant mongo as :MongoPersistenceProvider
149+
participant worker as :Worker
150+
participant workermanager as :BulkScanWorkerManager
151+
152+
mq ->>+ worker: handleScanJob(ScanJobDescription)
153+
154+
worker -)+ workermanager: handle(ScanJobDescription)
155+
156+
157+
worker -) worker: waitForScanResult
158+
note right of worker: Uses Thread pool<br>"crawler-worker: result handler"<br> to wait for result
159+
160+
workermanager ->>+ workermanager: getBulkScanWorker(ScanJobDescription)
161+
162+
opt if not exists
163+
create participant bulkScanWorker as :BulkScanWorker
164+
workermanager ->> bulkScanWorker: create(bulkScanId)
165+
end
166+
workermanager -->>- workermanager: BulkScanWorker
167+
168+
169+
workermanager ->>+ bulkScanWorker: handle(scanJobDescription.getScanTarget())
170+
bulkScanWorker ->>+ bulkScanWorker: scan(ScanTarget)
171+
note right of bulkScanWorker: Uses thread pool<br>"crawler-worker: scan executor"
172+
note right of bulkScanWorker: scan is abstract and is implemented by the concrete project
173+
bulkScanWorker -->>- bulkScanWorker: result
174+
bulkScanWorker -->>- workermanager: result
175+
workermanager -->>- worker: result
176+
alt success
177+
worker ->> mongo: insertScanResult(scanResult)
178+
else error
179+
worker ->> mongo: insertScanResult(errorResult)
180+
note right of worker: some error handling is performed <br> to store as much info as possible in the DB
181+
end
182+
worker -) mq: notifyOfDoneScanJob(scanJobDescription)
183+
184+
deactivate worker
185+
```
186+

src/main/java/de/rub/nds/crawler/data/ScanJobDescription.java

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,11 @@
1414
import java.util.Optional;
1515
import java.util.UUID;
1616

17+
/**
18+
* Single scan job to be processed by a worker. Contains all data required to perform the scan and
19+
* write back the results. Also holds metadata about the bulk scan this job belongs to, the current
20+
* status and the delivery tag used for the orchestration.
21+
*/
1722
public class ScanJobDescription implements Serializable {
1823

1924
private final UUID id = UUID.randomUUID();

0 commit comments

Comments
 (0)