Skip to content

Enhancement to Full Disclosure Crawler and Parsers #92

@jgwl

Description

@jgwl

Taken from #74

1. seclists_crawler_raw.py

1.1 Still doesn't provide an optional flag as save path.

Output parameter -o

For both Crawler and Parser, rather than default to save in the folder the script is run, an optional parameter -o could be useful for both Crawler and Parser. For us who will be versioning the code, this would help avoiding having to move the files manually every time we download a new month and makes it more scriptable from the command line.

Note also the expected behavior (although intuitively I see where you are going) is inconsistent in the 2 scripts which may leave a student confused: The Crawler script downloads in the same folder. The Parser script downloads at the provided input instead of where the script is run.

1.2 README.md

Should mention what the user is expected to be downloaded. Currently, it is each individual e-mail html page + an index.html page whose name format is _.raw.html. Main difference being the absence of a relative id in the file name.

2. seclists_index_parse.py

2.1 Script help message example is incorrect (?)

-f , parse single raw file, e.g. -f ./2011_Jan_0.raw.html

From your README.md (very nicely done by the way), I assume this would be without the 0 in it? i.e. 2011_Jan.raw.html.

2.2 Lacks save path

Currently adds to the input path directory.
should mention on README.md possible-follow ups case

should mention in the readme "possible-follow ups" are added to the parser the same way as follow-ups without any "possible" statement.

3. Add some python tests to ensure consistency across the scripts

Given it is hard to see from the results files are missing now or in the future, it would be interesting to have tests that:

  • Unit Test that the number of entries in the .csv generated by seclists_index_parse.py equal the number of raw.html files -1 in a given month folder (-1 represents the index .html file).

    • Assert the equality on a month with possible-followups case.
    • Assert the equality of a month without any possible-followup case.
    • Assert the equality of a month with and without any possible-follow up case.
  • Unit Test The number of files generated by seclists_reply_parse.py equals the number of raw html files.

This should suffice to minimally check all scripts are working consistently. Additional tests could include for example checking that the number of authors are correct, and the number of e-mail parents matches the expected.

General Notes

5. Missing requirements.txt with python libraries.

6. Parent README.md

Should probably add a parent folder to both Crawler and Parser with a readme mentioning the existence of the 3 scripts, a 1 line statement of what they do, and the agreed taxonomy of the file names.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions