Search Shakespeare's works by token, phrase, and regex across plays, acts/scenes, characters, and lines.
This repo contains:
- A Python build pipeline (
build.py) that reads Folger TEI XML intei/and emits JSON indexes. - A static web UI in
docs/that loads those generated JSON files. - Python and Playwright tests for data and UI behavior.
- Python 3.8+
- Node.js + npm (for Playwright tests)
- Optional:
lxml(listed inrequirements.txt, butbuild.pycurrently usesxml.etree.ElementTree)
Generate/re-generate data files:
python3 build.pyCurrent build.py behavior is fixed to:
- Input:
tei/*.xml - Output root:
docs/ - Data output:
docs/data/*.json - Line output:
docs/lines/*.json
Generated files include:
docs/data/plays.jsondocs/data/chunks.jsondocs/data/characters.jsondocs/data/tokens.jsondocs/data/tokens2.jsondocs/data/tokens3.jsondocs/data/tokens_char.jsondocs/data/tokens_char2.jsondocs/data/tokens_char3.jsondocs/lines/all_lines.jsondocs/lines/<scene_id>.json(one file per scene)
The build also reads optional metadata from:
play_metadata.jsoncharacter_metadata.json
Serve the static site from docs/:
python3 -m http.server 8766 -d docsThen open http://localhost:8766.
Run Python tests:
python3 -m unittest test_parse_play.py tests.test_build_output -vor
python3 -m pytest tests test_parse_play.py -vRun Playwright UI tests:
npm install
npx playwright install
npx playwright testThe Playwright config auto-starts a local server on port 8766 from docs/.
- Generated JSON files under
docs/data/anddocs/lines/are committed in this repository. - If you want to build a subset corpus, keep only the desired TEI files in
tei/and runpython3 build.py.