Skip to content

Scripts and corpora for the authorship attribution of B. Traven's works, using the method of the "impostors"

License

Notifications You must be signed in to change notification settings

SimoneRebora/Traven_stylometry

Repository files navigation

Traven_stylometry

Scripts and corpora for the authorship attribution of B. Traven's works, using the method of the "impostors".

Structure

The main branch of the repository contains materials for the paper Are Ret Marut and B. Traven the same person? Fine tuning the impostors method, presented at the DH2023 Conference (Paper | Slides).
The DH2022 branch contains materials for the paper Traven between the impostors. Preliminary considerations on an authorship verification case, presented at the DH2022 Conference (Paper).

Instructions

Run the bash_imposters_parallel.sh file (via: bash bash_imposters_parallel.sh), which will run all R scripts in a sequence. Scripts are designed for parallel processing, to accelerate computation speed.

Features

Analysis features are defined in the analysis_features.csv file. You can modify them to run different analyses:

  • n_cores defines the number of cores for parallel processing (the script currently supports from two to six cores)
  • n_best_imposters defines the number(s) of best impostors on which to run tests (you should separate the numbers with a space)
  • n_development_authors defines the number of authors to consitute the development set
  • unit defines the unit of analysis. You can choose between "1_words", "2_characters", "3_characters", etc. (currently, the script does not support word ngrams and runs with just one unit at the time)
  • MFU_series defines the number(s) of most frequent units (words or characters) on which to run tests (you should separate the numbers with a space)
  • culling defines the level(s) of culliung with which to run tests (you should separate the numbers with a space)
  • validation_rounds defines the number of repetitions for each configuration
  • distances defines the stylometric distances to be used (you should separate the names with a space)

Scripts

  • 01_prepare_imposters_corpora.R prepares corpora by running a first stylometric analysis and selecting the authors closest to the test set
  • 02_evaluate_parallel_processing.R reads analysis features from the analysis_features.csv file and prepares instructions for parallel processing
  • 03_prepare_analysis_tables.R prepares datasets for the actual analysis, by creating Term-Document-Frequency tables for each combination of texts
  • 04_imposters_analysis.R performs the impostors analysis
  • 05_process_results.R conflates the results and saves them to a Results.txt file

Corpora

Texts to be analysed are in the corpus folder:

  • Traven_Marut_corpus.RData contains the four novels by Traven and Marut on which to perform the analysis. Novels have been split into tokens, which have been reordered alphabetically (thus not allowing reconstruction of the original texts, which are still copyright protected)
  • Kolimo_metadata.csv contains metadata of the Kolimo corpus, from which development set and impostors will be extracted. The corpus itself will be downloaded by the R scripts

Requirements

R packages: stylo, tidyverse, and class. Run the Requirements.R script to install them.
The bash script should run via command line on Unix-like systems.

About

Scripts and corpora for the authorship attribution of B. Traven's works, using the method of the "impostors"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published