Tools for extracting information and constructing and analysing networks from Twitter data (JSON). Some of these are *nix shell scripts, some are Windows batch files, some are python scripts. Mostly tested on Windows running cygwin.
This code has been written to support the analysis of social media data (Twitter data, specifically), including to support the comparison of simultaneously collected datasets in order to examine variations in data collection activities and how they affect not just the datasets but also the analyses based on those data.
The data folder holds files with the IDs of tweets used in a number of studies.
data/README.md provides more details.
-
cygwin or bash for shell scripts
-
DOS command prompt for batch files
-
Python:
- matplotlib 3.0.3
- networkx 2.2
- numpy 1.16.2
- python 3.7.2
- scipy 1.2.1
- scikit-learn 0.21.2
- python-louvain 0.13
- twarc 1.6.3
basic_tweet_corpus_stats.py file1.json file2.json ... filen.jsonproduces a table of stats (including a LaTeX mode)compare_centralities.pycalculates centrality values for two graphs for one of four centrality types (degree, betweenness, closeness and eigenvector) and outputs the top matching x nodes to a file as well as printing out Kendall Tau and Spearman similarity values (tau and rho respectively, each with p-values).compare_centralities.batrunscompare_centralities.pyfor mentions and replies and each of the four centrality types, using the JSON of the tweets, and the CSV files generated byextract_all_separately.shcompare_centralities_longitudinally_from_tweets.pyruns a specified centrality comparison for the top x members of a particular network type built from interactions in two provided corpora - these comparisons are done overall and also over each window of specified length and the resulting Kendall Tau and Spearman's coefficients are reported, so one can see how the corpora correspond over time - this is invoked by the convenience scriptrun_A_vs_B_longitudinal_centrality_comparisons.sh.plot_centrality_comparisons.pygenerates scatter plots for two-columned CSVs of ranked lists of valuescentralities.pycalculates the four centrality types for a given GraphML filecsv_to_weighted_digraph.pybuilds a GraphML file using specified columns from a CSV file - a directed weighted graph with labeled edges is created. Creates graphs for retweets, mentions, replies and quotes.build_jsocnet_graphs.batcreates directed weighted graphs of users from CSVs generated byextract_all_separately.shusingcsv_to_weighted_digraph.pyrun_hashtag_wordcount.shshell script that reads tweets from stdin (JSON) and outputs the wordcount of occurring lower-cased hashtags in CSV format (hashtag, count, decreasing). Use this to figure out which hashtags are widespread.run_lang_wordcount.shshell script that reads tweets from stdin (JSON) and outputs the wordcount oflangproperty values drawn from the tweets in CSV format (language code, count, decreasing). Use this to figure out which languages are widespread. Use the-uoption to consider user languages rather than tweet languages.build_hashtag_co-mention_graph.pycreates a weighted graph of hashtags, linked when they are mentioned by the same user (thinking of adding when they are mentioned in the same tweet) - works directly from tweets. Visualise the results in Visone, colour and size the edges by weight, colour the nodes by Louvain clustering, and Bob's your uncle. Stress min. layout appears to be the best.extract_tweets_by_authors.shfilters out tweets authored by the given IDs from a corpus of tweets (in JSON) to extract a subset from the corpus.combine_wc_csvs.pycreates a single table from multiple key/value CSVs where the left column is the union of all keys discovered and each column includes the values (or 0) for each given CSV file - basically a way to combine word count lists to making pie charts in Excel easierextract_all_separately.shinvokesextract_hashtags.sh,extract_mentions.sh,extract_quotes.sh,extract_replies.sh,extract_retweets.shandextract_urls.shon a given tweet corpus (JSON), generating CSVs of the extracted information, which can then be used in tools like Visone.plot_per_time_metrics.pyplots four different plots of activity over time seen in a given tweet corpus (JSON) and CSV files generated byextract_all_separately.sh.plot_ranked_items.pycreates a scatterplot of the rankings of common elements in the columns of a two-columned CSV (with an optional header and arguments for chart labels). An option is provided to choose Mehwish Nasim's algorithm for plotting the points. E.g.python plot_ranked_items.py -f comparisons/rapid_twarc-centrality-comparisons.csv -l "RAPID,Twarc" --header -a NASIM -o myscatterplot.png
Specifically for comparing parallel datasets, given a few corpora of tweets, e.g. a.json, b.json, and c.json, you can use the
scripts above in the following way ($NET_BIN refers to the directory in which the scripts reside):
-
python basic_tweet_corpus_stats.py --latex --labels "A,B,C" a.json b.json c.jsonwill generate a LaTeX formatted table of stats for each corpus, with the column titles provided by the--labelsoption. -
shell: cat a.json | $NET_BIN/run_lang_wordcount.sh > a_lang_wc.csvcounts thelangproperty values in tweets ina.json. -
shell: cat a.json | $NET_BIN/run_lang_wordcount.sh -u > a_user_lang_wc.csvcounts thelangproperty values in users ina.json. -
shell: cat a.json | $NET_BIN/run_hashtag_wordcount.sh > a_hashtag_wc.csvcounts the hashtags (lowercase) in tweets ina.json. -
shell: extract_all_separately.sh a.jsonwill generate CSVs for several different interactions:a-mentions.csv,a-replies.csv, ... -
python plot_per_time_metrics.py -f a --label A --window 60 --y-limits auto-auto-1000-1500 -o charts --out-filebase A_outfile_prefixwill create a single figure with four plots usingaas the basename for the JSON and CSV files to use andA_outfile_prefixas the prefix for the output file. They-limitsoption can be used to specify the y-limit of each of the four charts to facilitate comparison between different datasets (i.e., give them all the same y-range). The window size is specified in minutes (i.e., 60 is one hour). -
python plot_tweet_counts.py -l "RAPID,Twarc" --tweets -t "Election Day" -v rapid-k.json twarc.json -o ..\images\nswelec-rapid_twarc-tweet_counts-60m.png -w 60 --tz-fix -600 -
cat tweets.json | run_hashtag_wordcount.sh > hashtag_counts.csvthen look at which hashtags occur most often and may clutter any hashtag graph - ignore these in the next step. -
python build_hashtag_co-mention_graph.py --min-width 1 -v -i tweets.json -o hashtag_co-mentions.graphml --ignore "hashtag1,hashtag2,hashtag3"-- this will build a network and tell you how big it is. If you use the option--dry-runyou can see the size of the graph without writing it out. -
DOS: build_jsocnet_graphs.bat a graphswill creategraphs/a-mentions.graphml,graphs/a-replies.graphml,graphs/a-retweets.graphml, andgraphs/a-quotes.graphml -
DOS: compare_centralities.bat 500 graphs\\a graphs\\b comparisons\\ab 2> nulwill compare at least the top 500 ranked nodes of each graph, constrained to only those nodes common to both graphs. Thegraphs\\aandgraphs\\bare prefixes for the mentions, replies, quotes and retweets GraphML files in thegraphsfolder. Thecomparisons\\aboption provides a prefix for the CSV output of the comparisons, each written to its own two-column file, which can be used in the next step. The2> nulbit is to redirect errors that pollute the output CSV -
python compare_centralities_longitudinally_from_tweets.py -f1 corpus1.json -f2 corpus2.json -t RETWEET -c DEGREE -w 60 | pbcopywill compare corpus1.json and corpus2.json overall and broken down into hourlong periods (a window of 60 minutes), or userun_A_vs_B_longitudinal_centrality_comparisons.sh corpus1.json corpus2.json comparisons -
DOS:plot_each_centrality_comparison.bat comparisons\\ab charts\\ab "A scores,B scores"will generate scatter plots for each interaction and centrality combination in the files ingraphsstarting withaband the results will be written to thechartsfolder. Use the option--algorithm NASIMto use Mehwish Nasim's plot algorithm. -
python plot_centrality_comparisons.py -f comparisons/ab -l "A,B" -t "A vs B centralities compared" --header -o charts/ab-centralities_compared-scatterplots.pngcreates a multi-scatterplot based on the mention and reply centrality comparison across the four centrality types. Other examples:(jsocnet) C:\Users\derek\Documents\PhD\local_analysis\jsocnet\nswelec>python %NET_BIN%\plot_centrality_comparisons.py -f comparisons\twarc_tweepy -l "Twarc,Tweepy" -t "Twarc vs Tweepy centralities compared" --header -o charts\twarc_tweepy-centrality_compared-scatterplots.png(jsocnet) C:\Users\derek\Documents\PhD\local_analysis\jsocnet\nswelec>python %NET_BIN%\plot_centrality_comparisons.py -f comparisons\twarc_tweepy -l "Twarc,Tweepy" -t "Twarc vs Tweepy centralities compared" --header -o charts\twarc_tweepy-centrality_compared-scatterplots-nasim.png -a NASIM(jsocnet) C:\Users\derek\Documents\PhD\local_analysis\jsocnet\nswelec>python %NET_BIN%\plot_centrality_comparisons.py -f comparisons\twarc_tweepy -l "Twarc,Tweepy" -t "Twarc vs Tweepy centralities compared" --header -o charts\twarc_tweepy-centrality_compared-scatterplots-nasim-grey.png -a NASIM -g(jsocnet) C:\Users\derek\Documents\PhD\local_analysis\jsocnet\nswelec>python %NET_BIN%\plot_centrality_comparisons.py -f comparisons\twarc_tweepy -l "Twarc,Tweepy" -t "Twarc vs Tweepy centralities compared" --header -o charts\twarc_tweepy-centrality_compared-scatterplots-grey.png -g
This code has been used in the following publications and relevant datasets can be found in the data folder:
-
Weber, D., Nasim, M., Falzon, L., and Mitchell, L., "#ArsonEmergency and Australia's "Black Summer": Polarisation and misinformation on social media.", In Disinformation in Open Online Media, MISDOOM, Leiden, The Netherlands, 26-27 October, 2020, pp. 159-173. URL: https://doi.org/10.1007/978-3-030-61841-4_11 (arXiv: https://arxiv.org/abs/2004.00742)
-
Weber, D., Nasim, M., Mitchell, L., and Falzon, L., "A method to evaluate the reliability of social media data for social network analysis.", In The 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM, Leiden, The Netherlands, 7-10 December, 2020, accepted. (arXiv: https://arxiv.org/abs/2010.08717)
-
Weber, D., Nasim, M., Mitchell, L. and Falzon, L. 2021, "Exploring the effect of streamed social media data variations on social network analysis", Journal of Social Network Analysis and Mining, submitted. (arXiv:2103.03424)