Skip to content
This repository was archived by the owner on Feb 16, 2025. It is now read-only.

TeriyakiThames/Data-Science-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

How do the affiliations of researchers influence the diversity of engineering research topics?

Data Preparation:

  1. main.py
    • Running this file will automatically run all the other files in Data Prep.
    • Input the starting folder (the root folder of the dataset), the folder you want to extract the data to, and the folder you want to store the imputed data.

  2. change_extension.py
    • This file turns the given Scopus dataset into .json files.

  3. data_extraction.py
    • This file loops through each year of the Scopus data set and combine it into 1 single file while removing unncessary data.

  4. impute_missing_value.py
    • This file imputes any missing values in the dataset.

  5. remove_duplicates.py <br / > • This file will drop duplicated paper from the file.

Web Scraping:

  1. main.py
    • Use this file to run web_scraping.py

  2. web_scraping.py
    • Run main.py to start scraping from IEEExplore.
    • Do note that the script will sometimes have null in the date field as it can't be found in some documents.
    • For other missing values, the script will not save the file.

  3. join_json.py
    • This file is used to join the json files obtained from web scraping into one file.
    • You can then use the result of this function to impute the missing values using impute_missing_value.py.

Model Training:

  1. Model.ipynb
    • This file trains the model from the data collected from Scopus and from web scraping.
    • We used Latent Dirichlet Allocation (LDA) and K Means.

  2. combine_csv.py
    • This file is used for combining all the CSV files together into 1 file.

Data Visualisation

  1. data_visualisation.py
    • This file visualises the data from Data Prep as well the model's fitted data.
    • The data is visualised using StreamLit.
    • There are 3 files which are used here (main_data.csv, cluster_data.csv, calculated_map_data.csv).

  2. calculate_map.py
    • This file is originally in data_visualisation.py, but since it was too computationally heavy.
    • Therefore it was split off and the output is stored in a file to be loaded instead.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •