For installation, you can run the following code in the command line/terminal:
git clone https://github.com/nbdt-journal/nbdt_lib.git
pip install ./nbdt_libCurrently the library has 5 datasets ready to use:
arxiv: Has nearly 3.5k papers
bioarxiv: Has 29k papers from Bioarxiv
plos_one: Has 18k papers from PLOS_ONE
medline_small: Has 105k papers from the top 200 journals in the neuroscience field
medline_large: Has 200k papers from MEDLINE
To load your dataset use the following code:
from nbdt.datasets import load_dataset
load_dataset(dataset_name = 'arxiv', destination_path = 'arxiv.csv', start_year = 2018, end_year = 2023)- If
destination_pathis not specified, the dataset will be loaded as a pandas DataFrame to the specified variable. - If
start_yearandend_yearare not specified, the entire dataset is returned by default. - Only papers with a publishing year from 2018 to 2023 are available in all specified datasets.
To update your dataset, use the following code:
from nbdt.update import update_dataset
update_dataset(dataset_name, start_date, end_date, destination_path, update)
# Examples
update_dataset(dataset_name='plos_one', start_date='06-2023', end_date = '08-2023', destination_path='plos_one_new.csv', update=True)
update_dataset(dataset_name='bioarxiv', end_date = '08-2023', destination_path='bioarxiv_new.csv') # Here as 'update' and 'start_date' is not specified, their respective default values are applied, e.g False and 01-01-2023.
dataset_name: Currently only arxiv, bioarxiv, and plos_one datasets are available for updation.
start_date and end_date: The input of both the parameters should be in MM-YYYY, any other format will return an error. By default all the datasets are updated to a predefined start date:
-
arxiv: 31-05-2023 -
bio_arxiv: 01-01-2023 -
plos_one: 31-05-2023
destination_path: Specify the destination path as required, for example: '/content/arxiv.csv' (Note: The file format should be in CSV)
update: By default, this parameter is False, but if update=True is passed then the source dataset along with the updated data of the mentioned period will be returned to the specified destination path.
Note: The parameters dataset_name, destination_path, and end_date should be atleast specified to return any data.