sdg-event-classification

This repository contains the data and scripts used for the classification of SDG-related events in sustainability reports. The repository includes the ground truth dataset, Python scripts for BERT fine-tuning, CatBoost and SHAP analysis, and LLM annotation.

Introduction

This repository aims to provide resources for the automated classification of events related to the Sustainable Development Goals (SDGs) in corporate sustainability reports. The provided dataset and scripts facilitate the replication of our study and enable further research in this domain.

Data Description

The annotation of the dataset was performed based on the guidelines provided in annotation_guideline.pdf.

The ground_truth_train.csv and ground_truth_train.csv datasets contains annotated events extracted from sustainability reports. The sentences are from the ESGBERT environmental_2k and social_2k datasets.

The ground_truth_features_grouped_train.csv and ground_truth_features_grouped_test.csv dataset contains the feature values grouped for each event trigger. Train and test sets are split 70/30. Below is a description of each column in the dataset:

Dataset Columns

document: The identifier for the document from which the sentence was extracted.
text: The full sentence from a sustainability report.
keyword: The SDG-related keyword(s) identified within the sentence.
event_trigger: The word or phrase indicating the occurrence of an event.
temporal_status: The temporal status of the event (past, ongoing, future).
measurability: A numerical value indicating how specific and quantifiable the event is.
event_factuality: The event factuality level of the event, indicating its likelihood.
kw_is_nsubj: A Boolean indicating if the keyword is a nominal subject.
kw_is_dobj: A Boolean indicating if the keyword is a direct object.
kw_is_pobj: A Boolean indicating if the keyword is a prepositional object.
category: The classification of the event into categories action, intention, belief, or situation.
text_kw_et: The sentence with the keyword and event trigger highlighted.
text_kw: The sentence with the keyword highlighted.
relation_time_specification: Temporal details related to the event (if any).
relation_unit: The unit of measurement related to the event (if any).

Scripts and Usage

BERT Fine-Tuning

Script: classification_bert.ipynb
Description: This script fine-tunes a BERT model on the annotated dataset to classify SDG-related events.

CatBoost and SHAP Analysis

Script: catboost_shap.ipynb
Description: This script trains a CatBoost model on the event features and uses SHAP values to interpret feature importance.

LLM Annotation

Script: llm_annotation.py
Description: This script utilizes GPT-3.5 and GPT-4 models to annotate events in the dataset.
Usage: $python llm_annotation.py

Contact

m.burghart@campus.tu-berlin.de

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
annotation_guideline		annotation_guideline
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sdg-event-classification

Table of Contents

Introduction

Data Description

Dataset Columns

Scripts and Usage

BERT Fine-Tuning

CatBoost and SHAP Analysis

LLM Annotation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sdg-event-classification

Table of Contents

Introduction

Data Description

Dataset Columns

Scripts and Usage

BERT Fine-Tuning

CatBoost and SHAP Analysis

LLM Annotation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages