This repository holds the training script for training BART-small for abstractively summarizing news text in partial fulfillment of the requirements in the course Machine Learning 2. A
News summarization is an application of automatic summarization in Natural Language Processing that generates a shorter version of a full news article while preserving its most important information. Key challenges in abstractive summarization, a branch of text summarization that generates new sentences to summarize the original text, include inadequate representation of meaning, maintaining factual consistency, and addressing temporal and causal reasoning. Despite progress, models still face difficulties in capturing semantic nuances and striking the right balance between completeness and conciseness.
The goal is to produce a summary that reads naturally—similar to how a human might summarize the article—while retaining the essential facts and context of the original news story. This task involves understanding the meaning of the text and generating summaries that are both informative and linguistically fluent.
The dataset can be found on this link. It contains a total of 56 240 rows, which are divided into three subsets: a training set with 45,000 rows, a validation set with 5,620 rows, and a test set also containing 5,620 rows. There are two features: document, which contains the original news content, and summary, which is the summarized version of the document column.
The base model used in this project is BART-small, a smaller version of BART with fewer attention heads for the news summarization task. BART is widely recognized for its effectiveness in natural language generation, particularly in summarizing long-form text, making it a suitable choice for our project. However, due to limited computational resources, a lightweight variant was chosen to balance performance with efficiency.
1. Tokenization on input and output fields (max_length = 512 for both fields)
2. Applying DataCollatorForSeq2Seq to pad sequences
- Initial Settings
- Final Settings
Epochs = 20
Training Steps = Epoch * tokenized_dataset
Warmup Steps = 0.1 * Training Steps
Per Device Train Batch Size = 8
Per Device Eval Batch Size = 10
Learning Rate =
Epochs = 20
Training Steps = Epoch * tokenized_dataset
Warmup Steps = 0.1 * Training Steps
Per Device Train Batch Size = 8
Per Device Eval Batch Size = 10
Learning Rate =
Many of the values in the training of the base model did not change over each epoch, however, this was resolved during the training of the fine-tuned model. However, the metrics of the fine-tuned model, though changing and generally improving during training, were not statistically significant. Thus, further training and fine-tuning is needed.