Skip to content

hrrtjaymee/ML2-Final-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

LoRA-Based Fine Tuning of BART-small for News Article Summarization

This repository holds the training script for training BART-small for abstractively summarizing news text in partial fulfillment of the requirements in the course Machine Learning 2. A

📃Introduction

News summarization is an application of automatic summarization in Natural Language Processing that generates a shorter version of a full news article while preserving its most important information. Key challenges in abstractive summarization, a branch of text summarization that generates new sentences to summarize the original text, include inadequate representation of meaning, maintaining factual consistency, and addressing temporal and causal reasoning. Despite progress, models still face difficulties in capturing semantic nuances and striking the right balance between completeness and conciseness.

🎯Objectives

The goal is to produce a summary that reads naturally—similar to how a human might summarize the article—while retaining the essential facts and context of the original news story. This task involves understanding the meaning of the text and generating summaries that are both informative and linguistically fluent.

📚About the Dataset: therapara/summary-of-news-articles

The dataset can be found on this link. It contains a total of 56 240 rows, which are divided into three subsets: a training set with 45,000 rows, a validation set with 5,620 rows, and a test set also containing 5,620 rows. There are two features: document, which contains the original news content, and summary, which is the summarized version of the document column.

🧠About the Model: BART-small

The base model used in this project is BART-small, a smaller version of BART with fewer attention heads for the news summarization task. BART is widely recognized for its effectiveness in natural language generation, particularly in summarizing long-form text, making it a suitable choice for our project. However, due to limited computational resources, a lightweight variant was chosen to balance performance with efficiency.

⚙️Methodology

Data Preprocessing

1. Tokenization on input and output fields (max_length = 512 for both fields)

2. Applying DataCollatorForSeq2Seq to pad sequences

Fine-tuning with LoRA and Adam Optimizer

  • Initial Settings
  • Epochs = 20

    Training Steps = Epoch * tokenized_dataset

    Warmup Steps = 0.1 * Training Steps

    Per Device Train Batch Size = 8

    Per Device Eval Batch Size = 10

    Learning Rate =

  • Final Settings
  • Epochs = 20

    Training Steps = Epoch * tokenized_dataset

    Warmup Steps = 0.1 * Training Steps

    Per Device Train Batch Size = 8

    Per Device Eval Batch Size = 10

    Learning Rate =

📝Evaluation and Results

Many of the values in the training of the base model did not change over each epoch, however, this was resolved during the training of the fine-tuned model. However, the metrics of the fine-tuned model, though changing and generally improving during training, were not statistically significant. Thus, further training and fine-tuning is needed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors