Skip to content

mlsquare/mmDiffBERT

 
 

Repository files navigation

mmDiffBERT - Multi-lingual Diffusion Text Generation

A research project exploring diffusion-based text generation using different transformer models as alternatives to traditional autoregressive language models.

Project Structure

mmDiffBERT/
├── pyproject.toml              # Project dependencies and configuration
├── README.md                   # This file
├── wiki-mmBERT/                # mmBERT-based diffusion implementation
│   ├── finetune.py            # Training script for mmBERT
│   ├── inference.py           # mmBERT diffusion inference
│   ├── compare.py             # mmBERT vs GPT-2 comparison
│   ├── gpt2_inference.py      # GPT-2 baseline
│   ├── README.md              # mmBERT-specific documentation
│   └── model_weights/  # Trained mmBERT model
└── wiki-roberta/              # RoBERTa-based diffusion implementation (legacy)
    ├── finetune.py            # Training script for RoBERTa
    ├── inference.py           # RoBERTa diffusion inference
    ├── compare.py             # RoBERTa vs GPT-2 comparison
    ├── gpt2_inference.py      # GPT-2 baseline
    ├── README.md              # RoBERTa-specific documentation
    └── model_weights/  # Trained RoBERTa model

Overview

This project explores diffusion-based text generation as an alternative to traditional autoregressive language models like GPT-2. Instead of generating text left-to-right one token at a time, this approach uses:

  1. Fixed prefix (first 16 tokens)
  2. Mask tokens for remaining positions
  3. Iterative denoising over multiple steps
  4. Progressive unmasking until fully denoised

Key Features

  • Bidirectional Generation: Unlike autoregressive models, can attend to full sequence context
  • Iterative Denoising: Gradually reveals text over configurable number of steps
  • Prefix Control: First 16 tokens remain fixed, providing stable context
  • Visual Animations: Step-by-step matplotlib animations showing generation process
  • Comparative Analysis: Side-by-side comparison with GPT-2 baseline
  • Multi-Model Support: Support for different transformer models (mmBERT, RoBERTa)

Installation

This project uses uv for package management.

# Install dependencies
uv sync

Requirements:

  • Python >= 3.11
  • PyTorch 2.7.0+
  • Transformers 4.52.4
  • Datasets 3.6.0
  • Matplotlib 3.10.3
  • Accelerate 1.7.0

Usage

mmBERT Diffusion (Recommended)

cd wiki-mmBERT

# Basic text generation
python inference.py "Your prompt here"

# Fine-tuning
python finetune.py

# Side-by-side comparison with GPT-2
python compare.py "Your prompt here"

RoBERTa Diffusion (Legacy)

cd wiki-roberta

# Basic text generation
python inference.py "Your prompt here"

# Fine-tuning
python finetune.py

# Side-by-side comparison with GPT-2
python compare.py "Your prompt here"

Model-Specific Documentation

  • mmBERT Implementation: See wiki-mmBERT/README.md for detailed mmBERT-specific documentation
  • RoBERTa Implementation: See wiki-roberta/README.md for detailed RoBERTa-specific documentation

Research Applications

This project is designed for research into:

  • Alternative text generation paradigms
  • Bidirectional context utilization
  • Iterative refinement approaches
  • Model comparison and evaluation

Citation

This project is based on the original RoBERTaDiffusion implementation:

@misc{robertadiffusion2024,
  title={RoBERTa Diffusion Text Generation},
  author={Nathan Barry},
  year={2024},
  url={https://github.com/nathan-barry/RoBERTaDiffusion},
  note={A research project exploring fine-tuning BERT-style models for text generation}
}

Original Repository: nathan-barry/RoBERTaDiffusion

This mmDiffBERT project extends the original work by:

  • Supporting multiple transformer models (mmBERT, RoBERTa)
  • Adding Hindi language support with Wikipedia and Sangraha datasets
  • With the primary objective of checking the syntax and compatiability.

About

A research project exploring fine-tuning BERT-style models for text generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%