mmDiffBERT - Multi-lingual Diffusion Text Generation

A research project exploring diffusion-based text generation using different transformer models as alternatives to traditional autoregressive language models.

Project Structure

mmDiffBERT/
├── pyproject.toml              # Project dependencies and configuration
├── README.md                   # This file
├── wiki-mmBERT/                # mmBERT-based diffusion implementation
│   ├── finetune.py            # Training script for mmBERT
│   ├── inference.py           # mmBERT diffusion inference
│   ├── compare.py             # mmBERT vs GPT-2 comparison
│   ├── gpt2_inference.py      # GPT-2 baseline
│   ├── README.md              # mmBERT-specific documentation
│   └── model_weights/  # Trained mmBERT model
└── wiki-roberta/              # RoBERTa-based diffusion implementation (legacy)
    ├── finetune.py            # Training script for RoBERTa
    ├── inference.py           # RoBERTa diffusion inference
    ├── compare.py             # RoBERTa vs GPT-2 comparison
    ├── gpt2_inference.py      # GPT-2 baseline
    ├── README.md              # RoBERTa-specific documentation
    └── model_weights/  # Trained RoBERTa model

Overview

This project explores diffusion-based text generation as an alternative to traditional autoregressive language models like GPT-2. Instead of generating text left-to-right one token at a time, this approach uses:

Fixed prefix (first 16 tokens)
Mask tokens for remaining positions
Iterative denoising over multiple steps
Progressive unmasking until fully denoised

Key Features

Bidirectional Generation: Unlike autoregressive models, can attend to full sequence context
Iterative Denoising: Gradually reveals text over configurable number of steps
Prefix Control: First 16 tokens remain fixed, providing stable context
Visual Animations: Step-by-step matplotlib animations showing generation process
Comparative Analysis: Side-by-side comparison with GPT-2 baseline
Multi-Model Support: Support for different transformer models (mmBERT, RoBERTa)

Installation

This project uses uv for package management.

# Install dependencies
uv sync

Requirements:

Python >= 3.11
PyTorch 2.7.0+
Transformers 4.52.4
Datasets 3.6.0
Matplotlib 3.10.3
Accelerate 1.7.0

Usage

mmBERT Diffusion (Recommended)

cd wiki-mmBERT

# Basic text generation
python inference.py "Your prompt here"

# Fine-tuning
python finetune.py

# Side-by-side comparison with GPT-2
python compare.py "Your prompt here"

RoBERTa Diffusion (Legacy)

cd wiki-roberta

# Basic text generation
python inference.py "Your prompt here"

# Fine-tuning
python finetune.py

# Side-by-side comparison with GPT-2
python compare.py "Your prompt here"

Model-Specific Documentation

mmBERT Implementation: See wiki-mmBERT/README.md for detailed mmBERT-specific documentation
RoBERTa Implementation: See wiki-roberta/README.md for detailed RoBERTa-specific documentation

Research Applications

This project is designed for research into:

Alternative text generation paradigms
Bidirectional context utilization
Iterative refinement approaches
Model comparison and evaluation

Citation

This project is based on the original RoBERTaDiffusion implementation:

@misc{robertadiffusion2024,
  title={RoBERTa Diffusion Text Generation},
  author={Nathan Barry},
  year={2024},
  url={https://github.com/nathan-barry/RoBERTaDiffusion},
  note={A research project exploring fine-tuning BERT-style models for text generation}
}

Original Repository: nathan-barry/RoBERTaDiffusion

This mmDiffBERT project extends the original work by:

Supporting multiple transformer models (mmBERT, RoBERTa)
Adding Hindi language support with Wikipedia and Sangraha datasets
With the primary objective of checking the syntax and compatiability.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
sangraha-hi-mmbert		sangraha-hi-mmbert
wiki-hi-mmbert		wiki-hi-mmbert
wiki-mmBERT		wiki-mmBERT
wiki-roberta		wiki-roberta
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mmDiffBERT - Multi-lingual Diffusion Text Generation

Project Structure

Overview

Key Features

Installation

Usage

mmBERT Diffusion (Recommended)

RoBERTa Diffusion (Legacy)

Model-Specific Documentation

Research Applications

Citation

About

Uh oh!

Releases

Packages

Languages

mlsquare/mmDiffBERT

Folders and files

Latest commit

History

Repository files navigation

mmDiffBERT - Multi-lingual Diffusion Text Generation

Project Structure

Overview

Key Features

Installation

Usage

mmBERT Diffusion (Recommended)

RoBERTa Diffusion (Legacy)

Model-Specific Documentation

Research Applications

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages