Skip to content

Conversation

jalateras
Copy link

Summary

  • Added comprehensive documentation explaining how ChatCompletion training data is transformed internally
  • Created new notebook with step-by-step examples and visualizations
  • Addresses the knowledge gap about OpenAI's internal SFT processing pipeline

What Changed

Created a new Jupyter notebook Understanding_ChatCompletion_SFT_transformation.ipynb that explains:

  • How structured ChatCompletion messages are converted to linear sequences
  • The role of special tokens and message boundaries
  • Tokenization process and token ID conversion
  • Loss mask creation and why only assistant tokens contribute to training loss
  • Gradient flow during backpropagation
  • Practical implications for designing effective training data

Why This Matters

Users frequently ask about what happens "under the hood" when they provide training data in ChatCompletion format for fine-tuning. This documentation fills that gap by providing a detailed technical explanation with code examples that demonstrate each transformation step.

Testing

The notebook includes runnable code examples that:

  • Demonstrate the transformation pipeline
  • Show tokenization in action
  • Explain loss calculation with concrete examples
  • Provide visualization of the training process

Fixes #2075

Added comprehensive notebook demonstrating how OpenAI's fine-tuning framework
internally transforms ChatCompletion-style training data into model-ready format
for Supervised Fine-Tuning. Covers message concatenation, tokenization, loss
masking, and training process visualization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How does openAI framework converts the ChatComplete (training data ) into model ready format (SFT)
1 participant