This release include:
Add attention function, model data flow, and olmo sequential block figures. (@Naeemkh, e6e54f0)
Add option for nsys profiling (@mbsabath, bf3f2a4)
Add table of parameters to the logger (@Naeemkh, 12bf09c)
Drop Llama Block (@Naeemkh, 22d0f45)
Drop drop out layer (@Naeemkh, 4f2775d and a11ee8a)
Add back of the envelop computations (@Naeemkh, 6d83c07)
Merge OLMoSequentialBlock into OLMoBlock (@Naeemkh, fff5955)
Move flash attention settings to the config file (@Naeemkh, 197c38f)
Add sweep generator scripts (@Naeemkh, def2931, e462a92, 1e7fb8c, 7e1c11e)
Drop SwiGLU activation function (@Naeemkh, dd12e48, 1d5f0dc, 7c942be)
Drop weight_tying (@Naeemkh, 544b0b6)
Drop OLMoBlockGroup (@Naeemkh, ceff8f8, ba49aa6 )
Keep only PyTorch default LayerNorm (@Naeemkh, beb76cd, d988ea7 )
Clean up utility codes for submitting the checkpoints to the cloud (@Naeemkh, f8dbc80)
Remove multi-query attention feature and related settings ( @Naeemkh, 74eaf03)
Drop effective key value heads and use user requested number of heads ( @Naeemkh, 36f51b7)
Fix a bug with setup condo environment (@amazloumi, e51c620, c1f3125 )
Drop output multiplier (@Naeemkh, 1b3eb2b)