Skip to content

Conversation

@ericcurtin
Copy link
Collaborator

So we can load these natively just like gguf

@ericcurtin ericcurtin marked this pull request as draft November 28, 2025 20:32
@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Nov 28, 2025

safetensors tend to be downloaded/pulled more often than gguf. This will introduce support for both formats.

It will help us run performance comparisons using the exact same models/files in vllm and llama.cpp

@ericcurtin ericcurtin force-pushed the support-safetensors branch 2 times, most recently from a963646 to 43efc4d Compare November 30, 2025 12:48
@ericcurtin ericcurtin force-pushed the support-safetensors branch 3 times, most recently from 6144e45 to e5e869d Compare November 30, 2025 16:04
So we can load these natively just like gguf

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
@ericcurtin ericcurtin marked this pull request as ready for review November 30, 2025 16:11
@ericcurtin ericcurtin requested a review from CISC as a code owner November 30, 2025 16:11
@ericcurtin
Copy link
Collaborator Author

I think this is good for review now, I did most of the testing with:

llama-server -hf HuggingFaceTB/SmolLM2-135M-Instruct

@ericcurtin
Copy link
Collaborator Author

@ggerganov @CISC PTAL

@CISC
Copy link
Collaborator

CISC commented Nov 30, 2025

Sooo, transformers can load GGUFs and llama.cpp will be able to load safetensors, not sure what the upshot is? Means we have to maintain the mappings (not to mention MoEs won't easily be supported), looks stressful enough for transformers...

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Dec 1, 2025

Sooo, transformers can load GGUFs and llama.cpp will be able to load safetensors, not sure what the upshot is? Means we have to maintain the mappings (not to mention MoEs won't easily be supported), looks stressful enough for transformers...

This PR already has 7 positive emoji reactions (👍/🚀) in just a few days. This indicates a strong user desire to run safetensors models directly without the intermediate conversion step.

Allowing native safetensors loading drastically simplifies the workflow for new users. It removes the friction of needing convert-hf-to-gguf.py, making llama.cpp a more "drop-in" replacement for other runners like vLLM. It also saves the user duplication of models which are often GBs in size. Being able to run the exact same file artifacts as other inference engines is useful for fair and accurate performance comparisons. To alleviate the stress you mentioned, I am happy to volunteer to help maintain these mappings or triage issues related to the safetensors loader if that helps get this across the line. Although documenting something along the lines of that safetensors hasn't been as battle tested here as gguf might be worthwhile.

@ggerganov WDYT?

@ggerganov
Copy link
Member

I don't think it's worth the complexity. There are various tensor manipulations that are needed during conversion to GGUF that would be difficult to implement at this level. It would duplicate or even more the effort for maintaining the conversion. No mmap support. Json bloat. etc.

What we could do in this direction is to think about abstracting the model loader and providing a public interface to implement it. This would allow 3rd parties to implement model loading and move this complexity outside of libllama. Haven't looked deeply, so not sure if even that makes sense.

This PR already has 7 positive emoji reactions (👍/🚀) in just a few days. This indicates a strong user desire to run safetensors models directly without the intermediate conversion step.

I hope this is not a serious argument.

@ericcurtin
Copy link
Collaborator Author

I don't think it's worth the complexity. There are various tensor manipulations that are needed during conversion to GGUF that would be difficult to implement at this level. It would duplicate or even more the effort for maintaining the conversion. No mmap support. Json bloat. etc.

What we could do in this direction is to think about abstracting the model loader and providing a public interface to implement it. This would allow 3rd parties to implement model loading and move this complexity outside of libllama. Haven't looked deeply, so not sure if even that makes sense.

Just to be clear, do you want to open the door to running safetensors models (and other models) via the public interface you describe here?

This PR already has 7 positive emoji reactions (👍/🚀) in just a few days. This indicates a strong user desire to run safetensors models directly without the intermediate conversion step.

I hope this is not a serious argument.

My point wasn't to argue for merging based on popularity, but to highlight a clear signal of user friction. The engagement suggests that the current conversion step is a significant pain point. You feel otherwise?

@ngxson
Copy link
Collaborator

ngxson commented Dec 1, 2025

Allowing native safetensors loading drastically simplifies the workflow for new users.

The counter arguments are:

  • Allowing native safetensors loading drastically complexifies the workflow for maintainers.
  • Allowing native safetensors loading drastically reduce the performance for some models, especially MoE models that requires aggregating 2D experts into 3D tensor.

If you want to discuss this feature, probably also tag all maintainers who spend their efforts maintaining model loading infrastructures.

My point wasn't to argue for merging based on popularity, but to highlight a clear signal of user friction.

Most users don't understand the underlaying works and maintainers' efforts to keep things as efficient as possible. So user friction is not the only indicator here.

Obviously throughout the discussion, we mentioned twice about how this can be stressful for human maintainers and you don't seem to take this seriously.

@CISC
Copy link
Collaborator

CISC commented Dec 1, 2025

A public loader interface sounds much more appealing.

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Dec 1, 2025

Allowing native safetensors loading drastically simplifies the workflow for new users.

The counter arguments are:

  • Allowing native safetensors loading drastically complexifies the workflow for maintainers.
  • Allowing native safetensors loading drastically reduce the performance for some models, especially MoE models that requires aggregating 2D experts into 3D tensor.

If you want to discuss this feature, probably also tag all maintainers who spend their efforts maintaining model loading infrastructures.

I am hesitant to mass-tag people as I recently received complaints about causing unnecessary notifications. :

#17579

Although as an aside, a list of who the maintainers are would be helpful, I'm not 100% sure anymore, although I have a rough idea. This isn't 100% reflective of who are maintainers:

https://github.com/ggml-org/llama.cpp/blob/master/CODEOWNERS

for example, I am not a maintainer, but I think most others if not all others in this file are, it would be nice to know. @CISC suggested something similar here:

#17492

My point wasn't to argue for merging based on popularity, but to highlight a clear signal of user friction.

Most users don't understand the underlaying works and maintainers' efforts to keep things as efficient as possible. So user friction is not the only indicator here.

I agree user friction is not the only indicator here. It's reasonable to bring it up as a discussion point, we often do.

Obviously throughout the discussion, we mentioned twice about how this can be stressful for human maintainers and you don't seem to take this seriously.

To be clear, I am taking the feedback regarding maintainer burnout and architectural complexity seriously. I directly addressed it in fact. My intention was not to dismiss the maintenance burden, but to ensure the user experience was represented in the trade-off analysis. We are discussing a potential interface. I hope it doesn't upset people that we are having the discussion.

I notice increasing stress in the last few weeks in particular in llama.cpp , it's not something I noticed when I first started contributing. I feel like I’m walking on eggshells somewhat with every pull request I open, and I certainly don't want to make your jobs harder. My goal is to help improve the project, not add to the burden.

@CISC
Copy link
Collaborator

CISC commented Dec 1, 2025

I notice increasing stress in the last few weeks in particular in llama.cpp , it's not something I noticed when I first started contributing. I feel like I’m walking on eggshells somewhat with every pull request I open, and I certainly don't want to make your jobs harder. My goal is to help improve the project, not add to the burden.

That should not be a feeling anyone should have to get, but at the same time maintainers have to deliver messages from time to time that can seem quite harsh, often it's not really possible to do that in a universally "nice" way, certainly not in written form.

The stress you've noticed is a manifestation of the increased workload, very few people are equipped to handle that without some adverse effect. It's important to keep in mind that any perceived slight or rejection is most likely not intentional nor personal.

At the end of the day everyone here is volunteering their time and experience as best they can, take care, be mindful and make sure it's something you enjoy the challenge of.

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Dec 1, 2025

I notice increasing stress in the last few weeks in particular in llama.cpp , it's not something I noticed when I first started contributing. I feel like I’m walking on eggshells somewhat with every pull request I open, and I certainly don't want to make your jobs harder. My goal is to help improve the project, not add to the burden.

That should not be a feeling anyone should have to get, but at the same time maintainers have to deliver messages from time to time that can seem quite harsh, often it's not really possible to do that in a universally "nice" way, certainly not in written form.

The stress you've noticed is a manifestation of the increased workload, very few people are equipped to handle that without some adverse effect. It's important to keep in mind that any perceived slight or rejection is most likely not intentional nor personal.

I believe there is a distinct difference between blunt technical feedback and personal condescension. Here are some examples:

"While your AI is not capable enough to explain this to you @ericcurtin , I will..."

This comment is condescending. It attacks me personally. from PR #17579 . Note a Draft PR at the time. There wasn't even an active conversation on that PR.

"Obviously... you don't seem to take this seriously."

This attacks my intent and professionalism. From this PR.

For some reason, there was even negative feedback given towards my PR in a conversation here:

antirez/linenoise#236

which is the most perplexing. Again there wasn't even an active conversation on that PR since January. Is every PR I open being chased?

This feedback is intentionally personal.

At the end of the day everyone here is volunteering their time and experience as best they can, take care, be mindful and make sure it's something you enjoy the challenge of.

I am here to volunteer my time and help improve the project, just like everyone else. While I accept that stress runs high, maintaining a baseline of professional respect is essential for contributors and maintainers, it was like this when I first started contributing a year ago.

I also think there should be conversations about how maintainers scale across the codebase rather than having maybe one or two people looking at a large amount of PRs, the project has grown beyond that stage and it feels like there are less maintainers then there were before in a growing project.

I am closing the PR, the conversation can continue if needs be.

@ericcurtin ericcurtin closed this Dec 1, 2025
@CISC
Copy link
Collaborator

CISC commented Dec 1, 2025

I believe there is a distinct difference between blunt technical feedback and personal condescension. Here are some examples:

[...]

This feedback is intentionally personal.

Ok, I agree, that was not a pleasant tone, but clearly made from a standpoint of exasperation. Given the context I would say not really directly towards you, but the slog of reviewing AI-generated code lately. That does not excuse it though as you are a long time collaborator with valuable contributions.

I also think there should be conversations about how maintainers scale across the codebase rather than having maybe one or two people looking at a large amount of PRs, the project has grown beyond that stage and it feels like there are less maintainers then there were before in a growing project.

Could not agree more, harder to realize though.

@bartowski1182
Copy link
Contributor

This comment is condescending. It attacks me personally. from PR #17579 . Note a Draft PR at the time. There wasn't even an active conversation on that PR.

Counter point, if I may, that comment was made more than 2 days after you were questioned on whether the code was written by AI and you offered up no explanation other than "I mark it experimental for good reason", since you didn't refute it (and AI written completely untested PRs are on the rise), ngxson understandably assumed that you had had an AI write the entire thing without reviewing the results

It was also an active PR until shortly before the comment, and multiple days after the initial comments.

Bear in mind as well, on open source projects like this, language barriers exist. People who don't natively speak English can very often come across as rude or condescending simply by not knowing how best to express themselves, so we should all take everything as professionally as possible until given a reason not to.

I think your biggest lesson from this should be to only publish draft PRs, not mark them ready for review until they work, and address feedback in a timely manner once it's open and marked for review. These people take time out of their day and development cycle to help get everyone's code in, their time is very valuable, and we should all treat it as such.

Just my 2 cents, no offense intended, just trying to examine the situation as an outsider and provide feedback/context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants