Skip to content

Conversation

rgerganov
Copy link
Collaborator

Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. Add new API to get the device count from an RPC endpoint.

closes: #15210

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.
Add new API to get the device count from an RPC endpoint.

closes: ggml-org#15210
@rgerganov
Copy link
Collaborator Author

@slaren I am still working on this but I'd appreciate some early feedback on the API changes in ggml-rpc.h

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Sep 26, 2025
@rgerganov rgerganov marked this pull request as ready for review September 30, 2025 08:51
@rgerganov
Copy link
Collaborator Author

rgerganov commented Sep 30, 2025

I am also considering changing the naming scheme for RPC devices. Now I am using RPC<X>[<host>:<port>] for both name and description which doesn't look nice in various logs.

I am thinking of switching to RPC<X> for device name and [<host>:<port>] for device description. The X number should be global monotonic counter making RPC devices appear like this:

RPC0 ([localhost:50052])
RPC1 ([localhost:50053])
RPC2 ([localhost:18053])
RPC3 ([localhost:18054])
...

Thoughts?

};

struct rpc_msg_get_device_memory_req {
uint32_t device;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually done the same optimization a long ago locally, haven't had a chance to properly submit it. I think uint8_t is enough for device ID, I don't think we are able to exceed 256 devices per endpoint. Using uint8_t will reduce amount of transferred data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but I was afraid of running into alignment issues if the payload for GRAPH_COMPUTE is not multiple of 4. Will do some tests and may reconsider.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing some testing and it seems that the change in #15966 broke Metal compatibility with the RPC backend. Running any model over RPC to a Metal machine since that change with FA enabled results in garbage.

I think the reason is that the RPC backend uses ggml_nbytes() to get the required memory for a tensor, but the Metal backend can now allocate extra size for fleeting data:

static size_t ggml_backend_metal_buffer_type_get_alloc_size(ggml_backend_buffer_type_t buft, const ggml_tensor * tensor) {
size_t res = ggml_nbytes(tensor);
// some operations require additional memory for fleeting data:
switch (tensor->op) {
case GGML_OP_MUL_MAT_ID:
{
res += ggml_metal_op_mul_mat_id_extra_tpe(tensor);
res += ggml_metal_op_mul_mat_id_extra_ids(tensor);
} break;
case GGML_OP_FLASH_ATTN_EXT:
{
if (ggml_metal_op_flash_attn_ext_use_vec(tensor)) {
res += ggml_metal_op_flash_attn_ext_extra_tmp(tensor);
}
} break;
default:
break;
}
return res;
GGML_UNUSED(buft);
}

This is not directly related to the changes in this PR, but we should look into resolving it. I think the issue is similar to the padding that the CUDA backend does for quantized tensors.

auto * reg = ggml_backend_reg_get(i);
std::string name = ggml_backend_reg_name(reg);
if (name != "CPU") {
if (name != "CPU" && !string_starts_with(name, "RPC")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead of adhoc string matching, the RPC backend could expose a ggml_backend_rpc_is_name_valid(const char * name) and move the entire naming logic inside the backend implementation. As it is now, it seems a bit of a hack that assumes a specific name syntax and there is no way to keep it in sync with the actual naming syntax used in the backend.

Just an idea - @slaren can confirm if this is needed or we keep it as it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a separate ggml_backend_reg for each RPC server which has the name RPC[host:port]. The tricky part is that we don't want to list all of these names in the backend column of the report but just say "RPC". I guess we can expose an is_rpc() function from ggml_backend_reg but I think it'd be an overkill.

I have slightly improved this logic with the current APIs

@rgerganov
Copy link
Collaborator Author

I am doing some testing and it seems that the change in #15966 broke Metal compatibility with the RPC backend. Running any model over RPC to a Metal machine since that change with FA enabled results in garbage.

I think the reason is that the RPC backend uses ggml_nbytes() to get the required memory for a tensor, but the Metal backend can now allocate extra size for fleeting data:

static size_t ggml_backend_metal_buffer_type_get_alloc_size(ggml_backend_buffer_type_t buft, const ggml_tensor * tensor) {
size_t res = ggml_nbytes(tensor);
// some operations require additional memory for fleeting data:
switch (tensor->op) {
case GGML_OP_MUL_MAT_ID:
{
res += ggml_metal_op_mul_mat_id_extra_tpe(tensor);
res += ggml_metal_op_mul_mat_id_extra_ids(tensor);
} break;
case GGML_OP_FLASH_ATTN_EXT:
{
if (ggml_metal_op_flash_attn_ext_use_vec(tensor)) {
res += ggml_metal_op_flash_attn_ext_extra_tmp(tensor);
}
} break;
default:
break;
}
return res;
GGML_UNUSED(buft);
}

This is not directly related to the changes in this PR, but we should look into resolving it. I think the issue is similar to the padding that the CUDA backend does for quantized tensors.

Thanks for catching this. Let's fix it in a separate PR.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for slaren's review before merging.

// serialization format:
// | n_nodes (4 bytes) | nodes (n_nodes * sizeof(uint64_t) | n_tensors (4 bytes) | tensors (n_tensors * sizeof(rpc_tensor)) |
if (input.size() < sizeof(uint32_t)) {
// | device (4 bytes) | n_nodes (4 bytes) | nodes (n_nodes * sizeof(uint64_t) | n_tensors (4 bytes) | tensors (n_tensors * sizeof(rpc_tensor)) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be refactored into a header struct with the fixed fields (device, n_nodes, n_tensors), followed by the fields with variable size (nodes and tensors).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Support multiple devices on a single rpc-server
4 participants