While not technically needed from a theoretical point of view, gradients are needed for prior means in the context of botorch, since the broader GP will be differentiable, but those derivatives will be incorrect if the prior mean computational graph is not included --> leading to bad acquisition function optimization. This might require a change to botorch