Thank you for sharing such a great work.
I'm currently including comparisons with your method in my paper. While the text-to-image and depth estimation tasks were easy to evaluate using the provided code, I found it difficult to reproduce the depth-to-image task.
Would it be possible to share the inference code or demo for the depth-to-image task?