lr=0?

Hi, in your paper you state that you use the same learning rate schedule as in the paper "attention is all you need". But I cannot find any implementation of it in your code. Moreover the learning rate in your adam optimizer is set to 0. Can you give me a hint where all this is handled in the code?