Do the synthesize inputs must be the .npy file

as i known, Clarinet is a end-to-end model(Text-to-Speech). But this model allows only the .npy file as the inputs. Can anyone use a sentence to synthesize ?
what's more , i wonder the function of the Clarinet. can it realize the multi-speakers synthetise? or just make synthesize results better?