-
Notifications
You must be signed in to change notification settings - Fork 42
Description
In the paper, the weights are the solution to equation (8), which minimizes the squared frobenius norms of the weighted RFF covariance matrices for each pair of features, subject to the constraint that the weights are a probability distribution.
In the code, the weight_learner function solves this problem (?) by using gradient descent on a modified objective that combines the squared frobenius norms of the weighted RFF covariance matrices and a lp norm of the weight vector. What is the purpose of the lp norm on the weight vector (which is already created using softmax on logits, so it is a probability vector)?
Does this somehow ensure that the logits don't go off to infinity? If that is the aim, why not directly regularize by the size of the logits?