-
Notifications
You must be signed in to change notification settings - Fork 19
Description
I have read your paper and got impressed by it. Meanwhile, I'm very glad to see your source code.
While I read the source code, I notice the computation of TiLoss in the code is a bit different from the description of the paper and I do not understand some designs in the paper. I would appreciate it if you can help me with these questions.
def forward(self, out_val, out_exp, target):
# err_out_exp=0
# integer cross entropy loss
s=out_val.type(torch.int64)
if out_exp >-7:
# if out_exp is big enough
# change the base in log softmax from e to 2
# to approx integer loss
s=s*47274//(2**15)
if out_exp>=0:
s=s*2**out_exp
else:
s=s//(2**-out_exp)
out_max, _ = torch.max(s,dim=1)
offset = out_max-10
s=s-offset.view(-1,1)
s=torch.max(s,Int8Tensor(0).type(torch.int64))
out_grad = 2**s-1
else:
# if out_exp is too small s will be all 0
# use another apporximation 1+e^x = 1 + x + 0.5 x^2 + o(x^2)
out_grad = 2**(1-2*out_exp.type(torch.int64)) + \
s*2**(1-out_exp.type(torch.int64)) + s*s
out_sum = out_grad.sum(1,dtype=torch.int64)
out_grad = out_grad*(2**11)//out_sum.view(-1,1)
out_grad[torch.arange(out_val.size(0)), target] -= out_grad.sum(1,dtype=torch.int64)
self.out_grad = StoShiftInt32(out_grad.type(torch.int32),4)
# return self.out_grad, err_out_exp
return self.out_grad
I have two questions:
- The implementation details:
-
Variable out_grad is supposed to represent
$e_i$ in the paper. But in your source code for the condition of out_exp<=-7, out_grad is$2^{1-2S_a}*e_i$ . What is$2^{1-2S_a}$ supposed to mean? -
Why do you multiply out_grad with
$2^{11}$ and shift 4 bits in the final? -
The equations do not match the code:
In the condition of out_exp<=-7, the code can be seen as
$\frac{\frac{2^{1-2S_a} e_i 2^{11}}{C}-y_i*C}{2^4}$ .In the condition of out_exp>-7, the code can be seen as
$\frac{\frac{e_i 2^{11}}{C}-y_i*C}{2^4}$ .They are different from equation 2 and equation 3 in the paper, respectively. Can you explain this?
- The paper details:
- In section 3.2, there is a description about
$s_w$ : "Recall that the value of$s_w$ for each layer is set during initialization and remain unchanged during training.". How to compute the initialization values of$s_w$ ? Does it use prior knowledge? - I get confused by a sentence in paper section 3.4: "The error tensor e in (1) is computed using these effectively 12-bit values and eventually rounded stochastically back to 8 bits before being used in back propagation.". Can you give a more specific description of this procedure? It seems related to the second point in question 1. But I don't understand the relation between them.
- I do not understand the computation design of
$\hat{x}$ when out_exp is greater than -7 in section 3.4. The reason is not given in the paper. Moreover, in the computation of p, it seems to have a magic number 10.
Thanks again for your time and effort!