Questions about the computation of TiLoss

I have read your paper and got impressed by it. Meanwhile, I'm very glad to see your source code.

While I read the source code, I notice the computation of TiLoss in the code is a bit different from the description of the paper and I do not understand some designs in the paper. I would appreciate it if you can help me with these questions.


```class TiLoss(Module):
    def forward(self, out_val, out_exp, target):
        # err_out_exp=0
        # integer cross entropy loss
        s=out_val.type(torch.int64)
        if out_exp >-7:
            # if out_exp is big enough
            # change the base in log softmax from e to 2
            # to approx integer loss
            s=s*47274//(2**15)
            if out_exp>=0:
                s=s*2**out_exp
            else:
                s=s//(2**-out_exp)

            out_max, _ = torch.max(s,dim=1)
            offset = out_max-10
            s=s-offset.view(-1,1)
            s=torch.max(s,Int8Tensor(0).type(torch.int64))
            out_grad = 2**s-1
        else:
            # if out_exp is too small s will be all 0
            # use another apporximation 1+e^x = 1 + x + 0.5 x^2 + o(x^2)
            out_grad = 2**(1-2*out_exp.type(torch.int64)) + \
                s*2**(1-out_exp.type(torch.int64)) + s*s

        out_sum = out_grad.sum(1,dtype=torch.int64)

        out_grad = out_grad*(2**11)//out_sum.view(-1,1)
        out_grad[torch.arange(out_val.size(0)), target] -= out_grad.sum(1,dtype=torch.int64)
        self.out_grad = StoShiftInt32(out_grad.type(torch.int32),4)

        # return self.out_grad, err_out_exp
        return self.out_grad
```

I have two questions:
1. The implementation details:

- Variable out_grad is supposed to represent $e_i$ in the paper. But in your source code for the condition of out_exp<=-7, out_grad is $2^{1-2S_a}*e_i$. What is $2^{1-2S_a}$ supposed to mean?
- Why do you multiply out_grad with $2^{11}$ and shift 4 bits in the final?
- The equations do not match the code:

    In the condition of out_exp<=-7, the code can be seen as $\frac{\frac{2^{1-2S_a} e_i 2^{11}}{C}-y_i*C}{2^4}$.  

    In the condition of out_exp>-7, the code can be seen as $\frac{\frac{e_i 2^{11}}{C}-y_i*C}{2^4}$.  

    They are different from equation 2 and equation 3 in the paper, respectively. Can you explain this?

2. The paper details:

- In section 3.2, there is a description about $s_w$: "Recall that the value of $s_w$ for each layer is set during initialization and remain unchanged during training.". How to compute the initialization values of $s_w$? Does it use prior knowledge?
- I get confused by a sentence in paper section 3.4: "The error tensor e in (1) is computed using these effectively 12-bit values and eventually rounded stochastically back to 8 bits before being used in back propagation.". Can you give a more specific description of this procedure?  It seems related to the second point in question 1. But I don't understand the relation between them.
- I do not understand the computation design of $\hat{x}$ when out_exp is greater than -7 in section 3.4. The reason is not given in the paper. Moreover, in the computation of p, it seems to have a magic number 10.

Thanks again for your time and effort!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the computation of TiLoss #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Questions about the computation of TiLoss #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions