Skip to content

Questions about the computation of TiLoss #3

@Richardhongyu

Description

@Richardhongyu

I have read your paper and got impressed by it. Meanwhile, I'm very glad to see your source code.

While I read the source code, I notice the computation of TiLoss in the code is a bit different from the description of the paper and I do not understand some designs in the paper. I would appreciate it if you can help me with these questions.

    def forward(self, out_val, out_exp, target):
        # err_out_exp=0
        # integer cross entropy loss
        s=out_val.type(torch.int64)
        if out_exp >-7:
            # if out_exp is big enough
            # change the base in log softmax from e to 2
            # to approx integer loss
            s=s*47274//(2**15)
            if out_exp>=0:
                s=s*2**out_exp
            else:
                s=s//(2**-out_exp)

            out_max, _ = torch.max(s,dim=1)
            offset = out_max-10
            s=s-offset.view(-1,1)
            s=torch.max(s,Int8Tensor(0).type(torch.int64))
            out_grad = 2**s-1
        else:
            # if out_exp is too small s will be all 0
            # use another apporximation 1+e^x = 1 + x + 0.5 x^2 + o(x^2)
            out_grad = 2**(1-2*out_exp.type(torch.int64)) + \
                s*2**(1-out_exp.type(torch.int64)) + s*s

        out_sum = out_grad.sum(1,dtype=torch.int64)

        out_grad = out_grad*(2**11)//out_sum.view(-1,1)
        out_grad[torch.arange(out_val.size(0)), target] -= out_grad.sum(1,dtype=torch.int64)
        self.out_grad = StoShiftInt32(out_grad.type(torch.int32),4)

        # return self.out_grad, err_out_exp
        return self.out_grad

I have two questions:

  1. The implementation details:
  • Variable out_grad is supposed to represent $e_i$ in the paper. But in your source code for the condition of out_exp<=-7, out_grad is $2^{1-2S_a}*e_i$. What is $2^{1-2S_a}$ supposed to mean?

  • Why do you multiply out_grad with $2^{11}$ and shift 4 bits in the final?

  • The equations do not match the code:

    In the condition of out_exp<=-7, the code can be seen as $\frac{\frac{2^{1-2S_a} e_i 2^{11}}{C}-y_i*C}{2^4}$.

    In the condition of out_exp>-7, the code can be seen as $\frac{\frac{e_i 2^{11}}{C}-y_i*C}{2^4}$.

    They are different from equation 2 and equation 3 in the paper, respectively. Can you explain this?

  1. The paper details:
  • In section 3.2, there is a description about $s_w$: "Recall that the value of $s_w$ for each layer is set during initialization and remain unchanged during training.". How to compute the initialization values of $s_w$? Does it use prior knowledge?
  • I get confused by a sentence in paper section 3.4: "The error tensor e in (1) is computed using these effectively 12-bit values and eventually rounded stochastically back to 8 bits before being used in back propagation.". Can you give a more specific description of this procedure? It seems related to the second point in question 1. But I don't understand the relation between them.
  • I do not understand the computation design of $\hat{x}$ when out_exp is greater than -7 in section 3.4. The reason is not given in the paper. Moreover, in the computation of p, it seems to have a magic number 10.

Thanks again for your time and effort!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions