Skip to content

Unmatch with CLS token position of indicies array and inverse mask array #4

@nesemenpolkov

Description

@nesemenpolkov

Dear Ivan,

All of the work that you did is great. But while using your code of IMDBDataset i found some strange things! Array of indicies and array of inverse token mask values do not match each other from the beggining of the sequence because of CLS token (see the reference below). And it was also strange for me, while i found out CLS token in the second sentence. So, i may be wrong, but original BERT uses only one CLS token in the begining. And the last one, calculating length of vocab each time could be very expensive. So, i hope, that my notes will help you to make your code better and more clear. I would be proud if you gave me a posibility to take part in this and help you to fix it.

I am looking forward to your reply,
Nesemenpolkov.

Reference:

def _create_item(self, first: typing.List[str], second: typing.List[str], target: int = 1):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions