Skip to content

Sequence score includes scores for lang id token #966

@Enkidu93

Description

@Enkidu93

In machine.py, we do something like this, so that only scores for non-special tokens are bubbled up:

        for item in zipped:
            output_ids, scores, sequence_score, attentions = cast(
                Tuple[torch.Tensor, torch.Tensor, Optional[float], Optional[torch.Tensor]], item
            )
            output_tokens: List[str] = []
            output_indices: List[int] = []
            for i, output_id in enumerate(output_ids):
                id = cast(int, output_id.item())
                if id not in all_special_ids:
                    output_tokens.append(self.tokenizer.convert_ids_to_tokens(id))
                    output_indices.append(i)

            scores = scores[output_indices]

In silnlp, we do something similar downstream in hugging_face_config.py:translate().

However, we grab the sequence_scores directly from the model outputs and these sequence scores seem to include the BOS token score which is close to 0. This will presumably bias the sequence score slightly so that shorter output sequences would have a score closer to zero.

We should confirm that these special token scores are being included in the sequence score and then update silnlp accordingly if they are (or maybe even consider submitting an issue in transformers).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions