Long T5 by HaokunLiu · Pull Request #179 · allenai/longformer

HaokunLiu · 2021-03-24T15:03:44Z

Based on @AkshitaB 's work (#149), this PR extends Longformer to T5. It also adds a test to check if the Longformer T5 produces the same output as the standard T5 on short input texts, as suggested by @ibeltagy in this comment

A quick thing about code style: I'm not sure if this repo has selected any formatter previously. I didn't find dev-requirements.txt. So I continue to use the black formatter in my default setting. It automatically re-formats the file whenever I save it. You may notice changes like ' -> ", or breaking a long line into multiple lines. I hope it doesn't bother you too much.

HaokunLiu · 2021-03-24T15:12:42Z


 class LongformerSelfAttention(nn.Module):
-    def __init__(self, config, layer_id):
+    def __init__(self, config, layer_id, bias=True, attention_dim_scale=True):


T5 attention module is slightly different from conventional ones. It doesn't have bias, nor does it scale the attention score according to attention head dimension before softmax. See this list for more details.

In the default option, bias=True, attention_dim_scale=True. This should just fall back to regular self-attention.

Please add your comment to the code.

HaokunLiu · 2021-03-24T15:32:56Z

            selected_attn_weights[selection_padding_mask_zeros[0], :, :, selection_padding_mask_zeros[1]] = -10000
            # concat to attn_weights
-            # (bsz, seq_len, num_heads, extra attention count + 2*window+1)
+            # (bsz, seq_len, num_heads, max_num_extra_indices_per_batch + 2*window+1)


changed annotation to be consistent with related annotations below

HaokunLiu · 2021-03-24T15:34:22Z

        self.attention_mode = config.attention_mode
        self.autoregressive = config.autoregressive
+
+        if hasattr(config, "relative_attention_num_buckets") and layer_id == 0:


In T5, the position bias is shared across layers. This is done by letting the first layer compute the position bias, then pass it on to the remaining layers.

Good catch. Please write this comment in the code for more readablity.

HaokunLiu · 2021-03-24T15:36:23Z

+        if output_attentions:
+            outputs = outputs + (attn_weights,)
+        if self.has_relative_attention_bias:
+            outputs = outputs + (position_bias,)


this is equivalent to the old output form, when self.has_relative_attention_bias=False

HaokunLiu · 2021-03-24T15:37:17Z

        return outputs
+
+
+def relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):


I was considering moving this to longformer_encoder_decoder, but that will lead to cycle import, so this has to be here.

HaokunLiu · 2021-03-24T15:53:47Z

+                layer.layer[0].SelfAttention = LongformerSelfAttentionForT5(config, layer_id=i)
+
+
+class LongformerT5Config(T5Config):


You can see, we are getting many highly-similar config classes as we extending to other transformer models. If you like, we can simplify this by using Mixin. It will be like having another Mixin class containing all the longformer specific settings, and the LongformerT5Config class will inherit both the Mixin class and T5Config.

I don't have strong feelings about this. You decide (as long as we don't change the interface of the released code)

HaokunLiu · 2021-03-24T15:59:29Z

+        )
+        self.output = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
+
+    def forward(


An alternative I considered was to let this class inherit LongformerSelfAttention. But eventually, I decided not to do so. The interfaces of the two classes are quite different. What we have here, i.e., making LongformerSelfAttention a member of the LongformerSelfAttentionForT5, is probably less confusing than the althernative.

HaokunLiu · 2021-03-24T16:12:58Z

+            # (bsz, seq_len, num_heads, max_num_extra_indices_per_batch + 2*window+1)
            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)
+
+        if position_bias is None and self.has_relative_attention_bias:


since the sliding window already has put the attention score in the form of [q_(i) * k_(i-w), q_(i) * k_(i-w+1), ..., q_(i) * k_(i), ... , q_(i) * k_(i+w)] the relative position is simply arange

please move this comment to the code.

nit: Maybe also move this block of code to a separate function

HaokunLiu · 2021-03-24T16:18:00Z

+            perm_global_position_bias = attn_weights.new_zeros(
+                bsz, max_num_extra_indices_per_batch, seq_len, self.num_heads
+            )  # (bsz, max_num_extra_indices_per_batch, seq_len, num_heads)
+            if extra_attention_mask is not None:


Global position bias is a bit more complex. We first get the memory position from extra_attention_mask_nonzeros, then compute the query position using arrange. Their diff is the relative position. But this "sparse" one vector for each global token in the batch. So we later put it back into the shape of (bsz, max_num_extra_indices_per_batch, ...) using the index information from selection_padding_mask_nonzeros

didn't review this part yet.

ibeltagy

Looks great, thank you.
I left a few small comments. I didn't review the global attention part yet, will do later, maybe today.

ibeltagy · 2021-03-25T00:37:32Z

        self.attention_mode = config.attention_mode
        self.autoregressive = config.autoregressive
+
+        if hasattr(config, "relative_attention_num_buckets") and layer_id == 0:


Good catch. Please write this comment in the code for more readablity.

ibeltagy · 2021-03-25T00:38:19Z


 class LongformerSelfAttention(nn.Module):
-    def __init__(self, config, layer_id):
+    def __init__(self, config, layer_id, bias=True, attention_dim_scale=True):


Please add your comment to the code.

ibeltagy · 2021-03-25T00:40:06Z

            selected_attn_weights[selection_padding_mask_zeros[0], :, :, selection_padding_mask_zeros[1]] = -10000
            # concat to attn_weights
-            # (bsz, seq_len, num_heads, extra attention count + 2*window+1)
+            # (bsz, seq_len, num_heads, max_num_extra_indices_per_batch + 2*window+1)


ibeltagy · 2021-03-25T14:44:39Z

+            # (bsz, seq_len, num_heads, max_num_extra_indices_per_batch + 2*window+1)
            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)
+
+        if position_bias is None and self.has_relative_attention_bias:


please move this comment to the code.

ibeltagy · 2021-03-25T14:45:18Z

+            # (bsz, seq_len, num_heads, max_num_extra_indices_per_batch + 2*window+1)
            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)
+
+        if position_bias is None and self.has_relative_attention_bias:


nit: Maybe also move this block of code to a separate function

ibeltagy · 2021-03-25T14:45:39Z

+            perm_global_position_bias = attn_weights.new_zeros(
+                bsz, max_num_extra_indices_per_batch, seq_len, self.num_heads
+            )  # (bsz, max_num_extra_indices_per_batch, seq_len, num_heads)
+            if extra_attention_mask is not None:


didn't review this part yet.

ibeltagy · 2021-03-25T14:50:49Z

+            base_model_name_or_path="t5-small",
+        )
+        self._run_test(
+            INPUT_TEXT="It begins with the Great Hungerer. It ends in utter darkeness.",


ibeltagy · 2021-03-25T14:59:00Z

+    def test_outout(self):
+        self._run_test(
+            INPUT_TEXT="Hello world!",
+            long_model_name_or_path="/net/nfs2.s2-research/haokunl/exp_files/model_artifacts/t5/longt5-small-4096",


It would be great if this test works without the local model. One way to do so is to call create_long_model in the text to convert t5 to long, then test it. It will make the test slower but easier to run.

ibeltagy · 2021-03-25T14:59:51Z

+                layer.layer[0].SelfAttention = LongformerSelfAttentionForT5(config, layer_id=i)
+
+
+class LongformerT5Config(T5Config):


I don't have strong feelings about this. You decide (as long as we don't change the interface of the released code)

ibeltagy · 2021-03-25T15:04:05Z

 from longformer.diagonaled_mm_tvm import diagonaled_mm as diagonaled_mm_tvm, mask_invalid_locations
 from longformer.sliding_chunks import sliding_chunks_matmul_qk, sliding_chunks_matmul_pv
-from longformer.sliding_chunks import sliding_chunks_no_overlap_matmul_qk, sliding_chunks_no_overlap_matmul_pv
+from longformer.sliding_chunks import (


It is fine that your dev env changed the file format. I know it doesn't change the code but I will feel more comfortable if you run a small test to make sure the new code produces the same output as the previous one for Longformer.

armancohan · 2021-03-26T18:56:10Z

+    # in T5 attention_probs_dropout_prob is dropout_rate
+    config.attention_probs_dropout_prob = config.dropout_rate
+    config.attention_window = [attention_window] * config.num_hidden_layers
+    config.attention_dilation = [1] * config.num_hidden_layers


when increasing the model length we probably want to increase number of relative position buckets as well config.relative_attention_num_buckets

AkshitaB and others added 9 commits December 7, 2020 23:58

adding t5 options to longformer

572da07

t5 encoder decoder options

2b44bf8

adding convert script

2c3860a

add t5

7f870e8

tidy imports in convert_t5_to_longformer_encoderdecoder

5e63b05

add some prints for easy inspection of model architecture

4a4997e

change line length

e4f8e49

clean up

4c8864e

add missing annotation

7b635a2

HaokunLiu commented Mar 24, 2021

View reviewed changes

HaokunLiu requested review from armancohan, ibeltagy and kyleclo March 24, 2021 16:18

ibeltagy approved these changes Mar 25, 2021

View reviewed changes

armancohan reviewed Mar 26, 2021

View reviewed changes

update

ff939af

		return outputs


		def relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):

		layer.layer[0].SelfAttention = LongformerSelfAttentionForT5(config, layer_id=i)


		class LongformerT5Config(T5Config):

Conversation

HaokunLiu commented Mar 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibeltagy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants