Transformer

Transformer?

'Attention is all you need'에서 나온 모델
기존의 seq2seq 구조인 인코더-디코더를 따르면서도 attention만으로 구현한 모델
기존의 RNN, CNN 구조의 한계를 탈피하기 위함
- CNN
  - m*m의 필터를 슬라이딩
  - local적인 것은 잘 반영하지만 맨 끝쪽에 대한 의미를 파악하기가 어려움
  - ht = f(xt, xt-k) + ... + f(xt, xt) + ... + f(xt, xt+k)
- RNN
  - 연쇄적인 연산
  - ht = f(xt, f(xt-1, ... f(x3, f(x2, f(x1)))))
  - 이전의 문장의 특징을 반영할 수 있음
  - 문장이 길어질 수록 끝에가 반영되기 어려움
    - momentum이 생겨서 dependency때문에 끝을 반영할 수 없음
- RN(Relation Network)
  - ht = f(xt, x1) + ... + f(xt, xt-1) + f(xt, xt+1) + ... + f(xt, xT)
  - xt를 제외하고 나머지를 전체 연산
  - 길이가 길어진다고 양 끝을 다 잘 반영할 수 있지만 연산을 굉장히 많이 해야 함

구조

인풋(번역할 언어) / 아웃풋(번역된 언어) 라벨을 주면 그것을 바탕으로 output probabilities를 내놓는다
인코더를 거쳐서 특정 값이 나오게 되면 디코더에서 right shifted된 라벨을 통과해 함꼐된다
- right shifted : soyoung is working at holiday (각 단어는 int)

1) encoder

input

(soyoung is working at holiday) 토큰들

input embedding

임베딩 벡터값을 취해서 넣어준다
토큰별로 끊어서 전체 사이즈는 모델사이즈가 되도록 한다
모델사이즈보다 작으면 토큰을 주게 된다 그 뒤에는 빈 토큰 (모델 사이즈 맞춰주기 위해)

position encoding

기존의 RNN/CNN이 가진 장점 중 하나 (위치정보를 저장)
토큰의 포지션 별로 특정 값을 산출. 그것을 임베딩 된 인풋/아웃풋에 더해준다

multi-head attention

self attention 기법이 사용된다
총 3개가 사용(multi-head attention)
attention의 종류
- 인풋을 어디서 받는지에 따라 source-target attention / self attention으로 나뉜다
  - source target attention
    - query, key, value의 인풋이 있을 때 쿼리는 타겟에서 받고 키, 밸류는 소스에서 받음
    - 쿼리를 다른 특정한 곳에서 받아옴
    - Q는 특정 벡터
    - K, V는 동일한 벡터
  - self attention
    - 쿼리, 키, 밸류 모두 소스에서 받음
    - 특정 벡터가 있을 때 이것을 키에도 주고 밸류에도 주고 쿼리에도 주는 것
    - 쿼리 키 밸류 모두 동일한 벡터
  - 쿼리 키 밸류로 나누는 이유?
    - 키 밸류라고 하는 특별한 역할을 나눈 다음에 하나는 키로 보내고 하나는 밸류로 보내고
    - 하나는 어탠션 밸류를 뽑아내는데 쓰이고 하나는 히든레이어를 대표하는데만 쓰게 하자 -> 결과 좋았다
    - 이것이 어텐션 네트워크. 어텐션쓰면 키 밸류 나누는 것이 보편화
- 연산 방식에 따라
  - additive attention
    - 쿼리 키를 FFN에 통과. 이것을 밸류에 통과
    - FFN(쿼리;키) * 밸류
      - FFN(쿼리;키) : attention weight
    - attention weight가 한번에 나오는 것이 아니고, 밸류까지 계산한 후에야 어텐션 웨이트가 산출
    - attention weight 나오고 -> value 곱하고 -> attention weight 나온다 (순차적)
  - dot-product attention
    - a(쿼리*(키)T)*밸류
      - a : attention weight
    - 한번에 매트리스 곱을 통해 1차적으로 attention weight가 나오게 된다
    - transformer에서 사용할 어텐션은 dot-product attention
- multi-head attention
  - 쿼리 키 밸류를 나눈 것을 linear을 통과
  - scale dot-product attention
    - scaling 값을 곱해준 것
    - softmax를 취한 다음에 v를 곱한 것
    - 하는 이유
      - Q, K가 dot-product하면 값 너무 커지니까 sqrt(dim)을 해서 작게한다
        
        Q*K = N(0, dim)이 된다
  - 나온 값들을 concat하게 되고 마지막에 linear함수를 통해 값을 내보낸다

add & norm

feed forward networks

일반적인 dense한 것을 사용

add & norm

2) decoder

output(right shifted)

output embedding

output embedding 은 한 칸을 꼭 띄워준다

positional encoding

masked multi-head attention

masked : output probability를 뽑아내기 위해서 현재의 토큰들과 이전의 토큰들이 있다고 할 때 그 다음 토큰들을 마스킹하는 과정
앞의 단어들에 대해서만 attention

add & norms

multi-head attention

encoder의 output과 함께 사용

add & norms

feed forward networks

feed forward networks : position wise (1D conv)
0보다 작으면 0 내보내고 그보다 크면 weight를 내보낸다 (ReLU와 매우 비슷)

add & norm

3) output

linear

linear mapping
이전까지는 아웃풋 임베딩 된 값들이 다 모델 사이즈 (512차원같이) 로 나왔는데
linear을 들어가면서 전체 워드 벡터의 각각의 예측이 필요하기 때문에 모델 사이즈(512)와 워드 사이즈(만개?)를 맞춰주는 과정이 필요
일반 fully connect layer
- 정보 압축
- 차원 축소 -> 속도 향상

softmax

단어 길이 벡터만큼의 softmax를 뽑아냄

output probabilities

각 단어는 s.m 확률값으로
0과 1사이의 값을 내보낸다
레이블과 비교하는데 output-label의 값을 줄이는 것이 트레이닝의 목표

encoder과 decoder는 여러개를 사용해서 다중으로 연산한다

시퀀스 모델 : 트레이닝과 테스트 방법이 차이가 있다
- 트레이닝
  - 스타트 토큰을 넣어서 디코더 통해서 아웃풋1이 나오고 이전에 준비한 라벨1이 디코더로 들어가서 아웃풋2가 나오게 된다
  - 라벨2가 들어가서 아웃풋3이 나온다
  - 아웃풋1과 라벨1, 아웃풋2와 라벨2의 차이를 줄이는 것
- 테스팅
  - 스타트 토큰이 들어가서 아웃풋이 나오면 그 아웃풋을 그 다음의 디코더의 인풋으로 넣게 된다

4) residual connection

ResNet에 적용된 기법
원래의 x와 layer를 거친 f(x)를 더함
gradient vanishing 문제 해결 (레이어를 더 깊게 쌓을 수 있게 함)

Transformer

Transformer?

구조

1) encoder

input

input embedding

position encoding

multi-head attention

add & norm

feed forward networks

add & norm

2) decoder

output(right shifted)

output embedding

positional encoding

masked multi-head attention

add & norms

multi-head attention

add & norms

feed forward networks

add & norm

3) output

linear

softmax

output probabilities

4) residual connection

reference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!