2023-Q4-AI 14. ViT Vision Transformers

 

 

14.1. Video / Materiāli

Video: https://youtube.com/live/BYz6Uc-twSw?feature=share

Jamboard: https://jamboard.google.com/d/1GXHTFpnrXIVzBnAISrZBoM-r5IUx58PN70BYvK6dYbI/edit?usp=sharing

Materials:

  1. https://viso.ai/deep-learning/vision-transformer-vit/

  2. https://jacobgil.github.io/deeplearning/vision-transformer-explainability

 

Stream key: 6rr7-8zfd-g776-8c7q-fsu4

Finished code download: http://share.yellowrobot.xyz/quick/2023-12-26-6D6B4B78-1645-4CD7-BBF2-6A2B3D199AC2.zip

14.2. Implementēt ViT

Follow video instructions and implement ViT (An Image is Worth 16x16 Words) https://openreview.net/forum?id=YicbFdNTTy

Template: http://share.yellowrobot.xyz/quick/2023-4-17-5C413CBF-48AA-4BEC-9D1B-223AE9B27E77.zip

Iesniegt pirmkodu un screenshot ar rezultātiem.

 

14.3. Implementēt talonu mācīšanos

Implementēt talonu mācīšanos izmantojot 14.2 pirmkodu un pievienot “ViT Token learner” pēc publikācijas https://arxiv.org/pdf/2106.11297.pdf Paraugs pirmkodam: https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner

Iesniegt pirmkodu un screenshot ar rezultātiem, salīdzināt rezultātus ar un bez “ViT Token learner”

 


 

 

ViT Token learner: https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner

https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/

ViViT: A Video Vision Transformer https://arxiv.org/pdf/2103.15691.pdf

 

VNT-Net: Rotational Invariant Vector Neuron Transformers https://arxiv.org/pdf/2205.09690.pdf

 

First introduced in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale19 https://iaml-it.github.io/posts/2021-04-28-transformers-in-vision/

 

 

Step by step code: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial15/Vision_Transformer.html

https://keras.io/examples/vision/probing_vits/

 

Good material:

https://iaml-it.github.io/posts/2021-04-28-transformers-in-vision/ https://www.pinecone.io/learn/vision-transformers/

 

 

In Vision Transformers, images are represented as sequences of patches, where every patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension . The number of patches increases as we increase the resolution, leading to higher memory footprint. TokenLearner can help reduce the number of patches without having to compromise performance

First, divide the input image into a sequence of small patches. Then, use the TokenLearner module to adaptively generate a smaller number of tokens. Process these tokens through a series of Transformer blocks. Finally, add a classification head to obtain the output.

from tokenlearner import TokenLearnerModuleV11

tklr_v11 = TokenLearnerModuleV11(in_channels=128, num_tokens=8, num_groups=4, dropout_rate=0.) tklr_v11.eval() # control dropout x = torch.ones(256, 32, 32, 128) # [bs, h, w, c] y2 = tklr_v11(x) print(y2.shape) # [256, 8, 128]

 

 

Tad, kad laidu ar konfigurāciju, ko izmantojām 14.2 uzdevumā, ievērojamu atšķirību starp ViT ar un bez TokenLearner nemanīju (varbūt ~5% improvement iterāciju ātrumā labākajā gadījumā). Tāpēc pamēģināju modeli palielināt. Palielināju transformeru layer skaitu uz 4 un n_patches uz 14. Modelis sasniedza ~99% accuracy pa aptuveni 20 epochs ar vidēji 26 iterācijām sekundē. Tad aiz pirmā transformer slāņa pieliku TokenLearner ar learned_tokens=2. Modelis sasniedza to pašu ~99% accuracy pa aptuveni tik pati epochiem, bet ar vidēji 72 itērācijām sekundē, kas ir ~2.75x uzlabojums. Nezinu, vai tas ir īpaši godīgs salīdzinājums tā kā datasets varētu nebūt pats sarežģītākais, bet skaitļi ir patīkami, tāpēc ¯_(ツ)_/¯

Slides

newplot

newplot (1)

 

Tips - tricks

https://theaisummer.com/transformers-computer-vision/

? Patches, ovelapping

? conv vs linear patch

 

? multiple tokens classifying => BCE not CCE

 

image-20230414121222711

 

 

image-20230414121204982

image-20230414121406703

img

Untitled (110)

Untitled (109)

Untitled (108)

Untitled (107)

Untitled (106)

Untitled (105)

Untitled (104)

Untitled (103)