2025-Q1-AI 14. ViT Vision Transformers

 

 

14.1. Video / Materiāli (🔴 13. maijs 18:00, Riga, Zunda krastmala 10, 122)

Zoom / Video pēc tam: https://zoom.us/j/3167417956?pwd=Q2NoNWp2a3M2Y2hRSHBKZE1Wcml4Zz09

Whiteboard: https://www.figma.com/board/gDNrOmXUNUAhUt2IOAM7jD/2025-Q1-AI-14.-ViT-Vision-Transformers?node-id=0-1&t=kucRJ7QfbdDo8xkV-1

Materials:

  1. https://viso.ai/deep-learning/vision-transformer-vit/

  2. https://jacobgil.github.io/deeplearning/vision-transformer-explainability

 


 

Iepriekšējā gada video: https://www.youtube.com/live/VKvUQP1pkcU

Finished code download: http://share.yellowrobot.xyz/quick/2023-12-26-6D6B4B78-1645-4CD7-BBF2-6A2B3D199AC2.zip

 


 

14.2. Implementēt ViT

Sekot līdzi video un implementēt ViT (An Image is Worth 16x16 Words) https://openreview.net/forum?id=YicbFdNTTy

Template: http://share.yellowrobot.xyz/quick/2023-4-17-5C413CBF-48AA-4BEC-9D1B-223AE9B27E77.zip

Iesniegt pirmkodu un screenshot ar rezultātiem.

 

14.3. Implementēt modeli ar talonu mācīšanos

Implementēt talonu mācīšanos izmantojot 14.2 pirmkodu un pievienot “ViT Token learner” pēc publikācijas https://arxiv.org/pdf/2106.11297.pdf Paraugs pirmkodam: https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner

Iesniegt pirmkodu un screenshot ar rezultātiem, salīdzināt rezultātus ar un bez “ViT Token learner”

 


 

 

ViT Token learner: https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner

https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/

ViViT: A Video Vision Transformer https://arxiv.org/pdf/2103.15691.pdf

 

VNT-Net: Rotational Invariant Vector Neuron Transformers https://arxiv.org/pdf/2205.09690.pdf

 

First introduced in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale19 https://iaml-it.github.io/posts/2021-04-28-transformers-in-vision/

 

Vision Transformers (ViT) Explained: Are They Better Than CNNs?

https://towardsdatascience.com/vision-transformers-vit-explained-are-they-better-than-cnns/

image-20250512125901188

image-20250512125923635

image-20250512125931830

image-20250512130926050

 

 

image-20250512130010272

Filters of the initial linear embedding layer of ViT-L/32 (left) [3]. The first layer of filters from AlexNet (right) [6].

 

https://www.researchgate.net/figure/llustration-of-multi-head-attention-in-the-transformer-encoder-With-N-different_fig2_359005252

image-20250512130230958

image-20250512130403080

 

https://www.researchgate.net/figure/sion-Transformer-encoding-The-image-is-split-into-fixed-size-patches-linearly-embedded_fig1_353284955

image-20250512130310391

 

https://medium.com/@gabell/encoder-decoder-models-and-transformers-5c1500c22c22

image-20250512130356656

 

 

https://hossboll.medium.com/generalizing-transformers-for-processing-images-a6b5c394d0e0

image-20250512130452074

 

https://www.researchgate.net/figure/Showing-a-single-image-becomes-256-image-patches-9_fig3_382995895

image-20250512130641150

 

https://www.pinecone.io/learn/series/image-search/vision-transformers/

image-20250512130724567

 

image-20250512130744910

https://towardsdatascience.com/vision-transformers-vit-explained-are-they-better-than-cnns/

image-20250512130810934

 

https://viso.ai/deep-learning/vision-transformer-vit/

Class Activation map

image-20250512130900346

 

 

 

https://jacobgil.github.io/deeplearning/vision-transformer-explainability

https://github.com/jacobgil/vit-explain

CleanShot 2025-05-12 at 13.11.30

 

Step by step code: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial15/Vision_Transformer.html

https://keras.io/examples/vision/probing_vits/

 

Good material:

https://iaml-it.github.io/posts/2021-04-28-transformers-in-vision/ https://www.pinecone.io/learn/vision-transformers/

 

 

In Vision Transformers, images are represented as sequences of patches, where every patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension . The number of patches increases as we increase the resolution, leading to higher memory footprint. TokenLearner can help reduce the number of patches without having to compromise performance

First, divide the input image into a sequence of small patches. Then, use the TokenLearner module to adaptively generate a smaller number of tokens. Process these tokens through a series of Transformer blocks. Finally, add a classification head to obtain the output.

from tokenlearner import TokenLearnerModuleV11

tklr_v11 = TokenLearnerModuleV11(in_channels=128, num_tokens=8, num_groups=4, dropout_rate=0.) tklr_v11.eval() # control dropout x = torch.ones(256, 32, 32, 128) # [bs, h, w, c] y2 = tklr_v11(x) print(y2.shape) # [256, 8, 128]

 

 

Tad, kad laidu ar konfigurāciju, ko izmantojām 14.2 uzdevumā, ievērojamu atšķirību starp ViT ar un bez TokenLearner nemanīju (varbūt ~5% improvement iterāciju ātrumā labākajā gadījumā). Tāpēc pamēģināju modeli palielināt. Palielināju transformeru layer skaitu uz 4 un n_patches uz 14. Modelis sasniedza ~99% accuracy pa aptuveni 20 epochs ar vidēji 26 iterācijām sekundē. Tad aiz pirmā transformer slāņa pieliku TokenLearner ar learned_tokens=2. Modelis sasniedza to pašu ~99% accuracy pa aptuveni tik pati epochiem, bet ar vidēji 72 itērācijām sekundē, kas ir ~2.75x uzlabojums. Nezinu, vai tas ir īpaši godīgs salīdzinājums tā kā datasets varētu nebūt pats sarežģītākais, bet skaitļi ir patīkami, tāpēc ¯_(ツ)_/¯

Slides

 

image-20250512125650057

 

Tips - tricks

https://theaisummer.com/transformers-computer-vision/

? Patches, ovelapping

? conv vs linear patch

? multiple tokens classifying => BCE not CCE