RTU HPC usage Documentation - 2021-10-04


 

TORQUE 4.0 Resource management system (uzdevumi pa rindām pēc prioritātēm): https://docs.adaptivecomputing.com/torque/4-0-2/help.htm

Tutorial

Latvian Tutorial video: https://youtu.be/tr6y-w0tnm4

Jamboard 11. sesija: https://jamboard.google.com/d/1p_TZ1bdndv5HtCKqxzbHIEzuuSl0CFQ2GgBA7YXh5J0/edit?usp=sharing

Templates:

http://share.yellowrobot.xyz/1628158950-vea-rtu-course-2020-q1/tensorboard_utils.py http://share.yellowrobot.xyz/1628158950-vea-rtu-course-2020-q1/11_1_tensorboard_template.py http://share.yellowrobot.xyz/1628158950-vea-rtu-course-2020-q1/11_2_classification_template.py http://share.yellowrobot.xyz/1628158950-vea-rtu-course-2020-q1/11_3_hpc_template.sh

Manual: http://share.yellowrobot.xyz/1628158950-vea-rtu-course-2020-q1/rokasgramata_RTU.pdf

For data storage use folder (different than the one used in tutorial) /mnt/beegfs2/home/abstrac01/your_name

Examples how to automate task generation task: http://share.yellowrobot.xyz/1631791795-hpc/task_gen_example_hpc.zip

 

Available GPUs

  1. Ir pieejams jauns aprēķinu mezgls ar 4 x NVIDIA A100 GPU (wn44): qsub -l feature=a100 (detalizētāk: https://hpc.rtu.lv/hpc/klastera-rokasgramata/klasteru-tehniskais-apraksts/#rudens ).

  2. wn59 compute-gpu Centos 7 192 GB Gold 6130 32 4 /scratch dell vasara gpu v100 “wn60 compute-gpu Centos 7 192 GB Gold 6130 32 4 /scratch dell vasara gpu v100

 

 

Only 2 available GPU combinations

K40 - first try with this GPU

V100

 

Script that works for V100 and K40

 

 


 

 

Trouble shooting

When setting up conda environment if you get "no space left" disk error, use

When installing something very important to use "pip3" not "pip" Must check that last entry is loaded from current environment

If you get missing libs errors, need to link specific verrsion in conda_env

 


Simple example

Tasks must be run with flag:

Example run.sh (notice with and signs at the end run will execute in parallel, do not forget at the end wait command) Also very important that you do not waste resources - try to run single run and look into showq, then see if your run is exectuing for example on wn60, then type "ssh wn60" then inside that node "nvidia-smi" to look how much GPU resources are you using - I expect that you could run 16 runs in parallel. I also advise you to make automatic script generator so you do not have to type in all of hyper parameters by hand.

Execute it by using:

Credits left

 

 

Check status

HPC commands

To see all tasks and nodes operating
And also usage

our own jobs

command to activate conda

Before executing scripts you need to create your own conda environment

GPU settings (do not use other combinations)

nodes=1:ppn=12:gpus=1,feature=k40
^ use this one

nodes=1:ppn=8:gpus=1,feature=v100

CPU settings for testing
nodes=1:ppn=1
nodes=1:ppn=8

image-20211004112020608

Very simple script