TORQUE 4.0 Resource management system (uzdevumi pa rindām pēc prioritātēm): https://docs.adaptivecomputing.com/torque/4-0-2/help.htm
Latvian Tutorial video: https://youtu.be/tr6y-w0tnm4
Jamboard 11. sesija: https://jamboard.google.com/d/1p_TZ1bdndv5HtCKqxzbHIEzuuSl0CFQ2GgBA7YXh5J0/edit?usp=sharing
Templates:
http://share.yellowrobot.xyz/1628158950-vea-rtu-course-2020-q1/tensorboard_utils.py http://share.yellowrobot.xyz/1628158950-vea-rtu-course-2020-q1/11_1_tensorboard_template.py http://share.yellowrobot.xyz/1628158950-vea-rtu-course-2020-q1/11_2_classification_template.py http://share.yellowrobot.xyz/1628158950-vea-rtu-course-2020-q1/11_3_hpc_template.sh
Manual: http://share.yellowrobot.xyz/1628158950-vea-rtu-course-2020-q1/rokasgramata_RTU.pdf
For data storage use folder (different than the one used in tutorial) /mnt/beegfs2/home/abstrac01/your_name
Examples how to automate task generation task: http://share.yellowrobot.xyz/1631791795-hpc/task_gen_example_hpc.zip
Ir pieejams jauns aprēķinu mezgls ar 4 x NVIDIA A100 GPU (wn44): qsub -l feature=a100 (detalizētāk: https://hpc.rtu.lv/hpc/klastera-rokasgramata/klasteru-tehniskais-apraksts/#rudens ).
wn59 compute-gpu Centos 7 192 GB Gold 6130 32 4 /scratch dell vasara gpu v100 “wn60 compute-gpu Centos 7 192 GB Gold 6130 32 4 /scratch dell vasara gpu v100
K40 - first try with this GPU
11#PBS -l nodes=1:ppn=12:gpus=1:shared,feature=k40
V100
11#PBS -l nodes=1:ppn=8:gpus=1:shared,feature=v100
151#!/bin/sh -v
2#PBS -e /mnt/home/abstrac01/logs
3#PBS -o /mnt/home/abstrac01/logs
4#PBS -q batch
5#PBS -p 1000
6#PBS -l nodes=1:ppn=12:gpus=1,feature=k40
7#PBS -l mem=40gb
8#PBS -l walltime=96:00:00
9
10eval "$(conda shell.bash hook)"
11conda activate conda_env
12export LD_LIBRARY_PATH=/mnt/home/abstrac01/.conda/envs/conda_env/lib:$LD_LIBRARY_PATH
13
14cd /your/task/dir
15python taskgen.py -learning_rate 1e-4 1e-3
When setting up conda environment if you get "no space left" disk error, use
11pip3 --cache-dir /home/tmp
When installing something very important to use "pip3" not "pip" Must check that last entry is loaded from current environment
21whereis pip
2which -a pip
If you get missing libs errors, need to link specific verrsion in conda_env
21export LD\_LIBRARY\_PATH=~/.conda/envs/conda\_env/lib:$LD\_LIBRARY\_PATH
2ln -s ~/.conda/envs/conda\_en/lib/libstdc++.so.6.0.28 ~/.conda/envs/conda\_env/lib/libstdc++.so.6
Tasks must be run with flag:
11qsub script.sh -A ditf_ldi
Example run.sh (notice with and signs at the end run will execute in parallel, do not forget at the end wait command) Also very important that you do not waste resources - try to run single run and look into showq, then see if your run is exectuing for example on wn60, then type "ssh wn60" then inside that node "nvidia-smi" to look how much GPU resources are you using - I expect that you could run 16 runs in parallel. I also advise you to make automatic script generator so you do not have to type in all of hyper parameters by hand.
211#!/bin/sh -v
2#PBS -e /mnt/home/abstrac01/evalds\_urtans/logs
3#PBS -o /mnt/home/abstrac01/evalds\_urtans/logs
4#PBS -q batch
5#PBS -l nodes=1:ppn=16:gpus=2:shared,feature=v100
6#PBS -l mem=40gb
7#PBS -l walltime=10:00:00
8
9module load conda
10eval "$(conda shell.bash hook)"
11conda activate conda\_env
12
13export TEMP=$HOME/tmp
14export TMPDIR=$HOME/tmp
15export LD\_LIBRARY\_PATH=/mnt/home/abstrac01/.conda/envs/conda\_env/lib:$LD\_LIBRARY\_PATH
16
17cd /mnt/home/abstrac01/evalds\_urtans
18python ./11\_2\_classification\_finished\_tmp.py -learning\_rate 1e-3 -is\_cuda True &
19python ./11\_2\_classification\_finished\_tmp.py -learning\_rate 1e-4 -is\_cuda True &
20python ./11\_2\_classification\_finished\_tmp.py -learning\_rate 1e-5 -is\_cuda True
21wait
Execute it by using:
xxxxxxxxxx
11qsub ./run.sh
Credits left
xxxxxxxxxx
11mam-balance
Check status
xxxxxxxxxx
11qstat -a ditf_ldi
To see all tasks and nodes operating
And also usage
xxxxxxxxxx
11showq -r
our own jobs
xxxxxxxxxx
31qstat
2qsub ./script.sh
3qdel ID-from-qstat
command to activate conda
xxxxxxxxxx
21module load conda
2module load cuda
Before executing scripts you need to create your own conda environment
xxxxxxxxxx
51conda create -n conda_env
2conda activate conda_env
3conda list
4conda install -c anaconda numpy
5conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
GPU settings (do not use other combinations)
nodes=1:ppn=12:gpus=1,feature=k40
^ use this one
nodes=1:ppn=8:gpus=1,feature=v100
CPU settings for testing
nodes=1:ppn=1
nodes=1:ppn=8
Very simple script