Lessons learned about DRAC (by Jeffrey Rosenthal, 2024) [These are some brief rough point-form notes about my efforts to use the Digital Research Alliance of Canada (DRAC) parallel computer facilities. No guarantee of accuracy or correctness -- take them with a grain of salt!] The main DRAC web page is at: alliancecan.ca However, to login, go to: ccdb.alliancecan.ca There's lots of documentation, starting at: https://docs.scinet.utoronto.ca/ But it might be somewhat hard to initially learn from. * Have to ssh to cedar.alliancecan.ca or niagara.alliancecan.ca. (No, actually also have: graham, beluga, narwhal -- with more time!) -- Best (only?) option is to set up ssh keygen on their web interface. * Have to load gcc (C compiler) and R, etc, with e.g.: module load gcc module load r * Can then run small jobs directly, but larger jobs must be QUEUED. * queued jobs must normally be run in your "scratch" (not main) directory. -- e.g. for me it is in: /gpfs/fs0/scratch/j/jrosenth/jrosenth Which is aliased for me from: ~/scratch * Can queue jobs with e.g. "sbatch myscript", where myscript is a shell script starting with lines which are something like: #!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=40 #SBATCH --time=24:00:00 * 24 hours is the maximum compute time you can request. -- In fact, anything more than 12 hours makes your jobs lower-priority. (No, actually some of their other servers offer more time!) (And, can move files between servers using: globus.alliancecan.ca) * But you can request 40 nodes in parallel, so PARALLELISING your computation is very useful. -- One simple way to do this for Monte Carlo computations is to have your initial script call multiple (e.g. 40) instances of your main code separately, as background jobs, and then combine their output together later. -- If you do this, then be sure to SEED all your multiple runs differently, in a way that will get fresh not repeated seeds each time. -- I ended up doing seeding in C with the command "srand48(presenttime());", where presenttime() gives the current time in milliseconds. At first I used "srand48((long int) time(NULL));" but that seems to repeat the same seed if multiple instances are started within one second of each other. * If your script calls background jobs, then you should put the command "wait" at the end of your script, so that it keeps running until all of the background jobs have finished. (Otherwise, it will simply exit them all right away.) * To check your job status type e.g. "squeue -u jrosenth" (and then e.g. "PD" means "pending", etc). To cancel a job type "scancel ". If you're stuck, you can email e.g. niagara@tech.alliancecan.ca, and they will usually reply quickly and helpfully ... Good luck!