Approaches to Debugging at Scale on the Peregrine System
On the Peregrine system, occasionally there is the need to debug programs at relatively large scale, on a larger number of nodes than what is available via the short queue. Because many of the jobs on Peregrine run for several days (up to 10 days per job), it may take a long time to acquire a large number of nodes.
To debug applications that use many nodes, there are two possible approaches. One approach provides those nodes as soon as possible but the time of their availability is unknown. The other provides those nodes at a specific time but requires assistance from system administrators.
Submit an interactive job asking for the number of nodes you will need. This is done by simply adding the -I option to the qsub command. For example:
qsub -I -l nodes=100 walltime=1:00:00:00 -A CSC001
This asks for 100 nodes for 1 day. When the nodes are available for your job, you "land" in an interactive session (shell) on one of the 100 compute nodes. From there you may run scripts, execute parallel programs across any of the 100 nodes, or use an interactive debugger such as <LINK TO>TotalView</A>. When you are done working, exit the interactive session.
Rarely will a request of this size and duration start right away, so running it within a screen session allows you to wait for your session to start without needing to stay connected to Peregrine. With this method the user must periodically check whether their session has started by reconnecting to their screen session.
Using screen sessions:
1) On a login node, type "screen"
2) Check to see whether your environment is correct within the screen session. If needed, purge modules and reload.
[user@login2 ~]$ screen
LD_LIBRARY_PATH: Undefined variable.
[user@login2 ~]$ module purge
[user@login2 ~]$ module load comp-intel
3) Request an interactive job
$ qsub -I -lnodes=X -A project-handle -q queue-name
When you want to disconnect from the session, type control-A then d. The interactive job continues to run on Peregrine.
Later, to continue working in the interactive job session, reconnect to this screen session. To reconnect, if you have logged out of Peregrine, first log in to the same login node. Then type screen -r to reattach to the screen session. If your interactive job has started, you will land on the compute node that you were given by the system.
When you are done with your work, type exit to end the interactive job, and then type exit again to end the screen session.
A more convenient approach may be to request a reservation for the number of nodes you need. A reservation may be shared by multiple users and it starts and ends at specific times. The start time for a reservation will be at least 5 days from when it is requested.
To request a reservation for a debugging session, please contact us and include:
- Project handle
- Number of nodes
- Time of the request.
When the work is complete, please inform the Peregrine system administrators, so the reservation can be released. The project allocation will be charged for the reserved time, up until the reservation is released, whether that time is used or not.
When your reserved time starts you may run either interactive (-I) jobs or regular batch jobs on the nodes in the reservation.