Approaches to Debugging at Scale on the Eagle System
On the Eagle system, occasionally there is the need to debug programs at relatively large scale, on a larger number of nodes than what is available via the short queue. Because many of the jobs on Eagle run for several days (up to 10 days per job), it may take a long time to acquire a large number of nodes.
To debug applications that use many nodes, there are three possible approaches.
Submit an interactive job asking for the number of tasks you will need. For example:
srun -n 3600 -t 1-00 -A <handle> --pty $SHELL
This asks for 3600 cores (100 nodes) for 1 day. When the nodes are available for your job, you "land" in an interactive session (shell) on one of the 100 compute nodes. From there you may run scripts, execute parallel programs across any of the 100 nodes, or use an interactive debugger such as ARM DDT.
When you are done working, exit the interactive session.
Rarely will a request of this size and duration start right away, so running it within a screen session allows you to wait for your session to start without needing to stay connected to Eagle. With this method, users must periodically check whether their session has started by reconnecting to their screen session.
Using screen sessions:
1) On a login node, type "screen"
2) Check to see whether your environment is correct within the screen session. If needed, purge modules and reload:
[user@login2 ~]$ screen
LD_LIBRARY_PATH: Undefined variable.
[user@login2 ~]$ module purge
[user@login2 ~]$ module load comp-intel
3) Request an interactive job:
$ srun -n 3600 -t 1-00 -A <handle> --pty $SHELL
When you want to disconnect from the session, type control-A then d. The interactive job continues to run on Eagle.
Later, to continue working in the interactive job session, reconnect to this screen session. To reconnect, if you have logged out of Eagle, first log in to the same login node. Then type screen -r to reattach to the screen session. If your interactive job has started, you will land on the compute node that you were given by the system.
When you are done with your work, type exit to end the interactive job, and then type exit again to end the screen session.
A more convenient approach may be to request a reservation for the number of nodes you need. A reservation may be shared by multiple users, and it starts and ends at specific times. The start time for a reservation will be at least 5 days from when it is requested.
To request a reservation for a debugging session, please contact us and include:
- Project handle
- Number of nodes
- Time of the request.
When the work is complete, please inform the Eagle system administrators, so the reservation can be released. The project allocation will be charged for the reserved time, up until the reservation is released, whether that time is used or not.
When your reserved time starts you may run either interactive jobs or regular batch jobs on the nodes in the reservation.
It might be difficult to debug a large parallel job on Eagle interactively. An alternative is to debug the problem by submitting a job for offline debugging.
The problem should be scaled down such that it can easily get access to an interactive queue (around 4 nodes). Create an interactive session and open the ARM DDT debugger(GUI). Run the program and set evaluations, tracepoints, watchpoints etc in the DDT session. Save the session file.
You can then submit a larger job with ARM DDT in offline mode pointing to the session file created in the previous step. At the end of the run, you can view the generated debugging report in html or text mode.