Webinar 2018 All about job wait times in the Graham queue

From SHARCNETHelp
Jump to navigationJump to search

Job wait times in the scheduling queue of large, shared and heavily allocated general purpose systems like Graham and Cedar can be a non-trivial portion of a research project’s time-to-result. Establishing appropriate wait time estimates and strategies for minimizing wait times can be a valuable part of a researcher’s workflow.

The Slurm scheduler on the Compute Canada general purpose clusters Cedar and Graham is configured to accommodate a heterogeneous job load on heterogeneous hardware. Traditionally, SHARCNET systems have been relatively homogeneous (the nodes on a given system have similar properties) and the heterogeneity of the researchers’ workload was handled across systems. For example, GPU nodes were on a isolated system (Monk) with its own scheduler. Similarly Large memory nodes were also on an isolated system (Redfin) that had its own scheduler. On the National general purpose systems, base nodes, GPU nodes and large memory nodes are all within the same cluster and serviced by the same scheduler instance. The current configuration of the Slurm scheduler on the general purpose systems use partitions to accommodate the heterogeneity of the job load and hardware. Understanding the partitioning of these systems, as well as the prioritization and billing of usage can help establish appropriate wait time estimates and minimize job wait times.

This seminar describes the layout of the partitions on Graham as well as methods for examining relevant parameters of the scheduling environment and the queue.