Webinar 2021

From SHARCNETHelp
Jump to navigationJump to search

Compute Canada staff monitor job record summaries that identify accounts which are experiencing unusual wait times for their compute jobs in the scheduler queue. The identified accounts are then assessed by staff to determine if something can be done with the job submission to alleviate the unusual wait times, provide an alternative solution, or improve the scheduler configuration to enable a wider scope of workloads. The majority of the unusual wait time cases that are identified can be resolved by modifications to the job submission parameters which affect the number of nodes that a job can access. This presentation starts by describing the operational definition being used to identify “unusual wait times”, then describes how jobs relate to node partitions before describing best practices for identifying when job wait times can be reduced by changes to sbatch parameters. This presentation covers practical examples of exploring the state of node partitions on the clusters as well as job submission parameters that affect the quantity of nodes available to the job.