![]() ![]() srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld: _job_complete: requeue JobID=3092804 State=0x8000 NodeCnt=1 due to node failure srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld: _job_complete: JobID=3092804 State=0x1 NodeCnt=1 cancelled by node failure srv/logs/messages-20180810:Aug 10 10:07:39 frankie slurmctld: Batch JobId=3092804 missing from node 0 (not found BatchStartTime after startup), Requeuing job srv/logs/messages-20180810:Aug 10 10:07:24 sgiuv300-srcf-d10-01 slurmd: _handle_stray_script: Purging vestigial job script /var/spool/slurm/slurmd/job3092804/slurm_script Job 3092804 ended, but it's slurmstepd is still hanging around and it's not in completing state: Here's a sample, I'll need to increase loglevels and send more later. > the proper signal to slurmstepd for some reason, so the bt may be correct. In any case I guess the process failed to send > It would be useful to get a "thread apply all bt full" to see the full dump My jobs either complete, but leave behind the zombie sleeps or they stay in Running state forever, even after the slurmstepd has been killed (again leaving behind an optional zombie sleep.) It's possible they share underlying causes, but the behavior I'm seeing is different. You can think of us as a poor little system that lives in subsidised housing on Sherlock's street. Kilian's bug is with Sherlock, this system is the SCG cluster. Sorry, I should have mentioned this is a different cluster. > This issue is very similar to your other bug 5443. The node was upgraded to CentOS 7.5 from 7.4, but worked fine for a while after that upgrade. The first instance of this problem I can recall was with 17.11.7 (it may have happened before and gone unnoticed as the system does get rebooted more often that I'd like). Suggestions for the next troubleshooting step are appreciated. Since the end result is slurmstepd in wait4() or zombie sleep processes, not quite sure where to prod for more information. But the problem doesn't affect anything else on the machine, just stuff started via SLURM. Since none of our other nodes are having this behavior I'm inclined to blame the UV300 (a node which is not my friend). The zombies seem to result after this message is logged:Īug 9 20:14:12 sgiuv300-srcf-d10-01 slurmstepd: error: *** EXTERN STEP FOR 3092791 STEPD TERMINATED ON sgiuv300-srcf-d10-01 AT T20:14:11 DUE TO JOB NOT ENDING WITH SIGNALS ***Īnd for the wait4() slurmstepd, the slurm logs look completely normal. ![]() Using pstree, here is an example of a job that has exited, but it's slurmstepd is left ~]# pstree -a -c -p -S 421660 After a reboot, things work great for some period of time, days or in some cases a few weeks, until jobs begin to fail to complete and the slurmstepd processes leave behind zombie sleep process. I have a special node, an SGI UV300, which recently (a couple of months ago) started having an issue with jobs completing.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |