Long work units run better with Linux (and so do short ones)

Message boards : Number crunching : Long work units run better with Linux (and so do short ones)
Message board moderation

To post messages, you must log in.

AuthorMessage
Jim1348

Send message
Joined: 11 Jan 17
Posts: 98
Credit: 224,673
RAC: 9
Message 708 - Posted: 20 Apr 2021, 13:18:54 UTC
Last modified: 20 Apr 2021, 13:55:59 UTC

I have noticed some long (over 1 hour, and even over 2 hours) work units in the last day or so that completed successfully. Those always used to error out for me. But in checking the results, I noticed that they had all failed on several other machines, all running Windows 10. So it appears that Windows 10 VirtualBox does not run the same as Linux VirtualBox.
https://boinc.nanohub.org/nanoHUB_at_home/results.php?hostid=11493&offset=0&show_names=0&state=4&appid=

Then I looked at some of the short ones too, which ran for the usual couple of minutes. I don't see any valid Windows ones there either.

So I will leave this machine running. It seems to be one of the few that works. Good luck to all.

PS - I am seeing only the ones that failed on Windows. Apparently "validate" does not mean comparing two machines, as on most projects. I get a failure rate of about 10%. Maybe someone with Windows could report on what they see.
ID: 708 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 25
Credit: 12,699
RAC: 0
Message 709 - Posted: 21 Apr 2021, 2:04:34 UTC
Last modified: 21 Apr 2021, 2:06:12 UTC

Boinc2Docker doesn't work well with Virtual Box 6.x.x on Windows machines.

I dropped back to VBox 5 on all machines that run Boinc2Docker.
The machine that's actually running them most reliably is VBox 5.1.26 on Windows 8.1 with HDDTurbo RAM cache added (thanks Jim for the tip)

I went through the valids and found only 6 over 5k seconds that completed in the last 2 days (but I have been aborting many that went over 59 mins):
https://boinc.nanohub.org/nanoHUB_at_home/workunit.php?wuid=7960858
https://boinc.nanohub.org/nanoHUB_at_home/workunit.php?wuid=7937999
https://boinc.nanohub.org/nanoHUB_at_home/workunit.php?wuid=8021573
https://boinc.nanohub.org/nanoHUB_at_home/workunit.php?wuid=8049227
https://boinc.nanohub.org/nanoHUB_at_home/workunit.php?wuid=7919527
https://boinc.nanohub.org/nanoHUB_at_home/workunit.php?wuid=8047507

but I am nervous about letting them run past 1000 seconds as 79, (that I did not abort) over 3000 seconds in the last 2 days, got compute errors.

The failure rate of the long runs are 79/85 = 93%.
Too much wasted computing time.

I'll allow 1 long run to complete in a batch of 8 WU's running and see if the failure rate improves.
ID: 709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 11 Jan 17
Posts: 98
Credit: 224,673
RAC: 9
Message 711 - Posted: 21 Apr 2021, 5:59:59 UTC - in response to Message 709.  

Thanks for the info. I really did not have the statistics.
But you might try letting the longer ones just run until they time out. I am seeing longer ones that actually work, which is a change.
ID: 711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 25
Credit: 12,699
RAC: 0
Message 713 - Posted: 22 Apr 2021, 7:22:41 UTC - in response to Message 711.  
Last modified: 22 Apr 2021, 7:51:53 UTC

I'm seeing massive numbers of invalids today that are completed by others.

Not sure why this happens.
Going to shut down VBox service and restart it to see if it's a management issue.

EDIT 1: This might be the cause of a string of invalids. After VBox was restarted there was a missing media ('?' icon VM) NanoHub task in the VBox manager that had to be removed.
Is it possible that the entire project has its slot indexes corrupted so it is attempting to interact with the wrong data\slot folder?
If all the WU's still return invalid then I might have to manually delete the slot folders.

EDIT 2: There are just 16 slots and the 1st 9 filled and the rest are empty, so they look fine. (One of the other machines getting all invalids had 450 data\slot folders filled with NanoHub WU remnants.)
SAS RAID fragmentation minimal 0.45% which is also an improvement over the other machine with the 450 BOINC data\slots folders. It's fragmentation had risen to 15.5%
ID: 713 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Long work units run better with Linux (and so do short ones)


©2021 COPYRIGHT 2017-2018 NCN