2 machines, thousands of WU's, 10 credit in 36 hours. What's wrong?

Message boards : Number crunching : 2 machines, thousands of WU's, 10 credit in 36 hours. What's wrong?
Message board moderation

To post messages, you must log in.

AuthorMessage
marmot

Send message
Joined: 21 Apr 19
Posts: 25
Credit: 12,699
RAC: 0
Message 695 - Posted: 17 Apr 2021, 11:47:18 UTC
Last modified: 17 Apr 2021, 11:54:53 UTC

Machine 1 (https://boinc.nanohub.org/nanoHUB_at_home/results.php?hostid=1675) stayed with Virtual Box 5.1.28 (because Boinc2Docker hasn't been updated in over a year - learned from Kryptos@Home earlier this year. See their forums.) and has run about 700 tasks.

39 valid, 630 invalid and about 30 errors.

I know the machine ran out of RAM (even though BOINC auto limited to 8 WU's) but I manually reduced the WU's to 4 and the process manager reports that total commits never broke 31.4GB (RAM installed 32GB) so it would be barely into the swap.

These WU's thrash the drives terribly upon copying a fresh VM to each new BOINC data slot folder.

The invalids logs don't give a clue as to why they are invalid (unless I missed something; the errors all seem to be normal Boinc2Docker complaints about using a RAM drive).

Machine 2 (https://boinc.nanohub.org/nanoHUB_at_home/results.php?hostid=1687) is updated to Virtual Box from March 2021 (6.1.2?) and wouldn't run Kryptos@Home Boinc2Docker WU's. It seems to be running Nano@home but... no valids shown and there are 445 slots in the data directory of BOINC and WUProps says it can't find any more drive storage even though the OS reports the drive has 143GB left.
I think the naming index is full so doing maintenance after completely deleting all BOINC data slots (bye bye ATLAS WU's *sigh*).

Something about VBoxManage ver 6.x.x with Boinc2Docker isn't removing the old WU's and building up project data slots under BOINC.
Also, the VBox manager had hundreds of invalid VM's to be removed.


Summary:
I thought it'd be nice to get my 100k credit but the drive thrashing causes too much wear on the equipment, especially since the WU's are only a few minutes long. Only getting 40 valid WU's out of 660 for a total of 10 (TEN) credits in 18 hours of trying is just ridiculous return for my effort..

I'm going to finish maintenance then limit this project to 1 WU at a time and 0 resource share and see if things improve by Tuesday.
ID: 695 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 7 Apr 17
Posts: 60
Credit: 26,471
RAC: 0
Message 696 - Posted: 17 Apr 2021, 11:58:17 UTC - in response to Message 695.  

Summary:
I thought it'd be nice to get my 100k credit but the drive thrashing causes too much wear on the equipment, especially since the WU's are only a few minutes long. Only getting 40 valid WU's out of 660 for a total of 10 (TEN) credits in 18 hours of trying is just ridiculous return for my effort..
I'm going to finish maintenance then limit this project to 1 WU at a time and 0 resource share and see if things improve by Tuesday.


Seems that they have some works to do on the app...
ID: 696 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 11 Jan 17
Posts: 99
Credit: 224,673
RAC: 0
Message 698 - Posted: 17 Apr 2021, 12:16:14 UTC - in response to Message 695.  

These WU's thrash the drives terribly upon copying a fresh VM to each new BOINC data slot folder.

I posted on that shorty after the project started several years ago. They will kill an SSD fast. I have never seen such high writes.
In fact, I measured it a couple of days ago on an i9-10900F (the low-power version, not the fastest) running four virtual cores on nanoHUB under Ubuntu 20.04.2.
The writes were 52 GB/hour, or 1.25 TB/day. I never allow more than 70 GB/day, though that may be a bit conservative.

But I always use a write-cache anyway. On that machine, I set it to 8 GB and 1 hour latency (write delay), using the built-in cache that Linux provides.
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
You would probably be OK with less memory, maybe 2 GB and 10 minutes latency, since the work units are so short, but use something.

On Windows, I use PrimoCache. You need it.
https://www.romexsoftware.com/en-us/primo-cache/index.html
ID: 698 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 25
Credit: 12,699
RAC: 0
Message 699 - Posted: 17 Apr 2021, 13:16:54 UTC - in response to Message 698.  
Last modified: 17 Apr 2021, 13:31:29 UTC


On Windows, I use PrimoCache. You need it.
https://www.romexsoftware.com/en-us/primo-cache/index.html


One server is shut down for warmer seasons approaching.
I'll fire it back up today, try PrimoCache on that machine and it will try to do 8x NanoHub WU's and see how things go.
Thanks.

I'm going to stop all NanoHub work on the other 2 machines.

-------
Oh, a correction Machine 2 (https://boinc.nanohub.org/nanoHUB_at_home/results.php?hostid=1687) was downgraded to VBox 5.2.44 so it could do Kryptos@Home.
ID: 699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 25
Credit: 12,699
RAC: 0
Message 701 - Posted: 17 Apr 2021, 18:07:41 UTC - in response to Message 698.  


On Windows, I use PrimoCache. You need it.
https://www.romexsoftware.com/en-us/primo-cache/index.html


OK, so the performance increase is amazing. Thanks.

The test machine is running 12 NanoHub WU in 32GB RAM. (That should be too many but they are peaking at different times and not actually using all their requested 3GB RAM in the VM settings).
The RAM cache is peaking at about 5GB.

My total credit is at 5700 as I post.
See what it hits by Monday.

Most the project WU's I'm trying to get 100k hours on are RAM hogs (Rosetta, YoYo ECM) but not disk intensive, so the RAM cache seemed optional in order to fit as many WU in RAM as possible.
ID: 701 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Fardringle

Send message
Joined: 11 Jan 19
Posts: 6
Credit: 254,285
RAC: 0
Message 703 - Posted: 17 Apr 2021, 23:27:20 UTC

The project was actually running quite well for the first few days after new work became available, but it appears that all of the tasks that failed previously are being sent out again now, and almost everything is failing at this point since there aren't any more good tasks.
ID: 703 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 25
Credit: 12,699
RAC: 0
Message 704 - Posted: 18 Apr 2021, 1:13:26 UTC - in response to Message 703.  
Last modified: 18 Apr 2021, 1:14:40 UTC

The project was actually running quite well for the first few days after new work became available, but it appears that all of the tasks that failed previously are being sent out again now, and almost everything is failing at this point since there aren't any more good tasks.



Yours are showing Compute Errors; my bad WU's are marked Invalid. (You also have a 1060 3Gb/r9 280 configured machine).

I've added RAM cache to all 3 servers but the two that were running WU's yesterday are only getting Invalid results, even though they appear to be running much more reliably (creating the entry in Virtual Box quickly, then entering computation quickly, and the drive thrashing is greatly reduced).

The third machine, that just got WU's this afternoon, has 231 Valid, 0 Invalid and 3 Compute Errors.

I'll reset the project on the 1st 2 machines and see if that has a positive result.
ID: 704 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Fardringle

Send message
Joined: 11 Jan 19
Posts: 6
Credit: 254,285
RAC: 0
Message 705 - Posted: 18 Apr 2021, 1:45:54 UTC - in response to Message 704.  
Last modified: 18 Apr 2021, 2:31:27 UTC

All of my recent failed tasks were failed by at least one other person previously. That's why I said they are re-sends. And the project is out of work now so it's only sending back out tasks that have failed or that are aborted/reset.
ID: 705 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 11 Jan 17
Posts: 99
Credit: 224,673
RAC: 0
Message 706 - Posted: 19 Apr 2021, 20:00:31 UTC - in response to Message 705.  

Most of mine are _1 through _4. But I am getting a number of _0 too, so it appears that they are generating new work.
It may not show up on the server list, since the demand about equals supply.
ID: 706 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 25
Credit: 12,699
RAC: 0
Message 707 - Posted: 20 Apr 2021, 5:52:40 UTC - in response to Message 705.  
Last modified: 20 Apr 2021, 6:50:52 UTC

All of my recent failed tasks were failed by at least one other person previously. That's why I said they are re-sends. And the project is out of work now so it's only sending back out tasks that have failed or that are aborted/reset.


I do see that most of my tasks are, also, ending in Computation Error and are the 3rd to 5th attempt from prior sends.

But my one machine, dedicated to only running these WU's, has completed 3 that were previously errors on other computers.
(example: https://boinc.nanohub.org/nanoHUB_at_home/workunit.php?wuid=8021849).


I think the RAM cache is crucial and I'm going to reduce my WU's from 10 at once to 8 and see if the reduced load improves the error rate.
Are these WU's are extremely time dependent and must be done in under 30 minutes? (I'm now aborting any that have run over 30).

At least, for the 1st time ever, I broke 2000 credit in a day.
Yeah!
ID: 707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Fardringle

Send message
Joined: 11 Jan 19
Posts: 6
Credit: 254,285
RAC: 0
Message 712 - Posted: 22 Apr 2021, 1:29:53 UTC

There are still some bad resends being issued, but there is also a decent amount of good new work available again, so the ratio of good to bad work units should improve dramatically.
ID: 712 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 25
Credit: 12,699
RAC: 0
Message 714 - Posted: 22 Apr 2021, 21:02:09 UTC
Last modified: 22 Apr 2021, 21:03:46 UTC

All it takes is one corrupted VM in the Virtual Box manager and every one of these WU's is then Invalid.

Need to kill the VBox service, start the management window and remove the corrupted entries, then restart the computer
The restart is usually not needed but could not get valid WU's till after a restart.

This means that any computer running these WU's will need a daily restart and VBox cleanup check. (This project has been the most annoying of all the ones I've done in last 5 years)

Added a script killing any WU going longer than 3000 seconds. The error rate on those is over 90% and not worth the time wasting risk.
ID: 714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : 2 machines, thousands of WU's, 10 credit in 36 hours. What's wrong?


©2024 COPYRIGHT 2017-2018 NCN