Request for feedback

Message boards : News : Request for feedback
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
[SG-FC] oki

Send message
Joined: 28 Feb 19
Posts: 2
Credit: 19,308
RAC: 0
Message 235 - Posted: 24 Mar 2019, 10:47:30 UTC

Hello

as long as there are only one or two workunits running on my pc it works fine.

But with 8 concurrent wu's my machine is freezing, i had to unplug the power cord and reoot to get it going again. (yes, i am sure it was nanohub...)

I would like to install an app_config.xml with <max_concurrent>4<max_concurrent> but
don't know the app-name.

Has anybody an working app_config.xml file?

Best Regards.
ID: 235 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Henk Haneveld

Send message
Joined: 20 Nov 18
Posts: 20
Credit: 5,560
RAC: 109
Message 236 - Posted: 24 Mar 2019, 13:09:14 UTC - in response to Message 235.  
Last modified: 24 Mar 2019, 13:09:49 UTC

Hello

as long as there are only one or two workunits running on my pc it works fine.

But with 8 concurrent wu's my machine is freezing, i had to unplug the power cord and reoot to get it going again. (yes, i am sure it was nanohub...)

I would like to install an app_config.xml with <max_concurrent>4<max_concurrent> but
don't know the app-name.

Has anybody an working app_config.xml file?

Best Regards.


<app_config>
<app>
<name>boinc2docker</name>
<max_concurrent>4</max_concurrent>
</app>
</app_config>
ID: 236 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[SG-FC] oki

Send message
Joined: 28 Feb 19
Posts: 2
Credit: 19,308
RAC: 0
Message 237 - Posted: 24 Mar 2019, 14:48:29 UTC - in response to Message 236.  

[nanoHUB_at_home] Found app_config.xml

Thank you!
ID: 237 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 8 Jan 19
Posts: 20
Credit: 2,501
RAC: 0
Message 238 - Posted: 24 Mar 2019, 20:40:45 UTC

I noticed another bad one today. I'm sure they know about this, I'm sure they will fix it, I have set nnt for the present.
ID: 238 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 8 Jan 19
Posts: 20
Credit: 2,501
RAC: 0
Message 253 - Posted: 12 Apr 2019, 4:16:34 UTC

I stayed away for a few weeks, came back expecting better, but overnight...

197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED

... more of these. They have been failed by everyone that has had them and are obviously resends, but, the first send date was only a few days ago, they are not hang overs from weeks ago. No new tasks set AGAIN.
ID: 253 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 11 Jan 17
Posts: 46
Credit: 143,698
RAC: 2
Message 254 - Posted: 12 Apr 2019, 12:25:13 UTC - in response to Message 253.  
Last modified: 12 Apr 2019, 12:47:15 UTC

If a very small number of errors bother you, then you really should find another project. (Try WCG).
They are running really great for me.
ID: 254 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 8 Jan 19
Posts: 20
Credit: 2,501
RAC: 0
Message 255 - Posted: 13 Apr 2019, 8:33:38 UTC - in response to Message 254.  
Last modified: 13 Apr 2019, 8:47:25 UTC

Reportig errors is potentially useful to the project team. They can see what tasks errored, on what machine, under which operating system etc. They probably test on various systems, but no project can try everything.

My systems are connected to about a dozen projects at any one time. I do not crunch wcg, a poor moderating decision in their forum was the start of a small exodus from the project about a year ago.

>>> If a very small number of errors

· Valid (591) · Invalid (188) · Error (30)

The bad figures here are probably more than the sum of all the other projects I have been connected too since 1999.
ID: 255 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 11 Jan 17
Posts: 46
Credit: 143,698
RAC: 2
Message 256 - Posted: 13 Apr 2019, 12:45:27 UTC - in response to Message 255.  
Last modified: 13 Apr 2019, 13:00:28 UTC

OK, suit yourself. But they are still in the startup phase. I expect it is communicating the problems to all of their submitters. They do not generate the work themselves, but act as a clearinghouse for dozens or hundreds of scientists. They all have to get used to the requirements of Virtualbox, and use the right versions of software, etc. to generate work units. (They can explain it better than I can.)

The projects that cause me problems are the ones that hang up your machine with work units that never end. I have not seen that here. They have built-in timers that limit the run times to something on the order of an hour. And even the bad ones tell them something (about how to generate good ones). Go ahead and report, but I think they keep track of the errors themselves. The reports here are more for the benefit of other users who might be having similar problems.

While we are on the subject, Virtualbox is difficult to debug, since it hides what is going on inside it. It is used because the scientists use different software than we do, and it allows us to run their stuff. But it is notoriously difficult to work with, especially in a heterogeneous environment where a lot of different people are generating work on a lot of different systems. I automatically take that into account, having run just about every Vbox project there is.
ID: 256 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Nonexistent_Admin
Volunteer moderator
Project administrator

Send message
Joined: 27 Sep 18
Posts: 58
Credit: 0
RAC: 0
Message 257 - Posted: 13 Apr 2019, 13:59:44 UTC - in response to Message 256.  

We do track the errors, and we greatly appreciate the error reports.
ID: 257 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 15
Credit: 4,656
RAC: 1
Message 261 - Posted: 22 Apr 2019, 8:23:13 UTC
Last modified: 22 Apr 2019, 9:04:21 UTC

I plan on aborting hundreds because the project sent down 1200 per machine.

Haven't gotten even 1 valid WU on either machine.
All invalid and about 1% errors.

The error occured once on each machine in the last 24 hours and is:
197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED.


One machine is running 2 WU concurrent has 127 invalids; the other machine runs 1 concurrent and has 47 invalid.

All appear to end successfully with:

2019-04-22 01:07:27 (3188): Guest Log: Running... 

2019-04-22 01:07:27 (3188): Guest Log: 07174842_032.boinc

2019-04-22 01:07:27 (3188): Guest Log: 07174842_032.sh

2019-04-22 01:09:04 (3188): Guest Log: boinc_app exited (0)

2019-04-22 01:09:04 (3188): Guest Log: Saving results...

2019-04-22 01:09:04 (3188): Guest Log: 07174842_032_output.tar.gz

2019-04-22 01:09:04 (3188): VM Completion File Detected.
2019-04-22 01:09:04 (3188): Powering off VM.
2019-04-22 01:09:06 (3188): Successfully stopped VM.



There are various warnings and errors throughout the log.
Early:
2019-04-22 01:02:55 (3188): Guest Log: BIOS: KBD: unsupported int 16h function 03

2019-04-22 01:02:55 (3188): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 

2019-04-22 01:02:55 (3188): VM state change detected. (old = 'poweroff', new = 'running')
2019-04-22 01:03:05 (3188): Preference change detected
2019-04-22 01:03:05 (3188): Setting CPU throttle for VM. (100%)
2019-04-22 01:03:05 (3188): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds))
2019-04-22 01:03:41 (3188): Guest Log: net.ipv4.ip_forward = 1

2019-04-22 01:03:41 (3188): Guest Log: sysctl: cannot stat /proc/sys/net/ipv6/conf/all/forwarding: No such file or directory

2019-04-22 01:03:41 (3188): Guest Log: sysctl: setting key "cannot stat %s": No such file or directory

2019-04-22 01:03:41 (3188): Guest Log: sysctl: "cannot stat %s" is an unknown key

2019-04-22 01:03:41 (3188): Guest Log: sysctl: setting key "cannot stat %s": No such file or directory

2019-04-22 01:03:41 (3188): Guest Log: Segmentation fault


This error happens 3/3:
2019-04-22 01:03:46 (3188): Guest Log: 00:00:00.061615 vminfo   Error: Unable to connect to system D-Bus (1/3): D-Bus not installed


Not sure what normal execution of the work app is but here's what logs show:

2019-04-22 01:03:52 (3188): Guest Log: -------------------

2019-04-22 01:03:52 (3188): Guest Log: Linking /etc/docker to /var/lib/boot2docker for persistence

2019-04-22 01:03:52 (3188): Guest Log: Waiting for Docker daemon to start...

2019-04-22 01:03:52 (3188): Guest Log: REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE

2019-04-22 01:03:52 (3188): Guest Log: Running boinc_app...

2019-04-22 01:03:52 (3188): Guest Log: Importing Docker image from BOINC...

2019-04-22 01:03:52 (3188): Guest Log: Filesystem                Size      Used Available Use% Mounted on

2019-04-22 01:03:52 (3188): Guest Log: tmpfs                     2.6G    164.2M      2.5G   6% /

2019-04-22 01:03:52 (3188): Guest Log: tmpfs                     1.5G         0      1.5G   0% /dev/shm

2019-04-22 01:03:52 (3188): Guest Log: cgroup                    1.5G         0      1.5G   0% /sys/fs/cgroup

2019-04-22 01:03:52 (3188): Guest Log: shared                  407.9G    121.6G    286.3G  30% /root/shared

2019-04-22 01:03:52 (3188): Guest Log: tmpfs                     2.6G    164.2M      2.5G   6% /var/lib/docker/aufs

2019-04-22 01:03:57 (3188): Guest Log: 00:00:10.078023 vminfo   Error: Unable to connect to system D-Bus (3/3): D-Bus not installed

2019-04-22 01:04:58 (3188): Guest Log: doing docker load...

2019-04-22 01:06:00 (3188): Guest Log: Filesystem                Size      Used Available Use% Mounted on

2019-04-22 01:06:00 (3188): Guest Log: tmpfs                     2.6G    631.9M      2.0G  23% /

2019-04-22 01:06:00 (3188): Guest Log: tmpfs                     1.5G         0      1.5G   0% /dev/shm

2019-04-22 01:06:00 (3188): Guest Log: cgroup                    1.5G         0      1.5G   0% /sys/fs/cgroup

2019-04-22 01:06:00 (3188): Guest Log: shared                  407.9G    121.5G    286.4G  30% /root/shared

2019-04-22 01:06:00 (3188): Guest Log: tmpfs                     2.6G    631.9M      2.0G  23% /var/lib/docker/aufs

2019-04-22 01:06:00 (3188): Guest Log:               total        used        free      shared  buff/cache   available

2019-04-22 01:06:00 (3188): Guest Log: Mem:           3007          48        2289         631         669        2289

2019-04-22 01:06:00 (3188): Guest Log: Swap:           701           0         701

2019-04-22 01:06:00 (3188): Guest Log: Building apps directory...

2019-04-22 01:07:21 (3188): Guest Log: Prerun diagnostics...

2019-04-22 01:07:21 (3188): Guest Log: REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE

2019-04-22 01:07:21 (3188): Guest Log: nanohub_apps_base   11                  42db2bf7db3d        16 months ago       458.1 MB

2019-04-22 01:07:26 (3188): Guest Log: CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

2019-04-22 01:07:26 (3188): Guest Log: 467.8M	/var/lib/docker

2019-04-22 01:07:27 (3188): Guest Log:               total        used        free      shared  buff/cache   available

2019-04-22 01:07:27 (3188): Guest Log: Mem:           3007          48        2287         631         671        2288

2019-04-22 01:07:27 (3188): Guest Log: Swap:           701           0         701

2019-04-22 01:07:27 (3188): Guest Log: Filesystem                Size      Used Available Use% Mounted on

2019-04-22 01:07:27 (3188): Guest Log: tmpfs                     2.6G    631.9M      2.0G  23% /

2019-04-22 01:07:27 (3188): Guest Log: tmpfs                     1.5G         0      1.5G   0% /dev/shm

2019-04-22 01:07:27 (3188): Guest Log: cgroup                    1.5G         0      1.5G   0% /sys/fs/cgroup

2019-04-22 01:07:27 (3188): Guest Log: shared                  407.9G    121.6G    286.3G  30% /root/shared

2019-04-22 01:07:27 (3188): Guest Log: tmpfs                     2.6G    631.9M      2.0G  23% /var/lib/docker/aufs

2019-04-22 01:07:27 (3188): Guest Log: Running... 

2019-04-22 01:07:27 (3188): Guest Log: 07174842_032.boinc

2019-04-22 01:07:27 (3188): Guest Log: 07174842_032.sh

2019-04-22 01:09:04 (3188): Guest Log: boinc_app exited (0)


Machines are running Windows 8.1. One machine has VB 5.1.26 and the other 5.1.28.
Both run my own custom VM's all day, every day, and have run Theory (40,000 hours), ATLAS, CMS and Cosmology VM's heavily.

They are running right up against their RAM limit and 99.97% core usage. I've seen heartbeat errors before but most all WU's are hardened against those and the only mention in the logs of heartbeat is:
2019-04-22 01:03:41 (3188): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds


I'll stop after 100 WUProps hours.
Sorry, I don't unhide my machines except in rare circumstances.

EDIT:

Additional info.

WU's did use a full core for their run. I didn't observe network or drive usage but will if that would be helpful.
One work unit did end in abort status and I had terminate the virtualbox service with a process manager because Virtual Box Manager would take commands but not perform the commands. It was the oddest behavior I'd ever seen from VBox.

These machines are actually downclocked because the weather is warming. Xeon's are staying at 55C and the RAM is under 35C.

EDIT2:
I'll assume this is a case of the VM unable to claim enough host virtual memory even though used RAM is below maximum by 200MB and commits are just 300mb above maximum real RAM. Shutting down management software, a lowering competing project to 1 WU as well as limiting nanoHUB to 1 WU.
My laptop with Windows 7, VB 5.1.28r just complete 7/7 WU successfully.
ID: 261 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 15
Credit: 4,656
RAC: 1
Message 264 - Posted: 22 Apr 2019, 11:33:33 UTC - in response to Message 261.  

So it's certainly not an issue with available RAM; freed up 10GB of 32GB.
Shut down 2 of the my custom 8 VM's that were running so there was plenty of free cores.

The WU's succeeded on another similar machine with Windows 7 Professional 64 (instead of 8.1) installed and Virtual Box 5.1.30.
They also succeeded on Windows 7 Professional 64 laptop with Virtual Box 5.1.28 (same that failed on Windows 8.1) that had very limited RAM available (the WU's caused the OS to swap the SSD vigorously).

What is different about the WU functioning under VBox on Windows 8.1 versus Windows 7?
ID: 264 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 8 Jan 19
Posts: 20
Credit: 2,501
RAC: 0
Message 266 - Posted: 23 Apr 2019, 12:16:18 UTC
Last modified: 23 Apr 2019, 12:35:33 UTC

Seeing some more failures, this is the reported error, Same as I was seeing in March:

Exit status -2135228415 (0x80BB0001) Unknown error code

Windows 8.1 x64
ID: 266 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ray Murray
Avatar

Send message
Joined: 23 Apr 19
Posts: 2
Credit: 1,092
RAC: 0
Message 267 - Posted: 23 Apr 2019, 23:21:33 UTC

Wasn't expecting a 3GB memory requirement so locked up my machine, which has only 4GB RAM, as I had a LHC running as well. Had to "hard reboot" to get control back.
Some ran through without any issue but I had quite a few "Postponed:unmanagables" which didn't recover after Boinc restarts (which usually works) and even removing the offending VM from VBox and I gave up and Aborted them. Some DID recover and returned successfully.
I sacrificed the LHC to allow this to run un-contested, one at a time, and I turned the work buffer down to get "a few" more to try and got sent 21 rather than the 2 or 3 I expected. First 2 of that batch became un manageable so I have Aborted them all rather trying to nurse them through. Sorry, I just don't think that machine is up to the task. I have a couple of other machines which have more memory so I'll let one of them try a few tomorrow night.

All on Boinc 7.14.2 and VBox 6.0.6, in case that's relevant.
ID: 267 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 15
Credit: 4,656
RAC: 1
Message 269 - Posted: 24 Apr 2019, 7:24:20 UTC - in response to Message 264.  


The WU's succeeded on another similar machine with Windows 7 Professional 64 (instead of 8.1) installed and Virtual Box 5.1.30.

What is different about the WU functioning under VBox on Windows 8.1 versus Windows 7?


My Windows 7 Professional 64 Xeon E5-2660 machine with VBox 5.1.30 finished 397 valid, 0 invalid and 1 error (196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED).

Runs great on that machine, 100% invalid rate on the nearly identical hardware machines but with Windows 8.1.


I'll try out some more on the next version.
G'luck with the alpha testing.
ID: 269 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ray Murray
Avatar

Send message
Joined: 23 Apr 19
Posts: 2
Credit: 1,092
RAC: 0
Message 271 - Posted: 24 Apr 2019, 8:45:33 UTC - in response to Message 267.  
Last modified: 24 Apr 2019, 9:19:14 UTC

All looking good on the machine with more memory, running 3x LHC-dev and 1x nano (restricted via app_config) but despite having turned the work buffer down to 0.1 and 0.1 spare, I still got 73 tasks.
ID: 271 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hal Bregg

Send message
Joined: 24 Apr 19
Posts: 34
Credit: 86,245
RAC: 605
Message 287 - Posted: 27 Apr 2019, 23:43:19 UTC

Almost 300 WUs completed with 28 invalid ones due to some troubles with VirtualBox. All tasks were short.

I already enquired about it in this thread

https://boinc.nanohub.org/nanoHUB_at_home/forum_thread.php?id=69
ID: 287 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 21 Apr 19
Posts: 15
Credit: 4,656
RAC: 1
Message 289 - Posted: 2 May 2019, 9:32:30 UTC - in response to Message 269.  
Last modified: 2 May 2019, 9:38:20 UTC


The WU's succeeded on another similar machine with Windows 7 Professional 64 (instead of 8.1) installed and Virtual Box 5.1.30.

What is different about the WU functioning under VBox on Windows 8.1 versus Windows 7?


My Windows 7 Professional 64 Xeon E5-2660 machine with VBox 5.1.30 finished 397 valid, 0 invalid and 1 error (196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED).

Runs great on that machine, 100% invalid rate on the nearly identical hardware machines but with Windows 8.1.




Just wanted to add that the same Windows 8.1 machine with VBox 5.1.28 installed just successfully ran 227 valid boinc2docker work units received from BOINC@TACC.

Not sure what is different between their version and yours that would make yours fail on identical hardware that runs theirs without an issue.

EDIT: Wonder if that machine would run this boinc2docker if I reverted to VBox versions VBox versions Versions 5.1.12 – 5.1.18 or 5.1.22 - 5.1.26
ID: 289 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Henk Haneveld

Send message
Joined: 20 Nov 18
Posts: 20
Credit: 5,560
RAC: 109
Message 298 - Posted: 13 May 2019, 9:47:49 UTC
Last modified: 13 May 2019, 9:49:14 UTC

13/05/2019 10:52:11 | nanoHUB_at_home | Aborting task 07209902_011_2: exceeded disk limit: 3226.00MB > 2048.00MB

It looks like the setting for allowed disk use is to low.
ID: 298 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Henk Haneveld

Send message
Joined: 20 Nov 18
Posts: 20
Credit: 5,560
RAC: 109
Message 303 - Posted: 18 May 2019, 8:22:21 UTC

The setting has been set higher but still to low. Try 4096 MB.

18/05/2019 10:15:54 | nanoHUB_at_home | Aborting task 07216014_116_3: exceeded disk limit: 3228.48MB > 3072.00MB
ID: 303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hal Bregg

Send message
Joined: 24 Apr 19
Posts: 34
Credit: 86,245
RAC: 605
Message 304 - Posted: 19 May 2019, 16:31:42 UTC

I still get EXIT_TIME_LIMIT_EXCEEDED error and I can only crunch one task at the time as those WU are memory hungry. This way I don't get as many errors.
ID: 304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : News : Request for feedback


©2019 COPYRIGHT 2017-2018 NCN