PDA

View Full Version : Am I the only one to have problem with...



Aselert
02-13-2014, 03:30 PM
.... graphics cards that stop working for no reason???

This problem still persist after lot of months and headhache to found the issue. But I still can't found it and it makes me mad!!!

I'm explaining: When I launch a render, I mean a long time render (for few hours by example) with my render machine (3 GTX 580), it begins with 3 GPU, then 2 GPU... Then 1 GPU! And sometimes finishing with the CPU!!!

I tried to changes cards, drivers, reinstall BS, change cards from PCI Express location, tested with 2 cards, verifying overheat etc... But nothing changes

I do not know what to do... Any help would be appreciated before I go really mad!


Thank you

Aselert
02-19-2014, 05:15 PM
Thanks for the reply devett, but no, it isn't a temperature issue. My 580's are hot but not so much, the tower have a big ventilation and very often open. But it doesn't change anything. I was thinking it's because of PCI Express speed but not too. Drivers? Again, not. Maybe the MSI's BIOS of the mobo. But I've tried with another app than BS, which using CUDA during a weekend and it worked without problem... So... I don't know.

The main problem is: SOMETIMES it work... and ... SOMETIMES it doesn't work! So I can't fix my mind :(

blitz
02-19-2014, 11:48 PM
I have been having a similar issue but very rarely. I'll post if it happens again.

haknslash
02-20-2014, 06:01 AM
Which motherboard do you have?

Aselert
02-20-2014, 08:38 AM
I have a MSI Big Bang Marshal ;)

andy
02-25-2014, 12:44 PM
So, is it the graphics card that stops working? Or is it Bunkspeed that stops working? lol

Aselert
02-25-2014, 02:51 PM
Lol andy

It's simple: When I launch a render on PRO, it begin with all the 3 cards, so OK. Then one card is stopping (??), then another one (???)... And sometimes (but very rarely and in Hybrid mode) it finish with the CPU (????)!

So I will say... PRO is still working without problem during this "no-reason stopping" and there is no crash but I've tested on another CUDA app (non rendering) during a week-end and there was no problem at all: it began with 3 cards and finish with this 3 cards.

The main problem is that it's RANDOM. It could have no problem (of stop working) and sometimes have as said above.
And the same "random effect" is the same with the time: a card could stop working after 5min like after 2h...

It looks like the MODO or the cards are following a "safe mode" or an "efficient mode" I don't know... But the thing is I've seen this behavior on different MOBO.

So how a GRAPHIC CARD could stop itself?? Really, as I could seen, there is no overheating issue, no fan issue, no drivers issue, no zebra screen etc...

ALL is working perfectly -or looking perfect!- except the fact that cards, sometimes but often, stops to work!

Lazy cards :( :( :(

andy
02-26-2014, 12:27 PM
This sounds a lot like the problem I get on my machine. Do you have the EVGA Precision X software installed? If not, get it. It's just a great GPU monitor. You can customize the views. You'll want Temperature, and GPU%. That should be good to see if this is the problem.
When a card overheats it will enter a safe mode. You'll notice the GPU is still working, but it's speed drops dramatically when it hits a certain temp.

My bet is that you don't have enough space between the cards. So one of the bottom two cards will cut out first, the other one might be close behind...then the last one may keep on going for a bit? Well, if that's not the case, you should be able to have a better guess by seeing what happens to the temp and GPU. If you notice the GPU just seems to fall off at a certain temperature, then you at least have a good idea of how to troubleshoot.

To test out my theory, remove the middle card. That will create quite a bit of space between them and they shouldn't overheat as easily. I haven't had it quite as bad as you though. One card keeps going, and the other only drops to 75% or so. 50% if it's really warm in the office as well.

It might not be that, but monitoring a few things can help you figure out the issue by making things seem less random. Hopefully you get those cards off their ass, and they can get to work.

Aselert
02-26-2014, 01:27 PM
Thank you Andy,

But it is not a "simple" temperature problem, unfortunately, because the cards are stopping without reason, unrelated to temperature.

These are Gainward GTX 580 3Gb and the maximum temperature that I have seen (with GPU- Z that I run almost every rendering) was 87 ° C, but this was exceptional during the hottest days of summer 2013. And the limit for these cards before safe operation mode is 97 ° C or 105 ° C for the "maximax". At classical whether temp, the cards temp is around 70-80°C.

Moreover, when I say "stop to work" itsn't just limited to 50 % load , but stop directely at 0 %! And that means from the outside : the fans stop spinning quickly to return to idle speed.

No, really, there is no link with the temperature because it often happens that the hottest card isn't the card which stops the first... This occurs after 5 minutes of operation, while the card is sometimes at just 60°C... It's really not related to this.

For now I have only 2 cards well spaced and well ventilated and nothing happens, it's the same .
It really is as if the "signal" between BS and the card was broken.


It depressed me ... :(

andy
02-27-2014, 12:23 PM
Well that's sort of good news. I can't change my MOBO, so I'm just stuck with one card always reducing it's speed. At least for you, it's just software. That's something we All share your pain with. And it Is depressing.

Aselert
04-17-2014, 05:11 PM
So... no solution of this problem? :confused: :(

Aselert
05-22-2014, 05:01 PM
Hello everyone,

I finally found the solution to my problem!!!
After harassing Bunkspeed , NVIDIA , MSI at Gainward I got my answer watching my PSUs ...

In fact, one of our tower (3 GPUs ) is working with 2 PSU : One is feeding mobo , CPU , RAM etc . and one is feeding 2 GPUs and other fans.
Because , without knowing it, I had an asymmetrical system. As my second PSU was not connected to the mother board, it did not exchange information with.

There or was misleading is that the system worked for every normal task, and was invisible. But it was unstable.
Because during low GPU load, the alone PSU frequently cut power GPU and even at high loads, by safety PSU system, cut power and then GPUs, because of no signal to continue working received by the motherboard.

Et voilą! I apologize for the inconvenience, the software (Bunkspeed) had nothing to do in this case. But I spent months realizing it! I think this experience will help others one day, maybe!

;)