Rather long read.
NVIDIA HAS RECENTLY been saying a lot about how it's chips are not bad, and giving people reasons about why the problem is contained. Unfortunately, these disingenuous half-truths don't stand up to an explanation of why this problem is happening.
The problem is extremely complex and defies a simple explanation. It involves multiple poor choices, multiple engineering failures, and likely a few bad accounting choices. This piece could also have been entitled: "More than you ever wanted to know about bumping, and then some: How not to do things". But we will simplify the science and technical details as much as possible to make it accessible, so some things may be oversimplified.
The defective parts appear to make up the entire line-up of Nvidia parts on 65nm and 55nm processes, no exceptions. The question is not whether or not these parts are defective, it is simply the failure rates of each line, with field reports on specific parts hitting up to 40 per cent early life failures. This is obviously not acceptable.
The end result of the failures is that bumps crack between the bump and the substrate on a chip, not on the bump to die side. When this happens to a signal bump, game over for the GPU or MCP. What is a bump, die and substrate? Why is it happening? That is a long and technical story.
A Via CN chip, note the die in the centre
First, let's start out with some terminology, illustrated here by the lovely and talented Via CN/Nano chip. As you can see, the total package is about the same area as a US quarter. The most important part is the black square at the centre, that is the die, or the silicon chip itself. The green fibreglass-like part around it is the substrate, a complex multi-layered organic material that routes signals from the pads on top to the pins on the bottom, and serves as an attachment point for the die and various passive components. Those are the little silver things around the edges.
The die itself looks a little rough around the edges, but in reality it is very very angular. It has four corners at 90 degree angles, this one being almost square. Some, like the Intel Atom for example, are much more rectangular. The blurry edges are due to a material called the underfill, it looks like glue seeping from the edges, and serves as mechanical support for the die to substrate bonds and a moisture barrier to protect the bumps.
Via CN about as thick as a quarter
The part you don't see are the bumps, and they are the most critical part. This type of packaging is called flip-chip because the connectors between the die and the substrate are put on the bottom of the die, and it is flipped over onto the package. The connectors are called bumps, and they are literally little balls of solder. A typical chip that is a little more than a centimetre on a side might have over 1000 bumps on it, so spacing is incredibly small and tolerances amazingly tight.
As you can see, the package is about the same height as a quarter as well, so the vertical tolerances are also pretty slim. The bumps act like pins on a normal chip, they carry signals, power and ground to and from the die. They also are the primary attachment mechanism of the die to the substrate. The precision needed to put these things together should not be underestimated.
Those are the biggest players in our little drama, now let's move on to some basic physics and related science. Chips consume power, and in return they give you heat and a few electrons in the right places, occasionally they also give you a flash of light and smoke as well, but few chips do that twice. Heat is not an intended product, it is a consequence, and has to be carried away or bad things happen.
Modern chips consume electricity in an uneven manner, as different parts of the chip use power at different rates. Sometimes parts of the chip are never used at all for a given workload. If you have a modern GPU and don't game or are smart enough to not run Vista, you will likely never touch the transistors that do all the 3D work. Think about it this way, there are hot spots on the chip as well as cold spots, it is uneven and changing constantly.
A typical IR photograph of a multi-core CPU
Related to this is the fact that the chip uses electricity in a non-uniform manner. Parts that are heavily used pull much more current than idle parts, and once again, those parts change over time. Some bumps may pull a lot of Amps, others may pull very few, and this again changes over time and use. The bumps also have a limited current capacity each, too much and they melt or burn out, so there are far more than are strictly needed to supply the chip with power.
The idea is to make sure no one bump will ever reach the maximum current it can handle. This is done by putting in more power bumps on the die in places that use high power than are needed from an average current point of view. If things are done right, no single bump will ever exceed the maximum current it can deliver.
The Nvidia defective chips use a type of bump called high lead, and are now transitioning to a type called eutectic, see here and here. Eutectic materials have two important properties, they have a low melting point, and all components crystallize at the same temperature. This means they are easier to work with, and form a good solid bond. Eutectic bumps may have lead in them, or they may not, some are gold/tin, other are lead based, it depends on what characteristics you want, and how much you want to pay. It is a property, not a formula.
Most if not all substrates use eutectic pads to attach the bumps to as well. If you use a eutectic pad with a eutectic bump, you get a much better connection than you do if you use a high-lead bump with a eutectic pad. This is reflected in much higher yields, lower assembly costs, and a physically stronger connection as well. At this time, we have no good explanation as to why Nvidia chose to go the high-lead bump on eutectic pad route.
High-lead bumps have a much higher current capacity than eutectic bumps. When power is run through eutectic bumps, you also get an effect called electromigration. This means that some of the materials are essentially pushed around by the current, and you get voids in the bump. These voids lessen the capacity of the bump, and eventually they burn out.
The more current you run through a eutectic bump, the quicker the electromigration. If you keep the current to a reasonable level, the time it takes for this to happen will be so long it isn't worth worrying about. This is why chip vendors say that upping the voltage will shorten the lifespan of parts, it literally does cause them to burn out quicker.
On the good side, eutectic bumps are generally more flexible than high lead. This means they are a bit more forgiving to stress. Some forces that would fracture a lead bump may be absorbed by a eutectic one without problems.
Bumps overall are a multi-dimensional trade-off between cost, assembly yield, current capacity and mechanical resilience among other things. To call it a complex mess is being overly kind, package engineering is not for the faint of heart.
FROM BUMP properties, we move on to thermal expansion of materials, and that is another piece to the puzzle. Most materials expand as they warm up. If you have ever seen a mechanic trying to free a stuck bolt, they usually heat the nut with a blowtorch, this expands the nut and loosens it. The same thing happens with the die and substrate. When you turn on a chip, it heats and expands a little. This expansion is not much, but it is measurable. The substrate also heats and expands.
The problem is that the die gets hot, and heats the substrate secondarily. The silicon on the die has one rate of thermal expansion, the substrate has another, basically they get bigger at different rates. To complicate things further, remember the uneven and changing heating bit above? Parts of the die heat up and expand differently from other parts of the die. This changes quite quickly while things are in use.
The result? The bumps take a lot of stress, and it changes from second to second. This can be very accurately simulated, and you can engineer bump placement at points of lower thermal expansion and therefore lower stress. If you lose a power bump here and there, the chip will very likely survive, but if you lose a signal bump, game over. This is why bump placement is very important.
Engineering what bumps go where is a very complex process, and is done basically when the chip is laid out, near the end of the development process. You don't do it on a whim, you don't make pretty patterns because they are cool, you do it scientifically to minimise the potential for damage.
Getting back to the stress, it is what makes bumps fracture. Think of the old trick of taking a fork and bending it back and forth. It bends several times, then it breaks. The same thing happens to bumps. Heating leads to stress, aka bending, and then it cools and bends back. Eventually this thermal cycling kills chips.
Once again, if you did your engineering right, this won't happen in any timeframe that matters to mere humans, if it takes ten years of on and off switching to make it happen, once a day power cycling won't matter in our lifetimes. Chip makers tend to engineer for timelines like the ten-year horizon, and are pretty safe in assuming it will live for five years of casual use.
If you recall, high-lead bumps are stiffer than eutectic and more prone to stress fractures. The high-lead-to-eutectic substrate bond is also weaker than a eutectic-to-eutectic bond. What is happening to Nvidia is that the substrate to bump joint is cracking, and the chips die. High lead bumps are a poor choice to use in this application.
One other bit to bring into the mix is underfill. If things were as simple as heat leads to cracking, no chips would work for any length of time. Underfill not only protects the bumps from moisture and contamination, but it also provides mechanical support as well. It is designed to take some of the stress that the bumps take, making them live longer.
Underfill can range from rock hard to soft and squishy, it depends on your application. The harder the underfill, the more mechanical support it provides, and the less stress the bumps take. Simple enough.
That brings us to another material, the Polyamide layer. When chips went to a low-K dielectric material, which is not the same as the high-K gate material, it proved a problem with packaging, bumps and underfill. The solution was to put a polyamide layer, sometimes called a stress layer, to cover the bottom of the chip. This prevents contamination and mechanical damage.
If you pick an underfill that is too soft, it doesn't provide you enough mechanical support for the bumps, they crack and your chip dies and early death. Pick one that is too hard and it rips the polyamide layer off. In the words of one packaging engineer talked to for this article, if you used too hard of an underfill, the chip "wouldn't survive the first heat cycle". The magic is in the middle, you have to pick a bowl of porridge, er, underfill, that is strong enough to provide the support you need, but not so strong as to rip layers off your chip. Like we said, package engineering is not for the faint of heart, but it can make baby bear happy.
That brings us to the billion dollar question, why not simply change bump types to eutectic if they are that much better, which they are, in some ways. The answer is in the current capacity, more specifically average current capacity. We mentioned this earlier, and the idea ties into the hot spots and functional units.
If you take a hypothetical simple CPU that has an integer and floating point units. If you are doing heavy int. work, the power bumps that supply that part of the chip will be loaded heavily and the FP bumps will not be doing much of anything at all. When FP load gets heavy, the opposite happen.
The layout of the bumps is designed so that neither set will be overloaded at peak times, and in fact won't get all that close to their maximum. To use completely made up numbers, take a bump has a peak capacity of 1000mA, and for longevity you don't want to exceed 800mA, basically a 20 per cent safety margin.
If the chip TDP divided by the number of bumps, IE the average current per bump is 200mA, there are likely many bumps drawing 100mA and a few under loaded areas that draw 600mA. This draw moves around with the work the chip is doing. Some may never break 100mA, others may be at 600mA for their entire lives. All are well below the 800mA average, much less the 1000mA max.
The problem with eutectic bumps is that they have a lower current capacity, and the closer you get to it, the worse the problem of electromigration becomes. Lets pick a hypothetical eutectic bump that has a peak capacity of 500mA and the same 20 per cent safety margin, 400mA max for long life. If Nvidia wants to swap in eutectic bumps for the high lead they are using, there is a slight problem, they are well over the current capacity of the new bumps.
If the chip actually powers up without letting the smoke out, the first time you fire up a massive game of Telengard, it will most assuredly go pop. In the rare case of that the gods of luck are staring right at you and the thing doe sn't fry immediately, electromigration will ensure it has the lifespan of a mayfly, basically worse than the current crop of defective Nvidia chips.
What do you do? You can either cut the power used by the GPU way way down, ie, clock it at a point where no one would ever buy it, or rearrange where the bumps go. The rearrangement is not a trivial thing, and may require moving large parts of the chip around, basically a partial relayout. This is expensive, time consuming, and likely can't be done and validated in the time the chip is on sale for.
The other option is basically just as bad, you need a power plane or power grid on the die. This is a metal layer that distributes power across the die, and it means adding a layer to the chip. That means expense, slightly lower yield, and can have other detrimental effects to power draw and clocking.
All of these things can be dealt with if you see this coming when you start making the GPU. It is pretty painfully obvious that Nvidia didn't, otherwise they wouldn't have used high lead bumps and gotten into the hole that they are in. They have switched to eutectic bumps, but given the way it is being done, and the supplier grumbles we are hearing, it appears to be very poorly planned. It will be interesting to see the lifespan of these new parts. Âµ
|GETTING BACK to the underfill, this is probably the key to the problem. There is one more property of underfill called the glassification temperature, Tg for short. Tg is not melting, it is more the temp that is goes soft and looses most of it's structural rigidity. The underfill that Nvidia used, Namics 8439-1 is what's called a low Tg material, and the Hitachi 3730 has a higher Tg.|
To be fair to Nvidia, about the time when the G84 and G86s were hitting the market, high Tg underfills were pretty rare and new to the market. Low Tg underfills, such as the Namics material that NV used had been available for a while, and were 'known'. The last thing you want to do is put a high risk part on a new and market untested material, so it looks like they went with the safe choice, low Tg.
If Nvidia did their homework right, the Tg of the material should never be hit, the chip should always run below that temp, and the underfill should provide the mechanical support needed to keep the high lead bumps from fracturing. This is why you engineer, test, retest, simulate, pray a lot, and pick your materials very carefully.
Namics 8439-1 underfill temp vs strength curve
Here is the Tg curve for Namics 8439-1. Let us be the first to say there appears to be nothing, repeat, nothing wrong with this material, it does exactly what it says it does. It starts to lose strength at about 60C and by a little over 80C it has 100 times less rigidity. Think going from hard plastic to jello. What temps do GPUs run at again? What is the Tj (transistor junction temperature) for them? Ooops. Big hundreds of millions of dollar ooopsie right here.
So, the failure chain happens like this. NV for some unfathomable reason decides to design their chips for high lead bumps, something that was likely decided at the layout phase or before because the bump placement is closely tied to the floorplan. At this point, they are basically stuck with the bump type they chose for the life of the chip.
The next choice was the underfill materials, and again, they chose the known low Tg part that had far less tolerances than the newer to the market high Tg materials. It was a risk vs risk proposition, likely with a lot of cost differences as well. They chose wrong, very wrong. The stiffness of the Namics material might be perfect below the Tg, but once you hit it, it is almost like it isn't there, and the stress transfers to the bumps while they are hot and weak.
Fanbois will cry that their $.23 temp sensor is reading much lower temps than that, so there is no way this could be an issue. Well, the temp sensors on many cards are not on die, much less between the die and the substrate. They are also cheap and notoriously inaccurate. To top it off, they only measure average temp across the chip, not hot and cold spots. If you look at the IR photo in the previous part of this story, you can see that if you move the sensor from the right side to the left, you will get vastly differing readings. In this case, a real current chip, it will vary by as much as 30C depending on placement.
Many people also don't realize that it is easier for heat to travel down through the pins, they are mini-heat pipes, than it is to cross the three thermal barriers (die -> thermal paste -> heat spreader -> thermal paste -> heatsink) to the heatsink. That means those little bumps take a huge thermal pounding, and are usually hotter than the surface of the heat spreader.
To make matters worse, the bumps that are under the hot spots get hotter still. Piling on the pain, they carry the most current, and the hotter things get, the more heat they generate, and the more resistance they usually have.
Could it get worse? Of course it could. Remember thermal stress? The expansion is highest at the point, wait for it, that is hottest. That would be under the hot spots, and it puts the most stress on the bumps that are weakest.
This is why you have to pick your underfill very carefully, you have to relieve as much stress as you can from the bumps. Too little and they go snap, and the chip dies. Too much and you pull the polyamide layer off and the chip dies. Basically, you go as stiff as you dare, then test the hell out of it under the conditions your simulations tell you will be present. Test, test, test, test or dies die.
When the underfill glassifies, it means you are at the hottest point on the die, the bumps that it is protecting are under the most heat, pulling the most current, and under the most thermal stress. If the underfill essentially turns to jello, it is very bad. If you compound that by using bumps that bond poorly to the substrate, it makes things worse. If those bumps are stiffer than the other option, it is worse yet.
Let's go down the checklist for Nvidia. High thermal load? Check. Unforgiving high lead bumps. Check. Eutectic pads? Check. Low Tg underfill? Check. Hot spots that exceed the underfill Tg? Check. If you are thinking this looks bad, you are right, expensive too.
If it was just as simple as the underfill glassifying, the parts would have never made it to market. It is much more complex than that. The problem with thermal stress is that it is somewhat additive, it weakens parts long before they actually break unless it is quite extreme.
An example of extreme thermal stress would be to take a glass cup, preferably non-tempered, and put it in the oven on max. Pull it out and drop it in a bucket of ice water, and voila, instant thermal stress demonstration. Wear eye protection. The thermal stress that the bumps see is much more like the fork example earlier, it gets weaker and weaker with each bend, until snap, black screen.
If you recall, the Nvidia parts are breaking at the bump to substrate connection. This is the weakest point in the chain, and it is where they made the worst possible materials choice. It is not really a surprise that it failed. It is simply shoddy engineering.
So, what can be done by Nvidia at this point? Well, changing to high Tg underfills is a start, as is changing to eutectic bumps. The high Tg underfill option has come down in risk substantially since the G84 and G86 parts were introduced, so that is a no-brainer, and guess what Nvidia did to the G86? And the G92 as well.
The problem of changing bump types is far thornier, but Nvidia is doing that as well. From the intelligence we have been able to gather, Nvidia has not made any power distribution changes to the parts, there is no power grid, no power plane, or no anything to protect the eutectic bumps from electromigration. They may be able to keep them under their current capacity, but by how much?
This is emblematic of the 'pants are on fire' school of engineering, and reports from inside Nvidia confirm that they are in full panic mode over this snafu. With short time horizons to fix a massive batch of defective parts, reliability engineering usually draws the short stick. Âµ
|SOURCES CLOSE to Dell say they knew about the problem a year ago, and HP is on record as being aware in November, so there has been about a year to characterise the problem, design a solution and test it. Multiple sources involved with package engineering tell us that this is not nearly enough time to do a proper test regime, much less long-term reliability studies.|
This new package and materials set does not appear to have been nearly as carefully vetted as it should have been. It may work but, then again, it may not. If the lack of power distribution changes is accurate, we may very well be reading about Nvidia Defective Chipsgate II in a couple of years.
How widespread is the problem? We told you about G84 and G86s as well as G92 and G94s. From the materials side, it appears that all non-R and non-F lot numbered parts made on the 65nm and 55nm processes are defective. The flaw is a downright idiotic choice of multiple materials coupled with poor chip design and inadequate testing. It is a case of errors compounding errors. They are all defective.
If this is the case, why aren't we seeing more defective desktop parts? That one is easy... thermal stress. It has two components that lead to a bump fracturing, the amount of the stress, that is the hot cold temperature delta, and the number of times the part is powered up and down, that is the heat cycle. Glass cups in the oven would be the amount of stress, the bended fork would be the number of cycles.
If you remember back to the Nvidia 8-K where they announced that "...customer use patterns are contributing factors." By customer usage patterns, they are referring mainly to thermal cycles, but you could also credit them with meaning high temperatures while the GPU is being pushed hard in gaming and the like.
Desktop systems are usually turned on once a day or so. Some people leave them on for weeks at a time, others may turn then on and off a few times in a day. The average desktop probably has about one heat cycle a day.
Laptops on the other hand are woken up and put to sleep many times a day. If you take a typical student who wakes up, checks his email, goes to three classes takes notes, goes to a coffee shop for a bit, goes home, watches a video or two, then goes to sleep, it is not hard to make a case for 10 or more power cycles a day. Every wake up/sleep or hibernate cycle is a heat cycle, so dozens are not out of the question.
The more cycles you put on it, and the more severe they are, the quicker these defective parts will die. A good way to look at it is to assign the lifespan of each critical bump an amount of stress it can take before it cracks. Lets call this number 100AU for Arbitrary Units. If a power on cycle is worth 4 AU, and a hardcore gaming session with the CPU OCd to within 1MHz of it crashing is worth 15, you can figure out when it should die. Remember, these are hypothetical numbers... the theory is the point.
When Dell, HP and others announce a BIOS 'fix', the reason it is so humorous is that all they are doing is lowering the amount of thermal stress on the chips when the fan would not normally be on. When the fan is going full tilt without the 'fix', the new 'updated thermal profiles' won't make a difference. When the fans are normally off or on low, the profiles will essentially lessen the stress from a three to a four. It is just there to allow the laptop to live through the warranty period so the companies don't have to pay for the fix. After that, if the defective chips burn out, it isn't their problem. The 'fix' doesn't fix anything at all.
In the end, it comes down to Nvidia screwing up badly on package engineering and testing, then trying as best they can to bury the problem while passing the buck. It appears that every Nvidia 65nm and 55nm part with high lead bumps and/or low Tg underfill are defective, it is just a question of how defective they are, and when they will die.
As far as we are able to tell, contrary to Nvidia's vague statements blaming suppliers, there are no materials defects at work here. Every material they used lived up to the claimed specs, and every material they used would have done the job while kept within the advertised parameters. Nvidia's engineering failures put overdue stress on the parts, and several failures compounded to make two generations of defective parts. The suppliers and subcontractors did exactly what they were told, Nvidia just told them to do the wrong thing.
When it started talking about this, Nvidia failed crisis management 101, and the coverup shows it doesn't care about consumers, just its bottom line. NV is doing exactly the wrong thing for the wrong reasons, and the lawyers circling with class action paperwork in hand are going to eat them alive.
The last time you had such a huge batch of defective GPUs, the company that did it swore up and down - just like Nvidia - that there was no problem despite forums filled with evidence to the contrary.
A few weeks later, they turned around and admitted there was a problem, and took a $1.1 Billion charge, placating customers and fending off lawsuits.
You know that as the Xbox 360 Red Ring of Death.
I wonder why Nvidia can't be that smart?