Overclock.net banner
1 - 8 of 8 Posts

·
Super Moderator
Joined
·
9,302 Posts
Discussion Starter · #1 ·
Let's look at a classic CPU, the i5 2500k:

Quote:
Originally Posted by CPU-world.com
Level 1 cache size:
4 x 32 KB 8-way set associative instruction caches
4 x 32 KB 8-way set associative data caches

Level 2 cache size:
4 x 256 KB 8-way set associative caches

Level 3 cache size:
6 MB 12-way set associative shared cache

Cache latency:
4 (L1 cache)
11 (L2 cache)
25 (L3 cache)

(Source: http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5-2500K%20CM8062300833803.html)
I think it's safe to assume that latency is how many clock cycles it takes to read or write to that level of cache, right? I assume also that the cache speed is linked to the CPU's frequency. That would make sense, because cache is an important part of the core. Accessing instructions stored in the L1, for example, would take four cycles or, at a stock speed of 3.3GHz, ~1.21ns.

Past that, well, I'm lost. The 4x32KiB part (L1) makes sense - that means there are four of these (one per core) with 32768 bytes of memory each. What do "8-way" and "12-way" mean? Is this like RAM channels? If yes, how wide is the bus? Further down on the page, it says each cache level has a "line size" of 64 bytes. Is that a 512-bit bus per "channel?" If yes, then it sounds like - assuming these don't use anything like DDR (double data rate) or QDR (quadruple) - the L3's bandwidth is:

(12 "channels") * (64B / channel) * (3.3GHz) = 2.534TB/s of bandwidth, shared among all four cores.

But I'm not sure if I'm right.

Let's look at another, the good ol' FX-8350:

Quote:
Originally Posted by CPU-world.com
Level 1 cache size:
4 x 64 KB 2-way set associative shared instruction caches
8 x 16 KB 4-way set associative data caches

Level 2 cache size:
4 x 2 MB 16-way set associative shared exclusive caches

Level 3 cache size:
8 MB 64-way set associative shared cache

(Source: http://www.cpu-world.com/CPUs/Bulldozer/AMD-FX-Series%20FX-8350.html)
It doesn't give latency numbers. Too bad. Again, we see a "line size," and again, it is 64 bytes. Plugging that into the previous equation and accounting for the additional frequency and 52 channels in the L3 gets us:

(2.534TB/s) * (64 / 12) * (4 / 3.3) = 16.39TB/s of raw bandwidth, shared among all four modules.

Am I on the right track, or am I totally lost? These numbers sound way too high and I might be totally wrong with regards to my interpretation of whatever "##-way associative cache" means.
 

·
Super Moderator
Joined
·
9,302 Posts
Discussion Starter · #4 ·
Quote:
Originally Posted by TheBlademaster01 View Post

Quote:
Originally Posted by CynicalUnicorn View Post

http://www.overclock.net/t/1541624/how-much-bandwidth-is-in-cpu-cache-and-how-is-it-calculated/0_100

Blade, PR, Artik, other people, and Oob especially, please help.
You're looking at it wrong. Caching isn't that easy to predict. It depends on hit rate and the uarch of the processor. L1 bandwidth depends on the instructions per tick and the stride of the instructions (AVX = 256-bit, SSE = 128-bit etc.). IIRC, Sandy Bridge has 1 instruction per tick and Haswell can do 2 instructions per tick (necessary to meet the FMA spec). It's also part of the reason why a lot of the Xeon E5 v3 SKUs have lame clock targets...

You could calculate it like this:

L1BW = core clock * 32 Bytes (assuming the largest stride possible i.e. AVX) * instructions per tick * #cores

L2, and L3 depend on hit rates. I found that during benchmarks for Sandy Bridge the total L2 BW is about 60% of L1 and L3 BW 50% of L2. I should have a SiSandra in this thread somewhere (I think I ran it around 8 september last year).

Also, associativity in caching relates to address mapping, not channel width. You have direct mapping cache (each data can only be written/read at a specific address) that requires less control and has crazy fast searching speed (low latency) but as a result craptastic hit rates and therefore a significant amount of wasted address space. The alternative would be fully asociative cache (data can be written/read at any address) that has amazing hit rates but as expected exceptionally long searching queries and complex search algorithms.

n-way associativity would be a compromise between the two. It means that your processor can only write/read data to/from a certain block of cache, but can can use any of the n addresses within that block. This way you improve hit rates and keep searching queries a lot more simple. This is also why for the larger, but slower L3 cache associativity is turned way up (hit rate for L3 cache is virtually 100%). The bigger n, the higher the hit rates but also the higher the latency (longer search query). But L3 is a last effort to circumvent paging the RAM, so the tradeoff makes sense here.
Thank you @TheBlademaster01.

So, try this again. 2500k, 256-bit instructions, and his claim of one instruction per cycle:

(3.3GHz) * (1 instruction / cycle) * (256 bits / instruction) * (4 cores) = 3.38Tb/s = 422GB/s of L1 bandwidth.

He was also nice enough to link to a benchmark from a few months back using his dual-socket, eight-core Sandy-EP Xeons at 2.7GHz. L1 bandwidth reached 1631GB/s. Does the math agree?

(2.7GHz) * (1 instruction / cycle) * (256 bits / instruction) * (16 cores) = 11.1Tb/s = 1.38TB/s.

Well... It's reasonably close and they might have been overclocked. L2, despite having the same amount cache blocks (I'm not sure the technical term...), takes a huge hit down to 980GB/s. I'm going to have to assume that's due to the comparatively huge latency.
 

·
Looking Ahead
Joined
·
13,043 Posts
Yeah, my all core turbo multi is 31x and I typically have it overclocked to 103MHz bClk for a little more performance (apparently gives me an additional 51.2 GB/s bandwidth to L1). You can easily find the clock speed if you know the arch and amount of cores 1630/(32*16) = ~3.2GHz (=103MHz*31)

What these benchmarks typically do to measure lower level caches' bandwidth, is that they flush the higher level caches by feeding it data that won't fit in the higher level cache. The core will get a cache miss at the higher level, stall and then hit the lower level cache ('higher' and 'lower' referring to the hierarchy and not the number if you still follow
upsidedwnsmiley.gif
).

TechReport typically includes a block size vs bandwidth plot when doing Xeon reviews. Here is one comparing Haswell, Ivy Bridge and Sandy Bridge:



(Source)

I edited in the dashed lines at which point L1, L2 and complete cashe misses occur. Keep in mind they all have 32KB of L1 cache per core, 256KB of L2 per core and the L3 is shared across the ringbus per CPU.

This gives:

dual 2687W: 512KB L1, 4096KB L2 and 40MB of L3

dual 2687Wv2: 512KB L1, 4096KB L2 and 40MB of L3

dual 2687Wv3: 640KB L1, 5120KB L2 and 50MB of L3

With the all core turbo bins you can easily calculate L1 bandwidth at slightly below 4 TB/s for the Haswell chips, slightly below 2TB/s for the Ivy chips and about 1.75TB/s for the Sandy chips. For some reason I can't really see at this point turbo kicks in to the highest bin at 128KB block size for Sandy and Ivy and 256KB for Haswell. Then at 512KB block size cache misses start to impair the bandwidth (apparently Haswell's L2 cache can still only push 1 instruction per tick so the decline is more apparent). At 4MB block size the hit rate for L2 cache also starts to decline and data needs to be fetched at L3/shared level. Past 16MB block size the hit rate in all cache is pretty abysmal and RAM paging ocurrs (pun not intended
:p
)



 

·
Super Moderator
Joined
·
9,302 Posts
Discussion Starter · #6 ·
I had to promise not to upload any offensive material, but I think I lied. Have you seen my handwriting?!



Log scale of bandwidths. 4P Haswell-EP should be slightly slower than 8P Ivy-EX due to clocks.
 

·
Iconoclast
Joined
·
32,438 Posts
Quote:
Originally Posted by CynicalUnicorn View Post

What do "8-way" and "12-way" mean?
That's associativity.

http://en.wikipedia.org/wiki/CPU_cache#Associativity
Quote:
Originally Posted by CynicalUnicorn View Post

Is this like RAM channels?
No.
Quote:
Originally Posted by CynicalUnicorn View Post

If yes, how wide is the bus?
Depends on the CPU. It's independent of associativity and line size.

An example: http://www.anandtech.com/show/6355/intels-haswell-architecture/9
Quote:
Originally Posted by CynicalUnicorn View Post

Further down on the page, it says each cache level has a "line size" of 64 bytes. Is that a 512-bit bus per "channel?"
No. Line size is the fixed block size of transfers between cache levels or memory. It doesn't necessarily have anything to do with the bus/interface width itself.
Quote:
Originally Posted by CynicalUnicorn View Post

Am I on the right track, or am I totally lost?
Mostly lost.

All you need to do to figure out maximum theoretical bandwidth is find the bandwidth per cycle and multiply it by number of cycles. You already know best case latency.

Many other factors will influence efficiency (including line size, associativity, exclusivity, prefetchers, et al, etc, so on and so forth) so predicting actual performance from theoretical performance, without a similar architecture to compare it to, is very difficult.
 
1 - 8 of 8 Posts
This is an older thread, you may not receive a response, and could be reviving an old thread. Please consider creating a new thread.
Top