Oleg Zabluda's blog
Tuesday, September 25, 2012
 
Blast from the past: accidentally stumbled upon my original reaction to, now classic, article:
Blast from the past: accidentally stumbled upon my original reaction to, now classic, article: 

=== http://tech.groups.yahoo.com/group/CoolTechClub/message/355 ===

From Oleg Zabluda ozabluda@... Fri Jan 07 10:40:57 2005
To: cooltechclub@yahoogroups.com
Message-ID: <20050107184056.87096.qmail@...>
Subject: "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software" from Herb Sutter

I was preaching the same thing ever since I realized it in the summer of 2003. Here is an essentially word-by-word narrative of my thoughts on the subject (minus modern CPU architecture details).

http://www.gotw.ca/publications/concurrency-ddj.htm

=== http://tech.groups.yahoo.com/group/CoolTechClub/message/xxxxx ==

From Anupam Kapoor wrote:

i think his assertion regarding "old ways of increasing CPU speeds are hitting practical limits" is flawed. transistor counts are still rising, for example nvidia 6800 has approx 222 M transistors, latest P4's have approx 150M (and most of which is cache).

some more discussion about this at:
http://lambda-the-ultimate.org/node/view/458#comment. lots of goodies here!

==== http://tech.groups.yahoo.com/group/CoolTechClub/message/356 ===

From ozabluda@... Fri Jan 07 12:38:08 2005

Transistor count is rising, but it doesn't lead to meaningfully increasing CPU
speeds on typical applications any more, except for the cache effect. Thus
the move to multicore cpus. It's a better use of transistors. The cache
effect is irrelevant for single-vs-multicore because the same cache can
be used by all cores.

=== http://tech.groups.yahoo.com/group/CoolTechClub/message/357 ===

From: "Vladimir Bronstein" Fri Jan 07 13:08:54 2005

Actually,
it is an interesting thought experiment - let's say you have a 100
million transistors - what would be the most efficient architecture?
Let's assume that a relatively simple processor itself takes about 2
million transistors - I believe it is a reasonable assumption.
So would it be 50 processors without cache, 25 processors - each with
40KB of cache (each bit takes approx 6 transistors in cache) or 10
processors with...
It definitely depends on the application, but I believe that with proper
level of optimization - transistors are better spent fo processing power
than for cache. This consequence may be derived from the trivial fact
that transistors in cache are used very rarely (only when particular
memory location is read) while in the processor they are used much more
often (ex. ALU is used for a big percentage of all operations). In
addition, those processors should operate at relatively low speeds -
this alleviates the problem of power dissipation - but at the same time
helps to match performance of the processor with the performance of
memory - of course assuming the emergence of multiport memories for such
designs.
Vladimir

=== http://tech.groups.yahoo.com/group/CoolTechClub/message/359 ===

From Oleg Zabluda Fri Jan 07 14:20:03 2005

Here is the important point that I want to put across, which in my view
clarifies things a lot:

In first approximation, the optimal number of cores is independent on
proportion of transistors used for cache, and cache size is independent
on the number of cores.

Explanation:

Assumption: for server applications the only thing you care about is total
throughput and not latency.

In first approximation the total number of transistors, clock speed,
memory bus speed and width, etc.. is independent on the number of cores
or their architecture. This is true for simple cores. 

Given that, you simply come up with the core with the highest ratio of
operations per second to transistors in both the core and cache.
Given the workload, the ratio of transistors in the core to the cache
is function of only the ratio of cpu speed to memory access speed).

Then you try to share the cache amoung as many cores as possible to take
advantage of statistical multiplexing, but not hit too much syncronization
over head. I will call is core cluster. Then you place as many of the
clusters on a die as possible. This gives you the total number of cores.

We know experimentally that even adding the second ALU to a core is a net loss for the ratio. The only thing that is a win is a small pipeline.
For maximal throughput, we should be producing 64-128 core dies right now.
The only reason it is not happening is because software development
is not ready and because it's not clear how to share the cost between
server and desktop hardware.

=== http://tech.groups.yahoo.com/group/CoolTechClub/message/358 ===

From Oleg Zabluda Fri Jan 07 13:19:46 2005

Let me expand some more. The only really useful thing you can do
with more transistors is to add more calculating units (ALU, FPU, etc..),
that is increase superscalarity. The rest is just baggage to keep
them buzy. So all this insane crap with out-of-order read/write, speculative
execution, branch prediction, register renaming, etc... is just baggage
to keep ALU occupied (and pipelines filled). Guess what? Currently we spend about half transistors on that, and we still can keep second ALU busy maybe 30% of the time. 

Unless you hand-optimize specifically with that in mind, which alost nobody
can do because nobody really understands what is going on except in the
simplest cases. The people who can optimize best are Intel ingineers with
diagnostic tools unavailable to others. Even then, the optimizations are not
portable across CPUs and are not always deterministic.

This is the end of the line for that.

Fortunately, on the sever side, almost all major applications are trivially
parallelizable. So there will be divergence of server cpu (larger number of
simple and slow cores with higher total throuput and higher latency) and
desktop cpus (the other way around). If not, like Herb is saying, we can
always run spyware on other cores.

=== http://tech.groups.yahoo.com/group/CoolTechClub/message/360 ===

From Oleg Zabluda Fri Jan 07 17:27:49 2005

I am looking for a graph of MIPS/clock/transistors. I am too lazy to make one myself[1]. In the process I came across this URL from Patterson

http://www.cs.berkeley.edu/~pattrsn/talks/NAE.ppt

Page 12 shows that even for relatively simple Pentium III with 10M
transistors, about 80% of them are basically wasted as far as throughput
is concerned.

[1] In the absence of a graph, just use your common sense for now.
How many instructions per clock cycle (IPC) can a modern general purpose
CPU do on a typical server workload? Maybe 1.3 if you are lucky. That's
despite the fact that about 80-90%[2] of the transistors in the core (not
in the caches) are dedicated to squizing that extra 30%. It's not going
to improve much. Now that clock speed is stalled for a while, the game
is over.

[2] In fact I am not sure about exact percentage. The total amount
of "wasted" transistors is about 97%. But it's hard for me to tell
exactly how much of it is wasted on cpu-vs-memory speed mismatch and how much is wasted on superscalarity.

=== http://tech.groups.yahoo.com/group/CoolTechClub/message/363 ===

From Sergiy Lozovsky Tue Jan 11 06:09:29 2005

Hi,

what is so new in this information? I've seen some physical explanation why CPU clock limit is around 4GHz (for ordinary technology, not optics, etc.), long before 2003 (it was in some article about optical or neirocomputers I think). Many companies create applications which use multiple CPUs (computers) - Google for example. There are even open source progects to support clustering, so people who need performance don't rely on one CPU for a long, long time. All supercomputers have multiple CPUs (usually it is a cluster, so each or group of CPUs have their
own RAM, cache, etc.)

=== http://tech.groups.yahoo.com/group/CoolTechClub/message/365 ===

From Oleg Zabluda Tue Jan 11 11:07:55 2005

CPU's are getting faster in two major ways: clock speed and the number of
transistors. Many differrent people were saying many differrent
things for a long time. Some were saying that clock speed will not be
increasing any more. Others were saying that the number of transistors
will not be increasing any more. It was hard to tell the truth from BS.
Now some of the predictions turned out to be correct. Some people they
knew what they were talking about, and some were right due to stopped
clock effect.

Here is what's new and what's not:

1. The ones that claimed that clock speed will not increase any more, turned out to be correct because it happenned. That's new.

2. Those who claimed that the number of transistors will not be increasing
are wrong so far. At this time it appears to me that the number of
transistors on a die will increase by at least a factor of 100 over the next 10
years using regular lithography. That's now new.

3. Turns out that the number of transistors does not lead to faster
single-core cpus any more. That's new.

4. In order to use all those transistors people are building multi-core cpus.
It appears that the natural progression of things will lead to thousands or
tens-of-thousands 486-Pentuim III class cpus on a die. That's new.

5. Existing software will not be any faster on those multi-core cpus. In fact,
it might be slower, because individual cpus might be slower. Free lunch
is over. That's new.

6. Very few people know how to create software for SMP computers. Nobody knows how to do it for 1024-way SMP machine. Things you don't know are hard. Whether the difficulty is inherent or accidental we will know
after we learn how to do it. The time to learn is now. This is new.

7. Existing SMP machines, clusters, etc, give us a place to start. But
it's nothing compared to where we are heading. If this progresses as I
envision, in 10 years we well have 1024 cpus per die, 4 dies per stack,
4 stacks per chip, 4 chips per blade, 8 blades per box and 4M boxes per
Google. The only reason this will not hapen is if software doesn't keep up.

==========================

Original threaded view:
http://tech.groups.yahoo.com/group/CoolTechClub/messages/355?threaded=1&m=e
http://tech.groups.yahoo.com/group/CoolTechClub/messages/355?threaded=1&m=e

Labels:


| |

Home

Powered by Blogger