I believe that Bjorn is right, and it's the fence following volatile stores on a multiprocessor that's causing the problem.  That sounds far more plausible than anything else I've seen here, including my own explanations.
Note that volatile doesn't force anything out of the cache; it just forces the processor to execute an mfence instructions for each store to enforce ordering between a volatile store and a subsequent volatile load.  On a P4 that typically costs you > 100 cycles.  On a core 2 duo I believe it's much less, but still significant.  
(Since the volatiles are only accessed by a single thread, I also believe it's actually correct to effectively optimize out the volatile qualifier in this case, or to optimize away the whole loop for that matter.   I'd be mildly impressed if a compiler actually did that.  As a general rule, it's poor practice to put empty loops in microbenchmarks.  It makes the benchmark very dependent on aspects of compiler optimization that don't matter for real code.)


	From what I can read in this thread so far - it's either a scheduling issue with the OS, or
	I'm being too aggressive with use of the volatile (I chose this since I wanted to see what
	the processors would act like when forced to go to main memory, rather than fetching 
	from their 4MB cache.).
	Oh, and it's Linux kernel 2.6.17.
	On 12/12/06, Bjorn Antonsson <ban at bea.com > wrote: 

		I would say that a lot of the extra time it takes comes from the fact that the volatile stores/loads in the Worker class, actually 1000000 of them, do mean something on a multi CPU. 
		On a typical x86 SMP machine the load/store/load pattern on volatiles results in an mfence instruction, which is quite costly. This is a normal load/store/load without mfence on a single CPU machine, since we are guaranteed that the next thread will have the same view of the memory. 
		>       David Holmes wrote:
		>       > I've assumed the platform is Windows, but if it is 
		> linux then that opens
		>       > other possibilities. The problem can be explained if
		> the busy-wait thread
		>       > doesn't get descheduled (which is easy to test by
		> changing it to not be a 
		>       > busy-wait). The issue as to why it doesn't get
		> descheduled is then the
		>       > interesting part. I suspect an OS scheduling quirk on
		> multi-core, but need
		>       > more information. 
		>       >>>>>    private long doIt() {
		>       >>>>>        long startTime = System.currentTimeMillis();
		>       >>>>>        for(int i = 0; i < howMany; i++) { 
		>       >>>>>            new Thread(new Worker()).start();
		>       >>>>>        }
		>       >>>>>        while(!finished);
		>       >>>>>        long endTime = System.currentTimeMillis();
		>       >>>>>        return (endTime - startTime);
		>       >>>>>
		>       >>>>>    }
		>       Historically, I've found that busy waits like the above 
		> are problematic.  I'd go
		>       along with David's comment/thought and try
		>               while(!finished) Thread.yield();
		>       or something else to cause it to get descheduled for a 
		> whole quanta for each
		>       check rather than busy waiting for a whole quanta which
		> will keep at least one
		>       CPU busy doing nothing productive.
