[concurrency-interest] Multi-core testing, help with findings

Boehm, Hans hans.boehm at hp.com
Tue Dec 12 14:38:07 EST 2006

I believe that Bjorn is right, and it's the fence following volatile stores on a multiprocessor that's causing the problem.  That sounds far more plausible than anything else I've seen here, including my own explanations.
Note that volatile doesn't force anything out of the cache; it just forces the processor to execute an mfence instructions for each store to enforce ordering between a volatile store and a subsequent volatile load.  On a P4 that typically costs you > 100 cycles.  On a core 2 duo I believe it's much less, but still significant.  
(Since the volatiles are only accessed by a single thread, I also believe it's actually correct to effectively optimize out the volatile qualifier in this case, or to optimize away the whole loop for that matter.   I'd be mildly impressed if a compiler actually did that.  As a general rule, it's poor practice to put empty loops in microbenchmarks.  It makes the benchmark very dependent on aspects of compiler optimization that don't matter for real code.)


	From: concurrency-interest-bounces at cs.oswego.edu [mailto:concurrency-interest-bounces at cs.oswego.edu] On Behalf Of David Harrigan
	Sent: Tuesday, December 12, 2006 1:52 AM
	To: concurrency-interest at cs.oswego.edu
	Subject: Re: [concurrency-interest] Multi-core testing, help with findings
	Hi All,
	I would love to explore this further. I could certainly try out the thread.yield().....but I have
	a small problemo now! My core 2 duo is going back to the factory since the screen doesn't
	appear to want to play ball :-( I'll have to wait until I can try the suggestions out. However, 
	this of course does not mean no-one else can give it a whirl. All this is very interesting, and
	I think highlights an area that is going to become more and more prevalent - as more
	developers have multi-core machines to develop on, then these things are going to come 
	up more often...
	From what I can read in this thread so far - it's either a scheduling issue with the OS, or
	I'm being too aggressive with use of the volatile (I chose this since I wanted to see what
	the processors would act like when forced to go to main memory, rather than fetching 
	from their 4MB cache.).
	Oh, and it's Linux kernel 2.6.17.
	On 12/12/06, Bjorn Antonsson <ban at bea.com > wrote: 

		I would say that a lot of the extra time it takes comes from the fact that the volatile stores/loads in the Worker class, actually 1000000 of them, do mean something on a multi CPU. 
		On a typical x86 SMP machine the load/store/load pattern on volatiles results in an mfence instruction, which is quite costly. This is a normal load/store/load without mfence on a single CPU machine, since we are guaranteed that the next thread will have the same view of the memory. 
		> -----Original Message-----
		> From: concurrency-interest-bounces at cs.oswego.edu
		> [mailto: concurrency-interest-bounces at cs.oswego.edu <mailto:concurrency-interest-bounces at cs.oswego.edu> ] On Behalf
		> Of David Harrigan
		> Sent: den 12 december 2006 07:57
		> To: concurrency-interest at cs.oswego.edu 
		> Subject: Re: [concurrency-interest] Multi-core testing, help
		> with findings
		> Hi,
		> I completely forgot to mention that platform is Linux (Ubuntu 6.10).
		> Just scanning thru the mail, will read when I get to work... 
		> -=david=-
		> On 12/12/06, Gregg Wonderly <gregg at cytetech.com> wrote:
		>       David Holmes wrote:
		>       > I've assumed the platform is Windows, but if it is 
		> linux then that opens
		>       > other possibilities. The problem can be explained if
		> the busy-wait thread
		>       > doesn't get descheduled (which is easy to test by
		> changing it to not be a 
		>       > busy-wait). The issue as to why it doesn't get
		> descheduled is then the
		>       > interesting part. I suspect an OS scheduling quirk on
		> multi-core, but need
		>       > more information. 
		>       >>>>>    private long doIt() {
		>       >>>>>        long startTime = System.currentTimeMillis();
		>       >>>>>        for(int i = 0; i < howMany; i++) { 
		>       >>>>>            new Thread(new Worker()).start();
		>       >>>>>        }
		>       >>>>>        while(!finished);
		>       >>>>>        long endTime = System.currentTimeMillis();
		>       >>>>>        return (endTime - startTime);
		>       >>>>>
		>       >>>>>    }
		>       Historically, I've found that busy waits like the above 
		> are problematic.  I'd go
		>       along with David's comment/thought and try
		>               while(!finished) Thread.yield();
		>       or something else to cause it to get descheduled for a 
		> whole quanta for each
		>       check rather than busy waiting for a whole quanta which
		> will keep at least one
		>       CPU busy doing nothing productive.
		>       Gregg Wonderly
		Notice:  This email message, together with any attachments, may contain
		information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated 
		entities,  that may be confidential,  proprietary,  copyrighted  and/or
		legally privileged, and is intended solely for the use of the individual
		or entity named in this message. If you are not the intended recipient, 
		and have received this message in error, please immediately return this
		by email and then delete it.
		Concurrency-interest mailing list
		Concurrency-interest at altair.cs.oswego.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: /pipermail/attachments/20061212/2ff4ccdd/attachment.html 

More information about the Concurrency-interest mailing list