[concurrency-interest] Multi-core testing, help with findings

David Harrigan dharrigan at gmail.com
Wed Dec 13 01:52:54 EST 2006


Hi,

Thanks to everyone who have spent the time looking at my little sample. It
has
been very educational. Yes, I was wrong to use the volatile in the counter -
I
thought it was a good way to test real-world performance, but I've been
wrong
and it's clear why now. All very interesting stuff :-)

As way of a test, what could I try to see the difference two cores makes
against
one core? I want to prove to myself that I've spent money on a worthwhile
thing ;-)

-=david=-

On 12/13/06, David Holmes <dcholmes at optusnet.com.au> wrote:
>
>  I really must learn to read what is written in sample code rather than
> what I expect to see :) The volatile on the per-thread counter had escaped
> my notice. I most definitely agree that that is the cause of the performance
> loss - volatiles are free on UP systems but not on MP and this is
> pathological usage.
>
> Cheers,
> David Holmes
>
> -----Original Message-----
> *From:* concurrency-interest-bounces at cs.oswego.edu [mailto:
> concurrency-interest-bounces at cs.oswego.edu]*On Behalf Of *Boehm, Hans
> *Sent:* Wednesday, 13 December 2006 5:38 AM
> *To:* David Harrigan
> *Cc:* concurrency-interest at cs.oswego.edu
> *Subject:* Re: [concurrency-interest] Multi-core testing, help with
> findings
>
> I believe that Bjorn is right, and it's the fence following volatile
> stores on a multiprocessor that's causing the problem.  That sounds far more
> plausible than anything else I've seen here, including my own explanations.
>
> Note that volatile doesn't force anything out of the cache; it just forces
> the processor to execute an mfence instructions for each store to enforce
> ordering between a volatile store and a subsequent volatile load.  On a P4
> that typically costs you > 100 cycles.  On a core 2 duo I believe it's much
> less, but still significant.
>
> (Since the volatiles are only accessed by a single thread, I also believe
> it's actually correct to effectively optimize out the volatile qualifier in
> this case, or to optimize away the whole loop for that matter.   I'd be
> mildly impressed if a compiler actually did that.  As a general rule, it's
> poor practice to put empty loops in microbenchmarks.  It makes the benchmark
> very dependent on aspects of compiler optimization that don't matter for
> real code.)
>
> Hans
>
>  ------------------------------
> *From:* concurrency-interest-bounces at cs.oswego.edu [mailto:
> concurrency-interest-bounces at cs.oswego.edu] *On Behalf Of *David Harrigan
> *Sent:* Tuesday, December 12, 2006 1:52 AM
> *To:* concurrency-interest at cs.oswego.edu
> *Subject:* Re: [concurrency-interest] Multi-core testing, help with
> findings
>
> Hi All,
>
> I would love to explore this further. I could certainly try out the
> thread.yield().....but I have
> a small problemo now! My core 2 duo is going back to the factory since the
> screen doesn't
> appear to want to play ball :-( I'll have to wait until I can try the
> suggestions out. However,
> this of course does not mean no-one else can give it a whirl. All this is
> very interesting, and
> I think highlights an area that is going to become more and more prevalent
> - as more
> developers have multi-core machines to develop on, then these things are
> going to come
> up more often...
>
> From what I can read in this thread so far - it's either a scheduling
> issue with the OS, or
> I'm being too aggressive with use of the volatile (I chose this since I
> wanted to see what
> the processors would act like when forced to go to main memory, rather
> than fetching
> from their 4MB cache.).
>
> Oh, and it's Linux kernel 2.6.17.
>
> -=david=-
>
> On 12/12/06, Bjorn Antonsson <ban at bea.com > wrote:
> >
> > Hi,
> >
> > I would say that a lot of the extra time it takes comes from the fact
> > that the volatile stores/loads in the Worker class, actually 1000000 of
> > them, do mean something on a multi CPU.
> >
> > On a typical x86 SMP machine the load/store/load pattern on volatiles
> > results in an mfence instruction, which is quite costly. This is a normal
> > load/store/load without mfence on a single CPU machine, since we are
> > guaranteed that the next thread will have the same view of the memory.
> >
> > /Björn
> >
> > > -----Original Message-----
> > > From: concurrency-interest-bounces at cs.oswego.edu
> > > [mailto: concurrency-interest-bounces at cs.oswego.edu] On Behalf
> > > Of David Harrigan
> > > Sent: den 12 december 2006 07:57
> > > To: concurrency-interest at cs.oswego.edu
> > > Subject: Re: [concurrency-interest] Multi-core testing, help
> > > with findings
> > >
> > > Hi,
> > >
> > > I completely forgot to mention that platform is Linux (Ubuntu 6.10).
> > >
> > > Just scanning thru the mail, will read when I get to work...
> > >
> > > -=david=-
> > >
> > >
> > > On 12/12/06, Gregg Wonderly <gregg at cytetech.com> wrote:
> > >
> > >
> > >
> > >       David Holmes wrote:
> > >       > I've assumed the platform is Windows, but if it is
> > > linux then that opens
> > >       > other possibilities. The problem can be explained if
> > > the busy-wait thread
> > >       > doesn't get descheduled (which is easy to test by
> > > changing it to not be a
> > >       > busy-wait). The issue as to why it doesn't get
> > > descheduled is then the
> > >       > interesting part. I suspect an OS scheduling quirk on
> > > multi-core, but need
> > >       > more information.
> > >
> > >       >>>>>    private long doIt() {
> > >       >>>>>        long startTime = System.currentTimeMillis();
> > >       >>>>>        for(int i = 0; i < howMany; i++) {
> > >       >>>>>            new Thread(new Worker()).start();
> > >       >>>>>        }
> > >       >>>>>        while(!finished);
> > >       >>>>>        long endTime = System.currentTimeMillis();
> > >       >>>>>        return (endTime - startTime);
> > >       >>>>>
> > >       >>>>>    }
> > >
> > >       Historically, I've found that busy waits like the above
> > > are problematic.  I'd go
> > >       along with David's comment/thought and try
> > >
> > >               while(!finished) Thread.yield();
> > >
> > >       or something else to cause it to get descheduled for a
> > > whole quanta for each
> > >       check rather than busy waiting for a whole quanta which
> > > will keep at least one
> > >       CPU busy doing nothing productive.
> > >
> > >       Gregg Wonderly
> > >
> > >
> > >
> > >
> > _______________________________________________________________________
> > Notice:  This email message, together with any attachments, may contain
> > information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> > entities,  that may be confidential,  proprietary,  copyrighted  and/or
> > legally privileged, and is intended solely for the use of the individual
> > or entity named in this message. If you are not the intended recipient,
> > and have received this message in error, please immediately return this
> > by email and then delete it.
> >
> > _______________________________________________
> > Concurrency-interest mailing list
> > Concurrency-interest at altair.cs.oswego.edu
> > http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest
> >
>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at altair.cs.oswego.edu
> http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: /pipermail/attachments/20061213/5dce9840/attachment.html 


More information about the Concurrency-interest mailing list