[concurrency-interest] Recursive Directory checker

Benedict Elliott Smith lists at laerad.com
Sun Feb 26 14:32:38 EST 2012


It depends what you want to achieve. If I were solving this problem I would
choose to simply not follow sym-links, as this is the simplest way to
ensure you never visit a directory more than once (excepting concurrent
modifications of the directory structure which are difficult to deal with
on a remote file system anyway)


On 24 February 2012 19:21, Aleksandar Lazic <
al-javaconcurrencyinterest at none.at> wrote:

> **
>
> Hi,
>
> we scan over a NAS Share (NFS Netapp Filer), due to this fact I don't
> think that the deep
>
> disk handling is in my hand.
>
>
>
> I use currently the IO:AIO program treescan from the IO:AIO perl module
>
> http://cvs.schmorp.de/IO-AIO/bin/treescan?view=markup
>
> which use 8 thread to collect the necessary data.
>
>
>
> The both links below shows my description from the perl point of view
>
> http://lists.schmorp.de/pipermail/anyevent/2012q1/000227.html
>
> http://lists.schmorp.de/pipermail/anyevent/2012q1/000231.html
>
>
>
> The reason why I want to switch to Java is that i need solution which I
> 'just'
>
> need to extract and run not to install a lot of modules for the dedicated
> script language.
>
> Please can you tell me what do you suggest to handle the directories which
> are already scanned?
>
> Best regards
>
> Aleks
>
> On 24-02-2012 19:29, Benedict Elliott Smith wrote:
>
> I hate to nitpick, but this is only true for sequential reads; as soon as
> you devolve to random IO (and for large directory trees metadata traversal
> is unlikely at best to remain sequential, even if there are no other
> competing IO requests) you are much better with multiple ops in flight so
> the disk can select the order it services them and to some degree maximize
> throughput. When performance testing new file servers I have found single
> threaded random IOPs are typically dreadful, even with dozens of disks.
> In my experience a multi-threaded directory traversal has usually been
> considerably faster than single threaded.
> I don't think the choice of queue is likely to have a material impact on
> the performance of this algorithm, Aleksandar; IO will be your bottleneck.
> However, I think the use of a queue defeats the point of using the ForkJoin
> framework.
> On 24 February 2012 17:52, Nathan Reynolds <nathan.reynolds at oracle.com>wrote:
>
>> I would like to point out that hard disks perform best when accessed in a
>> single threaded manner.  If you have 2 threads making requests, then the
>> disk head will have to swing back and forth between the 2 locations.  With
>> only 1 thread, the disk head doesn't have to travel as much.  Flash disks
>> (SSDs) are a different story.  We have seen optimal throughput when 16
>> threads hit the disk concurrently.  Your mileage will vary depending upon
>> the SSD.  So, you may not get much better performance from your directory
>> size counter by using multiple threads.
>>
>> I have found on Windows that defragmenting the hard drive and placing all
>> of the directory meta data together makes this kind of thing run really
>> fast.  (See MyDefrag). The disk head simply has to sit on the directory
>> meta data section of the hard disk.  I realize you aren't running on
>> Windows.  But, you might consider something similar.
>>
>> Nathan Reynolds<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds>| Consulting Member of Technical Staff |
>> 602.333.9091
>> Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
>>
>> On 2/24/2012 8:59 AM, Aleksandar Lazic wrote:
>>
>> Dear list members,
>>
>> I'm on the way to write a directory counter.
>>
>> I'm new to all this thread/fork stuff, so please accept my apologize
>> for such a 'simple' question ;-)
>>
>> What is the 'best' Class for such a program.
>>
>> ForkJoinTask
>> RecursiveAction
>> RecursiveTask
>>
>> I plan to use for the main program.
>>
>> pseudocode
>> ###
>> main:
>>
>>  File startdir = new File("/home/user/");
>>  File[] files = file.listFiles()
>>
>>  add directories to the Queue.
>>
>> -----
>> I'm unsure which Queue is the best for this?
>>
>> http://gee.cs.oswego.edu/dl/jsr166/dist/docs/java/util/Queue.html
>>
>> I tend to BlockingDeque
>> -----
>>
>>   ForkJoinPool fjp = new ForkJoinPool(5);
>>
>>   foreach worker
>>     get filesizes and $SummAtomicLong.addAndGet(filesizes);
>>
>> print "the Directory and there subdirs have {} Mbytes", $SummAtomicLong
>>
>> ####
>>
>> Worker:
>>
>>   foreach directory
>>     if directory is not in queue
>>       add directory to the Queue.
>>
>>   foreach file
>>     add filesize to $workerAtomicLong.addAndGet(file.size);
>> ###
>>
>> I hope it is a little bit clear what I want to do ;-)
>>
>> No this is not a Homework ;-)
>>
>> Should I use a global variable for the SummAtomicLong?
>> Should I use a global variable for the DirectoryQueue?
>>
>> I expect that there are not more then 'ForkJoinPool(5)'-Threads/Processes
>> which work
>> on the disk, is that right?
>>
>> I have try to understand some of the
>>
>> http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/test/loops/
>>
>> but I have still some questions.
>>
>> Many thanks for all your help.
>>
>> Cheers
>> Aleks
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20120226/90d573d5/attachment.html>


More information about the Concurrency-interest mailing list