[concurrency-interest] Recursive Directory checker

Aleksandar Lazic al-javaconcurrencyinterest at none.at
Fri Feb 24 14:21:47 EST 2012


 

Hi, 

we scan over a NAS Share (NFS Netapp Filer), due to this fact
I don't think that the deep 

disk handling is in my hand. 

I use
currently the IO:AIO program treescan from the IO:AIO perl module


http://cvs.schmorp.de/IO-AIO/bin/treescan?view=markup 

which use 8
thread to collect the necessary data. 

The both links below shows my
description from the perl point of view


http://lists.schmorp.de/pipermail/anyevent/2012q1/000227.html


http://lists.schmorp.de/pipermail/anyevent/2012q1/000231.html 

The
reason why I want to switch to Java is that i need solution which I
'just' 

need to extract and run not to install a lot of modules for the
dedicated script language. 

Please can you tell me what do you suggest
to handle the directories which are already scanned? 

Best regards


Aleks 

On 24-02-2012 19:29, Benedict Elliott Smith wrote: 

> I hate
to nitpick, but this is only true for sequential reads; as soon as you
devolve to random IO (and for large directory trees metadata traversal
is unlikely at best to remain sequential, even if there are no other
competing IO requests) you are much better with multiple ops in flight
so the disk can select the order it services them and to some degree
maximize throughput. When performance testing new file servers I have
found single threaded random IOPs are typically dreadful, even with
dozens of disks. 
> In my experience a multi-threaded directory
traversal has usually been considerably faster than single threaded. 
>
I don't think the choice of queue is likely to have a material impact on
the performance of this algorithm, Aleksandar; IO will be your
bottleneck. However, I think the use of a queue defeats the point of
using the ForkJoin framework. 
> On 24 February 2012 17:52, Nathan
Reynolds <nathan.reynolds at oracle.com [9]> wrote:
> 
>> I would like to
point out that hard disks perform best when accessed in a single
threaded manner. If you have 2 threads making requests, then the disk
head will have to swing back and forth between the 2 locations. With
only 1 thread, the disk head doesn't have to travel as much. Flash disks
(SSDs) are a different story. We have seen optimal throughput when 16
threads hit the disk concurrently. Your mileage will vary depending upon
the SSD. So, you may not get much better performance from your directory
size counter by using multiple threads.
>> 
>> I have found on Windows
that defragmenting the hard drive and placing all of the directory meta
data together makes this kind of thing run really fast. (See MyDefrag).
The disk head simply has to sit on the directory meta data section of
the hard disk. I realize you aren't running on Windows. But, you might
consider something similar.
>> 
>> Nathan Reynolds [5] | Consulting
Member of Technical Staff | 602.333.9091
>> Oracle PSR Engineering [6] |
Server Technology 
>> 
>> On 2/24/2012 8:59 AM, Aleksandar Lazic wrote:

>> 
>>> Dear list members, 
>>> 
>>> I'm on the way to write a
directory counter. 
>>> 
>>> I'm new to all this thread/fork stuff, so
please accept my apologize 
>>> for such a 'simple' question ;-) 
>>>

>>> What is the 'best' Class for such a program. 
>>> 
>>> ForkJoinTask

>>> RecursiveAction 
>>> RecursiveTask 
>>> 
>>> I plan to use for the
main program. 
>>> 
>>> pseudocode 
>>> ### 
>>> main: 
>>> 
>>> File
startdir = new File("/home/user/"); 
>>> File[] files = file.listFiles()

>>> 
>>> add directories to the Queue. 
>>> 
>>> ----- 
>>> I'm unsure
which Queue is the best for this? 
>>> 
>>>
http://gee.cs.oswego.edu/dl/jsr166/dist/docs/java/util/Queue.html [1]

>>> 
>>> I tend to BlockingDeque 
>>> ----- 
>>> 
>>> ForkJoinPool fjp
= new ForkJoinPool(5); 
>>> 
>>> foreach worker 
>>> get filesizes and
$SummAtomicLong.addAndGet(filesizes); 
>>> 
>>> print "the Directory and
there subdirs have {} Mbytes", $SummAtomicLong 
>>> 
>>> #### 
>>> 
>>>
Worker: 
>>> 
>>> foreach directory 
>>> if directory is not in queue

>>> add directory to the Queue. 
>>> 
>>> foreach file 
>>> add
filesize to $workerAtomicLong.addAndGet(file.size); 
>>> ### 
>>> 
>>> I
hope it is a little bit clear what I want to do ;-) 
>>> 
>>> No this is
not a Homework ;-) 
>>> 
>>> Should I use a global variable for the
SummAtomicLong? 
>>> Should I use a global variable for the
DirectoryQueue? 
>>> 
>>> I expect that there are not more then
'ForkJoinPool(5)'-Threads/Processes which work 
>>> on the disk, is that
right? 
>>> 
>>> I have try to understand some of the 
>>> 
>>>
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/test/loops/ [2]

>>> 
>>> but I have still some questions. 
>>> 
>>> Many thanks for all
your help. 
>>> 
>>> Cheers 
>>> Aleks 
>>>
_______________________________________________ 
>>>
Concurrency-interest mailing list 
>>>
Concurrency-interest at cs.oswego.edu [3] 
>>>
http://cs.oswego.edu/mailman/listinfo/concurrency-interest [4]
>> 
>>
_______________________________________________
>> Concurrency-interest
mailing list
>> Concurrency-interest at cs.oswego.edu [7]
>>
http://cs.oswego.edu/mailman/listinfo/concurrency-interest [8]



Links:
------
[1]
http://gee.cs.oswego.edu/dl/jsr166/dist/docs/java/util/Queue.html
[2]
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/test/loops/
[3]
mailto:Concurrency-interest at cs.oswego.edu
[4]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
[5]
http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds
[6]
http://psr.us.oracle.com/
[7]
mailto:Concurrency-interest at cs.oswego.edu
[8]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
[9]
mailto:nathan.reynolds at oracle.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20120224/8e759f61/attachment.html>


More information about the Concurrency-interest mailing list