Re: [netcdfgroup] [gdsjaar@xxxxxxxxxx: strlen calls in NC_finddim and NC_findvar]

To: parallel-netcdf@xxxxxxxxxxx, "netcdfgroup@xxxxxxxxxxxxxxxx" <netcdfgroup@xxxxxxxxxxxxxxxx>
Subject: Re: [netcdfgroup] [gdsjaar@xxxxxxxxxx: strlen calls in NC_finddim and NC_findvar]
From: "Greg Sjaardema" <gdsjaar@xxxxxxxxxx>
Date: Fri, 4 Dec 2009 08:33:53 -0700

I modified my fix somewhat from what is described below. The NC_string'nchars' field is not what is needed since it is modified for alignmentissues and can be incorrect after a rename operation. Instead, I addeda 'lenstr' field to both NC_dim and NC_var which maintains the length ofthe name. This reduced the number of strlen calls in one case from476,952,472 to 389,810 (43.6% of execution time down to 0.25%). Thereare still several calls to strncmp.I think that perhaps a better fix than caching the name string lengthmay be to compute a hash of the name and store that instead. Thefinddim and findvar functions can then hash the name they are searchingfor. The inner loop could then just compare the hash values and if theymatch, do the further strncmp check to catch hash collisions.



Rob Latham wrote:

Greg S. found something noteworthy on the serial netcdf list.  We do
something similar (not surprising: i'm sure our NC_finddim and
NC_findvar functions are 99% unchanged from serial netcdf)

In NC_finddim we have a call to strlen as part of the condition of a
for loop.  If there are a lot of dimensions as in Greg's case, then
yeah, we too would call strlen a lot.

http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/src/lib/dim.c#L135

our ncmpii_NC_findvar calls strlen inside a loop for each variable in
a dataset.

http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/src/lib/var.c#L317

How common are datasets with thousands of dimensions and thousands of
variables?

In a followup message, Greg found at least one case where "size" was
not the same as strlen(name) for one of these NC_dim types, so it
looks like the easy optimization won't work out after all.

The status quo isn't awful if you've got a small number of dimensions
and variables: if anybody else has a dataset like Greg's, though,
reply to this email and we'll put optimzing this workload on the todo
list.

thanks
==rob

----- Forwarded message from Greg Sjaardema <gdsjaar@xxxxxxxxxx> -----

Sender: netcdfgroup-bounces@xxxxxxxxxxxxxxxx
From: Greg Sjaardema <gdsjaar@xxxxxxxxxx>
Subject: [netcdfgroup] strlen calls in NC_finddim and NC_findvar
Date: Thu, 3 Dec 2009 15:41:49 -0700
Message-ID: <4B183EAD.20808@xxxxxxxxxx>
User-Agent: Thunderbird 2.0.0.23 (X11/20090812)
To: "netcdfgroup@xxxxxxxxxxxxxxxx" <netcdfgroup@xxxxxxxxxxxxxxxx>
X-Spam-Status: No, score=-2.599 tagged_above=-10 required=6.6
        tests=[BAYES_00=-2.599]
Delivered-To: netcdfgroup@xxxxxxxxxxxxxxxxxxxxxxxxxx
Delivered-To: netcdfgroup@xxxxxxxxxxxxxxxx

I have a monstrous file with several thousand dimensions and variables
which is running slower than it should.  I investigated the runtime
and found that strlen was the major time user in the NC_finddim and
NC_findvar calls.  The obvious optimization was to cache the length of
the name instead of calling strlen each time.  However, when I went to
do this, I discovered that the length is already cached as the nchars
field in the NC_string struct.

I did some checks in the code and also added some assertions to the
code and verified that, as far as I can tell, nchars is the correct
length of the string.  Is there a reason that it isn't used and
strlen() is called instead?  Switching the code to use nchars dropped
my execution time from 20 units to 6 units.  I would like to make the
switch, but wondered if there was some strange corner case where the
nchars value is incorrect and will cause problems.

Thanks,
--Greg

_______________________________________________
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit:
http://www.unidata.ucar.edu/mailing_lists/

----- End forwarded message -----