- To: parallel-netcdf@xxxxxxxxxxx, "netcdfgroup@xxxxxxxxxxxxxxxx" <netcdfgroup@xxxxxxxxxxxxxxxx>
- Subject: Re: [netcdfgroup] [gdsjaar@xxxxxxxxxx: strlen calls in NC_finddim and NC_findvar]
- From: "Greg Sjaardema" <gdsjaar@xxxxxxxxxx>
- Date: Fri, 4 Dec 2009 08:33:53 -0700
I modified my fix somewhat from what is described below. The NC_string
'nchars' field is not what is needed since it is modified for alignment
issues and can be incorrect after a rename operation. Instead, I added
a 'lenstr' field to both NC_dim and NC_var which maintains the length of
the name. This reduced the number of strlen calls in one case from
476,952,472 to 389,810 (43.6% of execution time down to 0.25%). There
are still several calls to strncmp.
I think that perhaps a better fix than caching the name string length
may be to compute a hash of the name and store that instead. The
finddim and findvar functions can then hash the name they are searching
for. The inner loop could then just compare the hash values and if they
match, do the further strncmp check to catch hash collisions.
Rob Latham wrote:
Greg S. found something noteworthy on the serial netcdf list. We do something similar (not surprising: i'm sure our NC_finddim and NC_findvar functions are 99% unchanged from serial netcdf) In NC_finddim we have a call to strlen as part of the condition of a for loop. If there are a lot of dimensions as in Greg's case, then yeah, we too would call strlen a lot. http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/src/lib/dim.c#L135 our ncmpii_NC_findvar calls strlen inside a loop for each variable in a dataset. http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/src/lib/var.c#L317 How common are datasets with thousands of dimensions and thousands of variables? In a followup message, Greg found at least one case where "size" was not the same as strlen(name) for one of these NC_dim types, so it looks like the easy optimization won't work out after all. The status quo isn't awful if you've got a small number of dimensions and variables: if anybody else has a dataset like Greg's, though, reply to this email and we'll put optimzing this workload on the todo list. thanks ==rob ----- Forwarded message from Greg Sjaardema <gdsjaar@xxxxxxxxxx> ----- Sender: netcdfgroup-bounces@xxxxxxxxxxxxxxxx From: Greg Sjaardema <gdsjaar@xxxxxxxxxx> Subject: [netcdfgroup] strlen calls in NC_finddim and NC_findvar Date: Thu, 3 Dec 2009 15:41:49 -0700 Message-ID: <4B183EAD.20808@xxxxxxxxxx> User-Agent: Thunderbird 2.0.0.23 (X11/20090812) To: "netcdfgroup@xxxxxxxxxxxxxxxx" <netcdfgroup@xxxxxxxxxxxxxxxx> X-Spam-Status: No, score=-2.599 tagged_above=-10 required=6.6 tests=[BAYES_00=-2.599] Delivered-To: netcdfgroup@xxxxxxxxxxxxxxxxxxxxxxxxxx Delivered-To: netcdfgroup@xxxxxxxxxxxxxxxx I have a monstrous file with several thousand dimensions and variables which is running slower than it should. I investigated the runtime and found that strlen was the major time user in the NC_finddim and NC_findvar calls. The obvious optimization was to cache the length of the name instead of calling strlen each time. However, when I went to do this, I discovered that the length is already cached as the nchars field in the NC_string struct. I did some checks in the code and also added some assertions to the code and verified that, as far as I can tell, nchars is the correct length of the string. Is there a reason that it isn't used and strlen() is called instead? Switching the code to use nchars dropped my execution time from 20 units to 6 units. I would like to make the switch, but wondered if there was some strange corner case where the nchars value is incorrect and will cause problems. Thanks, --Greg _______________________________________________ netcdfgroup mailing list netcdfgroup@xxxxxxxxxxxxxxxx For list information or to unsubscribe, visit: http://www.unidata.ucar.edu/mailing_lists/ ----- End forwarded message -----