Re: rrdcached performance problem

11 messages Options
Embed this post
Permalink
Sebastian Harl

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
Hi,

On Wed, Oct 28, 2009 at 11:25:03PM -0700, Thorsten von Eicken wrote:
> it's clearly active both receiving and flushing, and realloc is the
> top primitive that I see. Mhhh, I haven't dealt with these things in
> eons so I'll have to see whether there's a different distro I can use
> to get a better realloc implementation.

Disclaimer: I have not looked into the whole issue in detail yet, this
is just a more or less wild guess … ;-)

You're on Linux, right? So, I presume, you're using the GNU libc. IIrc,
the glibc memory management implementation (*alloc, free) is optimized
for memory usage rather than speed, so the speed is (supposedly - did
not check that myself) rather slow.

This might also further explain why Kevin did not stumble across that
issue, since (iIrc) he's on FreeBSD which might have a "better" (speed-
wise) implementation of those functions.

Since we're using the glib (as in GNOME lib - note the missing 'c' ;-))
anyway, we might give it a try to use their memory management functions
(g_*malloc*, g_free) - those are supposed to be quite a bit faster (yet,
I did not check that myself neither).

Cheers,
Sebastian

--
Sebastian "tokkee" Harl +++ GnuPG-ID: 0x8501C7FC +++ http://tokkee.org/

Those who would give up Essential Liberty to purchase a little Temporary
Safety, deserve neither Liberty nor Safety.         -- Benjamin Franklin



_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers

signature.asc (204 bytes) Download Attachment
kevin brintnall

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
> spends all of its time (more than 99??%) in ???realloc???. So in consequence
> 37.1??% of the time it spent in ???handle_request_update??? the daemon is
> actually waiting for ???realloc???. This is (to me) very unexpected and a
> schoolbook example of ???measure before you optimize???.

Florian et al,

I'd rather see the realloc in linear chunks vs. exponential.. i.e. when we
run out, allocate another X pointers.  We'll still reduce calls to
realloc(), but without risking exponential jumps in memory utilization.

i.e. we could add cache_item_t->values_alloc to keep track of how many
values have been allocated.  Then, we can realloc() a new chunk on demand.
This chunk size could even be configurable..  a clever admin could
pre-allocate all the pointers (i.e. "-w"/rrd_step).

P.S.: is_power_of_two = old_size && !(old_size & (old_size-1));

--
 kevin brintnall =~ /[hidden email]/

> +  if (old_size > 0)
> +  {
> +    is_power_of_two = old_size;
> +    while ((is_power_of_two & 0x01) == 0)
> +      is_power_of_two >>= 1;
> +  }
> +
> +  /* If the current size is a power of two (or zero), increase the array size.
> +   * Otherwise enough space has been allocated previously. */
> +  if (old_size == 0)
> +    new_size = 1;
> +  else if (is_power_of_two == 1)
> +    new_size = 2 * old_size;
> +  else
> +    new_size = old_size;
> +
> +  /* We need to allocate more space. */
> +  if (new_size != old_size)
> +  {
> +    char **tmp;
> +
> +    tmp = realloc (*dest, new_size * sizeof (char *));
> +    if (tmp == NULL)
> +      return (ENOMEM);
> +    *dest = tmp;
> +  }
> +
> +  /* Duplicate the string. On error, the array size is not decreased. This
> +   * means the callee may have to free `dest' even if this function failed. */
> +  (*dest)[old_size] = strdup (src);
> +  if ((*dest)[old_size] == NULL)
> +    return (ENOMEM);
> +
> +  *dest_size = old_size + 1;
> +  return (0);
> +} /* }}} int add_strdup_exp */
> +
>  static int count_lines(char *str) /* {{{ */
>  {
>    int lines = 0;
> @@ -1433,7 +1486,7 @@ static int handle_request_update (HANDLER_PROTO) /* {{{ */
>      else
>        ci->last_update_stamp = stamp;
>  
> -    if (!rrd_add_strdup(&ci->values, &ci->values_num, value))
> +    if (add_strdup_exp (&ci->values, &ci->values_num, value) != 0)
>      {
>        RRDD_LOG (LOG_ERR, "handle_request_update: rrd_add_strdup failed.");
>        continue;
> --
> 1.6.3.3
>



_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
Thorsten von Eicken-2

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
In reply to this post by Sebastian Harl
Florian Forster wrote:
>> 547,131,570  < rrd_daemon.c:queue_thread_main'2 (263x) [/usr/bin/rrdcached]
>>   3,186,524  < rrd_daemon.c:queue_thread_main (3x) [/usr/bin/rrdcached]
>> 550,318,094  *  rrd_update.c:rrd_update_r [/usr/lib64/librrd_th.so.4.1.0]
>>    
>
> I wonder where this second update thread is coming from. Did you
> configure one or two? If you configured two, why isn't the second one
> doing any real work?
>  
Yes, there are two configured. Your question is a good one. If I get to
run callgrind again I'll run it a bit longer and we'll see whether it
was due to the short observation window. I don't know how the
locking/dequeueing primitives work, is there perhaps something that
prefers one thread and the second doesn't run if the first one is fast
enough at dequeueing?

> spends all of its time (more than 99 %) in “realloc”. So in consequence
> 37.1 % of the time it spent in “handle_request_update” the daemon is
> actually waiting for “realloc”. This is (to me) very unexpected and a
> schoolbook example of “measure before you optimize”.
>
> I think we can get rid of this bottleneck by writing a specialized
> version of “rrd_add_strdup” which reallocates powers of ten. Something
> like:
>
> [...]
>
> It'd be great if you could give the attached patch a try
Test is running, including Kevin's simplification... Thanks for the help!!!

By the way, we were calculating some stats the other day and noticed
that we're collecting 680k RRDs (most have 1 DS) every 20 seconds.
That's 35K RRD updates per second. No, this is not on one server...

Cheers,
Thorsten

_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
Thorsten von Eicken-2

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
Thorsten von Eicken wrote:

>> 37.1 % of the time it spent in “handle_request_update” the daemon is
>> actually waiting for “realloc”. This is (to me) very unexpected and a
>> schoolbook example of “measure before you optimize”.
>>
>> I think we can get rid of this bottleneck by writing a specialized
>> version of “rrd_add_strdup” which reallocates powers of ten. Something
>> like:
>>
>> [...]
>>
>> It'd be great if you could give the attached patch a try
> spends all of its time (more than 99 %) in “realloc”. So in consequence
> Test is running, including Kevin's simplification... Thanks for the
> help!!!
Things are again looking much better, almost great I should say! The one
thing that still makes me a bit uncomfortable is that at the end of the
second hour of run-time there was a cpu spike which caused collectd to
grow rapidly. (Still using -w 3600 -z 3600 -f 7200, I put a load of ~50k
tree nodes right from the start.) You can see the graphs at
http://www.voneicken.com/dl/rrd/ look for the rrdcached-6* series. It
flattens out nicely after the spike, but it's one of those things that
tend to bite sooner or later. I'm not sure what to do about it.
Thorsten

_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
kevin brintnall

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
On Fri, Oct 30, 2009 at 09:19:30AM -0700, Thorsten von Eicken wrote:

> Thorsten von Eicken wrote:
> >> 37.1 % of the time it spent in ?handle_request_update? the daemon is
> >> actually waiting for ?realloc?. This is (to me) very unexpected and a
> >> schoolbook example of ?measure before you optimize?.
> >>
> >> I think we can get rid of this bottleneck by writing a specialized
> >> version of ?rrd_add_strdup? which reallocates powers of ten. Something
> >> like:
> >>
> >> [...]
> >>
> >> It'd be great if you could give the attached patch a try
> > spends all of its time (more than 99 %) in ?realloc?. So in consequence
> > Test is running, including Kevin's simplification... Thanks for the
> > help!!!
> Things are again looking much better, almost great I should say! The one
> thing that still makes me a bit uncomfortable is that at the end of the
> second hour of run-time there was a cpu spike which caused collectd to
> grow rapidly. (Still using -w 3600 -z 3600 -f 7200, I put a load of ~50k
> tree nodes right from the start.) You can see the graphs at
> http://www.voneicken.com/dl/rrd/ look for the rrdcached-6* series. It
> flattens out nicely after the spike, but it's one of those things that
> tend to bite sooner or later. I'm not sure what to do about it.

Thorsten,

What's the start time on that daemon?  If it was started slightly before
the graphs (~21:30-21:35), then it's possible that the CPU increase is
associated with the first flush (-f).

Something is queueing a lot of files; I would only expect that to result
from the flush process if writing had stopped for a set of your RRDs.  For
example, if you have a large number of RRD files in the journal that
aren't being re-written afterwards, then the first flush would contain all
those.

Did you use journal on startup for this test?

If you could profile during that "hump" it would be very instructive...

--
 kevin brintnall =~ /[hidden email]/


_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
Thorsten von Eicken-2

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
kevin brintnall wrote:

> On Fri, Oct 30, 2009 at 09:19:30AM -0700, Thorsten von Eicken wrote:
>  
>> Thorsten von Eicken wrote:
>>    
>>>> 37.1 % of the time it spent in ?handle_request_update? the daemon is
>>>> actually waiting for ?realloc?. This is (to me) very unexpected and a
>>>> schoolbook example of ?measure before you optimize?.
>>>>
>>>> I think we can get rid of this bottleneck by writing a specialized
>>>> version of ?rrd_add_strdup? which reallocates powers of ten. Something
>>>> like:
>>>>
>>>> [...]
>>>>
>>>> It'd be great if you could give the attached patch a try
>>>>        
>>> spends all of its time (more than 99 %) in ?realloc?. So in consequence
>>> Test is running, including Kevin's simplification... Thanks for the
>>> help!!!
>>>      
>> Things are again looking much better, almost great I should say! The one
>> thing that still makes me a bit uncomfortable is that at the end of the
>> second hour of run-time there was a cpu spike which caused collectd to
>> grow rapidly. (Still using -w 3600 -z 3600 -f 7200, I put a load of ~50k
>> tree nodes right from the start.) You can see the graphs at
>> http://www.voneicken.com/dl/rrd/ look for the rrdcached-6* series. It
>> flattens out nicely after the spike, but it's one of those things that
>> tend to bite sooner or later. I'm not sure what to do about it.
>>    
>
> Thorsten,
>
> What's the start time on that daemon?  If it was started slightly before
> the graphs (~21:30-21:35), then it's possible that the CPU increase is
> associated with the first flush (-f).
>  
It was right at the start of the graphs. If you look at the
"write-data_sets" graph, you can see very clearly how the flushing
starts at 22:38-22:39 and the first hour of flushing ends 23:46-23:47.
The cpu spike starts to build at 23:26-23:28 so it's not on a clean
boundary at all.If you look at the if-packets graph you can see how
inbound traffic is 100% stable throughout the whole run.
> Something is queueing a lot of files; I would only expect that to result
> from the flush process if writing had stopped for a set of your RRDs.  For
> example, if you have a large number of RRD files in the journal that
> aren't being re-written afterwards, then the first flush would contain all
> those.
>
> Did you use journal on startup for this test?
>  
I cleared the journals before starting rrdcached.
> If you could profile during that "hump" it would be very instructive...
>  
Yeah, no kidding. The question is how to reproduce this given the
callgrind slow-down. I may try later next week when I hope to have some
more time.

Thorsten

_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
kevin brintnall

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
On Fri, Oct 30, 2009 at 09:47:43AM -0700, Thorsten von Eicken wrote:

> > What's the start time on that daemon?  If it was started slightly before
> > the graphs (~21:30-21:35), then it's possible that the CPU increase is
> > associated with the first flush (-f).
> >  
> It was right at the start of the graphs. If you look at the
> "write-data_sets" graph, you can see very clearly how the flushing
> starts at 22:38-22:39 and the first hour of flushing ends 23:46-23:47.
> The cpu spike starts to build at 23:26-23:28 so it's not on a clean
> boundary at all.If you look at the if-packets graph you can see how
> inbound traffic is 100% stable throughout the whole run.

OK..  That's what I thought :|

> > Something is queueing a lot of files; I would only expect that to
> > result from the flush process if writing had stopped for a set of your
> > RRDs.  For example, if you have a large number of RRD files in the
> > journal that aren't being re-written afterwards, then the first flush
> > would contain all those.

When the queue length jumps up, try issuing the "QUEUE" command to the
daemon.  Maybe the list of queued files (or their order) will be
instructive.

--
 kevin brintnall =~ /[hidden email]/

_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
Thorsten von Eicken-2

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
In reply to this post by Thorsten von Eicken-2
Quick follow-up. I decided to add another 3k updates per second (extra
30k tree nodes) to my test run. See results in
http://www.voneicken.com/dl/rrd/rrdcached-7.png
What's interesting is that the server got somewhat overloaded sitting a
lot in I/O wait. By and large the flush queue length remained under
control, except when doing backups (10pm, 8:30am). Memory usage by
rrdcached and collectd remained under control, but there is a long term
upward-trending slope to rrdcached's memory usage which is not good.
Possibly related to the power-of-two allocator patch that Florian
provided. The graph I find the most interesting one is the disk sdk disk
ops (3rd from the end). Before adding the last chunk of traffic the disk
load was write-dominated, which means that rrds were mostly cached in
memory (5-6 GB left after the processes). After adding the extra load
the disk load became read-dominated indicating that the rrd working set
exceeded memory.

Thorsten


Thorsten von Eicken wrote:

> Thorsten von Eicken wrote:
>  
>>> 37.1 % of the time it spent in “handle_request_update” the daemon is
>>> actually waiting for “realloc”. This is (to me) very unexpected and a
>>> schoolbook example of “measure before you optimize”.
>>>
>>> I think we can get rid of this bottleneck by writing a specialized
>>> version of “rrd_add_strdup” which reallocates powers of ten. Something
>>> like:
>>>
>>> [...]
>>>
>>> It'd be great if you could give the attached patch a try
>>>      
>> spends all of its time (more than 99 %) in “realloc”. So in consequence
>> Test is running, including Kevin's simplification... Thanks for the
>> help!!!
>>    
> Things are again looking much better, almost great I should say! The one
> thing that still makes me a bit uncomfortable is that at the end of the
> second hour of run-time there was a cpu spike which caused collectd to
> grow rapidly. (Still using -w 3600 -z 3600 -f 7200, I put a load of ~50k
> tree nodes right from the start.) You can see the graphs at
> http://www.voneicken.com/dl/rrd/ look for the rrdcached-6* series. It
> flattens out nicely after the spike, but it's one of those things that
> tend to bite sooner or later. I'm not sure what to do about it.
> Thorsten
>
> _______________________________________________
> rrd-developers mailing list
> [hidden email]
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
>
>  

_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
kevin brintnall

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
On Sat, Oct 31, 2009 at 09:52:15AM -0700, Thorsten von Eicken wrote:

> Quick follow-up. I decided to add another 3k updates per second (extra
> 30k tree nodes) to my test run. See results in
> http://www.voneicken.com/dl/rrd/rrdcached-7.png
> What's interesting is that the server got somewhat overloaded sitting a
> lot in I/O wait. By and large the flush queue length remained under
> control, except when doing backups (10pm, 8:30am). Memory usage by
> rrdcached and collectd remained under control, but there is a long term
> upward-trending slope to rrdcached's memory usage which is not good.
> Possibly related to the power-of-two allocator patch that Florian
> provided. The graph I find the most interesting one is the disk sdk disk
> ops (3rd from the end). Before adding the last chunk of traffic the disk
> load was write-dominated, which means that rrds were mostly cached in
> memory (5-6 GB left after the processes). After adding the extra load
> the disk load became read-dominated indicating that the rrd working set
> exceeded memory.

Thorsten,

If you're becoming read dominated, you should consider lowering your file
update/sec rate by increasing your -w/-f timers.  This just trades one
kind of cache memory (f/s blocks) for another (update strings).

I'm sending a linear chunk allocator along for allocating
cache_item_t.values in operator-defined block sizes..  I'd appreciate if
you'd test it with your load to see if it reduces your CPU usage related
to frequent realloc().

--
 kevin brintnall =~ /[hidden email]/


> Thorsten
>
>
> Thorsten von Eicken wrote:
> > Thorsten von Eicken wrote:
> >  
> >>> 37.1 % of the time it spent in ?handle_request_update? the daemon is
> >>> actually waiting for ?realloc?. This is (to me) very unexpected and a
> >>> schoolbook example of ?measure before you optimize?.
> >>>
> >>> I think we can get rid of this bottleneck by writing a specialized
> >>> version of ?rrd_add_strdup? which reallocates powers of ten. Something
> >>> like:
> >>>
> >>> [...]
> >>>
> >>> It'd be great if you could give the attached patch a try
> >>>      
> >> spends all of its time (more than 99 %) in ?realloc?. So in consequence
> >> Test is running, including Kevin's simplification... Thanks for the
> >> help!!!
> >>    
> > Things are again looking much better, almost great I should say! The one
> > thing that still makes me a bit uncomfortable is that at the end of the
> > second hour of run-time there was a cpu spike which caused collectd to
> > grow rapidly. (Still using -w 3600 -z 3600 -f 7200, I put a load of ~50k
> > tree nodes right from the start.) You can see the graphs at
> > http://www.voneicken.com/dl/rrd/ look for the rrdcached-6* series. It
> > flattens out nicely after the spike, but it's one of those things that
> > tend to bite sooner or later. I'm not sure what to do about it.
> > Thorsten
> >
> > _______________________________________________
> > rrd-developers mailing list
> > [hidden email]
> > https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
> >
> >  
>
> _______________________________________________
> rrd-developers mailing list
> [hidden email]
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers

_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
Thorsten von Eicken-2

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
kevin brintnall wrote:

> On Sat, Oct 31, 2009 at 09:52:15AM -0700, Thorsten von Eicken wrote:
>  
>> Quick follow-up. I decided to add another 3k updates per second (extra
>> 30k tree nodes) to my test run. See results in
>> http://www.voneicken.com/dl/rrd/rrdcached-7.png
>> What's interesting is that the server got somewhat overloaded sitting a
>> lot in I/O wait. By and large the flush queue length remained under
>> control, except when doing backups (10pm, 8:30am). Memory usage by
>> rrdcached and collectd remained under control, but there is a long term
>> upward-trending slope to rrdcached's memory usage which is not good.
>> Possibly related to the power-of-two allocator patch that Florian
>> provided. The graph I find the most interesting one is the disk sdk disk
>> ops (3rd from the end). Before adding the last chunk of traffic the disk
>> load was write-dominated, which means that rrds were mostly cached in
>> memory (5-6 GB left after the processes). After adding the extra load
>> the disk load became read-dominated indicating that the rrd working set
>> exceeded memory.
>>    
>
> Thorsten,
>
> If you're becoming read dominated, you should consider lowering your file
> update/sec rate by increasing your -w/-f timers.  This just trades one
> kind of cache memory (f/s blocks) for another (update strings).
>  
Yeah, but given that I'm already using 1 hour of caching it starts
getting a bit uncomfortable. I haven't re-tested the journal reading
yet, but at some point restarting rrdcached becomes really difficult.
But regardless, I suspect there are 3 performance regimes:
 - working set much smaller than memory - cpu scales linearly/smoothly
with update rate
 - working set much larger than memory - it's all about random small
disk I/O throughput
 - working set similar to memory - relatively sharp (?) transition from
cpu/memory speed to disk speed
It's pretty clear that the performance when operating in memory is very
nice and smooth, but when memory is exceeded things become bumpy. It
degrades nicely and recovers well: I saw 9k queue items at one point and
that got worked off over close to an hour but did work itself out as
opposed to spinning out of control. But it doesn't look like a safe
operating regime. So basically gotta keep the rrd working set in memory.

> I'm sending a linear chunk allocator along for allocating
> cache_item_t.values in operator-defined block sizes..  I'd appreciate if
> you'd test it with your load to see if it reduces your CPU usage related
> to frequent realloc().
>
>  
What value of -m do you suggest? -w divided by step size?

Thanks,
Thorsten

_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
kevin brintnall

Re: rrdcached performance problem

Reply Threaded More More options
Print post
Permalink
> It's pretty clear that the performance when operating in memory is very
> nice and smooth, but when memory is exceeded things become bumpy. It
> degrades nicely and recovers well: I saw 9k queue items at one point and
> that got worked off over close to an hour but did work itself out as
> opposed to spinning out of control.

9k/hour seems awfully slow.  Even my old development box can flush
55k/hour.  Maybe more entries are being enqueued at the same time...

> What value of -m do you suggest? -w divided by step size?

That's the maximum; any more than that is just waste.  IIRC you had around
~400 cached values per RRD file.  So, if you use '-m 400' you'll only have
call realloc() one time.  However, the 401st value will cause a large
realloc()...

The ideal value is highly dependent on your malloc() implementation..  I
would start with a power of two large enough to get you relatively few
realloc() but small enough that you don't waste too much...  In your case,
maybe '-m 64'.

--
 kevin brintnall =~ /[hidden email]/

_______________________________________________
rrd-developers mailing list
[hidden email]
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers