More information on SNMP poll failure during ONMS restart

46 messages Options
Embed this post
Permalink
1 2 3
John A. Sullivan III

More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
Hello, all.  I have some more details on this ongoing problem several of
us have had with SNMP polling failing for random devices when ONMS is
restarted.

Unfortunately, we needed to restart last night when we encountered:

Oct 28 00:40:10 monitor01 kernel: BUG: soft lockup - CPU#0 stuck for 4124s! [swapper:0]
Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery acpi_memhotplu
Oct 28 00:40:10 monitor01 kernel: CPU 0:
Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery acpi_memhotplu
Oct 28 00:40:10 monitor01 kernel: Pid: 0, comm: swapper Tainted: G S         2.6.29.1 #2
Oct 28 00:40:10 monitor01 kernel: RIP: 0010:[<ffffffff80238e1c>]  [<ffffffff80238e1c>] native_safe_halt+0x2/0x3
Oct 28 00:40:10 monitor01 kernel: RSP: 0018:ffffffff8075df48  EFLAGS: 00000246
Oct 28 00:40:10 monitor01 kernel: RAX: ffffffff8075dfd8 RBX: ffffffff80767140 RCX: 0000000000000000
Oct 28 00:40:10 monitor01 kernel: RDX: ffffffff80238e1c RSI: 0000000000000001 RDI: ffffffff806caa10
Oct 28 00:40:10 monitor01 kernel: RBP: ffffffff80224c6e R08: 0000000000000000 R09: 0000000000000001
Oct 28 00:40:10 monitor01 kernel: R10: ffff88000103d580 R11: ffff88000103d580 R12: ffffffff807dc580
Oct 28 00:40:10 monitor01 kernel: R13: ffff88000103d580 R14: ffffffff8024d3bb R15: ffffffff806c7360
Oct 28 00:40:10 monitor01 kernel: FS:  0000000045d1f940(0000) GS:ffffffff807ea000(0000) knlGS:0000000000000000
Oct 28 00:40:10 monitor01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Oct 28 00:40:10 monitor01 kernel: CR2: 00007f5382cf4580 CR3: 0000000000201000 CR4: 00000000000006e0
Oct 28 00:40:10 monitor01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 28 00:40:10 monitor01 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 28 00:40:10 monitor01 kernel: Call Trace:
Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80229f77>] ? default_idle+0x2a/0x46
Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80222c9d>] ? cpu_idle+0x47/0x65

I don't think this is related.  This looks like a known bug in the
2.6.29 kernel we're running.  I've tossed it in here just in case it is
related or in case someone else has had this other problem and can shed
some light on it.

I've attached a more targeted trace than last time.  So much for
expunging data but we really need to get to the bottom of this :-(  The
device for which polling failed is 172.30.10.1.

Our OpenNMS server polls from several different addresses.  Because it
has elevated privileges, it communicates to anything off its local
network via OpenVPN tunnels authenticated via X.509 certificate to
ensure no one can spoof the privileged IP address.  The extended
credentials are enforced throughout the entire WAN - something we've
only found possible via the ISCS project (iscs.sourceforge.net).  Thus,
the ONMS addresses are 172.30.10.31, 192.168.124.125, 10.68.6.238, and
10.68.6.254.  Disregard the CRC errors - the checksum is being
calculated elsewhere.

Not knowing enough about SNMP packet exchange, I'm not sure of where the
problem lies.  In looking at successful polls, there appears to be an
initial two packet exchange followed shortly thereafter by another four
packets.  The content is hard to discern because we are using privacy.

In our failed exchange, we see the first pair of packets exchanged.  We
then see the next two exchanged.  However, the third is sent from ONMS
and there is no reply.  It then resends the same sized packet roughly
five seconds later.

This would seem to imply the problematic system truly is not responding.
However, looking at the exact same packet sequence in the logs on the
failing station, we see what appears to be a normal exchange - in fact
we see all four packets:

Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
Oct 28 03:04:44 fw01 snmpd[22813]: Received SNMP packet(s) from UDP: [172.30.10.31]:37027
Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
Oct 28 03:04:49 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027

Unfortunately, I did not trace from the other side to see if the missing
reply was indeed placed on the network and was lost in transmission to
ONMS (unlikely as they are on the same local network) or if it was never
sent.

I do notice my snmp-config.xml is set to only 1 retry.  I'll change it
to 2 and try to test during a scheduled outage.  That may make the
system a little more resilient to a failed response like this.

However, we still do not know why there was no response.  It is not
likely the device was too busy to respond.  We are in pre-launch of the
company so the systems are very robust and lightly loaded.

It could be a bug in net-snmp.  172.30.10.1 is a Linux device running
CentOS 5.4 and the yum based net-snmp (net-snmp-5.3.2.2-7.el5).
However, we are seeing the problem with ProCurve switches and SnapGear
SG565 VPN gateways. It does not seem to be limited to
net-snmp-5.3.2.2-7.el5.

It is possible the packets are getting lost in the network.

Is it possible that ONMS is sending a malformed packet and thus does not
receive a response? Next time, I'll trace to include ICMP packets just
in case there is some notification of a problem.  Is there someplace in
the logs where we can see want SNMPv3 parameters were used in each
query? Perhaps it occasionally mis-sends one of them.

So, not a whole lot further but hopefully this can spark something in
someone else's mind. I know there were several others reporting the same
issue. Have any of you made any further progress? I'll report back on
how changing the retry value in snmp-config.xml worked now that I've
removed those settings from the capsd/pollerd/collectd configurations.
Thanks - John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss

onms-prob-fw01.trace (326K) Download Attachment
John Blake

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink

 John,
 Do you have route tables in your server? I'm wondering if ONMS sends the data out one interface (sourced form one ip), You reboot, then the data is sourced from a diff interface/ip.
Some firewalls do not like that at all. I have a Data Domain box that will send responses out any interface and the VPN wont let it thru because i sent a icmp to a specific ip and its expecting a response from that ip, not a different one. So I'll get responses and then suddenly stop getting them for say 30 minutes, then it comes back.


John




From: "John A. Sullivan III" <[hidden email]>
To: General OpenNMS Discussion <[hidden email]>
Date: 10/28/2009 12:31 PM
Subject: [opennms-discuss] More information on SNMP poll failure during ONMS        restart





Hello, all.  I have some more details on this ongoing problem several of
us have had with SNMP polling failing for random devices when ONMS is
restarted.

Unfortunately, we needed to restart last night when we encountered:

Oct 28 00:40:10 monitor01 kernel: BUG: soft lockup - CPU#0 stuck for 4124s! [swapper:0]
Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery acpi_memhotplu
Oct 28 00:40:10 monitor01 kernel: CPU 0:
Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery acpi_memhotplu
Oct 28 00:40:10 monitor01 kernel: Pid: 0, comm: swapper Tainted: G S         2.6.29.1 #2
Oct 28 00:40:10 monitor01 kernel: RIP: 0010:[<ffffffff80238e1c>]  [<ffffffff80238e1c>] native_safe_halt+0x2/0x3
Oct 28 00:40:10 monitor01 kernel: RSP: 0018:ffffffff8075df48  EFLAGS: 00000246
Oct 28 00:40:10 monitor01 kernel: RAX: ffffffff8075dfd8 RBX: ffffffff80767140 RCX: 0000000000000000
Oct 28 00:40:10 monitor01 kernel: RDX: ffffffff80238e1c RSI: 0000000000000001 RDI: ffffffff806caa10
Oct 28 00:40:10 monitor01 kernel: RBP: ffffffff80224c6e R08: 0000000000000000 R09: 0000000000000001
Oct 28 00:40:10 monitor01 kernel: R10: ffff88000103d580 R11: ffff88000103d580 R12: ffffffff807dc580
Oct 28 00:40:10 monitor01 kernel: R13: ffff88000103d580 R14: ffffffff8024d3bb R15: ffffffff806c7360
Oct 28 00:40:10 monitor01 kernel: FS:  0000000045d1f940(0000) GS:ffffffff807ea000(0000) knlGS:0000000000000000
Oct 28 00:40:10 monitor01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Oct 28 00:40:10 monitor01 kernel: CR2: 00007f5382cf4580 CR3: 0000000000201000 CR4: 00000000000006e0
Oct 28 00:40:10 monitor01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 28 00:40:10 monitor01 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 28 00:40:10 monitor01 kernel: Call Trace:
Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80229f77>] ? default_idle+0x2a/0x46
Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80222c9d>] ? cpu_idle+0x47/0x65

I don't think this is related.  This looks like a known bug in the
2.6.29 kernel we're running.  I've tossed it in here just in case it is
related or in case someone else has had this other problem and can shed
some light on it.

I've attached a more targeted trace than last time.  So much for
expunging data but we really need to get to the bottom of this :-(  The
device for which polling failed is 172.30.10.1.

Our OpenNMS server polls from several different addresses.  Because it
has elevated privileges, it communicates to anything off its local
network via OpenVPN tunnels authenticated via X.509 certificate to
ensure no one can spoof the privileged IP address.  The extended
credentials are enforced throughout the entire WAN - something we've
only found possible via the ISCS project (iscs.sourceforge.net).  Thus,
the ONMS addresses are 172.30.10.31, 192.168.124.125, 10.68.6.238, and
10.68.6.254.  Disregard the CRC errors - the checksum is being
calculated elsewhere.

Not knowing enough about SNMP packet exchange, I'm not sure of where the
problem lies.  In looking at successful polls, there appears to be an
initial two packet exchange followed shortly thereafter by another four
packets.  The content is hard to discern because we are using privacy.

In our failed exchange, we see the first pair of packets exchanged.  We
then see the next two exchanged.  However, the third is sent from ONMS
and there is no reply.  It then resends the same sized packet roughly
five seconds later.

This would seem to imply the problematic system truly is not responding.
However, looking at the exact same packet sequence in the logs on the
failing station, we see what appears to be a normal exchange - in fact
we see all four packets:

Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
Oct 28 03:04:44 fw01 snmpd[22813]: Received SNMP packet(s) from UDP: [172.30.10.31]:37027
Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
Oct 28 03:04:49 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027

Unfortunately, I did not trace from the other side to see if the missing
reply was indeed placed on the network and was lost in transmission to
ONMS (unlikely as they are on the same local network) or if it was never
sent.

I do notice my snmp-config.xml is set to only 1 retry.  I'll change it
to 2 and try to test during a scheduled outage.  That may make the
system a little more resilient to a failed response like this.

However, we still do not know why there was no response.  It is not
likely the device was too busy to respond.  We are in pre-launch of the
company so the systems are very robust and lightly loaded.

It could be a bug in net-snmp.  172.30.10.1 is a Linux device running
CentOS 5.4 and the yum based net-snmp (net-snmp-5.3.2.2-7.el5).
However, we are seeing the problem with ProCurve switches and SnapGear
SG565 VPN gateways. It does not seem to be limited to
net-snmp-5.3.2.2-7.el5.

It is possible the packets are getting lost in the network.

Is it possible that ONMS is sending a malformed packet and thus does not
receive a response? Next time, I'll trace to include ICMP packets just
in case there is some notification of a problem.  Is there someplace in
the logs where we can see want SNMPv3 parameters were used in each
query? Perhaps it occasionally mis-sends one of them.

So, not a whole lot further but hopefully this can spark something in
someone else's mind. I know there were several others reporting the same
issue. Have any of you made any further progress? I'll report back on
how changing the retry value in snmp-config.xml worked now that I've
removed those settings from the capsd/pollerd/collectd configurations.
Thanks - John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society
[attachment "onms-prob-fw01.trace" deleted by John Blake/Cary/IBM] ------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
Thanks, John.  An interesting idea but I don't think that is our issue.
There are indeed routes on the ONMS server from the VPN connections.
There are no iptables rules on the ONMS server so there is no connection
tracking there.

The problem isn't after a reboot but after an ONMS restart.  Frequently,
the problems are with devices which are on the local network and not on
the other side of the firewall, e.g., the local switch.

Another strange thing I did not mention here but did in the earlier
thread - it never comes back.  It does not matter how long we wait, the
SNMP polling never succeeds once it fails.  That's why I don't think it
is dropped packets on the network or the monitored devices not
responding and why I wonder if something has become confused in ONMS's
view of the end point and that confusion is causing it to make the
queries with incorrect credentials / data / format.  Thanks - John

On Wed, 2009-10-28 at 14:15 -0400, John Blake wrote:

>
>  John,
>  Do you have route tables in your server? I'm wondering if ONMS sends
> the data out one interface (sourced form one ip), You reboot, then the
> data is sourced from a diff interface/ip.
> Some firewalls do not like that at all. I have a Data Domain box that
> will send responses out any interface and the VPN wont let it thru
> because i sent a icmp to a specific ip and its expecting a response
> from that ip, not a different one. So I'll get responses and then
> suddenly stop getting them for say 30 minutes, then it comes back.
>
>
> John
>
>
>
>
> From:
> "John A. Sullivan III"
> <[hidden email]>
> To:
> General OpenNMS Discussion
> <[hidden email]>
> Date:
> 10/28/2009 12:31 PM
> Subject:
> [opennms-discuss] More information
> on SNMP poll failure during ONMS
>      restart
>
>
> ______________________________________________________________________
>
>
>
> Hello, all.  I have some more details on this ongoing problem several
> of
> us have had with SNMP polling failing for random devices when ONMS is
> restarted.
>
> Unfortunately, we needed to restart last night when we encountered:
>
> Oct 28 00:40:10 monitor01 kernel: BUG: soft lockup - CPU#0 stuck for
> 4124s! [swapper:0]
> Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp
> l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery
> acpi_memhotplu
> Oct 28 00:40:10 monitor01 kernel: CPU 0:
> Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp
> l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery
> acpi_memhotplu
> Oct 28 00:40:10 monitor01 kernel: Pid: 0, comm: swapper Tainted: G S
> 2.6.29.1 #2
> Oct 28 00:40:10 monitor01 kernel: RIP: 0010:[<ffffffff80238e1c>]
>  [<ffffffff80238e1c>] native_safe_halt+0x2/0x3
> Oct 28 00:40:10 monitor01 kernel: RSP: 0018:ffffffff8075df48  EFLAGS:
> 00000246
> Oct 28 00:40:10 monitor01 kernel: RAX: ffffffff8075dfd8 RBX:
> ffffffff80767140 RCX: 0000000000000000
> Oct 28 00:40:10 monitor01 kernel: RDX: ffffffff80238e1c RSI:
> 0000000000000001 RDI: ffffffff806caa10
> Oct 28 00:40:10 monitor01 kernel: RBP: ffffffff80224c6e R08:
> 0000000000000000 R09: 0000000000000001
> Oct 28 00:40:10 monitor01 kernel: R10: ffff88000103d580 R11:
> ffff88000103d580 R12: ffffffff807dc580
> Oct 28 00:40:10 monitor01 kernel: R13: ffff88000103d580 R14:
> ffffffff8024d3bb R15: ffffffff806c7360
> Oct 28 00:40:10 monitor01 kernel: FS:  0000000045d1f940(0000)
> GS:ffffffff807ea000(0000) knlGS:0000000000000000
> Oct 28 00:40:10 monitor01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
> 000000008005003b
> Oct 28 00:40:10 monitor01 kernel: CR2: 00007f5382cf4580 CR3:
> 0000000000201000 CR4: 00000000000006e0
> Oct 28 00:40:10 monitor01 kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Oct 28 00:40:10 monitor01 kernel: DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Oct 28 00:40:10 monitor01 kernel: Call Trace:
> Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80229f77>] ? default_idle
> +0x2a/0x46
> Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80222c9d>] ? cpu_idle
> +0x47/0x65
>
> I don't think this is related.  This looks like a known bug in the
> 2.6.29 kernel we're running.  I've tossed it in here just in case it
> is
> related or in case someone else has had this other problem and can
> shed
> some light on it.
>
> I've attached a more targeted trace than last time.  So much for
> expunging data but we really need to get to the bottom of
> this :-(  The
> device for which polling failed is 172.30.10.1.
>
> Our OpenNMS server polls from several different addresses.  Because it
> has elevated privileges, it communicates to anything off its local
> network via OpenVPN tunnels authenticated via X.509 certificate to
> ensure no one can spoof the privileged IP address.  The extended
> credentials are enforced throughout the entire WAN - something we've
> only found possible via the ISCS project (iscs.sourceforge.net).
>  Thus,
> the ONMS addresses are 172.30.10.31, 192.168.124.125, 10.68.6.238, and
> 10.68.6.254.  Disregard the CRC errors - the checksum is being
> calculated elsewhere.
>
> Not knowing enough about SNMP packet exchange, I'm not sure of where
> the
> problem lies.  In looking at successful polls, there appears to be an
> initial two packet exchange followed shortly thereafter by another
> four
> packets.  The content is hard to discern because we are using privacy.
>
> In our failed exchange, we see the first pair of packets exchanged.
>  We
> then see the next two exchanged.  However, the third is sent from ONMS
> and there is no reply.  It then resends the same sized packet roughly
> five seconds later.
>
> This would seem to imply the problematic system truly is not
> responding.
> However, looking at the exact same packet sequence in the logs on the
> failing station, we see what appears to be a normal exchange - in fact
> we see all four packets:
>
> Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
> Oct 28 03:04:44 fw01 snmpd[22813]: Received SNMP packet(s) from UDP:
> [172.30.10.31]:37027
> Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
> Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
> Oct 28 03:04:49 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
>
> Unfortunately, I did not trace from the other side to see if the
> missing
> reply was indeed placed on the network and was lost in transmission to
> ONMS (unlikely as they are on the same local network) or if it was
> never
> sent.
>
> I do notice my snmp-config.xml is set to only 1 retry.  I'll change it
> to 2 and try to test during a scheduled outage.  That may make the
> system a little more resilient to a failed response like this.
>
> However, we still do not know why there was no response.  It is not
> likely the device was too busy to respond.  We are in pre-launch of
> the
> company so the systems are very robust and lightly loaded.
>
> It could be a bug in net-snmp.  172.30.10.1 is a Linux device running
> CentOS 5.4 and the yum based net-snmp (net-snmp-5.3.2.2-7.el5).
> However, we are seeing the problem with ProCurve switches and SnapGear
> SG565 VPN gateways. It does not seem to be limited to
> net-snmp-5.3.2.2-7.el5.
>
> It is possible the packets are getting lost in the network.
>
> Is it possible that ONMS is sending a malformed packet and thus does
> not
> receive a response? Next time, I'll trace to include ICMP packets just
> in case there is some notification of a problem.  Is there someplace
> in
> the logs where we can see want SNMPv3 parameters were used in each
> query? Perhaps it occasionally mis-sends one of them.
>
> So, not a whole lot further but hopefully this can spark something in
> someone else's mind. I know there were several others reporting the
> same
> issue. Have any of you made any further progress? I'll report back on
> how changing the retry value in snmp-config.xml worked now that I've
> removed those settings from the capsd/pollerd/collectd configurations.
> Thanks - John
> --
> John A. Sullivan III
> Open Source Development Corporation
> +1 207-985-7880
> [hidden email]
>
> http://www.spiritualoutreach.com
> Making Christianity intelligible to secular society
> [attachment "onms-prob-fw01.trace" deleted by John Blake/Cary/IBM]
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart
> your
> developing skills, take BlackBerry mobile applications to market and
> stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference_______________________________________________
> Please read the OpenNMS Mailing List FAQ:
> http://www.opennms.org/index.php/Mailing_List_FAQ
>
> opennms-discuss mailing list
>
> To *unsubscribe* or change your subscription options, see the bottom
> of this page:
> https://lists.sourceforge.net/lists/listinfo/opennms-discuss 
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________ Please read the OpenNMS Mailing List FAQ: http://www.opennms.org/index.php/Mailing_List_FAQ opennms-discuss mailing list To *unsubscribe* or change your subscription options, see the bottom of this page: https://lists.sourceforge.net/lists/listinfo/opennms-discuss
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John Blake

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink

 I would only start getting a response from my Data Domain when it would decide to send a packet back down the correct interface. I couldnt do anything about it on my side. So we decided to just ping 1 interface, get syslog info from it and open a ticket with them about the routing in their device.

The only way I see to narrow it down is to try to watch the traffic leave the onms server to the node in question.




From: "John A. Sullivan III" <[hidden email]>
To: General OpenNMS Discussion <[hidden email]>
Date: 10/28/2009 02:54 PM
Subject: Re: [opennms-discuss] More information on SNMP poll failure        during        ONMS        restart





Thanks, John.  An interesting idea but I don't think that is our issue.
There are indeed routes on the ONMS server from the VPN connections.
There are no iptables rules on the ONMS server so there is no connection
tracking there.

The problem isn't after a reboot but after an ONMS restart.  Frequently,
the problems are with devices which are on the local network and not on
the other side of the firewall, e.g., the local switch.

Another strange thing I did not mention here but did in the earlier
thread - it never comes back.  It does not matter how long we wait, the
SNMP polling never succeeds once it fails.  That's why I don't think it
is dropped packets on the network or the monitored devices not
responding and why I wonder if something has become confused in ONMS's
view of the end point and that confusion is causing it to make the
queries with incorrect credentials / data / format.  Thanks - John

On Wed, 2009-10-28 at 14:15 -0400, John Blake wrote:
>
>  John,
>  Do you have route tables in your server? I'm wondering if ONMS sends
> the data out one interface (sourced form one ip), You reboot, then the
> data is sourced from a diff interface/ip.
> Some firewalls do not like that at all. I have a Data Domain box that
> will send responses out any interface and the VPN wont let it thru
> because i sent a icmp to a specific ip and its expecting a response
> from that ip, not a different one. So I'll get responses and then
> suddenly stop getting them for say 30 minutes, then it comes back.
>
>
> John
>
>
>
>
> From:
> "John A. Sullivan III"
> <[hidden email]>
> To:
> General OpenNMS Discussion
> <[hidden email]>
> Date:
> 10/28/2009 12:31 PM
> Subject:
> [opennms-discuss] More information
> on SNMP poll failure during ONMS
>      restart
>
>
> ______________________________________________________________________
>
>
>
> Hello, all.  I have some more details on this ongoing problem several
> of
> us have had with SNMP polling failing for random devices when ONMS is
> restarted.
>
> Unfortunately, we needed to restart last night when we encountered:
>
> Oct 28 00:40:10 monitor01 kernel: BUG: soft lockup - CPU#0 stuck for
> 4124s! [swapper:0]
> Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp
> l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery
> acpi_memhotplu
> Oct 28 00:40:10 monitor01 kernel: CPU 0:
> Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp
> l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery
> acpi_memhotplu
> Oct 28 00:40:10 monitor01 kernel: Pid: 0, comm: swapper Tainted: G S
> 2.6.29.1 #2
> Oct 28 00:40:10 monitor01 kernel: RIP: 0010:[<ffffffff80238e1c>]
>  [<ffffffff80238e1c>] native_safe_halt+0x2/0x3
> Oct 28 00:40:10 monitor01 kernel: RSP: 0018:ffffffff8075df48  EFLAGS:
> 00000246
> Oct 28 00:40:10 monitor01 kernel: RAX: ffffffff8075dfd8 RBX:
> ffffffff80767140 RCX: 0000000000000000
> Oct 28 00:40:10 monitor01 kernel: RDX: ffffffff80238e1c RSI:
> 0000000000000001 RDI: ffffffff806caa10
> Oct 28 00:40:10 monitor01 kernel: RBP: ffffffff80224c6e R08:
> 0000000000000000 R09: 0000000000000001
> Oct 28 00:40:10 monitor01 kernel: R10: ffff88000103d580 R11:
> ffff88000103d580 R12: ffffffff807dc580
> Oct 28 00:40:10 monitor01 kernel: R13: ffff88000103d580 R14:
> ffffffff8024d3bb R15: ffffffff806c7360
> Oct 28 00:40:10 monitor01 kernel: FS:  0000000045d1f940(0000)
> GS:ffffffff807ea000(0000) knlGS:0000000000000000
> Oct 28 00:40:10 monitor01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
> 000000008005003b
> Oct 28 00:40:10 monitor01 kernel: CR2: 00007f5382cf4580 CR3:
> 0000000000201000 CR4: 00000000000006e0
> Oct 28 00:40:10 monitor01 kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Oct 28 00:40:10 monitor01 kernel: DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Oct 28 00:40:10 monitor01 kernel: Call Trace:
> Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80229f77>] ? default_idle
> +0x2a/0x46
> Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80222c9d>] ? cpu_idle
> +0x47/0x65
>
> I don't think this is related.  This looks like a known bug in the
> 2.6.29 kernel we're running.  I've tossed it in here just in case it
> is
> related or in case someone else has had this other problem and can
> shed
> some light on it.
>
> I've attached a more targeted trace than last time.  So much for
> expunging data but we really need to get to the bottom of
> this :-(  The
> device for which polling failed is 172.30.10.1.
>
> Our OpenNMS server polls from several different addresses.  Because it
> has elevated privileges, it communicates to anything off its local
> network via OpenVPN tunnels authenticated via X.509 certificate to
> ensure no one can spoof the privileged IP address.  The extended
> credentials are enforced throughout the entire WAN - something we've
> only found possible via the ISCS project (iscs.sourceforge.net).
>  Thus,
> the ONMS addresses are 172.30.10.31, 192.168.124.125, 10.68.6.238, and
> 10.68.6.254.  Disregard the CRC errors - the checksum is being
> calculated elsewhere.
>
> Not knowing enough about SNMP packet exchange, I'm not sure of where
> the
> problem lies.  In looking at successful polls, there appears to be an
> initial two packet exchange followed shortly thereafter by another
> four
> packets.  The content is hard to discern because we are using privacy.
>
> In our failed exchange, we see the first pair of packets exchanged.
>  We
> then see the next two exchanged.  However, the third is sent from ONMS
> and there is no reply.  It then resends the same sized packet roughly
> five seconds later.
>
> This would seem to imply the problematic system truly is not
> responding.
> However, looking at the exact same packet sequence in the logs on the
> failing station, we see what appears to be a normal exchange - in fact
> we see all four packets:
>
> Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
> Oct 28 03:04:44 fw01 snmpd[22813]: Received SNMP packet(s) from UDP:
> [172.30.10.31]:37027
> Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
> Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
> Oct 28 03:04:49 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
>
> Unfortunately, I did not trace from the other side to see if the
> missing
> reply was indeed placed on the network and was lost in transmission to
> ONMS (unlikely as they are on the same local network) or if it was
> never
> sent.
>
> I do notice my snmp-config.xml is set to only 1 retry.  I'll change it
> to 2 and try to test during a scheduled outage.  That may make the
> system a little more resilient to a failed response like this.
>
> However, we still do not know why there was no response.  It is not
> likely the device was too busy to respond.  We are in pre-launch of
> the
> company so the systems are very robust and lightly loaded.
>
> It could be a bug in net-snmp.  172.30.10.1 is a Linux device running
> CentOS 5.4 and the yum based net-snmp (net-snmp-5.3.2.2-7.el5).
> However, we are seeing the problem with ProCurve switches and SnapGear
> SG565 VPN gateways. It does not seem to be limited to
> net-snmp-5.3.2.2-7.el5.
>
> It is possible the packets are getting lost in the network.
>
> Is it possible that ONMS is sending a malformed packet and thus does
> not
> receive a response? Next time, I'll trace to include ICMP packets just
> in case there is some notification of a problem.  Is there someplace
> in
> the logs where we can see want SNMPv3 parameters were used in each
> query? Perhaps it occasionally mis-sends one of them.
>
> So, not a whole lot further but hopefully this can spark something in
> someone else's mind. I know there were several others reporting the
> same
> issue. Have any of you made any further progress? I'll report back on
> how changing the retry value in snmp-config.xml worked now that I've
> removed those settings from the capsd/pollerd/collectd configurations.
> Thanks - John
> --
> John A. Sullivan III
> Open Source Development Corporation
> +1 207-985-7880
> [hidden email]
>
>
http://www.spiritualoutreach.com
> Making Christianity intelligible to secular society
> [attachment "onms-prob-fw01.trace" deleted by John Blake/Cary/IBM]
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart
> your
> developing skills, take BlackBerry mobile applications to market and
> stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>
http://p.sf.net/sfu/devconference_______________________________________________
> Please read the OpenNMS Mailing List FAQ:
>
http://www.opennms.org/index.php/Mailing_List_FAQ
>
> opennms-discuss mailing list
>
> To *unsubscribe* or change your subscription options, see the bottom
> of this page:
>
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>
http://p.sf.net/sfu/devconference
> _______________________________________________ Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ opennms-discuss mailing list To *unsubscribe* or change your subscription options, see the bottom of this page: https://lists.sourceforge.net/lists/listinfo/opennms-discuss
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss



------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
Ah, so it sounds like you're talking about monitoring the firewall
itself.

Indeed, the trace I attached is watching the traffic entering and
leaving the ONMS server.  Thanks - John

On Wed, 2009-10-28 at 15:29 -0400, John Blake wrote:

>
>  I would only start getting a response from my Data Domain when it
> would decide to send a packet back down the correct interface. I
> couldnt do anything about it on my side. So we decided to just ping 1
> interface, get syslog info from it and open a ticket with them about
> the routing in their device.
>
> The only way I see to narrow it down is to try to watch the traffic
> leave the onms server to the node in question.
>
>
>
>
> From:
> "John A. Sullivan III"
> <[hidden email]>
> To:
> General OpenNMS Discussion
> <[hidden email]>
> Date:
> 10/28/2009 02:54 PM
> Subject:
> Re: [opennms-discuss] More
> information on SNMP poll failure
>      during        ONMS
>  restart
>
>
> ______________________________________________________________________
>
>
>
> Thanks, John.  An interesting idea but I don't think that is our
> issue.
> There are indeed routes on the ONMS server from the VPN connections.
> There are no iptables rules on the ONMS server so there is no
> connection
> tracking there.
>
> The problem isn't after a reboot but after an ONMS restart.
>  Frequently,
> the problems are with devices which are on the local network and not
> on
> the other side of the firewall, e.g., the local switch.
>
> Another strange thing I did not mention here but did in the earlier
> thread - it never comes back.  It does not matter how long we wait,
> the
> SNMP polling never succeeds once it fails.  That's why I don't think
> it
> is dropped packets on the network or the monitored devices not
> responding and why I wonder if something has become confused in ONMS's
> view of the end point and that confusion is causing it to make the
> queries with incorrect credentials / data / format.  Thanks - John
>
> On Wed, 2009-10-28 at 14:15 -0400, John Blake wrote:
> >
> >  John,
> >  Do you have route tables in your server? I'm wondering if ONMS
> sends
> > the data out one interface (sourced form one ip), You reboot, then
> the
> > data is sourced from a diff interface/ip.
> > Some firewalls do not like that at all. I have a Data Domain box
> that
> > will send responses out any interface and the VPN wont let it thru
> > because i sent a icmp to a specific ip and its expecting a response
> > from that ip, not a different one. So I'll get responses and then
> > suddenly stop getting them for say 30 minutes, then it comes back.
> >
> >
> > John
> >
> >
> >
> >
> > From:
> > "John A. Sullivan III"
> > <[hidden email]>
> > To:
> > General OpenNMS Discussion
> > <[hidden email]>
> > Date:
> > 10/28/2009 12:31 PM
> > Subject:
> > [opennms-discuss] More information
> > on SNMP poll failure during ONMS
> >      restart
> >
> >
> >
> ______________________________________________________________________
> >
> >
> >
> > Hello, all.  I have some more details on this ongoing problem
> several
> > of
> > us have had with SNMP polling failing for random devices when ONMS
> is
> > restarted.
> >
> > Unfortunately, we needed to restart last night when we encountered:
> >
> > Oct 28 00:40:10 monitor01 kernel: BUG: soft lockup - CPU#0 stuck for
> > 4124s! [swapper:0]
> > Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4
> hidp
> > l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery
> > acpi_memhotplu
> > Oct 28 00:40:10 monitor01 kernel: CPU 0:
> > Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4
> hidp
> > l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery
> > acpi_memhotplu
> > Oct 28 00:40:10 monitor01 kernel: Pid: 0, comm: swapper Tainted: G S
> > 2.6.29.1 #2
> > Oct 28 00:40:10 monitor01 kernel: RIP: 0010:[<ffffffff80238e1c>]
> >  [<ffffffff80238e1c>] native_safe_halt+0x2/0x3
> > Oct 28 00:40:10 monitor01 kernel: RSP: 0018:ffffffff8075df48
>  EFLAGS:
> > 00000246
> > Oct 28 00:40:10 monitor01 kernel: RAX: ffffffff8075dfd8 RBX:
> > ffffffff80767140 RCX: 0000000000000000
> > Oct 28 00:40:10 monitor01 kernel: RDX: ffffffff80238e1c RSI:
> > 0000000000000001 RDI: ffffffff806caa10
> > Oct 28 00:40:10 monitor01 kernel: RBP: ffffffff80224c6e R08:
> > 0000000000000000 R09: 0000000000000001
> > Oct 28 00:40:10 monitor01 kernel: R10: ffff88000103d580 R11:
> > ffff88000103d580 R12: ffffffff807dc580
> > Oct 28 00:40:10 monitor01 kernel: R13: ffff88000103d580 R14:
> > ffffffff8024d3bb R15: ffffffff806c7360
> > Oct 28 00:40:10 monitor01 kernel: FS:  0000000045d1f940(0000)
> > GS:ffffffff807ea000(0000) knlGS:0000000000000000
> > Oct 28 00:40:10 monitor01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
> > 000000008005003b
> > Oct 28 00:40:10 monitor01 kernel: CR2: 00007f5382cf4580 CR3:
> > 0000000000201000 CR4: 00000000000006e0
> > Oct 28 00:40:10 monitor01 kernel: DR0: 0000000000000000 DR1:
> > 0000000000000000 DR2: 0000000000000000
> > Oct 28 00:40:10 monitor01 kernel: DR3: 0000000000000000 DR6:
> > 00000000ffff0ff0 DR7: 0000000000000400
> > Oct 28 00:40:10 monitor01 kernel: Call Trace:
> > Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80229f77>] ?
> default_idle
> > +0x2a/0x46
> > Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80222c9d>] ? cpu_idle
> > +0x47/0x65
> >
> > I don't think this is related.  This looks like a known bug in the
> > 2.6.29 kernel we're running.  I've tossed it in here just in case it
> > is
> > related or in case someone else has had this other problem and can
> > shed
> > some light on it.
> >
> > I've attached a more targeted trace than last time.  So much for
> > expunging data but we really need to get to the bottom of
> > this :-(  The
> > device for which polling failed is 172.30.10.1.
> >
> > Our OpenNMS server polls from several different addresses.  Because
> it
> > has elevated privileges, it communicates to anything off its local
> > network via OpenVPN tunnels authenticated via X.509 certificate to
> > ensure no one can spoof the privileged IP address.  The extended
> > credentials are enforced throughout the entire WAN - something we've
> > only found possible via the ISCS project (iscs.sourceforge.net).
> >  Thus,
> > the ONMS addresses are 172.30.10.31, 192.168.124.125, 10.68.6.238,
> and
> > 10.68.6.254.  Disregard the CRC errors - the checksum is being
> > calculated elsewhere.
> >
> > Not knowing enough about SNMP packet exchange, I'm not sure of where
> > the
> > problem lies.  In looking at successful polls, there appears to be
> an
> > initial two packet exchange followed shortly thereafter by another
> > four
> > packets.  The content is hard to discern because we are using
> privacy.
> >
> > In our failed exchange, we see the first pair of packets exchanged.
> >  We
> > then see the next two exchanged.  However, the third is sent from
> ONMS
> > and there is no reply.  It then resends the same sized packet
> roughly
> > five seconds later.
> >
> > This would seem to imply the problematic system truly is not
> > responding.
> > However, looking at the exact same packet sequence in the logs on
> the
> > failing station, we see what appears to be a normal exchange - in
> fact
> > we see all four packets:
> >
> > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> > [172.30.10.31]:37027
> > Oct 28 03:04:44 fw01 snmpd[22813]: Received SNMP packet(s) from UDP:
> > [172.30.10.31]:37027
> > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> > [172.30.10.31]:37027
> > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> > [172.30.10.31]:37027
> > Oct 28 03:04:49 fw01 snmpd[22813]: Connection from UDP:
> > [172.30.10.31]:37027
> >
> > Unfortunately, I did not trace from the other side to see if the
> > missing
> > reply was indeed placed on the network and was lost in transmission
> to
> > ONMS (unlikely as they are on the same local network) or if it was
> > never
> > sent.
> >
> > I do notice my snmp-config.xml is set to only 1 retry.  I'll change
> it
> > to 2 and try to test during a scheduled outage.  That may make the
> > system a little more resilient to a failed response like this.
> >
> > However, we still do not know why there was no response.  It is not
> > likely the device was too busy to respond.  We are in pre-launch of
> > the
> > company so the systems are very robust and lightly loaded.
> >
> > It could be a bug in net-snmp.  172.30.10.1 is a Linux device
> running
> > CentOS 5.4 and the yum based net-snmp (net-snmp-5.3.2.2-7.el5).
> > However, we are seeing the problem with ProCurve switches and
> SnapGear
> > SG565 VPN gateways. It does not seem to be limited to
> > net-snmp-5.3.2.2-7.el5.
> >
> > It is possible the packets are getting lost in the network.
> >
> > Is it possible that ONMS is sending a malformed packet and thus does
> > not
> > receive a response? Next time, I'll trace to include ICMP packets
> just
> > in case there is some notification of a problem.  Is there someplace
> > in
> > the logs where we can see want SNMPv3 parameters were used in each
> > query? Perhaps it occasionally mis-sends one of them.
> >
> > So, not a whole lot further but hopefully this can spark something
> in
> > someone else's mind. I know there were several others reporting the
> > same
> > issue. Have any of you made any further progress? I'll report back
> on
> > how changing the retry value in snmp-config.xml worked now that I've
> > removed those settings from the capsd/pollerd/collectd
> configurations.
> > Thanks - John
> > --
> > John A. Sullivan III
> > Open Source Development Corporation
> > +1 207-985-7880
> > [hidden email]
> >
> > http://www.spiritualoutreach.com
> > Making Christianity intelligible to secular society
> > [attachment "onms-prob-fw01.trace" deleted by John Blake/Cary/IBM]
> >
> ------------------------------------------------------------------------------
> > Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> > is the only developer event you need to attend this year. Jumpstart
> > your
> > developing skills, take BlackBerry mobile applications to market and
> > stay
> > ahead of the curve. Join us from November 9 - 12, 2009. Register
> now!
> >
> http://p.sf.net/sfu/devconference_______________________________________________
> > Please read the OpenNMS Mailing List FAQ:
> > http://www.opennms.org/index.php/Mailing_List_FAQ
> >
> > opennms-discuss mailing list
> >
> > To *unsubscribe* or change your subscription options, see the bottom
> > of this page:
> > https://lists.sourceforge.net/lists/listinfo/opennms-discuss
> >
> >
> ------------------------------------------------------------------------------
> > Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> > is the only developer event you need to attend this year. Jumpstart
> your
> > developing skills, take BlackBerry mobile applications to market and
> stay
> > ahead of the curve. Join us from November 9 - 12, 2009. Register
> now!
> > http://p.sf.net/sfu/devconference
> > _______________________________________________ Please read the
> OpenNMS Mailing List FAQ:
> http://www.opennms.org/index.php/Mailing_List_FAQopennms-discuss
> mailing list To *unsubscribe* or change your subscription options, see
> the bottom of this page:
> https://lists.sourceforge.net/lists/listinfo/opennms-discuss
> --
> John A. Sullivan III
> Open Source Development Corporation
> +1 207-985-7880
> [hidden email]
>
> http://www.spiritualoutreach.com
> Making Christianity intelligible to secular society
>
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart
> your
> developing skills, take BlackBerry mobile applications to market and
> stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Please read the OpenNMS Mailing List FAQ:
> http://www.opennms.org/index.php/Mailing_List_FAQ
>
> opennms-discuss mailing list
>
> To *unsubscribe* or change your subscription options, see the bottom
> of this page:
> https://lists.sourceforge.net/lists/listinfo/opennms-discuss
>
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________ Please read the OpenNMS Mailing List FAQ: http://www.opennms.org/index.php/Mailing_List_FAQ opennms-discuss mailing list To *unsubscribe* or change your subscription options, see the bottom of this page: https://lists.sourceforge.net/lists/listinfo/opennms-discuss
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
In reply to this post by John A. Sullivan III
On Wed, 2009-10-28 at 12:26 -0400, John A. Sullivan III wrote:

> Hello, all.  I have some more details on this ongoing problem several of
> us have had with SNMP polling failing for random devices when ONMS is
> restarted.
>
> Unfortunately, we needed to restart last night when we encountered:
>
> Oct 28 00:40:10 monitor01 kernel: BUG: soft lockup - CPU#0 stuck for 4124s! [swapper:0]
> Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery acpi_memhotplu
> Oct 28 00:40:10 monitor01 kernel: CPU 0:
> Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery acpi_memhotplu
> Oct 28 00:40:10 monitor01 kernel: Pid: 0, comm: swapper Tainted: G S         2.6.29.1 #2
> Oct 28 00:40:10 monitor01 kernel: RIP: 0010:[<ffffffff80238e1c>]  [<ffffffff80238e1c>] native_safe_halt+0x2/0x3
> Oct 28 00:40:10 monitor01 kernel: RSP: 0018:ffffffff8075df48  EFLAGS: 00000246
> Oct 28 00:40:10 monitor01 kernel: RAX: ffffffff8075dfd8 RBX: ffffffff80767140 RCX: 0000000000000000
> Oct 28 00:40:10 monitor01 kernel: RDX: ffffffff80238e1c RSI: 0000000000000001 RDI: ffffffff806caa10
> Oct 28 00:40:10 monitor01 kernel: RBP: ffffffff80224c6e R08: 0000000000000000 R09: 0000000000000001
> Oct 28 00:40:10 monitor01 kernel: R10: ffff88000103d580 R11: ffff88000103d580 R12: ffffffff807dc580
> Oct 28 00:40:10 monitor01 kernel: R13: ffff88000103d580 R14: ffffffff8024d3bb R15: ffffffff806c7360
> Oct 28 00:40:10 monitor01 kernel: FS:  0000000045d1f940(0000) GS:ffffffff807ea000(0000) knlGS:0000000000000000
> Oct 28 00:40:10 monitor01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> Oct 28 00:40:10 monitor01 kernel: CR2: 00007f5382cf4580 CR3: 0000000000201000 CR4: 00000000000006e0
> Oct 28 00:40:10 monitor01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Oct 28 00:40:10 monitor01 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Oct 28 00:40:10 monitor01 kernel: Call Trace:
> Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80229f77>] ? default_idle+0x2a/0x46
> Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80222c9d>] ? cpu_idle+0x47/0x65
>
> I don't think this is related.  This looks like a known bug in the
> 2.6.29 kernel we're running.  I've tossed it in here just in case it is
> related or in case someone else has had this other problem and can shed
> some light on it.
>
> I've attached a more targeted trace than last time.  So much for
> expunging data but we really need to get to the bottom of this :-(  The
> device for which polling failed is 172.30.10.1.
>
> Our OpenNMS server polls from several different addresses.  Because it
> has elevated privileges, it communicates to anything off its local
> network via OpenVPN tunnels authenticated via X.509 certificate to
> ensure no one can spoof the privileged IP address.  The extended
> credentials are enforced throughout the entire WAN - something we've
> only found possible via the ISCS project (iscs.sourceforge.net).  Thus,
> the ONMS addresses are 172.30.10.31, 192.168.124.125, 10.68.6.238, and
> 10.68.6.254.  Disregard the CRC errors - the checksum is being
> calculated elsewhere.
>
> Not knowing enough about SNMP packet exchange, I'm not sure of where the
> problem lies.  In looking at successful polls, there appears to be an
> initial two packet exchange followed shortly thereafter by another four
> packets.  The content is hard to discern because we are using privacy.
>
> In our failed exchange, we see the first pair of packets exchanged.  We
> then see the next two exchanged.  However, the third is sent from ONMS
> and there is no reply.  It then resends the same sized packet roughly
> five seconds later.
>
> This would seem to imply the problematic system truly is not responding.
> However, looking at the exact same packet sequence in the logs on the
> failing station, we see what appears to be a normal exchange - in fact
> we see all four packets:
>
> Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> Oct 28 03:04:44 fw01 snmpd[22813]: Received SNMP packet(s) from UDP: [172.30.10.31]:37027
> Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> Oct 28 03:04:49 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
>
> Unfortunately, I did not trace from the other side to see if the missing
> reply was indeed placed on the network and was lost in transmission to
> ONMS (unlikely as they are on the same local network) or if it was never
> sent.
>
> I do notice my snmp-config.xml is set to only 1 retry.  I'll change it
> to 2 and try to test during a scheduled outage.  That may make the
> system a little more resilient to a failed response like this.
>
> However, we still do not know why there was no response.  It is not
> likely the device was too busy to respond.  We are in pre-launch of the
> company so the systems are very robust and lightly loaded.
>
> It could be a bug in net-snmp.  172.30.10.1 is a Linux device running
> CentOS 5.4 and the yum based net-snmp (net-snmp-5.3.2.2-7.el5).
> However, we are seeing the problem with ProCurve switches and SnapGear
> SG565 VPN gateways. It does not seem to be limited to
> net-snmp-5.3.2.2-7.el5.
>
> It is possible the packets are getting lost in the network.
>
> Is it possible that ONMS is sending a malformed packet and thus does not
> receive a response? Next time, I'll trace to include ICMP packets just
> in case there is some notification of a problem.  Is there someplace in
> the logs where we can see want SNMPv3 parameters were used in each
> query? Perhaps it occasionally mis-sends one of them.
>
> So, not a whole lot further but hopefully this can spark something in
> someone else's mind. I know there were several others reporting the same
> issue. Have any of you made any further progress? I'll report back on
> how changing the retry value in snmp-config.xml worked now that I've
> removed those settings from the capsd/pollerd/collectd configurations.
> Thanks - John
<snip>
Unfortunately, we had another kernel event this morning which gave us an
opportunity to test our changes and do some more tracing.  Also
unfortunately, even with retries now set to 2, we still had a problem.
It took us four restarts this time instead of dozens but sometimes it
doesn't take any - still very random.

Alas, each time I set up a trace, it moved to a different node! Hence,
no additional information there.  Still digging - John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
On Thu, 2009-10-29 at 10:28 -0400, John A. Sullivan III wrote:

> On Wed, 2009-10-28 at 12:26 -0400, John A. Sullivan III wrote:
> > Hello, all.  I have some more details on this ongoing problem several of
> > us have had with SNMP polling failing for random devices when ONMS is
> > restarted.
> >
> > Unfortunately, we needed to restart last night when we encountered:
> >
> > Oct 28 00:40:10 monitor01 kernel: BUG: soft lockup - CPU#0 stuck for 4124s! [swapper:0]
> > Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery acpi_memhotplu
> > Oct 28 00:40:10 monitor01 kernel: CPU 0:
> > Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery acpi_memhotplu
> > Oct 28 00:40:10 monitor01 kernel: Pid: 0, comm: swapper Tainted: G S         2.6.29.1 #2
> > Oct 28 00:40:10 monitor01 kernel: RIP: 0010:[<ffffffff80238e1c>]  [<ffffffff80238e1c>] native_safe_halt+0x2/0x3
> > Oct 28 00:40:10 monitor01 kernel: RSP: 0018:ffffffff8075df48  EFLAGS: 00000246
> > Oct 28 00:40:10 monitor01 kernel: RAX: ffffffff8075dfd8 RBX: ffffffff80767140 RCX: 0000000000000000
> > Oct 28 00:40:10 monitor01 kernel: RDX: ffffffff80238e1c RSI: 0000000000000001 RDI: ffffffff806caa10
> > Oct 28 00:40:10 monitor01 kernel: RBP: ffffffff80224c6e R08: 0000000000000000 R09: 0000000000000001
> > Oct 28 00:40:10 monitor01 kernel: R10: ffff88000103d580 R11: ffff88000103d580 R12: ffffffff807dc580
> > Oct 28 00:40:10 monitor01 kernel: R13: ffff88000103d580 R14: ffffffff8024d3bb R15: ffffffff806c7360
> > Oct 28 00:40:10 monitor01 kernel: FS:  0000000045d1f940(0000) GS:ffffffff807ea000(0000) knlGS:0000000000000000
> > Oct 28 00:40:10 monitor01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > Oct 28 00:40:10 monitor01 kernel: CR2: 00007f5382cf4580 CR3: 0000000000201000 CR4: 00000000000006e0
> > Oct 28 00:40:10 monitor01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Oct 28 00:40:10 monitor01 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Oct 28 00:40:10 monitor01 kernel: Call Trace:
> > Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80229f77>] ? default_idle+0x2a/0x46
> > Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80222c9d>] ? cpu_idle+0x47/0x65
> >
> > I don't think this is related.  This looks like a known bug in the
> > 2.6.29 kernel we're running.  I've tossed it in here just in case it is
> > related or in case someone else has had this other problem and can shed
> > some light on it.
> >
> > I've attached a more targeted trace than last time.  So much for
> > expunging data but we really need to get to the bottom of this :-(  The
> > device for which polling failed is 172.30.10.1.
> >
> > Our OpenNMS server polls from several different addresses.  Because it
> > has elevated privileges, it communicates to anything off its local
> > network via OpenVPN tunnels authenticated via X.509 certificate to
> > ensure no one can spoof the privileged IP address.  The extended
> > credentials are enforced throughout the entire WAN - something we've
> > only found possible via the ISCS project (iscs.sourceforge.net).  Thus,
> > the ONMS addresses are 172.30.10.31, 192.168.124.125, 10.68.6.238, and
> > 10.68.6.254.  Disregard the CRC errors - the checksum is being
> > calculated elsewhere.
> >
> > Not knowing enough about SNMP packet exchange, I'm not sure of where the
> > problem lies.  In looking at successful polls, there appears to be an
> > initial two packet exchange followed shortly thereafter by another four
> > packets.  The content is hard to discern because we are using privacy.
> >
> > In our failed exchange, we see the first pair of packets exchanged.  We
> > then see the next two exchanged.  However, the third is sent from ONMS
> > and there is no reply.  It then resends the same sized packet roughly
> > five seconds later.
> >
> > This would seem to imply the problematic system truly is not responding.
> > However, looking at the exact same packet sequence in the logs on the
> > failing station, we see what appears to be a normal exchange - in fact
> > we see all four packets:
> >
> > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> > Oct 28 03:04:44 fw01 snmpd[22813]: Received SNMP packet(s) from UDP: [172.30.10.31]:37027
> > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> > Oct 28 03:04:49 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> >
> > Unfortunately, I did not trace from the other side to see if the missing
> > reply was indeed placed on the network and was lost in transmission to
> > ONMS (unlikely as they are on the same local network) or if it was never
> > sent.
> >
> > I do notice my snmp-config.xml is set to only 1 retry.  I'll change it
> > to 2 and try to test during a scheduled outage.  That may make the
> > system a little more resilient to a failed response like this.
> >
> > However, we still do not know why there was no response.  It is not
> > likely the device was too busy to respond.  We are in pre-launch of the
> > company so the systems are very robust and lightly loaded.
> >
> > It could be a bug in net-snmp.  172.30.10.1 is a Linux device running
> > CentOS 5.4 and the yum based net-snmp (net-snmp-5.3.2.2-7.el5).
> > However, we are seeing the problem with ProCurve switches and SnapGear
> > SG565 VPN gateways. It does not seem to be limited to
> > net-snmp-5.3.2.2-7.el5.
> >
> > It is possible the packets are getting lost in the network.
> >
> > Is it possible that ONMS is sending a malformed packet and thus does not
> > receive a response? Next time, I'll trace to include ICMP packets just
> > in case there is some notification of a problem.  Is there someplace in
> > the logs where we can see want SNMPv3 parameters were used in each
> > query? Perhaps it occasionally mis-sends one of them.
> >
> > So, not a whole lot further but hopefully this can spark something in
> > someone else's mind. I know there were several others reporting the same
> > issue. Have any of you made any further progress? I'll report back on
> > how changing the retry value in snmp-config.xml worked now that I've
> > removed those settings from the capsd/pollerd/collectd configurations.
> > Thanks - John
> <snip>
> Unfortunately, we had another kernel event this morning which gave us an
> opportunity to test our changes and do some more tracing.  Also
> unfortunately, even with retries now set to 2, we still had a problem.
> It took us four restarts this time instead of dozens but sometimes it
> doesn't take any - still very random.
>
> Alas, each time I set up a trace, it moved to a different node! Hence,
> no additional information there.  Still digging - John
I was able to gather more data during a restart - this time from both
sides of the connection.  The view was the same on both sides, viz.,
ONMS SNMP -> reply -> ONMS SNMP -> reply -> ONMS SNMP -> no reply.

So the packet never came out of the failed node . . . or at least so we
suspect.  This device does happen to be a multi-homed device with bonded
interfaces so not a good candidate.  The next time we trace off of this
device, I'll trap all interfaces to see if the packet is simply being
misdirected.

However, based upon empirical evidence, I don't think so because we have
seen failures on non-multihomed devices.

By the way, are the others who are having this problem using alb bonding
by any chance? Most of the devices in our environment are.  Thanks -
John

PS - if anyone wants, I can send the traces but it is more of the same -
John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John Blake

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink

 Do you see the packet leaving the ONMS server going out the correct interface to the proper IP each time?
 




From: "John A. Sullivan III" <[hidden email]>
To: General OpenNMS Discussion <[hidden email]>
Date: 10/29/2009 04:33 PM
Subject: Re: [opennms-discuss] More information on SNMP poll failure        during        ONMS restart





On Thu, 2009-10-29 at 10:28 -0400, John A. Sullivan III wrote:
> On Wed, 2009-10-28 at 12:26 -0400, John A. Sullivan III wrote:
> > Hello, all.  I have some more details on this ongoing problem several of
> > us have had with SNMP polling failing for random devices when ONMS is
> > restarted.
> >
> > Unfortunately, we needed to restart last night when we encountered:
> >
> > Oct 28 00:40:10 monitor01 kernel: BUG: soft lockup - CPU#0 stuck for 4124s! [swapper:0]
> > Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery acpi_memhotplu
> > Oct 28 00:40:10 monitor01 kernel: CPU 0:
> > Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4 hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc battery acpi_memhotplu
> > Oct 28 00:40:10 monitor01 kernel: Pid: 0, comm: swapper Tainted: G S         2.6.29.1 #2
> > Oct 28 00:40:10 monitor01 kernel: RIP: 0010:[<ffffffff80238e1c>]  [<ffffffff80238e1c>] native_safe_halt+0x2/0x3
> > Oct 28 00:40:10 monitor01 kernel: RSP: 0018:ffffffff8075df48  EFLAGS: 00000246
> > Oct 28 00:40:10 monitor01 kernel: RAX: ffffffff8075dfd8 RBX: ffffffff80767140 RCX: 0000000000000000
> > Oct 28 00:40:10 monitor01 kernel: RDX: ffffffff80238e1c RSI: 0000000000000001 RDI: ffffffff806caa10
> > Oct 28 00:40:10 monitor01 kernel: RBP: ffffffff80224c6e R08: 0000000000000000 R09: 0000000000000001
> > Oct 28 00:40:10 monitor01 kernel: R10: ffff88000103d580 R11: ffff88000103d580 R12: ffffffff807dc580
> > Oct 28 00:40:10 monitor01 kernel: R13: ffff88000103d580 R14: ffffffff8024d3bb R15: ffffffff806c7360
> > Oct 28 00:40:10 monitor01 kernel: FS:  0000000045d1f940(0000) GS:ffffffff807ea000(0000) knlGS:0000000000000000
> > Oct 28 00:40:10 monitor01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > Oct 28 00:40:10 monitor01 kernel: CR2: 00007f5382cf4580 CR3: 0000000000201000 CR4: 00000000000006e0
> > Oct 28 00:40:10 monitor01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Oct 28 00:40:10 monitor01 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Oct 28 00:40:10 monitor01 kernel: Call Trace:
> > Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80229f77>] ? default_idle+0x2a/0x46
> > Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80222c9d>] ? cpu_idle+0x47/0x65
> >
> > I don't think this is related.  This looks like a known bug in the
> > 2.6.29 kernel we're running.  I've tossed it in here just in case it is
> > related or in case someone else has had this other problem and can shed
> > some light on it.
> >
> > I've attached a more targeted trace than last time.  So much for
> > expunging data but we really need to get to the bottom of this :-(  The
> > device for which polling failed is 172.30.10.1.
> >
> > Our OpenNMS server polls from several different addresses.  Because it
> > has elevated privileges, it communicates to anything off its local
> > network via OpenVPN tunnels authenticated via X.509 certificate to
> > ensure no one can spoof the privileged IP address.  The extended
> > credentials are enforced throughout the entire WAN - something we've
> > only found possible via the ISCS project (iscs.sourceforge.net).  Thus,
> > the ONMS addresses are 172.30.10.31, 192.168.124.125, 10.68.6.238, and
> > 10.68.6.254.  Disregard the CRC errors - the checksum is being
> > calculated elsewhere.
> >
> > Not knowing enough about SNMP packet exchange, I'm not sure of where the
> > problem lies.  In looking at successful polls, there appears to be an
> > initial two packet exchange followed shortly thereafter by another four
> > packets.  The content is hard to discern because we are using privacy.
> >
> > In our failed exchange, we see the first pair of packets exchanged.  We
> > then see the next two exchanged.  However, the third is sent from ONMS
> > and there is no reply.  It then resends the same sized packet roughly
> > five seconds later.
> >
> > This would seem to imply the problematic system truly is not responding.
> > However, looking at the exact same packet sequence in the logs on the
> > failing station, we see what appears to be a normal exchange - in fact
> > we see all four packets:
> >
> > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> > Oct 28 03:04:44 fw01 snmpd[22813]: Received SNMP packet(s) from UDP: [172.30.10.31]:37027
> > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> > Oct 28 03:04:49 fw01 snmpd[22813]: Connection from UDP: [172.30.10.31]:37027
> >
> > Unfortunately, I did not trace from the other side to see if the missing
> > reply was indeed placed on the network and was lost in transmission to
> > ONMS (unlikely as they are on the same local network) or if it was never
> > sent.
> >
> > I do notice my snmp-config.xml is set to only 1 retry.  I'll change it
> > to 2 and try to test during a scheduled outage.  That may make the
> > system a little more resilient to a failed response like this.
> >
> > However, we still do not know why there was no response.  It is not
> > likely the device was too busy to respond.  We are in pre-launch of the
> > company so the systems are very robust and lightly loaded.
> >
> > It could be a bug in net-snmp.  172.30.10.1 is a Linux device running
> > CentOS 5.4 and the yum based net-snmp (net-snmp-5.3.2.2-7.el5).
> > However, we are seeing the problem with ProCurve switches and SnapGear
> > SG565 VPN gateways. It does not seem to be limited to
> > net-snmp-5.3.2.2-7.el5.
> >
> > It is possible the packets are getting lost in the network.
> >
> > Is it possible that ONMS is sending a malformed packet and thus does not
> > receive a response? Next time, I'll trace to include ICMP packets just
> > in case there is some notification of a problem.  Is there someplace in
> > the logs where we can see want SNMPv3 parameters were used in each
> > query? Perhaps it occasionally mis-sends one of them.
> >
> > So, not a whole lot further but hopefully this can spark something in
> > someone else's mind. I know there were several others reporting the same
> > issue. Have any of you made any further progress? I'll report back on
> > how changing the retry value in snmp-config.xml worked now that I've
> > removed those settings from the capsd/pollerd/collectd configurations.
> > Thanks - John
> <snip>
> Unfortunately, we had another kernel event this morning which gave us an
> opportunity to test our changes and do some more tracing.  Also
> unfortunately, even with retries now set to 2, we still had a problem.
> It took us four restarts this time instead of dozens but sometimes it
> doesn't take any - still very random.
>
> Alas, each time I set up a trace, it moved to a different node! Hence,
> no additional information there.  Still digging - John
I was able to gather more data during a restart - this time from both
sides of the connection.  The view was the same on both sides, viz.,
ONMS SNMP -> reply -> ONMS SNMP -> reply -> ONMS SNMP -> no reply.

So the packet never came out of the failed node . . . or at least so we
suspect.  This device does happen to be a multi-homed device with bonded
interfaces so not a good candidate.  The next time we trace off of this
device, I'll trap all interfaces to see if the packet is simply being
misdirected.

However, based upon empirical evidence, I don't think so because we have
seen failures on non-multihomed devices.

By the way, are the others who are having this problem using alb bonding
by any chance? Most of the devices in our environment are.  Thanks -
John

PS - if anyone wants, I can send the traces but it is more of the same -
John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss



------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
Yes, although, honestly, the trace was only on a single interface.
However, the problem does not seem to be the ONMS server sending the
packet.  The problem is the target does not reply.  The question is why?

1) Is the reply being lost? We do not see it being placed on the wire at
all but the definitive determination will be when we either do a trace
on all interfaces of a multi-homed device or trace the same problem
happening on a single homed device especially one without bonding like
the switch (of course, the switch has plenty of opportunity to lose a
packet, too, especially with some stations doing ALB bonding)

2) Is the target overwhelmed? Not likely - powerful lightly loaded
servers - no spikes in CPU

3) Is it a net-snmp bug? Possible but we are seeing it on non-net-snmp
devices.

4) Is the packet malformed by ONMS? This would explain the random, cross
platform phenomenon but how do we tell?

Any suggestions, especially on that last point, would be most helpful.
Thanks - John

On Thu, 2009-10-29 at 16:52 -0400, John Blake wrote:

>
>  Do you see the packet leaving the ONMS server going out the correct
> interface to the proper IP each time?
>  
>
>
>
>
> From:
> "John A. Sullivan III"
> <[hidden email]>
> To:
> General OpenNMS Discussion
> <[hidden email]>
> Date:
> 10/29/2009 04:33 PM
> Subject:
> Re: [opennms-discuss] More
> information on SNMP poll failure
>      during        ONMS restart
>
>
> ______________________________________________________________________
>
>
>
> On Thu, 2009-10-29 at 10:28 -0400, John A. Sullivan III wrote:
> > On Wed, 2009-10-28 at 12:26 -0400, John A. Sullivan III wrote:
> > > Hello, all.  I have some more details on this ongoing problem
> several of
> > > us have had with SNMP polling failing for random devices when ONMS
> is
> > > restarted.
> > >
> > > Unfortunately, we needed to restart last night when we
> encountered:
> > >
> > > Oct 28 00:40:10 monitor01 kernel: BUG: soft lockup - CPU#0 stuck
> for 4124s! [swapper:0]
> > > Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4
> hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc
> battery acpi_memhotplu
> > > Oct 28 00:40:10 monitor01 kernel: CPU 0:
> > > Oct 28 00:40:10 monitor01 kernel: Modules linked in: ipv6 autofs4
> hidp l2cap bluetooth tun dm_mirror dm_multipath scsi_dh sbs sbshc
> battery acpi_memhotplu
> > > Oct 28 00:40:10 monitor01 kernel: Pid: 0, comm: swapper Tainted: G
> S         2.6.29.1 #2
> > > Oct 28 00:40:10 monitor01 kernel: RIP: 0010:[<ffffffff80238e1c>]
>  [<ffffffff80238e1c>] native_safe_halt+0x2/0x3
> > > Oct 28 00:40:10 monitor01 kernel: RSP: 0018:ffffffff8075df48
>  EFLAGS: 00000246
> > > Oct 28 00:40:10 monitor01 kernel: RAX: ffffffff8075dfd8 RBX:
> ffffffff80767140 RCX: 0000000000000000
> > > Oct 28 00:40:10 monitor01 kernel: RDX: ffffffff80238e1c RSI:
> 0000000000000001 RDI: ffffffff806caa10
> > > Oct 28 00:40:10 monitor01 kernel: RBP: ffffffff80224c6e R08:
> 0000000000000000 R09: 0000000000000001
> > > Oct 28 00:40:10 monitor01 kernel: R10: ffff88000103d580 R11:
> ffff88000103d580 R12: ffffffff807dc580
> > > Oct 28 00:40:10 monitor01 kernel: R13: ffff88000103d580 R14:
> ffffffff8024d3bb R15: ffffffff806c7360
> > > Oct 28 00:40:10 monitor01 kernel: FS:  0000000045d1f940(0000)
> GS:ffffffff807ea000(0000) knlGS:0000000000000000
> > > Oct 28 00:40:10 monitor01 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
> 000000008005003b
> > > Oct 28 00:40:10 monitor01 kernel: CR2: 00007f5382cf4580 CR3:
> 0000000000201000 CR4: 00000000000006e0
> > > Oct 28 00:40:10 monitor01 kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> > > Oct 28 00:40:10 monitor01 kernel: DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> > > Oct 28 00:40:10 monitor01 kernel: Call Trace:
> > > Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80229f77>] ?
> default_idle+0x2a/0x46
> > > Oct 28 00:40:10 monitor01 kernel:  [<ffffffff80222c9d>] ? cpu_idle
> +0x47/0x65
> > >
> > > I don't think this is related.  This looks like a known bug in the
> > > 2.6.29 kernel we're running.  I've tossed it in here just in case
> it is
> > > related or in case someone else has had this other problem and can
> shed
> > > some light on it.
> > >
> > > I've attached a more targeted trace than last time.  So much for
> > > expunging data but we really need to get to the bottom of
> this :-(  The
> > > device for which polling failed is 172.30.10.1.
> > >
> > > Our OpenNMS server polls from several different addresses.
>  Because it
> > > has elevated privileges, it communicates to anything off its local
> > > network via OpenVPN tunnels authenticated via X.509 certificate to
> > > ensure no one can spoof the privileged IP address.  The extended
> > > credentials are enforced throughout the entire WAN - something
> we've
> > > only found possible via the ISCS project (iscs.sourceforge.net).
>  Thus,
> > > the ONMS addresses are 172.30.10.31, 192.168.124.125, 10.68.6.238,
> and
> > > 10.68.6.254.  Disregard the CRC errors - the checksum is being
> > > calculated elsewhere.
> > >
> > > Not knowing enough about SNMP packet exchange, I'm not sure of
> where the
> > > problem lies.  In looking at successful polls, there appears to be
> an
> > > initial two packet exchange followed shortly thereafter by another
> four
> > > packets.  The content is hard to discern because we are using
> privacy.
> > >
> > > In our failed exchange, we see the first pair of packets
> exchanged.  We
> > > then see the next two exchanged.  However, the third is sent from
> ONMS
> > > and there is no reply.  It then resends the same sized packet
> roughly
> > > five seconds later.
> > >
> > > This would seem to imply the problematic system truly is not
> responding.
> > > However, looking at the exact same packet sequence in the logs on
> the
> > > failing station, we see what appears to be a normal exchange - in
> fact
> > > we see all four packets:
> > >
> > > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
> > > Oct 28 03:04:44 fw01 snmpd[22813]: Received SNMP packet(s) from
> UDP: [172.30.10.31]:37027
> > > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
> > > Oct 28 03:04:44 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
> > > Oct 28 03:04:49 fw01 snmpd[22813]: Connection from UDP:
> [172.30.10.31]:37027
> > >
> > > Unfortunately, I did not trace from the other side to see if the
> missing
> > > reply was indeed placed on the network and was lost in
> transmission to
> > > ONMS (unlikely as they are on the same local network) or if it was
> never
> > > sent.
> > >
> > > I do notice my snmp-config.xml is set to only 1 retry.  I'll
> change it
> > > to 2 and try to test during a scheduled outage.  That may make the
> > > system a little more resilient to a failed response like this.
> > >
> > > However, we still do not know why there was no response.  It is
> not
> > > likely the device was too busy to respond.  We are in pre-launch
> of the
> > > company so the systems are very robust and lightly loaded.
> > >
> > > It could be a bug in net-snmp.  172.30.10.1 is a Linux device
> running
> > > CentOS 5.4 and the yum based net-snmp (net-snmp-5.3.2.2-7.el5).
> > > However, we are seeing the problem with ProCurve switches and
> SnapGear
> > > SG565 VPN gateways. It does not seem to be limited to
> > > net-snmp-5.3.2.2-7.el5.
> > >
> > > It is possible the packets are getting lost in the network.
> > >
> > > Is it possible that ONMS is sending a malformed packet and thus
> does not
> > > receive a response? Next time, I'll trace to include ICMP packets
> just
> > > in case there is some notification of a problem.  Is there
> someplace in
> > > the logs where we can see want SNMPv3 parameters were used in each
> > > query? Perhaps it occasionally mis-sends one of them.
> > >
> > > So, not a whole lot further but hopefully this can spark something
> in
> > > someone else's mind. I know there were several others reporting
> the same
> > > issue. Have any of you made any further progress? I'll report back
> on
> > > how changing the retry value in snmp-config.xml worked now that
> I've
> > > removed those settings from the capsd/pollerd/collectd
> configurations.
> > > Thanks - John
> > <snip>
> > Unfortunately, we had another kernel event this morning which gave
> us an
> > opportunity to test our changes and do some more tracing.  Also
> > unfortunately, even with retries now set to 2, we still had a
> problem.
> > It took us four restarts this time instead of dozens but sometimes
> it
> > doesn't take any - still very random.
> >
> > Alas, each time I set up a trace, it moved to a different node!
> Hence,
> > no additional information there.  Still digging - John
> I was able to gather more data during a restart - this time from both
> sides of the connection.  The view was the same on both sides, viz.,
> ONMS SNMP -> reply -> ONMS SNMP -> reply -> ONMS SNMP -> no reply.
>
> So the packet never came out of the failed node . . . or at least so
> we
> suspect.  This device does happen to be a multi-homed device with
> bonded
> interfaces so not a good candidate.  The next time we trace off of
> this
> device, I'll trap all interfaces to see if the packet is simply being
> misdirected.
>
> However, based upon empirical evidence, I don't think so because we
> have
> seen failures on non-multihomed devices.
>
> By the way, are the others who are having this problem using alb
> bonding
> by any chance? Most of the devices in our environment are.  Thanks -
> John
>
> PS - if anyone wants, I can send the traces but it is more of the same
> -
> John
> --
> John A. Sullivan III
> Open Source Development Corporation
> +1 207-985-7880
> [hidden email]
>
> http://www.spiritualoutreach.com
> Making Christianity intelligible to secular society
>
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart
> your
> developing skills, take BlackBerry mobile applications to market and
> stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Please read the OpenNMS Mailing List FAQ:
> http://www.opennms.org/index.php/Mailing_List_FAQ
>
> opennms-discuss mailing list
>
> To *unsubscribe* or change your subscription options, see the bottom
> of this page:
> https://lists.sourceforge.net/lists/listinfo/opennms-discuss
>
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________ Please read the OpenNMS Mailing List FAQ: http://www.opennms.org/index.php/Mailing_List_FAQ opennms-discuss mailing list To *unsubscribe* or change your subscription options, see the bottom of this page: https://lists.sourceforge.net/lists/listinfo/opennms-discuss
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
James Masson

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink



John A. Sullivan III wrote:

> Yes, although, honestly, the trace was only on a single interface.
> However, the problem does not seem to be the ONMS server sending the
> packet.  The problem is the target does not reply.  The question is why?
>
> 1) Is the reply being lost? We do not see it being placed on the wire at
> all but the definitive determination will be when we either do a trace
> on all interfaces of a multi-homed device or trace the same problem
> happening on a single homed device especially one without bonding like
> the switch (of course, the switch has plenty of opportunity to lose a
> packet, too, especially with some stations doing ALB bonding)
>

I've seen a few similar problems with OpenNMS and multihomed machines. 99% of the machines here are
multihomed, often with 5 different interfaces, some routable, some non-routable, and some which are
partially routable if the firewalls aren't doing source IP filtering.

Because SNMP uses udp, and is connectionless, it's possible for a SNMP packet from OpenNMS to get to
an interface which hasn't got a direct route back to OpenNMS. The reply to OpenNMS goes out of the
machine's default gateway on a different interface, with a _source_ IP of the original interface,
and makes it back to OpenNMS. A snmpwalk of this partially routable interface will yield the first
line of the SNMP reply, and will then hang.

This seems to:

 1) make OpenNMS think SNMP is available on this interface
 2) Depending on the interfaces available, OpenNMS might select this dodgy interface as the Primary
interface for the node.


My solution was to enable source IP filtering on the firewalls between the networks. This ensures
only packets with a source IP on that network will be accepted at the local firewall.

James M

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
On Fri, 2009-10-30 at 09:26 +0000, James Masson wrote:

>
>
> John A. Sullivan III wrote:
> > Yes, although, honestly, the trace was only on a single interface.
> > However, the problem does not seem to be the ONMS server sending the
> > packet.  The problem is the target does not reply.  The question is why?
> >
> > 1) Is the reply being lost? We do not see it being placed on the wire at
> > all but the definitive determination will be when we either do a trace
> > on all interfaces of a multi-homed device or trace the same problem
> > happening on a single homed device especially one without bonding like
> > the switch (of course, the switch has plenty of opportunity to lose a
> > packet, too, especially with some stations doing ALB bonding)
> >
>
> I've seen a few similar problems with OpenNMS and multihomed machines. 99% of the machines here are
> multihomed, often with 5 different interfaces, some routable, some non-routable, and some which are
> partially routable if the firewalls aren't doing source IP filtering.
>
> Because SNMP uses udp, and is connectionless, it's possible for a SNMP packet from OpenNMS to get to
> an interface which hasn't got a direct route back to OpenNMS. The reply to OpenNMS goes out of the
> machine's default gateway on a different interface, with a _source_ IP of the original interface,
> and makes it back to OpenNMS. A snmpwalk of this partially routable interface will yield the first
> line of the SNMP reply, and will then hang.
>
> This seems to:
>
>  1) make OpenNMS think SNMP is available on this interface
>  2) Depending on the interfaces available, OpenNMS might select this dodgy interface as the Primary
> interface for the node.
>
>
> My solution was to enable source IP filtering on the firewalls between the networks. This ensures
> only packets with a source IP on that network will be accepted at the local firewall.
<snip>
Thanks, James.  That makes a great deal of sense and is why we set SNMP
as unmanaged on all such interfaces. The failure us always on directly
connected interfaces.  Although I'm sure this s a valid issue, I'm not
sure it's the problem going on here since we see the problem manifest
itself even on the local switch.

We might have a kernel and a distribution problem although I think the
kernel problem is unrelated. I noticed an occasional hiccup where ONMS
would stop responding altogether for a little while and then come
roaring back.  This was always preceded in the logs with a message about
the mcelog running for too long.  This looks like a brand new bug in
CentOS 5.4 so I commented out the offending part of the cron.hourly
script which invokes it.

However, we are still seeing complete hangs as I briefly described
earlier.  These do not resolve nor do they allow us to stop opennms or
even gracefully shut down the device.  The only option is to destroy the
VM and restart it - in effect cutting the power.  This is a known,
rarely occurring bug in 2.6.28 and 2.6.29 which are the two kernels we
use (lucky us).  I'll be working on upgrading the kernel shortly as it
must have crashed six times during the night.  However, when it does, it
is not responsive at all so I believe it is a different issue.  Then
again, cleaning some of the mud out of the water may make it easier to
see what this problem is.  Thanks - John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
Les Mikesell

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
John A. Sullivan III wrote:

>
>
> We might have a kernel and a distribution problem although I think the
> kernel problem is unrelated. I noticed an occasional hiccup where ONMS
> would stop responding altogether for a little while and then come
> roaring back.  This was always preceded in the logs with a message about
> the mcelog running for too long.  This looks like a brand new bug in
> CentOS 5.4 so I commented out the offending part of the cron.hourly
> script which invokes it.
>
> However, we are still seeing complete hangs as I briefly described
> earlier.  These do not resolve nor do they allow us to stop opennms or
> even gracefully shut down the device.  The only option is to destroy the
> VM and restart it - in effect cutting the power.

VM?  Is this problem unique to running under xen?

--
   Les Mikesell
    [hidden email]



------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
Brian Fertig-2

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
We use XenServer 5.5 w/ CentOS 5.4 and have no issues.  I didn't catch the
complete thread but if you would what problems have you experienced?  I
havent done much in the way of customizations to our system except some
timer issues I have had with false positives.  I just upped the timer and
everything is working fine now.


Brian



On 10/30/09 10:47 AM, "Les Mikesell" <[hidden email]> wrote:

> John A. Sullivan III wrote:
>>
>>
>> We might have a kernel and a distribution problem although I think the
>> kernel problem is unrelated. I noticed an occasional hiccup where ONMS
>> would stop responding altogether for a little while and then come
>> roaring back.  This was always preceded in the logs with a message about
>> the mcelog running for too long.  This looks like a brand new bug in
>> CentOS 5.4 so I commented out the offending part of the cron.hourly
>> script which invokes it.
>>
>> However, we are still seeing complete hangs as I briefly described
>> earlier.  These do not resolve nor do they allow us to stop opennms or
>> even gracefully shut down the device.  The only option is to destroy the
>> VM and restart it - in effect cutting the power.
>
> VM?  Is this problem unique to running under xen?


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
In reply to this post by Les Mikesell
On Fri, 2009-10-30 at 09:47 -0500, Les Mikesell wrote:

> John A. Sullivan III wrote:
> >
> >
> > We might have a kernel and a distribution problem although I think the
> > kernel problem is unrelated. I noticed an occasional hiccup where ONMS
> > would stop responding altogether for a little while and then come
> > roaring back.  This was always preceded in the logs with a message about
> > the mcelog running for too long.  This looks like a brand new bug in
> > CentOS 5.4 so I commented out the offending part of the cron.hourly
> > script which invokes it.
> >
> > However, we are still seeing complete hangs as I briefly described
> > earlier.  These do not resolve nor do they allow us to stop opennms or
> > even gracefully shut down the device.  The only option is to destroy the
> > VM and restart it - in effect cutting the power.
>
> VM?  Is this problem unique to running under xen?
>
I don't know (we're actually running KVM).  Are the rest of you who have
reported this problem using OpenNMS in a VM? Thanks - John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
In reply to this post by Brian Fertig-2
In summary, all works fine until ONMS is restarted. Then, random devices
fail their SNMP poll and never get it back until ONMS is restarted (at
which time some other random target permanently fails its SNMP poll).
Thanks - John

On Fri, 2009-10-30 at 10:55 -0400, Brian Fertig wrote:

> We use XenServer 5.5 w/ CentOS 5.4 and have no issues.  I didn't catch the
> complete thread but if you would what problems have you experienced?  I
> havent done much in the way of customizations to our system except some
> timer issues I have had with false positives.  I just upped the timer and
> everything is working fine now.
>
>
> Brian
>
>
>
> On 10/30/09 10:47 AM, "Les Mikesell" <[hidden email]> wrote:
>
> > John A. Sullivan III wrote:
> >>
> >>
> >> We might have a kernel and a distribution problem although I think the
> >> kernel problem is unrelated. I noticed an occasional hiccup where ONMS
> >> would stop responding altogether for a little while and then come
> >> roaring back.  This was always preceded in the logs with a message about
> >> the mcelog running for too long.  This looks like a brand new bug in
> >> CentOS 5.4 so I commented out the offending part of the cron.hourly
> >> script which invokes it.
> >>
> >> However, we are still seeing complete hangs as I briefly described
> >> earlier.  These do not resolve nor do they allow us to stop opennms or
> >> even gracefully shut down the device.  The only option is to destroy the
> >> VM and restart it - in effect cutting the power.
> >
> > VM?  Is this problem unique to running under xen?
>
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Please read the OpenNMS Mailing List FAQ:
> http://www.opennms.org/index.php/Mailing_List_FAQ
>
> opennms-discuss mailing list
>
> To *unsubscribe* or change your subscription options, see the bottom of this page:
> https://lists.sourceforge.net/lists/listinfo/opennms-discuss
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
In reply to this post by John A. Sullivan III
On Fri, 2009-10-30 at 10:15 -0400, John A. Sullivan III wrote:

> On Fri, 2009-10-30 at 09:26 +0000, James Masson wrote:
> >
> >
> > John A. Sullivan III wrote:
> > > Yes, although, honestly, the trace was only on a single interface.
> > > However, the problem does not seem to be the ONMS server sending the
> > > packet.  The problem is the target does not reply.  The question is why?
> > >
> > > 1) Is the reply being lost? We do not see it being placed on the wire at
> > > all but the definitive determination will be when we either do a trace
> > > on all interfaces of a multi-homed device or trace the same problem
> > > happening on a single homed device especially one without bonding like
> > > the switch (of course, the switch has plenty of opportunity to lose a
> > > packet, too, especially with some stations doing ALB bonding)
> > >
> >
> > I've seen a few similar problems with OpenNMS and multihomed machines. 99% of the machines here are
> > multihomed, often with 5 different interfaces, some routable, some non-routable, and some which are
> > partially routable if the firewalls aren't doing source IP filtering.
> >
> > Because SNMP uses udp, and is connectionless, it's possible for a SNMP packet from OpenNMS to get to
> > an interface which hasn't got a direct route back to OpenNMS. The reply to OpenNMS goes out of the
> > machine's default gateway on a different interface, with a _source_ IP of the original interface,
> > and makes it back to OpenNMS. A snmpwalk of this partially routable interface will yield the first
> > line of the SNMP reply, and will then hang.
> >
> > This seems to:
> >
> >  1) make OpenNMS think SNMP is available on this interface
> >  2) Depending on the interfaces available, OpenNMS might select this dodgy interface as the Primary
> > interface for the node.
> >
> >
> > My solution was to enable source IP filtering on the firewalls between the networks. This ensures
> > only packets with a source IP on that network will be accepted at the local firewall.
> <snip>
> Thanks, James.  That makes a great deal of sense and is why we set SNMP
> as unmanaged on all such interfaces. The failure us always on directly
> connected interfaces.  Although I'm sure this s a valid issue, I'm not
> sure it's the problem going on here since we see the problem manifest
> itself even on the local switch.
>
> We might have a kernel and a distribution problem although I think the
> kernel problem is unrelated. I noticed an occasional hiccup where ONMS
> would stop responding altogether for a little while and then come
> roaring back.  This was always preceded in the logs with a message about
> the mcelog running for too long.  This looks like a brand new bug in
> CentOS 5.4 so I commented out the offending part of the cron.hourly
> script which invokes it.
>
> However, we are still seeing complete hangs as I briefly described
> earlier.  These do not resolve nor do they allow us to stop opennms or
> even gracefully shut down the device.  The only option is to destroy the
> VM and restart it - in effect cutting the power.  This is a known,
> rarely occurring bug in 2.6.28 and 2.6.29 which are the two kernels we
> use (lucky us).  I'll be working on upgrading the kernel shortly as it
> must have crashed six times during the night.  However, when it does, it
> is not responsive at all so I believe it is a different issue.  Then
> again, cleaning some of the mud out of the water may make it easier to
> see what this problem is.  Thanks - John
<snip>
We did upgrade the kernel on our ONMS server from 2.6.29.1 to 2.6.31.5
and this has eliminated the runaway processes that were crashing the
server so a warning to anyone running 2.6.29 or 2.6.28 - ONMS is ikely
to constantly hang and require a cold boot to fix.

However, despite fixing that problem, it did not fix this SNMP poll
failure problem.  I just restarted seven times after the last change to
get a clean start.

I have noticed some false events for some services on the other side of
OpenVPN tunnels.  We can correlate these to packets that were supposed
to be intercepted by the tun driver and were not but sent to the default
gateway instead and consequently dropped.

However, this does not appear to be the issue with the SNMP problem for
two reasons.

1) Those services always come back on the next poll in 30 seconds -
we're working on tuning our OpenVPN set up
2) This problems extends to local devices.  In fact, over the weekend,
we lost the ability to poll the snmp agent running on the ONMS server
itself on 127.0.0.1.  It took many restarts to get it back.  I would
think that tells me this is not a network issue.

So I'm certainly open to the next round of suggestions and willing to do
the troubleshooting.  I'm just not sure where to look next.  Thanks -
John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
jcat

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
Some javascript/style in this post has been disabled (why?)
Hi John,


I know this is happening across the board, but it might be useful on the nodes that run net-snmp to run the daemon in debug mode, just to see if you can get any further info on the problem from the clients point of view.



Cheers,
Just

On Mon, 2009-11-02 at 14:02 -0500, John A. Sullivan III wrote:
On Fri, 2009-10-30 at 10:15 -0400, John A. Sullivan III wrote:
> On Fri, 2009-10-30 at 09:26 +0000, James Masson wrote:
> > 
> > 
> > John A. Sullivan III wrote:
> > > Yes, although, honestly, the trace was only on a single interface.
> > > However, the problem does not seem to be the ONMS server sending the
> > > packet.  The problem is the target does not reply.  The question is why?
> > > 
> > > 1) Is the reply being lost? We do not see it being placed on the wire at
> > > all but the definitive determination will be when we either do a trace
> > > on all interfaces of a multi-homed device or trace the same problem
> > > happening on a single homed device especially one without bonding like
> > > the switch (of course, the switch has plenty of opportunity to lose a
> > > packet, too, especially with some stations doing ALB bonding)
> > > 
> > 
> > I've seen a few similar problems with OpenNMS and multihomed machines. 99% of the machines here are
> > multihomed, often with 5 different interfaces, some routable, some non-routable, and some which are
> > partially routable if the firewalls aren't doing source IP filtering.
> > 
> > Because SNMP uses udp, and is connectionless, it's possible for a SNMP packet from OpenNMS to get to
> > an interface which hasn't got a direct route back to OpenNMS. The reply to OpenNMS goes out of the
> > machine's default gateway on a different interface, with a _source_ IP of the original interface,
> > and makes it back to OpenNMS. A snmpwalk of this partially routable interface will yield the first
> > line of the SNMP reply, and will then hang.
> > 
> > This seems to:
> > 
> >  1) make OpenNMS think SNMP is available on this interface
> >  2) Depending on the interfaces available, OpenNMS might select this dodgy interface as the Primary
> > interface for the node.
> > 
> > 
> > My solution was to enable source IP filtering on the firewalls between the networks. This ensures
> > only packets with a source IP on that network will be accepted at the local firewall.
> <snip>
> Thanks, James.  That makes a great deal of sense and is why we set SNMP
> as unmanaged on all such interfaces. The failure us always on directly
> connected interfaces.  Although I'm sure this s a valid issue, I'm not
> sure it's the problem going on here since we see the problem manifest
> itself even on the local switch.
> 
> We might have a kernel and a distribution problem although I think the
> kernel problem is unrelated. I noticed an occasional hiccup where ONMS
> would stop responding altogether for a little while and then come
> roaring back.  This was always preceded in the logs with a message about
> the mcelog running for too long.  This looks like a brand new bug in
> CentOS 5.4 so I commented out the offending part of the cron.hourly
> script which invokes it.
> 
> However, we are still seeing complete hangs as I briefly described
> earlier.  These do not resolve nor do they allow us to stop opennms or
> even gracefully shut down the device.  The only option is to destroy the
> VM and restart it - in effect cutting the power.  This is a known,
> rarely occurring bug in 2.6.28 and 2.6.29 which are the two kernels we
> use (lucky us).  I'll be working on upgrading the kernel shortly as it
> must have crashed six times during the night.  However, when it does, it
> is not responsive at all so I believe it is a different issue.  Then
> again, cleaning some of the mud out of the water may make it easier to
> see what this problem is.  Thanks - John
<snip>
We did upgrade the kernel on our ONMS server from 2.6.29.1 to 2.6.31.5
and this has eliminated the runaway processes that were crashing the
server so a warning to anyone running 2.6.29 or 2.6.28 - ONMS is ikely
to constantly hang and require a cold boot to fix.

However, despite fixing that problem, it did not fix this SNMP poll
failure problem.  I just restarted seven times after the last change to
get a clean start.

I have noticed some false events for some services on the other side of
OpenVPN tunnels.  We can correlate these to packets that were supposed
to be intercepted by the tun driver and were not but sent to the default
gateway instead and consequently dropped.

However, this does not appear to be the issue with the SNMP problem for
two reasons.

1) Those services always come back on the next poll in 30 seconds -
we're working on tuning our OpenVPN set up
2) This problems extends to local devices.  In fact, over the weekend,
we lost the ability to poll the snmp agent running on the ONMS server
itself on 127.0.0.1.  It took many restarts to get it back.  I would
think that tells me this is not a network issue.

So I'm certainly open to the next round of suggestions and willing to do
the troubleshooting.  I'm just not sure where to look next.  Thanks -
John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
That's exactly the kind of patching of my ignorance that I need! Do I
simply add -D to /etc/sysconfig/snmpd.options or do I need something
more specific? Thanks, Just - John

On Mon, 2009-11-02 at 19:48 +0000, jcat wrote:

> Hi John,
>
>
> I know this is happening across the board, but it might be useful on
> the nodes that run net-snmp to run the daemon in debug mode, just to
> see if you can get any further info on the problem from the clients
> point of view.
>
>
>
>
> Cheers,
> Just
>
> On Mon, 2009-11-02 at 14:02 -0500, John A. Sullivan III wrote:
> > On Fri, 2009-10-30 at 10:15 -0400, John A. Sullivan III wrote:
> > > On Fri, 2009-10-30 at 09:26 +0000, James Masson wrote:
> > > >
> > > >
> > > > John A. Sullivan III wrote:
> > > > > Yes, although, honestly, the trace was only on a single interface.
> > > > > However, the problem does not seem to be the ONMS server sending the
> > > > > packet.  The problem is the target does not reply.  The question is why?
> > > > >
> > > > > 1) Is the reply being lost? We do not see it being placed on the wire at
> > > > > all but the definitive determination will be when we either do a trace
> > > > > on all interfaces of a multi-homed device or trace the same problem
> > > > > happening on a single homed device especially one without bonding like
> > > > > the switch (of course, the switch has plenty of opportunity to lose a
> > > > > packet, too, especially with some stations doing ALB bonding)
> > > > >
> > > >
> > > > I've seen a few similar problems with OpenNMS and multihomed machines. 99% of the machines here are
> > > > multihomed, often with 5 different interfaces, some routable, some non-routable, and some which are
> > > > partially routable if the firewalls aren't doing source IP filtering.
> > > >
> > > > Because SNMP uses udp, and is connectionless, it's possible for a SNMP packet from OpenNMS to get to
> > > > an interface which hasn't got a direct route back to OpenNMS. The reply to OpenNMS goes out of the
> > > > machine's default gateway on a different interface, with a _source_ IP of the original interface,
> > > > and makes it back to OpenNMS. A snmpwalk of this partially routable interface will yield the first
> > > > line of the SNMP reply, and will then hang.
> > > >
> > > > This seems to:
> > > >
> > > >  1) make OpenNMS think SNMP is available on this interface
> > > >  2) Depending on the interfaces available, OpenNMS might select this dodgy interface as the Primary
> > > > interface for the node.
> > > >
> > > >
> > > > My solution was to enable source IP filtering on the firewalls between the networks. This ensures
> > > > only packets with a source IP on that network will be accepted at the local firewall.
> > > <snip>
> > > Thanks, James.  That makes a great deal of sense and is why we set SNMP
> > > as unmanaged on all such interfaces. The failure us always on directly
> > > connected interfaces.  Although I'm sure this s a valid issue, I'm not
> > > sure it's the problem going on here since we see the problem manifest
> > > itself even on the local switch.
> > >
> > > We might have a kernel and a distribution problem although I think the
> > > kernel problem is unrelated. I noticed an occasional hiccup where ONMS
> > > would stop responding altogether for a little while and then come
> > > roaring back.  This was always preceded in the logs with a message about
> > > the mcelog running for too long.  This looks like a brand new bug in
> > > CentOS 5.4 so I commented out the offending part of the cron.hourly
> > > script which invokes it.
> > >
> > > However, we are still seeing complete hangs as I briefly described
> > > earlier.  These do not resolve nor do they allow us to stop opennms or
> > > even gracefully shut down the device.  The only option is to destroy the
> > > VM and restart it - in effect cutting the power.  This is a known,
> > > rarely occurring bug in 2.6.28 and 2.6.29 which are the two kernels we
> > > use (lucky us).  I'll be working on upgrading the kernel shortly as it
> > > must have crashed six times during the night.  However, when it does, it
> > > is not responsive at all so I believe it is a different issue.  Then
> > > again, cleaning some of the mud out of the water may make it easier to
> > > see what this problem is.  Thanks - John
> > <snip>
> > We did upgrade the kernel on our ONMS server from 2.6.29.1 to 2.6.31.5
> > and this has eliminated the runaway processes that were crashing the
> > server so a warning to anyone running 2.6.29 or 2.6.28 - ONMS is ikely
> > to constantly hang and require a cold boot to fix.
> >
> > However, despite fixing that problem, it did not fix this SNMP poll
> > failure problem.  I just restarted seven times after the last change to
> > get a clean start.
> >
> > I have noticed some false events for some services on the other side of
> > OpenVPN tunnels.  We can correlate these to packets that were supposed
> > to be intercepted by the tun driver and were not but sent to the default
> > gateway instead and consequently dropped.
> >
> > However, this does not appear to be the issue with the SNMP problem for
> > two reasons.
> >
> > 1) Those services always come back on the next poll in 30 seconds -
> > we're working on tuning our OpenVPN set up
> > 2) This problems extends to local devices.  In fact, over the weekend,
> > we lost the ability to poll the snmp agent running on the ONMS server
> > itself on 127.0.0.1.  It took many restarts to get it back.  I would
> > think that tells me this is not a network issue.
> >
> > So I'm certainly open to the next round of suggestions and willing to do
> > the troubleshooting.  I'm just not sure where to look next.  Thanks -
> > John
> > --
> > John A. Sullivan III
> > Open Source Development Corporation
> > +1 207-985-7880
> > [hidden email]
> >
> > http://www.spiritualoutreach.com
> > Making Christianity intelligible to secular society
> >
> >
> > ------------------------------------------------------------------------------
> > Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> > is the only developer event you need to attend this year. Jumpstart your
> > developing skills, take BlackBerry mobile applications to market and stay
> > ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> > http://p.sf.net/sfu/devconference
> > _______________________________________________
> > Please read the OpenNMS Mailing List FAQ:
> > http://www.opennms.org/index.php/Mailing_List_FAQ
> >
> > opennms-discuss mailing list
> >
> > To *unsubscribe* or change your subscription options, see the bottom of this page:
> > https://lists.sourceforge.net/lists/listinfo/opennms-discuss
> >
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________ Please read the OpenNMS Mailing List FAQ: http://www.opennms.org/index.php/Mailing_List_FAQ opennms-discuss mailing list To *unsubscribe* or change your subscription options, see the bottom of this page: https://lists.sourceforge.net/lists/listinfo/opennms-discuss
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
jcat

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
Some javascript/style in this post has been disabled (why?)
Yes, and also specify a log file (if there isn't one already). So..
"-D -Lf /var/log/somelogfile"

That will give you incredibly verbose output, I believe you can pass -D particular "tokens" to make the debug output more specific (debug on a  particular feature for example).  "man snmpd" will tell you more than I can :)



Cheers,
Just

On Mon, 2009-11-02 at 15:58 -0500, John A. Sullivan III wrote:
That's exactly the kind of patching of my ignorance that I need! Do I
simply add -D to /etc/sysconfig/snmpd.options or do I need something
more specific? Thanks, Just - John

On Mon, 2009-11-02 at 19:48 +0000, jcat wrote:
> Hi John,
> 
> 
> I know this is happening across the board, but it might be useful on
> the nodes that run net-snmp to run the daemon in debug mode, just to
> see if you can get any further info on the problem from the clients
> point of view.
> 
> 
> 
> 
> Cheers,
> Just
> 
> On Mon, 2009-11-02 at 14:02 -0500, John A. Sullivan III wrote: 
> > On Fri, 2009-10-30 at 10:15 -0400, John A. Sullivan III wrote:
> > > On Fri, 2009-10-30 at 09:26 +0000, James Masson wrote:
> > > > 
> > > > 
> > > > John A. Sullivan III wrote:
> > > > > Yes, although, honestly, the trace was only on a single interface.
> > > > > However, the problem does not seem to be the ONMS server sending the
> > > > > packet.  The problem is the target does not reply.  The question is why?
> > > > > 
> > > > > 1) Is the reply being lost? We do not see it being placed on the wire at
> > > > > all but the definitive determination will be when we either do a trace
> > > > > on all interfaces of a multi-homed device or trace the same problem
> > > > > happening on a single homed device especially one without bonding like
> > > > > the switch (of course, the switch has plenty of opportunity to lose a
> > > > > packet, too, especially with some stations doing ALB bonding)
> > > > > 
> > > > 
> > > > I've seen a few similar problems with OpenNMS and multihomed machines. 99% of the machines here are
> > > > multihomed, often with 5 different interfaces, some routable, some non-routable, and some which are
> > > > partially routable if the firewalls aren't doing source IP filtering.
> > > > 
> > > > Because SNMP uses udp, and is connectionless, it's possible for a SNMP packet from OpenNMS to get to
> > > > an interface which hasn't got a direct route back to OpenNMS. The reply to OpenNMS goes out of the
> > > > machine's default gateway on a different interface, with a _source_ IP of the original interface,
> > > > and makes it back to OpenNMS. A snmpwalk of this partially routable interface will yield the first
> > > > line of the SNMP reply, and will then hang.
> > > > 
> > > > This seems to:
> > > > 
> > > >  1) make OpenNMS think SNMP is available on this interface
> > > >  2) Depending on the interfaces available, OpenNMS might select this dodgy interface as the Primary
> > > > interface for the node.
> > > > 
> > > > 
> > > > My solution was to enable source IP filtering on the firewalls between the networks. This ensures
> > > > only packets with a source IP on that network will be accepted at the local firewall.
> > > <snip>
> > > Thanks, James.  That makes a great deal of sense and is why we set SNMP
> > > as unmanaged on all such interfaces. The failure us always on directly
> > > connected interfaces.  Although I'm sure this s a valid issue, I'm not
> > > sure it's the problem going on here since we see the problem manifest
> > > itself even on the local switch.
> > > 
> > > We might have a kernel and a distribution problem although I think the
> > > kernel problem is unrelated. I noticed an occasional hiccup where ONMS
> > > would stop responding altogether for a little while and then come
> > > roaring back.  This was always preceded in the logs with a message about
> > > the mcelog running for too long.  This looks like a brand new bug in
> > > CentOS 5.4 so I commented out the offending part of the cron.hourly
> > > script which invokes it.
> > > 
> > > However, we are still seeing complete hangs as I briefly described
> > > earlier.  These do not resolve nor do they allow us to stop opennms or
> > > even gracefully shut down the device.  The only option is to destroy the
> > > VM and restart it - in effect cutting the power.  This is a known,
> > > rarely occurring bug in 2.6.28 and 2.6.29 which are the two kernels we
> > > use (lucky us).  I'll be working on upgrading the kernel shortly as it
> > > must have crashed six times during the night.  However, when it does, it
> > > is not responsive at all so I believe it is a different issue.  Then
> > > again, cleaning some of the mud out of the water may make it easier to
> > > see what this problem is.  Thanks - John
> > <snip>
> > We did upgrade the kernel on our ONMS server from 2.6.29.1 to 2.6.31.5
> > and this has eliminated the runaway processes that were crashing the
> > server so a warning to anyone running 2.6.29 or 2.6.28 - ONMS is ikely
> > to constantly hang and require a cold boot to fix.
> > 
> > However, despite fixing that problem, it did not fix this SNMP poll
> > failure problem.  I just restarted seven times after the last change to
> > get a clean start.
> > 
> > I have noticed some false events for some services on the other side of
> > OpenVPN tunnels.  We can correlate these to packets that were supposed
> > to be intercepted by the tun driver and were not but sent to the default
> > gateway instead and consequently dropped.
> > 
> > However, this does not appear to be the issue with the SNMP problem for
> > two reasons.
> > 
> > 1) Those services always come back on the next poll in 30 seconds -
> > we're working on tuning our OpenVPN set up
> > 2) This problems extends to local devices.  In fact, over the weekend,
> > we lost the ability to poll the snmp agent running on the ONMS server
> > itself on 127.0.0.1.  It took many restarts to get it back.  I would
> > think that tells me this is not a network issue.
> > 
> > So I'm certainly open to the next round of suggestions and willing to do
> > the troubleshooting.  I'm just not sure where to look next.  Thanks -
> > John
> > -- 
> > John A. Sullivan III
> > Open Source Development Corporation
> > +1 207-985-7880
> > [hidden email]
> > 
> > http://www.spiritualoutreach.com
> > Making Christianity intelligible to secular society
> > 
> > 
> > ------------------------------------------------------------------------------
> > Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> > is the only developer event you need to attend this year. Jumpstart your
> > developing skills, take BlackBerry mobile applications to market and stay 
> > ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> > http://p.sf.net/sfu/devconference
> > _______________________________________________
> > Please read the OpenNMS Mailing List FAQ:
> > http://www.opennms.org/index.php/Mailing_List_FAQ
> > 
> > opennms-discuss mailing list
> > 
> > To *unsubscribe* or change your subscription options, see the bottom of this page:
> > https://lists.sourceforge.net/lists/listinfo/opennms-discuss
> > 
> 
> -- 
> This message has been scanned for viruses and 
> dangerous content by MailScanner, and is 
> believed to be clean. 
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay 
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________ Please read the OpenNMS Mailing List FAQ: http://www.opennms.org/index.php/Mailing_List_FAQ opennms-discuss mailing list To *unsubscribe* or change your subscription options, see the bottom of this page: https://lists.sourceforge.net/lists/listinfo/opennms-discuss
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society



--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
John A. Sullivan III

Re: More information on SNMP poll failure during ONMS restart

Reply Threaded More More options
Print post
Permalink
LOL - I don't think this will work.  I turned on debugging with just -D
and generated a 100MB text file in a minute! Talk about a needle in a
haystack! Any suggestions about what TOKEN to observe? Thanks - John

On Mon, 2009-11-02 at 21:18 +0000, jcat wrote:

> Yes, and also specify a log file (if there isn't one already). So..
> "-D -Lf /var/log/somelogfile"
>
> That will give you incredibly verbose output, I believe you can pass
> -D particular "tokens" to make the debug output more specific (debug
> on a  particular feature for example).  "man snmpd" will tell you more
> than I can :)
>
>
>
>
> Cheers,
> Just
>
> On Mon, 2009-11-02 at 15:58 -0500, John A. Sullivan III wrote:
> > That's exactly the kind of patching of my ignorance that I need! Do I
> > simply add -D to /etc/sysconfig/snmpd.options or do I need something
> > more specific? Thanks, Just - John
> >
> > On Mon, 2009-11-02 at 19:48 +0000, jcat wrote:
> > > Hi John,
> > >
> > >
> > > I know this is happening across the board, but it might be useful on
> > > the nodes that run net-snmp to run the daemon in debug mode, just to
> > > see if you can get any further info on the problem from the clients
> > > point of view.
> > >
> > >
> > >
> > >
> > > Cheers,
> > > Just
> > >
> > > On Mon, 2009-11-02 at 14:02 -0500, John A. Sullivan III wrote:
> > > > On Fri, 2009-10-30 at 10:15 -0400, John A. Sullivan III wrote:
> > > > > On Fri, 2009-10-30 at 09:26 +0000, James Masson wrote:
> > > > > >
> > > > > >
> > > > > > John A. Sullivan III wrote:
> > > > > > > Yes, although, honestly, the trace was only on a single interface.
> > > > > > > However, the problem does not seem to be the ONMS server sending the
> > > > > > > packet.  The problem is the target does not reply.  The question is why?
> > > > > > >
> > > > > > > 1) Is the reply being lost? We do not see it being placed on the wire at
> > > > > > > all but the definitive determination will be when we either do a trace
> > > > > > > on all interfaces of a multi-homed device or trace the same problem
> > > > > > > happening on a single homed device especially one without bonding like
> > > > > > > the switch (of course, the switch has plenty of opportunity to lose a
> > > > > > > packet, too, especially with some stations doing ALB bonding)
> > > > > > >
> > > > > >
> > > > > > I've seen a few similar problems with OpenNMS and multihomed machines. 99% of the machines here are
> > > > > > multihomed, often with 5 different interfaces, some routable, some non-routable, and some which are
> > > > > > partially routable if the firewalls aren't doing source IP filtering.
> > > > > >
> > > > > > Because SNMP uses udp, and is connectionless, it's possible for a SNMP packet from OpenNMS to get to
> > > > > > an interface which hasn't got a direct route back to OpenNMS. The reply to OpenNMS goes out of the
> > > > > > machine's default gateway on a different interface, with a _source_ IP of the original interface,
> > > > > > and makes it back to OpenNMS. A snmpwalk of this partially routable interface will yield the first
> > > > > > line of the SNMP reply, and will then hang.
> > > > > >
> > > > > > This seems to:
> > > > > >
> > > > > >  1) make OpenNMS think SNMP is available on this interface
> > > > > >  2) Depending on the interfaces available, OpenNMS might select this dodgy interface as the Primary
> > > > > > interface for the node.
> > > > > >
> > > > > >
> > > > > > My solution was to enable source IP filtering on the firewalls between the networks. This ensures
> > > > > > only packets with a source IP on that network will be accepted at the local firewall.
> > > > > <snip>
> > > > > Thanks, James.  That makes a great deal of sense and is why we set SNMP
> > > > > as unmanaged on all such interfaces. The failure us always on directly
> > > > > connected interfaces.  Although I'm sure this s a valid issue, I'm not
> > > > > sure it's the problem going on here since we see the problem manifest
> > > > > itself even on the local switch.
> > > > >
> > > > > We might have a kernel and a distribution problem although I think the
> > > > > kernel problem is unrelated. I noticed an occasional hiccup where ONMS
> > > > > would stop responding altogether for a little while and then come
> > > > > roaring back.  This was always preceded in the logs with a message about
> > > > > the mcelog running for too long.  This looks like a brand new bug in
> > > > > CentOS 5.4 so I commented out the offending part of the cron.hourly
> > > > > script which invokes it.
> > > > >
> > > > > However, we are still seeing complete hangs as I briefly described
> > > > > earlier.  These do not resolve nor do they allow us to stop opennms or
> > > > > even gracefully shut down the device.  The only option is to destroy the
> > > > > VM and restart it - in effect cutting the power.  This is a known,
> > > > > rarely occurring bug in 2.6.28 and 2.6.29 which are the two kernels we
> > > > > use (lucky us).  I'll be working on upgrading the kernel shortly as it
> > > > > must have crashed six times during the night.  However, when it does, it
> > > > > is not responsive at all so I believe it is a different issue.  Then
> > > > > again, cleaning some of the mud out of the water may make it easier to
> > > > > see what this problem is.  Thanks - John
> > > > <snip>
> > > > We did upgrade the kernel on our ONMS server from 2.6.29.1 to 2.6.31.5
> > > > and this has eliminated the runaway processes that were crashing the
> > > > server so a warning to anyone running 2.6.29 or 2.6.28 - ONMS is ikely
> > > > to constantly hang and require a cold boot to fix.
> > > >
> > > > However, despite fixing that problem, it did not fix this SNMP poll
> > > > failure problem.  I just restarted seven times after the last change to
> > > > get a clean start.
> > > >
> > > > I have noticed some false events for some services on the other side of
> > > > OpenVPN tunnels.  We can correlate these to packets that were supposed
> > > > to be intercepted by the tun driver and were not but sent to the default
> > > > gateway instead and consequently dropped.
> > > >
> > > > However, this does not appear to be the issue with the SNMP problem for
> > > > two reasons.
> > > >
> > > > 1) Those services always come back on the next poll in 30 seconds -
> > > > we're working on tuning our OpenVPN set up
> > > > 2) This problems extends to local devices.  In fact, over the weekend,
> > > > we lost the ability to poll the snmp agent running on the ONMS server
> > > > itself on 127.0.0.1.  It took many restarts to get it back.  I would
> > > > think that tells me this is not a network issue.
> > > >
> > > > So I'm certainly open to the next round of suggestions and willing to do
> > > > the troubleshooting.  I'm just not sure where to look next.  Thanks -
> > > > John
> > > > --
> > > > John A. Sullivan III
> > > > Open Source Development Corporation
> > > > +1 207-985-7880
> > > > [hidden email]
> > > >
> > > > http://www.spiritualoutreach.com
> > > > Making Christianity intelligible to secular society
> > > >
> > > >
> > > > ------------------------------------------------------------------------------
> > > > Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> > > > is the only developer event you need to attend this year. Jumpstart your
> > > > developing skills, take BlackBerry mobile applications to market and stay
> > > > ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> > > > http://p.sf.net/sfu/devconference
> > > > _______________________________________________
> > > > Please read the OpenNMS Mailing List FAQ:
> > > > http://www.opennms.org/index.php/Mailing_List_FAQ
> > > >
> > > > opennms-discuss mailing list
> > > >
> > > > To *unsubscribe* or change your subscription options, see the bottom of this page:
> > > > https://lists.sourceforge.net/lists/listinfo/opennms-discuss
> > > >
> > >
> > > --
> > > This message has been scanned for viruses and
> > > dangerous content by MailScanner, and is
> > > believed to be clean.
> > > ------------------------------------------------------------------------------
> > > Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> > > is the only developer event you need to attend this year. Jumpstart your
> > > developing skills, take BlackBerry mobile applications to market and stay
> > > ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> > > http://p.sf.net/sfu/devconference
> > > _______________________________________________ Please read the OpenNMS Mailing List FAQ: http://www.opennms.org/index.php/Mailing_List_FAQ opennms-discuss mailing list To *unsubscribe* or change your subscription options, see the bottom of this page: https://lists.sourceforge.net/lists/listinfo/opennms-discuss
> > --
> > John A. Sullivan III
> > Open Source Development Corporation
> > +1 207-985-7880
> > [hidden email]
> >
> > http://www.spiritualoutreach.com
> > Making Christianity intelligible to secular society
> >
> >
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________ Please read the OpenNMS Mailing List FAQ: http://www.opennms.org/index.php/Mailing_List_FAQ opennms-discuss mailing list To *unsubscribe* or change your subscription options, see the bottom of this page: https://lists.sourceforge.net/lists/listinfo/opennms-discuss
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
[hidden email]

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
1 2 3