Looking for HF2 error data - Important

cobraguy 17y ago

Hello all,
I am trying to help track down and fix the cause of HF2 errors in NION. This is an error condition in which the CM-1 CobraNet module stops responding to the NION host processor and appears dead.
It may be accompanied by a blinking light sequence on the CobraNet Ethernet port of 7,6,2
I am looking for any and all info from the community of the following nature:
1) Do you know of a way to consistently reproduce this error?
2) Under what conditions have you seen this error?
3) If you have seen this error, what do you know that can be done to minimize or eliminate its occurrence?
Thank you in advance for your help

zhangye 17y ago

Our have this problem.We have 12 NIONs,Evey 3 Nions in a vlan.Between of them use xdab.We have four switchs,all of them enable STP.If STP changed,some of them have this problem. Every times problem not in same Nion.We didn't konw how to eliminate it.If CAB4N included in problem conditions ,it's ok.

zhangye 17y ago

By the way, if NION only LAN port working ,cobranet port not working ,this problem didn't have.If Nion lose xdab ,it also have.

cobraguy 17y ago

Thank you zhangye. Can you tell me if you are using Spanning Tree or are you actually using Rapid Spanning Tree?

jvalenzuela 17y ago

I am also currently experiencing this problem with one of our systems. The system previously worked without issue. This problem started after an upgrade where we added a CAB and two switches.
In response to your questions:
1. I currently have no idea as to the cause, much less how to reproduce it.
2. The fault can occur two times a day, or not for several days in a row. I have made no progress in determining the conditions which cause it.
3. Same as above, I have tried several attempts to make some sort of change in the errors. If not to outright fix it, but perhaps to cause it to move to the other n3 in the system.
The only consistent symptom I have found is that the fault always occurs with the same n3. I've moved bundles around and even replaced the unit altogether to no avail. I currently have swapped that n3's Cobranet connection with another unit to see if the problem moves. If I get feedback showing no change in the problem, my next step is to temporarily install a laptop with a network monitoring program to try and capture all non-Cobranet traffic on the port connected to the faulting n3's Cobranet port.
If you have more specific questions regarding the network, system, etc. I can provide more detail as you require.

zhangye 17y ago

We using MSTP (Multiple spanning tree) in LAN system.When we first get this problem,we thought MSTP casue the problem .After switch completion STP, the problem also have.So we think MSTP not cause this problem.

cobraguy 17y ago

Zhangye,
MSTP is a newer variant of STP that allows separation of spanning tree domains and allows greater efficiency. However, the core protocol used underneath this scheme is RSTP. RSTP has been proven to be a cause of the HF2 error. If you can, try disabling MSTP and/or RSTP and use standard spanning tree and see if this eliminates HF2 errors. This is (hopefully) a temporary fix. We are working with Cirrus right now to try to get this problem fixed.

jvalenzuela 17y ago

cobraguy wrote:
RSTP has been proven to be a cause of the HF2 error.
Really.....? Is this directly due to the increased non-Cobranet traffic received by the CM-1 or indirectly by the reconfiguration of the LAN during reconvergence? Either way, an interesting bit of knowledge, however I have disabled STP in my system so it can't be causing the error in my system.

zhangye 17y ago

We have been try to disabling MSTP,it is also have this problem.Our testing result is when stp changed it will have this problem.Our use H3C switch. Huawei and 3com combina is H3C.We didn't enable RSTP.We testing have three core switch and one edge switch.Both of core switch have double fibre optic cables.12 Nions connect to the edge switch.The edage swtich have two fibre optic cables connect to two of core switch.

cobraguy 17y ago

Reply to #8 an #9
MSTP behaves much like STP and both act much faster than standard STP. There is very fast convergence which generates more traffic. And from what I have been told by a network guru, the philosophy of STP vs. RSTP is different. In STP, a new connection will not be allowed to become active until STP knows it will not create a loop. In RSTP and MSTP a new connection will be allowed right away and is then taken out if a loop is detected. I have not verified this behavior myself yet and am just relating what I have been told.
In any case, if the problems you are seeing are not STP related then, you must look elsewhere. Can you get any statistics from the switch or use a snooper like Wireshark to identify sources of high burst traffic?

jvalenzuela 17y ago

Interesting, I guess M/RSTP actually alters the sequence of modes which a port goes through upon startup, something I'll have to learn more about. Kind of leaves the door open to a loop created upon the activation of a port, but it will be shortlived.
As far as stats for the port related to the device exhibiting failure, I don't believe I have anything useful, yet. I have of course sniffed the port, but due to the frequency which faults occur(sometimes several days between errors) and my current inability to associate the faults with any other events, I haven't seen anything unusual with the network sniffer. I only see Cobranet traffic that I would expect, and some very infrequent CDP frames. If my current tests don't reveal anything useful, my next plan is to grab a company laptop with Wireshark and leave it at the site on a port mirrored with the failing CM-1. If I setup Wireshark's filter to exclude Cobranet traffic, protocol 0x8819, it should be capable of running for several days. I would hope to see something interesting with the same time stamp as a failure.

zhangye 17y ago

cobraguy,
Did you have Nion testing with STP?How about the result?
We have a testing with STP.We have 12 nions ,every three nions in a vlan.Both of them use xdab. We use 3 switch only enable stp .We also have this problem. Our switch engineer think maybe BPDU package cause this problem.BPDU for STP negotiate between switch.
We also use sniffer software to catch package.Our catch much Cobranet package,less BPDU package and udp that port is 1234.I think the udp package is pandad send.

cobraguy 17y ago

Zhangye,
I have not done any new testing with STP. I have been working with Cirrus to find the root cause of the problem, which is a stack overflow, and correct it. I think we have a fix. Please see the announcements section.
BTW, STP MSTP or RSTP are not the root cause but can contribute to this error. Usually not STP but MSTP and RSTP can contribute to the problem when the net topology changes.

zhangye 17y ago

cobraguy,
It is our testing result what you said.Tonight we will do a testing whether close BPDU package can fix it .

Fergy 17y ago

CobraGuy has written a NioNote about this issue, and can be found here;
http://downloads.peavey.com/mm/index.cf … umentation
It will be placed on the public site and incorporated into the help files when it is ready. But we thought it might be helpful to the conversation to make the draft version available to forum members.

cobraguy 17y ago

A recent posy in another thread has forced me to go back and dig deeper into how STP, RSTP and MSTP work. I've got some new things to try. I'll post more on this as soon as I can.
One thing that I was just told early today from a person trying the Beta firmware is that it worked great in a system with three switches but started to fail again when he added two more.
This points to a possible issue with the RSTP BPDU frames themselves as only their quantity would be a meaningful change in that scenario. So he configured his switches to block BPDU frames (EThertype 0x0000) on all the edge ports and the system stopped failing with HF2 errors. There is more to investigate here . Either the presence or frequency of BPDU frames at the CobraNet port is looking like an issue. I have contacted Cirrus about this and they are looking at it. More to come as we find out more. I really appreciate all the great feedback and participation on this topic from everyone.

phils 17y ago

Steve, just read the NIONote HF2: fantastic for an old audio guy still trying to catch up on networking finesse!!
More documentation like this would certainly help us avoid unnecessary grief.
Something in a similar vein that pulled the network "specification" out of the depths of the Programmers Reference Guide, and could be handed to network administrators would be great!
PS It's my role in life not to have to learn all about every related field: that's what other specialists are for!

cobraguy 17y ago

I've been playing around with trying to cause an HF2 error some more by using a little utility I wrote to blast later 2 frames onto the net through a gigabit port, including BPDU frames. So far, using the beta firmware, I have not been able to induce an HF2 error.
So we know that a broadcast storm can cause a problem. We know that using RSTP or MSTP vs STP seems to allow for the problem to occur.
And we know that the new CM-1 beta firmware offers an improvement but does not insure a fix in all cases according to Zhangye.
What we need is a reproducible method of causing the problem to appear.
Can anyone help with this? Does anyone out there have a solidly reproducible way (using a minimum of equipment and the beta firmware) to cause the HF2 error to occur? Zhangye?
Please let me know. I need to be able to consistently reproduce the error and then snoop the net and find out what is going on.
Thanks

cobraguy 17y ago

Has anyone ever seen the HF-2 error occur in a system that does not contain at least one CAB-4n populated with a CM-2 module?

jvalenzuela 17y ago

I'm currently in the field for the next month or so, but when I get back I can check the system I'm having problems with. It has a bunch of 4ns, but I'm not sure what's in them.

jvalenzuela 17y ago

cobraguy wrote:
I've been playing around with trying to cause an HF2 error some more by using a little utility I wrote to blast later 2 frames onto the net through a gigabit port, including BPDU frames. So far, using the beta firmware, I have not been able to induce an HF2 error.
If you have your packet generator connected to a switch via a gigabit port and a CM-1 connected to some other port, are you sure that BPDU's generated by your host are forwarded on to the CM-1? BPDUs are not forwarded through a switch in the same manner as other frames. BPDUs are normally processed internally and the switch may then generate its own BPDUs which may in turn be propagated to other ports depending on configuration. Different types of BPDUs also travel is specifc directions with respect to the root switch. If the BPDUs you generate are injected into a port not expecting such a message, for example by definition your generator's port is not a root port and should not receive BPDUs that would normally originate from the root switch, the switch may ignore them completely.

cobraguy 17y ago

My switch does not have STP so it wouldn't know what to do with a BPDU anyway.
But to answer your question specifically, I turned on port mirroring and observed the BPDU's being forwarded to the target CM-1.

jvalenzuela 17y ago

I've returned from an out of town project and had a chance to load the firmware into the project in which I had first experienced the HF-2 errors. It's been a week running the new firmware and so far no problems

cobraguy 17y ago

Jason,
That's good news. Please let us know after a while if the beta firmware continues to mitigate this problem over time.

jvalenzuela 17y ago

Looks like I spoke too soon. Two failures within the last four days. Logs show the same error on the exact same unit.

jvalenzuela 17y ago

I found a spare laptop and loaded up Wireshark. I'm planning on leaving it connected at the site to capture all non-Ethernet 0x8819 frames from a port that is mirrored to the device that keeps locking up. Hopefully I can go back after a failure and see something interesting.

jvalenzuela 17y ago

I've had some more HF2 failures, but the network analyzer has not captured any non-Cobranet traffic, save for the periodic bootp requests which are quite infrequent. By chance I did happen to notice a difference between the Nion that is failing and the one that is not, temperature. The one that always faults is running at 53 degrees while the other n3 is running at 41 as reported by the front panel temperature display. These two units are mounted next to each other in adjacent racks that are bolted together. The racks are open between each other, so there is no difference in cooling or internal rack air temperature between the two racks. Is this range of temperature normal?

cobraguy 17y ago

I'm not sure what the case temp should be. But 53 sounds high. I will see if we have any data on that. In the meantime is there anything you can do to get the temp down? Are all the fans running properly? Try to get the temp down and see if that has an effect on the HF2 err.

jvalenzuela 17y ago

I checked both Nions, the one running at 53 and the other at 40 degrees. Both appear to be drawing air in the front and both of their power supply fans are running. I don't know if there are any other fans. I loaded a dummy project with no processing at all and that didn't seem to make a difference.

cobraguy 17y ago

Jason. The 53 degrees reading clearly seems high; I think we all agree on that. Can you try to get the temp down? Possibly remove the NION from the rack, pop the top and blow air on it. I think it would be useful to try and isolate temperature as a factor in the HF2 errors you are seeing. If you can get it cooler and the errors stop then that tells us something. After that, determining why the box is running so hot and how to fix it become issues apart from the HF2 issue.

jvalenzuela 17y ago

I returned to the site today to see if I could look further into the temperature differences. I pulled the failing n3 out of the rack and placed it on a table. Before I pulled it and while it sat on the table, it would run at about 53 degrees while the other n3, which has never exhibited this problem, runs at about 38 degrees. I removed the top cover to check out the fans, the three of which I could see appeared to be working properly. As soon as I removed the cover, the temperature as reported via the front panel began to drop. I reinstalled the unit in the rack without its top cover and let it warm up for a few minutes. Its temperature leveled off at about 42 degrees. Given that it has four input cards as compared to the other unit with three output cards, that may account for the slightly elevated temperature. I've left it like that and we'll see what happens....