Ticket #210 (closed defect: invalid)

Opened 2 years ago

Last modified 5 weeks ago

Controller starts heart beat timer for own AvND causing whole cluster to restart

Reported by: bertil.engelholm@… Owned by: nagendra.kumar@…
Priority: major Milestone:
Component: AvSv Version: 1.1.0
Keywords: Cc:
patch waiting for maintainer: no

Description

Sometimes when an active controller is rebooted the new active controller also reboots causing the whole cluster to restart.

The reason it reboots is because it is mistakenly marked as a payload node in the cb->avnd_anchor data structure. Therefore the HB timer towards its own AvND is started when the controller becomes active. And when that timer times out the controller reboots causing all payload nodes to reboot since they lost contact with both controllers.

After debugging this for some time I have seen that the avnd->type variable is wrong the first time when the standby controller is updated with AvND data from the active controller, using ckpt. So when this standby controller later becomes active it starts the HB timer. However I have not seen that the active controller (that sends the data to the standby) has started its timer. So my conclusion is that the avnd->type variable is mistakenly changed to payload sometime from it's set to controller to the time the avnd data is sent to the standby.

This problem sometimes occure about every 10:th reboot but when debug printouts is added the problem sometimes disappears completely. So it feels like it could be timing related or possibly some interference between different priority levels.

Change History

Changed 2 years ago by hafe

  • milestone set to PL 2.0.1

Changed 23 months ago by scon

  • milestone changed from PL 2.0.1 to PL 2.0.2

Changed 16 months ago by nagendra

Bertil, Can i close this bug ??

Changed 3 months ago by marioa

  • milestone 2.0.2 deleted

Milestone 2.0.2 deleted

Changed 5 weeks ago by murthy

  • status changed from new to closed
  • resolution set to invalid

With opensaf4.0 there are no heart beats between Avd and Avnd. If the specific behaviour is observed, please reopen the issue.

Invalid for OpenSAF4.0

Note: See TracTickets for help on using tickets.