Ticket #210 (closed defect: invalid)
Controller starts heart beat timer for own AvND causing whole cluster to restart
| Reported by: | bertil.engelholm@… | Owned by: | nagendra.kumar@… |
|---|---|---|---|
| Priority: | major | Milestone: | |
| Component: | AvSv | Version: | 1.1.0 |
| Keywords: | Cc: | ||
| patch waiting for maintainer: | no |
Description
Sometimes when an active controller is rebooted the new active controller also reboots causing the whole cluster to restart.
The reason it reboots is because it is mistakenly marked as a payload node in the cb->avnd_anchor data structure. Therefore the HB timer towards its own AvND is started when the controller becomes active. And when that timer times out the controller reboots causing all payload nodes to reboot since they lost contact with both controllers.
After debugging this for some time I have seen that the avnd->type variable is wrong the first time when the standby controller is updated with AvND data from the active controller, using ckpt. So when this standby controller later becomes active it starts the HB timer. However I have not seen that the active controller (that sends the data to the standby) has started its timer. So my conclusion is that the avnd->type variable is mistakenly changed to payload sometime from it's set to controller to the time the avnd data is sent to the standby.
This problem sometimes occure about every 10:th reboot but when debug printouts is added the problem sometimes disappears completely. So it feels like it could be timing related or possibly some interference between different priority levels.
