Bug 35354 - Increase uvmmd / libvirt connection robustness
Increase uvmmd / libvirt connection robustness
Status: CLOSED WORKSFORME
Product: UCS
Classification: Unclassified
Component: Virtualization - UVMM
UCS 4.2
Other Linux
: P5 normal (vote)
: ---
Assigned To: Philipp Hahn
Erik Damrose
:
: 23529 33186 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-14 16:49 CEST by Erik Damrose
Modified: 2019-12-09 10:34 CET (History)
3 users (show)

See Also:
What kind of report is it?: Development Internal
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional): Error handling, Usability
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Erik Damrose univentionstaff 2014-07-14 16:49:30 CEST
In a larger xen environment uvmmd / libvirt often loses the connection to its nodes. The environment consists of 4 XEN-Nodes with about 10 VMs each and varying load. The connection losses get more visible with the updated uvmmd interface from Bug #35122. The affected node and its domains get marked as unavailable.

Various error messages appear in the virtual-machine-manager-daemon.log:

Most errors are logged due to timeouts, and are concerning domainEventDeregister:
2014-07-14 16:30:25,076 - uvmmd.node - WARNING - 'xen://schulr1.ucs.local/' broken? next check in 0:00:30.000. Cannot write data: Input/output error
2014-07-14 16:30:25,077 - uvmmd.node - ERROR - xen://schulr1.ucs.local/: Exception in domainEventDeregister
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/univention/uvmm/node.py", line 608, in update_autoreconnect
    self.conn.domainEventDeregister(self.domainCB)
  File "/usr/lib/python2.6/dist-packages/libvirt.py", line 3343, in domainEventDeregister
    if ret == -1: raise libvirtError ('virConnectDomainEventDeregister() failed', conn=self)
libvirtError: Cannot write data: Input/output error


But i also saw the following error:
2014-07-14 15:59:19,435 - uvmmd.node - WARNING - 'xen://schulr2.ucs.local/' broken? next check in 0:00:30.000. Domain not found: xenUnifiedDomainLookupByID
2014-07-14 15:59:21,649 - uvmmd.unix - ERROR - [834] Exception: Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/univention/uvmm/unix.py", line 149, in handle_command
    res = cmd(self, command)
  File "/usr/lib/pymodules/python2.6/univention/uvmm/commands.py", line 208, in DOMAIN_STATE
    node.domain_state(request.uri, request.domain, request.state)
  File "/usr/lib/pymodules/python2.6/univention/uvmm/node.py", line 1450, in domain_state
    dom = conn.lookupByUUIDString(domain)
AttributeError: 'NoneType' object has no attribute 'lookupByUUIDString'
Comment 1 Erik Damrose univentionstaff 2014-07-14 17:06:45 CEST
The following traceback can also be seen:

2014-07-14 17:10:16,172 - uvmmd.node - WARNING - 'xen://schulr2.ucs.local/' broken? next check in 0:00:30.000. Domain not found: xenUnifiedDomainLookupByUUID
2014-07-14 17:10:16,183 - uvmmd.node - ERROR - xen://schulr2.ucs.local/: Exception handling callback
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/univention/uvmm/node.py", line 690, in domain_callback
    domStat.update( dom )
  File "/usr/lib/pymodules/python2.6/univention/uvmm/node.py", line 243, in update
    info = domain.info()
  File "/usr/lib/python2.6/dist-packages/libvirt.py", line 1730, in info
    if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
libvirtError: An error occurred, but the cause is unknown
Comment 2 Philipp Hahn univentionstaff 2014-07-15 08:32:33 CEST
What I've now seen several times is our check scripts restarting libvirtd / UVMMd, because our "virsh" check did not return in time while a long running operation like snapshot-create is in progress (Bug #35101). This will definitely break any existing connection and leads to the host being shown as unreachable.

Especially with Xen there is Bug #20910, which is unfixed since 4 years: When a VM is started, the dom0 seems to get into a resource problem, which breaks something in libvirtd. Our libvirt package contains patches/libvirt/3.1-0-0-ucs/0.9.12-5-ucs3.1-1/64_xen-hypervisor-reopen.patch to provide some band-aid for  Bug #20024, but the underlying problem remains.

Without the restarts scripts adding a time-stamp to their output (Bug #35069) it is nearly impossible to correlate those events.
Comment 3 Florian Best univentionstaff 2016-05-27 11:20:49 CEST
*** Bug 33186 has been marked as a duplicate of this bug. ***
Comment 4 Florian Best univentionstaff 2016-05-27 11:21:40 CEST
*** Bug 23529 has been marked as a duplicate of this bug. ***
Comment 5 Florian Best univentionstaff 2016-05-27 11:22:39 CEST
(In reply to Florian Best from comment #4)
> *** Bug 23529 has been marked as a duplicate of this bug. ***
Part of the fix mentioned there:
Ursache ist das nach einem Fehler mit der Verbindung Node.conn=None gesetzt wird, was dann trotzdem ohne Überprüfung von den ganzen domain_*()-Funktionen genutzt wird. Hier wäre folgendes zu ergänzen:
  if conn is None:
    raise NodeError(_('Node is currently unconnected.'))

At least this would display a human readable description instead of an unuseful error message.
Comment 6 Philipp Hahn univentionstaff 2019-09-13 14:39:55 CEST
Xen is no longer relevant since UCS-4.0
It works for me with KVM.
Comment 7 Erik Damrose univentionstaff 2019-12-09 10:34:40 CET
No reports of this with KVM, and Xen is not supported anymore -> close as worksforme