Univention Bugzilla – Bug 35354
Increase uvmmd / libvirt connection robustness
Last modified: 2019-12-09 10:34:40 CET
In a larger xen environment uvmmd / libvirt often loses the connection to its nodes. The environment consists of 4 XEN-Nodes with about 10 VMs each and varying load. The connection losses get more visible with the updated uvmmd interface from Bug #35122. The affected node and its domains get marked as unavailable. Various error messages appear in the virtual-machine-manager-daemon.log: Most errors are logged due to timeouts, and are concerning domainEventDeregister: 2014-07-14 16:30:25,076 - uvmmd.node - WARNING - 'xen://schulr1.ucs.local/' broken? next check in 0:00:30.000. Cannot write data: Input/output error 2014-07-14 16:30:25,077 - uvmmd.node - ERROR - xen://schulr1.ucs.local/: Exception in domainEventDeregister Traceback (most recent call last): File "/usr/lib/pymodules/python2.6/univention/uvmm/node.py", line 608, in update_autoreconnect self.conn.domainEventDeregister(self.domainCB) File "/usr/lib/python2.6/dist-packages/libvirt.py", line 3343, in domainEventDeregister if ret == -1: raise libvirtError ('virConnectDomainEventDeregister() failed', conn=self) libvirtError: Cannot write data: Input/output error But i also saw the following error: 2014-07-14 15:59:19,435 - uvmmd.node - WARNING - 'xen://schulr2.ucs.local/' broken? next check in 0:00:30.000. Domain not found: xenUnifiedDomainLookupByID 2014-07-14 15:59:21,649 - uvmmd.unix - ERROR - [834] Exception: Traceback (most recent call last): File "/usr/lib/pymodules/python2.6/univention/uvmm/unix.py", line 149, in handle_command res = cmd(self, command) File "/usr/lib/pymodules/python2.6/univention/uvmm/commands.py", line 208, in DOMAIN_STATE node.domain_state(request.uri, request.domain, request.state) File "/usr/lib/pymodules/python2.6/univention/uvmm/node.py", line 1450, in domain_state dom = conn.lookupByUUIDString(domain) AttributeError: 'NoneType' object has no attribute 'lookupByUUIDString'
The following traceback can also be seen: 2014-07-14 17:10:16,172 - uvmmd.node - WARNING - 'xen://schulr2.ucs.local/' broken? next check in 0:00:30.000. Domain not found: xenUnifiedDomainLookupByUUID 2014-07-14 17:10:16,183 - uvmmd.node - ERROR - xen://schulr2.ucs.local/: Exception handling callback Traceback (most recent call last): File "/usr/lib/pymodules/python2.6/univention/uvmm/node.py", line 690, in domain_callback domStat.update( dom ) File "/usr/lib/pymodules/python2.6/univention/uvmm/node.py", line 243, in update info = domain.info() File "/usr/lib/python2.6/dist-packages/libvirt.py", line 1730, in info if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self) libvirtError: An error occurred, but the cause is unknown
What I've now seen several times is our check scripts restarting libvirtd / UVMMd, because our "virsh" check did not return in time while a long running operation like snapshot-create is in progress (Bug #35101). This will definitely break any existing connection and leads to the host being shown as unreachable. Especially with Xen there is Bug #20910, which is unfixed since 4 years: When a VM is started, the dom0 seems to get into a resource problem, which breaks something in libvirtd. Our libvirt package contains patches/libvirt/3.1-0-0-ucs/0.9.12-5-ucs3.1-1/64_xen-hypervisor-reopen.patch to provide some band-aid for Bug #20024, but the underlying problem remains. Without the restarts scripts adding a time-stamp to their output (Bug #35069) it is nearly impossible to correlate those events.
*** Bug 33186 has been marked as a duplicate of this bug. ***
*** Bug 23529 has been marked as a duplicate of this bug. ***
(In reply to Florian Best from comment #4) > *** Bug 23529 has been marked as a duplicate of this bug. *** Part of the fix mentioned there: Ursache ist das nach einem Fehler mit der Verbindung Node.conn=None gesetzt wird, was dann trotzdem ohne Überprüfung von den ganzen domain_*()-Funktionen genutzt wird. Hier wäre folgendes zu ergänzen: if conn is None: raise NodeError(_('Node is currently unconnected.')) At least this would display a human readable description instead of an unuseful error message.
Xen is no longer relevant since UCS-4.0 It works for me with KVM.
No reports of this with KVM, and Xen is not supported anymore -> close as worksforme