Bug 49737 - Within one day the virtual-machine-manager-daemon.log increases to 400Gb
Within one day the virtual-machine-manager-daemon.log increases to 400Gb
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: UMC - Virtual machines (UVMM)
UCS 4.4
Other Linux
: P5 normal (vote)
: UCS 4.4-1-errata
Assigned To: Florian Best
Philipp Hahn
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-06-27 15:27 CEST by Christina Scheinig
Modified: 2019-08-07 15:44 CEST (History)
3 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 7: Crash: Bug causes crash or data loss
Who will be affected by this bug?: 1: Will affect a very few installed domains
How will those affected feel about the bug?: 3: A User would likely not purchase the product
User Pain: 0.120
Enterprise Customer affected?: Yes
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2019062521000734
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christina Scheinig univentionstaff 2019-06-27 15:27:10 CEST
In a customer environment the logfile of two backupservers increases to 400Gb within one day. This causes the hard disk to ran out of space. The logfile is filled with tracebacks as follows. I have renamed the doamin name.
 
-rw-r----- 1 root adm 415G Jun 25 12:16 /var/lib/libvirt/images2/virtual-machine-manager-daemon.log

2019-06-24 06:25:04,255 - uvmmd - INFO - Logfile reopened.
2019-06-24 06:28:12,048 - uvmmd.node - WARNING - 'qemu://master.schein.ig/system' broken? next check in 0:05:00.000. unable to connect to server at 'master.schein.ig
:16514': Connection timed out
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 834, in update_autoreconnect
    self.conn = libvirt.open(self.pd.uri)
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 255, in open
    if ret is None:raise libvirtError('virConnectOpen() failed')
libvirtError: unable to connect to server at 'master.schein.ig:16514': Connection timed out
2019-06-24 06:35:22,128 - uvmmd.node - WARNING - 'qemu://master.schein.ig/system' broken? next check in 0:05:00.000. unable to connect to server at 'master.schein.ig
:16514': Connection timed out
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 834, in update_autoreconnect
    self.conn = libvirt.open(self.pd.uri)
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 255, in open
    if ret is None:raise libvirtError('virConnectOpen() failed')
libvirtError: unable to connect to server at 'master.schein.ig:16514': Connection timed out
2019-06-24 06:42:32,208 - uvmmd.node - WARNING - 'qemu://master.schein.ig/system' broken? next check in 0:05:00.000. unable to connect to server at 'master.schein.ig
:16514': Connection timed out
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 834, in update_autoreconnect
    self.conn = libvirt.open(self.pd.uri)
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 255, in open
    if ret is None:raise libvirtError('virConnectOpen() failed')
libvirtError: unable to connect to server at 'master.schein.ig:16514': Connection timed out
[...]

libvirtError: unable to connect to server at 'master.schein.ig:16514': Connection timed out
2019-06-24 10:31:54,768 - uvmmd.node - WARNING - 'qemu://master.schein.ig/system' broken? next check in 0:05:00.000. unable to connect to server at 'master.schein.ig
:16514': Connection timed out
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 834, in update_autoreconnect
    self.conn = libvirt.open(self.pd.uri)
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 255, in open
    if ret is None:raise libvirtError('virConnectOpen() failed')
libvirtError: unable to connect to server at 'master.schein.ig:16514': Connection timed out
2019-06-24 10:39:04,848 - uvmmd.node - WARNING - 'qemu://master.schein.ig/system' broken? next check in 0:05:00.000. unable to connect to server at 'master.schein.ig
:16514': Connection timed out
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 834, in update_autoreconnect
    self.conn = libvirt.open(self.pd.uri)
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 255, in open
    if ret is None:raise libvirtError('virConnectOpen() failed')
libvirtError: unable to connect to server at 'master.schein.ig:16514': Connection timed out
2019-06-24 10:46:14,928 - uvmmd.node - WARNING - 'qemu://master.schein.ig/system' broken? next check in 0:05:00.000. unable to connect to server at 'master.schein.ig
:16514': Connection timed out
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 834, in update_autoreconnect
    self.conn = libvirt.open(self.pd.uri)
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 255, in open
    if ret is None:raise libvirtError('virConnectOpen() failed')
libvirtError: unable to connect to server at 'master.schein.ig:16514': Connection timed out
2019-06-24 10:51:15,349 - uvmmd.node - INFO - Connected to 'qemu://master.schein.ig/system'
[..]

2019-06-24 11:19:19,145 - uvmmd.node - INFO - Connected to 'qemu://backup5-10.schein.ig/system'
2019-06-24 12:34:25,040 - uvmmd.node - WARNING - 'qemu://master.schein.ig/system' broken? next check in 0:00:30.000. Cannot recv data: Input/output error
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 842, in update_autoreconnect
    self.update()
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1123, in update
    for dom in self.conn.listAllDomains():
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 5011, in listAllDomains
    raise libvirtError("virConnectListAllDomains() failed", conn=self)
libvirtError: Cannot recv data: Input/output error
2019-06-24 12:34:25,041 - uvmmd.node.connection - INFO - Connection qemu://master.schein.ig/system closed: Error
2019-06-24 12:34:25,041 - uvmmd.node - ERROR - Cannot recv data: Input/output error
2019-06-24 12:34:25,041 - uvmmd.node - ERROR - qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in _unregister
    self.conn.domainEventDeregisterAny(self.domainCB.pop())
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 4820, in domainEventDeregisterAny
    if ret == -1: raise libvirtError ('virConnectDomainEventDeregisterAny() failed', conn=self)
libvirtError: internal error: client socket is closed
2019-06-24 12:34:25,041 - uvmmd.node - ERROR - An error occurred, but the cause is unknown
2019-06-24 12:34:25,041 - uvmmd.unix - WARNING - [58861] Error doing command "DOMAIN_STATE": Error managing domain "7f75c911-cea5-49de-8ffb-ab6090733bef": An error occurred, but the cause is unknown
2019-06-24 12:34:25,041 - uvmmd.node - ERROR - qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in _unregister
    self.conn.domainEventDeregisterAny(self.domainCB.pop())
AttributeError: 'NoneType' object has no attribute 'domainEventDeregisterAny'
2019-06-24 12:34:25,041 - uvmmd.unix - WARNING - [58898] Error doing command "DOMAIN_STATE": Error managing domain "7f75c911-cea5-49de-8ffb-ab6090733bef": An error occurred, but the cause is unknown
2019-06-24 12:34:25,041 - uvmmd.node - ERROR - qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in _unregister
    self.conn.domainEventDeregisterAny(self.domainCB.pop())

[...]

2019-06-25 04:06:32,064 - uvmmd.node - ERROR - qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in _unregister
    self.conn.domainEventDeregisterAny(self.domainCB.pop())
AttributeError: 'NoneType' object has no attribute 'domainEventDeregisterAny'
2019-06-25 04:06:32,064 - uvmmd.node - ERROR - qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in _unregister
    self.conn.domainEventDeregisterAny(self.domainCB.pop())
AttributeError: 'NoneType' object has no attribute 'domainEventDeregisterAny'
2019-06-25 04:06:32,064 - uvmmd.node - ERROR - qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in _unregister
    self.conn.domainEventDeregisterAny(self.domainCB.pop())
AttributeError: 'NoneType' object has no attribute 'domainEventDeregisterAny'
2019-06-25 04:06:32,064 - uvmmd.node - ERROR - qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in _unregister
    self.conn.domainEventDeregisterAny(self.domainCB.pop())
AttributeError: 'NoneType' object has no attribute 'domainEventDeregisterAny'
2019-06-25 04:06:32,064 - uvmmd.node - ERROR - qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in _unregister
    self.conn.domainEventDeregisterAny(self.domainCB.pop())
AttributeError: 'NoneType' object has no attribute 'domainEventDeregisterAny'
2019-06-25 04:06:32,064 - uvmmd.node - ERROR - qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in _unregister
    self.conn.domainEventDeregisterAny(self.do2019-06-25 12:16:02,649 - uvmmd.node - INFO - timer_callback(qemu://backup0.schein.ig/system) start
2019-06-25 12:16:02,741 - uvmmd.node - INFO - timer_callback(qemu://backup2.schein.ig/system) start
2019-06-25 12:16:02,750 - uvmmd.node - INFO - Connected to 'qemu://backup2.schein.ig/system'
2019-06-25 12:16:02,774 - uvmmd.node - INFO - timer_callback(qemu://master.schein.ig/system) start
2019-06-25 12:16:02,799 - uvmmd.node - INFO - timer_callback(qemu://backup5-10.schein.ig/system) start
2019-06-25 12:16:03,053 - uvmmd.node - INFO - Connected to 'qemu://backup0.schein.ig/system'
2019-06-25 12:16:03,395 - uvmmd.node - INFO - Connected to 'qemu://master.schein.ig/system'
2019-06-25 12:16:03,408 - uvmmd.node - INFO - Connected to 'qemu://backup5-10.schein.ig/system'
Comment 1 Erik Damrose univentionstaff 2019-06-27 15:46:42 CEST
(In reply to Christina Scheinig from comment #0)
> -rw-r----- 1 root adm 415G Jun 25 12:16
> /var/lib/libvirt/images2/virtual-machine-manager-daemon.log

Where does this file come from? From which dates does that logfile contain logentries?

UVMMd uses the hardcoded logfile /var/log/univention/virtual-machine-manager-daemon.log, and that file is rotated by logrotate.

Which type of error in the logfiles caused the massive increase of filesize?

In comment0 i see errors that a hypervisor is not reachable, but this has a backoff logic and should only be logged every 5 minutes after the connection breaks of.

Another error seems to be with virConnectListAllDomains...
Comment 2 Christina Scheinig univentionstaff 2019-06-27 16:41:19 CEST
(In reply to Erik Damrose from comment #1)
> (In reply to Christina Scheinig from comment #0)
> > -rw-r----- 1 root adm 415G Jun 25 12:16
> > /var/lib/libvirt/images2/virtual-machine-manager-daemon.log
> 
> Where does this file come from? From which dates does that logfile contain
> logentries?

I think the customer moved the logfile to get free space again.
The logfile starts on 
2019-06-24 06:25:04,255 - uvmmd - INFO - Logfile reopened and ends
2019-06-25 12:16:03,408 - uvmmd.node - INFO - Connected to 'qemu://....

> UVMMd uses the hardcoded logfile
> /var/log/univention/virtual-machine-manager-daemon.log, and that file is
> rotated by logrotate.
> 
> Which type of error in the logfiles caused the massive increase of filesize?
> 
This message is repeated all the time:
2019-06-24 12:34:25,041 - uvmmd.node - ERROR - qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in _unregister
    self.conn.domainEventDeregisterAny(self.domainCB.pop())

> In comment0 i see errors that a hypervisor is not reachable, but this has a
> backoff logic and should only be logged every 5 minutes after the connection
> breaks of.
> 
> Another error seems to be with virConnectListAllDomains...
Comment 3 Ingo Steuwer univentionstaff 2019-07-05 16:12:04 CEST
(In reply to Christina Scheinig from comment #2)
> > Which type of error in the logfiles caused the massive increase of filesize?
> > 
> This message is repeated all the time:
> 2019-06-24 12:34:25,041 - uvmmd.node - ERROR -
> qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
> Traceback (most recent call last):
>   File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in
> _unregister
>     self.conn.domainEventDeregisterAny(self.domainCB.pop())

If this message is the reason for the log file growth: do we know the root cause?
Comment 4 Christina Scheinig univentionstaff 2019-07-05 16:29:35 CEST
(In reply to Ingo Steuwer from comment #3)
> (In reply to Christina Scheinig from comment #2)
> > > Which type of error in the logfiles caused the massive increase of filesize?
> > > 
> > This message is repeated all the time:
> > 2019-06-24 12:34:25,041 - uvmmd.node - ERROR -
> > qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
> > Traceback (most recent call last):
> >   File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in
> > _unregister
> >     self.conn.domainEventDeregisterAny(self.domainCB.pop())
> 
> If this message is the reason for the log file growth: do we know the root
> cause?

Unfortunatly, we don't know for sure. According to the logfile and the statement of the customer the connection to the master was broken caused by a failed vpn connection. After that you can see two messages of input/output error and then the spamming starts.
Comment 5 Florian Best univentionstaff 2019-07-17 16:56:39 CEST
Node._unregister() / Node.update_autoreconnect() / Node.run() is called concurrently by multiple threads.
I think the Node.close_event() callback destroys the connection which causes the exceptions in Node._unregister().

Maybe some thread locking has to be done?
Comment 6 Florian Best univentionstaff 2019-07-19 20:37:19 CEST
So far from the Ticket information I understand the setup in the following way:
* There is 1 DC Master and 3 DC Backups
* The UVMM is running on one of the DC Backups
* The DC Master also have a libvirt instance which is used from the DC Backup
* The general network connection to the DC Master doesn't work
* On the DC Backup slapd is not running(?)
Comment 7 Florian Best univentionstaff 2019-07-26 17:30:42 CEST
(In reply to Christina Scheinig from comment #0)

> 2019-06-24 10:46:14,928 - uvmmd.node - WARNING -
> 'qemu://master.schein.ig/system' broken? next check in 0:05:00.000. unable
> to connect to server at 'master.schein.ig
> :16514': Connection timed out
> Traceback (most recent call last):
>   File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 834, in
> update_autoreconnect
>     self.conn = libvirt.open(self.pd.uri)
>   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 255, in open
>     if ret is None:raise libvirtError('virConnectOpen() failed')
> libvirtError: unable to connect to server at 'master.schein.ig:16514':
> Connection timed out
Tracebacks in update_autoreconnect() are only logged once now. Otherwise only a warning is logged.

> Traceback (most recent call last):
>   File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 842, in
> update_autoreconnect
>     self.update()
>   File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1123, in
> update
>     for dom in self.conn.listAllDomains():
>   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 5011, in
> listAllDomains
>     raise libvirtError("virConnectListAllDomains() failed", conn=self)
> libvirtError: Cannot recv data: Input/output error
Same for this traceback.

> 2019-06-24 12:34:25,041 - uvmmd.node - ERROR -
> qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
> Traceback (most recent call last):
>   File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in
> _unregister
>     self.conn.domainEventDeregisterAny(self.domainCB.pop())
>   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 4820, in
> domainEventDeregisterAny
>     if ret == -1: raise libvirtError ('virConnectDomainEventDeregisterAny()
> failed', conn=self)
> libvirtError: internal error: client socket is closed
Exceptions in _unregister() are now logged as warning without the traceback.

> 2019-06-24 12:34:25,041 - uvmmd.node - ERROR - An error occurred, but the
> cause is unknown
> 2019-06-24 12:34:25,041 - uvmmd.unix - WARNING - [58861] Error doing command
> "DOMAIN_STATE": Error managing domain
> "7f75c911-cea5-49de-8ffb-ab6090733bef": An error occurred, but the cause is
> unknown
> 2019-06-24 12:34:25,041 - uvmmd.node - ERROR -
> qemu://master.schein.ig/system: Exception in domainEventDeregisterAny
> Traceback (most recent call last):
>   File "/usr/lib/pymodules/python2.7/univention/uvmm/node.py", line 1097, in
> _unregister
>     self.conn.domainEventDeregisterAny(self.domainCB.pop())
> AttributeError: 'NoneType' object has no attribute 'domainEventDeregisterAny'
domainEventDeregisterAny'

This actually was a race condition between the update_autoreconnect() thread and the main thread calling the event close_event().
Storing self.conn in a local variable ensures the instance is accessible and the callbacks can be deregistered.

univention-virtual-machine-manager-daemon (8.0.1-8)
6bf9160197c8 | Bug #49737 UVMM: Improve Exception handling on connection close

univention-virtual-machine-manager-daemon.yaml
43c6b1b8791d | YAML Bug #28771, Bug #49695, Bug #49737
Comment 8 Philipp Hahn univentionstaff 2019-07-30 19:22:47 CEST
OK: Code review
OK: tail -f /var/log/univention/virtual-machine-manager-daemon.log
    /etc/init.d/libvirtd stop
OK: errata-announce  -V --only univention-virtual-machine-manager-schema.yaml
OK: YAML
Comment 9 Erik Damrose univentionstaff 2019-08-07 15:44:32 CEST
<http://errata.software-univention.de/ucs/4.4/218.html>