Bug 50660 - replication.py: UDL not stopped on disk full
replication.py: UDL not stopped on disk full
Status: NEW
Product: UCS
Classification: Unclassified
Component: Listener (univention-directory-listener)
UCS 4.4
Other Linux
: P5 normal (vote)
: ---
Assigned To: UCS maintainers
UCS maintainers
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-12-18 17:18 CET by Philipp Hahn
Modified: 2020-04-21 15:03 CEST (History)
3 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 7: Crash: Bug causes crash or data loss
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 2: A Pain – users won’t like this once they notice it
User Pain: 0.160
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2019121621000516
Bug group (optional): Error handling
Max CVSS v3 score:
hahn: Patch_Available+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Hahn univentionstaff 2019-12-18 17:18:53 CET
/usr/lib/univention-directory-listener/system/replication.py # check_file_system_space check the free disk space and stopps UDL otherwise:
> 775         listener.run('/etc/init.d/univention-directory-listener', ['univention-directory-listener', 'stop'], uid=0, wait=True)

Before doing this the module tries to send an email:
> 770         s = smtplib.SMTP()
> 771         s.connect()
> 772         s.sendmail(sender, [recipient], msg.as_string())
> 773         s.close()

It the MTA is not available, the module crashes and UDL is *not* stopped:

>18.12.19 16:53:19.833  LISTENER    ( ERROR   ) : replication: Critical disk space. The Univention LDAP Listener was stopped
Traceback (most recent call last):
>  File "/usr/lib/univention-directory-listener/system/replication.py", line 783, in handler
>    check_file_system_space()
>  File "/usr/lib/univention-directory-listener/system/replication.py", line 771, in check_file_system_space
>    s.connect()
>  File "/usr/lib/python2.7/smtplib.py", line 316, in connect
>    self.sock = self._get_socket(host, port, self.timeout)
>  File "/usr/lib/python2.7/smtplib.py", line 291, in _get_socket
>    return socket.create_connection((host, port), timeout)
>  File "/usr/lib/python2.7/socket.py", line 575, in create_connection
>    raise err
>socket.error: [Errno 111] Connection refused
>18.12.19 16:53:19.834  LISTENER    ( WARN    ) : handler: replication (failed)

This can be reproduced easily by
 ucr set ldap/replication/filesystem/check=true ldap/replication/filesystem/limit=40187996
 /etc/init.d/postfix stop
 /etc/init.d/univention-directory-listener restart
 udm users/user modify --dn "uid=Administrator,cn=users,$(ucr get ldap/base)" --set description="$(date)"
 tail -f /var/log/univention/listener.log


Also there are 2 UCRVs:

> # ucr search listener/freespace ldap/replication/filesystem/limit
> ldap/replication/filesystem/limit: 40187996
>  This variable configures the lower limit for free space in the directory '/var/lib/univention-ldap/', when replication will be stopped. Default is 10 [MiB].

This is implemented in `replication.py`.
The module only exists on Backups and Slaves.
The module is always executed first.
If the check triggers, UDL is stopped via `service stop udl`.

> listener/freespace: 10
>  This variable configures the lower limit for free space in the directories '/var/lib/univention-ldap/' and '/var/lib/univention-directory-listener/', when the Listener will be stopped. Default is 10 [MiB].

This is implemented in the main loop of UDL.
This is also checked on Master and Member.
If the check fails, UDL abort()s: management/univention-directory-listener/src/notifier.c
> 89 »···»···abort();
But is restarted by `systemd` in an endless-loop and fills the remaining disk space with error messages.


If both limits are set to the same value, both implementations will case. For the customer the logfile contains the beginning of the Traceback, but it is overwritten in the middle by the next incarnation of UDL:

> 16.12.19 10:05:44.816  LISTENER    ( ERROR   ) : replication: Critical disk space. The Univention LDAP Listener was stopped
> Traceback (most recent call last):
>   File "/usr/lib/univention-directory-listener/system/replication.py", line 783, in handler
>     check_file_system_space()
>   File "/usr/lib/univention-directory16.12.19 10:05:51.019  DEBUG_INIT


Tasks:
1. Catch exception in replication.py to always stop UDL
2. Or remove code from replication.py completely now that UDL has a check itself
3. Make systemd not restart UDL in that case; either by UDL stopping itself or by using RestartPreventExitStatus= or by ...
4. Document to set listener/freespace << ldap/replication/filesystem/limit
Comment 1 Philipp Hahn univentionstaff 2020-04-21 15:03:05 CEST
Patch in git:phahn/replication