Bug 50660 - replication.py: UDL not stopped on disk full
replication.py: UDL not stopped on disk full
Status: NEW
Product: UCS
Classification: Unclassified
Component: Listener (univention-directory-listener)
UCS 5.0
Other Linux
: P5 normal (vote)
: ---
Assigned To: UCS maintainers
UCS maintainers
:
: 53413 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-12-18 17:18 CET by Philipp Hahn
Modified: 2021-06-08 13:54 CEST (History)
9 users (show)

See Also:
What kind of report is it?: Bug Report
What type of bug is this?: 7: Crash: Bug causes crash or data loss
Who will be affected by this bug?: 2: Will only affect a few installed domains
How will those affected feel about the bug?: 2: A Pain – users won’t like this once they notice it
User Pain: 0.160
Enterprise Customer affected?:
School Customer affected?: Yes
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number: 2019121621000516, 2021060421000439
Bug group (optional): Error handling
Max CVSS v3 score:
hahn: Patch_Available+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Hahn univentionstaff 2019-12-18 17:18:53 CET
/usr/lib/univention-directory-listener/system/replication.py # check_file_system_space check the free disk space and stopps UDL otherwise:
> 775         listener.run('/etc/init.d/univention-directory-listener', ['univention-directory-listener', 'stop'], uid=0, wait=True)

Before doing this the module tries to send an email:
> 770         s = smtplib.SMTP()
> 771         s.connect()
> 772         s.sendmail(sender, [recipient], msg.as_string())
> 773         s.close()

It the MTA is not available, the module crashes and UDL is *not* stopped:

>18.12.19 16:53:19.833  LISTENER    ( ERROR   ) : replication: Critical disk space. The Univention LDAP Listener was stopped
Traceback (most recent call last):
>  File "/usr/lib/univention-directory-listener/system/replication.py", line 783, in handler
>    check_file_system_space()
>  File "/usr/lib/univention-directory-listener/system/replication.py", line 771, in check_file_system_space
>    s.connect()
>  File "/usr/lib/python2.7/smtplib.py", line 316, in connect
>    self.sock = self._get_socket(host, port, self.timeout)
>  File "/usr/lib/python2.7/smtplib.py", line 291, in _get_socket
>    return socket.create_connection((host, port), timeout)
>  File "/usr/lib/python2.7/socket.py", line 575, in create_connection
>    raise err
>socket.error: [Errno 111] Connection refused
>18.12.19 16:53:19.834  LISTENER    ( WARN    ) : handler: replication (failed)

This can be reproduced easily by
 ucr set ldap/replication/filesystem/check=true ldap/replication/filesystem/limit=40187996
 /etc/init.d/postfix stop
 /etc/init.d/univention-directory-listener restart
 udm users/user modify --dn "uid=Administrator,cn=users,$(ucr get ldap/base)" --set description="$(date)"
 tail -f /var/log/univention/listener.log


Also there are 2 UCRVs:

> # ucr search listener/freespace ldap/replication/filesystem/limit
> ldap/replication/filesystem/limit: 40187996
>  This variable configures the lower limit for free space in the directory '/var/lib/univention-ldap/', when replication will be stopped. Default is 10 [MiB].

This is implemented in `replication.py`.
The module only exists on Backups and Slaves.
The module is always executed first.
If the check triggers, UDL is stopped via `service stop udl`.

> listener/freespace: 10
>  This variable configures the lower limit for free space in the directories '/var/lib/univention-ldap/' and '/var/lib/univention-directory-listener/', when the Listener will be stopped. Default is 10 [MiB].

This is implemented in the main loop of UDL.
This is also checked on Master and Member.
If the check fails, UDL abort()s: management/univention-directory-listener/src/notifier.c
> 89 »···»···abort();
But is restarted by `systemd` in an endless-loop and fills the remaining disk space with error messages.


If both limits are set to the same value, both implementations will case. For the customer the logfile contains the beginning of the Traceback, but it is overwritten in the middle by the next incarnation of UDL:

> 16.12.19 10:05:44.816  LISTENER    ( ERROR   ) : replication: Critical disk space. The Univention LDAP Listener was stopped
> Traceback (most recent call last):
>   File "/usr/lib/univention-directory-listener/system/replication.py", line 783, in handler
>     check_file_system_space()
>   File "/usr/lib/univention-directory16.12.19 10:05:51.019  DEBUG_INIT


Tasks:
1. Catch exception in replication.py to always stop UDL
2. Or remove code from replication.py completely now that UDL has a check itself
3. Make systemd not restart UDL in that case; either by UDL stopping itself or by using RestartPreventExitStatus= or by ...
4. Document to set listener/freespace << ldap/replication/filesystem/limit
Comment 1 Philipp Hahn univentionstaff 2020-04-21 15:03:05 CEST
Patch in git:phahn/replication
Comment 3 Jürn Brodersen univentionstaff 2021-04-30 14:30:32 CEST
I had a similar problem during product testing for ucs5. Just not with replication.py but with the s4 connector. The listener module "s4-connector.py" was writing the changes faster to "/var/lib/univention-connector/s4" than the connector was processing them. This filled up all space on the drive.

While the check for free space in the main loop worked, systemd restarted the service and the little space left was used up with log messages.

As a side note: 10MiB as the default value feels a bit low? Otherwise I would have hoped the s4 connector could have freed up some space while the listener was stopped.
Comment 4 Philipp Hahn univentionstaff 2021-06-07 16:56:19 CEST
*** Bug 53413 has been marked as a duplicate of this bug. ***