Bug 34105 - ntpdate by u-system-setup-base/40_ssl/10ssl causes fatal UMC process exit and breaks whole system
ntpdate by u-system-setup-base/40_ssl/10ssl causes fatal UMC process exit and...
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: UMC (Generic)
UCS 3.2
Other Linux
: P5 normal (vote)
: UCS 3.2-2-errata
Assigned To: Janek Walkenhorst
Stefan Gohmann
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-11 23:52 CET by Philipp Hahn
Modified: 2014-08-07 17:43 CEST (History)
3 users (show)

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Hahn univentionstaff 2014-02-11 23:52:16 CET
The UMC-module-setup process exits when the system time is updated and the delta is too large.
This aborts any running univention-system-setup-script.

I was able to reliably trigger this in a VM reverting back to a previous snapshot and by running USS again to change the SSL certificate settings.
As the module process is the sole reader of the pipe, to which all scripts in /usr/lib/univention-system-setup/scripts/ write their STDOUT/STDERR, the next script doing so is aborted with SIGPIPE, which also aborts all following scripts.

For me it was 40_ssl/10ssl, which results in an almost empty directory /etc/univention/ssl/ only containing the sub-directory ucsCA/ and some dangling symlinks.
This completely breaks the system, as slapd, u-d-listener, apache2, ... will miss the host certificate when they're next restarted and will no longer work, rendering the whole system unusable.

Sometimes the script 10ssl succeeds, but then a later script fails.

management-console-server.log:
...
09:59:53.227 PROTOCOL  (INFO   ): Received UMCP RESPONSE 139210919284144-42
20:27:36.145 PARSER    (INFO   ): UMCP REQUEST 139210919344386-43 parsed successfully
20:27:36.145 MAIN      (INFO   ): Incoming request of type COMMAND
20:27:36.145 RESOURCES (INFO   ): Searching for module providing command setup/finished
20:27:36.145 RESOURCES (INFO   ): Found module setup
20:27:36.146 MAIN      (INFO   ): Passing new request to running module setup
20:27:36.146 PROTOCOL  (INFO   ): Sending UMCP COMMAND REQUEST 139210919344386-43
20:27:36.246 MAIN      (PROCESS): ModuleProcess: child died
20:27:36.246 MAIN      (INFO   ): Module process setup died on purpose
20:27:36.246 MAIN      (INFO   ): Checking for kill timer ()
20:27:36.246 MAIN      (WARN   ): Cleaning up requests
20:27:36.246 MAIN      (WARN   ): Invalidating all pending requests 139210919344386-43
20:27:36.246 MAIN      (WARN   ): Remove inactivity timer
...

setup.log:
...
=== 40_ssl/10ssl (2014-02-11 09:59:53) ===
__NAME__:40_ssl/10ssl Generating SSL certificate
rdate: time.fu-berlin.de: Name or service not known
Tue Feb 11 20:27:36 CET 2014

management-console-module-setup.log:
...
09:59:53.198 MODULE     (INFO  ): Running script /usr/lib/univention-system-setup/scripts/40_ssl/10ssl
09:59:53.227 PROTOCOL   (INFO  ): Sending UMCP RESPONSE 139210919284144-42
20:27:36.082 MODULE     (INFO  ): Commiting suicide
20:27:36.082 MAIN       (WARN  ): Shutting down all open connections
20:28:16.161 DEBUG_INIT
20:28:16.161 MODULE     (ERROR ): The specified locale is not available (cmdline: en_US.UTF-8)
...

1. univention-system-setup/scripts/40_ssl/10ssl probably should not call ntpdate. See Bug #27728 for further issues regarding NTP.

2. A UMC module process should use better use a "monotonic clock" instead of "wall clock", which is immune to NTP time changes: <http://stackoverflow.com/questions/1205722/how-do-i-get-monotonic-time-durations-in-python>

3. U-S-S should probably better write the output into some log file to guarantee the process continue to run even when the sole reader quits. Similar to the updater-log-polling the Univention System Setup UMC-module should poll that log file and display/parse its tail.
Comment 1 Lukas Walter univentionstaff 2014-06-13 12:58:09 CEST
1. 40_ssl/10ssl needs to call rdate right now to prevent the generation of invalid SSL certificates, see Bug #13549

2. As UMC uses python-notifier for testing timeouts, and python-notifier uses wall-clock time, a switch to "monotonic clock" would be difficult.
Instead, UMC module processes will not die by timeout anymore if the time delta is bigger then one and a half the timeout, which can be considered a safe hint for the fact that something went wrong.

3. U-S-S will now directly write the setup script output to it's log and watch the file end for new content, as suggested. I've tested the progress bar after theses changes without being able to find any problems with them. SIGPIPES should not occur anymore.


svn 51039
Comment 2 Alexander Kläser univentionstaff 2014-06-15 08:18:01 CEST
(In reply to Lukas Walter from comment #1)
> 1. 40_ssl/10ssl needs to call rdate right now to prevent the generation of
> invalid SSL certificates, see Bug #13549
> 
> 2. As UMC uses python-notifier for testing timeouts, and python-notifier
> uses wall-clock time, a switch to "monotonic clock" would be difficult.
> Instead, UMC module processes will not die by timeout anymore if the time
> delta is bigger then one and a half the timeout, which can be considered a
> safe hint for the fact that something went wrong.

Would it not be easier to adjust the timer similarly to the timer in the UMC web server, i.e., register a timer every 1sec that counts down the timeout? The check for 1.5 * self.__timeout might still leave cases where the problem occurs (although the probability of theses cases to happen is much lower, indeed), and this would then be really awkward to debug + understand, IMHO.

> 3. U-S-S will now directly write the setup script output to it's log and
> watch the file end for new content, as suggested. I've tested the progress
> bar after theses changes without being able to find any problems with them.
> SIGPIPES should not occur anymore.

Which kind of problem does this adjustment solve exactly?
Comment 3 Lukas Walter univentionstaff 2014-06-16 16:06:11 CEST
(In reply to Alexander Kläser from comment #2)
> (In reply to Lukas Walter from comment #1)
> > 1. 40_ssl/10ssl needs to call rdate right now to prevent the generation of
> > invalid SSL certificates, see Bug #13549
> > 
> > 2. As UMC uses python-notifier for testing timeouts, and python-notifier
> > uses wall-clock time, a switch to "monotonic clock" would be difficult.
> > Instead, UMC module processes will not die by timeout anymore if the time
> > delta is bigger then one and a half the timeout, which can be considered a
> > safe hint for the fact that something went wrong.
> 
> Would it not be easier to adjust the timer similarly to the timer in the UMC
> web server, i.e., register a timer every 1sec that counts down the timeout?
> The check for 1.5 * self.__timeout might still leave cases where the problem
> occurs (although the probability of theses cases to happen is much lower,
> indeed), and this would then be really awkward to debug + understand, IMHO.

Implemented. In addition, the USS module will no more return successfull execution before even processing the request, which will keep the USS module process from terminating by timeout while there are still requests to be processed, as UMC module processes do not time out with remaining active requests (svn 51088)

> > 3. U-S-S will now directly write the setup script output to it's log and
> > watch the file end for new content, as suggested. I've tested the progress
> > bar after theses changes without being able to find any problems with them.
> > SIGPIPES should not occur anymore.
> 
> Which kind of problem does this adjustment solve exactly?

This prevents the currently running setup script from receiving SIGPIPE when the USS module process dies for some reason. Before, the USS module process read the STDOUT/STDERR of each executed setup script via pipe, which always results in SIGPIPEs for the writing process (the setup scripts) if no reader is left at the other side.
Comment 4 Janek Walkenhorst univentionstaff 2014-07-01 14:47:27 CEST
When running univention-system-setup-boot on an unjoined 3.2 UCS with large (2 year) time behind upstream does not lead to the same errors as without the fixes.

But when starting the configuration process a dialog pops up immediately saying the configuration process was completed, although the progress bar is still showing and the setup scripts are still running.

Also there are still SIGPIPE-errors in the setup.log:
 […]
 Please enter the following 'extra' attributes
 to be sent with your certificate request
 A challenge password []:An optional company name [Univention GmbH]:yes: Standardausgabe: Datenübergabe unterbrochen (broken pipe)
 yes: Schreibfehler
 Using configuration from openssl.cnf
 Check that the request matches the signature
 Signature ok
 […]

Interestingly the files in /etc/univention/ssl do exist.


Additionally the UMC does not show most modules, but this might be unrelated.
Comment 5 Alexander Kläser univentionstaff 2014-07-15 12:28:38 CEST
I do not understand the following line in the YAML file:

  * Request successfull request processing will not be reported before the actual
    request processing anymore.
Comment 6 Janek Walkenhorst univentionstaff 2014-07-29 16:49:05 CEST
(In reply to Alexander Kläser from comment #5)
> I do not understand the following line in the YAML file:
> 
>   * Request successfull request processing will not be reported before the
> actual
>     request processing anymore.
Reworded.

(In reply to Janek Walkenhorst from comment #4)
> When running univention-system-setup-boot on an unjoined 3.2 UCS with large
> (2 year) time behind upstream does not lead to the same errors as without
> the fixes.
> 
> But when starting the configuration process a dialog pops up immediately
> saying the configuration process was completed, although the progress bar is
> still showing and the setup scripts are still running.
With the other changes to u-s-s-boot the system setup works when the time is changed.
Comment 7 Florian Best univentionstaff 2014-07-29 17:41:01 CEST
Well, there is still a typo in the YAML "Successfull" → Successful. Also i wouldn't write that implementation detail but more something which is related to the bug title.
Comment 8 Janek Walkenhorst univentionstaff 2014-07-30 15:06:20 CEST
(In reply to Florian Best from comment #7)
> Well, there is still a typo in the YAML "Successfull" → Successful.
Fixed.
Comment 9 Stefan Gohmann univentionstaff 2014-08-01 13:27:42 CEST
YAML: OK

Code review: OK 
 - univention-system-setup: OK
 - univention-management-console: OK

Tests: OK
 date -s "$(LC_ALL=C date -d '2 years ago')"
 mv /etc/univention/ssl /etc/univention/ssl.BAK
 /var/lib/dpkg/info/univention-ssl.postinst
 /etc/init.d/univention-system-setup-boot start
Comment 10 Janek Walkenhorst univentionstaff 2014-08-07 17:42:21 CEST
http://errata.univention.de/ucs/3.2/159.html
Comment 11 Janek Walkenhorst univentionstaff 2014-08-07 17:43:19 CEST
http://errata.univention.de/ucs/3.2/160.html