Bug 39482 - atd hangs using 100% CPU in docker
atd hangs using 100% CPU in docker
Status: CLOSED FIXED
Product: UCS
Classification: Unclassified
Component: Docker
UCS 4.1
Other Linux
: P5 normal (vote)
: UCS 4.1
Assigned To: Daniel Tröder
Felix Botner
: interim-2
Depends on:
Blocks: 40134
  Show dependency treegraph
 
Reported: 2015-10-06 15:42 CEST by Daniel Tröder
Modified: 2015-11-30 19:43 CET (History)
3 users (show)

See Also:
What kind of report is it?: ---
What type of bug is this?: ---
Who will be affected by this bug?: ---
How will those affected feel about the bug?: ---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Tröder univentionstaff 2015-10-06 15:42:41 CEST
Whenever a docker.software-univention.de/ucs-appbox-amd64:4.1-0 is started, the atd process inside the container hangs using a full core.
Comment 1 Daniel Tröder univentionstaff 2015-10-23 14:29:23 CEST
I cannot reproduce this on any other host than 10.200.3.26.

Atds man page:

WARNING
       atd won't work if its spool directory is mounted via NFS even if no_root_squash is set.

Of cause it's not, but maybe on that host there is something different about the storage that is not apparent to me.

[..]

Used "strace -f /usr/sbin/atd" to see what was going on.
After some initialization this happens infinitively:

stat(".", {st_mode=S_IFDIR|S_ISVTX|0770, st_size=4096, ...}) = 0
open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
getdents(4, /* 4 entries */, 32768)     = 112
stat("a00001016f9298", {st_mode=S_IFREG|0700, st_size=767, ...}) = 0
unlink("=00001016f9298")                = -1 ENOENT (No such file or directory)
getdents(4, /* 0 entries */, 32768)     = 0
close(4)                                = 0

The file "a00001016f9298" exits in /var/spool/cron/atjobs/ and is the create-dh-parameter-files.sh for in 1d, created by univention-mail-postfix. Why it is (unsuccessfully) trying to unlink "=00001016f9298" is not clear.

There seemed to be an error in the base image. After deleting all images and installing a docker-app from scratch, the problem disappeared.
Comment 2 Felix Botner univentionstaff 2015-10-28 14:50:02 CET
21885 daemon    20   0 16676  144    0 R  96,7  0,0  45:47.95 atd


UCS 4.1 dudle-docker app installed
Comment 3 Stefan Gohmann univentionstaff 2015-10-30 08:18:14 CET
Maybe overlayfs is the problem?
Comment 4 Daniel Tröder univentionstaff 2015-11-03 10:31:58 CET
The scheduling of the Postfix DH parameters regeneration for the day after the Docker _image_ generation didn't make much sense. It was moved to one day after the creation of the _container_.

Commit: 65106
Packages: univention-mail-postfix, univention-appcenter

Please rebuild the images that install univention-container-role-server-common, as that depends on univention-mail-postfix. The scripts installing those are:
* /usr/share/univention-docker-dev/scripts/create-docker-ucs-appbox-image.sh
* /usr/share/univention-docker-dev/scripts/create-docker-ucs-role-image.sh

This leaves the question of the same problem for images for other virtualization open.

IMHO we could completely remove the recreation of the 2048b DH parameters file. It is the one from the RFC and is considered safe. The immediate recreation of the 512b file could be done at package post-installation time, as it costs very little time/resources.
Comment 5 Sönke Schwardt-Krummrich univentionstaff 2015-11-03 12:01:43 CET
(In reply to Daniel Tröder from comment #4)
> The scheduling of the Postfix DH parameters regeneration for the day after
> the Docker _image_ generation didn't make much sense. It was moved to one
> day after the creation of the _container_.

The following line does not work as detection if the postinst is run within a container. The component ucs-container will be removed (soon).

> if ! is_ucr_true repository/online/component/ucs-container; then

The following might help detecting if run within a container. Maybe the at job itself has to test after 1 day, if the at job is run within a docker container and not on bare metal.

container_uuid=$(grep -oe "docker.*" /proc/self/cgroup | head -1 | \
                                      sed -r 's|docker.*([a-z0-9]{64}).*|\1|')
if [ -n "$container_uuid" ]; then
    echo "RUNNING IN DOCKER CONTAINER"
else
    echo "RUNNING ON 'BARE' METAL"
fi

→ REOPEN
Comment 6 Sönke Schwardt-Krummrich univentionstaff 2015-11-03 12:12:46 CET
Alternatively the UCR variable docker/container/uuid should be set automatically during startup of the container. So if the variable is present, it should be a container.
Comment 7 Daniel Tröder univentionstaff 2015-11-03 12:57:38 CET
The atd in the postinst can safely run inside a started container. It should just not run when creating the image. That is done in a chroot weeks or months ahead.

Detection with docker/container/uuid and /proc/self/cgroup... do not work in the chroot, as it is not a started Docker container.

The UCRV repository/online/component/ucs-container exists at that time in the chroot (at least that is, what I gather from reading create-docker-ucs-appbox-image.sh).
Comment 8 Arvid Requate univentionstaff 2015-11-03 15:59:01 CET
Regarding Comment 6: This is done already in /sbin/init but doesn't help here since the at-Job is currently created in a postinst-Script at the build-time of a container.


I support the proposal of Comment 4: Removal of the scheduled automatic creation of individualized DH parameters. If people need that, they can run the creation script once manually. An SDB article should be created to explain the details.

Proposal for a corresponding UCS 4.1-0 changelog entry:

 "The automatic generation of individual DH-Parameters has been disabled to avoid issues with pre-Installed images like Docker containers and Appliances. Individual DH-Parameters can be created manually at any time by running /usr/share/univention-mail-postfix/create-dh-parameter-files.sh once and restarting the affected services (e.g. postfix ). A forthcomming SDB article will explain the details"


If this is not acceptable for anybody, my next proposal would be to move the creation of the at-job from the postinst to a joinscript. This would fix the issue for all docker containes that are not Master. For Master appliances we may want to disable running joinscripts at build-time anyway and postpone it to the time of instanciation.
Comment 9 Sönke Schwardt-Krummrich univentionstaff 2015-11-04 09:34:44 CET
(In reply to Arvid Requate from comment #8)
> Regarding Comment 6: This is done already in /sbin/init but doesn't help
> here since the at-Job is currently created in a postinst-Script at the
> build-time of a container.

I think, the at job should be set up in any case. But the at job itself should check if the at job is executed in an container environment and may behave differently in such a case.

> I support the proposal of Comment 4: Removal of the scheduled automatic
> creation of individualized DH parameters. If people need that, they can run
> the creation script once manually. An SDB article should be created to
> explain the details.

Creating ones own DH2048 file avoids/impedes brute force attacks with precalculated rainbow tables. In the past the main problem was that every one used the hardcoded compiled-in dh file. I think, the dh file should be calculated automatically in the background after the instance has been instanciated. Does the installation of haveged on the host / container help creating the dh file?
If not, I would suggest, that we precalculate a bunch of dh2048 files, ship them with univention-mail-postfix and select one randomly during postinst. The precalculated one is replaced by the at job later on.

> If this is not acceptable for anybody, my next proposal would be to move the
> creation of the at-job from the postinst to a joinscript. This would fix the
> issue for all docker containes that are not Master. For Master appliances we
> may want to disable running joinscripts at build-time anyway and postpone it
> to the time of instanciation.

I would second that. Aren't the join scripts skipped during DC master installation by the scripts itself (→ master not joined during join script execution)?
Comment 10 Arvid Requate univentionstaff 2015-11-04 11:22:52 CET
See Bug 38685#c4 and Bug 38685#c7 for the rationale of the arguments to use group 14 of RFC 3526. For additional reference https://weakdh.org/sysadmin.html says:

"It is fine to leave diffie-hellman-group14-sha1, which uses a 2048-bit prime."

Remember also, that currently only smaller than 1024bit keys have been shown to be precalculatable for *nation state* scale budgets.


But, at the end of the day, I agree, it is always safer to create individual DH parameter groups per server. This should be done once for each server, not for each service individually. That is planned via Bug 39158.

And yes, haveged helps, but that is not the issue of this bug.

As far as I understood Daniel, *atd* is having a problem, regardless of what the script does.


If we want to stick with automatic scheduling of DH parameter creation, then we should move the creation of the at-Job to the joinscript. I guess everybody would be happy with that?
Comment 11 Stefan Gohmann univentionstaff 2015-11-04 13:36:02 CET
(In reply to Arvid Requate from comment #10)
> If we want to stick with automatic scheduling of DH parameter creation, then
> we should move the creation of the at-Job to the joinscript. I guess
> everybody would be happy with that?

Yes, that would be fine.
Comment 12 Daniel Tröder univentionstaff 2015-11-04 14:28:25 CET
* The scheduling was removed from the appcenter code.
* The DH parameter generation is now scheduled in the join script.
* 512 bit DH parameters are created in postinst as they are the only ones that are possibly weak and are be generated in about 1 second.

Commit: 65166
4.1 changelog entry: 65168 + 65169
Comment 13 Felix Botner univentionstaff 2015-11-05 11:12:43 CET
OK - atd in docker
OK - DH parameters in univention-mail-postfix
OK - changelog entry
Comment 14 Stefan Gohmann univentionstaff 2015-11-17 12:11:55 CET
UCS 4.1 has been released:
 https://docs.software-univention.de/release-notes-4.1-0-en.html
 https://docs.software-univention.de/release-notes-4.1-0-de.html

If this error occurs again, please use "Clone This Bug".