Bug 33993 – LDAP server restart is not reliable

Bug 33993 - LDAP server restart is not reliable


Summary:	LDAP server restart is not reliable

Status:	CLOSED FIXED

Product:	UCS
Classification:	Unclassified
Component:	LDAP
Version:	UCS 4.1
Hardware:	Other Linux

Importance:	P5 normal (vote)
Target Milestone:	UCS 4.1-2-errata
Assigned To:	Felix Botner
QA Contact:	Arvid Requate

URL:
Keywords:

Depends on:	33992
Blocks:
	Show dependency tree / graph

Reported:	2014-01-25 09:32 CET by Stefan Gohmann
Modified:	2016-08-03 15:56 CEST (History)
CC List:	2 users (show)

See Also:
What kind of report is it?:	---
What type of bug is this?:	---
Who will be affected by this bug?:	---
How will those affected feel about the bug?:	---
User Pain:
Enterprise Customer affected?:
School Customer affected?:
ISV affected?:
Waiting Support:
Flags outvoted (downgraded) after PO Review:
Ticket number:
Bug group (optional):
Max CVSS v3 score:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stefan Gohmann

2014-01-25 09:32:06 CET

See Bug #33992, the LDAP server restart is not reliable.


+++ This bug was initially created as a clone of Bug #33992 +++

It happens on S3 master only:
-------------------------------------------------------------------------------
*** BEGIN *** ['/bin/bash', '68univention-admingrp-user-passwordreset-protected-groups'] ***
E: Object exists: cn=autotest090.local,cn=domain,cn=mail,dc=autotest090,dc=local
info 2014-01-24 19:05:05         create user k704j4ok
Object created: uid=k704j4ok,cn=users,dc=autotest090,dc=local
Object modified: cn=User Password Admins,cn=groups,dc=autotest090,dc=local
info 2014-01-24 19:05:06         create group lixucryo with Mailaddress 6qsp677a3unf8sszbarx
Object created: cn=lixucryo,cn=groups,dc=autotest090,dc=local
info 2014-01-24 19:05:07         create user d8goq5kb
Object created: uid=d8goq5kb,cn=users,dc=autotest090,dc=local
Object modified: cn=lixucryo,cn=groups,dc=autotest090,dc=local
info 2014-01-24 19:05:08         create user da1üy6cö
Object created: uid=da1üy6cö,cn=users,dc=autotest090,dc=local
Object modified: uid=da1üy6cö,cn=users,dc=autotest090,dc=local
info 2014-01-24 19:05:09         create user r3gmu7bh
Object created: uid=r3gmu7bh,cn=users,dc=autotest090,dc=local
Waiting for replication:
OK: replication complete (nid=4112 lid=4112)
Done: replication complete.
Waiting for postrun
Setting ldap/acl/user/passwordreset/protected/gid
Script: /etc/univention/templates/scripts/admingrp-user-passwordreset
Create ldap/acl/user/passwordreset/internal/groupmemberlist/lixucryo
Multifile: /etc/ldap/slapd.conf
Multifile: /etc/ldap/slapd.conf
Initiating graceful reload of ldap server(s).
Sending HUP to ldap server(s): slapd ...done.
Check database: ...Could not determine BDB version of /var/lib/univention-ldap/ldap.
Skipping /usr/bin/db4.8_recover to avoid damage.
Starting ldap server(s): slapd ...failed.
52e2ffd1 OVER: Loading Translog Overlay 52e2ffd1 OVER: db_init 52e2ffd1 OVER: Configuring Translog Overlay 52e2ffd1 OVER: Configured Translog Overlay to use file "/var/lib/univention-ldap/listener/listener" 52e2ffd1 OVER: db_close 52e2ffd1 OVER: db_destroy.
Multifile: /etc/ldap/slapd.conf
Traceback (most recent call last):
  File "/usr/share/univention-directory-manager-tools/univention-cli-server", line 222, in doit
    output = univention.admincli.admin.doit(arglist)
  File "/usr/lib/pymodules/python2.6/univention/admincli/admin.py", line 393, in doit
    out=_doit(arglist)
  File "/usr/lib/pymodules/python2.6/univention/admincli/admin.py", line 540, in _doit
    co=univention.admin.config.config(configRegistry['ldap/master'])
  File "/usr/lib/pymodules/python2.6/univention/admin/config.py", line 38, in __init__
    base=univention.admin.uldap.getBaseDN(host)
  File "/usr/lib/pymodules/python2.6/univention/admin/uldap.py", line 62, in getBaseDN
    return result[0][1]['namingContexts'][0]
KeyError: 'namingContexts'
-------------------------------------------------------------------------------
From this point all LDAP operations fail. The slapd can not be stopped through the init script:

root@master090:~# pidof slapd
17993
root@master090:~# /etc/init.d/slapd stop
Stopping ldap server(s): slapd ...done.
root@master090:~# pidof slapd
17993 
root@master090:~# cat /var/run/slapd/slapd.pid
cat: /var/run/slapd/slapd.pid: No such file or directory
root@master090:~#

The LDAP server is restarted very often in the LDAP test.

Comment 1 Stefan Gohmann

2016-05-06 17:01:28 CEST

Happened again: Ticket #2016050621000287

Comment 2 Stefan Gohmann

2016-05-06 17:03:26 CEST

Please provide the Erratum for UCS 4.1-1 and 4.1-2.

Comment 3 Felix Botner

2016-06-28 13:40:10 CEST

On a i386 system with ldap mdb and a mdb maxsize of at least 2147483648, slapd sometimes refuses to start (exactly the same setup as on Tiket #2016050621000287).
amd64 works fine.

-> while true; do /etc/init.d/slapd restart; done
Jun 28 00:31:54 slave2 slapd[23930]: mdb_db_open: database "dc=w2k12,dc=test" cannot be opened: Cannot allocate memory (12). Restore from backup!
Jun 28 00:31:54 slave2 slapd[23930]: backend_startup_one (type=mdb, suffix="dc=w2k12,dc=test"): bi_db_open failed! (12)
Jun 28 00:31:54 slave2 slapd[23930]: slapd stopped.

-> strace
4006  open("/var/lib/univention-ldap/ldap/data.mdb", O_RDWR|O_CREAT|O_LARGEFILE, 0600) = 17
4006  fstatfs64(17, 84, {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=12486714, f_bfree=11816933, f_bavail=11176882, f_files=3180464, f_ffree=3059825, f_fsid={11195914, 1546682992}, f_namelen=255, f_frsize=4096}) = 0
4006  uname({sys="Linux", node="slave2", ...}) = 0
4006  pread64(17, "\0\0\0\0\0\0\10\0\0\0\0\0\336\300\357\276\1\0\0\0\0\0\0\0\0\0\0\200\0\20\0\0\10\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\r\0\0\0\26\0\0\0\0\0\0\0\0\0\2\0\1\0\0\0\2\0\0\0\0\0\0\0T\0\0\0a\1\0\0t\1\0\0P\2\0\0", 92, 0) = 92
4006  pread64(17, "\1\0\0\0\0\0\10\0\0\0\0\0\336\300\357\276\1\0\0\0\0\0\0\0\0\0\0\200\0\20\0\0\10\0\1\0\0\0\0\0\1\0\0\0\0\0\0\0\16\0\0\0v\0\0\0\0\0\0\0\0\0\2\0\1\0\0\0\2\0\0\0\0\0\0\0T\0\0\0\33\0\0\0t\1\0\0Q\2\0\0", 92, 4096) = 92
4006  mmap2(NULL, 2147483648, PROT_READ, MAP_SHARED, 17, 0) = -1 ENOMEM (Cannot allocate memory)

-> gdb --args /usr/sbin/slapd -d -1 -h "ldapi:/// ldap://:7389/ ldap://:389/ ldaps://:7636/ ldaps://:636/"
break mdb_env_map
run
3739    {
(gdb) step
3741            unsigned int flags = env->me_flags;
(gdb) 
3739    {
(gdb) 
3782            if (flags & MDB_WRITEMAP) {
(gdb) 
3739    {
(gdb) 
3782            if (flags & MDB_WRITEMAP) {
(gdb) 
3781            int prot = PROT_READ;
(gdb) 
3787            env->me_map = mmap(addr, env->me_mapsize, prot, MAP_SHARED,
(gdb) print pro
No symbol "pro" in current context.
(gdb) print prot
$1 = 1
(gdb) print flags
$2 = 805306368
(gdb) print env->me_mapsize
$3 = 2147483648
(gdb) print addr
$4 = (void *) 0x0
(gdb) step
3789            if (env->me_map == MAP_FAILED) {
(gdb) 
3787            env->me_map = mmap(addr, env->me_mapsize, prot, MAP_SHARED,
(gdb) 
3789            if (env->me_map == MAP_FAILED) {
(gdb) print env->me_map
$5 = 0xffffffff <Address 0xffffffff out of bounds>
(gdb) 
$6 = 0xffffffff <Address 0xffffffff out of bounds>
(gdb) step
3790                    env->me_map = NULL;
(gdb) 
3791                    return ErrCode();
(gdb) 
0xb7a81440 in __errno_location () from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0

Seems that slapd can not allocate enough memory for the database memory map. Strange thing is, that this happens sporadically. And my test script

#include <stdio.h>
#include <fcntl.h>
#include <sys/mman.h>

int main (int argc, char *argv[])
{
        char *p;
        int fd;
        fd = open (argv[0], O_RDONLY);
        int prot = PROT_READ;
        int size = 2147483648;


        fd = open (argv[0], O_RDONLY);
        p = mmap (NULL, size, prot, MAP_SHARED, fd, 0);
        if (p == MAP_FAILED) {
                printf ("p: %d", p);
                perror ("mmap");
                return 1;
        }
        close (fd);
        return 0;
}
-> gcc -g -D_FILE_OFFSET_BITS=64 test.c

always works.

But here https://symas.com/getting-down-and-dirty-with-lmdb-qa-with-symas-corporations-howard-chu-about-symass-lightning-memory-mapped-database/ is explained, that there is indeed a limit for the size of database on i386 systems (which happens to be our default).

A: No. The maximum database size is constrained only by the amount of disk space available and by the size of the address space on the machine. For 32-bit implementations it’s restricted to approximately 2^31 bytes (2 GB), and for 64-bit implementations, which typically bring 48 address bits out of the CPU, it’s restricted to 2^47 bytes (128 TB). The operating system takes care of moving data in and out of available memory as needed. This means that database sizes can be many multiples of available physical memory.

So maybe we should reduce the mdb maxsize default?

All i can do for now, is to triple check slapd start in /etc/init.d/slapd on i686 systems. If the start fails, just try it again.


univention-ldap: 12.1.6-24.815.201606281334

Comment 4 Sönke Schwardt-Krummrich

2016-07-07 15:09:44 CEST

Customer reported that reducing the maxsize made his slapd restart reliable again. Maybe we want to change the maxsize by default?

Comment 5 Arvid Requate

2016-07-11 11:19:30 CEST

+1 from my side. It's a bit unclear though which value to choose. So, it might not be bad to have this fallback in place too? My recommendation to Felix was: In case a failure has been detected decrease maxsize by 1 and try again. This should converge to the the optimal value :-)

Comment 6 Felix Botner

2016-07-25 16:11:54 CEST

Changed default for i686 system to 1.9GB in univention-ldap.

Not changed documentation as the recommendation is to not migrate i386 system to MDB.

   "If BDB is still in use, a migration to MDB should be performed for amd64 
    systems (not i386).

Comment 7 Arvid Requate

2016-08-02 20:32:12 CEST

Ok, works.

Comment 8 Janek Walkenhorst

2016-08-03 15:56:26 CEST

<http://errata.software-univention.de/ucs/4.1/222.html>