Wednesday, February 27, 2008

More OpenLDAP

I tested OpenLDAP multimaster replication some more today. I went back to the official relase version 2.4.8. I ran my test on an AMD 64 bit machine with both servers on the same box--different ports and different installation folders. Things ran fine for the most part. Synchronization was consistent in loading and removing 5000 entries like this:

dn: cn=Fred XX_0,dc=mgm,dc=com
objectClass: inetOrgPerson
objectClass: top
givenName: Fred
sn: XX
cn: Fred XX_0

I don't know much about the berkeley db, but my issues seemed to happen after abrupt shutdowns and what I think are database corruptions. slapd cored on each server in the same place at different times during the testing:

Server 1 Core:

#0 is_ad_subtype (sub=0x0, super=0x832430) at ad.c:489
#1 0x00000000004213a7 in attrs_find (a=0x936998, desc=0x832430) at attr.c:647
#2 0x00000000004352bf in test_ava_filter (op=0x41801640, e=0x914a18, ava=0x41800fb0, type=163) at filterentry.c:617
#3 0x0000000000435761 in test_filter (op=0x41801640, e=0x914a18, f=0x41800fd0) at filterentry.c:88
#4 0x0000000000480d9b in bdb_search (op=0x41801640, rs=0x41800ec0) at search.c:845
#5 0x0000000000470be2 in overlay_op_walk (op=0x41801640, rs=0x41800ec0, which=op_search, oi=0x87b2b0, on=0x0) at backover.c:653
#6 0x00000000004710d5 in over_op_func (op=0x41801640, rs=0x41800ec0, which=op_search) at backover.c:705
#7 0x000000000046a56e in syncrepl_entry (si=0x8a31b0, op=0x41801640, entry=0x90add8, modlist=0x418015a8, syncstate=1,
syncUUID=, syncCSN=0x0) at syncrepl.c:1989
#8 0x000000000046c395 in do_syncrep2 (op=0x41801640, si=0x8a31b0) at syncrepl.c:844
#9 0x000000000046de8c in do_syncrepl (ctx=0x41801df0, arg=) at syncrepl.c:1226
#10 0x000000000041a692 in connection_read_thread (ctx=0x41801df0, argv=) at connection.c:1213
#11 0x00000000004fa6f4 in ldap_int_thread_pool_wrapper (xpool=0x83c890) at tpool.c:625
#12 0x00000035f1e06407 in start_thread () from /lib64/
#13 0x00000035f12d4b0d in clone () from /lib64/

After this happened, a restart of slapd would always hang on either server with this stack trace:

#0 0x00000035f1e076dd in pthread_join () from /lib64/
#1 0x00000000004e4501 in syncprov_db_open (be=0x8a27c0, cr=) at syncprov.c:2632
#2 0x0000000000470868 in over_db_func (be=0x8a27c0, cr=0x7fffad390610, which=) at backover.c:62
#3 0x0000000000427350 in backend_startup_one (be=0x8a27c0, cr=0x7fffad390610) at backend.c:224
#4 0x000000000042761a in backend_startup (be=0x8a27c0) at backend.c:316
#5 0x000000000040505a in main (argc=4, argv=0x7fffad3908d8) at main.c:932

I'm happy to see Gavin's interest in my blog. It shows his dedication--a tribute to the OpenSource community.

OpenLDAP MultiMaster Replication Redemption

After some tweaking of my slapd.conf, I was able to get multimaster replication working a lot more reliably than my earlier attempts. I pulled in the latest source code - whatever was comitted to cvs after 2.4.8 release, and tested with this version. Here is my current slapd.conf file syncrepl settings:

#serverID 1 ldap://server1:9009
serverID 2

overlay syncprov

syncRepl rid=1
retry="5 + 5 +"

syncRepl rid=2
retry="5 + 5 +"

mirrormode true
database monitor

It seems if you switch replication types from refreshAndPersist to refreshOnly, things get messed up. I prefer the refreshAndPersist. I left the interval option in my config, though it's not used in refreshAndPersist mode. Previously, my retry intervals were very short, so I think this is why I was getting un-predictable results in the number of entities replicated. The timeouts may have been hit and I wasn't waiting long enough.

I stress tested OpenLDAP much more than I have FDS. I would stress test by ctrl-c'ing a running slapd process while thousands of adds or deletes were being done to another server. Without hard stopping a server, replication seemed to work well. With hard stopping the server, the only issue I had was sometimes the stopped server would freeze and hang when coming back up and performing an ldapsearch at the same time--some sort of connection deadlock or something. I had to kill -9 slapd and start it again, then it would sync back up. Under most situations, servers being killed and started and killed and started again would not be a common occurrence.

I still saw a few SEGV's sometimes when bringing up a server after stopping it during bulk adds or deletes--like this one I just got. I deleted 5000 entries on server1, then as server2 was processing the deletes, I ctrl-c'd server2. Then bringing server2 back up. I got this:

#0 0x00ace375 in memmove () from /lib/
#1 0x0810a51c in bdb_dn2id_children (op=0x8d71028, txn=0x0, e=0x8be1cd4) at dn2id.c:351
#2 0x0810606f in bdb_cache_children (op=0x8d71028, txn=0x0, e=0x8be1cd4) at cache.c:1008
#3 0x080e4cac in bdb_hasSubordinates (op=0x8d71028, e=0x8be1cd4, hasSubordinates=0xa15caa0c) at operational.c:54
#4 0x080e4e09 in bdb_operational (op=0x8d71028, rs=0xa168c168) at operational.c:101
#5 0x080d3001 in overlay_op_walk (op=0x8d71028, rs=0xa168c168, which=op_aux_operational, oi=0x8b74868, on=0x8b74e38) at backover.c:653
#6 0x080d351d in over_op_func (op=0x8d71028, rs=0xa168c168, which=op_aux_operational) at backover.c:705
#7 0x0808066b in fe_aux_operational (op=0x8d71028, rs=0xa168c168) at backend.c:1868
#8 0x080802b9 in backend_operational (op=0x8d71028, rs=0xa168c168) at backend.c:1885
#9 0x08084531 in slap_send_search_entry (op=0x8d71028, rs=0xa168c168) at result.c:764
#10 0x080e7953 in bdb_search (op=0x8d71028, rs=0xa168c168) at search.c:869
#11 0x080d3001 in overlay_op_walk (op=0x8d71028, rs=0xa168c168, which=op_search, oi=0x8b74868, on=0x8b74e38) at backover.c:653
#12 0x080d351d in over_op_func (op=0x8d71028, rs=0xa168c168, which=op_search) at backover.c:705
#13 0x08076256 in fe_op_search (op=0x8d71028, rs=0xa168c168) at search.c:368
#14 0x08076a47 in do_search (op=0x8d71028, rs=0xa168c168) at search.c:217
#15 0x0807416c in connection_operation (ctx=0xa168c238, arg_v=0x8d71028) at connection.c:1084
#16 0x080748e0 in connection_read_thread (ctx=0xa168c238, argv=0x10) at connection.c:1211
#17 0x0816c9a4 in ldap_int_thread_pool_wrapper (xpool=0x8b4eea0) at tpool.c:663
#18 0x00d0650b in start_thread () from /lib/
#19 0x00b30b2e in clone () from /lib/

Running slapd again and it came up just fine and synchronized started again.

OpenLDAP multimaster replication is working a lot better than I first experienced. The multimaster replication is not bullet proof, but it's probably adequate now for many situations.

Sunday, February 24, 2008

OpenDS Looks Promising

Sun's OpenDS project -- -- looks to be a very promising LDAP implementation. I haven't gotten into it much, but as I installed it this morning, I was pleasantly surprised.

The install was the easiest of FDS or OpenLDAP. A nice gui steps you through the initial install. Replication setup was simple, as the gui prompts you to identify another server already participating in the replication. OpenDS, by default, supports multi-master replication. I believe this is, in fact, the only replication it supports. I think it would be useful to have the ability to force read-only replicated servers, but I didn't see if this was possible.

I easily set up 3 servers on my machine ( a Dual-core Opteron 185 with 2GB of memory running Fedora 8 64bit ). Using OpenDS, I generated the example ldif of 10k users, and loaded it up. Replication started immediately. OpenDS provides a nice, simple gui for simple monitoring, so it was easy to see the updates going to the other 2 servers. participating in the replicated cluster.

The ldif additions were slow--it took several minutes to load the 10k users. My machine load went up past 7, and with running several servers, my computer was having to swap memory quite a bit. During the load, I shut down one server, brought it up for a minute or two, then down and up again. I wanted to see how this server would handle the synchronization when it was not up.

When the load finished, the replicated server that stayed up the entire time, had the same number of entries as the server I loaded the ldif into--10,003. The server I shut down, however, was about 50+ entries short with some errors in the replication log:

[24/Feb/2008:09:22:26 -0700] category=SYNC severity=MILD_ERROR msgID=14876739 msg=Could not replay operation AddOperation(connID=-1, opID=47, dn=uid=user.1333,ou=People,dc=example,dc=com) with ChangeNumber 000001184c39fe467f4300000537 error Canceled

Apart from that, OpenDS is off to an excellent start--especially for it's age. It's by far the easiest server to get up and running. I'll be watching as it matures to see how it performs and stabilizes.

Friday, February 22, 2008

OpenLDAP vs Fedora Directory Server (cont)

Today I tried to get OpenLDAP going one last time. I erased the database on each machine and recompiled the code from scratch. My simple 10-person adds and deletes were working and synchronizing ok, so I was starting to think it was working. I bumped up the ldif to have 5000 names. Synchronization started up, but then things went crazy and eventually one of the servers seg-faulted. That was enough OpenLDAP for me.

I went ahead and shut down OpenLDAP and fired up FDS. In no time at all, I had a simple Master-slave replication set up and adding and removing 5000 names ( using ldapadd ) was working perfectly. I still need to get it going in a multi-master setup--looks easy enough.

My recommendation: If you're jumping into LDAP, start up with FDS.

Thursday, February 21, 2008

OpenLDAP vs Fedora Directory Server

I was recently coming up to speed on LDAP. Eager and ready, I got OpenLDAP version 2.4.7 and came up to speed with LDAP in general and had a server up and running fairly quickly.

While working with OpenLDAP, and editing and loading ldifs, I was quickly hoping some tool existed to manage the basic ldap tasks. I installed phpLdapAdmin which seemed to do the job. I enabled the ppolicy module and found out that trying to clear a users password inside phpLdapAdmin ( setting the text box to empty and then committing) caused OpenLDAP to exit with an assertion error. Ouch!! Determined that this was a pretty obvious bug, I found the latest source code had already fixed the issue. I installed the patch, and no longer did slapd exit when setting an empty password.

Today, I noticed 2.4.8 was released which also had the password fix, so I pulled that in and upgraded.

The next OpenLDAP task was to get multi-master replication up and going. After getting two servers set up I was able to add a single user and remove it from either server. Everything looked good. I decided to try refreshOnly syncing instead of refreshAndPersist. However, after changing the sync method on both servers, as soon as I restarted the servers and both servers connected, one would seg fault. I changed both back to refreshAndPersist, tested the single add and delete, and went to the next step--bulk loads.

I added 10 users to an ldiff. When I loaded them up, all ten would load up fine into the local server, but only one or two users from the list would get replicated to the other server. After deleting and adding several times, I could never get all 10 to replicate. I thought the computers not being ntp synced was an issue, but getting them synced up did not fix the issue.

I realize N-Way multi master has only been around since October or so. It would appear to me, it's not yet ready for production use if you are planning to do multi-master replication.

While working with OpenLDAP, I learned about the existence of Fedora Directory Server, and running Fedora myself, I got that up and going too. The experience has been completely different. The initial setup was simple ( RPMS -- no compiles ). The web-based java management tool is tons more functional than phpLdapAdmin, the documention is incredible, and it has yet to crash on me. FDS now manages my simple home network user accounts, and is now my LDAP server of choice.

Tomorrow, I will test FDS mult-master replication and report back on my findings. The multi-master replication is more mature than OpenLDAPs, so I have high expectations.