Wednesday, February 27, 2008

OpenLDAP MultiMaster Replication Redemption

After some tweaking of my slapd.conf, I was able to get multimaster replication working a lot more reliably than my earlier attempts. I pulled in the latest source code - whatever was comitted to cvs after 2.4.8 release, and tested with this version. Here is my current slapd.conf file syncrepl settings:

#serverID 1 ldap://server1:9009
serverID 2

overlay syncprov

syncRepl rid=1
retry="5 + 5 +"

syncRepl rid=2
retry="5 + 5 +"

mirrormode true
database monitor

It seems if you switch replication types from refreshAndPersist to refreshOnly, things get messed up. I prefer the refreshAndPersist. I left the interval option in my config, though it's not used in refreshAndPersist mode. Previously, my retry intervals were very short, so I think this is why I was getting un-predictable results in the number of entities replicated. The timeouts may have been hit and I wasn't waiting long enough.

I stress tested OpenLDAP much more than I have FDS. I would stress test by ctrl-c'ing a running slapd process while thousands of adds or deletes were being done to another server. Without hard stopping a server, replication seemed to work well. With hard stopping the server, the only issue I had was sometimes the stopped server would freeze and hang when coming back up and performing an ldapsearch at the same time--some sort of connection deadlock or something. I had to kill -9 slapd and start it again, then it would sync back up. Under most situations, servers being killed and started and killed and started again would not be a common occurrence.

I still saw a few SEGV's sometimes when bringing up a server after stopping it during bulk adds or deletes--like this one I just got. I deleted 5000 entries on server1, then as server2 was processing the deletes, I ctrl-c'd server2. Then bringing server2 back up. I got this:

#0 0x00ace375 in memmove () from /lib/
#1 0x0810a51c in bdb_dn2id_children (op=0x8d71028, txn=0x0, e=0x8be1cd4) at dn2id.c:351
#2 0x0810606f in bdb_cache_children (op=0x8d71028, txn=0x0, e=0x8be1cd4) at cache.c:1008
#3 0x080e4cac in bdb_hasSubordinates (op=0x8d71028, e=0x8be1cd4, hasSubordinates=0xa15caa0c) at operational.c:54
#4 0x080e4e09 in bdb_operational (op=0x8d71028, rs=0xa168c168) at operational.c:101
#5 0x080d3001 in overlay_op_walk (op=0x8d71028, rs=0xa168c168, which=op_aux_operational, oi=0x8b74868, on=0x8b74e38) at backover.c:653
#6 0x080d351d in over_op_func (op=0x8d71028, rs=0xa168c168, which=op_aux_operational) at backover.c:705
#7 0x0808066b in fe_aux_operational (op=0x8d71028, rs=0xa168c168) at backend.c:1868
#8 0x080802b9 in backend_operational (op=0x8d71028, rs=0xa168c168) at backend.c:1885
#9 0x08084531 in slap_send_search_entry (op=0x8d71028, rs=0xa168c168) at result.c:764
#10 0x080e7953 in bdb_search (op=0x8d71028, rs=0xa168c168) at search.c:869
#11 0x080d3001 in overlay_op_walk (op=0x8d71028, rs=0xa168c168, which=op_search, oi=0x8b74868, on=0x8b74e38) at backover.c:653
#12 0x080d351d in over_op_func (op=0x8d71028, rs=0xa168c168, which=op_search) at backover.c:705
#13 0x08076256 in fe_op_search (op=0x8d71028, rs=0xa168c168) at search.c:368
#14 0x08076a47 in do_search (op=0x8d71028, rs=0xa168c168) at search.c:217
#15 0x0807416c in connection_operation (ctx=0xa168c238, arg_v=0x8d71028) at connection.c:1084
#16 0x080748e0 in connection_read_thread (ctx=0xa168c238, argv=0x10) at connection.c:1211
#17 0x0816c9a4 in ldap_int_thread_pool_wrapper (xpool=0x8b4eea0) at tpool.c:663
#18 0x00d0650b in start_thread () from /lib/
#19 0x00b30b2e in clone () from /lib/

Running slapd again and it came up just fine and synchronized started again.

OpenLDAP multimaster replication is working a lot better than I first experienced. The multimaster replication is not bullet proof, but it's probably adequate now for many situations.


GNU Yoga said...

so what is ur final conclusion ?? do u still prefer fedora-ds over openldap ?

Michael Martin said...

Yes--I don't think you can go wrong with Fedora-DS.

Scott said...

Hey thanks for this example, I now have N-way multimaster working between cygwin and Suse yea!

e-m-y said...

I am trying the proposed solution, but i am getting the database is already shadowed?

Jan said...


If you are using in your slapd.conf
'mirrormode true'

Do you use "N-way Multimaster" or "MirrorMode" replication ?