Wednesday, February 27, 2008

OpenLDAP MultiMaster Replication Redemption

After some tweaking of my slapd.conf, I was able to get multimaster replication working a lot more reliably than my earlier attempts. I pulled in the latest source code - whatever was comitted to cvs after 2.4.8 release, and tested with this version. Here is my current slapd.conf file syncrepl settings:


#serverID 1 ldap://server1:9009
serverID 2

overlay syncprov

syncRepl rid=1
provider=ldap://server1:9009
binddn="cn=Manager,dc=mgm,dc=com"
bindmethod=simple
credentials=ldap
searchbase="dc=mgm,dc=com"
type=refreshAndPersist
retry="5 + 5 +"
interval=00:00:00:05

syncRepl rid=2
provider=ldap://server2:9009
binddn="cn=Manager,dc=mgm,dc=com"
bindmethod=simple
credentials=ldap
searchbase="dc=mgm,dc=com"
type=refreshAndPersist
retry="5 + 5 +"
interval=00:00:00:05


mirrormode true
database monitor


It seems if you switch replication types from refreshAndPersist to refreshOnly, things get messed up. I prefer the refreshAndPersist. I left the interval option in my config, though it's not used in refreshAndPersist mode. Previously, my retry intervals were very short, so I think this is why I was getting un-predictable results in the number of entities replicated. The timeouts may have been hit and I wasn't waiting long enough.

I stress tested OpenLDAP much more than I have FDS. I would stress test by ctrl-c'ing a running slapd process while thousands of adds or deletes were being done to another server. Without hard stopping a server, replication seemed to work well. With hard stopping the server, the only issue I had was sometimes the stopped server would freeze and hang when coming back up and performing an ldapsearch at the same time--some sort of connection deadlock or something. I had to kill -9 slapd and start it again, then it would sync back up. Under most situations, servers being killed and started and killed and started again would not be a common occurrence.

I still saw a few SEGV's sometimes when bringing up a server after stopping it during bulk adds or deletes--like this one I just got. I deleted 5000 entries on server1, then as server2 was processing the deletes, I ctrl-c'd server2. Then bringing server2 back up. I got this:


#0 0x00ace375 in memmove () from /lib/libc.so.6
#1 0x0810a51c in bdb_dn2id_children (op=0x8d71028, txn=0x0, e=0x8be1cd4) at dn2id.c:351
#2 0x0810606f in bdb_cache_children (op=0x8d71028, txn=0x0, e=0x8be1cd4) at cache.c:1008
#3 0x080e4cac in bdb_hasSubordinates (op=0x8d71028, e=0x8be1cd4, hasSubordinates=0xa15caa0c) at operational.c:54
#4 0x080e4e09 in bdb_operational (op=0x8d71028, rs=0xa168c168) at operational.c:101
#5 0x080d3001 in overlay_op_walk (op=0x8d71028, rs=0xa168c168, which=op_aux_operational, oi=0x8b74868, on=0x8b74e38) at backover.c:653
#6 0x080d351d in over_op_func (op=0x8d71028, rs=0xa168c168, which=op_aux_operational) at backover.c:705
#7 0x0808066b in fe_aux_operational (op=0x8d71028, rs=0xa168c168) at backend.c:1868
#8 0x080802b9 in backend_operational (op=0x8d71028, rs=0xa168c168) at backend.c:1885
#9 0x08084531 in slap_send_search_entry (op=0x8d71028, rs=0xa168c168) at result.c:764
#10 0x080e7953 in bdb_search (op=0x8d71028, rs=0xa168c168) at search.c:869
#11 0x080d3001 in overlay_op_walk (op=0x8d71028, rs=0xa168c168, which=op_search, oi=0x8b74868, on=0x8b74e38) at backover.c:653
#12 0x080d351d in over_op_func (op=0x8d71028, rs=0xa168c168, which=op_search) at backover.c:705
#13 0x08076256 in fe_op_search (op=0x8d71028, rs=0xa168c168) at search.c:368
#14 0x08076a47 in do_search (op=0x8d71028, rs=0xa168c168) at search.c:217
#15 0x0807416c in connection_operation (ctx=0xa168c238, arg_v=0x8d71028) at connection.c:1084
#16 0x080748e0 in connection_read_thread (ctx=0xa168c238, argv=0x10) at connection.c:1211
#17 0x0816c9a4 in ldap_int_thread_pool_wrapper (xpool=0x8b4eea0) at tpool.c:663
#18 0x00d0650b in start_thread () from /lib/libpthread.so.0
#19 0x00b30b2e in clone () from /lib/libc.so.6



Running slapd again and it came up just fine and synchronized started again.

OpenLDAP multimaster replication is working a lot better than I first experienced. The multimaster replication is not bullet proof, but it's probably adequate now for many situations.

5 comments:

Unknown said...

so what is ur final conclusion ?? do u still prefer fedora-ds over openldap ?

Michael Martin said...

Yes--I don't think you can go wrong with Fedora-DS.

Unknown said...

Hey thanks for this example, I now have N-way multimaster working between cygwin and Suse yea!

Eman M. Hammad said...

I am trying the proposed solution, but i am getting the database is already shadowed?

savrukj said...

Hi,

If you are using in your slapd.conf
'mirrormode true'

Do you use "N-way Multimaster" or "MirrorMode" replication ?

Thanks.