#serverID 1 ldap://server1:9009
retry="5 + 5 +"
retry="5 + 5 +"
It seems if you switch replication types from refreshAndPersist to refreshOnly, things get messed up. I prefer the refreshAndPersist. I left the interval option in my config, though it's not used in refreshAndPersist mode. Previously, my retry intervals were very short, so I think this is why I was getting un-predictable results in the number of entities replicated. The timeouts may have been hit and I wasn't waiting long enough.
I stress tested OpenLDAP much more than I have FDS. I would stress test by ctrl-c'ing a running slapd process while thousands of adds or deletes were being done to another server. Without hard stopping a server, replication seemed to work well. With hard stopping the server, the only issue I had was sometimes the stopped server would freeze and hang when coming back up and performing an ldapsearch at the same time--some sort of connection deadlock or something. I had to kill -9 slapd and start it again, then it would sync back up. Under most situations, servers being killed and started and killed and started again would not be a common occurrence.
I still saw a few SEGV's sometimes when bringing up a server after stopping it during bulk adds or deletes--like this one I just got. I deleted 5000 entries on server1, then as server2 was processing the deletes, I ctrl-c'd server2. Then bringing server2 back up. I got this:
#0 0x00ace375 in memmove () from /lib/libc.so.6
#1 0x0810a51c in bdb_dn2id_children (op=0x8d71028, txn=0x0, e=0x8be1cd4) at dn2id.c:351
#2 0x0810606f in bdb_cache_children (op=0x8d71028, txn=0x0, e=0x8be1cd4) at cache.c:1008
#3 0x080e4cac in bdb_hasSubordinates (op=0x8d71028, e=0x8be1cd4, hasSubordinates=0xa15caa0c) at operational.c:54
#4 0x080e4e09 in bdb_operational (op=0x8d71028, rs=0xa168c168) at operational.c:101
#5 0x080d3001 in overlay_op_walk (op=0x8d71028, rs=0xa168c168, which=op_aux_operational, oi=0x8b74868, on=0x8b74e38) at backover.c:653
#6 0x080d351d in over_op_func (op=0x8d71028, rs=0xa168c168, which=op_aux_operational) at backover.c:705
#7 0x0808066b in fe_aux_operational (op=0x8d71028, rs=0xa168c168) at backend.c:1868
#8 0x080802b9 in backend_operational (op=0x8d71028, rs=0xa168c168) at backend.c:1885
#9 0x08084531 in slap_send_search_entry (op=0x8d71028, rs=0xa168c168) at result.c:764
#10 0x080e7953 in bdb_search (op=0x8d71028, rs=0xa168c168) at search.c:869
#11 0x080d3001 in overlay_op_walk (op=0x8d71028, rs=0xa168c168, which=op_search, oi=0x8b74868, on=0x8b74e38) at backover.c:653
#12 0x080d351d in over_op_func (op=0x8d71028, rs=0xa168c168, which=op_search) at backover.c:705
#13 0x08076256 in fe_op_search (op=0x8d71028, rs=0xa168c168) at search.c:368
#14 0x08076a47 in do_search (op=0x8d71028, rs=0xa168c168) at search.c:217
#15 0x0807416c in connection_operation (ctx=0xa168c238, arg_v=0x8d71028) at connection.c:1084
#16 0x080748e0 in connection_read_thread (ctx=0xa168c238, argv=0x10) at connection.c:1211
#17 0x0816c9a4 in ldap_int_thread_pool_wrapper (xpool=0x8b4eea0) at tpool.c:663
#18 0x00d0650b in start_thread () from /lib/libpthread.so.0
#19 0x00b30b2e in clone () from /lib/libc.so.6
Running slapd again and it came up just fine and synchronized started again.
OpenLDAP multimaster replication is working a lot better than I first experienced. The multimaster replication is not bullet proof, but it's probably adequate now for many situations.