First reported 08/2008 by Shahaf Abileah, then found by Bill Moran at CFI. http://www.mail-archive.com/slony1-general@lists.slony.info/msg02754.html http://www.nabble.com/duplicate-key-sl_nodelock-pkey-td22246434.html -------- In both cases, the users report that duplicate running SLON processes seem to be the cause (hence, duplicate "node locks"). Error #1 is the PgSQL error. In our case, during a switchover drill, the SWITCHOVER command fails silently, eventually timing out. No errors from Slon/PgSQL are reported back to the user, which is problem #2. ------------------- 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [20-1] [27776] FATAL localListenThread: "select "_database_name_prod".cleanupNodelock(); insert into 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [20-2] "_database_name_prod".sl_nodelock values ( 2, 0, "pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value violates 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [20-3] unique constraint "sl_nodelock-pkey" --------------------- duplicate key sl_nodelock-pkey and suplicate slon(8) processes not detected
I can duplicate this issue in REL_2_0_STABLE cvs tip as of today...
Is there anything that I can help you to debug this issue?
(In reply to comment #2) > Is there anything that I can help you to debug this issue? Can you clarify what your seeing. If I start up on the same postgresql cluster 1) 1 slon on my origin database 2) 1 slon on my receiver database 3) A second slon on my receiver database The third slon gives me ---------------------------------------------- 2010-03-31 10:31:44 EDTFATAL localListenThread: "select "_test".cleanupNodelock(); insert into "_test".sl_nodelock values ( 2, 0, "pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value violates unique constraint "sl_nodelock-pkey" 2010-03-31 10:31:44 EDTINFO slon: shutdown requested 2010-03-31 10:31:44 EDTINFO slon: notify worker process to shutdown 2010-03-31 10:32:04 EDTINFO slon: child termination timeout - kill child 2010-03-31 10:32:04 EDTCONFIG slon: child terminated status: 9; pid: 5071, current worker pid: 5071 2010-03-31 10:32:04 EDTINFO slon: done 2010-03-31 10:32:04 EDTINFO slon: exit(0) ----------------------------------------------------------- and exits. Isn't this the behavior one would expect, or are you seeing something else?
(In reply to comment #0) > First reported 08/2008 by Shahaf Abileah, then found by Bill Moran at CFI. > > http://www.mail-archive.com/slony1-general@lists.slony.info/msg02754.html > > http://www.nabble.com/duplicate-key-sl_nodelock-pkey-td22246434.html The report by Shahaf Abileah doesn't seem to me to indicate an error; if you try to run a slon on 2 servers, then there's no way to find, via /usr/bin/ps, that there were two instances. That case doesn't seem like a bug to me - it looks instead like the slon is working correctly. The only possible change would be to catch the exception and give a more explicit warning to the effect that "it looks like you have a slon running already!" Bill Moran's report isn't about the same thing. In his case, the attempt to start another slon leads to the same symptom (e.g. - slon complains that it can't get the exclusive lock on the tuple). The *problem*, however, seems to be that the "old" slon has gotten somehow 'wedged,' and I'd think it likely that the problem is that it's waiting for a lock to be granted. What's likely necessary to diagnose the issue is a query against pg_locks on the node being managed showing what the "stalled" connection is waiting on.
FWIW, I could not reproduce this error after connecting postgresql+slony directly, instead of using pgbouncer in the middle...
The proposed patch to bug 132 might address this
I agree with Steve that this is covered by 132. I am moving this to devel since there is still an issue. It is possible that the OS's PID "almost" wrapped around when the slon worker process dies and that another process gets its PID reassigned before the cleanup of sl_node_lock happens. The solution to that however is too much change for 2.0, so we will do this in 2.1. The idea for this is to replace sl_node_lock as it is now with a temp table. Temp tables are visible in pg_class, so an existing sl_node_lock temp table indicates an existing worker backend connection. If after creating the temp table there are more than one entry in pg_class, we had a race condition and need to retry the whole thing.
Hi, in comment, the issue #81 will be fixed on Slony-I 2.1. but the issue is still standing on Slony-I 2.2beta1. (sl_nodelock table is used for detecting duplicated slon start) is there any problem to fix it?