Bug 132

Summary:	ERROR: duplicate key value violates unique constraint "sl_nodelock-pkey"
Product:	Slony-I	Reporter:	Steve Singer <ssinger>
Component:	slon	Assignee:	Steve Singer <ssinger>
Status:	RESOLVED FIXED
Severity:	normal	CC:	slony1-bugs
Priority:	medium
Version:	2.0
Hardware:	PC
OS:	Linux
See Also:	http://bugs.slony.info/bugzilla/show_bug.cgi?id=81

Description Steve Singer 2010-06-01 08:03:52 UTC

A slon exitted with 

"_disorder_replica".cleanupNodelock(); insert into "_disorder_replica".sl_nodelock values (    4, 0, "pg_catalog".pg_backend_pid()); " - ERROR:  duplicate key value violates unique constraint "sl_nodelock-pkey"


What appears to have happened is that following a move set the slon called slon_retry() and exitted and was restarted  The restarted slon did some processing and eventually encountered the above error.

At this stage I can't say if the bug is:

1) The first slon instance exitted without cleaning up
2) The second slon instance should have cleaned up after the first (check if the backend is still alive and if not delete the row)
3) something else




2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG enableSubscription: sub_set=2
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - subscription is complete:2010-06-01 10:34:17 EDTCONFIG enableSubscription: sub_set=2
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTWARN   remoteWorker_wakeup: node 3 - no worker thread
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG2 sched_wakeup_node(): no_id=3 (0 threads + worker signaled)
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: last local event sequence = 5000000002
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: configuration complete - starting threads
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO   localListenThread: thread starts
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG version for "dbname=test4 host=localhost user=slony password=slony" is 80309
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG1 local_listen "dbname=test4 host=localhost user=slony password=slony": backend pid = 4464
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG slon: worker process created - pid = 4460
2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTFATAL  localListenThread: "select "_disorder_replica".cleanupNodelock(); insert into "_disorder_replica".sl_nodelock values (    4, 0, "pg_catalog".pg_backend_pid()); " - ERROR:  duplicate key value violates unique constraint "sl_nodelock-pkey"
2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 
2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG2 slon_abort() from pid=4460
2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO   slon: shutdown requested
2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO   slon: notify worker process to shutdown
2010-06-01 10:34:17,448 [db4stderr] ERROR info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=4072
2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTINFO   slon: child termination timeout - kill child
2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTCONFIG slon: child terminated status: 9; pid: 4460, current worker pid: 4460
2010-06-01 10:34:37,450 [db4 stdout] INFO  info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - abnormal termination of slon2010-06-01 10:34:37 EDTCONFIG slon: child terminated status: 9; pid: 4460, current worker pid: 4460
2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTINFO   slon: done



2010-06-01 10:34:17,412 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG remoteWorkerThread_5: update provider configuration
2010-06-01 10:34:17,412 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO   remoteWorkerThread_5: thread done
2010-06-01 10:34:17,412 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG1 cleanupThread: thread done
2010-06-01 10:34:17,412 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: done
2010-06-01 10:34:17,412 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG slon: child terminated status: 0; pid: 3974, current worker pid: 3974
2010-06-01 10:34:17,413 [db4 stdout] INFO  info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - abnormal termination of slon2010-06-01 10:34:17 EDTCONFIG slon: child terminated status: 0; pid: 3974, current worker pid: 3974
2010-06-01 10:34:17,413 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG slon: restart of worker
2010-06-01 10:34:17,413 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: slon version 2.0.3 starting up
2010-06-01 10:34:17,413 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO   slon: watchdog process started



2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG enableSubscription: sub_set=2
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - subscription is complete:2010-06-01 10:34:17 EDTCONFIG enableSubscription: sub_set=2
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTWARN   remoteWorker_wakeup: node 3 - no worker thread
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG2 sched_wakeup_node(): no_id=3 (0 threads + worker signaled)
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: last local event sequence = 5000000002
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: configuration complete - starting threads
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO   localListenThread: thread starts
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG version for "dbname=test4 host=localhost user=slony password=slony" is 80309
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG1 local_listen "dbname=test4 host=localhost user=slony password=slony": backend pid = 4464
2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG slon: worker process created - pid = 4460
2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTFATAL  localListenThread: "select "_disorder_replica".cleanupNodelock(); insert into "_disorder_replica".sl_nodelock values (    4, 0, "pg_catalog".pg_backend_pid()); " - ERROR:  duplicate key value violates unique constraint "sl_nodelock-pkey"
2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 
2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG2 slon_abort() from pid=4460
2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO   slon: shutdown requested
2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO   slon: notify worker process to shutdown
2010-06-01 10:34:17,448 [db4stderr] ERROR info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=4072
2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTINFO   slon: child termination timeout - kill child
2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTCONFIG slon: child terminated status: 9; pid: 4460, current worker pid: 4460
2010-06-01 10:34:37,450 [db4 stdout] INFO  info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - abnormal termination of slon2010-06-01 10:34:37 EDTCONFIG slon: child terminated status: 9; pid: 4460, current worker pid: 4460
2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTINFO   slon: done

Comment 1 Steve Singer 2010-08-03 10:51:05 UTC

I've seen this happen some more often.

I think the following is happening.

-a slon_retry() method is called following the move set
-slon_retry signals the parent slon process which in turn sends a kill to the child
-The child exits, I can find no 'cleanup' process that the child runs on a kill to ensure the postgresql connections are closed or to remove entries from sl_nodelock
-The child is restarted by the parent.
-The child calls cleanupNodelock() which checks to see if the pid for the backend registered with sl_nodelock is still around.  I think sometimes the old backend process is still around (hasn't yet exited) maybe because it is in the middle of a query and hasn't yet noticed that the slon it is talking to has gone away
-Since the backend process is still around the row isn't deleted from sl_nodelock causing the insert into sl_nodelock to fail.

Since I've seen thsi happen more than an isolated incident and it causes the watchdog to exit as well I am bumping the priority.

Options to fix this include
1) Having a the slon worker properly exit and remove itself from the sl_nodelock table before exiting
2) Increase the 'sleep' time before restarting the child.  This doesn't really fix the problem it just makes it less likely

Comment 2 Steve Singer 2010-08-27 12:23:48 UTC

This has been committed to REL_2_0_STABLE and master