81 – duplicate key sl_nodelock-pkey and duplicate slon(8) processes not detected

Bug 81 - duplicate key sl_nodelock-pkey and duplicate slon(8) processes not detected

Summary: duplicate key sl_nodelock-pkey and duplicate slon(8) processes not detected

Status:	NEW

Alias:	None

Product:	Slony-I
Classification:	Unclassified
Component:	slon (show other bugs)
Version:	devel
Hardware:	PC All

Importance:	medium normal
Assignee:	Slony Bugs List

URL:

Depends on:
Blocks:

Reported:	2009-04-13 07:47 UTC by Brian A. Seklecki
Modified:	2012-08-29 19:43 UTC (History)
CC List:	1 user (show)

See Also:	http://bugs.slony.info/bugzilla/show_bug.cgi?id=132

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Brian A. Seklecki 2009-04-13 07:47:05 UTC

First reported 08/2008 by Shahaf Abileah, then found by Bill Moran at CFI.


http://www.mail-archive.com/slony1-general@lists.slony.info/msg02754.html

http://www.nabble.com/duplicate-key-sl_nodelock-pkey-td22246434.html

--------

In both cases, the users report that duplicate running SLON processes seem to be the cause (hence, duplicate "node locks").  

Error #1 is the PgSQL error.

In our case, during a switchover drill, the SWITCHOVER command fails silently, eventually timing out.

No errors from Slon/PgSQL are reported back to the user, which is problem #2.

-------------------

2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] 
slon[27776]: [20-1] [27776] FATAL localListenThread: "select 
"_database_name_prod".cleanupNodelock(); insert into
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] 
slon[27776]: [20-2] "_database_name_prod".sl_nodelock values ( 2, 0, 
"pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value violates
2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] 
slon[27776]: [20-3] unique constraint "sl_nodelock-pkey"

---------------------

duplicate key sl_nodelock-pkey and suplicate slon(8) processes not detected

Comment 1 Devrim GUNDUZ 2010-03-21 13:45:27 UTC

I can duplicate this issue in REL_2_0_STABLE cvs tip as of today...

Comment 2 Devrim GUNDUZ 2010-03-31 06:46:16 UTC

Is there anything that I can help you to debug this issue?

Comment 3 Steve Singer 2010-03-31 07:38:13 UTC

(In reply to comment #2)
> Is there anything that I can help you to debug this issue?

Can you clarify what your seeing.

If I start up on the same postgresql cluster 
1) 1 slon on my origin database
2) 1 slon on my receiver database
3) A second slon on my receiver database

The third slon gives me
----------------------------------------------
2010-03-31 10:31:44 EDTFATAL  localListenThread: "select "_test".cleanupNodelock(); insert into "_test".sl_nodelock values (    2, 0, "pg_catalog".pg_backend_pid()); " - ERROR:  duplicate key value violates unique constraint "sl_nodelock-pkey"

2010-03-31 10:31:44 EDTINFO   slon: shutdown requested
2010-03-31 10:31:44 EDTINFO   slon: notify worker process to shutdown



2010-03-31 10:32:04 EDTINFO   slon: child termination timeout - kill child
2010-03-31 10:32:04 EDTCONFIG slon: child terminated status: 9; pid: 5071, current worker pid: 5071
2010-03-31 10:32:04 EDTINFO   slon: done
2010-03-31 10:32:04 EDTINFO   slon: exit(0)

-----------------------------------------------------------

and exits.   Isn't this the behavior one would expect, or are you seeing something else?

Comment 4 Christopher Browne 2010-03-31 14:40:51 UTC

(In reply to comment #0)
> First reported 08/2008 by Shahaf Abileah, then found by Bill Moran at CFI.
> 
> http://www.mail-archive.com/slony1-general@lists.slony.info/msg02754.html
> 
> http://www.nabble.com/duplicate-key-sl_nodelock-pkey-td22246434.html

The report by Shahaf Abileah doesn't seem to me to indicate an error; if you try to run a slon on 2 servers, then there's no way to find, via /usr/bin/ps, that there were two instances.  That case doesn't seem like a bug to me - it looks instead like the slon is working correctly.  The only possible change would be to catch the exception and give a more explicit warning to the effect that "it looks like you have a slon running already!"

Bill Moran's report isn't about the same thing.

In his case, the attempt to start another slon leads to the same symptom (e.g. - slon complains that it can't get the exclusive lock on the tuple).

The *problem*, however, seems to be that the "old" slon has gotten somehow 'wedged,' and I'd think it likely that the problem is that it's waiting for a lock to be granted.

What's likely necessary to diagnose the issue is a query against pg_locks on the node being managed showing what the "stalled" connection is waiting on.

Comment 5 Devrim GUNDUZ 2010-04-21 01:00:17 UTC

FWIW, I could not reproduce this error after connecting postgresql+slony directly, instead of using pgbouncer in the middle...

Comment 6 Steve Singer 2010-08-11 14:23:02 UTC

The proposed patch to bug 132 might address this

Comment 7 Jan Wieck 2010-08-24 07:40:10 UTC

I agree with Steve that this is covered by 132.

I am moving this to devel since there is still an issue. It is possible that the OS's PID "almost" wrapped around when the slon worker process dies and that another process gets its PID reassigned before the cleanup of sl_node_lock happens. 

The solution to that however is too much change for 2.0, so we will do this in 2.1. The idea for this is to replace sl_node_lock as it is now with a temp table. Temp tables are visible in pg_class, so an existing sl_node_lock temp table indicates an existing worker backend connection. If after creating the temp table there are more than one entry in pg_class, we had a race condition and need to retry the whole thing.

Comment 8 Tomonari Katsumata 2012-08-29 19:43:12 UTC

Hi,

in comment, the issue #81 will be fixed on Slony-I 2.1.
but the issue is still standing on Slony-I 2.2beta1.
(sl_nodelock table is used for detecting duplicated slon start)

is there any problem to fix it?