Bug 156 - Slon on startup should have "health check"
Summary: Slon on startup should have "health check"
Status: RESOLVED FIXED
Alias: None
Product: Slony-I
Classification: Unclassified
Component: slon (show other bugs)
Version: devel
Hardware: All All
: low enhancement
Assignee: Christopher Browne
URL:
: 253 (view as bug list)
Depends on: 128
Blocks:
  Show dependency tree
 
Reported: 2010-09-08 12:18 UTC by Christopher Browne
Modified: 2011-11-30 13:14 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christopher Browne 2010-09-08 12:18:32 UTC
Stored function performs some tests, returning t/f, and if f, slon will fail with fatal error.

Tests we know we ought to run:

- verify that tables match sl_table, complete with triggers

- possibly similar for sl_sequence
Comment 1 Christopher Browne 2010-09-08 14:50:19 UTC
Set up a branch at GitHub for this:

http://github.com/cbbrowne/slony1-engine/commits/bug156

I do not have a regression test for this yet, but do have:
a) Health check function
b) slon calls it at startup time, and fails FATAL if it fails
c) documentation indicating why the error might come up, and what one might
do about it.
Comment 2 Steve Singer 2010-10-28 12:05:09 UTC
I've looked over the patch.

I'm wondering if we shouldn't try to print a more helpful error message.

Would we not be friendlier if we printed the names of the tables (both as listed in sl_table and as listed in pg_class as part of the slon output.

Asking the user to check the postgresql.log is kind of a pain (for them).

We also might want to explicitly mention REPAIR CONFIG in the error message.
Comment 3 Christopher Browne 2010-10-29 10:30:18 UTC
(In reply to comment #2)
> I've looked over the patch.
> 
> I'm wondering if we shouldn't try to print a more helpful error message.
> 
> Would we not be friendlier if we printed the names of the tables (both as
> listed in sl_table and as listed in pg_class as part of the slon output.
> 
> Asking the user to check the postgresql.log is kind of a pain (for them).
> 
> We also might want to explicitly mention REPAIR CONFIG in the error message.

I have done all of this...

- slon pulls the warning messages using PQresultErrorMessage(), and displays them in its logs.

- I augmented the health check function to suggest that REPAIR CONFIG might fix the problem.
Comment 4 Christopher Browne 2010-11-01 13:44:51 UTC
Per discussion on list, add in check that the timezone is GMT/UTC (or, I suppose, one of the timezones which is stable, that is, does not have Daylight Savings Time), and raise a WARNING if not.
Comment 5 Steve Singer 2010-11-03 07:22:28 UTC
The latest version of the patch seems to introduce the following compiler warning

slon.c: In function ‘SlonMain’:
slon.c:460: warning: comparison between pointer and integer

from the line 
if (PQgetvalue(res, 0, 0) == 'f') {
				slon_log(SLON_FATAL, 
				 "slon_node_health_check

I am surprised that your testing was able to get this warning to show up.

PQgetvalue(..) will return  a char * to a null terminated string not a char.
strncmp is probably what we want to be using.

2.  If I modify the code to work around that issue this is all I see when I see in the slon output.

2010-11-03 10:19:30 EDTCONFIG main: local node id = 1
2010-11-03 10:19:30 EDTFATAL  could not call slon_node_health_check() - 2010-11-03 10:19:30 EDTCONFIG slon: child terminated status: 9; pid: 31474, current worker pid: 31474
2010-11-03 10:19:30 EDTINFO   slon: done
2010-11-03 10:19:30 EDTINFO   slon: exit(0)


I'm not seeing the details. (ie the specific table that is the issue)
Comment 6 Christopher Browne 2010-11-04 09:13:34 UTC
(In reply to comment #5)
> The latest version of the patch seems to introduce the following compiler
> warning
> 
> slon.c: In function ‘SlonMain’:
> slon.c:460: warning: comparison between pointer and integer
> 
> from the line 
> if (PQgetvalue(res, 0, 0) == 'f') {
>                 slon_log(SLON_FATAL, 
>                  "slon_node_health_check
> 
> I am surprised that your testing was able to get this warning to show up.
> 
> PQgetvalue(..) will return  a char * to a null terminated string not a char.
> strncmp is probably what we want to be using.
> 
> 2.  If I modify the code to work around that issue this is all I see when I see
> in the slon output.
> 
> 2010-11-03 10:19:30 EDTCONFIG main: local node id = 1
> 2010-11-03 10:19:30 EDTFATAL  could not call slon_node_health_check() -
> 2010-11-03 10:19:30 EDTCONFIG slon: child terminated status: 9; pid: 31474,
> current worker pid: 31474
> 2010-11-03 10:19:30 EDTINFO   slon: done
> 2010-11-03 10:19:30 EDTINFO   slon: exit(0)
> 
> 
> I'm not seeing the details. (ie the specific table that is the issue)

Right you are...

I have changed it to 
	if (*(PQgetvalue(res, 0, 0)) == 'f') {

which is consistent with what is done elsewhere:

postgres@cbbrowne [12:08:05] [~/slony1-engine.github/src/slon] [bug156 *]
-> % grep -n PQgetvalue *.c | grep "*" | grep "'"
remote_worker.c:2661:           if (*(PQgetvalue(res1, 0, 0)) == 't')
slon.c:460:                             if (*(PQgetvalue(res, 0, 0)) == 'f') {
slon.c:550:             int                     no_active = (*PQgetvalue(res, i, 1) == 't') ? 1 : 0;


(Of course, the second instance is for this patch! :-))

Here is a self-contained test of this, run against a node that was initialized and had tables added in.  I mess with it by breaking sl_table.tab_reloid values.

postgres@cbbrowne [12:10:00] [~/slony1-engine.github/src/slon] [bug156 *]
-> % psql -d slonyregress1 -c "update _slony_regress1.sl_table set tab_reloid=tab_id;"
UPDATE 4
postgres@cbbrowne [12:10:47] [~/slony1-engine.github/src/slon] [bug156 *]
-> % slon slony_regress1 'dbname=slonyregress1' > /tmp/stdout.txt 2> /tmp/stderr.txt
postgres@cbbrowne [12:10:55] [~/slony1-engine.github/src/slon] [bug156 *]
-> % cat /tmp/stdout.txt
2010-11-04 12:10:55 EDTCONFIG main: slon version 2.0.5 starting up
2010-11-04 12:10:55 EDTINFO   slon: watchdog process started
2010-11-04 12:10:55 EDTCONFIG slon: watchdog ready - pid = 29512
2010-11-04 12:10:55 EDTCONFIG slon: worker process created - pid = 29513
2010-11-04 12:10:55 EDTCONFIG main: Integer option vac_frequency = 3
2010-11-04 12:10:55 EDTCONFIG main: Integer option log_level = 0
2010-11-04 12:10:55 EDTCONFIG main: Integer option sync_interval = 2000
2010-11-04 12:10:55 EDTCONFIG main: Integer option sync_interval_timeout = 10000
2010-11-04 12:10:55 EDTCONFIG main: Integer option sync_group_maxsize = 20
2010-11-04 12:10:55 EDTCONFIG main: Integer option desired_sync_time = 60000
2010-11-04 12:10:55 EDTCONFIG main: Integer option syslog = 0
2010-11-04 12:10:55 EDTCONFIG main: Integer option quit_sync_provider = 0
2010-11-04 12:10:55 EDTCONFIG main: Integer option quit_sync_finalsync = 0
2010-11-04 12:10:55 EDTCONFIG main: Integer option sync_max_rowsize = 8192
2010-11-04 12:10:55 EDTCONFIG main: Integer option sync_max_largemem = 5242880
2010-11-04 12:10:55 EDTCONFIG main: Integer option remote_listen_timeout = 300
2010-11-04 12:10:55 EDTCONFIG main: Boolean option log_pid = 0
2010-11-04 12:10:55 EDTCONFIG main: Boolean option log_timestamp = 1
2010-11-04 12:10:55 EDTCONFIG main: Boolean option cleanup_deletelogs = 0
2010-11-04 12:10:55 EDTCONFIG main: Real option real_placeholder = 0.000000
2010-11-04 12:10:55 EDTCONFIG main: String option cluster_name = slony_regress1
2010-11-04 12:10:55 EDTCONFIG main: String option conn_info = dbname=slonyregress1
2010-11-04 12:10:55 EDTCONFIG main: String option pid_file = [NULL]
2010-11-04 12:10:55 EDTCONFIG main: String option log_timestamp_format = %Y-%m-%d %H:%M:%S %Z
2010-11-04 12:10:55 EDTCONFIG main: String option archive_dir = [NULL]
2010-11-04 12:10:55 EDTCONFIG main: String option sql_on_connection = [NULL]
2010-11-04 12:10:55 EDTCONFIG main: String option lag_interval = [NULL]
2010-11-04 12:10:55 EDTCONFIG main: String option command_on_logarchive = [NULL]
2010-11-04 12:10:55 EDTCONFIG main: String option syslog_facility = LOCAL0
2010-11-04 12:10:55 EDTCONFIG main: String option syslog_ident = slon
2010-11-04 12:10:55 EDTCONFIG main: String option cleanup_interval = 10 minutes
2010-11-04 12:10:55 EDTCONFIG main: local node id = 1
2010-11-04 12:10:55 EDTFATAL  slon_node_health_check() returned false - fatal health problem!

REPAIR CONFIG may be helpful to rectify this problem2010-11-04 12:10:55 EDTCONFIG slon: child terminated status: 9; pid: 29513, current worker pid: 29513
2010-11-04 12:10:55 EDTINFO   slon: done
2010-11-04 12:10:55 EDTINFO   slon: exit(0)
postgres@cbbrowne [12:11:03] [~/slony1-engine.github/src/slon] [bug156 *]
-> % cat /tmp/stderr.txt
WARNING:  table [id,nsp,name]=[1,table1,public] - sl_table does not match pg_class/pg_namespace
WARNING:  table [id,nsp,name]=[2,table2,public] - sl_table does not match pg_class/pg_namespace
WARNING:  table [id,nsp,name]=[4,table4,public] - sl_table does not match pg_class/pg_namespace
WARNING:  table [id,nsp,name]=[5,table5,public] - sl_table does not match pg_class/pg_namespace
WARNING:  Mismatch found between sl_table and pg_class.  Slonik command REPAIR CONFIG may be useful to rectify this.
postgres@cbbrowne [12:11:07] [~/slony1-engine.github/src/slon] [bug156 *]
-> %
Comment 7 Steve Singer 2010-11-10 07:01:37 UTC
I have reviewed version 726a3f0ee1775519696b1e1ebfca804c059da6e5 from your github repository and am now happy with this being applied against master/HEAD.


Since this is a behavior change I'm inclined to not apply it against 2.0
Comment 8 Christopher Browne 2010-11-10 11:44:06 UTC
Committed to master...

http://git.postgresql.org/gitweb?p=slony1-engine.git;a=commit;h=48bd40e006448b00d7ab413016562e13277102e5

I cherry picked the relevant patches into master, then squashed them into a single patch.  

The following was rather interesting in describing how to do such:

http://stackoverflow.com/questions/598672/git-how-to-squash-the-first-two-commits
Comment 9 Steve Singer 2011-11-30 13:14:35 UTC
*** Bug 253 has been marked as a duplicate of this bug. ***