Stored function performs some tests, returning t/f, and if f, slon will fail with fatal error. Tests we know we ought to run: - verify that tables match sl_table, complete with triggers - possibly similar for sl_sequence
Set up a branch at GitHub for this: http://github.com/cbbrowne/slony1-engine/commits/bug156 I do not have a regression test for this yet, but do have: a) Health check function b) slon calls it at startup time, and fails FATAL if it fails c) documentation indicating why the error might come up, and what one might do about it.
I've looked over the patch. I'm wondering if we shouldn't try to print a more helpful error message. Would we not be friendlier if we printed the names of the tables (both as listed in sl_table and as listed in pg_class as part of the slon output. Asking the user to check the postgresql.log is kind of a pain (for them). We also might want to explicitly mention REPAIR CONFIG in the error message.
(In reply to comment #2) > I've looked over the patch. > > I'm wondering if we shouldn't try to print a more helpful error message. > > Would we not be friendlier if we printed the names of the tables (both as > listed in sl_table and as listed in pg_class as part of the slon output. > > Asking the user to check the postgresql.log is kind of a pain (for them). > > We also might want to explicitly mention REPAIR CONFIG in the error message. I have done all of this... - slon pulls the warning messages using PQresultErrorMessage(), and displays them in its logs. - I augmented the health check function to suggest that REPAIR CONFIG might fix the problem.
Per discussion on list, add in check that the timezone is GMT/UTC (or, I suppose, one of the timezones which is stable, that is, does not have Daylight Savings Time), and raise a WARNING if not.
The latest version of the patch seems to introduce the following compiler warning slon.c: In function ‘SlonMain’: slon.c:460: warning: comparison between pointer and integer from the line if (PQgetvalue(res, 0, 0) == 'f') { slon_log(SLON_FATAL, "slon_node_health_check I am surprised that your testing was able to get this warning to show up. PQgetvalue(..) will return a char * to a null terminated string not a char. strncmp is probably what we want to be using. 2. If I modify the code to work around that issue this is all I see when I see in the slon output. 2010-11-03 10:19:30 EDTCONFIG main: local node id = 1 2010-11-03 10:19:30 EDTFATAL could not call slon_node_health_check() - 2010-11-03 10:19:30 EDTCONFIG slon: child terminated status: 9; pid: 31474, current worker pid: 31474 2010-11-03 10:19:30 EDTINFO slon: done 2010-11-03 10:19:30 EDTINFO slon: exit(0) I'm not seeing the details. (ie the specific table that is the issue)
(In reply to comment #5) > The latest version of the patch seems to introduce the following compiler > warning > > slon.c: In function ‘SlonMain’: > slon.c:460: warning: comparison between pointer and integer > > from the line > if (PQgetvalue(res, 0, 0) == 'f') { > slon_log(SLON_FATAL, > "slon_node_health_check > > I am surprised that your testing was able to get this warning to show up. > > PQgetvalue(..) will return a char * to a null terminated string not a char. > strncmp is probably what we want to be using. > > 2. If I modify the code to work around that issue this is all I see when I see > in the slon output. > > 2010-11-03 10:19:30 EDTCONFIG main: local node id = 1 > 2010-11-03 10:19:30 EDTFATAL could not call slon_node_health_check() - > 2010-11-03 10:19:30 EDTCONFIG slon: child terminated status: 9; pid: 31474, > current worker pid: 31474 > 2010-11-03 10:19:30 EDTINFO slon: done > 2010-11-03 10:19:30 EDTINFO slon: exit(0) > > > I'm not seeing the details. (ie the specific table that is the issue) Right you are... I have changed it to if (*(PQgetvalue(res, 0, 0)) == 'f') { which is consistent with what is done elsewhere: postgres@cbbrowne [12:08:05] [~/slony1-engine.github/src/slon] [bug156 *] -> % grep -n PQgetvalue *.c | grep "*" | grep "'" remote_worker.c:2661: if (*(PQgetvalue(res1, 0, 0)) == 't') slon.c:460: if (*(PQgetvalue(res, 0, 0)) == 'f') { slon.c:550: int no_active = (*PQgetvalue(res, i, 1) == 't') ? 1 : 0; (Of course, the second instance is for this patch! :-)) Here is a self-contained test of this, run against a node that was initialized and had tables added in. I mess with it by breaking sl_table.tab_reloid values. postgres@cbbrowne [12:10:00] [~/slony1-engine.github/src/slon] [bug156 *] -> % psql -d slonyregress1 -c "update _slony_regress1.sl_table set tab_reloid=tab_id;" UPDATE 4 postgres@cbbrowne [12:10:47] [~/slony1-engine.github/src/slon] [bug156 *] -> % slon slony_regress1 'dbname=slonyregress1' > /tmp/stdout.txt 2> /tmp/stderr.txt postgres@cbbrowne [12:10:55] [~/slony1-engine.github/src/slon] [bug156 *] -> % cat /tmp/stdout.txt 2010-11-04 12:10:55 EDTCONFIG main: slon version 2.0.5 starting up 2010-11-04 12:10:55 EDTINFO slon: watchdog process started 2010-11-04 12:10:55 EDTCONFIG slon: watchdog ready - pid = 29512 2010-11-04 12:10:55 EDTCONFIG slon: worker process created - pid = 29513 2010-11-04 12:10:55 EDTCONFIG main: Integer option vac_frequency = 3 2010-11-04 12:10:55 EDTCONFIG main: Integer option log_level = 0 2010-11-04 12:10:55 EDTCONFIG main: Integer option sync_interval = 2000 2010-11-04 12:10:55 EDTCONFIG main: Integer option sync_interval_timeout = 10000 2010-11-04 12:10:55 EDTCONFIG main: Integer option sync_group_maxsize = 20 2010-11-04 12:10:55 EDTCONFIG main: Integer option desired_sync_time = 60000 2010-11-04 12:10:55 EDTCONFIG main: Integer option syslog = 0 2010-11-04 12:10:55 EDTCONFIG main: Integer option quit_sync_provider = 0 2010-11-04 12:10:55 EDTCONFIG main: Integer option quit_sync_finalsync = 0 2010-11-04 12:10:55 EDTCONFIG main: Integer option sync_max_rowsize = 8192 2010-11-04 12:10:55 EDTCONFIG main: Integer option sync_max_largemem = 5242880 2010-11-04 12:10:55 EDTCONFIG main: Integer option remote_listen_timeout = 300 2010-11-04 12:10:55 EDTCONFIG main: Boolean option log_pid = 0 2010-11-04 12:10:55 EDTCONFIG main: Boolean option log_timestamp = 1 2010-11-04 12:10:55 EDTCONFIG main: Boolean option cleanup_deletelogs = 0 2010-11-04 12:10:55 EDTCONFIG main: Real option real_placeholder = 0.000000 2010-11-04 12:10:55 EDTCONFIG main: String option cluster_name = slony_regress1 2010-11-04 12:10:55 EDTCONFIG main: String option conn_info = dbname=slonyregress1 2010-11-04 12:10:55 EDTCONFIG main: String option pid_file = [NULL] 2010-11-04 12:10:55 EDTCONFIG main: String option log_timestamp_format = %Y-%m-%d %H:%M:%S %Z 2010-11-04 12:10:55 EDTCONFIG main: String option archive_dir = [NULL] 2010-11-04 12:10:55 EDTCONFIG main: String option sql_on_connection = [NULL] 2010-11-04 12:10:55 EDTCONFIG main: String option lag_interval = [NULL] 2010-11-04 12:10:55 EDTCONFIG main: String option command_on_logarchive = [NULL] 2010-11-04 12:10:55 EDTCONFIG main: String option syslog_facility = LOCAL0 2010-11-04 12:10:55 EDTCONFIG main: String option syslog_ident = slon 2010-11-04 12:10:55 EDTCONFIG main: String option cleanup_interval = 10 minutes 2010-11-04 12:10:55 EDTCONFIG main: local node id = 1 2010-11-04 12:10:55 EDTFATAL slon_node_health_check() returned false - fatal health problem! REPAIR CONFIG may be helpful to rectify this problem2010-11-04 12:10:55 EDTCONFIG slon: child terminated status: 9; pid: 29513, current worker pid: 29513 2010-11-04 12:10:55 EDTINFO slon: done 2010-11-04 12:10:55 EDTINFO slon: exit(0) postgres@cbbrowne [12:11:03] [~/slony1-engine.github/src/slon] [bug156 *] -> % cat /tmp/stderr.txt WARNING: table [id,nsp,name]=[1,table1,public] - sl_table does not match pg_class/pg_namespace WARNING: table [id,nsp,name]=[2,table2,public] - sl_table does not match pg_class/pg_namespace WARNING: table [id,nsp,name]=[4,table4,public] - sl_table does not match pg_class/pg_namespace WARNING: table [id,nsp,name]=[5,table5,public] - sl_table does not match pg_class/pg_namespace WARNING: Mismatch found between sl_table and pg_class. Slonik command REPAIR CONFIG may be useful to rectify this. postgres@cbbrowne [12:11:07] [~/slony1-engine.github/src/slon] [bug156 *] -> %
I have reviewed version 726a3f0ee1775519696b1e1ebfca804c059da6e5 from your github repository and am now happy with this being applied against master/HEAD. Since this is a behavior change I'm inclined to not apply it against 2.0
Committed to master... http://git.postgresql.org/gitweb?p=slony1-engine.git;a=commit;h=48bd40e006448b00d7ab413016562e13277102e5 I cherry picked the relevant patches into master, then squashed them into a single patch. The following was rather interesting in describing how to do such: http://stackoverflow.com/questions/598672/git-how-to-squash-the-first-two-commits
*** Bug 253 has been marked as a duplicate of this bug. ***