author : Zhang Luodan
Original love can produce DBA Team members , Current Lujin Office DBA Team members , Persistent pursuit of Technology !
In this paper, the source : Original contribution
* Produced by aikesheng open source community , Original content is not allowed to be used without authorization , For reprint, please contact the editor and indicate the source .
background
One night , database hang live , Phenomenon is :
- Application error
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection,pool error Timeout waiting for idle object
- Unable to login , Enter the login command and it will be stuck , Unable to respond
In desperation, through coercion kill Drop the process , Restart database recovery .
Not here hang The reason for living , Only analyze the database hang live , however MHA Switch not triggered .
Conclusion
Let's start with the conclusion ,MHA By default, the long connection is used to do the database ping Health detection ( perform select 1 as Value
),4 Unable to connect... Times MySQL The switch is triggered .
The previous database hang It's just that new connections can't be established , But the old connection has no effect , And MHA The health detection statement of is very simple , Only in server Layer is detected , Not involved InnoDB layer , therefore MHA Think MySQL It's healthy , No decision was made .
solve
MHA from 0.53 Version starting support ping_type Parameter setting how to check master The usability of . Support 3 individual value :
- select: Use a long connection to connect to MySQL perform
select 1 as Value
, This long connection is reused , But the check is too simple , No more faults found . - connect: At each execution
select 1 as Value
Create and disconnect before and after , You can find more TCP Connection level failure .
Be careful : This kind of situation ,MHA The monitoring process will fork A child process is detected
- insert: Based on a to MySQL Existing connection execution insert sentence , You can better detect that the database is running out of disk space or disk IO Failure caused by resource exhaustion .
By way of ping_type Change settings to connect
,MHA Each time the process status is detected , New connection required , The new link could not be successfully established , The switch is triggered .
Three detection mechanism codes :
## If the distributed lock acquisition fails, it returns 2, Normal return 0, Exception return 1
sub ping_connect($) {
my $self = shift;
my $log = $self->{logger};
my $dbh;
my $rc = 1;
my $max_retries = 2;
eval {
my $ping_start = [gettimeofday];
# Connect max_retries Time , If the connection fails, exit
while ( !$self->{dbh} && $max_retries-- ) {
eval { $rc = $self->connect( 1, $self->{interval}, 0, 0, 1 ); };
if ( !$self->{dbh} && [email protected] ) {
die [email protected] if ( !$max_retries );
}
}
# call ping_select
$rc = $self->ping_select();
# To hold advisory lock for some periods of time
$self->sleep_until( $ping_start, $self->{interval} - 1.5 );
$self->disconnect_if();
};
if ([email protected]) {
my $msg = "Got error on MySQL connect ping: [email protected]";
undef [email protected];
$msg .= $DBI::err if ($DBI::err);
$msg .= " ($DBI::errstr)" if ($DBI::errstr);
$log->warning($msg) if ($log);
$rc = 1;
}
return 2 if ( $self->{_already_monitored} );
return $rc;
}
# Normal return 0, Exception return 1
sub ping_select($) {
my $self = shift;
my $log = $self->{logger};
my $dbh = $self->{dbh};
my ( $query, $sth, $href );
eval {
$dbh->{RaiseError} = 1;
$sth = $dbh->prepare("SELECT 1 As Value");
$sth->execute();
$href = $sth->fetchrow_hashref;
if ( !defined($href)
|| !defined( $href->{Value} )
|| $href->{Value} != 1 )
{
die;
}
};
if ([email protected]) {
my $msg = "Got error on MySQL select ping: ";
undef [email protected];
$msg .= $DBI::err if ($DBI::err);
$msg .= " ($DBI::errstr)" if ($DBI::errstr);
$log->warning($msg) if ($log);
return 1;
}
return 0;
}
# Normal return 0, Exception return 1
sub ping_insert($) {
my $self = shift;
my $log = $self->{logger};
my $dbh = $self->{dbh};
my ( $query, $sth, $href );
eval {
$dbh->{RaiseError} = 1;
$dbh->do("CREATE DATABASE IF NOT EXISTS infra");
$dbh->do(
"CREATE TABLE IF NOT EXISTS infra.chk_masterha (`key` tinyint NOT NULL primary key,`val` int(10) unsigned NOT NULL DEFAULT '0')"
);
$dbh->do(
"INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestamp()"
);
};
if ([email protected]) {
my $msg = "Got error on MySQL insert ping: ";
undef [email protected];
$msg .= $DBI::err if ($DBI::err);
$msg .= " ($DBI::errstr)" if ($DBI::errstr);
$log->warning($msg) if ($log);
return 1;
}
return 0;
}
test
MHA The configuration file
[server default]
manager_log=/Data/mha/log/workdir/my3306tst.log
manager_workdir=/Data/mha/workdir/my3306tst
remote_workdir=/Data/mysql/my3306/mha
master_binlog_dir=/Data/mysql/my3306/log
password=xxx
ping_interval=5
repl_password=xxx
repl_user=xxx
ssh_user=mysql
ssh_port=xxx
user=mha
master_ip_online_change_script="/usr/local/bin/master_ip_online_change"
master_ip_failover_script="master_ip_failover"
[server1]
hostname=xxx
port=3306
candidate_master=1
[server2]
hostname=xxx
port=3306
candidate_master=1
Be careful : At the time of testing ping_interval Set to 5, It is convenient for quick observation to switch , In production , It can be adjusted according to the fault tolerance of the business .
Simulate server CPU Full load , The database could not establish a new connection
Write a simple one c Program , as follows :
# include <stdio.h>
int main()
{
while(1);
return 0;
}
compile :
gcc -o out test_cpu.c
perform :
for in in `seq 1 $(cat /proc/cpuinfo | grep "physical id" | wc -l)`; do ./out & done
Another two mysqlslap Pressure test procedure :
mysqlslap -c 30000 -i 100 --detach=1 --query="select 1 from dual" --delimiter=";" -uxxx -pxxx -S /xxxx/xxx.sock
- ping_type=connect when ,4 Secondary connection failure triggers switching
here , stay MHA In the switch log, you can see the output of the error reported by the connection database as follows :
Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'waiting for initial communication packet',system error: 110)
- ping_type=select when , Switch not triggered
Interested students can test by themselves
MHA Health detection mechanism
Call link :
MasterMonitor.pm|MHA::MasterMonitor::main()
-->
MasterMonitor.pm|MHA::MasterMonitor::wait_until_master_is_dead()
-->
MasterMonitor.pm|MHA::MasterMonitor::wait_until_master_is_unreachable()
-->
MHA::HealthCheck::wait_until_unreachable();
-->
HealthCheck.pm|MHA::HealthCheck::ping_select( perhaps )
HealthCheck.pm|MHA::HealthCheck::ping_insert( perhaps )
HealthCheck.pm|MHA::HealthCheck::ping_connect( perhaps )
MHA After the monitoring process starts , The status of the primary node is continuously monitored , The main health detection function is wait_until_unreachable().
PS:MHA Monitoring process startup process , Will read the configuration file , Perform a series of checks on the servers in the configuration file , Including survival status 、 Version information 、 Configuration from library (read_only
,relay_log_purge
,log-bin
, Copy filter, etc ),ssh State, etc , If the inspection fails , Can't start
There will be an endless loop in this function , Continuous health testing
1. First , Test connection , The connection returns correctly 0, Otherwise return to 1.
- If connected MySQL success , Then obtain the distributed lock , If the distributed lock acquisition fails , The return status value is 1
- If connected MySQL Failure , The status value is returned 1 And the error message of connection failure , For the following cases of connection failure ( Common are 1040 Connection number full sum 1045 Permission denied )MHA Will think MySQL The process is normal , Does not trigger a switch , Instead, the connection is always checked
our @ALIVE_ERROR_CODES = (
1040, # ER_CON_COUNT_ERROR
1042, # ER_BAD_HOST_ERROR
1043, # ER_HANDSHAKE_ERROR
1044, # ER_DBACCESS_DENIED_ERROR
1045, # ER_ACCESS_DENIED_ERROR
1129, # ER_HOST_IS_BLOCKED
1130, # ER_HOST_NOT_PRIVILEGED
1203, # ER_TOO_MANY_USER_CONNECTIONS
1226, # ER_USER_LIMIT_REACHED
1251, # ER_NOT_SUPPORTED_AUTH_MODE
1275, # ER_SERVER_IS_IN_SECURE_AUTH_MODE
);
2. After the test connection is successful , Then carry out health status detection ( As mentioned above 3 Ways of planting ); If it's continuous 4 Secondary connection failed , In the first place 4 The second script will be used for detection ( If defined ), If the test passes , Think master Hang up
The key function wait_until_unreachable() Code :
# main function
sub wait_until_unreachable($) {
my $self = shift;
my $log = $self->{logger};
my $ssh_reachable = 2;
my $error_count = 0;
my $master_is_down = 0;
eval {
while (1) {
$self->{_tstart} = [gettimeofday];
## Determine if a connection needs to be established
if ( $self->{_need_reconnect} ) {
my ( $rc, $mysql_err ) =
$self->connect( undef, undef, undef, undef, undef, $error_count );
if ($rc) {
if ($mysql_err) {
# The error code is ALIVE_ERROR_CODES In the middle of the day , Do not trigger switching , The common problem is that the user password is incorrect , Will not switch
if (
grep ( $_ == $mysql_err, @MHA::ManagerConst::ALIVE_ERROR_CODES )
> 0 )
{
$log->info(
"Got MySQL error $mysql_err, but this is not a MySQL crash. Continue health check.."
);
# next Go straight to the next cycle
$self->sleep_until();
next;
}
}
$error_count++;
$log->warning("Connection failed $error_count time(s)..");
$self->handle_failing();
if ( $error_count >= 4 ) {
$ssh_reachable = $self->is_ssh_reachable();
# return 1 Represents the main library down,0 Indicates that the main database does not have down
$master_is_down = 1 if ( $self->is_secondary_down() );
# Main library down And then jump out of the loop
last if ($master_is_down);
$error_count = 0;
}
$self->sleep_until();
next;
}
# connection ok
$self->{_need_reconnect} = 0;
$log->info(
"Ping($self->{ping_type}) succeeded, waiting until MySQL doesn't respond.."
);
}
# If ping_type by connect, Then disconnect
$self->disconnect_if()
if ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_CONNECT );
# Parent process forks one child process. The child process queries
# from MySQL every <interval> seconds. The child process may hang on
# executing queries.
# DBD::mysql 4.022 or earlier does not have an option to set
# read timeout, executing queries might take forever. To avoid this,
# the parent process kills the child process if it won't exit within
# <interval> seconds.
my $child_exit_code;
eval {
# Call the detection function
if ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_CONNECT ) {
$child_exit_code = $self->fork_exec( sub { $self->ping_connect() },
"MySQL Ping($self->{ping_type})" );
}
elsif ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_SELECT ) {
$child_exit_code = $self->fork_exec( sub { $self->ping_select() },
"MySQL Ping($self->{ping_type})" );
}
elsif ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_INSERT ) {
$child_exit_code = $self->fork_exec( sub { $self->ping_insert() },
"MySQL Ping($self->{ping_type})" );
}
else {
die "Not supported ping_type!\n";
}
};
if ([email protected]) {
my $msg = "Unexpected error heppened when pinging! [email protected]";
$log->error($msg);
undef [email protected];
$child_exit_code = 1;
}
if ( $child_exit_code == 0 ) {
#ping ok
## ping If you succeed , Then update the status , And set the counter to 0
$self->update_status_ok();
if ( $error_count > 0 ) {
$error_count = 0;
}
$self->kill_sec_check();
$self->kill_ssh_check();
}
elsif ( $child_exit_code == 2 ) {
$self->{_already_monitored} = 1;
croak;
}
else {
## Failed to create connection
# failed on fork_exec
$error_count++;
$self->{_need_reconnect} = 1;
$self->handle_failing();
}
$self->sleep_until();
}
$log->warning("Master is not reachable from health checker!");
};
if ([email protected]) {
my $msg = "Got error when monitoring master: [email protected]";
$log->warning($msg);
undef [email protected];
return 2 if ( $self->{_already_monitored} );
return 1;
}
return 1 unless ($master_is_down);
return ( 0, $ssh_reachable );
}
1;