Failure analysis | database failure MHA is not switched

author ： Zhang Luodan
Original love can produce DBA Team members , Current Lujin Office DBA Team members , Persistent pursuit of Technology ！
In this paper, the source ： Original contribution
* Produced by aikesheng open source community , Original content is not allowed to be used without authorization , For reprint, please contact the editor and indicate the source .

background

One night , database hang live , Phenomenon is ：

Application error org.apache.commons.dbcp.SQLNestedException: Cannot get a connection,pool error Timeout waiting for idle object
Unable to login , Enter the login command and it will be stuck , Unable to respond

In desperation, through coercion kill Drop the process , Restart database recovery .

Not here hang The reason for living , Only analyze the database hang live , however MHA Switch not triggered .

Conclusion

Let's start with the conclusion ,MHA By default, the long connection is used to do the database ping Health detection （ perform select 1 as Value）,4 Unable to connect... Times MySQL The switch is triggered .
The previous database hang It's just that new connections can't be established , But the old connection has no effect , And MHA The health detection statement of is very simple , Only in server Layer is detected , Not involved InnoDB layer , therefore MHA Think MySQL It's healthy , No decision was made .

solve

MHA from 0.53 Version starting support ping_type Parameter setting how to check master The usability of . Support 3 individual value ：

select： Use a long connection to connect to MySQL perform select 1 as Value, This long connection is reused , But the check is too simple , No more faults found .
connect： At each execution select 1 as Value Create and disconnect before and after , You can find more TCP Connection level failure .

Be careful ： This kind of situation ,MHA The monitoring process will fork A child process is detected

insert： Based on a to MySQL Existing connection execution insert sentence , You can better detect that the database is running out of disk space or disk IO Failure caused by resource exhaustion .

By way of ping_type Change settings to connect,MHA Each time the process status is detected , New connection required , The new link could not be successfully established , The switch is triggered .

Three detection mechanism codes ：

## If the distributed lock acquisition fails, it returns 2, Normal return 0, Exception return 1
sub ping_connect($) {
  my $self = shift;
  my $log  = $self->{logger};
  my $dbh;
  my $rc          = 1;
  my $max_retries = 2;
  eval {
    my $ping_start = [gettimeofday];
    #  Connect max_retries Time , If the connection fails, exit 
    while ( !$self->{dbh} && $max_retries-- ) {
      eval { $rc = $self->connect( 1, $self->{interval}, 0, 0, 1 ); };
      if ( !$self->{dbh} && [email protected] ) {
        die [email protected] if ( !$max_retries );
      }
    }
    #  call ping_select
    $rc = $self->ping_select();

    # To hold advisory lock for some periods of time
    $self->sleep_until( $ping_start, $self->{interval} - 1.5 );
    $self->disconnect_if();
  };
  if ([email protected]) {
    my $msg = "Got error on MySQL connect ping: [email protected]";
    undef [email protected];
    $msg .= $DBI::err if ($DBI::err);
    $msg .= " ($DBI::errstr)" if ($DBI::errstr);
    $log->warning($msg) if ($log);
    $rc = 1;
  }
  return 2 if ( $self->{_already_monitored} );
  return $rc;
}

#  Normal return 0, Exception return 1
sub ping_select($) {
  my $self = shift;
  my $log  = $self->{logger};
  my $dbh  = $self->{dbh};
  my ( $query, $sth, $href );
  eval {
    $dbh->{RaiseError} = 1;
    $sth = $dbh->prepare("SELECT 1 As Value");
    $sth->execute();
    $href = $sth->fetchrow_hashref;
    if ( !defined($href)
      || !defined( $href->{Value} )
      || $href->{Value} != 1 )
    {
      die;
    }
  };
  if ([email protected]) {
    my $msg = "Got error on MySQL select ping: ";
    undef [email protected];
    $msg .= $DBI::err if ($DBI::err);
    $msg .= " ($DBI::errstr)" if ($DBI::errstr);
    $log->warning($msg) if ($log);
    return 1;
  }
  return 0;
}


#  Normal return 0, Exception return 1
sub ping_insert($) {
  my $self = shift;
  my $log  = $self->{logger};
  my $dbh  = $self->{dbh};
  my ( $query, $sth, $href );
  eval {
    $dbh->{RaiseError} = 1;
    $dbh->do("CREATE DATABASE IF NOT EXISTS infra");
    $dbh->do(
"CREATE TABLE IF NOT EXISTS infra.chk_masterha (`key` tinyint NOT NULL primary key,`val` int(10) unsigned NOT NULL DEFAULT '0')"
    );
    $dbh->do(
"INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestamp()"
    );
  };
  if ([email protected]) {
    my $msg = "Got error on MySQL insert ping: ";
    undef [email protected];
    $msg .= $DBI::err if ($DBI::err);
    $msg .= " ($DBI::errstr)" if ($DBI::errstr);
    $log->warning($msg) if ($log);
    return 1;
  }
  return 0;
}

test

MHA The configuration file

[server default]
manager_log=/Data/mha/log/workdir/my3306tst.log
manager_workdir=/Data/mha/workdir/my3306tst
remote_workdir=/Data/mysql/my3306/mha
master_binlog_dir=/Data/mysql/my3306/log
password=xxx
ping_interval=5
repl_password=xxx
repl_user=xxx
ssh_user=mysql
ssh_port=xxx
user=mha
master_ip_online_change_script="/usr/local/bin/master_ip_online_change"
master_ip_failover_script="master_ip_failover"

[server1]
hostname=xxx
port=3306
candidate_master=1

[server2]
hostname=xxx
port=3306
candidate_master=1

Be careful ： At the time of testing ping_interval Set to 5, It is convenient for quick observation to switch , In production , It can be adjusted according to the fault tolerance of the business .

Simulate server CPU Full load , The database could not establish a new connection
Write a simple one c Program , as follows ：

# include <stdio.h>
int main()
{
    while(1);
    return 0;
}

compile ：

gcc -o out test_cpu.c

perform ：

for in in `seq 1 $(cat /proc/cpuinfo | grep "physical id" | wc -l)`; do ./out & done

Another two mysqlslap Pressure test procedure ：

mysqlslap -c 30000 -i 100 --detach=1 --query="select 1 from dual" --delimiter=";" -uxxx -pxxx -S /xxxx/xxx.sock

ping_type=connect when ,4 Secondary connection failure triggers switching
here , stay MHA In the switch log, you can see the output of the error reported by the connection database as follows ：

Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'waiting for initial communication packet',system error: 110)

ping_type=select when , Switch not triggered

Interested students can test by themselves

MHA Health detection mechanism

Call link ：

MasterMonitor.pm|MHA::MasterMonitor::main() 
-->
MasterMonitor.pm|MHA::MasterMonitor::wait_until_master_is_dead() 
-->
MasterMonitor.pm|MHA::MasterMonitor::wait_until_master_is_unreachable() 
--> 
MHA::HealthCheck::wait_until_unreachable();
-->
HealthCheck.pm|MHA::HealthCheck::ping_select( perhaps )
HealthCheck.pm|MHA::HealthCheck::ping_insert( perhaps )
HealthCheck.pm|MHA::HealthCheck::ping_connect( perhaps )

MHA After the monitoring process starts , The status of the primary node is continuously monitored , The main health detection function is wait_until_unreachable().

PS：MHA Monitoring process startup process , Will read the configuration file , Perform a series of checks on the servers in the configuration file , Including survival status 、 Version information 、 Configuration from library （ read_only, relay_log_purge, log-bin, Copy filter, etc ）,ssh State, etc , If the inspection fails , Can't start

There will be an endless loop in this function , Continuous health testing

1. First , Test connection , The connection returns correctly 0, Otherwise return to 1.

If connected MySQL success , Then obtain the distributed lock , If the distributed lock acquisition fails , The return status value is 1
If connected MySQL Failure , The status value is returned 1 And the error message of connection failure , For the following cases of connection failure （ Common are 1040 Connection number full sum 1045 Permission denied ）MHA Will think MySQL The process is normal , Does not trigger a switch , Instead, the connection is always checked

our @ALIVE_ERROR_CODES = (
  1040,    # ER_CON_COUNT_ERROR
  1042,    # ER_BAD_HOST_ERROR
  1043,    # ER_HANDSHAKE_ERROR
  1044,    # ER_DBACCESS_DENIED_ERROR
  1045,    # ER_ACCESS_DENIED_ERROR
  1129,    # ER_HOST_IS_BLOCKED
  1130,    # ER_HOST_NOT_PRIVILEGED
  1203,    # ER_TOO_MANY_USER_CONNECTIONS
  1226,    # ER_USER_LIMIT_REACHED
  1251,    # ER_NOT_SUPPORTED_AUTH_MODE
  1275,    # ER_SERVER_IS_IN_SECURE_AUTH_MODE
);

2. After the test connection is successful , Then carry out health status detection （ As mentioned above 3 Ways of planting ）; If it's continuous 4 Secondary connection failed , In the first place 4 The second script will be used for detection （ If defined ）, If the test passes , Think master Hang up

The key function wait_until_unreachable() Code ：

# main function
sub wait_until_unreachable($) {
  my $self           = shift;
  my $log            = $self->{logger};
  my $ssh_reachable  = 2;
  my $error_count    = 0;
  my $master_is_down = 0;

  eval {
    while (1) {
      $self->{_tstart} = [gettimeofday];
      ##  Determine if a connection needs to be established 
      if ( $self->{_need_reconnect} ) {
        my ( $rc, $mysql_err ) =
          $self->connect( undef, undef, undef, undef, undef, $error_count );
        if ($rc) {
          if ($mysql_err) {
            #  The error code is ALIVE_ERROR_CODES In the middle of the day , Do not trigger switching , The common problem is that the user password is incorrect , Will not switch 
            if (
              grep ( $_ == $mysql_err, @MHA::ManagerConst::ALIVE_ERROR_CODES )
              > 0 )
            {
              $log->info(
"Got MySQL error $mysql_err, but this is not a MySQL crash. Continue health check.."
              );
              # next Go straight to the next cycle 
              $self->sleep_until();
              next;
            }
          }
          $error_count++;
          $log->warning("Connection failed $error_count time(s)..");
          $self->handle_failing();

          if ( $error_count >= 4 ) {
            $ssh_reachable = $self->is_ssh_reachable();
            #  return 1 Represents the main library down,0 Indicates that the main database does not have down
            $master_is_down = 1 if ( $self->is_secondary_down() );
            #  Main library down And then jump out of the loop 
            last if ($master_is_down);
            $error_count = 0;
          }
          $self->sleep_until();
          next;
        }

        # connection ok
        $self->{_need_reconnect} = 0;
        $log->info(
"Ping($self->{ping_type}) succeeded, waiting until MySQL doesn't respond.."
        );
      }
      #  If ping_type by connect, Then disconnect 
      $self->disconnect_if()
        if ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_CONNECT );

      # Parent process forks one child process. The child process queries
      # from MySQL every <interval> seconds. The child process may hang on
      # executing queries.
      # DBD::mysql 4.022 or earlier does not have an option to set
      # read timeout, executing queries might take forever. To avoid this,
      # the parent process kills the child process if it won't exit within
      # <interval> seconds.

      my $child_exit_code;
      eval {
        #  Call the detection function 
        if ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_CONNECT ) {
          $child_exit_code = $self->fork_exec( sub { $self->ping_connect() },
            "MySQL Ping($self->{ping_type})" );
        }
        elsif ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_SELECT ) {
          $child_exit_code = $self->fork_exec( sub { $self->ping_select() },
            "MySQL Ping($self->{ping_type})" );
        }
        elsif ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_INSERT ) {
          $child_exit_code = $self->fork_exec( sub { $self->ping_insert() },
            "MySQL Ping($self->{ping_type})" );
        }
        else {
          die "Not supported ping_type!\n";
        }
      };
      if ([email protected]) {
        my $msg = "Unexpected error heppened when pinging! [email protected]";
        $log->error($msg);
        undef [email protected];
        $child_exit_code = 1;
      }

      if ( $child_exit_code == 0 ) {

        #ping ok
        ## ping If you succeed , Then update the status , And set the counter to 0
        $self->update_status_ok();
        if ( $error_count > 0 ) {
          $error_count = 0;
        }
        $self->kill_sec_check();
        $self->kill_ssh_check();
      }
      elsif ( $child_exit_code == 2 ) {
        $self->{_already_monitored} = 1;
        croak;
      }
      else {  
        ##  Failed to create connection 
        # failed on fork_exec
        $error_count++;
        $self->{_need_reconnect} = 1;
        $self->handle_failing();
      }
      $self->sleep_until();
    }
    $log->warning("Master is not reachable from health checker!");
  };
  if ([email protected]) {
    my $msg = "Got error when monitoring master: [email protected]";
    $log->warning($msg);
    undef [email protected];
    return 2 if ( $self->{_already_monitored} );
    return 1;
  }
  return 1 unless ($master_is_down);
  return ( 0, $ssh_reachable );
}

1;