How to limit impact of a bad application on MariaDB Galera Cluster with MaxScale?
I have a Galera Replication cluster with three MariaDB nodes where a Maxscale Active-Passive cluster in front provides a single node image to tis clients.
I have a bad behaving client, which opens connections and doesn't close them. No of connections keep increasing till database limits hit. To limit the number of connections I have configured below two params
My situation is this, When I have only max_connections configured, whenever the limits are reached Galera node stops accepting more connection with error of "Too many connections". When Maxscale see this connection rejections for n number of times, it puts the server under Maintenance mode. I can understand this behaviour, it's expected. When I configure max_user_connections, and because the application is behaving bad and trying to make new connections continously, when the userspecific limit reaches further attempt of connections fails to the mariadb nodes in backend. Maxscale observes these failures, and again puts the server in Maintenance mode. I believe during this time it only sees connections attempt from the bad client, no other application tried to connect.
And this way, MaxScale puts all three nodes in Maintenance mode over the time, which makes complete DB service unavailable.
For me as administrator, situation becomes same, puting a user specific limit doesn't achieve anything. I would like to ask two points here
Q1. How can I prevent just one user connection failures from puting the backend mariadb node into maintenance?
Q2. Any documentation, or tutorials, article reference on how and when MaxScale decides to put a server in Maintenance mode?
Below are the details about the environment
Galera - 25.3.23, MariaDB - 10.3.12, MaxScale - 2.4.11, OS - RHEL 7.4 (Maipo)
Here is my configuration
MariaDB Galera Configuration
[server] # this is only for the mysqld standalone daemon [mysqld] #user statistics userstat=1 performance_schema #wait_timeout=600 max_allowed_packet=1024M # lower_case_table_names=1 # max_connections=1500 max_user_connections=200 # # * Galera-related settings # [galera] # Mandatory settings wsrep_on=ON wsrep_provider=/usr/lib64/galera/libgalera_smm.so wsrep_provider_options="gcache.size=300M; gcache.page_size=300M; pc.ignore_sb=false; pc.ignore_quorum=false" #wsrep_cluster_address defines members of the cluster wsrep_cluster_address=gcomm://x.x.x.1,x.x.x.2,x.x.x.3 wsrep_cluster_name="mariadb-cluster" wsrep_node_address=x.x.x.1 wsrep_node_incoming_address=x.x.x.1 wsrep_debug=OFF # binlog_format=row default_storage_engine=InnoDB innodb_autoinc_lock_mode=2 innodb_doublewrite=1 query_cache_size=0 innodb_flush_log_at_trx_commit=0 innodb_buffer_pool_size=5G # bind-address=x.x.x.1 # [mariadb] #performance wait_timeout=31536000 # #query logging log_output=FILE #slow queries slow_query_log slow_query_log_file=/var/log/mariadb/mariadb-slow.log long_query_time=10.0 log_queries_not_using_indexes=ON min_examined_row_limit=1000 log_slow_rate_limit=1 log_slow_verbosity=query_plan,explain # #error logs log_error=/var/log/mariadb/mariadb-error.log log_warnings=2
Similarly all three Galera nodes are configured.
[maxscale] threads=auto # Server definitions [mariadb1] type=server address=x.x.x.1 port=3306 protocol=MariaDBBackend #priority=0 [mariadb2] type=server address=x.x.x.2 port=3306 protocol=MariaDBBackend #priority=1 [mariadb3] type=server address=x.x.x.3 port=3306 protocol=MariaDBBackend #priority=1 # Monitor for the servers # [Galera-Monitor] type=monitor module=galeramon servers=mariadb1, mariadb2, mariadb3 user=xxx password=xxx #disable_master_role_setting=true monitor_interval=1000 #use_priority=true # disable_master_failback=true available_when_donor=true # Service definitions [Galera-Service] type=service router=readwritesplit master_accept_reads=true connection_keepalive=300s master_reconnection=true master_failure_mode=error_on_write connection_timeout=3600s servers=mariadb1, mariadb2, mariadb3 user=xxx password=xxx #filters=Query-Log-Filter #Listener [Galera-Listener] type=listener service=Galera-Service protocol=MariaDBClient port=4306
I have tried
connection_timeout, max_connections, max_user_connectionsconfigurations on database server nodes but it didn't help. When the bad application is making connection attempts and if it reaches threshold, database servers drop connections with
"Too many connections". Maxscale observes it for sometime and put the back-end servers under
Maintenance. Setting up
max_users_connectionsto some value say: 200, causes backend server to reject connections when limit reaches for a single user. Now, when there are multiple
Too many connectionsfailures due to 'max_users_connections
limit threshold breach due to bad application, Maxscale again marks the server inMaintenance` state. It doesn't differentiate that the attempts are from single user or many users collectively. It just sees the 'Too many connections' failures from the server.
To solve the situation, I created a separate Service under Maxscale for the bad behaving application with the
max_connectionslimit set. Created a separate listener on different port number for the service.
Due to a separate service, whenever
max_connectionsthreshold reaches on MAxscale other clients are unaffected. Additionally, took care that
max_connectionslimit on MariadB servers on back-end is more than the value configured on Maxscale, so threshold reaches earlier on Maxscale, and it never puts back-end servers in
Maintenance mode. New configuration block for Maxscale is as below -
I don't think MaxScale is the component you want to use to fix this problem. It's possible to do this in the MariaDB server itself. I've experienced exactly the same problem and have solved it by imposing limits on the database users using the
Why is this value so large? Do your applications keep the connections open rather than creating new ones? While that might sound like a good idea, it means that connections accidentally left open/idle won't be closed until much later.
I don't think this is correct.
If you limit the database users so that the sum of
max_user_connectionfor all users <
max_connectionsfor each of the nodes, then the users will not be able to reach the
I don't think there is one single document for this, but rather it is scattered across the MaxScale documentation. I think maintenance mode started out as a way for the admin to schedule planned downtime, but has since been used for other things as well, see e.g. maintenance_on_low_disk_space