We have just launched the next generation case portal. Please visit our blog to find out what changes this brings. If you are experiencing any issues accessing the case portal please reach out to: Support.

We are also aware that supportcases.lithium.com is displaying a certificate error and are working to resolve the issue.

Better MySQL Metrics in Datadog

Introduction

MySQL is the most popular open source database server that is used by some of the biggest web properties, including Lithium. It is also the most critical part of the infrastructure because it stores data - and data is what drives us.

So it is of paramount importance to keep tabs on the performance and health of the MySQL infrastructure. And that is where Datadog comes into picture, as a monitoring service that monitors metrics of key systems. These metrics can then be used to analyze, alert and report on the health and performance of the MySQL architecture, giving you immediate and actionable insights into MySQL’s performance. To summarize, monitoring the metrics allows us to achieve the following purposes:

  • Alert us immediately when a problem arises
  • Alert us on a problem that we may have in the future

Adding Comprehensive MySQL Metrics to Datadog

Using Datadog, we soon realized that it does not monitor all of the desired key metrics; Most of the InnoDB related metrics were missing. Cacti had the same problem until Baron Schwartz started the Better Cacti Templates project which later on became a part of Percona Monitoring Plugins. Having previously worked at Percona for several years, solving performance problems, I do truly appreciate how the Better Cacti Templates project makes life easier to troubleshoot and foresee problems.

So I embarked on porting the same set of metrics to Datadog and decided to give it back to the community so that Lithium and other Datadog users could monitor MySQL more comprehensively.

 

The details of my changes are available in the Pull Request on GitHub here:

https://github.com/DataDog/dd-agent/pull/1269

 

Let me share a few screenshots that show the metrics as seen when utilized in Datadog dashboard.

datadog_msql_threads.png 

As you can see the graphs in the dashboard show pretty nice details that can be used to analyze a trend or to analyze a problem at hand.

Where do I see the Query Response Times?

There is one key metric that I think is the single most important metric when it comes to MySQL performance : the “95th percentile query response time”.

 

Wikipedia has a detailed post on “percentile” that is a good read to understand the logic behind wanting to monitor “95th percentile query response time”.

“95th percentile query response time” shows the query response time for 95% of the queries executed against the MySQL server.

It seemed there would be no better metric to highlight immediate problems but also problems that might be building up slowly. After all, we want applications to be able to execute queries against MySQL server as fast as possible.

 

So I also wrote another Datadog custom metric that fetches information from the performance_schema database to calculate the 95th percentile query response times and send it over to Datadog.

 

I have filed a Pull Request on GitHub to have this integrated into the official Datadog checks so that the community can use it too. The details of my changes are available here on GitHub:

https://github.com/DataDog/dd-agent/pull/1284

 

Note that this check requires MySQL versions greater than or equal to 5.6.5. The check exposes average query response time per schema as well, apart from reporting the 95th percentile query response time across all the queries. Exposing average query response time per schema gives you a good idea as to which schema is responsible for most of the load on the MySQL server. This is particularly important for multi-tenant database servers that host more than one application, each with its own schema. An example would be a Wordpress Multisite Blog application.

 

Here are a few screenshots that show the metrics as seen when utilized in Datadog dashboard.

 datadog_95th_graph.png

datadog_avg_responsetime.png

The query response time metrics are collected using microseconds as the unit. This allows for fine-grained resolution. Once enough data has been collected for the 95th percentile query response time metric, a threshold can be setup to alert you when query performance becomes unacceptable.

 

These contributions are now out there in community space so they can be used by the community at large. I hope you enjoy them, and look forward to your comments!

1 Comment

Hi @OvaisT Thanks for this nice post, I need your help here.

 

I executed this query on my database server

SELECT schema_name, SUM(count_star) cnt, ROUND(AVG(avg_timer_wait)/1000000) AS avg_us 
             FROM performance_schema.events_statements_summary_by_digest 
             WHERE schema_name IS NOT NULL 
             GROUP BY schema_name;

and got the following result:

+--------------------+------+--------+
| schema_name        | cnt  | avg_us |
+--------------------+------+--------+
| db_name         | 2324 |  58756 |
| information_schema |  345 |   1793 |
| mysql              |  684 |   1873 |
+--------------------+------+--------+

And this query too:

SELECT s2.avg_us avg_us,
                 IFNULL(SUM(s1.cnt)/NULLIF((SELECT COUNT(*) FROM performance_schema.events_statements_summary_by_digest), 0), 0) percentile
             FROM (SELECT COUNT(*) cnt, ROUND(avg_timer_wait/1000000) AS avg_us
                     FROM performance_schema.events_statements_summary_by_digest
                     GROUP BY avg_us) AS s1
             JOIN (SELECT COUNT(*) cnt, ROUND(avg_timer_wait/1000000) AS avg_us
                     FROM performance_schema.events_statements_summary_by_digest
                     GROUP BY avg_us) AS s2
             ON s1.avg_us <= s2.avg_us
             GROUP BY s2.avg_us
             HAVING percentile > 0.95
             ORDER BY percentile
             LIMIT 1;

gave this result:

+--------+------------+
| avg_us | percentile |
+--------+------------+
| 379062 |     0.9516 |
+--------+------------+

I tried to interpret these results like this:

 

The first one means that 2324 queries were executed on db_name database and their average response time was 0.05 seconds?

The second one means 95.16% percent of these queries took on average 0.3 seconds to execute.

Is that correct???

 

Also how to read your second dashboard? What does the vertical axis shows??

 

Thanks a lot.

Lithys

Lithy Awards 2017

The winners in digital CX have been crowned!

See the winners!!