The World of 5Σ in 2012

Large MySQL Installation tuning and optimisation techniques explored

December 22nd, 2012 - There are two general methods for creating a query plan for a query. A rule-based optimizer goes through your query, sees if you're doing X or Y, and then optimizes for method A or B depending. Rule-based optimizers grow rather large as there are many possible cases to account for. Sometime in the past couple of decades, rule-based optimizers fell out of favour and cost-based optimizers took over instead. At the moment the query is submitted, the optimizer does some simple calculations of costs of various ways of doing the query and picks the probably-best. It turns out the simplicity of the optimizer combined with the "it generally works" nature of the beast means it totally wins the popularity contest.

My Tuning Methodology, the Overview

With this in mind, I wanted to evaluate my own tuning methodology. At first blush, I thought my method rather coarse and unrefined. I generally look at every variable in my.cnf, do a quick search if I don't know what it is, set it to something that “seems about right,” and then go to the next one.

My tuning experience is almost exclusively in high-volume shops, which isn’t representative of the many different MySQL installations of different sizes in the world. In high-volume environments, there are no good rules of thumb for tunings, and if you followed any such rules, your MySQLs would crash into the dust. So this rather naïve approach is actually a good start. Because there's that word again. Rule.

You can tune the my.cnf by having several rules of thumb handy. A very long ruleset of "if this, then that" either documented or as a mental checklist. This works well on MySQL installations that are typical, but it isn't my methodology. Mine is iterative and experimentally-based.

My Tuning Methodology, the Details

So now I've got a my.cnf that I’ve touched line-by-line and picked values that seemed reasonable. Anyone who's ever administered a MySQL knows this my.cnf is going to be a disaster if left unchecked. But here's where my magic begins and why my methodology works in high-volume environments where no rules of thumb apply.

I place the my.cnf onto a machine that is taking representative load. That is, it receives queries much like it would if it were performing its production role. There are a lot of ways of doing this, but my favorite is to put the slave into a production pool at a small fraction (perhaps 1%) of the workload of a currently-production slave.

Next, I point trending software at the slave. What do I monitor? That's easy. EVERYTHING. Every single statistic I can think of, I graph, from the OS and DBMS. Everything sar would ever tell you. Everything from "show global status". Everything in log files I can turn into numbers.

Then I measure the "black-box expected throughput" of the machine. That is, it's getting a fraction of the query traffic, so is it performing obviously better than the other hosts performing full work? Say this machine is supposed to be able to answer its typical queries in 250ms. Is it doing it in at most 250ms, and perhaps better? If so, I increase the load on the machine. I keep increasing the load until it starts to struggle. Say I get it up to 150% normal load, and it's now taking 260ms to answer its queries. Now I know it's "stressed."

At this point, I start to look at every statistic I've graphed and look for things at their limits. Is CPU maxed out? Is disk I/O at a plateau? Are read_rnd_next handler stats off-the-charts? I find it useful to have the same metrics coming from an existing reference slave, but sometimes they are obviously in a bad state (like if read_rnd_next is pushing 100,000,000/s for example, we probably need some indexes).

From here, I begin to get an idea of tunings or, more precisely, "experiments I can run." Maybe the tables opened is going crazy. So I tune table_open_cache, restart the server, and see if the graph corrects (secondary concern) and if my server is responding better, say 10ms faster per query (primary concern, which is most-correctly measured as "does the application seem happy with this server in its current state?"). Maybe disk I/O is crazy, but InnoDB free pages is nothing. That suggests an experiment to increase innodb_buffer_pool_size.

These are perhaps obvious examples, but there can be some subtler and surprising tuning choices to be made when you start viewing graphs of statistics you don't even know the meaning of. For example, suppose you didn't know what handler_read_rnd_next meant. You look at it and see it's 10,000x larger than handler_read_next which is 50x larger than handler_read_first. You start looking up what those stats mean, et voila, you've made a useful discovery.

At this point, I've formulated a hypothesis, which sounds a bit like this: "The graphs indicate that my system is doing suboptimal X. There is a tuning A that affects X. Thus I will change its current value from M to N, and I expect it will perform more optimally in X." I make the change, and verify the evidence on the graphs. Most importantly, the primary concern is that the application, the thing using my DBMS, should perform better. If the graphs indicate everything is better, but the application is performing worse, we must accept the failed experiment!

This leads us to an anecdote about InnoDB tuning.

Why Experimental Evidence Matters

At a previous company, I was running several 100% InnoDB MySQL servers in a particular slave query pool, and was getting a particular performance characteristic. I was looking at my.cnf and noted innodb_buffer_pool_size was rather small (2GB out of 24GB total available RAM) and that I had an extremly low cache hit rate. I formulated the obvious hypothesis: that I should increase the buffer pool to decrease my deplorable cache hit rate. Upon increasing the innodb_buffer_pool_size to 6GB (I believe in changing things in small increments), however, I discovered that although my cache hit rate much nicer, my query latency was also up (and the application was clearly suffering).

The first thing to do was tune it back to 2GB to ensure things returned to status quo. Then I formulated a new experiment which made no sense, but I had to try: I tuned innodb_buffer_pool_size to 1GB... Performance INCREASED over the original 2GB setting!

Further experimentation settled on about a 750MB innodb_buffer_pool_size being optimal for the machines in this particular query pool (that is, smaller values began leading to worse performance again).

This was extremely counter-intuitive, as it should be to you if you're a MySQL DBA. It took me a long time to accept the truth of this situation and move on. I formulated many other hypotheses in the process of trying to understand this phenomenon. MyISAM? Bad trending data? Matter/antimatter collision? Practical joke by my co-workers who were in the datacenter siphoning off the blue smoke from my machines?

To add to the mystery, I observed two additional phenomena: First, in other slave pools, increased innodb_buffer_pool_size indeed had the expected result: better cache hit rate and better performance. Second, several months later these machines were replaced with a new set of hardware that was configured similarly in RAM/disk, but were AMD processors instead of Intel. Upon loading them up with MySQL and putting them into the pool, they performed extremely badly compared to their peers.

It took almost as long to discover the obvious, that increasing the innodb_buffer_pool_size decreased latency on this configuration, as it took to originally ignore this intuition. But the graphs coming out of the trending software didn't lie. On this new configuration, we could see the performance increase as innodb_buffer_pool_size was tuned up. I believe we settled on about 18GB of total 24GB RAM as the optimal tuning.

Wrapping up the Experiment, Standardize on your Configuration

At some point, after staring at graphs, modifying parameters, altering tables, changing sysctl.conf settings, remounting filesystems, it starts to be more difficult to suggest tunings that will have a great effect. Keep in mind that the business has its own goals.

After a while, getting another 5% performance out of your servers isn't worth the same to the business as, say, helping the developers push out the next release. So recognize that although you could spend the next several months staring at graphs and performing experiments (which are fun), at this point you should be satisfied with your increased performance and get back to the daily grind.

Roll your tunings out to the pool of DBMSes, monitor the cluster to be sure that, on the whole it's performing as well as the individual test machine was (This is important! Sometimes a subset of the machines will behave in a deviant fashion differently than you expect! Be prepared for it!), and if so, move on to something else more important. Don't worry. Changes in workload or hardware configuration or application code will necessitate revisiting this procedure in the future.

5Σ _____

In-Terminal ASCII Histograms to make visualising log data and commands trivial

December 15th, 2012 - In the course of a large cluster administrator's job, there are dozens of times a week it can be useful to visualise some data. You're constantly working with machines that have hundreds of databases, directories, files, log files with often millions of entries each.

Wouldn't it be nice if you could visualise these situations?

You have dozens of directories full of files. What's the relative size of each?
You have a logfile with 1.5M entries. You grep out "ERROR" and that is 50,000 timestamped entries. What times of day have the most errors?
You have a process list with 30,000 entries, comprised of 70 unique commands. What is the relative frequency of each command?
You have a database with 10,000 tables, each having several million rows. Which are the 30 biggest, and what is their relative size?
You have slow query logs spanning a couple of weeks. You wish to know the days and times-of-day that had the most slow queries.

Typically, you can use awk, sort, grep -c, and uniq -c to give some output that contains "keys" and "values" and you can eyeball it to get a general sense. In fact, if the output is decimal, the number of digits is a base-10 logarithmic graph of the relative sizes of things. But sometimes that isn't good enough. For example, a 7x difference will not be obvious from eyeballing a bunch of numbers.

You can copy the output, paste into a CSV, load it up in Libre Office Calc, and generate graphs if you want to spend another couple minutes every iteration. But wouldn't it be nice to just do it right there in the terminal where you already are?

If you did web searches for this sort of thing, up until a couple of months ago, the best you'd've found was a page referencing an amazing awk script that takes a list of numeric keys and presents an ASCII histogram. However, it couldn't take non-numeric keys, and it could not help with several of the use-cases presented above.

Enter "distribution," a radically simple tool by Fifth Sigma. It will take a large input, munge it, tokenise it, tally it, and give you beautiful in-terminal graphs with ANSI colour coding, unicode partial-width characters, and more.

So go to the project page, download it, and play with it. You'll find plenty of examples, and using a little imagination, you will begin to get a different view of the log files and Un*x commands you're constantly working with.

5Σ _____

The Redis Persistence Model

November 17th, 2012 - I am often asked about the benefits of using key-value stores, such as Redis, for high-volume environments. One of the key benefits that is often cited is the durability offered by Redis persistence (as described in detail here).

Developer Salvatore ‘antirez’ Sanfilippo delves into the topic on his blog. Here are some key questions you should consider as you evaluate Redis in your own environment.

Redis Replication and the AOF

Redis replication (and one of two methods of persistence) is achieved using the AOF (append-only file), which logs all statements that modify data. In the example in the article, there is a DELETE issued on a non-existent key, and that statement doesn't get logged nor replicated to a slave Redis. But in real life, masters and slaves get out-of-sync, and it can be handy to have a statement that does nothing on the master, but when replicated to the slave, has the tangible effect of moving the master and slave closer toward convergence. It seems at least it should be an option to have "no-effect" statements replicate from the master to the slave.

When the AOF gets too large, an AOF rewrite occurs. This is the minimal set of statements needed to log to reproduce the data set as it is in memory:

You may wonder what happens to data that is written to the server while the rewrite is in progress. This new data is simply also written to the old (current) AOF file, and at the same time queued into an in-memory buffer, so that when the new AOF is ready we can write this missing part inside it, and finally replace the old AOF file with the new one.

So what happens when we run out of RAM? That would be a very interesting behaviour to have defined unambiguously. He also doesn't talk about how long it takes to write the in-memory delta to the AOF before the swap-over. I suspect during that time, the server will be largely unresponsive (or should be, to preserve data integrity). On a busy server, there might be millions of entries in the in-memory delta buffer once the new AOF is finished writing.

The Purpose of Redis Persistence

If you think Redis having persistence means you can serve data from disk, you'd be wrong. The point of persistence is just to get the in-memory-only dataset back after a crash or restart.

AOF rewrites are generated only using sequential I/O operations, so the whole dump process is efficient even with rotational disks (no random I/O is performed). This is also true for RDB snapshots generation. The complete lack of Random I/O accesses is a rare feature among databases, and is possible mostly because Redis serves read operations from memory, so data on disk does not need to be organized for a random access pattern, but just for a sequential loading on restart.

So Redis really is just (a fast) memcached but with some persistence methods.

There is an option for fsync'ing data to the AOF in various ways. Be careful, the "appendfsync everysec" setting is actually worst-case every TWO seconds. It's only every second on average.

When you set appendfsync always, Redis still doesn't do an fsync after every write. If there are multiple threads writing, it'll batch the writes together doing what he calls "group commit." That is, every thread that performs a write in the current event loop will get written at the same time at the end of the event loop. I can't think of any downside to this, as I don't think clients get their response that data was written until the end of the event loop.

Restarting Redis requires either re-loading an RDB (Redis snapshot) or replaying AOF transactions to get to the state before the server was stopped. Redis is an in-memory database, so as you might expect, the start-up times are fairly onerous.

Redis server will load an RDB file at the rate of 10 ~ 20 seconds per gigabyte of memory used, so loading a dataset composed of tens of gigabytes can take even a few minutes.

That's the best case, as we note from the following:

Loading an AOF file [takes] twice per gigabyte in Redis 2.6, but of course if a lot of writes reached the AOF file after the latest compaction it can take longer (however Redis in the default configuration triggers a rewrite automatically if the AOF size reaches 200% of the initial size).

It isn't pretty, but I'm really glad the author is giving so much transparency here. An optimisation mentioned is to run a Redis slave and have it continue serving the application while the Redis master is restarted.

The author notes that in high-volume environments, a traditional RDBMS can in theory serve reads from the moment it's started, in practice that can cause the database to become fairly unresponsive as the disks seek like wild to pull in the required data. He further notes that Redis, once it starts serving reads, serves them at full speed.

5Σ _____

Fifth Sigma designs core database architecture for Obama for America 2012 Campaign

November 4th, 2012 - Fifth Sigma was central to the architecture of the database back-end of the Obama for America Campaign 2012, the infamous "Narwhal" project. It was a wild ride, and ended in a flawless execution of the database infrastructure.

The underlying problem was that databases in Amazon are difficult to set up in a proper redundant fashion. The "availability zones" are actually just different parts of the same datacentre, and doing cross-region replication with RDS (a simple wrapper around MySQL, really) is not possible.

We set up vanilla MySQL in US-East and US-West and created a proper replication chain. The design called for failover, but was kept simple. Automatic failover, which has been known to cause more downtime than it prevents, was avoided.

5Σ _____