Graphite is a powerful tool for receiving, persisting, and (along with Grafana), reporting and alerting on metrics. While many cloud vendors provide excellent solutions as well, many organizations, including Zuora, deploy this open-source tooling as a central part of our observability infrastructure.
If you are unfamiliar, or just getting started with graphite, purchase a copy of “Monitoring with Graphite” by Jason Dixon. While this post glosses over the basics, Dixon’s book provides an excellent discussion of everything you need from getting started to advanced, reliable, highly scalable graphite infrastructures. Yes, this is 2017 and in the days of quick searches and thorough answers on StackOverflow, it’s unusual to see a recommendation to purchase a book. Just buy it. It’s that good.
At Zuora we manage a hybrid cloud infrastructure supporting our Zuora Central platform, including utilizing Amazon Web Services (AWS) for a variety of computing and storage needs. When it came time to implement graphite, AWS was a good choice for us. After quickly browsing through “Monitoring with Graphite”, I downloaded and installed all the components on a common, general purpose Amazon EC2 instance type (M3), running Centos 7, with general purpose Elastic Block Storage drives for persisting the metrics. Although we knew more work was required for the long term, this setup scaled well for our early users of up to roughly 8K metrics per second and 3 - 5TB of data.
First Attempts to Scale
While this was my first time setting and administering a metrics infrastructure, several of my colleagues as well as people I’ve met outside of Zuora, warned me that metrics adoption grows extremely fast, so be prepared. Curious about how far this system would scale, I set up a staging system, and utilized another tool I learned about in “Monitoring with Graphite” called hagger, to test the system at scale.
Hagger has the capability to send data very quickly to graphite, and once I got up to about 30K per second, I could see high CPU wait times, large caches sizes and delays in committing metrics to disk; in other words, my infrastructure would become I/O bound when metric production tripled from the early adopter level. This test, of course did not include the metric retrieval nor peaks and valleys of metrics production in an actual environment. So I had room for growth, but not much.
Monitoring the metrics infrastructure itself is accomplished with combination of htop, and yet another helpful resource supplied by Dixon (aka @obfuscurity), a Grafana dashboard titled Graphite Carbon Metrics. Yes, Graphite has an extensive set of self-reported metrics that can be quite overwhelming at first, but Dixon’s dashboard is an excellent way to organize and present them.
Improving Scale using Provisioned IOPS
So now that I was about to be I/O bound, the easiest way to remedy this, especially now that AWS lets you modify live EBS volumes without downtime was to change from general purpose (gp2) solid state drives (SSD) volumes to “provisioned IOPS”, (io1).
Provisioned IOPS guarantees a specific number of Input Output Operations per second (IOPS). But of course, like all things, AWS “easy” does not mean “inexpensive”. While this setup allowed me to scale to 80-100K metrics per second during performance tests, the projected persistence 8TB of metrics was getting more expensive than we liked.
Instance Store to the Rescue!
It was about this time that a buddy of mine explained that at his company, a highly scalable cloud based server log aggregator, likes the storage-optimized I3 AWS EC2 instance type for its fast I/O. These instances are called “instance store”, because the SSD storage is on the virtual server itself, not on a storage area network (SAN). This provides for very high throughput I/O at a much lower cost than provisioned EBS drives.
I decided to give these instance types a try, and using haggar, was able to achieve the same (and likely higher, but I didn’t test it) throughput as the most expensive io1 drives. This seemed like a very promising way to scale graphite cost effectively for the foreseeable future.
- Instance store drives have some disadvantages:
- 2TB maximum size
- Ephemeral, designed to be temporary storage
- Can’t snapshot, a common backup mechanism on AWS
- Can’t “modify volume” to change the size of the instance store drive
So, you need to chunk your data into groups less than 2TB, and more perhaps more importantly, implement either periodic backups, replicate your metrics or both.
It turns out that the 2TB chunks is an advantage in Zuora’s case because along with graphite’s “rule” relay method, it allows us both to allocate storage by team and easily expand capacity by adding an additional instance store.
Zuora’s Current Metric Collection, Persistence and Reporting Architecture
Given the above experience and performance testing, our Graphite system looks like this:
- A “front-end” M4 instance type that is the termination point for graphite carbon cache and graphite-web, the metric query system.
- Carbon’s RELAY_METHOD = rules, and relay-rules.conf to distribute the metrics on several “back-end” instance store
- Each back-end instance has one EBS gp2 SSD drive that matches the size of the instance store drive.
- rysnc (or carbonate) running on each backend, periodically to the EBS drive. Currently we are doing this every 60 minutes. Your frequency will depend on how fast you get through your rysnc, which of course, is dependent on how many metrics have changed between your previous rsync and the current one as well as the file-size of each metric. Do not judge your rsync times by the first one, because you typically only need to copy ALL of the data once, or if you change your storage_schema (the precision or resolution of the metrics themselves).
- EBS snapshots once per day
- Archive in S3/Glacier once per week
Graphite-web Caveat: Switching from local to remote metrics
Hopefully this part saves you the day it took for me to figure out why metrics retrieval was flaky when we first switched all users to the distributed system. Although clearly documented in “Monitoring with Graphite”, I missed the fact that when a local graphite receives a query and sends it to all of its remotes, it typically takes longer than the default settings of 3 seconds for REMOTE_FIND_TIMEOUT and REMOTE_FETCH_TIMEOUT. Make sure you set these significantly higher in graphite/webapp/graphite/local_settings.py.
Especially important if you are doing complex queries against large data sets, like we do here at Zuora, otherwise your Grafana graphs will likely show you (and alert on) “no data”.
AWS for hosting Graphite at scale is working well at Zuora and can be an excellent choice, for your company as well. Buy “Monitoring with Graphite” for the details of configuring graphite. If you are running into scale problems on gp2 EBS drives, consider switching to Provisioned IOPS. If Provisioned IOPS is too expensive, consider the I3 setup outlined above. Performance test with haggar before unleashing new systems on your users.
Did you find this post helpful? What is your experience with Graphite or other open-source metrics tooling? Reply to this blog to participate in the discussion!
Principal Software Engineer
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.