As mentioned in our previous Reinventing the wheel post, as well as our Earlybird and Custom Event Service posts, Zuora has embarked on a journey towards a microservices architecture for our entire product stack. As our older, monolithic application is rebuilt as individual services by different teams, we are able to make independent design decisions about how to build each service. One of these decisions centers around how a previously centralized database dissolves into smaller, disparate databases. That’s the topic of this blog post.
The Zuora Central platform provides powerful usage-based billing services. Usage information is processed by rating & billing engines, and is exposed through the REST API and data sources. Usage information can also be presented on invoices.
Because many customers integrate with Zuora for their usage-based business, there are billions of usage records occupying terabytes of space in our transactional databases. Currently, these usage records - together with other platform data - are sharded (distributed) across multiple database instances in our US and EU data centers.
One of the most common problems with our current architecture is that when a bill run processes a large volume of usage data, the bill run consumes lots of system resources and performs a lot of database I/O. This has the side effect of impacting other backend jobs running on the same application server and database shard.
As we are reinventing Zuora’s usage billing service, we would like to build a new standalone microservice and break out the usage data from our sharded MySQL platform. The new Usage service will provide an new set of data access API operations for all related business areas, including:
Usage record create, update, delete
Usage record query
Usage record aggregation
After analyzing existing customer data and behavior, we finalized a set of business requirements that represent the key problems that need to be solved. In addition, because each service team owns its own full stack, operational effort and maintenance complexity are aspects that we also need to consider as part of our set of requirements.
Fast write and read concurrency
Support real-time API integration for data loading and retrieval
Efficient querying and support for ad hoc queries
Filters on columns are based on functional requirements and are not predictable during the initial table design phase
Consistency after each transaction
Basic reporting and analytics support
Large data volume
Tens of billions of objects and TBs of data
Thousands of records loaded per second
Hot and cold data separation and archiving ease
Durability, availability and resilience
On-demand and scalable
We need a flexible solution that we can quickly scale to handle increasing volumes of data
As we evaluated different database technologies, our first candidate was MySQL; it’s the database we’re currently using for the Zuora Central platform. We also looked into non-relational solutions such as MongoDB and Cassandra. Since Zuora’s microservice architecture are being implemented on AWS, RDS Aurora, RDS MySQL/Postgres and DynamoDB were also on the evaluation candidate list.
We’ll assume that you have the ability to look up each database technology to learn more about some of their characteristics, so we’ll skip straight to why we ended up choosing Aurora :-)
It’s probably more interesting to start with why none of the other database technologies fit the bill, so here goes:
MongoDB does meet the requirements on read & write performance, ad-hoc query and data volume support. However, it doesn’t support BigDecimal out of the box, meaning that we’d have to write some custom logic to handle it. Also, AWS does not have a native MongoDB service, and there’s a chance that any third-party MongoDB technology might fail in our security review.
We did consider setting up our own MongoDB cluster, but operational efforts weighed too heavy for the service team; a MongoDB cluster would have to be company infrastructure maintained by a dedicated team.
Cassandra meets most of the requirements, but “ad hoc” query performance would suffer. A rule of thumb for Cassandra table design is having clearly defined access patterns with partition and keys clustered correspondingly. With this in mind, it would be hard to get good performance key design doesn’t consider the query. Also, secondary indexes are tricky to use and can impact performance greatly. This posed too much risk as features are developing incrementally, and model and access patterns can change and are unpredictable.
DynamoDB can retrieve data quickly using keys and also supports secondary indexes per table. However, creating secondary indexes actually creates another table which references back to the primary key of the original table. This means that clients should use the queried back primary key to query the original table for the attributes. Also, the number of indexes that can be created is limited, and this doesn’t bode well for complex business use cases where access patterns change.
DynamoDB also has a 1MB limitation on the result set size from a query, and clients have to use the LastEvaluatedKey from the query response to retrieve more results by pagination. This requires us to perform throughput benchmarking for tables and indexes, which, if not optimally selected, ends up either costing you unused read and write throughput, or causes DynamoDB to reject your query if throughput limits are hit.
MySQL is what we’re using today, maintained by a dedicated Technical DBA Operations team. From a functional perspective, it has all the requirements for our service. Also, our developers are more familiar with MySQL compared to other candidates. Generally speaking, MySQL is our plan B if we don’t have any other choices.
So, then, why did we choose Aurora?
Aurora is MySQL 5.6 compatible. It has good read and write performance, and also supports ad hoc queries. Amazon claims that it has up to five times the performance of MySQL. Despite the fact that our service needs to deal with its own partition or sharding strategy, we have more to gain then compromise or lose if we go with Aurora:
Backup & Recovery - In many SaaS platforms, including Zuora's, data backup and recovery capabilities are one of the most important items in the system design. Aurora reduces the backup and recovery operational overhead and any out of band automation required to perform replicated backups across AZs and regions.
Storage Auto-scaling - We need support for a large volume of data in our service, even though we relocate “cold” data to cheaper storage like S3 after a certain period of time. We also don’t want to provision storage in advance. So we need the storage to scale automatically; Aurora can grow in increments of 10 GB up to a maximum of 64 TB. The scaling is totally transparent and doesn’t introduce any downtime when we need to resize data.
Encryption - We need to secure data both in transit and at rest. Aurora supports SSL for data in transit and encrypted data at rest in the underlying storage. Encryption and decryption are handled seamlessly, making it much easier for our service to support encryption.
Given all this, we decided to move forward with Aurora for our Usage Service!