Happy Business Starts Here

Your Top Questions On Data Query JOINs: Answered

Kate
Zuora Staff

Your Top Questions On Data Query JOINs: Answered

While many  of you may have landed on this post to read more about the complexities of Index Join, let’s begin with a simple introduction to SELECT statements with and without JOINs. Take the following scenario:

 

Zuri, a Data Analyst, is writing a query to get the Rate Plan and Rate Plan Charge names for the Subscriptions for each account. A query without JOINs will look like the following query and output:

 

Query: Select rateplan.name as RPN, rateplancharge.name as RPCN, subscription.name as SUBN from RatePlan, RatePlanCharge, Subscription

Output: ERROR: Reached maximum number of rows 

 

Why did it reach the maximum rows? 

This is because one or more of the tables referenced in the query (Rate Plan table, the Rate Plan Charge table, or the Subscription table) stores more than 1,000,000 rows. When you list tables separated by commas, the SQL engine scans each row in each table to obtain every result. With each scan, it adds that record to the results set. Then the results set is a list of all rate plan names, rate plan charge names, and subscription names regardless of the relationship between the two. We usually want more information than this query is telling us. 

 

Thus, the best practice is to use JOINs.  

 

What does a query with JOINs  look like?

A query with JOINs will look like the following query and output:

 

Query: Select rateplan.name as RPN, rateplancharge.name as RPCN, rateplan.amendmenttype, subscription.name as SUBN

FROM Subscription JOIN Rateplan ON subscription.id = rateplan.subscriptionid JOIN rateplancharge ON rateplan.id = rateplancharge.rateplanid

Output: 98 Rows

 

 

Using the JOINs helped narrow down the search. Instead of pulling all of the data and creating a huge list of records, Zuri optimized the query to only search for the records with matching Subscription to Rate Plan to Rate Plan Charge.

 

In this example, the JOIN was quite simple. No added prefix or suffix, just JOIN across 3 tables. However, subscription billing data can be much more complicated than that. 

 

What types of JOINs are available in Data Query?

Data Query offers different types of JOINs depending on the data in the tables you are working with and the results that you want. Outside of the four SQL JOINS (INNER, RIGHT, LEFT, SELF, and FULL) users can specify an index as a filter in the query and turn on Index Join. 

 

When is it best to JOIN?

 

Use JOIN: 

  • If the scan after the filter is applied to table A (the larger table) is less than 1,000,000 records
  • If the scan after the filter is applied to table B (the smaller table) is between 20,000 and 1,000,000 records

 

Data Query SQL engine is supported by a Presto Coordinator that prepares queries to be pushed down to the databases for data extraction. The coordinator decides the ordering in which it will scan the tables and then lookup records to JOIN ON. In Presto, the first table listed in the query (before JOIN) is our Probe table and the second table (after JOIN) is the Build table. Presto filters the Build table by any relevant filters in the WHERE clause, then creates a new filtered table, Hash table, of the remaining records. Presto will then probe, or examine, the Probe table using the Hash table to find the records that match the JOIN criteria listed after ON. 

 

Let’s take a look at an example:

The B table is the Build table and the A table is the Probe table. The Hash table is built from the filtered results of the Build table, B. Because there is no filter on B, the Hash table is just as Big as the Build table (not something we want to do because the Hash table is the table we send to get scanned… and not reach 1,000,000 records! Try adding a filter on the Build table too… the more filters, the faster things go!) Once the Hash table is built, Presto iterates through a filtered table A to match A.B_id = B.id. The Probe table will get filtered by Presto because of the A.B_id = B.id filter. In this case, there are 2 table scans: filtered Probe table A and create Hash table from B.


SELECT * FROM A JOIN B ON A.B_id = B.id WHERE A.status = ‘Active’ 

 

 

 

In order to more effectively scan, the filter is now placed on the B table. Therefore, making it smaller before the scan to the database. 

 

In this case, there are 2 table scans: filtered Probe table A and create filtered Hash table from B. But this time, the Hash table is much smaller with a filter step applying B.id = ‘abc’.

 

SELECT * FROM A JOIN B ON A.B_id = B.id WHERE B.updatedDate = ‘2020-01-08’

 

 

When do I turn on Index Join?

 

Turn ON Index Join:

 

 

  • If the scan after the filter is applied to table A (the larger table) is less than 1,000,000 records
  • If the scan after the filter is applied to table B (the smaller table) is less than 20,000 records


When Index Join is turned ON for the query, the Presto coordinator changes the order in which it scans the queried tables and matches the JOIN criteria. In this case, Presto filters records in table A based on the filter criteria in the WHERE clause. The filter in this query will significantly filter the A table to only include records where B_id = ‘abc’. Once filtered, the coordinator performs an index lookup using the index B.id = ‘abc’.

 

Instead of the creation of the Hash table, turning on Index Join causes the engine to perform an index lookup to match the results and bypass the scan required to match results from a Hash table to a Probe table. In this case, there is 1 table scan and 1 index lookup: scan to filter A and index lookup to match records in B.

 

SELECT * FROM A JOIN B ON A.B_id = B.id WHERE A.B_id = ‘abc’



It is important to note the limitations around this JOIN. When the Presto coordinator does the index lookup, it creates a SQL statement in which all of the records associated with that index value are listed (i.e. WHERE id IN {xxx, xxx, xxx}). If this list is too long (there are many records with this index value), then SQL statement character limits will be hit and the query will not perform quickly. Therefore, we keep the limit at 20,000. If you have a filter with less than 20,000 records, turn index join ON via API flag useIndexJoin or UI checkbox for Index Join.

 

Which fields to filter on?

The fields that are to be filtered on must be chosen with purpose. These are the fields used in the WHERE clause or After the ON. Be aware of the different fields you choose in your filters. For most Zuora Standard Objects, UUIDs, dates, and statuses are good options for optimized filters. UUIDs and updateddate are optimized for filtering. For custom objects, filter on the indexed fields. 


JOINs are extremely important for the optimization of queries to properly connect data across tables, gain insights, and reduce the compute time needed for the SQL engine to scan the records you need. By understanding the different JOINs available and how they each have benefits on query optimization, you can now create beautiful SQL queries for your unique data! 

 

Hope you learned something new; Happy Querying 🙂 

 

Have a question about JOINs? Email me at kmcnally@zuora.com.

 

2 REPLIES 2
gauri
Master

Re: Your Top Questions On Data Query JOINs: Answered

This is very helpful. I have emailed you with a question. 

msonsma
Savvy Scholar

Re: Your Top Questions On Data Query JOINs: Answered

Hi Kate,

 

Nice post but....just wondering why this is the only place in the entire Zuora documentation sources that refers to the Presto as the underlying Data Query's database engine ? 

 

Isn't an idea to mention the Presto DB on the Zuora's UI Data Query page / Workflow Retrieve  task itself with a link to the Presto's SQL syntax and functions page: https://prestosql.github.io/docs.prestosql.io/current/functions.html. 

 

That's more helpful that a  'show functions' statement....

 

BR,
Manfred