Dear Analyst
Dear Analyst #32: How to use the QUERY function in Google Sheets on COVID-19 data
The QUERY() function in Google Sheets gives you the ability to quickly filter and sort your data similar to how you might get data from a database. If you write SQL queries, the QUERY() function feels easy and natural to use. There are a few caveats as I discuss in this episode. If you want to follow along with the exercises I discuss in this episode, make a copy of this Google Sheet which contains the QUERY() functions I mention in the episode.
Basic query to find confirmed cases greater than 50,000
Our data set is from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. The data shows confirmed cases, deaths, and recovered cases by country (188 countries) on May 1st:
The first query simply pulls back the list of countries and confirmed cases where the number of confirmed cases is greater than 50,000. Notice how you reference the column letter name versus the actual name of the column in the header row:
The first parameter is covid_data which is a named range in Google Sheets. In this case, it references cells A1:E188 in our data set.
More SQL-like commands
You can do many database-like commands with the QUERY() function. The next example shows how you can use the ORDER BY command to find countries with deaths between 0 and 5 and the resulting list is sorted in descending order:
Check out Ben Collins' blog post about the QUERY() function to see some of the other SQL commands you can use.
Adding in new calculated columns
In the third query, we get a little more advanced and use the LABEL command to create a new "column" called Case Fatality Rate. This calculation is simply Confirmed / Deaths. Unlike SQL, you put the LABEL at the end of the command instead of in the beginning of the SELECT statement:
Coming from SQL, you'll need to account for the difference in the order of commands in the query in order for it to work correctly.
Inability to select column names
You'll notice that you don't put the actual names of the columns in your header row in the query. This can be a pro or con of the QUERY() function depending on how your underlying data set is structured.
Columns are changing a lot
If you underlying data is constantly "shuffling" where columns are moving around and the structure of the data is not set in stone, the QUERY() function will most likely break because you're referencing the column letter instead of the column name like in a traditional SQL query.
Columns are fixed
If your columns are not shuffling around a lot, this syntax of selecting the column letter may actually be easier for you. This is because you don't have to type out the long column name in the QUERY() function. If data is simply getting appended to the bottom of your data set, then the QUERY() function should work fine for you because the letters of the columns will always reference the correct columns of data.
PivotTables vs. the QUERY() function