Dear Analyst

Dear Analyst


Dear Analyst #45: Thinking long-term for structuring your dataset using U.S. public food assistance data

September 28, 2020

When you need to capture some data in a structured way, you'll open up an Excel file or Google Sheet and just start throwing data into the spreadsheet. Not much thinking; just copy and paste. As that dataset grows, the original structure you had set up for that spreadsheet may not be ideal. Specifically, the dataset is not ideal for putting into a PivotTable. Long-term, I'd argue that all your spreadsheets should be structured in a way that's suitable for a PivotTable (which makes it ready for storing in a traditional database). This post explores how you can structure a dataset that looks like 99% of data out there into a structure you can analyze in a PivotTable. Link to the Google Sheet is here.

Video walkthrough of Google Sheet:

https://www.youtube.com/watch?v=qFhluLJQpfA

Why this is important

Telling someone that their data should be structured is a platitude like "such is life" and "forgive and forget." Let's be more specific in how this statement can impact your work.

To be specific: 9 times out of 10, structure your data so that it can always be analyzed in a PivotTable.

Consider this scenario:

* Your accounting team needs your group to start forecasting expense for next month's budget* You start gathering the data and throw it into a spreadsheet* Every month new data gets added to the spreadsheet, and perhaps the CFO wants to get more granular analyses on the forecast* You start adding additional columns to the spreadsheet and perhaps summary tables in other sheets in the file* Other teams now need to see your data to understand how your team's decisions will impact their decisions* This spreadsheet ends up being too hard to maintain, so there's an internal project to put this data into a real database (some ERP solution)* One quarter of planning goes by, and another quarter for implementation* 6 months later, the business has changed, the structure of the database needs to be adjusted, and the data engineer role still needs to be filled

This concocted scenario is quite extreme, but the key lesson is this:

Focusing on the schema and structure of your spreadsheet today takes time and requires you to think about how your data will be used and maintained in the future.

U.S. public food assistance dataset

I've started browsing Kaggle's to find interesting datasets recently, and this one caught my attention since it looks at spending and household participation related to a public food assistance program called SNAP. As the creator of the dataset discusses, there are many issues with collecting government datasets. Data is spread out across different agencies, there are multiple formats, and data is sometimes aggregated. This makes consolidating the data a pain. These problems may sound familiar if you're working at a large organization.

The "Raw" sheet in the Google Sheet simply shows the cost, households participating, and total people associated with the SNAP program for the 2019 fiscal year across four states (CA, IL, LA, NY):

In your organization, this could be sales data, headcount data, COGS, whatever. The key thing about this dataset is that you have all the numbers organized by month across the top.