Dear Analyst
Dear Analyst #63: Cleaning Bitcoin Tweet data with OpenRefine, a free and open source alternative to Power Query
Numerous studies claim that data scientists spend too much time cleaning and preparing data (although this article claims it is a bullshit measure). I agree with some points in that article in that you should get your hands dirty with cleaning data to understand what eventually goes into the analysis. You may already be cleaning up messy data today with Power Query, an add-in for Excel ten years ago which now is its own standalone application. For those who don't have Office 365, a recent version of Excel, or a Mac, what tool can you use for cleaning up data? The main tool I've been using is OpenRefine. The main reason I reach for this tool: it's free. It's like Power Query for the masses. I've been wanting to do this episode for a while, so get your messy dataset ready. The Google Sheet for the examples in this episode is here.
If you want to watch just the tutorial portion of this episode, see the video below:
https://www.youtube.com/watch?v=0Fyh5x1QjMs
OpenRefine history
You can read more about OpenRefine's history on this blog post. The tool started as an open source project in 2010 before Google bought the company that created the tool (Metaweb). The tool was renamed to Google Refine for two years, but Google eventually stopped supporting it in 2012. The blog post cites a few reasons why Google stopped supporting the tool. I think one of the main reasons is that it's a desktop application and not run in the cloud. This probably conflicted with Google's own cloud ambitions for what is now Google Cloud Platform where they have data cleaning tools all in the cloud.
Since Google dropped support in 2012, it's exciting to see a good number of contributors to the project and an active mailing list. One feature I think that will keep OpenRefine relevant among analysts and data scientists who need to clean their data is the reconciliation service (similar to Excel's rich data types). More on this later.
Clean messy data, not organize clean data
As I've been using OpenRefine over the years, I've found that I reach for OpenRefine for specific use cases. It doesn't aim to be an all-in-one tool. When you first launch OpenRefine, you'll see the main tagline for the tool in the top left:
A power tool for working with messy data.
It just says it like it is. It doesn't do PivotTables, charts, or other things you might find in a spreadsheet tool. It does one thing and one thing well: clean messy data. I also think it does a good job of exploring outliers of your data, but it's all in service of ridding your dataset of inconsistencies.
This post from Alex Petralia is a good read with regards to how you should think about OpenRefine:
In fact, what differentiates clean data from messy data is not organizational structure but data consistency. While clean datasets may not be organized as you’d like (eg. datetimes are stored as strings), they are at least consistent. Messy datasets, on the other hand, are defined by inconsistency: typos abound and there is no standardization on how the data should be input.