Chemistry Matters' Blog

Tidy Data

Written by Steve Shideler | Oct 14, 2020 5:32:00 PM

As a teenager, I was notorious in my family for having the messiest of rooms. While I preferred to call it “organized chaos” rather than a mess, I cannot deny the amount of time and effort I spent (wasted) when needing to locate something in that chaos. The struggles I had in finding that missing sock, or my essay I had printed off for school, were met with the oft repeated question by my parents, “why don’t you just tidy up?”

In data analysis, the workflow can generally be thought of in the three stages of structural organization, visualization and finally modelling of the data set. However, data sets that I typically work with are what data analysts would call “messy”. Much like the time wasted searching through my childhood mess of a room, messy data sets drastically increase the amount of time and cost of structural organizational stage of the data analysis workflow.

Being able to reduce costs and find efficiencies in processes is paramount. Having “tidy” data sets not only reduces costs but will allow data analysts (consultants) to deliver results faster and with less potential for error. You might be wondering, well what is tidy data? Those tables and charts looked good in your PDF report, right? Aren’t they tidy?

Let me share a few tips on how you and your organization can save time and money by tidying up those valuable data sets that took valuable resources to collect and produce, in order to provide maximum return on that investment.

  1. Call before you dig!

    While not always possible, before expensive field data collection begins, I would encourage you and your organization to speak with those who will be analyzing the data produced. If certain types of visualizations or statistical modelling is desired, you need to make sure your sampling program can produce the data required for that particular analysis.
  1. Ditch the PDF!

    Sending only PDF reports to a data analyst is something to avoid if possible. One data set that I worked with consisted of 3,000+ pages of PDF documents full of tables. This amounted to over 1 million samples, and over 300 columns of data for each sample. This data had to be manually extracted from PDF files and then quality controlled for the accuracy of that extraction. It would have saved a great deal of time, and considerable expense, if tables in the PDFs had been submitted as a database file or even as excel sheets. These tables were likely made in Excel, send those along with a PDF if possible. Nobody needs to pay twice to get their own data into tables. Stop hoarding the data, it’s time to make the data readily available and shareable, especially to the client that paid for it.
  1. Tidy Data

    Data wrangling can be a real struggle. There are tasks of dealing with outliers, data parsing, and imputation of missing values. While providing your data analyst with database files or excel tables is a great start, that table you made to present to your colleagues, while looking impressive, may not be that useful for downstream applications. Data analysts often use computer languages such as Python and R to provide an in-depth analysis of large datasets in short periods of time. Having your data in a standard format for computing, like the tidy data standard (Wickham, 2014), will facilitate this speed. There are 3 standard rules to be followed to achieve this.

Rules for Tidy Data in Data Analysis:

      1. Each of your variables, like depth, compound concentrations, and sample names, need to form a column.
      1. Each observation, or sample, should then form a row.
      1. Each of the observations (samples), should be organized in a single table, rather than spread over multiple tables.

That’s it, three rules: straightforward and simple. Once data is in a tidy format, then your data analyst can use powerful tools to visualize and model the data set.

Now, these suggestions might not be applicable or possible in every situation. However, as a data analyst, I can say that more than 75% of my billable time is spent on the structural organization portion of the data analysis workflow. If you are able to implement even some of what I have discussed here, it will result in considerable savings in time and resources for your organization. If you have any questions about this, or about a potential project you are looking to begin, don’t hesitate to reach out. At Chemistry Matters, we have the expertise and experience to help you get your data collected and analyzed right the first time! Tidy data can then turn into slick visuals. More to come on designing the proper visualizations to get your point across to your audiences.

References:

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10