by Serena Peruzzo

When I went to university to study statistics, data science didn’t really exist1. An article2 published in the Harvard Business Review in 2012 claiming that ‘data scientist’ was the sexiest job of the century made the term popular, and since then the subject has drawn an incredible amount of attention. There is a huge number of resources available online on what data scientists do, how they do it, and what are the most needed skills to become one, but they all tend to focus on the more glamorous aspect of modelling, often dismissing the previous stages as mundane.

A popular ‘statistic’ states that 60 to 80 percent of a data scientist’s time is dedicated to data cleaning (also referred to as data wrangling or data munging) while only a much smaller percentage is devoted to their real specialty: building models and algorithms.

I don’t disagree with the numbers: if you’re starting a new project, you’ll be spending an awful lot of time collecting, cleaning and pre-processing data. What I disagree with is the tendency to use this statement to argue that this step is a waste of time and that there is a need for tools that automate the process of data cleaning in order to free data scientists to focus on models and algorithms. Those making this argument are often trying to sell you a piece of software or convincing you to use a particular library.

Truth is, when you’re cleaning your data, you’re not just dealing with missing values, errors, noise or inconsistencies, you’re fulfilling the two main requirements for building a good model: understanding the problem you’re trying to solve, and gaining perspective of the data that is available in that domain.

At the very beginning of a data science project not much is known about the data and decisions are made under condition of uncertainty, often relying on intuition and experience. This is the very first step towards transforming data into actionable insight: correcting inconsistencies and imputing3 missing data may seem trivial tasks but it is at this stage that raw data is transformed into variables that can be analysed and reveal statistical relationships. The decisions made at this stage can make the difference between introducing biases or facilitating the training of the final model.

In a nutshell: data cleaning isn’t as straightforward as it looks and it is key to producing valid and reliable models. Data wrangling generally results in writing scripts that address a variety of issues in the raw data. Technical aspects include type conversion, string parsing, outliers detection and imputation of missing data. The following sections describe some of the most common issues data cleaning deals with, and how they could be addressed.

Messy data vs tidy data

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types4. In statistics, variables are collections of values measuring the same attribute, observations contain all values measured on the same unit, and observational units represent the entities whose characteristics are being measured. This means that in tidy data:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

Most statistical packages provide datasets that can be used to play with various methods. In order to allow you to focus on the analysis, these datasets are provided in their tidy form, which makes it easy to extract variables and information. This is great for learning, but leaves junior analysts quite unprepared to what real-world data looks like.

Messy data is any arrangement of the data that doesn’t respect the above matchings and may present any combination of:

  • Column headers as values, not variable names
  • Multiple values stored in one column
  • Variables stored in both rows and columns
  • Multiple types of observations stored in both rows and columns
  • Single observational units stored in multiple tables

Here’s a simple example of messy data. The table is an extract from a report published by Pew Research Center on Religion and Education5:

religion men women
jews 13.4 13.4
christians 9.5 9.1
unaffiliated 9.2 8.3
buddhists 8.5 7.4
muslims 6.4 4.9
hindus 6.9 4.2
global average 8.3 7.2

There are 3 variables here: religion, gender, and average years of formal schooling, but only one of them (religion) forms a column, and the column headers are values (men, women) rather than variables names. To fix this, the men and women columns should be turned into rows.

The result is:

religion gender value
jews men 13.4
christians men 9.5
unaffiliated men 9.2
buddhists men 8.5
muslims men 6.4
hindus men 6.9
global average men 8.3
jews women 13.4
christians women 9.1
unaffiliated women 8.3
buddhists women 7.4
muslims women 4.9
hindus women 4.2
global average women 7.2

Going from messy to tidy data may require a few different steps and it is only the first component of the cleaning process, therefore it’s important to make sure that every step is clear and document it so that the resulting process is reproducible, in case it needs to be run again. Chances are it will.

From tidy data to clean data

Tidy data isn’t necessarily clean data. Clean data is consistent, i.e. fit for statistical analysis. To reach this status one first needs to deal with outliers, missing data, errors and special values. On a high level, this involves three steps: detection of an inconsistency, selection of the field (or fields) responsible for it, and rectification via deletion, correction or imputation.

Obvious inconsistencies

Obvious inconsistencies are values that do not reflect real-world situations, e.g. a person’s age cannot be negative, or values of the same variable that are not consistent with each other, like different spellings of the same category for categorical variables. This type of constraints are called edit rules or edits.

Obvious inconsistencies are generally easy to identify and correct, depending on auxiliary information available.

Outliers

An outlier an be defined as an observation (or a set of observations), which appear to be inconsistent with the rest of the set. A quick and easy way of detecting outliers is the boxplot rule or Tukey’s test.

In a boxplot, the box represents the difference between the 75th and 25th percentiles (interquartile range), with the darker line inside it representing the median, while the whiskers outside the box representing the limits for outliers detection. Conventionally, the upper limit is determined by adding 1.5 times the interquartile range to the third quartile and rounding to the nearest lower observation. The lower limit is computed likewise.

This is easily implemented in most statistical packages. R provides a built in function that allows to easily extract outliers using the boxplot rule:

# create a vector of 15 randomly selected numbers between 0 and 10, plus 20 and 30
x <- c(sample(x=0:10, size=15, replace=TRUE), 20, 30)
x
## [1]  3  8  4  9 10  0  5  9  6  5 10  4  7  6  1 20 30
boxplot.stats(x)$out
## [1] 20 30

The outliers are represented in the plot as points outside the box.

The boxplot rule works well for unimodal and symmetrical variables but fails when the distribution of the data is skewed. In that case, one can attempt to apply a logarithmic or square root transformation to the data prior to the boxplot rule, or use a method that takes in account the skewness.

It’s important to note that outliers are not errors. Identifying them helps in understanding the data, but whether to include them in the analysis or not is a statistical decision.

Missing data

A missing value is a datum for which the type it’s known but the value isn’t, e.g. an unanswered question on age in a questionnaire is a missing datum: it should be a number but the value is not given. Statistical analysis can’t be performed on missing data, so the choice here is between removing or imputing them.

Imputation is the process used to determine and assign replacement values for missing, invalid or inconsistent data that have failed edits. The main benefits of imputation, in terms of analysis, is that it maintains full sample size, however, it can introduce bias depending on the approach adopted.

The easiest way to impute is to replace all missing values with the average (or other centrality measure) for that variable. This is called numerical imputation. The main risk of this approach is that it can severely distort the distribution of the variable, more specifically, by underestimating its variability.

An alternative to mean imputation is hot deck imputation, where missing values are imputed by copying values from similar records in the same dataset. The main question in hot decking is how to choose the replacement. In random hot deck imputation, values are chosen randomly from the set. In sequential hot deck, records are sorted by some auxiliary variable and missing values are imputed with the value from the first following record that has an observed value. In predictive mean matching a specific distance function is used to determine the nearest neighbors and copy the missing value from it.

In order to demonstrate the difference between numerical imputation and hot deck imputation we’ll randomly introduce 20% missing data in the iris dataset6.

library(VIM)
data(iris)

# add some missings
iris$Sepal.Length[sample(1:nrow(iris), nrow(iris) * 0.20, replace=FALSE)] <- NA

iris.mean <- iris
iris.mean$Sepal.Length[is.na(iris.mean$Sepal.Length)] <- mean(iris$Sepal.Length, na.rm=TRUE)
iris.knn <- kNN(iris, k=5, numFun=median)

In iris.mean missing values in Sepal.Length are replaced with the average of the variable, while in iris.knn they are replaced with the median value of the 5 nearest neighbors, where the distance is based on the other variables in the set.

summary(iris$Sepal.Length)
##   Min. 1st Qu.  Median    Mean 3rd Qu.    ## Max.    NA's 
##  4.400   5.100   5.800   5.843   6.425   7.900      30 

summary(iris.mean$Sepal.Length)
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  4.400   5.325   5.843   5.843   6.300   7.900 

summary(iris.knn$Sepal.Length)
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  4.400   5.100   5.800   5.863   6.475   7.900

In the first case (mean imputation) the mean of the variable has remained unchanged, however, there has been a significant reduction in variability, with the first and third quantile moving closer to each other. In the second case (hot deck imputation with knn), there is no significant change in the distribution.

The choice of imputation method depends on auxiliary information available and whether there are restrictions on the data to be imputed, however the ultimate goal is to preserve the distribution of the data and avoid introducing bias. For this reason, there is no single best imputation method, and different variables of the same set will often require different imputation procedures.

Epilogue

This is by no means a comprehensive guide to data cleaning, but rather a small sample of data wrangling issues and how they can be solved. In reality, there isn’t a unique definition of clean data or a single set of tasks to perform to reach that status. Data cleanliness is a set of conditions that changes depending on the problem to be solved and even the definition of inconsistency might vary across projects, depending on how they are identified and dealt with.

In terms of tools, data cleaning scripts often combine various resources such as SQL, command line code, R and Python, to cite the most common. Part of the challenge is also identifying the best workflow given the goals and constraints of the project.

As private and public organisations increasingly rely on algorithms to make decisions and carry out their activities, it’s important that the data they’re based on is reliable, consistent and unbiased. With all the steps and decision making involved in the process, data cleaning is far from being a trivial task and the expertise required put it right at the core of data science as building block of any analysis.

Serena is a statistician and a self confessed PhD dropout, currently working as a data scientist in London. She enjoys using math and code to solve real world problems and is obsessed with data, food and travels, not always in this order.


  1. To be completely fair, the term Data Science existed, but it didn’t have the same meaning and drew far less attention. 
  2. Here is the original HBR article. 
  3. Imputation is the process of replacing missing data with substitute values. More on this topic below. 
  4. H. Wickham, 2014, Tidy Data 
  5. Original table here
  6. The iris dataset is a public dataset widely used to demonstrate various machine learning and statistical techniques.