Definition
Data cleaning is the process of identifying and correcting data that are inaccurate, missing, or incomplete. Data cleaning tasks can include removing duplicate records, investigating extreme values (e.g., outliers), converting dates from one format to another, removing unwanted text, splitting multiple data points in a cell into separate cells, or coding missing or NA values.
Using a regular expression (regex) in OpenRefine to identify a pattern or sequence within text and removing and/or replacing it (e.g., finding all “NA” values and replacing it with -999).
Similar Terms
Tidyverse is a collection of open source R packages, several of which can be used for data wrangling and cleaning.
Pandas in a collection of open source Python libraries for data manipulation and analysis.
OpenRefine is a user-friendly, point-and-click tool for working with messy data.
Relevant Literature
Hadley Wickham’s well-known article “Tidy Data” (2014) explains the tidy approach wherein every variable is a column, every observation is a row, and each type of observational unit is a table. This, in turn, is the basis for “long”, rather than “wide” data.
Critical Perspective: In “Against Cleaning”, Katie Rawson and Trevor Muñoz discuss the impact of data cleaning and the need to scrutinize the bias and assumptions involved in the data cleaning process and the implications of words like “cleaning” and “messy.”
