Data Cleaning

Definition

Data cleaning is the process of identifying and correcting data that are inaccurate, missing, or incomplete. Data cleaning tasks can include removing duplicate records, investigating extreme values (e.g., outliers), converting dates from one format to another, removing unwanted text, splitting multiple data points in a cell into separate cells, or coding missing or NA values.

Tools-link

Data Documentation Initiative

DCAT Application Profiles for data portals in Europe

Hanson, Surkis, and Yacobucci. (2012) Data Sharing Snafu in Three Short Acts, A…

Nature Publishing Group - Scientific Data - Data Descriptor

Similar Terms

Data Wrangling

Examples

Using a regular expression (regex) in OpenRefine to identify a pattern or sequence within text and removing and/or replacing it (e.g., finding all “NA” values and replacing it with -999).

Tools

Tidyverse is a collection of open source R packages, several of which can be used for data wrangling and cleaning.

Pandas in a collection of open source Python libraries for data manipulation and analysis.

OpenRefine is a user-friendly, point-and-click tool for working with messy data.

Similar Terms

Deduplication

Data Standardization

Data Cleansing

Data Scrubbing

Further Resources

Hadley Wickham’s well-known article “Tidy Data” (2014) explains the tidy approach wherein every variable is a column, every observation is a row, and each type of observational unit is a table. This, in turn, is the basis for “long”, rather than “wide” data.

Critical Perspective: In “Against Cleaning”, Katie Rawson and Trevor Muñoz discuss the impact of data cleaning and the need to scrutinize the bias and assumptions involved in the data cleaning process and the implications of words like “cleaning” and “messy.”

Data Cleaning

Contact Us

Regional Medical Libraries