Web scraping is the process of programmatically and systematically collecting information on the web and processing it into more easily analyzable formats that can be serialized (json, xml, etc) and stored for later use. Online information can be unstructured, but the HTML formatting of a web page can be leveraged to find and collect information as well as create relationships between data points. The information gained by scraping a website can be organized into highly structured formats like databases, which allows deeper forms of analysis for research.
Researchers often use specialized software and customized scripts to scrape data from websites. While web scraping involves extracting data from various sites, there are additional methods for pulling data from the web. For instance, web crawling (or web archiving) refers to downloading entire pages systematically, usually for archival or preservation purposes (e.g., Archive-It, Conifer, Internet Archive, etc.) into WARC files. Additionally, some websites (like Twitter and Wikipedia) provide Application Programming Interfaces (APIs), which can be used to request downloads of their data.
Web scraping challenges include gaining access to the desired data and confronting the ethical and legal issues around harvesting, analyzing, storing, or sharing the collected data.
The MediaWiki Action API can be used to interact with information on Wikipedia.org, allowing users to systematically edit or download entries:
Beautiful Soup is a popular Python library for scraping data from HTML and XML files: https://beautiful-soup-4.readthedocs.io/
This is an overview on the use of web scraping for population health research:
This article provides an explanation of web scraping while also presenting areas of concerns for researchers to consider:
Krotov, V., Johnson, L., & Silva, L. (2020). Tutorial: Legality and Ethics of Web Scraping. Communications of the Association for Information Systems, 47. https://doi.org/10.17705/1CAIS.04724
This article explores the ethics of web scraping in public health research by reviewing a project that scrapes public websites of U.S. county jails to construct a database for HIV surveillance among incarcerated populations:
Rennie S, Buchbinder M, Juengst E, Brinkley-Rubinstein L, Blue C, Rosen DL. Scraping the Web for Public Health Gains: Ethical Considerations from a 'Big Data' Research Project on HIV and Incarceration. Public Health Ethics. 2020 Mar 11;13(1):111-121. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7392638/