Webpage for the University of Chicago Data Science Clinic
Hosted on GitHub Pages — Theme by orderedlist
Data Science clinic projects may require downloading or “scraping” information from a website. Examples may include using Yelp to identify restaurants and businesses in a specific area or downloading prices from Amazon. While downloading publicly available data from a website is not illegal in and of itself, there are some things that need to be considered before starting a project. The purpose of this document is to present a checklist of items to be considered before starting.
Before presenting the following, please note that we are not lawyers and if there is any gray area it should be discussed with your mentor. If you are interested in learning more about the specific legal considerations, this article has quite a bit of information.
Our primary legal concern is making sure to not violate any terms of service specified by the data provider. If they explicitly state that behavior is not allowed as part of an agreement that you are expected to enter into when using the site, then do not violate the terms of that agreement. There is a notion of fair use
which is invoked in legal proceedings against web scrapers. Fair use is not an automatic get out of jail free card and only governs a specific
Beyond the legal aspects, there are additional ethical considerations that should be verified before undertaking any web scraping project. The first four ethical considerations are taken from the aforementioned article:
robots.txt
file (which can usually be found at the root of a url, so if you are going to https://some-company.com
, then navigate to https://some-compnay.com/robots.txt
). The purpose of robots.txt is to define the rules for scraping or crawling a website. Before doing any serious scraping please consult this file as it may contain information about who and when scraping is allowed.If your data science clinic project involves web scraping, please ensure you adhere to the ethical guidelines outlined above. If you are not sure about any of them please ask your project mentor or the clinic administrative staff.
Many organizations (especially non-profits and research groups) may have alternative methods of getting data. It’s recommended to send an email to an organization to ask them if they have the data available in a way that doesn’t require scrapping.
Secondly, when dealing with government sources there is the “Freedom of Information Act” (“FOIA”) a mechanism by which citizens are allowed to request information from the government. At the federal level, the FOIA process begins here. Every state in the United States has implemented their own version of this concept and getting information from this process can be surprisingly quick. If you are going to use FOIA you need to identify the agency that has the data you are interested in and the FOIA office that has domain over them. There are often online forms that you can fill out to request specific data.
Web scraping can be an important part of a data science clinic project, but one that can easily veer into illegal or (more likely) unethical areas. Please make sure to go through the guidelines above before beginning any such project.