Data Scraping is Revolutionary
Scraping programs allow researchers, statisticians, and other data users to collect information from nearly any public online webpage in a matter of seconds.
Furthermore, many scraping programs can function dynamically. Such dynamic programs do not simply scrape the source webpage a single time; rather, dynamic scrapers repeatedly pull data from the desired online source, allowing users to create data spreadsheets that update themselves automatically.
This dynamic function can be particularly useful for industries that rely on quick, real-time updates for large sets of data, such as trade and investment firms that need to continuously monitor price movements.
Even further, many data scraping programs are very accessible and inexpensive: Microsoft Excel has its own built-in scraping program, and there are several free scraping extensions offered by the Google Chrome Web Store. Indeed, data scraping technology is improving rapidly, but such improvements have raised ethical concerns regarding the potential applications of scraping programs.
Scraping programs can be engineered to extract information from any public webpage. This includes any personal information that is publicly shared via social media, including on platforms such as Facebook, Twitter, Instagram, and YouTube.
In other words, if you upload any personal information to a public social media profile, a scraping program could potentially retrieve and store such information in an instant. This could include pictures, names, locations, phone numbers, and email addresses.
The possibility of personal information being discreetly scraped and stored is very alarming, and prompts the following questions: Is this legal? How can I prevent this? Is this happening right now?
There are legal and corporate regulations that address these questions and concerns.
The Computer Fraud and Abuse Act (CFAA) forbids the retrieval of online information from programs that have “unauthorized access” to a webpage. Furthermore, Twitter, Facebook, YouTube, and Venmo explicitly prohibit scraping of user information in their Automated Data Collection Terms.
Does this mean that your social media profiles are protected from scrapers? Not exactly. Unfortunately, the protection offered by the CFAA does not necessarily apply to public social media profiles; profiles set to a “Public” setting technically grant “authorized access” to all web visitors, including automated scrapers.
Social media users can prevent unwanted scraping by switching their profile settings from “Public” to “Private,” as this would limit the amount of information that is made publicly available and also legally protect such information from any automated programs.
But what if you would rather have a public profile?
Do company regulations protect public profiles from being scraped?
In practice, no.
While Twitter, Facebook, and other social media companies prohibit scraping on their platforms, programmers and softwares can simply ignore these rules and scrape user information regardless.
A current and noteworthy example of such a software is Clearview AI: a state of the art facial recognition application that has recently caused controversy regarding the future of data scraping technology.
Law enforcement agencies currently use Clearview AI to identify potential suspects and persons of interest. The application has an incredibly large database made up of pictures that the program has scraped from online webpages, including social media profiles.
Law enforcement officers upload a picture of an unidentified suspect, and the app returns matching pictures from its database, along with corresponding names and source links.
The software has garnered praise from law enforcement for its ability “to identify a subject in a matter of seconds.” Clearview’s database currently has nearly 3 billion pictures, and is being used by over 600 law enforcement agencies in the United States.
On the other hand, the software has received harsh criticism from the public, conjuring fears of a dystopian society that completely lacks privacy. In March, Vermont Attorney General TJ Donovan sued Clearview for violating Vermont’s Consumer Protection Act, and described the software as “unscrupulous, unethical, and contrary to public policy.”
While Clearview maintains that its software is intended for law enforcement, a recent report from The New York Times revealed that the software has been used by investors and wealthy individuals. These findings have further amplified public worry and disapproval, as privacy advocates warn of the potential for the software to be used with malicious intent.
Facebook, Twitter, YouTube – each of these companies forbid scraping on their platforms. How are they responding to Clearview’s practices? Each of these companies have sent cease-and-desist letters to Clearview, asserting that Clearview’s methods directly violate each company’s data collection policy. Clearview has responded defensively to these claims, arguing that the use of public information is a “First Amendment right.”
The conflict between the companies is yet to be settled, and without any current federal laws that prohibit Clearview’s practices, it appears that internet users are currently at risk of their personal information being retrieved and stored by Clearview and other scraping programs.
Clearview’s emergence and public controversy should perhaps serve as a preliminary warning as we look towards the future of technological innovation. Data scraping, a common practice enjoyed by researchers and data scientists, has an inherent risk to the individual’s right to privacy.
Written by Alexandar Ristic & Edited by Alexander Fleiss