Britain’s independent authority to uphold data privacy rights, the ICO, announced on Monday that it intends to fine Clearview AI, a facial recognition company, $22.6 million for breaking UK data privacy laws. Since 2017, the company has raised $38 million. As technology advances to allow companies to collect and use data at a larger scale, laws surrounding data privacy will be tested and shaped in real time.
Clearview is a facial recognition software used by federal and state law enforcement to identify criminals. The founder of the company, 33 year old Hoan Ton-That, explains the product as a “search engine for faces. Anyone in law enforcement can upload a face to the system and it finds any other publicly available material that matches that particular face.” Using computer vision, companies like Clearview can use machine learning to teach a computer to recognize a face in an image, and then match different images of the same person, the same way that you can recognize someone you’ve seen many times before, even if they are wearing a hoodie, new glasses, or you can only see the side of their faces. Machine learning models learn using data to improve a target task like identifying an image. The more images they have to train on, the more accurate the model becomes. Clearview has amassed a database of 3 billion images.
The ICO intends to fine Clearview AI for collecting those billions of images from public sites without informing Britain citizens. A major component of the General Data Protection Regulation (GDPR) — a set of regulations introduced in Europe to protect online privacy — is the requirement that companies be “transparent about how your data is handled, and get your permission before starting to use it.” The ICO enforces GDPR laws in the UK.
As more and more information becomes available online, companies have started building powerful models trained on large amounts of public data. Take for example, Bert, GPT-3 and Codex. Bert and GPT-3 are popular language models trained on wikipedia, online books and other online texts. Codex, which powers GitHub's co-pilot, was trained on over 100 gigabytes of python code across millions of public GitHub repositories, per OpenAI’s blog. (Some individuals were not thrilled.)
Clearview’s spokesperson defended the company’s actions by saying it only “provides publicly available information from the internet to law enforcement agencies, however, many companies, including Facebook, Venmo, Youtube, Twitter and Instagram prohibit such scraping required to amass the kind of inventory Clearview has, and Twitter even goes so far as to explicitly ban the use of its data for facial recognition. As for Britain's ICO, they seem less focused on the public nature of the data and more focused on the fact that Clearview has amassed so much personal data without the data creators knowledge.
Machine learning models require large amounts of data and internet platforms often are the largest source of data. Yet, regulations around collecting and using that data are still catching up. Cases like this one between Clearview and the ICO will be telling as to how regulators plan to interact with companies that are using human intelligence harnessed by computers.