ChatGPT has taken the world by storm. Within two months of its release, the chatbot reached 100 million active users, making it the fastest-growing consumer application ever launched. Users are attracted to the tool’s advanced capabilities – and concerned by its potential to cause disruption in various sectors. A much less discussed implication is the privacy risks ChatGPT poses. This week, Google unveiled its own conversational AI called Bard, and others will surely follow. Technology companies working on AI have well and truly entered an arms race. The problem is: This race is being fueled, in part, by our personal data, raising questions and concerns on the privacy front.
Launched in November 2022 by OpenAI, ChatGPT is underpinned by a large language model that requires massive amounts of data to function and improve. The more data the model is trained on, the better it gets at detecting patterns, anticipating what will come next, and generating plausible text. OpenAI fed the tool some 300 billion words systematically scraped from the internet – namely, from books, articles, websites, and posts – including personal information obtained without consent. If you have ever written a blog post or product review, for example, or commented on an article online, there is a chance this information was consumed by ChatGPT.
Why is that an issue?
The data collection used to train ChatGPT is problematic for several reasons. First, consumers were not asked whether OpenAI could use our data. This is a violation of privacy, especially when the data is sensitive and can be used to identify us, our family members, or our location. Even when the data is publicly available, its use can breach what we call contextual integrity, which is a fundamental principle in legal discussions of privacy. It requires that individuals’ information is not revealed outside of the context in which it was originally produced.
More than that, OpenAI does not offers procedures for individuals to check whether personal information is being stored and/or to request that it be deleted. This is a guaranteed right in accordance with the European General Data Protection Regulation (“GDPR”) – although it is still under debate whether ChatGPT is compliant with GDPR requirements. This is particularly important in cases where the information is inaccurate or misleading, which seems to be a regular occurrence with ChatGPT.
Still yet, the scraped data ChatGPT was trained on can be proprietary or subject to copyright protection. For instance, when I prompted it, the tool produced the first few paragraphs of Peter Carey’s novel “True History of the Kelly Gang” – a copyright-protected text. Finally, OpenAI did not pay for the data it scraped from the internet. In other words, the individuals, website owners, and companies that produced it were not compensated. This is noteworthy considering OpenAI was recently valued at $29 billion, more than double its value in 2021. OpenAI also just announced ChatGPT Plus, a paid subscription plan that will offer customers ongoing access to the tool, faster response times and priority access to new features. This plan will contribute to expected revenue of $1 billion by 2024.
None of this would have been possible without data collected and used without authorization.
A flimsy privacy policy
Another privacy risk involves the data provided to ChatGPT in the form of user prompts. When we ask the tool to answer questions or perform tasks, we may inadvertently hand over sensitive information and put it in the public domain. For instance, an attorney may prompt the tool to review a draft divorce agreement, or a programmer may ask it to check a piece of code. The agreement and code, in addition to the outputted essays, are now part of ChatGPT’s database. This means they can be used to further train the tool and may be included in responses to other people’s prompts.
Beyond this, OpenAI gathers a broad scope of other user information. According to the company’s privacy policy, it collects users’ IP address, browser type and settings, and data on users’ interactions with the site – including the type of content users engage with, features they use and actions they take. It also collects information about users’ browsing activities over time and across websites. OpenAI states that it may share users’ personal information with unspecified third parties, without informing them, to meet their business objectives.
Some experts believe ChatGPT is a tipping point for AI – a realization of technological development that can revolutionize the way we work, learn, write and even think. Its potential benefits notwithstanding, we must remember that OpenAI is a private, for-profit company whose interests and commercial imperatives do not necessarily align with greater societal needs. The privacy risks that come attached to ChatGPT should sound a warning. And as consumers of a growing number of AI technologies, we should be extremely careful about what information we share with such tools.
A rep for OpenAI did not respond to a request for comment by the time of publication.
Uri Gal is a Professor in Business Information Systems at the University of Sydney. (This article was initially published by The Conversation.)