Artificial intelligence firm OpenAI has launched its latest tool called the GPTBot. According to a blog post by the company, the web crawler will be used to source publicly available data from internet websites to train future versions of ChatGPT.
OpenAI claims GPTBot will help “improve the accuracy and expand the capabilities” of its language-based AI models.
A web crawler or web spider, is a type of bot that is usually used by search engines to index the contents of a website for them to appear in search results. The tool is heavily operated by Google, Bing, and Duck Duck Go.
It is called a web “crawler” because crawling is the technical term used to describe the process of automatically entering a website and obtaining data using a software program.
OpenAI Releases Web Crawler to Enhance the Scope of Data Provided to ChatGPT
OpenAI says GPTBot will source all publicly available data from the internet but will remove sources that require paywall access, are known to gather personally identifiable information (PII) of internet users, or have text that violates company policies.
The world wide web is the open field that provides the data required to feed language-based models such as Open AI’s ChatGPT and Google’s Bard. Even though it is not confirmed, most of the information used to train the AI is sourced from social media posts, blogs, online games, etcetera.
Companies Suggesting Methods To Tackle the Problem of AI Data Scrapping
Searching the internet to source data has become increasingly contentious with many users and companies raising security and privacy concerns. Recently, Reddit and Twitter began their pushback against AI companies by cracking down on mining users’ posts to train the models.
Meanwhile, authors Sarah Silverman, Christopher Golden, and Richard Kadrey sued Open AI and Meta for copyright infringement, accusing the companies of illegally-acquiring datasets that contained their works to train ChatGPT and LLaMA.
Companies like Adobe have come up with the idea of forming an anti-impression law that will mark data as not suitable for training AI models. Last month, OpenAI, Meta, Amazon, Google, Inflection, and Microsoft promised in a White House meeting to address the many risks posed by artificial intelligence.
The companies agreed to invest in cybersecurity, online discrimination research, and most importantly, adding a new watermarking system that will inform internet users whether the content they view is AI or human-generated.
OpenAI Allows Users to Stop GPTBot From Accessing Their Websites
Now internet users can disallow GPTBot from crawling through their website to mine data – either partially or fully. In another blog post, OpenAI has detailed how to get this done.
Websites can block GPTBot’s IP address or disallow the web crawler by adding it to the site’s Robot.txt file. This file essentially instructs the web crawler about what is accessible from a particular site.
This is not the first time attempts were made to stop AI models from freely accessing data on the internet. Last year, online art platform DeviantArt released a “NoAI” tag that would create a flag to exclude a given content from AI training.
Regulators Want AI Companies to Consent to Internet Users
Governments have also taken the matter seriously, with the Japanese privacy watchdog Personal Information Protection Commission stating in June asking OpenAI to not collect sensitive data for machine learning purposes without the users’ consent. The commission also highlighted the responsibility of AI companies to respect privacy concerns while fostering innovation and benefits of the new technology.
In April, Italy’s Data Protection Authority (IDPA) placed a temporary ban on ChatGPT after alleging the language model for breaching European privacy laws. The agency ordered OpenAI to stop processing the data of Italian users until it complies with the European Union’s General Data Protection Regulation (GDPR).
OpenAI is currently facing a class-action lawsuit in California by 16 Plaintiffs over sourcing data from social media comments, blog posts, Wikipedia articles, and family recipes without the users’ permission.
If the allegations are proven to be right, then the company invested in by Microsoft will be charged for breaching the Computer Fraud and Abuse Act – a law with precedent for cases involving web-scrapping.