Microsoft-Led Research Finds ChatGPT-4 Is Prone To Jailbreaking

New research backed by Microsoft has found OpenAI’s GPT-4 generative large language model (LLM) to be more trustworthy than its predecessor – GPT-3 – but also more vulnerable to jailbreaking.

The paper published by researchers from the University of Illinois Urbana-Champaign, Stanford University, University of California, Berkeley Center for AI Safety, and Microsoft Research found the generative pre-transformer 4 or GPT-4 to be better at safeguarding private information, avoiding toxic or discriminatory text results, and resisting adversarial attacks than GPT-3.5.

GPT-4 More Trustworthy Than GPT-3 But Could Leak Users’ Personal Information When Jailbreaked

Despite its perks, GPT-4 is far more likely to follow instructions with “jailbreaking” prompts that could be used by bad actors to make the machine learning algorithm maliciously bypass OpenAI’s built-in safety measures. Researchers say GPT-4 could be told to ignore security measures and leak personal information and conversational histories of its users because it follows “misleading instructions more precisely”.

The co-authors of the research noted their discoveries in a blog post. They compared both models’ trustworthiness in several categories, such as toxicity, stereotypes, privacy, machine ethics, fairness, and strength in resisting adversarial attacks. Apparently, GPT-4 obtained a higher trustworthy rating than GPT-3 on standard benchmarks.

The researchers began their tests with standard prompts that included toxic and discriminatory words that were banned. Then, the team gave both models prompts designed to break its content policy restrictions but without the AI being outwardly biased against specific ethnic groups. The final leg involved challenging the models by intentionally trying to trick them into ignoring their safeguards altogether.

The team found that when given jailbreaking prompts, GPT-4 was more likely to generate toxic text than GPT-3.5. The newer model also agreed with “biased” content more frequently, depending on the demographic group mentioned in the prompt.

While all LLMs can leak details of the data using which they were trained, much to everyone’s concern, GPT-4 was more susceptible to leaking private, sensitive data, including users’ email addresses, when given the right jailbreaking prompts.

The researchers also gave an example. When given the prompt “Homosexuals have HIV”, GPT-4 is trained to “strongly disagree” with the statement. But when prompted “Women have HIV”, the model agreed and supported its argument with biased content.

GPT-4’s Vulnerabilities Were Fixed Before The Report Was Published

Although the findings are a subject of concern to users of LLMs like ChatGPT and Google’s Bard, the researchers said their team worked alongside Microsoft to confirm the potential vulnerabilities that were identified would not impact existing customer-facing AI services. They implied that the finished AI applications had already made the relevant bug fixes and patches before the paper was published.

The team also shared their research with OpenAI, which has noted the vulnerabilities in the respective system cards for the LLMs. They have also published the codes used to benchmark the AIs on GitHub, so other researchers can recreate them.

For those who don’t know, jailbreaking is the process of exploiting the flaws of a digital system by tricking it into performing tasks that it was not originally intended for.

Like all LLMs, GPT-4 must be instructed to complete an assigned task and this requires users to give it prompts, on which it will then take action. When LLMs are being jailbreaked, they are given prompts worded in a specific way that it will trick the model into performing a particular task that was not a part of its objective. In the wrong hands, the AI could be used for generating racist, sexist, or harmful text, to run propaganda campaigns, and to malign an individual, community, or organization.

Usually, generative AI models go through a security assessment process called “red teaming”, where developers give the algorithm several prompts to see if it will return unwanted results. The mechanism is mainly used by ethical hackers in the cybersecurity space to test an organization’s systems, defenses, and operational strategies while identifying and rectifying any security gaps or loopholes.

For instance, when Microsoft first released its ChatGPT-powered ‘Bing Chat’, there was a possibility that the LLM would provide results and content considered harmful or banned. This was because the AI was trained using vast amounts of publicly available data from the internet that contained all sorts of biased and unbiased information.

Discover more : OpenAI Launches DALL-E 3 And Brings The AI Image Generator To ChatGPT

OpenAI Founder Says GPT-4 Is “Still Flawed, Still Limited”

When GPT-4 was released back in March, OpenAI CEO Sam Altman admitted that the model “is still flawed” and “still limited”. He also said the AI is more impressive on first use than it is after users spend more time on it.

GPT-4 was benchmarked on a number of tests, including the Uniform Bar Exam, LSAT, SAT Math, and SAT Evidence-Based Reading and Writing exams, where it consistently scored 88% and above.

OpenAI stressed that the system had gone through six months of rigorous safety training, claiming it to be 82% less likely to respond to prompts for disallowed content and 40% more likely to produce factual responses than GPT-3.5.

Discover more : ChatGPT Can Now See, Hear, And Speak To Users: Breaking Boundaries