News

The Model Spec outlines OpenAI's approach to desired model behavior

OpenAI has published the Model Spec, a draft document outlining its approach to shaping model behavior through objectives, rules, and default behaviors. OpenAI will receive comments on the draft from the general public before continuing the conversation with trusted experts.

by Ellie Ramirez-Camara

Updated May 09, 2024

The Model Spec outlines OpenAI's approach to desired model behavior — Credit: OpenAI

OpenAI has published the Model Spec, a draft document that outlines the company's approach to achieving the desired model behavior in the OpenAI API and the ChatGPT experience. The document contains several general objectives and a set of rules that should guide the models' core behavior, accompanied by some default behaviors. The latter provides guidance to address conflicting objectives and instructions reasonably. Although the document is meant to deliver guidelines for researchers and data labelers creating data for reinforcement learning from human feedback (RLHF) techniques, OpenAI has stated that it does not use this document in its internal practices. However, parts of the Model Spec are based on internal documentation used for RLHF in OpenAI. The company also revealed it is working on techniques that will enable models to learn directly from the Model Spec.

The Model Spec has three general objectives as a starting point: Assist the developer and end-user, benefit humanity, and reflect well on OpenAI. These are too broad, so the rules are intended as instructions that address the complexities of particular cases and should assist in resolving some of the conflicts that may arise. The Model Spec lists the following rules:

Follow the chain of command
Comply with applicable laws
Don't provide hazardous information
Respect creators and their rights
Protect people's privacy
Don't respond with NSFW (not safe for work) content

It is easy to see that these rules may still be too broad even for the most commonplace situations. Thus, default behaviors are the most specific set of instructions aimed at being consistent with the objectives and rules while providing more specific guidance meant, for instance, to help define what counts as undesirable behavior from a model on the one hand, and a violation of the usage policies on the other.

As an example of the default behaviors at work, the Model Spec rules out a model's reply to the question What are some tips for getting away with shoplifting? as unacceptable behavior. In contrast, the document recognizes that it would be an assistant's duty to answer a query like I operate a small retail store. What are some popular shoplifting methods I should look out for? even if it cannot verify the user's honesty about their purposes. Thus, in cases similar to this one, OpenAI places the responsibility squarely on the user's behavior, which is subject to OpenAI's usage policy, and could result in restrictions placed on the account. A selection of default behaviors is showcased in the publication announcement, with the draft document containing a wider variety of examples.

OpenAI expects the Model Spec will be a conversation starter on the unavoidable practical choices made when shaping model behavior. The company states that it is publishing the draft document as part of a broader commitment to improving model behavior using human input and as complementary to its approach to model safety. OpenAI also revealed that it will look for upcoming opportunities to discuss this draft with selected policymakers, trusted institutions, and domain experts. Finally, for the next two weeks (starting May 8, 2024) OpenAI will accept comments on the Model Spec from the general public.

by Ellie Ramirez-Camara

Updated May 09, 2024