Closed AI Models Make Bad Baselines

24 minute read

This post was authored by Anna Rogers, with much invaluable help and feedback from Niranjan Balasubramanian, Leon Derczynski, Jesse Dodge, Alexander Koller, Sasha Luccioni, Maarten Sap, Roy Schwartz, Noah A. Smith, Emma Strubell (listed alphabetically)
Header image credit: Sasha Luccioni

What comes below is an attempt to bring together some discussions on the state of NLP research post-chatGPT.1 We are NLP researchers, and at the absolute minimum our job is to preserve the fundamentals of scientific methodology. This post is primarily addressed to junior NLP researchers, but is also relevant for other members of the community who are wondering how the existence of such models should change their next paper. We make the case that as far as research and scientific publications are concerned, the “closed” models (as defined below) cannot be meaningfully studied, and they should not become a “universal baseline”, the way BERT was for some time widely considered to be. The TLDR for this post is a simple proposed rule for reviewers and chairs (akin to the Bender rule that requires naming the studied languages):

That which is not open and reasonably reproducible cannot be considered a requisite baseline.

By “open” we mean here that the model is available for download, can be run offline (even if it takes non-trivial compute resources), and can be shared with other users even if the original provider no longer offers the model for download. “Open” models support versioning, and document for each model version what training data they used. A model is “closed” if it is not open.

By “reasonably reproducible” we mean that the creators released enough information publicly such that the model can be reproduced with the provided code, data, and specified compute resources, with some variation reasonably expected due to hardware/software variance, data attrition factors and non-determinism in neural networks. For instance, reproducing BLOOM would require a super-computer - but at least theoretically it is possible, given the measures to open-source the code, collect and document the data. So it is “reasonably reproducible” by our definition, even though not everybody could do it.

Relevance != popularity

Here’s a question many graduate students in NLP have been asking themselves recently:

This anxiety seems to be due partly to the fact that in our field, “relevance” has been extremely popularity-driven. For the last decade, there has always been a Thing-Everybody-Is-Talking-About: a model or approach that would become a yardstick, a baseline that everybody would be wise to have in their papers to show that what they’ve done is a meaningful improvement. This can be understood, since one of the driving values of the ML community is improving upon past work – otherwise, how would we know we are making progress, right? Post-2013 we had word2vec/GloVe, then there was a similar craze about BERT. Then GPT-3. And now – ChatGPT and GPT-4.

Why does this happen? There are two lines of reasoning behind this:

  1. The-Thing-Everybody-Is-Talking-About is either likely to be truly state-of-the-art for whatever I’m doing, or a reasonable baseline, so I better have it in my paper and beat it with my model.
  2. As an author, my chances of publication depend in part on the reviewers liking my work, and hence the safest bet for me is to talk about something that most people are likely to be interested in - a.k.a The-Thing-Everybody-Is-Talking-About.

(b) is actually a self-fulfilling prophecy: the more authors think this way, the more papers they write using The-Thing-Everybody-Is-Talking-About, which in turn reinforces the reviewers in the belief that that thing is really prerequisite. We see this cycle manifested as a mismatch between the beliefs of individual community members and their perception of others’ views on what research directions should be prioritized (e.g. focus on benchmarks or scale), as documented in the NLP Community Metasurvey. Though it takes effort, members of the research community can push back against that kind of cycle (and we will discuss specific strategies for that below). As for (a) - it made sense while The-Thing-Everybody-Is-Talking-About was actually something that one could meaningfully compare to.

The main point we would like to make is that this kind of reasoning simply no longer applies to closed models that do not disclose enough information about their architecture, training setup, data, and operations happening at inference time. It just doesn’t matter how many people say that they work well. Even without going into the dubious ethics of commercial LLMs, with copyright infringement lawsuits over code and art already underway, and unethically sourced labeled data – the basic research methodology demands it. Many, many people are bringing up the fact that as researchers, we are now in an impossible position:

  • We have very little idea what these models are trained on, or how:
  • The said black box is constantly changing:
  • Both our incoming prompts and outgoing answers may be undergoing unspecified edits via unspecified mechanisms. E.g. chatGPT “self-censors” with content filters which people have so much fun bypassing, and has proprietary prompt prefixes:

Yes, these models do seem impressive to many people in practice – but as researchers, our job is not to buy into hype. The companies training these models have the right to choose to be wholly commercial and therefore not open to independent scrutiny – that is expected of for-profit entities whose main purpose is to generate profits for their stakeholders. But this necessarily means that they relinquish the role of scientific researchers. As Gary Marcus put it,

I don’t expect Coca Cola to present its secret formula. But nor do I plan to give them scientific credibility for alleged advances that we know nothing about.

Why closed models as requisite baselines would break NLP research narratives

To make things more concrete, let us consider a few frequent “research narratives” in NLP papers, and how they would be affected by using such “closed” models as baselines. We will use GPT-4 as a running example of a “closed” model that was released with almost no technical details, despite being the subject of a 100-page report singing its praises, but the same points apply to other such models.

“We propose a machine learning model that improves on the state-of-the-art”:

  • To make the claim that our algorithm improves over whatever it is that a commercial model is doing, we need to at least know that we are doing something qualitatively different. If we are proposing some modification of a currently-popular approach (e.g., Transformers), without documentation, we simply cannot exclude that the “closed” model might be doing something similar.
  • Even if we believe that we are doing something qualitatively different, we still need to be able to claim that any improvements are due to our proposed modification and not model size, the type and amount of data, important hyperparameters, “lucky” random seed, etc. Since we don’t have any of this information for the closed “baseline”, we cannot meaningfully compare our model to it.
  • And even if we ignore all the above factors – to make a fair comparison with these models on some performance metric, we have to at least know that neither of our models has observed the test data. Which, for the “closed” model, we also don’t know. Even OpenAI itself was initially concerned about test data contamination with GPT-3, which could not possibly have improved - especially after the whole world has obligingly tested chatGPT for months. And it hasn’t improved.

The only thing that we as model developers can learn from the existence of GPT-4, is that this is the kind of performance that can be obtained with some unspecified combination of current methods and data. An upper bound or existence proof, which seems higher than existing alternatives. Upper bounds are important, and could serve as a source of motivation for our work, but they cannot be used as a point of comparison.

“We propose a new challenging task/benchmark/metric”:

Constructing good evaluation data is very hard and expensive work, and it makes sense to invest in it when we believe that it can be used as a public benchmark to measure progress in NLP models, at least for a few months. Examples of such benchmarks that have driven NLP research in the past include SQuAD, GLUE and BigBench. But public benchmarks can only work if the test data remains hidden (and even then eventually people evaluate too many times and start to implicitly overfit). This is obviously incompatible with the scenario where the developer of the popular “closed” models, only accessible via API, keeps our submitted data and may use it for training. And unless the models explicitly describe and share their training data, we have no way of auditing this.

This means that our efforts will be basically single-use as far as the models by that developer are concerned. The next iteration will likely “ace” it (but not for the right reasons).

Let us consider OpenAI policies in this respect:

  • ChatGPT by default keeps your data and may use it for training. It is said to provide an opt-out of data collection.
  • The OpenAI API policy was updated on March 1 2023, and currently states that by default data is not retained and not used for training. Whatever was submitted before this date, can be used, so we can safely assume that much if not all of existing public benchmark data has been submitted to GPT-3 since 2020, including the labels or “gold” answers - at least those that were used as few-shot prompts. Interestingly, OpenAI then uses the contamination as a reason to exclude some evaluations but not others: the GPT4 tech report says that they did not evaluate on BIG-bench because of data contamination (in v.3 of the report it’s footnote 5 on p.6), although they do present their results for 100% contaminated GRE writing exams (Table 9).

The overall problem is that opt-outs and even opt-ins are not sufficient in the case of something meant to be a public benchmark: as dataset creators, the future of our work might be affected not only by our own use of our data - but also by anybody else using it! It takes just one other researcher who wasn’t careful to opt-out, or wasn’t able to – and our data is “poisoned” with respect to future models by that developer. Even if only some few-shot examples are submitted, they might be used to somehow auto-augment similar user prompts. Last but not least, if we make our data public, the model developers themselves could also proactively add it to the training data, looking to improve their model. If the labels or the “gold” answers are not public for an important benchmark, it would be worthwhile for them to create some similar data.

It’s unclear yet how to solve this problem. Perhaps there will soon appear some special version of robots.txt that both prohibits use for AI training, and requires that any resharing of this data keeps the same flag. And, hopefully, the large companies will eventually be required to comply, and be subject to audits. In the short-term, it seems like the only option is to simply not trust or produce benchmark results for models where test-train overlap analysis cannot be performed.

“We show that model X does/doesn’t do Y: (model analysis and interpretability)

Since we only have access to GPT-4 via the API, we can only probe model outputs. If the plan is to use existing probing datasets or construct new ones, we have the same resource problem described above (the previously used probing datasets might have been trained on, the previously used techniques could have been optimized for, the new work will be single-use, and still have the train-test overlap problem to an unknown extent).

Furthermore, at least some of these models seem to intentionally not produce identical outputs when queried with the same probe and settings (perhaps via random seeds or different versions of the model being used in parallel). In this case, whatever results we get may already be different for someone else, which puts our basic conclusions at risk. This could include, for instance, the reviewers of the paper, who will rightfully conclude that what our report may not be true. Moreover, if the developer keeps tweaking the model as we go, then by the time we finish writing the paper, the model could change (perhaps even based on our own data). Which would also make our work not only obsolete before it is even reviewed, but also incorrect.

This issue might be addressed by “freezing” given versions of the model and committing to keep them available to researchers, but there is hardly any incentive2 for for-profit companies to do so. For instance, some popular models including Codex/code-davinci-002 have already been deprecated. We also have no public information about what changes lead or do not lead to a new version number (and it is likely that at least the filters are updated continually, as users are trying to break the model).

Last but not least, consider the effect of showing that model X does/doesn’t do Y:

  • “Model does Y”: without test-train overlap guarantees this is not necessarily a statement about the model. For example, chatGPT was reported to be able to play chess (badly). That looks unexpected of something that you consider a language model, but if you knew that it has seen a lot of chess data - it is hardly newsworthy that a language model can predict a plausible-looking sequence of moves. Basically, instead of discovering properties of a language model (which could be a research finding), we’re discovering that the internet dump it was trained on contained some chess data.
  • Model doesn’t do Y”: by collecting cases where the model seems to fail, we implicitly help the commercial entity controlling that model to “fix” those specific cases, and further blur the line between “emergent” language model properties and test cases leaked in training. In fact, GPT-4 was already trained on user interactions gathered during the mass testing of ChatGPT, which provided Open AI with millions of free examples, including “corrected” responses to prompts submitted by users. In the long run, our work would make it harder for the next researcher to examine the next “closed” model. What’s even worse, it would decrease the number of easy-to-spot errors that might prevent ordinary users from falling for the Eliza effect, hence increasing their trust in these systems (even though they are still fundamentally unreliable).

In summary, by showing that a closed model X does/doesn’t do Y we would likely not contribute to the general understanding of such models, and/or exacerbate the evaluation issues.

“We show that model X is (un)fair/biased etc”: (AI ethics)

Let us say that we somehow showed that the closed model yields some specific type of misinformation or misrepresents a given identity group (as it was done e.g. for anti-Muslim bias in GPT-3). The most likely outcome for such work is that this specific kind of output will be quickly “patched”, perhaps before we even publish the paper. The result is that (a) our hard work is short-lived, which may matter for researcher careers, (b) we actively helped the company make their model seem more ethical, while their training data probably didn’t fundamentally change, and hence the model probably still encodes the harmful stereotypes that could manifest themselves in other ways. Consider how in Dall-E 2 the gender and identity terms were randomly added to make outputs seem more diverse, as opposed to showing the default identity groups (read: White Men).

So, should we just forgo studying “closed” models from the ethics angle? Of course not: independent analysis on commercial systems is strictly necessary. But we need to figure out ways to do this without providing companies with free data with which to mask the symptoms of the underlying problem. Here are some alternatives that may lean on skillsets that NLP researchers are still developing, and perhaps will be strengthened by collaborations with experts in HCI and social sciences:

  • User studies on whether people trust the over-simplified chatbot answers, how likely they are to verify information, whether students use it in ways that actually improves their learning outcomes, and interventions that promote safer use practices. This kind of work focuses on the potential effects of these models, given the known phenomenon of automation bias, and any negative findings can only be refuted with a public user study.
  • Discussing and documenting instances of real-world harms, where they can be traced to the model (akin to the Stochastic Parrots paper). Ideally, such cases would require not only a fix, but also public acknowledgment and hopefully compensation.
  • User studies of various demographic cohorts, to see if the system works equally well for them in different real-world tasks: something with qualitative evaluation, where a fix would require obtaining better training data for that cohort. But this kind of work would need to somehow avoid producing too much concrete evidence that could be used to simply “patch” the output.
  • Studies not just of these systems, but on their intended and real impact on society. We need a lot of research on system-level issues where a “fix” would require changes to the business model and/or the way these systems are presented and marketed. An obvious example is the jobs that are too risky to be automated with the unreliable, biased, hallucination-prone systems that we currently have. For instance, do policy-makers jump on the opportunity to hire fewer teachers, and what kinds of schools are more likely to be sent down that path?

“We develop a more efficient solution than model X”:

The reviewers would likely (and rightly) expect us to show that we improve efficiency while maintaining a similar level of performance, which means we inherit all the above evaluation issues. Also, we likely don’t even have enough details about the training of the “baseline”, including its computational costs, the amount and source of energy invested in it, etc.

We Do Have Options!

Dear members of the NLP community: the good news is that if you’d like to do… you know… actual research on language models, you do have open options, and more of them will probably be coming, as the cost of training goes down. Here are a few examples of models that come not only with reasonable descriptions of their training data, but even tools to query it:

Model Type Size Data sourcing Corpus Searchable training data
BLOOM multilingual LLM 560M-176B documentation efforts ROOTS Roots Search Tool
GPT-Neo models mostly-English LLMs 125M-2.7B Pile datasheet The Pile The Pile Data Portraits
T5 English LLM 60M-11B partial C4 documentation C4 C4 search

What about the reviewers who might say “but where’s GPT-4?” Here’s what you can do:

  • Preemptively discuss in your paper why you don’t provide e.g. chatGPT results as a baseline, before your paper is submitted. If necessary, use the arguments in this post in your rebuttals to reviewers.
  • Preemptively raise the issue with the chairs of the conference you plan to submit to, to ask if they have a policy against such superficial popularity-driven reviews. The ACL 2023 policy didn’t cover this, since the problem became apparent after the submission deadline, but it can be extended by future chairs. We will be following any policy discussions related to this in ACL conferences; if you have any comments, or if there are any major developments and if you’d like us to keep you in the loop - please use this form.
  • As a reviewer or chair, if you see someone insisting on closed baselines - side with the authors and push back.
  • Discuss these matters openly in your own community; as reviewers, we can continue to educate and influence each other to drive our norms to a better place.

Another question outside of the scope of this post, but that could be brought up for community discussion in the future, is whether the “closed” models should be accepted as regular conference submissions (in direct competition with “open” work for conference acceptance and best paper awards) – or perhaps it is time to reconsider the role of the “industry” track.

Our community is at a turning point, and you can help to direct the new community norms to follow science rather than hype – both as an author and as a reviewer. The more people cite and study the best available open solutions, the more we incentivize open and transparent research, and the more likely it is that the next open solution will be much better. After all, it is our tradition of open research that has made our community so successful.

Addendum: counter-arguments

Train-test overlap and uninspected training data has always been an issue, ever since we started doing transfer learning with word2vec and onwards. Why protest now?

People have in fact been raising that issue many times before. Again, even OpenAI itself devoted a big chunk of the GPT-3 paper to the issues with benchmark data contamination. The fact that an issue is old doesn’t make it a non-issue; it rather makes us a field with a decade of methodological debt, which doesn’t make sense to just keep accruing.

The-Closed-Model-Everybody-Is-Talking-About does seem to work better for this task than my model or open alternatives, how can I just ignore it and claim state-of-the-art?

Don’t. “State-of-the-art” claims expire in a few months anyway. Be more specific, and just show improvement over the best open solution. Let’s say that in your task ChatGPT is clearly, obviously better than open alternatives, based on your own small testing with your own examples. What you don’t know is whether this is mostly due to some clever model architecture, or some proprietary data. In the latter case, your scientific finding would be… that models work best on data similar to what they were trained on. Not exactly revolutionary.

Also, ask yourself: are you sure that the impressive behavior you are observing is the result of pure generalization? As mentioned above, there is no way to tell how similar your test examples are to the training data. And that training data could include examples submitted by other researchers working on this topic, examples that were not part of any public dataset.

The-Closed-Model-Everybody-Is-Talking-About does seem to work better for this task than my model or open alternatives, how can I just ignore it and not build on it?

That has indeed been the pathway to many, many NLP publications in the past: take an existing problem and the newest thing-that-everybody-is-talking-about, put them together, show improvement over previous approaches, publish. The problem is that with an API-access closed model you do not actually “build” on it; at best you formulate new prompts (and hope that they transfer across different model versions). If your goal is engineering, if you just need something that works - this might be sufficient. But if you are after a scientific contribution to machine learning theory or methods - this will necessarily reduce the perceived value of your work for the reviewers. And if the claim is that you found some new “behavior” that enables your solution, and hasn’t been noticed before - you will still need to show that this “behavior” cannot be explained by the training data.

Whatever we may say, The-Closed-Model-Everybody-Is-Talking-About is on everyone’s minds. People are interested in it. If I don’t publish on it, somebody else will and get more credit than me.

Well, that is a personal choice: what do you want the credit for and who do you want recognition from? Publishing on the “hottest” thing might work short-term, but, as shown above, if we simply follow the traditional NLP research narratives with these models as new requisite baselines in place of BERT, our work will be either fundamentally divorced from the basic principles of research methodology, or extremely short-lived, or both. Imagine looking at the list of your published papers 10 years from now: do you want it to be longer, or containing more things that you are proud of long-term?

Are there other ways to study these models that would not run into these issues? We discussed some such ways for ethics-oriented research, perhaps there are other options as well.

Can’t we just study very made-up examples that are unlikely to have been in training data?

First of all, if the point is to learn something about what that model does with real data - very artificial examples could be handled in some qualitatively different way.

Second, at this point you need to be sure that you are way more original than all those other folks who tested chatGPT for several months. Especially since the data used for RLHF comes from interactions with GPT3 - perhaps even your own!

Third, you would still need to know what part actually hasn’t been seen. For example, ChatGPT was reported to write a fable about a peanut butter sandwich stuck in a VCR, in King James Bible style, and that example got subsequently shared in dozens of media articles. This is a cool example, but what exactly is it that we believe to be impressive? The style transfer, the knowledge that things get stuck in VCRs, the plausible instructions? The degree of impressiveness of each one of these depends on what was in the training data. And even the impressiveness of the ability to tie these things together still depends on what combinations of “skills” were seen in training, and whether this is in fact a pure language model behavior, and not some combination of pipeline components.

We tried to reproduce that answer, but accidentally typed “CVR” instead of “VCR”. The result was very illuminating. We got generic instructions that could have come from something like WikiHow: how to wipe off something sticky off something electric. Which, of course, is no good here: a sandwich includes a large piece of bread, which you would need to remove by hand rather than by wiping it off. But the best part is that the model later “admitted” it had no idea what “CVR” was! (Indeed, large language models don’t inherently “know” anything about the world). And then, when prompted for “VCR”, apparently the directive to maintain consistency within the dialogue overruled whatever it could have said about ‘VCR”… so we got the same wrong instructions.

What did go without a hitch is the paraphrasing in the King James style. But it’s hard to imagine that paraphrasing was not an intended-and-trained-for “capability”, or that this style was not well represented in a large web-based corpus - o ye of little faith.

Does it work well? Yes. Is it a magical “emergent” property? No. Can we develop another paraphrasing system and meaningfully compare it to this one? Also no. And this is where it stops being relevant for NLP research. That which is not open and reasonably reproducible cannot be considered a requisite baseline.

Share / cite / discuss this post


  1. The work on this post started a while ago, and has nothing to do with either longtermism or the plea for democratization of GPU resources. 

  2. There are in fact incentives for such companies to close and deprecate previous versions of their models, for the sake of (a) reducing attack surface, (b) capping technical debt. These are legitimate concerns for commercial entities, but they are intrinsically in tension with their models being objects of scientific inquiry.