Peer review in NLP: reject-if-not-SOTA

10 minute read

Everything wrong with reject-if-not-SOTA

After each reviewing round for a major conference, #NLProc Twitter erupts with bitter reports of methods rejected for failing to achieve the state-of-the-art status (SOTA).

This kind of attitude gives the impression of a completely broken peer-review system, discouraging new minds from even trying to enter the field. Only last month I gave an invited talk for an advanced NLP class in UMass Lowell, telling the students of a new QA benchmark that they could try. After the class a few students came up to me and said they were interested, but they were concerned that it would be a priori futile: whatever they did, they probably would not be able to beat the huge models released monthly by the top industry labs. Note that this was a class in a major US university, so the students in less favorable environments probably feel even more discouraged.

Moreover, the coveted SOTA does not even necessarily advance the field. Looking at a popular leaderboard like GLUE, can we really conclude that the top system has the best architecture? When the test score differences are marginal, any of the following could be in play:

In addition to all the above issues, the leaderboards put us in a hamster wheel. They are updated so quickly that SOTA claims should really be taken as “SOTA at the time of submitting this paper”. If the paper is accepted, it will likely lose the SOTA status even before publication. If it is rejected, the authors have to try their luck at the next conference without being able to claim SOTA anymore.

The SOTA chase takes an absurd twist when a tired reviewer glances at the leaderboard and dislikes the paper for not including the very latest models. For instance, at least two EMNLP 2019 reviewers requested a comparison with XLNet (Yang et al., 2019), which topped the leaderboards after the EMNLP submission deadline:

How did we get here?

All of the above makes so much sense that one has to wonder how we even got to reject-if-not-SOTA. Surely, the people who get asked to review for top NLP conferences know all this?

I would conjecture that two factors are in play:

  • The fact that we are drowning in papers, and need heuristics to decide what to read/tweet/publish. SOTA is just one such heuristic.
  • Glorification of benchmarks, coupled with the initial trajectory of deep learning community within NLP.

The first factor deserves its own post. The lack of time, the low prestige and lack of career or monetary compensation for reviewing means that people are strongly incentivized to rely on heuristics, of which SOTA is just one example. To combat that, we need deep, systemic changes, which will take a long time to implement.

The second factor is specific to the reject-if-not-SOTA. Ehud Reiter makes a useful distinction between “evaluation metrics” vs “scoring functions”. Language is complex, and our benchmarks far from perfect, so ideally we would have (1) the introduction of a benchmark, (2) a wave of system papers that hopefully reaches human performance, and then (3) a massive switch to an improved benchmark. Instead, we get stuck in step 2, and the benchmark becomes a scoring function that simply enables the community to publish tons of SOTA-claiming papers.

For example, we now have SuperGLUE and over 80 QA datasets, but new system papers will still mostly evaluate on SQuAD and GLUE, because these are the names that the reviewers most likely know and expect. Since both SQuAD and GLUE are solved well past human baselines, the result is likely an exercise in overfitting.

Additionally, while the benchmark problem is nothing new, the current SOTA chase might have had an extra push from the fact that there was a massive wave of papers with the common trajectory: taking some task/dataset and showing that a neural method could handle it better than was possible before. Many of these papers were written by new authors, and they might still expect the same kind of contributions. But that expectation is outdated. As discussed above, the current leaderboards do not necessarily indicate superiority of the architecture, and the very possibility of using neural nets for different NLP tasks is now taken for granted.

Solution: guidelines on what constitutes an acceptable contribution

Once again, SOTA is just one of the heuristics that the tired and underpaid reviewers are resorting to to cope with the deluge of papers, and in the long run we need to implement systemic changes to the review system. But there is something we could do right now to mitigate this particular heuristic: we could expand it. Performance is just one of many factors that could make a system interesting. What is we had guidelines for authors, reviewers and ACs, which would contain a list of publication-worthy contributions – of which SOTA would be just one? The authors would then have a fighting chance against the SOTA heuristic in the rebuttals, and the reviewers would hopefully be discouraged from using it in the first place.

Now, compiling such guidelines is admittedly not easy: no paper is perfect. The best reviewers are weighing the strengths and the weaknesses of papers on case-by-case basis, necessarily comparing apples to oranges with some degree of subjectivity. Still, for starters, here is a list compiled from many Twitter discussions (suggestions welcome).

A new system may make be publication-worthy if it has a strong edge over the competition in one or more of the following ways:

  • better performance (significantly and consistently higher than the competition, and surpassing variability due to random initializations);
  • more computation-efficient (less resources to train and/or deploy);
  • more data-efficient (requires less data to train, or less high-quality data);
  • more stable over possible hyperparameters and/or random initializations, easier to tune;
  • better generalizability (less biased, able to avoid learning from data artifacts, better generalizing across datasets and domains, more adversarially robust);
  • having different properties (e.g. different output type, making different kinds of predictions and errors);
  • more interpretable (humans can engage with the output better, easier to understand where it goes wrong and how to fix it);
  • conceptually simpler (this would likely overlap with computation efficiency and stability);
  • more cognitively plausible (more consistent with what is known about human language processing);
  • making unexpected connections between subfields, bringing some technique in a completely new context;

It goes without saying that for any of these criteria the study should clearly state its hypothesis (doing X as opposed to Y is expected to have the effect Z), and prove/disprove it for the reviewers with appropriate experiments. If the proposal is only a minor incremental modification of an existing model, and its only hope of publication was beating SOTA, then the authors would be unlikely to be able to claim any of the other factors retroactively.

The above list aims to give a fighting chance to systems that perform well while offering some other kind of advantage, such as generalizability/efficiency etc. But given the history of deep learning, it should not be impossible to publish a valuable idea, even if for some reason it could not be made to perform well yet. However, the idea should be actually novel, rather than “just make it bigger”. A rule-of-thumb criterion for a paper with an interesting idea (attributed to Serge Abiteboul) is that you’d feel tempted to have your students read it.

Reject-if-not-SOTA and non-modeling papers

It would seem that NLP system papers are the ones the most affected by the reject-if-not-SOTA heuristic, but they are actually the priviledged class because at least they contain the kind of experiments that the reject-if-not-SOTA reviewers expect. All other kinds of papers are just unacceptable by definition:

  • systematic parameter and tuning studies;
  • model analysis, representation probing papers, ablation studies;
  • resource papers;
  • surveys;
  • work on ethical considerations in NLP;
  • opinion pieces, especially retrospectives (bridging DL and prior methods), cross-disciplinary contributions, papers connecting subfields that work on the similar phenomenon under different names;

Reviewing all these different kinds of papers properly deserve separate posts, but they are all a legitimate part of *ACL conferences. For resources in particular, consider again that ideally the field should cycle through (1) the introduction of a benchmark, (2) a wave of system papers that hopefully reaches human performance, and then (3) a massive switch to an improved benchmark. If the difficult interdisciplinary work on improving benchmarks is not rewarded on par with system engineering work, who would bother?

Rachel Bawden cites an ACL 2019 reviewer who gave the following account of her MT-mediated bilingual dialogue resource:

The paper is mostly a description of the corpus and its collection and contains little scientific contribution.

Reviewers with CS backgrounds who are not interested in methodology, theoretical, linguistic, or psychological work should not simply reject these kinds of contributions, recommending that the authors try LREC or workshops. They should decline the assignment and ask the ACs to find a better match. NLP is an interdisciplinary field, human language is incredibly complex, and we need all the help we can get.

Update (09.05.2020): Here is a post specifically on dos and don’ts in reviewing resource papers.


SOTA is just one of many other heuristics used by reviewers and everybody else to decide what is worth paying attention to. Heuristics stem from the paper deluge and the difficulties navigating an interdisciplinary field with just one degree. The field is in dire need of systemic changes to make reviewing visible, compensated, and high-prestige work.

But one thing we could realistically do about the SOTA heuristic right now is to at least have clear guidelines for both the authors and reviewers of NLP system papers. These guidelines should emphasize that there are many possible publication-worthy types of contributions: we need breakthroughs in models that are energy- and data-efficient, transparent, cognitively plausible, generalizable etc. Welcoming them would stimulate intellectual diversity of approaches, greener solutions, cross-disciplinary collaboration, and participation by less well-funded labs from all over the world.


A lot of amazing #NLProc people contributed to the Twitter discussions on which this post is based. In alphabetical order:

Niranjan Balasubramanian , Emily Bender , Kyunghyun Cho , Leshem Choshen , Aleksandr Drozd , Gregg Durett , Matt Gardner , Alvin Grissom II , Kristian Kersing , Tal Linzen , Zachary Lipton , Florian Mai , Marten van Schijndel , Evpok Padding , Ehud Reiter , Stephen Roller , Anna Rumshisky , Jesse Thomason


2020 events for your SOTA-free paper

If you’re concerned about the above issues, here are some events and workshops this year that work towards mitigating it:

Note also that EMNLP 2020 implements a reproducibility checklist based on work by Joel Pinneau and (Dodge, Gururangan, Card, Schwartz, & Smith, 2019), which includes the number of hyperparameter search trials and some measure of performance “mean and variance as a function of the number of hyperparameter trials”. Hopefully that by itself should draw some of the reviewers’ attention towards model efficiency.


  1. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. ArXiv:1906.08237 [Cs].
      archiveprefix = {arXiv},
      eprinttype = {arxiv},
      eprint = {1906.08237},
      primaryclass = {cs},
      title = {{{XLNet}}: {{Generalized Autoregressive Pretraining}} for {{Language Understanding}}},
      shorttitle = {{{XLNet}}},
      journal = {arXiv:1906.08237 [cs]},
      url = {},
      author = {Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Ruslan and Le, Quoc V.},
      month = jun,
      year = {2019}
  2. Sugawara, S., Stenetorp, P., Inui, K., & Aizawa, A. (2020). Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets. AAAI.
      title = {Assessing the {{Benchmarking Capacity}} of {{Machine Reading Comprehension Datasets}}},
      booktitle = {{{AAAI}}},
      author = {Sugawara, Saku and Stenetorp, Pontus and Inui, Kentaro and Aizawa, Akiko},
      year = {2020},
      url = {},
      archiveprefix = {arXiv}
  3. McCoy, T., Pavlick, E., & Linzen, T. (2019). Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3428–3448.
      title = {Right for the {{Wrong Reasons}}: {{Diagnosing Syntactic Heuristics}} in {{Natural Language Inference}}},
      shorttitle = {Right for the {{Wrong Reasons}}},
      booktitle = {Proceedings of the 57th {{Annual Meeting}} of the {{Association}} for {{Computational Linguistics}}},
      author = {McCoy, Tom and Pavlick, Ellie and Linzen, Tal},
      year = {2019},
      pages = {3428--3448},
      publisher = {{Association for Computational Linguistics}},
      address = {{Florence, Italy}},
      doi = {10.18653/v1/P19-1334},
      url = {}
  4. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv:1907.11692 [Cs].
      ids = {LiuOttEtAl\_2019\_RoBERTa\_Robustly\_Optimized\_BERT\_Pretraining\_Approach,LiuOttEtAl\_2019\_RoBERTa\_Robustly\_Optimized\_BERT\_Pretraining\_Approacha},
      title = {{{RoBERTa}}: {{A Robustly Optimized BERT Pretraining Approach}}},
      shorttitle = {{{RoBERTa}}},
      author = {Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin},
      year = {2019},
      url = {},
      archiveprefix = {arXiv},
      journal = {arXiv:1907.11692 [cs]},
      primaryclass = {cs}
  5. Jia, R., & Liang, P. (2017). Adversarial Examples for Evaluating Reading Comprehension Systems. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2021–2031.
      title = {Adversarial {{Examples}} for {{Evaluating Reading Comprehension Systems}}},
      booktitle = {Proceedings of the 2017 {{Conference}} on {{Empirical Methods}} in {{Natural Language Processing}}},
      author = {Jia, Robin and Liang, Percy},
      year = {2017},
      pages = {2021--2031},
      publisher = {{Association for Computational Linguistics}},
      address = {{Copenhagen, Denmark}},
      doi = {10.18653/v1/D17-1215},
      url = {}
  6. Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H., & Smith, N. (2020). Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping. ArXiv:2002.06305 [Cs].
      title = {Fine-{{Tuning Pretrained Language Models}}: {{Weight Initializations}}, {{Data Orders}}, and {{Early Stopping}}},
      shorttitle = {Fine-{{Tuning Pretrained Language Models}}},
      author = {Dodge, Jesse and Ilharco, Gabriel and Schwartz, Roy and Farhadi, Ali and Hajishirzi, Hannaneh and Smith, Noah},
      year = {2020},
      url = {},
      archiveprefix = {arXiv},
      journal = {arXiv:2002.06305 [cs]},
      primaryclass = {cs}
  7. Dodge, J., Gururangan, S., Card, D., Schwartz, R., & Smith, N. A. (2019). Show Your Work: Improved Reporting of Experimental Results. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2185–2194.
      title = {Show {{Your Work}}: {{Improved Reporting}} of {{Experimental Results}}},
      shorttitle = {Show {{Your Work}}},
      booktitle = {Proceedings of the 2019 {{Conference}} on {{Empirical Methods}} in {{Natural Language Processing}} and the 9th {{International Joint Conference}} on {{Natural Language Processing}} ({{EMNLP}}-{{IJCNLP}})},
      author = {Dodge, Jesse and Gururangan, Suchin and Card, Dallas and Schwartz, Roy and Smith, Noah A.},
      year = {2019},
      pages = {2185--2194},
      publisher = {{Association for Computational Linguistics}},
      address = {{Hong Kong, China}},
      doi = {10.18653/v1/D19-1224},
      url = {}
  8. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186.
      title = {{{BERT}}: {{Pre}}-Training of {{Deep Bidirectional Transformers}} for {{Language Understanding}}},
      shorttitle = {{{BERT}}},
      language = {en-us},
      booktitle = {Proceedings of the 2019 {{Conference}} of the {{North American Chapter}} of the {{Association}} for {{Computational Linguistics}}: {{Human Language Technologies}}, {{Volume}} 1 ({{Long}} and {{Short Papers}})},
      url = {},
      author = {Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
      month = jun,
      year = {2019},
      pages = {4171-4186}
  9. Crane, M. (2018). Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results. Transactions of the Association for Computational Linguistics, 6, 241–252.
      title = {Questionable {{Answers}} in {{Question Answering Research}}: {{Reproducibility}} and {{Variability}} of {{Published Results}}},
      volume = {6},
      shorttitle = {Questionable {{Answers}} in {{Question Answering Research}}},
      language = {en-us},
      journal = {Transactions of the Association for Computational Linguistics},
      doi = {10.1162/tacl_a_00018},
      url = {},
      author = {Crane, Matt},
      year = {2018},
      pages = {241-252}

Share / cite / discuss this post