# How the Transformers broke NLP leaderboards

This post summarizes some of the recent XLNet-prompted discussions on Twitter and offline. Idea credits go to Yoav Goldberg, Sam Bowman, Jason Weston, Alexis Conneau, Ted Pedersen, fellow members of Text Machine Lab, and many others. Any misconfiguration of those ideas is my own.

A big reason why NLP is such an actively developed area is the leaderboards: they are the core of multiple shared tasks, benchmark systems like GLUE, and individual datasets such as SQUAD and AllenAI datasets. Leaderboards stimulate competitions between engineering teams, helping them to develop better and better models to tackle human language.

Or do they?

## So what’s wrong with the leaderboards?

Typically a leaderboard for an NLP task X looks roughly as follows:

System Citation Performance
System A Smith et al. 2018 76.05
System B Li et al. 2018 75.85
System C Petrov et al. 2018 75.62

This format is followed both by online leaderboards (such as the GLUE benchmark), and academic papers (when comparing the proposed model to the baselines).

Now, the test performance of the model is far from the only thing that make it novel or even interesting, but it is the only thing that is in the leaderboard. Since DL is such a big zoo with different architectures, there is no standard way to present additional information such as model parameters and training data. In the papers, sometimes these details are in the methodology section, sometimes in the appendices, sometimes in the comments on github repo or nowhere at all. In an online leaderboard, the details of each system can only be retrieved from the link to the paper (if one is available), or by going through the code in the repository.

In an increasingly busy world, how many of us actually look for those details, unless we are reviewing or re-implementing? The simple leaderboard already gives us the information we most care about: who SOTA-ed. Generally, our minds are lazy and tend to receive such messages uncritically, ignoring any caveats even they are immediately present (Kahneman, 2013). And if we have to actively hunt for the caveats… well, no chance. The winner receives all the Twitter hype, potentially gaining unfair advantage in the blind review.

There has been a lot of discussion of the dangers of the SOTA-centric approach. If the reader’s main takeaway is going to be the leaderboard, that increases the perception that the publication-worthiness is only achieved by beating the SOTA. That perception results in a flood of papers with marginal and often unreproducible performance gains (Crane, 2018). It also creates a huge problem for shared tasks, when non-winners feel like it’s not even worth their while to write the paper on their work (Escartín et al., 2017), see also the recent discussion of the issue by Ted Pedersen).

The focus of this post is yet another problem with the leaderboards that is relatively recent. Its cause is simple: fundamentally, a model may be better than its competitors by building better representations from the available data - or it may simply use more data, and/or throw a deeper network at it. When we have a paper presenting a new model that also uses more data/compute than its competitors, credit attribution becomes hard.

The most popular NLP leaderboards are currently dominated by Transformer-based models. BERT (Devlin, Chang, Lee, & Toutanova, 2019) received the best paper award at NAACL 2019 after months of holding SOTA on many leaderboards. Now the hot topic is XLNet (Yang et al., 2019) that is said to overtake BERT on GLUE and some other benchmarks. Other Transformers include GPT-2 (Radford et al., 2019), ERNIE (Zhang et al., 2019), and the list is growing.

The problem we’re starting to face is that these models are HUGE. While the source code is available, in reality it is beyond the means of an average lab to reproduce these results, or to produce anything comparable. For instance, XLNet is trained on 32B tokens, and the price of using 500 TPUs for 2 days is over $250,000. Even fine-tuning this model is getting expensive. ## Wait, this was supposed to happen! On the one hand, this trend looks predictable, even inevitable: people with more resources will use more resources to get better performance. One could even argue that a huge model proves its scalability and fulfils the inherent promise of deep learning, i.e. being able to learn more complex patterns from more information. Nobody knows how much data we actually need to solve a given NLP task, but more should be better, and limiting data seems counter-productive. On that view - well, from now on top-tier NLP research is going to be something possible only for industry. Academics will have to somehow up their game, either by getting more grants or by collaborating with high-performance computing centers. They are also welcome to switch to analysis, building something on top of the industry-provided huge models, or making datasets. However, in terms of overall progress in NLP that might not be the best thing to do. Here is why. ## Why huge models + leaderboards = trouble The chief problem with the huge models is simply this: “More data & compute = SOTA” is NOT research news. If leaderboards are to highlight the actual progress, we need to incentivize new architectures rather than teams outspending each other. Obviously, huge pretrained models are valuable, but unless the authors show that their system consistently behaves differently from its competition with comparable data & compute, it is not clear whether they are presenting a model or a resource. Furthermore, much of this research is not reproducible: nobody is going to spend$250,000 just to repeat XLNet training. Given the fact that its ablation study showed only 1-2% gain over BERT in 3 datasets out of 4 (Yang et al., 2019), we don’t actually know for sure that its masking strategy is more successful than BERT’s.

At the same time, the development of leaner models is dis-incentivized, as their task is fundamentally harder and the leaderboard-oriented community only rewards the SOTA. That, in its turn, prices out of competitions academic teams, which will not result in students becoming better engineers when they graduate.

Last but not the least, huge DL models are often overparametrized (Frankle & Carbin, 2019; Wu, Fan, Baevski, Dauphin, & Auli, 2019). As an example, the smaller version of BERT achieves better scores on a number of syntax-testing experiments than the larger one (Goldberg, 2019). The fact that DL models require a lot of compute is not necessarily a bad thing in itself, but wasting compute is not ideal for the environment (Strubell, Ganesh, & McCallum, 2019).

## Possible solutions

NLP leaderboards are in real danger of turning into something where we give up on reproducibility and just watch one Google model outperform another Google model every couple of months. To avoid that, the leaderboards need to change.

In principle, there are two possible solutions:

1) For a specific task, it should be possible to provide a standard training corpus, and limit the amount of compute to that used by a strong baseline. If the baseline is itself something like BERT, this will incentivize the development of models that make better use of resources. If a system uses pre-trained representations (word embeddings, BERT, etc.), the size of pre-training data should be factored into the final score.

2) For a suite of tasks like GLUE, we could let the participants use however much data&compute they wanted, but factor that into the final score. The leaderboard itself should make it immediately clear what is the performance of a model over the baseline relative to the amount of resources it consumed.

Both of these approaches require a reliable way to estimate the computation cost. At the minimum, it could be the inference time as estimated by the task organizers. Aleksandr Drozd (RIKEN CCS) suggests the best way is to just report the FLOPs count, which seems to be already possible for both PyTorch and TensorFlow. Perhaps it would also be possible to build a general service for shared tasks that would receive a DL model, train it for one epoch on one batch of data, and provide the researchers with the estimate.

Estimating the training data is also not straightforward: a plain text corpus should be worth less than an annotated corpus or Freebase. However, this should be possible to weigh. For example, unstructured data could be estimated as raw token count $N$, augmented/parsed data - as $aN$, and structured data such as dictionaries - as $N^2$.

One counter-argument to the above is that some models may inherently require more data than others, and can only be fairly evaluated in large-scale experiments. But even in this case, a convincing paper would need to show that the new model can “hold” more data than its competitors, and so multiple rounds of training all models on the same data are still necessary.

## Summing up

This is the leaderboard discussion so far, and it’s far from over. If you have anything to add, especially any other possible solutions - please let me know on Twitter or in the comments below. I’ll update the post with any major developments.

Let me stress that huge pretrained models like BERT are an undeniable achievement, and did help to push the state-of-the-art on numerous tasks. Obviously, there is nothing wrong methodologically with using any <muppetName> as pretrained representations, as long as the paper is about something else and does not rest on any properties of <muppetName> that have not been fully validated. There is also nothing wrong with analysing <muppetName>: the steady stream of BERTology papers by itself suggests how little we understood about BERT while it was all over the leaderboards (Voita, Talbot, Moiseev, Sennrich, & Titov, 2019; Clark, Khandelwal, Levy, & Manning, 2019; Coenen et al., 2019; Jawahar, Sagot, & Seddah, n.d.; Lin, Tan, & Frank, 2019).

But we do have a methodological problem if a paper introduces another <muppetName> without factoring in its stability and the resources it took to train vs competition, and then everybody takes the leaderboard performance as indicator of a breakthrough architecture.

Imagine that tomorrow we wake up to a paper presenting a Don’t-Even-Try-Net (model name © Olga Kovaleva), a new architecture that achieves superhuman performance on every NLP task after being trained for a year on every computer in North America. Even with the source code we would not be able to verify that claim. We could use the pretrained weights, but without multiple runs for ablation and stability evaluation the authors would not have proven the superiority of their approach. In a sense, they would be presenting a resource rather than a model.

If we are to make actual progress, we need to make sure new systems get fame and awards only with rigorous proofs - including the multiple runs of training on the same data as the baselines, ablation studies, estimates of compute and stability. This would inherently encourage more hypothesis-driven research. For instance, the dependency objective in XLNet looks really interesting, and I would love to know how much advantage it actually confers on different tasks, given that dependency-based word embeddings turned out to be of limited use (Li et al., 2017; Lapesa & Evert, 2017).

## Update of 22.07.2019

Oh wow, this post was retweeted over 100 times and made it to Sebastian Ruder’s NLP newsletter! Clearly, the issue of fair evaluation of huge models resonates with the community deeply.

Sebastian points out that Transformers make an important contribution in showing us the limitations of more-data-and-compute approach, and, ironically, also starting to encourage research on the leaner models. I fully agree with both points, and of course the Transformer in itself is an undeniable breakthrough. My point is simply that the current leaderboards implicitly encourage a blend of architectures, data and compute that are impossible to disentangle and replicate. If we are on a quest for the best possible NLP model, this is a problem we are going to have to solve.

Another update from a later discussion with Sam Bowman: leaderboards where you win by whatever combination of means do have a place in the world. Like Kaggle, they stimulate competition in ML engineering for NLP, and the results they showcase may in themselves be interesting and useful. But by themselves they are not a proof of architecture superiority, which they are commonly mistaken for. It seems to me that it would be the easiest for the most influential leaderboards such as GLUE to change so as to help to correct this perception, since all eyes are on them, but I can also see why they may want to remain Kaggle-style.

Model training cost clarification. the price of training XLNet was estimated as follows: the paper states that it was trained on 512 TPU v3 chips for 2.5 days, i.e. 60 hours. Google on-demand price for TPU v-3 is currently $8, which amounts to$245,760 before fine-tuning. James Bradbury points out that authors could actually mean “devices” or “cores”, which would bring it down to $61,440 or$30,720, respectively. I would add that even in this most optimistic scenario the model would still cost more than the stipend of the graduate student working on it, and still be unrealistic for most labs.

## References

1. Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced Language Representation with Informative Entities. ACL 2019.
@inproceedings{ZhangHanEtAl_2019_ERNIE_Enhanced_Language_Representation_with_Informative_Entities,
archiveprefix = {arXiv},
eprinttype = {arxiv},
eprint = {1905.07129},
title = {{{ERNIE}}: {{Enhanced Language Representation}} with {{Informative Entities}}},
shorttitle = {{{ERNIE}}},
booktitle = {{{ACL}} 2019},
url = {http://arxiv.org/abs/1905.07129},
author = {Zhang, Zhengyan and Han, Xu and Liu, Zhiyuan and Jiang, Xin and Sun, Maosong and Liu, Qun},
month = may,
year = {2019}
}

http://arxiv.org/abs/1905.07129
2. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. ArXiv:1906.08237 [Cs].
@article{YangDaiEtAl_2019_XLNet_Generalized_Autoregressive_Pretraining_for_Language_Understanding,
archiveprefix = {arXiv},
eprinttype = {arxiv},
eprint = {1906.08237},
primaryclass = {cs},
title = {{{XLNet}}: {{Generalized Autoregressive Pretraining}} for {{Language Understanding}}},
shorttitle = {{{XLNet}}},
journal = {arXiv:1906.08237 [cs]},
url = {http://arxiv.org/abs/1906.08237},
author = {Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Ruslan and Le, Quoc V.},
month = jun,
year = {2019}
}

http://arxiv.org/abs/1906.08237
3. Wu, F., Fan, A., Baevski, A., Dauphin, Y., & Auli, M. (2019). Pay Less Attention with Lightweight and Dynamic Convolutions. International Conference on Learning Representations.
@inproceedings{WuFanEtAl_2019_Pay_Less_Attention_with_Lightweight_and_Dynamic_Convolutions,
title = {Pay {{Less Attention}} with {{Lightweight}} and {{Dynamic Convolutions}}},
booktitle = {International {{Conference}} on {{Learning Representations}}},
url = {https://openreview.net/forum?id=SkVhlh09tX},
author = {Wu, Felix and Fan, Angela and Baevski, Alexei and Dauphin, Yann and Auli, Michael},
year = {2019}
}

https://openreview.net/forum?id=SkVhlh09tX
4. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. ArXiv:1905.09418 [Cs].
@article{VoitaTalbotEtAl_2019_Analyzing_Multi-Head_Self-Attention_Specialized_Heads_Do_Heavy_Lifting_Rest_Can_Be_Pruned,
archiveprefix = {arXiv},
eprinttype = {arxiv},
eprint = {1905.09418},
primaryclass = {cs},
title = {Analyzing {{Multi}}-{{Head Self}}-{{Attention}}: {{Specialized Heads Do}} the {{Heavy Lifting}}, the {{Rest Can Be Pruned}}},
shorttitle = {Analyzing {{Multi}}-{{Head Self}}-{{Attention}}},
journal = {arXiv:1905.09418 [cs]},
url = {http://arxiv.org/abs/1905.09418},
author = {Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan},
month = may,
year = {2019}
}

http://arxiv.org/abs/1905.09418
5. Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. ACL 2019.
@inproceedings{StrubellGaneshEtAl_2019_Energy_and_Policy_Considerations_for_Deep_Learning_in_NLP,
archiveprefix = {arXiv},
eprinttype = {arxiv},
eprint = {1906.02243},
title = {Energy and {{Policy Considerations}} for {{Deep Learning}} in {{NLP}}},
booktitle = {{{ACL}} 2019},
url = {http://arxiv.org/abs/1906.02243},
author = {Strubell, Emma and Ganesh, Ananya and McCallum, Andrew},
month = jun,
year = {2019}
}

http://arxiv.org/abs/1906.02243
6. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models Are Unsupervised Multitask Learners. OpenAI Blog, 1, 8.
@article{RadfordWuEtAl_2019_Language_models_are_unsupervised_multitask_learners,
title = {Language Models Are Unsupervised Multitask Learners},
volume = {1},
journal = {OpenAI Blog},
url = {https://openai.com/blog/better-language-models/},
author = {Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
year = {2019},
pages = {8}
}

https://openai.com/blog/better-language-models/
7. Lin, Y., Tan, Y. C., & Frank, R. (2019). Open Sesame: Getting Inside BERT’s Linguistic Knowledge. ArXiv:1906.01698 [Cs].
@article{LinTanEtAl_2019_Open_Sesame_Getting_Inside_BERTs_Linguistic_Knowledge,
archiveprefix = {arXiv},
eprinttype = {arxiv},
eprint = {1906.01698},
primaryclass = {cs},
title = {Open {{Sesame}}: {{Getting Inside BERT}}'s {{Linguistic Knowledge}}},
shorttitle = {Open {{Sesame}}},
journal = {arXiv:1906.01698 [cs]},
url = {http://arxiv.org/abs/1906.01698},
author = {Lin, Yongjie and Tan, Yi Chern and Frank, Robert},
month = jun,
year = {2019}
}

http://arxiv.org/abs/1906.01698
8. Li, B., Liu, T., Zhao, Z., Tang, B., Drozd, A., Rogers, A., & Du, X. (2017). Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2411–2421. Copenhagen, Denmark, September 7–11, 2017.
@inproceedings{LiLiuEtAl_2017_Investigating_Different_Syntactic_Context_Types_and_Context_Representations_for_Learning_Word_Embeddings,
address = {{Copenhagen, Denmark, September 7\textendash{}11, 2017}},
title = {Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings},
booktitle = {Proceedings of the 2017 {{Conference}} on {{Empirical Methods}} in {{Natural Language Processing}}},
url = {http://aclweb.org/anthology/D17-1257},
author = {Li, Bofang and Liu, Tao and Zhao, Zhe and Tang, Buzhou and Drozd, Aleksandr and Rogers, Anna and Du, Xiaoyong},
year = {2017},
pages = {2411--2421}
}

http://aclweb.org/anthology/D17-1257
9. Lapesa, G., & Evert, S. (2017). Large-Scale Evaluation of Dependency-Based DSMs: Are They Worth the Effort? Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 394–400. Association for Computational Linguistics.
@inproceedings{LapesaEvert_2017_Large-scale_evaluation_of_dependency-based_DSMs_Are_they_worth_the_effort,
title = {Large-Scale Evaluation of Dependency-Based {{DSMs}}: {{Are}} They Worth the Effort?},
shorttitle = {Large-Scale Evaluation of Dependency-Based {{DSMs}}},
booktitle = {Proceedings of the 15th {{Conference}} of the {{European Chapter}} of the {{Association}} for {{Computational Linguistics}} ({{EACL}})},
publisher = {{Association for Computational Linguistics}},
url = {http://www.aclweb.org/anthology/E17-2063},
author = {Lapesa, Gabriella and Evert, Stefan},
year = {2017},
pages = {394-400}
}

http://www.aclweb.org/anthology/E17-2063
10. Kahneman, D. (2013). Thinking, Fast and Slow (1st pbk. ed). New York: Farrar, Straus and Giroux.
@book{Kahneman_2013_Thinking_fast_and_slow,
address = {{New York}},
edition = {1st pbk. ed},
title = {Thinking, Fast and Slow},
isbn = {978-0-374-53355-7},
lccn = {BF441 .K238 2013},
publisher = {{Farrar, Straus and Giroux}},
author = {Kahneman, Daniel},
year = {2013}
}

11. Jawahar, G., Sagot, B., & Seddah, D. What Does BERT Learn about the Structure of Language? ACL 2019, 8.
@inproceedings{JawaharSagotEtAl_What_does_BERT_learn_about_structure_of_language,
title = {What Does {{BERT}} Learn about the Structure of Language?},
language = {en},
booktitle = {{{ACL}} 2019},
author = {Jawahar, Ganesh and Sagot, Beno{\^i}t and Seddah, Djam{\'e}},
pages = {8}
}

12. Goldberg, Y. (2019). Assessing BERT’s Syntactic Abilities. ArXiv:1901.05287 [Cs].
@article{Goldberg_2019_Assessing_BERTs_Syntactic_Abilities,
archiveprefix = {arXiv},
eprinttype = {arxiv},
eprint = {1901.05287},
primaryclass = {cs},
title = {Assessing {{BERT}}'s {{Syntactic Abilities}}},
journal = {arXiv:1901.05287 [cs]},
url = {http://arxiv.org/abs/1901.05287},
author = {Goldberg, Yoav},
month = jan,
year = {2019}
}

http://arxiv.org/abs/1901.05287
13. Frankle, J., & Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. International Conference on Learning Representations.
@inproceedings{FrankleCarbin_2019_Lottery_Ticket_Hypothesis_Finding_Sparse_Trainable_Neural_Networks,
title = {The {{Lottery Ticket Hypothesis}}: {{Finding Sparse}}, {{Trainable Neural Networks}}},
booktitle = {International {{Conference}} on {{Learning Representations}}},
url = {https://openreview.net/forum?id=rJl-b3RcF7},
author = {Frankle, Jonathan and Carbin, Michael},
year = {2019}
}

https://openreview.net/forum?id=rJl-b3RcF7
14. Escartín, C. P., Reijers, W., Lynn, T., Moorkens, J., Way, A., & Liu, C.-H. (2017). Ethical Considerations in NLP Shared Tasks. Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, 66–73. https://doi.org/10.18653/v1/W17-1608
@inproceedings{EscartinReijersEtAl_2017_Ethical_Considerations_in_NLP_Shared_Tasks,
title = {Ethical {{Considerations}} in {{NLP Shared Tasks}}},
language = {en-us},
booktitle = {Proceedings of the {{First ACL Workshop}} on {{Ethics}} in {{Natural Language Processing}}},
doi = {10.18653/v1/W17-1608},
url = {https://aclweb.org/anthology/papers/W/W17/W17-1608/},
author = {Escart{\'i}n, Carla Parra and Reijers, Wessel and Lynn, Teresa and Moorkens, Joss and Way, Andy and Liu, Chao-Hong},
month = apr,
year = {2017},
pages = {66-73}
}

https://aclweb.org/anthology/papers/W/W17/W17-1608/
15. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186.
@inproceedings{DevlinChangEtAl_2019_BERT_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding,
title = {{{BERT}}: {{Pre}}-Training of {{Deep Bidirectional Transformers}} for {{Language Understanding}}},
shorttitle = {{{BERT}}},
language = {en-us},
booktitle = {Proceedings of the 2019 {{Conference}} of the {{North American Chapter}} of the {{Association}} for {{Computational Linguistics}}: {{Human Language Technologies}}, {{Volume}} 1 ({{Long}} and {{Short Papers}})},
url = {https://aclweb.org/anthology/papers/N/N19/N19-1423/},
author = {Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
month = jun,
year = {2019},
pages = {4171-4186}
}

https://aclweb.org/anthology/papers/N/N19/N19-1423/
16. Crane, M. (2018). Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results. Transactions of the Association for Computational Linguistics, 6, 241–252. https://doi.org/10.1162/tacl_a_00018
@article{Crane_2018_Questionable_Answers_in_Question_Answering_Research_Reproducibility_and_Variability_of_Published_Results,
title = {Questionable {{Answers}} in {{Question Answering Research}}: {{Reproducibility}} and {{Variability}} of {{Published Results}}},
volume = {6},
shorttitle = {Questionable {{Answers}} in {{Question Answering Research}}},
language = {en-us},
journal = {Transactions of the Association for Computational Linguistics},
doi = {10.1162/tacl_a_00018},
url = {https://aclweb.org/anthology/papers/Q/Q18/Q18-1018/},
author = {Crane, Matt},
year = {2018},
pages = {241-252}
}

https://aclweb.org/anthology/papers/Q/Q18/Q18-1018/
17. Coenen, A., Reif, E., Yuan, A., Kim, B., Pearce, A., Viégas, F., & Wattenberg, M. (2019). Visualizing and Measuring the Geometry of BERT. ArXiv:1906.02715 [Cs, Stat].
@article{CoenenReifEtAl_2019_Visualizing_and_Measuring_Geometry_of_BERT,
archiveprefix = {arXiv},
eprinttype = {arxiv},
eprint = {1906.02715},
primaryclass = {cs, stat},
title = {Visualizing and {{Measuring}} the {{Geometry}} of {{BERT}}},
journal = {arXiv:1906.02715 [cs, stat]},
url = {http://arxiv.org/abs/1906.02715},
author = {Coenen, Andy and Reif, Emily and Yuan, Ann and Kim, Been and Pearce, Adam and Vi{\'e}gas, Fernanda and Wattenberg, Martin},
month = jun,
year = {2019}
}

http://arxiv.org/abs/1906.02715
18. Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What Does BERT Look At? An Analysis of BERT’s Attention. ArXiv:1906.04341 [Cs].
@article{ClarkKhandelwalEtAl_2019_What_Does_BERT_Look_At_Analysis_of_BERTs_Attention,
archiveprefix = {arXiv},
eprinttype = {arxiv},
eprint = {1906.04341},
primaryclass = {cs},
title = {What {{Does BERT Look At}}? {{An Analysis}} of {{BERT}}'s {{Attention}}},
shorttitle = {What {{Does BERT Look At}}?},
journal = {arXiv:1906.04341 [cs]},
url = {http://arxiv.org/abs/1906.04341},
author = {Clark, Kevin and Khandelwal, Urvashi and Levy, Omer and Manning, Christopher D.},
month = jun,
year = {2019}
}

http://arxiv.org/abs/1906.04341

## Cite this post

If you’d like to cite this post, please use the following bibtex:

@misc{Rogers_2019_leaderboards,
title = { How the Transformers broke NLP leaderboards},
journal = {Hacking Semantics},
url = { https://hackingsemantics.xyz/2019/leaderboards/ },
author = {Rogers, Anna},
day = { 30 },
month = { Jun },
year = { 2019 }
}