NLP Evaluation Metrics


As @jabalazs suggested, this is a spin-off from a slack #nlp question

Besides metrics for NLG have been briefly discussed there, here I would like to begin with an under-appreciated paper (despite its 2.6k citations): Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation (Powers, 2011). My take on this paper is that even with a seemly straightforward binary classification task, widely used F-measure is biased.

Unfortunately, according to The price of debiasing automatic metrics in natural language evaluation (Chaganty et. al., 2018), “there is no unbiased estimator with lower cost.” Although it did not verify unbiased metrics from Powers (2011), and personally I suspect the studied
corpora in Chaganty et. al. (2018) are somewhat misleading, because I don’t think CNN/Daily Mail dataset and MS MARCO v1.0 are sufficient for summarization and question-answering, respectively.

So far it may sound disappointing, but acknowledging the weakness is important and sometimes insightful, in my opinion. Several papers (I won’t list them up here for the readability) have also stated that in spite of all the ineffectiveness among auto. eval. metrics, some of them are still useful to help researchers spot wrong/bad outputs systematically. At very least I would say, just don’t blindly trust the state-of-the-art by number.


Back to the original question, I’m going to rearrange the thread here for reference.

Starting from @jabalazsrecommendation of automatic evaluation metrics:

How to evaluate generated output is an open area of research. In machine translation people will often use BLEU score because it correlates well with human judgement, but there’s no agreement as to what is the best metric to use for other generation tasks. I recommend you to check the ROUGE metric (often used in summarization), and METEOR.


I followed

@jabalazs is right and that was why I asked about the usage of NLG.

  • BLEU (for MT) is usually purely n-gram precision to referenced (gold-standard/ground-truth) sentences.
  • ROUGE (for summarization) as its name stated is recall-based except it checks WordNet synonyms.
  • METEOR (for MT) measures alignment with paraphrases in addition to synonyms.

You may have already noticed that NLG often comes with a specific task: MT, summarization, dialogue system, etc. Toward their ultimate goals, automatic evaluation metrics can only find middle grounds for effectiveness and efficiency. If human evaluations are affordable, there are at least two approaches worth checking for insight:

  1. Pyramid method for summarization:
  2. ARPA-like method for MT:

More human evaluation methods are out there such as PARADISE for dialogue system, and yet design principles among them will be similar.

An EMNLP-2016 paper is interesting: How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. Here’s some quotes (with my highlights) from its last section:

… do not correlate strongly with human judgement

… a natural language generation component for applications in constrained domains, may find stronger correlations with the BLEU metric … an empirical investigation is still necessary to justify this

Future work should examine how retrieving additional responses affects the correlation with word-overlap metrics

Despite the poor performance of the word embedding-based metrics in this survey, we believe that metrics based on distributed sentence representations hold the most promise for the future.

(Please note that the paper didn’t test Word Mover’s Distance as an embedding-based method.)


For complementing @mike’s awesome replies, here are the references to the original papers:

  • Bleu: A Method for Automatic Evaluation of Machine Translation
  • Rouge: A Package for Automatic Evaluation of Summaries
  • Meteor Universal: Language Specific Translation Evaluation for Any Target Language

Further, BLEU can be used with different smoothing functions. For a comparison between these, see A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU.

Additionally, for those of you who want a deeper understanding of ROUGE, see Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics, written by the same authors of the ROUGE paper referenced above (this is the conference paper, the one above is the workshop paper. Both were published the same year.)

For more insights on BLEU, and a short description of other less used evaluation metrics see this stackoverflow answer. For a short comparison between BLEU and ROUGE, see this one.

Finally, there’s a less-known metric also created by the people that created ROUGE, called ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation, which, as its name implies, is a meta metric for evaluating metrics. This method, however, makes the strange assumption, in my opinion, that machine translated sentences are worse than their reference translations:

One basic assumption of all automatic evaluation metrics for machine translation is that reference translations are good translations and the more a machine translation is similar to its reference translations the better. We adopt this assumption and add one more assumption that automatic translations are usually worst than their reference translations.

Which the authors acknowledge as a caveat, for there are instances where this might not be the case, and suggest some ways for avoiding it in the conclusions of the paper:

One caveat of the ORANGE method is that what if machine translations are as good as reference translations? To rule out this scenario, we can sample instances where machine translations are ranked higher than human translations. We then check the portion of the cases where machine translations are as good as the human translations. If the portion is small then the ORANGE method can be confidently applied. We conjecture that this is the case for the currently available machine translation systems.


Great replies to NLP evaluation metrics @jabalazs and @mike !

Wanted to add a few more approaches for further thought.
Since human evaluation scores can be very tedious process, I have been looking for alternative methods that can evaluate independently from resemblance to the targets of the generated samples. Particularly, if we want to evaluate style, appropriateness, etc qualitative features, human evaluation seems unavoidable. So I thought an ML style evaluation technique could be interesting to try and came across some papers where they propose such a method. Basically, a different model is trained on evaluating generated samples like a regression task.

Here are my references:

While adversarial seems doable as long as you have samples representing bad generation, with ADEM you would need to have human labels on a dataset and then could benefit from using it on another dataset with a similar task.


There’s a real nice blog post published recently that examines the pitfalls of BLEU for NLG and related tasks.


For BLEU and many other similar automatic evaluation metrics, one of important prerequisites is to recognize which usages are appropriate. In machine translation, a common suggestion (that is implicit and rarely known to general public) is that BLEU is better for measuring performance differences for the same system, or at most systems share the same architecture (e.g.: comparing phrase-based MT systems only).

MT community has been working on quality estimation and meta-evaluation (somewhat resembles meta-analysis in medical research). Other relevant key aspects include what significance test and/or correlation coefficient to check, what human judgement method to employ, what data to curate and apply, etc.

I guess NLG community will catch up soon. This event may be a start:


In related fields of IR, Summarization, and QA, this may be a place dig deeper (do check to box of “evaluation issues and methodology”):
Prof. Jimmy Lin also co-authored the chapter about evaluation in The Handbook of Computational Linguistics and Natural Language Processing