Besides metrics for NLG have been briefly discussed there, here I would like to begin with an under-appreciated paper (despite its 2.6k citations): Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation (Powers, 2011). My take on this paper is that even with a seemly straightforward binary classification task, widely used F-measure is biased.
Unfortunately, according to The price of debiasing automatic metrics in natural language evaluation (Chaganty et. al., 2018), “there is no unbiased estimator with lower cost.” Although it did not verify unbiased metrics from Powers (2011), and personally I suspect the studied
corpora in Chaganty et. al. (2018) are somewhat misleading, because I don’t think CNN/Daily Mail dataset and MS MARCO v1.0 are sufficient for summarization and question-answering, respectively.
So far it may sound disappointing, but acknowledging the weakness is important and sometimes insightful, in my opinion. Several papers (I won’t list them up here for the readability) have also stated that in spite of all the ineffectiveness among auto. eval. metrics, some of them are still useful to help researchers spot wrong/bad outputs systematically. At very least I would say, just don’t blindly trust the state-of-the-art by number.