question about the bleu and meteor #49

yeshenpy · 2019-06-18T04:57:52Z

 def  compute_score(self, gts, res):

    assert(gts.keys() == res.keys())
    imgIds = gts.keys()

    bleu_scorer = BleuScorer(n=self._n)
    for id in imgIds:
        hypo = res[id]
        ref = gts[id]

        # Sanity check.
        assert(type(hypo) is list)
        assert(len(hypo) == 1)
        assert(type(ref) is list)
        assert(len(ref) >= 1)

        bleu_scorer += (hypo[0], ref)

    #score, scores = bleu_scorer.compute_score(option='shortest')
    score, scores = bleu_scorer.compute_score(option='closest', verbose=1)
    # score, scores = bleu_scorer.compute_score(option='average', verbose=1)

    # return (bleu, bleu_info)
    return score, scores

I find the return has two params, one is score and the other is scores ,and I found mean(scores) is not equal to score , I want to know what these two return values do, and under what circumstances mean(scores) == score , and the same problem occurred with cider .

def compute_score(self, gts, res):
    assert(gts.keys() == res.keys())
    imgIds = gts.keys()
    scores = []

    eval_line = 'EVAL'
    self.lock.acquire()
    for i in imgIds:
        assert(len(res[i]) == 1)
        stat = self._stat(res[i][0], gts[i])
        eval_line += ' ||| {}'.format(stat)

    self.meteor_p.stdin.write('{}\n'.format(eval_line).encode())
    self.meteor_p.stdin.flush()
    for i in range(0,len(imgIds)):
        scores.append(float(self.meteor_p.stdout.readline().strip()))
    score = float(self.meteor_p.stdout.readline().strip())
    self.lock.release()

    return score, scores

and the later is my res

{'testlen': 14006, 'reflen': 14927, 'guess': [14006, 12389, 10773, 9166], 'correct': [2367, 22, 1, 0]}
ratio: 0.9382997253298762
Bleu_1:  0.1582435446030457
Bleu_2:  0.016220982225013343
Bleu_3:  0.0028384843308123897
Bleu_4:  2.198519789887133e-07
METEOR:  0.04443493208767419
ROUGE_L: 0.16704389834453118
CIDEr:   0.028038780435183798
{'testlen': 14006, 'reflen': 14927, 'guess': [14006, 12389, 10773, 9166], 'correct': [2367, 22, 1, 0]}
ratio: 0.9382997253298762
     val_Bleu_1    val_Bleu_2    val_Bleu_3    val_Bleu_4  val_METEOR  \
0  1.312883e-01  2.181574e-03  1.214780e-04  1.884038e-08    0.046652

   val_ROUGE_L  val_CIDEr
0     0.167044   0.028039

We find only cider and rouge is equal .
I hope to get your help, thanks

The text was updated successfully, but these errors were encountered:

XinhaoMei · 2021-07-28T18:22:03Z

same question!

pdpino · 2021-10-19T23:55:12Z

(I had the same question not long ago, and as far as I understand:)

At least for BLEU, is usual that mean(scores) != scores in a corpus

Consider the formula for modified precision p_n in the original paper (subsection 2.1.1)
The first sum c in candidates, both in numerator and denominator, adds over all candidates in the corpus
This implies the average of individual sentences and the corpus calculation may differ, for example:
- Sentence 1 p_n = A/B and Sentence 2 p_n = C/D
- Score in the corpus with both sentences: (A+C) / (B+D)
- Mean of individual scores: (A/B + C/D)/2 (not necessarily equal)
This can be spotted in the NLTK code for the corpus_bleu() function; or in this library in the BleuScorer class comparing the comps (individual scores) and totalcomps (corpus score) variables

Hope this helps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about the bleu and meteor #49

question about the bleu and meteor #49

yeshenpy commented Jun 18, 2019

XinhaoMei commented Jul 28, 2021

pdpino commented Oct 19, 2021 •

edited

Loading

question about the bleu and meteor #49

question about the bleu and meteor #49

Comments

yeshenpy commented Jun 18, 2019

XinhaoMei commented Jul 28, 2021

pdpino commented Oct 19, 2021 • edited Loading

pdpino commented Oct 19, 2021 •

edited

Loading