Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about the bleu and meteor #49

Open
yeshenpy opened this issue Jun 18, 2019 · 2 comments
Open

question about the bleu and meteor #49

yeshenpy opened this issue Jun 18, 2019 · 2 comments

Comments

@yeshenpy
Copy link

 def  compute_score(self, gts, res):

    assert(gts.keys() == res.keys())
    imgIds = gts.keys()

    bleu_scorer = BleuScorer(n=self._n)
    for id in imgIds:
        hypo = res[id]
        ref = gts[id]

        # Sanity check.
        assert(type(hypo) is list)
        assert(len(hypo) == 1)
        assert(type(ref) is list)
        assert(len(ref) >= 1)

        bleu_scorer += (hypo[0], ref)

    #score, scores = bleu_scorer.compute_score(option='shortest')
    score, scores = bleu_scorer.compute_score(option='closest', verbose=1)
    # score, scores = bleu_scorer.compute_score(option='average', verbose=1)

    # return (bleu, bleu_info)
    return score, scores

I find the return has two params, one is score and the other is scores ,and I found mean(scores) is not equal to score , I want to know what these two return values do, and under what circumstances mean(scores) == score , and the same problem occurred with cider .

def compute_score(self, gts, res):
    assert(gts.keys() == res.keys())
    imgIds = gts.keys()
    scores = []

    eval_line = 'EVAL'
    self.lock.acquire()
    for i in imgIds:
        assert(len(res[i]) == 1)
        stat = self._stat(res[i][0], gts[i])
        eval_line += ' ||| {}'.format(stat)

    self.meteor_p.stdin.write('{}\n'.format(eval_line).encode())
    self.meteor_p.stdin.flush()
    for i in range(0,len(imgIds)):
        scores.append(float(self.meteor_p.stdout.readline().strip()))
    score = float(self.meteor_p.stdout.readline().strip())
    self.lock.release()

    return score, scores

and the later is my res

{'testlen': 14006, 'reflen': 14927, 'guess': [14006, 12389, 10773, 9166], 'correct': [2367, 22, 1, 0]}
ratio: 0.9382997253298762
Bleu_1:  0.1582435446030457
Bleu_2:  0.016220982225013343
Bleu_3:  0.0028384843308123897
Bleu_4:  2.198519789887133e-07
METEOR:  0.04443493208767419
ROUGE_L: 0.16704389834453118
CIDEr:   0.028038780435183798
{'testlen': 14006, 'reflen': 14927, 'guess': [14006, 12389, 10773, 9166], 'correct': [2367, 22, 1, 0]}
ratio: 0.9382997253298762
     val_Bleu_1    val_Bleu_2    val_Bleu_3    val_Bleu_4  val_METEOR  \
0  1.312883e-01  2.181574e-03  1.214780e-04  1.884038e-08    0.046652

   val_ROUGE_L  val_CIDEr
0     0.167044   0.028039

We find only cider and rouge is equal .
I hope to get your help, thanks

@XinhaoMei
Copy link

same question!

@pdpino
Copy link

pdpino commented Oct 19, 2021

(I had the same question not long ago, and as far as I understand:)

At least for BLEU, is usual that mean(scores) != scores in a corpus

  • Consider the formula for modified precision p_n in the original paper (subsection 2.1.1)
  • The first sum c in candidates, both in numerator and denominator, adds over all candidates in the corpus
  • This implies the average of individual sentences and the corpus calculation may differ, for example:
    • Sentence 1 p_n = A/B and Sentence 2 p_n = C/D
    • Score in the corpus with both sentences: (A+C) / (B+D)
    • Mean of individual scores: (A/B + C/D)/2 (not necessarily equal)
  • This can be spotted in the NLTK code for the corpus_bleu() function; or in this library in the BleuScorer class comparing the comps (individual scores) and totalcomps (corpus score) variables

Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants