Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diffcore-delta: avoid ignoring final 'line' of file #1637

Closed
wants to merge 1 commit into from

Conversation

newren
Copy link

@newren newren commented Jan 11, 2024

Found while experimenting with converting portions of diffcore-delta to Rust.

Changes since v1:

  • Removed the unnecessary similarity threshold specification
  • Munged the discovered similarity percentage so we are only checking that a rename is detected
  • Fixed up filenames

cc: Taylor Blau me@ttaylorr.com
cc: Elijah Newren newren@gmail.com

@newren newren force-pushed the fix-diffcore-final-line branch from 2dcaa9c to f62b22a Compare January 11, 2024 17:46
@newren
Copy link
Author

newren commented Jan 11, 2024

/submit

Copy link

gitgitgadget bot commented Jan 11, 2024

Submitted as pull.1637.git.1705006074626.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1637/newren/fix-diffcore-final-line-v1

To fetch this version to local tag pr-1637/newren/fix-diffcore-final-line-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1637/newren/fix-diffcore-final-line-v1

Copy link

gitgitgadget bot commented Jan 11, 2024

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Jan 11, 2024 at 08:47:54PM +0000, Elijah Newren via GitGitGadget wrote:
> ---
>  diffcore-delta.c       |  4 ++++
>  t/t4001-diff-rename.sh | 19 +++++++++++++++++++
>  2 files changed, 23 insertions(+)

Nice find and fix. This patch LGTM.

Thanks,
Taylor

Copy link

gitgitgadget bot commented Jan 11, 2024

User Taylor Blau <me@ttaylorr.com> has been added to the cc: list.

Copy link

gitgitgadget bot commented Jan 11, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/diffcore-delta.c b/diffcore-delta.c
> index c30b56e983b..7136c3dd203 100644
> --- a/diffcore-delta.c
> +++ b/diffcore-delta.c
> @@ -159,6 +159,10 @@ static struct spanhash_top *hash_chars(struct repository *r,
>  		n = 0;
>  		accum1 = accum2 = 0;
>  	}
> +	if (n > 0) {
> +		hashval = (accum1 + accum2 * 0x61) % HASHBASE;
> +		hash = add_spanhash(hash, hashval, n);
> +	}

OK, so we were ignoring the final short bit that is not terminated
with LF due to the "continue".  Nicely found.

> diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
> index 85be1367de6..29299acbce7 100755
> --- a/t/t4001-diff-rename.sh
> +++ b/t/t4001-diff-rename.sh
> @@ -286,4 +286,23 @@ test_expect_success 'basename similarity vs best similarity' '
>  	test_cmp expected actual
>  '
>  
> +test_expect_success 'last line matters too' '
> +	test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline &&
> +	printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
> +	git add nonewline &&
> +	git commit -m "original version of file with no final newline" &&

I found it misleading that the file whose name is nonewline has
bunch of LF including on its last line ;-).

> +	# Change ONLY the first character of the whole file
> +	test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline &&

Same here, but it is too much to bother rewriting it ...

	{
		test_write_lines ...
		printf ...
	} >incomplete

... like so ("incomplete" stands for "file ending with an incomplete line"),
so I'll let it pass.

> +	printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&


> +	git add nonewline &&
> +	git mv nonewline still-no-newline &&
> +	git commit -a -m "rename nonewline -> still-no-newline" &&
> +	git diff-tree -r -M01 --name-status HEAD^ HEAD >actual &&
> +	cat >expected <<-\EOF &&
> +	R097	nonewline	still-no-newline

I am not very happy with the hardcoded 97.  You are already using
the non-standard 10% threshold.  If the delta detection that
forgets about the last line is so broken as your proposed log
message noted, shouldn't you be able to construct a sample pair of
preimage and postimage for which the broken version gives so low
similarity to be judged not worth treating as a rename, while the
fixed version gives reasonable similarity to be made into a rename,
by the default threshold?  That way, the test only needs to see if
we got a rename (with any similarity) or a delete and an add.

Copy link

gitgitgadget bot commented Jan 12, 2024

This branch is now known as en/diffcore-delta-final-line-fix.

Copy link

gitgitgadget bot commented Jan 12, 2024

This patch series was integrated into seen via git@ab79343.

@gitgitgadget gitgitgadget bot added the seen label Jan 12, 2024
Copy link

gitgitgadget bot commented Jan 12, 2024

There was a status update in the "New Topics" section about the branch en/diffcore-delta-final-line-fix on the Git mailing list:

Rename detection logic ignored the final line of a file if it is an
incomplete line.

Expecting a reroll.
cf. <xmqqedenearc.fsf@gitster.g>
source: <pull.1637.git.1705006074626.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Jan 13, 2024

This patch series was integrated into seen via git@620f616.

hash_chars() would hash lines to integers, and store them in a spanhash,
but cut lines at 64 characters.  Thus, whenever it reached 64 characters
or a newline, it would create a new spanhash.  The problem is, the final
part of the file might not end 64 characters after the previous 'line'
and might not end with a newline.  This could, for example, cause an
85-byte file with 12 lines and only the first character in the file
differing to appear merely 23% similar rather than the expected 97%.
Ensure the last line is included, and add a testcase that would have
caught this problem.

Signed-off-by: Elijah Newren <newren@gmail.com>
@newren newren force-pushed the fix-diffcore-final-line branch from f62b22a to e0223bb Compare January 13, 2024 01:43
Copy link

gitgitgadget bot commented Jan 13, 2024

On the Git mailing list, Elijah Newren wrote (reply to this):

On Thu, Jan 11, 2024 at 3:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > diff --git a/diffcore-delta.c b/diffcore-delta.c
> > index c30b56e983b..7136c3dd203 100644
> > --- a/diffcore-delta.c
> > +++ b/diffcore-delta.c
> > @@ -159,6 +159,10 @@ static struct spanhash_top *hash_chars(struct repository *r,
> >               n = 0;
> >               accum1 = accum2 = 0;
> >       }
> > +     if (n > 0) {
> > +             hashval = (accum1 + accum2 * 0x61) % HASHBASE;
> > +             hash = add_spanhash(hash, hashval, n);
> > +     }
>
> OK, so we were ignoring the final short bit that is not terminated
> with LF due to the "continue".  Nicely found.
>
> > diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
> > index 85be1367de6..29299acbce7 100755
> > --- a/t/t4001-diff-rename.sh
> > +++ b/t/t4001-diff-rename.sh
> > @@ -286,4 +286,23 @@ test_expect_success 'basename similarity vs best similarity' '
> >       test_cmp expected actual
> >  '
> >
> > +test_expect_success 'last line matters too' '
> > +     test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline &&
> > +     printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
> > +     git add nonewline &&
> > +     git commit -m "original version of file with no final newline" &&
>
> I found it misleading that the file whose name is nonewline has
> bunch of LF including on its last line ;-).

Yeah, sorry.  It's been a while, but I think I started with a file
with only one line that had no LF, but then realized for inexact
rename detection to kick in I needed some other lines, at least one of
which didn't match...but I forgot to update the filename after adding
the other lines...

> > +     # Change ONLY the first character of the whole file
> > +     test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline &&
>
> Same here, but it is too much to bother rewriting it ...
>
>         {
>                 test_write_lines ...
>                 printf ...
>         } >incomplete
>
> ... like so ("incomplete" stands for "file ending with an incomplete line"),
> so I'll let it pass.
>
> > +     printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
>
>
> > +     git add nonewline &&
> > +     git mv nonewline still-no-newline &&
> > +     git commit -a -m "rename nonewline -> still-no-newline" &&
> > +     git diff-tree -r -M01 --name-status HEAD^ HEAD >actual &&
> > +     cat >expected <<-\EOF &&
> > +     R097    nonewline       still-no-newline
>
> I am not very happy with the hardcoded 97.  You are already using
> the non-standard 10% threshold.  If the delta detection that
> forgets about the last line is so broken as your proposed log
> message noted, shouldn't you be able to construct a sample pair of
> preimage and postimage for which the broken version gives so low
> similarity to be judged not worth treating as a rename, while the
> fixed version gives reasonable similarity to be made into a rename,
> by the default threshold?  That way, the test only needs to see if
> we got a rename (with any similarity) or a delete and an add.

Oops, the threshold is entirely unnecessary here; not sure why I
didn't remember to take it out (originally used the threshold while
testing without the fix to just how low of a similarity git thought
these nearly identical files had).

Since you don't like the threshold, and since we don't seem to have a
summary format that reports on the rename without the percentage, I
guess I need to munge the output with sed:

      sed -e "s/^R[0-9]* /R /" actual >actual.munged &&

and then compare the expected output to that.  I'll send in a patch
doing so and fix up the filenames and drop the rename threshold while
at it.

Copy link

gitgitgadget bot commented Jan 13, 2024

User Elijah Newren <newren@gmail.com> has been added to the cc: list.

@newren
Copy link
Author

newren commented Jan 13, 2024

/submit

Copy link

gitgitgadget bot commented Jan 13, 2024

Submitted as pull.1637.v2.git.1705119973690.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1637/newren/fix-diffcore-final-line-v2

To fetch this version to local tag pr-1637/newren/fix-diffcore-final-line-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1637/newren/fix-diffcore-final-line-v2

Copy link

gitgitgadget bot commented Jan 13, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes:

>> I am not very happy with the hardcoded 97.  You are already using
>> the non-standard 10% threshold.  If the delta detection that
>> forgets about the last line is so broken as your proposed log
>> message noted, shouldn't you be able to construct a sample pair of
>> preimage and postimage for which the broken version gives so low
>> similarity to be judged not worth treating as a rename, while the
>> fixed version gives reasonable similarity to be made into a rename,
>> by the default threshold?  That way, the test only needs to see if
>> we got a rename (with any similarity) or a delete and an add.
>
> Oops, the threshold is entirely unnecessary here; not sure why I
> didn't remember to take it out (originally used the threshold while
> testing without the fix to just how low of a similarity git thought
> these nearly identical files had).
>
> Since you don't like the threshold, and since we don't seem to have a
> summary format that reports on the rename without the percentage, I
> guess I need to munge the output with sed:
>
>       sed -e "s/^R[0-9]* /R /" actual >actual.munged &&

Heh, I was hoping that we should be able to use "diff --name-only".

 $ git mv Makefile Breakfile
 $ git diff --name-only -M HEAD
 Breakfile
 $ git reset --hard
 $ git rm Makefile
 $ >Breakfile && git add Breakfile
 $ git diff --name-only -M HEAD
 Breakfile
 Makefile
 $ git reset --hard

Copy link

gitgitgadget bot commented Jan 16, 2024

This patch series was integrated into seen via git@a4af72f.

Copy link

gitgitgadget bot commented Jan 16, 2024

There was a status update in the "Cooking" section about the branch en/diffcore-delta-final-line-fix on the Git mailing list:

Rename detection logic ignored the final line of a file if it is an
incomplete line.

Expecting a reroll.
cf. <xmqqedenearc.fsf@gitster.g>
source: <pull.1637.git.1705006074626.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Jan 17, 2024

This patch series was integrated into seen via git@25b2467.

Copy link

gitgitgadget bot commented Jan 18, 2024

This patch series was integrated into seen via git@0b3f533.

Copy link

gitgitgadget bot commented Jan 18, 2024

This patch series was integrated into seen via git@dd8ffa8.

Copy link

gitgitgadget bot commented Jan 19, 2024

On the Git mailing list, Elijah Newren wrote (reply to this):

On Fri, Jan 12, 2024 at 10:22 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> I am not very happy with the hardcoded 97.  You are already using
> >> the non-standard 10% threshold.  If the delta detection that
> >> forgets about the last line is so broken as your proposed log
> >> message noted, shouldn't you be able to construct a sample pair of
> >> preimage and postimage for which the broken version gives so low
> >> similarity to be judged not worth treating as a rename, while the
> >> fixed version gives reasonable similarity to be made into a rename,
> >> by the default threshold?  That way, the test only needs to see if
> >> we got a rename (with any similarity) or a delete and an add.
> >
> > Oops, the threshold is entirely unnecessary here; not sure why I
> > didn't remember to take it out (originally used the threshold while
> > testing without the fix to just how low of a similarity git thought
> > these nearly identical files had).
> >
> > Since you don't like the threshold, and since we don't seem to have a
> > summary format that reports on the rename without the percentage, I
> > guess I need to munge the output with sed:
> >
> >       sed -e "s/^R[0-9]* /R /" actual >actual.munged &&
>
> Heh, I was hoping that we should be able to use "diff --name-only".
>
>  $ git mv Makefile Breakfile
>  $ git diff --name-only -M HEAD
>  Breakfile
>  $ git reset --hard
>  $ git rm Makefile
>  $ >Breakfile && git add Breakfile
>  $ git diff --name-only -M HEAD
>  Breakfile
>  Makefile
>  $ git reset --hard

I guess we could, since the only thing in this repository is a file
which is being renamed, and we can then deduce based on the setup that
the change we expected must have happened.

However, I didn't like the deduce bit; since --name-only only lists
one of the two filenames and doesn't provide any hint that the change
involved is a rename, it felt to me like using --name-only would make
the test not really be a rename test.

Copy link

gitgitgadget bot commented Jan 19, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes:

>> Heh, I was hoping that we should be able to use "diff --name-only".
>>
>>  $ git mv Makefile Breakfile
>>  $ git diff --name-only -M HEAD
>>  Breakfile
>>  $ git reset --hard
>>  $ git rm Makefile
>>  $ >Breakfile && git add Breakfile
>>  $ git diff --name-only -M HEAD
>>  Breakfile
>>  Makefile
>>  $ git reset --hard
>
> I guess we could, since the only thing in this repository is a file
> which is being renamed, and we can then deduce based on the setup that
> the change we expected must have happened.
>
> However, I didn't like the deduce bit; since --name-only only lists
> one of the two filenames and doesn't provide any hint that the change
> involved is a rename, it felt to me like using --name-only would make
> the test not really be a rename test.

Hmph, I am not quite seeing what you are saying.  If the "mv" from
Makefile to Breakfile in the above example is between preimage and
postimage that are similar enough, then we will see "one" paths,
i.e. the file in the "after" side of the diff.  But if the "mv" from
Makefile to Breakfile involves too large a content change (like,
say, going from 3800+ lines to an empty file, in the second example
above), then because such a change from Makefile to Breakfile is too
dissimilar, we do not judge it as "renamed" and show "two" paths.  I
do not quite see where we need to "deduce".

Copy link

gitgitgadget bot commented Jan 19, 2024

On the Git mailing list, Elijah Newren wrote (reply to this):

On Thu, Jan 18, 2024 at 7:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> Heh, I was hoping that we should be able to use "diff --name-only".
> >>
> >>  $ git mv Makefile Breakfile
> >>  $ git diff --name-only -M HEAD
> >>  Breakfile
> >>  $ git reset --hard
> >>  $ git rm Makefile
> >>  $ >Breakfile && git add Breakfile
> >>  $ git diff --name-only -M HEAD
> >>  Breakfile
> >>  Makefile
> >>  $ git reset --hard
> >
> > I guess we could, since the only thing in this repository is a file
> > which is being renamed, and we can then deduce based on the setup that
> > the change we expected must have happened.
> >
> > However, I didn't like the deduce bit; since --name-only only lists
> > one of the two filenames and doesn't provide any hint that the change
> > involved is a rename, it felt to me like using --name-only would make
> > the test not really be a rename test.
>
> Hmph, I am not quite seeing what you are saying.  If the "mv" from
> Makefile to Breakfile in the above example is between preimage and
> postimage that are similar enough, then we will see "one" paths,
> i.e. the file in the "after" side of the diff.  But if the "mv" from
> Makefile to Breakfile involves too large a content change (like,
> say, going from 3800+ lines to an empty file, in the second example
> above), then because such a change from Makefile to Breakfile is too
> dissimilar, we do not judge it as "renamed" and show "two" paths.  I
> do not quite see where we need to "deduce".

You just wrote a very well worded paragraph going through the
reasoning involved to prove that a rename is involved.  You reasoned,
or deduced, the necessary conclusion quite well.  Sure, it might be a
simple deduction given the knowledge of the test setup, but it's still
a deduction.

But perhaps I can put it another way:  You can't just look at the
output of `diff --name-only` and say a rename was involved -- unless
you know the test setup and the previous operations.  In fact, how
about a possibly contrived alternate scenario: What if `git mv $1 $2`
had a bug where, after doing the expected work, it did the equivalent
of running `git checkout HEAD -- $1` at the end of its operation.
Then we'd see:

   $ <tweak Makefile slightly>
   $ git mv Makefile Breakfile
   $ git diff --name-only -M HEAD
   Breakfile

i.e. we get the same output as without the git-mv bug (note that
Makefile will not be listed since it is unmodified), but with the bug
in git-mv there definitely is no rename involved (since there's no
delete to pair with the addition of Breakfile).  As such, we'd still
pass the test despite there being no rename.  Sure, the example is
somewhat contrived, but I'm just saying that the --name-only output
doesn't actually test or prove that a rename occurred.  And since this
test, and all other tests in this particular testfile, are
specifically about renames, the fact that we aren't specifically
testing for something being renamed feels odd to me.

If you still like `diff --name-only` better anyway, that's fine and
I'll switch it.  I'm just stating why it seems suboptimal to me.

Copy link

gitgitgadget bot commented Jan 19, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes:

> But perhaps I can put it another way:  You can't just look at the
> output of `diff --name-only` and say a rename was involved -- unless
> you know the test setup and the previous operations.

That is true.  But that is exactly what a test is about.  You have
this and that file, and you do this operation, now what should
happen?  Does the observation match the expectation?  That is what
our tests are done.

And your argument should not have to rely on a bug in "git mv".
After all, you should be able to do the same with "mv A B && git add
B && git add -u" (or "git rm -f A") and you won't be affected by
such a bug.

> If you still like `diff --name-only` better anyway, that's fine and
> I'll switch it.  I'm just stating why it seems suboptimal to me.

I'd prefer to omit "sed" involved, but I'd even more prefer not
waste more time on the test.  As long as we can make a robust test
(which an extra process running sed would certainly give us), I'd be
fine.

Thanks.

Copy link

gitgitgadget bot commented Jan 20, 2024

This patch series was integrated into seen via git@4b4242f.

Copy link

gitgitgadget bot commented Jan 20, 2024

There was a status update in the "Cooking" section about the branch en/diffcore-delta-final-line-fix on the Git mailing list:

Rename detection logic ignored the final line of a file if it is an
incomplete line.

Will merge to 'next'.
source: <pull.1637.v2.git.1705119973690.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Jan 20, 2024

This patch series was integrated into seen via git@587dcc6.

Copy link

gitgitgadget bot commented Jan 22, 2024

This patch series was integrated into seen via git@bb72a21.

Copy link

gitgitgadget bot commented Jan 22, 2024

This patch series was integrated into next via git@7141d20.

@gitgitgadget gitgitgadget bot added the next label Jan 22, 2024
Copy link

gitgitgadget bot commented Jan 23, 2024

This patch series was integrated into seen via git@737df8f.

Copy link

gitgitgadget bot commented Jan 23, 2024

This patch series was integrated into seen via git@b94ec41.

Copy link

gitgitgadget bot commented Jan 24, 2024

There was a status update in the "Cooking" section about the branch en/diffcore-delta-final-line-fix on the Git mailing list:

Rename detection logic ignored the final line of a file if it is an
incomplete line.

Will merge to 'master'.
source: <pull.1637.v2.git.1705119973690.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Jan 24, 2024

This patch series was integrated into seen via git@d97eb69.

Copy link

gitgitgadget bot commented Jan 26, 2024

This patch series was integrated into seen via git@d2267b2.

Copy link

gitgitgadget bot commented Jan 26, 2024

This patch series was integrated into seen via git@c26e8bb.

Copy link

gitgitgadget bot commented Jan 27, 2024

There was a status update in the "Cooking" section about the branch en/diffcore-delta-final-line-fix on the Git mailing list:

Rename detection logic ignored the final line of a file if it is an
incomplete line.

Will merge to 'master'.
source: <pull.1637.v2.git.1705119973690.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Jan 29, 2024

This patch series was integrated into seen via git@c61e2f8.

Copy link

gitgitgadget bot commented Jan 30, 2024

This patch series was integrated into seen via git@d3bf8d3.

Copy link

gitgitgadget bot commented Jan 30, 2024

This patch series was integrated into master via git@d3bf8d3.

Copy link

gitgitgadget bot commented Jan 30, 2024

This patch series was integrated into next via git@d3bf8d3.

@gitgitgadget gitgitgadget bot added the master label Jan 30, 2024
@gitgitgadget gitgitgadget bot closed this Jan 30, 2024
Copy link

gitgitgadget bot commented Jan 30, 2024

Closed via d3bf8d3.

@newren newren deleted the fix-diffcore-final-line branch June 13, 2024 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant