-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diffcore-delta: avoid ignoring final 'line' of file #1637
Conversation
2dcaa9c
to
f62b22a
Compare
/submit |
Submitted as pull.1637.git.1705006074626.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Taylor Blau wrote (reply to this): On Thu, Jan 11, 2024 at 08:47:54PM +0000, Elijah Newren via GitGitGadget wrote:
> ---
> diffcore-delta.c | 4 ++++
> t/t4001-diff-rename.sh | 19 +++++++++++++++++++
> 2 files changed, 23 insertions(+)
Nice find and fix. This patch LGTM.
Thanks,
Taylor |
User |
On the Git mailing list, Junio C Hamano wrote (reply to this): "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> diff --git a/diffcore-delta.c b/diffcore-delta.c
> index c30b56e983b..7136c3dd203 100644
> --- a/diffcore-delta.c
> +++ b/diffcore-delta.c
> @@ -159,6 +159,10 @@ static struct spanhash_top *hash_chars(struct repository *r,
> n = 0;
> accum1 = accum2 = 0;
> }
> + if (n > 0) {
> + hashval = (accum1 + accum2 * 0x61) % HASHBASE;
> + hash = add_spanhash(hash, hashval, n);
> + }
OK, so we were ignoring the final short bit that is not terminated
with LF due to the "continue". Nicely found.
> diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
> index 85be1367de6..29299acbce7 100755
> --- a/t/t4001-diff-rename.sh
> +++ b/t/t4001-diff-rename.sh
> @@ -286,4 +286,23 @@ test_expect_success 'basename similarity vs best similarity' '
> test_cmp expected actual
> '
>
> +test_expect_success 'last line matters too' '
> + test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline &&
> + printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
> + git add nonewline &&
> + git commit -m "original version of file with no final newline" &&
I found it misleading that the file whose name is nonewline has
bunch of LF including on its last line ;-).
> + # Change ONLY the first character of the whole file
> + test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline &&
Same here, but it is too much to bother rewriting it ...
{
test_write_lines ...
printf ...
} >incomplete
... like so ("incomplete" stands for "file ending with an incomplete line"),
so I'll let it pass.
> + printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
> + git add nonewline &&
> + git mv nonewline still-no-newline &&
> + git commit -a -m "rename nonewline -> still-no-newline" &&
> + git diff-tree -r -M01 --name-status HEAD^ HEAD >actual &&
> + cat >expected <<-\EOF &&
> + R097 nonewline still-no-newline
I am not very happy with the hardcoded 97. You are already using
the non-standard 10% threshold. If the delta detection that
forgets about the last line is so broken as your proposed log
message noted, shouldn't you be able to construct a sample pair of
preimage and postimage for which the broken version gives so low
similarity to be judged not worth treating as a rename, while the
fixed version gives reasonable similarity to be made into a rename,
by the default threshold? That way, the test only needs to see if
we got a rename (with any similarity) or a delete and an add. |
This branch is now known as |
This patch series was integrated into seen via git@ab79343. |
There was a status update in the "New Topics" section about the branch Rename detection logic ignored the final line of a file if it is an incomplete line. Expecting a reroll. cf. <xmqqedenearc.fsf@gitster.g> source: <pull.1637.git.1705006074626.gitgitgadget@gmail.com> |
This patch series was integrated into seen via git@620f616. |
hash_chars() would hash lines to integers, and store them in a spanhash, but cut lines at 64 characters. Thus, whenever it reached 64 characters or a newline, it would create a new spanhash. The problem is, the final part of the file might not end 64 characters after the previous 'line' and might not end with a newline. This could, for example, cause an 85-byte file with 12 lines and only the first character in the file differing to appear merely 23% similar rather than the expected 97%. Ensure the last line is included, and add a testcase that would have caught this problem. Signed-off-by: Elijah Newren <newren@gmail.com>
f62b22a
to
e0223bb
Compare
On the Git mailing list, Elijah Newren wrote (reply to this): On Thu, Jan 11, 2024 at 3:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > diff --git a/diffcore-delta.c b/diffcore-delta.c
> > index c30b56e983b..7136c3dd203 100644
> > --- a/diffcore-delta.c
> > +++ b/diffcore-delta.c
> > @@ -159,6 +159,10 @@ static struct spanhash_top *hash_chars(struct repository *r,
> > n = 0;
> > accum1 = accum2 = 0;
> > }
> > + if (n > 0) {
> > + hashval = (accum1 + accum2 * 0x61) % HASHBASE;
> > + hash = add_spanhash(hash, hashval, n);
> > + }
>
> OK, so we were ignoring the final short bit that is not terminated
> with LF due to the "continue". Nicely found.
>
> > diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
> > index 85be1367de6..29299acbce7 100755
> > --- a/t/t4001-diff-rename.sh
> > +++ b/t/t4001-diff-rename.sh
> > @@ -286,4 +286,23 @@ test_expect_success 'basename similarity vs best similarity' '
> > test_cmp expected actual
> > '
> >
> > +test_expect_success 'last line matters too' '
> > + test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline &&
> > + printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
> > + git add nonewline &&
> > + git commit -m "original version of file with no final newline" &&
>
> I found it misleading that the file whose name is nonewline has
> bunch of LF including on its last line ;-).
Yeah, sorry. It's been a while, but I think I started with a file
with only one line that had no LF, but then realized for inexact
rename detection to kick in I needed some other lines, at least one of
which didn't match...but I forgot to update the filename after adding
the other lines...
> > + # Change ONLY the first character of the whole file
> > + test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline &&
>
> Same here, but it is too much to bother rewriting it ...
>
> {
> test_write_lines ...
> printf ...
> } >incomplete
>
> ... like so ("incomplete" stands for "file ending with an incomplete line"),
> so I'll let it pass.
>
> > + printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
>
>
> > + git add nonewline &&
> > + git mv nonewline still-no-newline &&
> > + git commit -a -m "rename nonewline -> still-no-newline" &&
> > + git diff-tree -r -M01 --name-status HEAD^ HEAD >actual &&
> > + cat >expected <<-\EOF &&
> > + R097 nonewline still-no-newline
>
> I am not very happy with the hardcoded 97. You are already using
> the non-standard 10% threshold. If the delta detection that
> forgets about the last line is so broken as your proposed log
> message noted, shouldn't you be able to construct a sample pair of
> preimage and postimage for which the broken version gives so low
> similarity to be judged not worth treating as a rename, while the
> fixed version gives reasonable similarity to be made into a rename,
> by the default threshold? That way, the test only needs to see if
> we got a rename (with any similarity) or a delete and an add.
Oops, the threshold is entirely unnecessary here; not sure why I
didn't remember to take it out (originally used the threshold while
testing without the fix to just how low of a similarity git thought
these nearly identical files had).
Since you don't like the threshold, and since we don't seem to have a
summary format that reports on the rename without the percentage, I
guess I need to munge the output with sed:
sed -e "s/^R[0-9]* /R /" actual >actual.munged &&
and then compare the expected output to that. I'll send in a patch
doing so and fix up the filenames and drop the rename threshold while
at it. |
User |
/submit |
Submitted as pull.1637.v2.git.1705119973690.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Junio C Hamano wrote (reply to this): Elijah Newren <newren@gmail.com> writes:
>> I am not very happy with the hardcoded 97. You are already using
>> the non-standard 10% threshold. If the delta detection that
>> forgets about the last line is so broken as your proposed log
>> message noted, shouldn't you be able to construct a sample pair of
>> preimage and postimage for which the broken version gives so low
>> similarity to be judged not worth treating as a rename, while the
>> fixed version gives reasonable similarity to be made into a rename,
>> by the default threshold? That way, the test only needs to see if
>> we got a rename (with any similarity) or a delete and an add.
>
> Oops, the threshold is entirely unnecessary here; not sure why I
> didn't remember to take it out (originally used the threshold while
> testing without the fix to just how low of a similarity git thought
> these nearly identical files had).
>
> Since you don't like the threshold, and since we don't seem to have a
> summary format that reports on the rename without the percentage, I
> guess I need to munge the output with sed:
>
> sed -e "s/^R[0-9]* /R /" actual >actual.munged &&
Heh, I was hoping that we should be able to use "diff --name-only".
$ git mv Makefile Breakfile
$ git diff --name-only -M HEAD
Breakfile
$ git reset --hard
$ git rm Makefile
$ >Breakfile && git add Breakfile
$ git diff --name-only -M HEAD
Breakfile
Makefile
$ git reset --hard |
This patch series was integrated into seen via git@a4af72f. |
There was a status update in the "Cooking" section about the branch Rename detection logic ignored the final line of a file if it is an incomplete line. Expecting a reroll. cf. <xmqqedenearc.fsf@gitster.g> source: <pull.1637.git.1705006074626.gitgitgadget@gmail.com> |
This patch series was integrated into seen via git@25b2467. |
This patch series was integrated into seen via git@0b3f533. |
This patch series was integrated into seen via git@dd8ffa8. |
On the Git mailing list, Elijah Newren wrote (reply to this): On Fri, Jan 12, 2024 at 10:22 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> I am not very happy with the hardcoded 97. You are already using
> >> the non-standard 10% threshold. If the delta detection that
> >> forgets about the last line is so broken as your proposed log
> >> message noted, shouldn't you be able to construct a sample pair of
> >> preimage and postimage for which the broken version gives so low
> >> similarity to be judged not worth treating as a rename, while the
> >> fixed version gives reasonable similarity to be made into a rename,
> >> by the default threshold? That way, the test only needs to see if
> >> we got a rename (with any similarity) or a delete and an add.
> >
> > Oops, the threshold is entirely unnecessary here; not sure why I
> > didn't remember to take it out (originally used the threshold while
> > testing without the fix to just how low of a similarity git thought
> > these nearly identical files had).
> >
> > Since you don't like the threshold, and since we don't seem to have a
> > summary format that reports on the rename without the percentage, I
> > guess I need to munge the output with sed:
> >
> > sed -e "s/^R[0-9]* /R /" actual >actual.munged &&
>
> Heh, I was hoping that we should be able to use "diff --name-only".
>
> $ git mv Makefile Breakfile
> $ git diff --name-only -M HEAD
> Breakfile
> $ git reset --hard
> $ git rm Makefile
> $ >Breakfile && git add Breakfile
> $ git diff --name-only -M HEAD
> Breakfile
> Makefile
> $ git reset --hard
I guess we could, since the only thing in this repository is a file
which is being renamed, and we can then deduce based on the setup that
the change we expected must have happened.
However, I didn't like the deduce bit; since --name-only only lists
one of the two filenames and doesn't provide any hint that the change
involved is a rename, it felt to me like using --name-only would make
the test not really be a rename test. |
On the Git mailing list, Junio C Hamano wrote (reply to this): Elijah Newren <newren@gmail.com> writes:
>> Heh, I was hoping that we should be able to use "diff --name-only".
>>
>> $ git mv Makefile Breakfile
>> $ git diff --name-only -M HEAD
>> Breakfile
>> $ git reset --hard
>> $ git rm Makefile
>> $ >Breakfile && git add Breakfile
>> $ git diff --name-only -M HEAD
>> Breakfile
>> Makefile
>> $ git reset --hard
>
> I guess we could, since the only thing in this repository is a file
> which is being renamed, and we can then deduce based on the setup that
> the change we expected must have happened.
>
> However, I didn't like the deduce bit; since --name-only only lists
> one of the two filenames and doesn't provide any hint that the change
> involved is a rename, it felt to me like using --name-only would make
> the test not really be a rename test.
Hmph, I am not quite seeing what you are saying. If the "mv" from
Makefile to Breakfile in the above example is between preimage and
postimage that are similar enough, then we will see "one" paths,
i.e. the file in the "after" side of the diff. But if the "mv" from
Makefile to Breakfile involves too large a content change (like,
say, going from 3800+ lines to an empty file, in the second example
above), then because such a change from Makefile to Breakfile is too
dissimilar, we do not judge it as "renamed" and show "two" paths. I
do not quite see where we need to "deduce".
|
On the Git mailing list, Elijah Newren wrote (reply to this): On Thu, Jan 18, 2024 at 7:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> Heh, I was hoping that we should be able to use "diff --name-only".
> >>
> >> $ git mv Makefile Breakfile
> >> $ git diff --name-only -M HEAD
> >> Breakfile
> >> $ git reset --hard
> >> $ git rm Makefile
> >> $ >Breakfile && git add Breakfile
> >> $ git diff --name-only -M HEAD
> >> Breakfile
> >> Makefile
> >> $ git reset --hard
> >
> > I guess we could, since the only thing in this repository is a file
> > which is being renamed, and we can then deduce based on the setup that
> > the change we expected must have happened.
> >
> > However, I didn't like the deduce bit; since --name-only only lists
> > one of the two filenames and doesn't provide any hint that the change
> > involved is a rename, it felt to me like using --name-only would make
> > the test not really be a rename test.
>
> Hmph, I am not quite seeing what you are saying. If the "mv" from
> Makefile to Breakfile in the above example is between preimage and
> postimage that are similar enough, then we will see "one" paths,
> i.e. the file in the "after" side of the diff. But if the "mv" from
> Makefile to Breakfile involves too large a content change (like,
> say, going from 3800+ lines to an empty file, in the second example
> above), then because such a change from Makefile to Breakfile is too
> dissimilar, we do not judge it as "renamed" and show "two" paths. I
> do not quite see where we need to "deduce".
You just wrote a very well worded paragraph going through the
reasoning involved to prove that a rename is involved. You reasoned,
or deduced, the necessary conclusion quite well. Sure, it might be a
simple deduction given the knowledge of the test setup, but it's still
a deduction.
But perhaps I can put it another way: You can't just look at the
output of `diff --name-only` and say a rename was involved -- unless
you know the test setup and the previous operations. In fact, how
about a possibly contrived alternate scenario: What if `git mv $1 $2`
had a bug where, after doing the expected work, it did the equivalent
of running `git checkout HEAD -- $1` at the end of its operation.
Then we'd see:
$ <tweak Makefile slightly>
$ git mv Makefile Breakfile
$ git diff --name-only -M HEAD
Breakfile
i.e. we get the same output as without the git-mv bug (note that
Makefile will not be listed since it is unmodified), but with the bug
in git-mv there definitely is no rename involved (since there's no
delete to pair with the addition of Breakfile). As such, we'd still
pass the test despite there being no rename. Sure, the example is
somewhat contrived, but I'm just saying that the --name-only output
doesn't actually test or prove that a rename occurred. And since this
test, and all other tests in this particular testfile, are
specifically about renames, the fact that we aren't specifically
testing for something being renamed feels odd to me.
If you still like `diff --name-only` better anyway, that's fine and
I'll switch it. I'm just stating why it seems suboptimal to me. |
On the Git mailing list, Junio C Hamano wrote (reply to this): Elijah Newren <newren@gmail.com> writes:
> But perhaps I can put it another way: You can't just look at the
> output of `diff --name-only` and say a rename was involved -- unless
> you know the test setup and the previous operations.
That is true. But that is exactly what a test is about. You have
this and that file, and you do this operation, now what should
happen? Does the observation match the expectation? That is what
our tests are done.
And your argument should not have to rely on a bug in "git mv".
After all, you should be able to do the same with "mv A B && git add
B && git add -u" (or "git rm -f A") and you won't be affected by
such a bug.
> If you still like `diff --name-only` better anyway, that's fine and
> I'll switch it. I'm just stating why it seems suboptimal to me.
I'd prefer to omit "sed" involved, but I'd even more prefer not
waste more time on the test. As long as we can make a robust test
(which an extra process running sed would certainly give us), I'd be
fine.
Thanks. |
This patch series was integrated into seen via git@4b4242f. |
There was a status update in the "Cooking" section about the branch Rename detection logic ignored the final line of a file if it is an incomplete line. Will merge to 'next'. source: <pull.1637.v2.git.1705119973690.gitgitgadget@gmail.com> |
This patch series was integrated into seen via git@587dcc6. |
This patch series was integrated into seen via git@bb72a21. |
This patch series was integrated into next via git@7141d20. |
This patch series was integrated into seen via git@737df8f. |
This patch series was integrated into seen via git@b94ec41. |
There was a status update in the "Cooking" section about the branch Rename detection logic ignored the final line of a file if it is an incomplete line. Will merge to 'master'. source: <pull.1637.v2.git.1705119973690.gitgitgadget@gmail.com> |
This patch series was integrated into seen via git@d97eb69. |
This patch series was integrated into seen via git@d2267b2. |
This patch series was integrated into seen via git@c26e8bb. |
There was a status update in the "Cooking" section about the branch Rename detection logic ignored the final line of a file if it is an incomplete line. Will merge to 'master'. source: <pull.1637.v2.git.1705119973690.gitgitgadget@gmail.com> |
This patch series was integrated into seen via git@c61e2f8. |
This patch series was integrated into seen via git@d3bf8d3. |
This patch series was integrated into master via git@d3bf8d3. |
This patch series was integrated into next via git@d3bf8d3. |
Closed via d3bf8d3. |
Found while experimenting with converting portions of diffcore-delta to Rust.
Changes since v1:
cc: Taylor Blau me@ttaylorr.com
cc: Elijah Newren newren@gmail.com