Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for-each-ref optimizations & usability improvements #1609

Closed
wants to merge 10 commits into from
Closed

for-each-ref optimizations & usability improvements #1609

wants to merge 10 commits into from

Conversation

vdye
Copy link

@vdye vdye commented Nov 6, 2023

This series is a bit of an informal follow-up to [1], adding some more substantial optimizations and usability fixes around ref filtering/formatting. Some of the changes here affect user-facing behavior, some are internal-only, but they're all interdependent enough to warrant putting them together in one series.

[1] https://lore.kernel.org/git/pull.1594.v2.git.1696888736.gitgitgadget@gmail.com/

Patch 1 changes the behavior of the '--no-sort' option in 'for-each-ref', 'tag', and 'branch'. Currently, it just removes previous sort keys and, if no further keys are specified, falls back on ascending refname sort (which, IMO, makes the name '--no-sort' somewhat misleading). Now, '--no-sort' completely disables sorting (unless subsequent '--sort' options are provided).

Patches 2-7 incrementally refactor various parts of the ref filtering/formatting workflows in order to create a 'filter_and_format_refs()' function. If certain conditions are met (sorting disabled, no reachability filtering or ahead-behind formatting), ref filtering & formatting is done within a single 'for_each_fullref_in' callback. Especially in large repositories, this makes a huge difference in memory usage & runtime for certain usages of 'for-each-ref', since it's no longer writing everything to a 'struct ref_array' then repeatedly whittling down/updating its contents.

Patch 8 updates the 'for-each-ref' documentation, making the '--format' description a bit less jumbled and more clearly explaining the '*' prefix (to be updated in the next patch)

Patch 9 changes the dereferencing done by the '*' format prefix from a single dereference to a recursive peel. See [1] + replies for the discussion that led to this approach (as opposed to a command line option or new format specifier).

[1] https://lore.kernel.org/git/ZUoWWo7IEKsiSx-C@tanuki/

Finally, patch 10 adds performance tests for 'for-each-ref', showing the effects of optimizations made throughout the series. Here are some sample results from my Ubuntu VM (test names shortened for space):

Test                                                         HEAD
----------------------------------------------------------------------------
6300.2: (loose)                                              4.68(0.98+3.64)
6300.3: (loose, no sort)                                     4.65(0.91+3.67)
6300.4: (loose, --count=1)                                   4.50(0.84+3.60)
6300.5: (loose, --count=1, no sort)                          4.24(0.46+3.71)
6300.6: (loose, tags)                                        2.41(0.45+1.93)
6300.7: (loose, tags, no sort)                               2.33(0.48+1.83)
6300.8: (loose, tags, dereferenced)                          3.65(1.66+1.95)
6300.9: (loose, tags, dereferenced, no sort)                 3.48(1.59+1.87)
6300.10: for-each-ref + cat-file (loose, tags)               4.48(2.27+2.22)
6300.12: (packed)                                            0.90(0.68+0.18)
6300.13: (packed, no sort)                                   0.71(0.55+0.06)
6300.14: (packed, --count=1)                                 0.77(0.52+0.16)
6300.15: (packed, --count=1, no sort)                        0.03(0.01+0.02)
6300.16: (packed, tags)                                      0.45(0.33+0.10)
6300.17: (packed, tags, no sort)                             0.39(0.33+0.03)
6300.18: (packed, tags, dereferenced)                        1.83(1.67+0.10)
6300.19: (packed, tags, dereferenced, no sort)               1.42(1.28+0.08)
6300.20: for-each-ref + cat-file (packed, tags)              2.36(2.11+0.29)
  • Victoria

Changes since V1

  • Restructured commit message of patch 1 for better readability
  • Re-added 'ref_sorting_release(sorting)' to 'ls-remote'
  • Dropped patch 2 so we don't commit to behavior we don't want in 'for-each-ref --omit-empty --count'
  • Split patch 6 into one that renames 'ref_filter_handler()' to 'filter_one()' and another that creates helper functions from existing code
  • Added/updated code comments in patch 7, changed ref iteration "break" return value from -1 to 1
  • Added a patch to reword 'for-each-ref' documentation in anticipation of updating the description of what '*' does in the format
  • Removed command-line option '--full-deref' for peeling tags in '*' format fields in favor of simply cutting over from the current single dereference to recursive dereference in all cases. Updated tests to match new behavior.
  • Added the '--count=1' tests back to p6300 (I must have unintentionally removed them before submitting V1)

cc: Patrick Steinhardt ps@pks.im
cc: Øystein Walle oystwa@gmail.com
cc: "Kristoffer Haugsbakk" code@khaugsbakk.name

@vdye vdye self-assigned this Nov 6, 2023
Copy link

gitgitgadget bot commented Nov 6, 2023

There are issues in commit a55a854:
ref-filter.c: filter & format refs in one iteration if possible
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.
Indented lines, and lines without whitespace, are exempt

Copy link

gitgitgadget bot commented Nov 6, 2023

There are issues in commit 4f69311:
ref-filter.c: filter & format refs in the same callback
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.
Indented lines, and lines without whitespace, are exempt

@vdye vdye changed the title More for-each-ref optimizations + usability improvements for-each-ref optimizations + usability improvements Nov 7, 2023
@vdye vdye changed the title for-each-ref optimizations + usability improvements for-each-ref optimizations & usability improvements Nov 7, 2023
@vdye vdye changed the title for-each-ref optimizations & usability improvements for-each-ref optimizations, usability improvements, and new option --full-deref Nov 7, 2023
@vdye vdye changed the title for-each-ref optimizations, usability improvements, and new option --full-deref for-each-ref optimizations & usability improvements Nov 7, 2023
@vdye
Copy link
Author

vdye commented Nov 7, 2023

/submit

Copy link

gitgitgadget bot commented Nov 7, 2023

Submitted as pull.1609.git.1699320361.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1609/vdye/vdye/for-each-ref-optimizations-v1

To fetch this version to local tag pr-1609/vdye/vdye/for-each-ref-optimizations-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1609/vdye/vdye/for-each-ref-optimizations-v1

Copy link

gitgitgadget bot commented Nov 7, 2023

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series is a bit of an informal follow-up to [1], adding some more
> substantial optimizations and usability fixes around ref
> filtering/formatting. Some of the changes here affect user-facing behavior,
> some are internal-only, but they're all interdependent enough to warrant
> putting them together in one series.
>
> [1]
> https://lore.kernel.org/git/pull.1594.v2.git.1696888736.gitgitgadget@gmail.com/
>
> Patch 1 changes the behavior of the '--no-sort' option in 'for-each-ref',
> 'tag', and 'branch'. Currently, it just removes previous sort keys and, if
> no further keys are specified, falls back on ascending refname sort (which,
> IMO, makes the name '--no-sort' somewhat misleading).

We can read it changes the behaviour and what the current behaviour
is, but I presume that the untold new behaviour with --no-sort is to
show the output in an unspecified order of implementation's
convenience?  I think it makes quite a lot of sense if that is what
is done.

> Patch 2 updates the 'for-each-ref' docs to clearly state what happens if you
> use '--omit-empty' and '--count' together. I based the explanation on what
> the current behavior is (i.e., refs omitted with '--omit-empty' do count
> towards the total limited by '--count').

OK.

> Patches 3-7 incrementally refactor various parts of the ref
> filtering/formatting workflows in order to create a
> 'filter_and_format_refs()' function. If certain conditions are met (sorting
> disabled, no reachability filtering or ahead-behind formatting), ref
> filtering & formatting is done within a single 'for_each_fullref_in'
> callback. Especially in large repositories, this makes a huge difference in
> memory usage & runtime for certain usages of 'for-each-ref', since it's no
> longer writing everything to a 'struct ref_array' then repeatedly whittling
> down/updating its contents.

OK.  I was wondering if you are going threaded implementation, until
I read into 6th line ;-)

> Patch 8 introduces a new option to 'for-each-ref' called '--full-deref'.
> When provided, any format fields for the dereferenced value of a tag (e.g.
> "%(*objectname)") will be populated with the fully peeled target of the tag;
> right now, those fields are populated with the immediate target of a tag
> (which can be another tag). This avoids the need to pipe 'for-each-ref'
> results to 'cat-file --batch-check' to get fully-peeled tag information. It
> also benefits from the 'filter_and_format_refs()' single-iteration
> optimization, since 'peel_iterated_oid()' may be able to read the
> pre-computed peeled OID from a packed ref. A couple notes on this one:
>
>  * I went with a command line option for '--full-deref' rather than another
>    format specifier (like ** instead of *) because it seems unlikely that a
>    user is going to want to perform a shallow dereference and a full
>    dereference in the same 'for-each-ref'. There's also a NEEDSWORK going
>    all the way back to the introduction of 'for-each-ref' in 9f613ddd21c
>    (Add git-for-each-ref: helper for language bindings, 2006-09-15) that (to
>    me) implies different dereferencing behavior corresponds to different use
>    cases/user needs.

Makes quite a lot of sense.

>  * I'm not attached to '--full-deref' as a name - if someone has an idea for
>    a more descriptive name, please suggest it!

Another candidate verb may be "to peel", and I have no strong
opinion between it and "to dereference".  But I have a mild aversion
to an abbreviation that is not strongly established.

> Finally, patch 9 adds performance tests for 'for-each-ref', showing the
> effects of optimizations made throughout the series. Here are some sample
> results from my Ubuntu VM (test names shortened for space):

Nice.

Copy link

gitgitgadget bot commented Nov 7, 2023

On the Git mailing list, Victoria Dye wrote (reply to this):

Junio C Hamano wrote:
> "Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> This series is a bit of an informal follow-up to [1], adding some more
>> substantial optimizations and usability fixes around ref
>> filtering/formatting. Some of the changes here affect user-facing behavior,
>> some are internal-only, but they're all interdependent enough to warrant
>> putting them together in one series.
>>
>> [1]
>> https://lore.kernel.org/git/pull.1594.v2.git.1696888736.gitgitgadget@gmail.com/
>>
>> Patch 1 changes the behavior of the '--no-sort' option in 'for-each-ref',
>> 'tag', and 'branch'. Currently, it just removes previous sort keys and, if
>> no further keys are specified, falls back on ascending refname sort (which,
>> IMO, makes the name '--no-sort' somewhat misleading).
> 
> We can read it changes the behaviour and what the current behaviour
> is, but I presume that the untold new behaviour with --no-sort is to
> show the output in an unspecified order of implementation's
> convenience?  I think it makes quite a lot of sense if that is what
> is done.

Ah sorry, I over-edited my cover letter and accidentally removed the
explanation of what this patch does! Yes - the new behavior is that
'--no-sort' (assuming there are no subsequent --sort=<something> options)
will completely skip sorting the filtered refs. 

>>  * I'm not attached to '--full-deref' as a name - if someone has an idea for
>>    a more descriptive name, please suggest it!
> 
> Another candidate verb may be "to peel", and I have no strong
> opinion between it and "to dereference".  But I have a mild aversion
> to an abbreviation that is not strongly established.
> 

Makes sense. I got the "deref" abbreviation for 'update-ref --no-deref', but
'show-ref' has a "--dereference" option and protocol v2's "ls-refs" includes
a "peel" arg. "Dereference" is the term already used in the 'for-each-ref'
documentation, though, so if no one comes in with an especially strong
opinion on this I'll change the option to '--full-dereference'. Thanks!

Copy link

gitgitgadget bot commented Nov 7, 2023

On the Git mailing list, Junio C Hamano wrote (reply to this):

Victoria Dye <vdye@github.com> writes:

> Ah sorry, I over-edited my cover letter and accidentally removed the
> explanation of what this patch does! Yes - the new behavior is that
> '--no-sort' (assuming there are no subsequent --sort=<something> options)
> will completely skip sorting the filtered refs. 

Makes sense.

And the way to countermand "--no-sort" that appears earlier on the
command line to revert to the default sort order is "--sort" that
uses "refname" as the sort key, which is also nice.

Copy link

gitgitgadget bot commented Nov 7, 2023

This branch is now known as vd/for-each-ref-unsorted-optimization.

Copy link

gitgitgadget bot commented Nov 7, 2023

This patch series was integrated into seen via git@a4313b6.

@gitgitgadget gitgitgadget bot added the seen label Nov 7, 2023
Copy link

gitgitgadget bot commented Nov 7, 2023

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Mon, Nov 06, 2023 at 06:48:29PM -0800, Victoria Dye wrote:
> Junio C Hamano wrote:
> > "Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:
[snip]
> >>  * I'm not attached to '--full-deref' as a name - if someone has an idea for
> >>    a more descriptive name, please suggest it!
> > 
> > Another candidate verb may be "to peel", and I have no strong
> > opinion between it and "to dereference".  But I have a mild aversion
> > to an abbreviation that is not strongly established.
> > 
> 
> Makes sense. I got the "deref" abbreviation for 'update-ref --no-deref', but
> 'show-ref' has a "--dereference" option and protocol v2's "ls-refs" includes
> a "peel" arg. "Dereference" is the term already used in the 'for-each-ref'
> documentation, though, so if no one comes in with an especially strong
> opinion on this I'll change the option to '--full-dereference'. Thanks!

But doesn't dereferencing in the context of git-update-ref(1) refer to
something different? It's not about tags, but it is about symbolic
references and whether we want to update the symref or the pointee. But
true enough, in git-show-ref(1) "dereference" actually means that we
should peel the tag.

To me it feels like preexisting commands are confused already. In my
mind model:

    - "peel" means that an object gets resolved to one of its pointees.
      This also includes the case here, where a tag gets peeled to its
      pointee.

    - "dereference" means that a symbolic reference gets resolved to its
      pointee. This matches what we do in `git update-ref --no-deref`.

But after reading through the code I don't think we distinguish those
terms cleanly throughout our codebase. Still, "peeling" feels like a
better match in my opinion.

Patrick

Copy link

gitgitgadget bot commented Nov 7, 2023

User Patrick Steinhardt <ps@pks.im> has been added to the cc: list.

@@ -767,7 +767,13 @@ int cmd_branch(int argc, const char **argv, const char *prefix)
if (argc == 2 && !strcmp(argv[1], "-h"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Tue, Nov 07, 2023 at 01:25:53AM +0000, Victoria Dye via GitGitGadget wrote:
> From: Victoria Dye <vdye@github.com>
> 
> Update 'ref_sorting_options()' to return a NULL 'struct ref_sorting *' if
> the string list provided to it is empty, rather than returning the default
> refname sort structure. Also update 'ref_array_sort()' to explicitly skip
> sorting if its 'struct ref_sorting *' arg is NULL. Other functions using
> 'struct ref_sorting *' do not need any changes because they already properly
> ignore NULL values.
> 
> The goal of this change is to have the '--no-sort' option truly disable
> sorting in commands like 'for-each-ref, 'tag', and 'branch'. Right now,
> '--no-sort' will still trigger refname sorting by default in 'for-each-ref',
> 'tag', and 'branch'.
> 
> To match existing behavior as closely as possible, explicitly add "refname"
> to the list of sort keys in 'for-each-ref', 'tag', and 'branch' before
> parsing options (if no config-based sort keys are set). This ensures that
> sorting will only be fully disabled if '--no-sort' is provided as an option;
> otherwise, "refname" sorting will remain the default. Note: this also means
> that even when sort keys are provided on the command line, "refname" will be
> the final sort key in the sorting structure. This doesn't actually change
> any behavior, since 'compare_refs()' already falls back on comparing
> refnames if two refs are equal w.r.t all other sort keys.
> 
> Finally, remove the condition around sorting in 'ls-remote', since it's no
> longer necessary. Unlike 'for-each-ref' et. al., it does *not* set any sort
> keys by default. The default empty list of sort keys will produce a NULL
> 'struct ref_sorting *', which causes the sorting to be skipped in
> 'ref_array_sort()'.

I found the order in this commit message a bit funny because you first
explain what you're doing, then explain the goal, and then jump into the
changes again. The message might be a bit easier to read if the goal was
stated up front.

I was also briefly wondering whether it would make sense to split up
this commit, as you're doing two different things:

    - Refactor how git-for-each-ref(1), git-tag(1) and git-branch(1) set
      up their default sorting.

    - Change `ref_array_sort()` to not sort when its sorting option is
      `NULL`.

If this was split up into two commits, then the result might be a bit
easier to reason about. But I don't feel strongly about this.

> Signed-off-by: Victoria Dye <vdye@github.com>
> ---
>  builtin/branch.c        |  6 ++++
>  builtin/for-each-ref.c  |  3 ++
>  builtin/ls-remote.c     | 10 ++----
>  builtin/tag.c           |  6 ++++
>  ref-filter.c            | 19 ++----------
>  t/t3200-branch.sh       | 68 +++++++++++++++++++++++++++++++++++++++--
>  t/t6300-for-each-ref.sh | 21 +++++++++++++
>  t/t7004-tag.sh          | 45 +++++++++++++++++++++++++++
>  8 files changed, 152 insertions(+), 26 deletions(-)
> 
> diff --git a/builtin/branch.c b/builtin/branch.c
> index e7ee9bd0f15..d67738bbcaa 100644
> --- a/builtin/branch.c
> +++ b/builtin/branch.c
> @@ -767,7 +767,13 @@ int cmd_branch(int argc, const char **argv, const char *prefix)
>  	if (argc == 2 && !strcmp(argv[1], "-h"))
>  		usage_with_options(builtin_branch_usage, options);
>  
> +	/*
> +	 * Try to set sort keys from config. If config does not set any,
> +	 * fall back on default (refname) sorting.
> +	 */
>  	git_config(git_branch_config, &sorting_options);
> +	if (!sorting_options.nr)
> +		string_list_append(&sorting_options, "refname");
>  
>  	track = git_branch_track;
>  
> diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c
> index 350bfa6e811..93b370f550b 100644
> --- a/builtin/for-each-ref.c
> +++ b/builtin/for-each-ref.c
> @@ -67,6 +67,9 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
>  
>  	git_config(git_default_config, NULL);
>  
> +	/* Set default (refname) sorting */
> +	string_list_append(&sorting_options, "refname");
> +
>  	parse_options(argc, argv, prefix, opts, for_each_ref_usage, 0);
>  	if (maxcount < 0) {
>  		error("invalid --count argument: `%d'", maxcount);
> diff --git a/builtin/ls-remote.c b/builtin/ls-remote.c
> index fc765754305..436249b720c 100644
> --- a/builtin/ls-remote.c
> +++ b/builtin/ls-remote.c
> @@ -58,6 +58,7 @@ int cmd_ls_remote(int argc, const char **argv, const char *prefix)
>  	struct transport *transport;
>  	const struct ref *ref;
>  	struct ref_array ref_array;
> +	struct ref_sorting *sorting;
>  	struct string_list sorting_options = STRING_LIST_INIT_DUP;
>  
>  	struct option options[] = {
> @@ -141,13 +142,8 @@ int cmd_ls_remote(int argc, const char **argv, const char *prefix)
>  		item->symref = xstrdup_or_null(ref->symref);
>  	}
>  
> -	if (sorting_options.nr) {
> -		struct ref_sorting *sorting;
> -
> -		sorting = ref_sorting_options(&sorting_options);
> -		ref_array_sort(sorting, &ref_array);
> -		ref_sorting_release(sorting);
> -	}
> +	sorting = ref_sorting_options(&sorting_options);
> +	ref_array_sort(sorting, &ref_array);

We stopped calling `ref_sorting_release()`. Doesn't that cause us to
leak memory?

>  	for (i = 0; i < ref_array.nr; i++) {
>  		const struct ref_array_item *ref = ref_array.items[i];
> diff --git a/builtin/tag.c b/builtin/tag.c
> index 3918eacbb57..64f3196cd4c 100644
> --- a/builtin/tag.c
> +++ b/builtin/tag.c
> @@ -501,7 +501,13 @@ int cmd_tag(int argc, const char **argv, const char *prefix)
>  
>  	setup_ref_filter_porcelain_msg();
>  
> +	/*
> +	 * Try to set sort keys from config. If config does not set any,
> +	 * fall back on default (refname) sorting.
> +	 */
>  	git_config(git_tag_config, &sorting_options);
> +	if (!sorting_options.nr)
> +		string_list_append(&sorting_options, "refname");
>  
>  	memset(&opt, 0, sizeof(opt));
>  	filter.lines = -1;
> diff --git a/ref-filter.c b/ref-filter.c
> index e4d3510e28e..7250089b7c6 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -3142,7 +3142,8 @@ void ref_sorting_set_sort_flags_all(struct ref_sorting *sorting,
>  
>  void ref_array_sort(struct ref_sorting *sorting, struct ref_array *array)
>  {
> -	QSORT_S(array->items, array->nr, compare_refs, sorting);
> +	if (sorting)
> +		QSORT_S(array->items, array->nr, compare_refs, sorting);
>  }
>  
>  static void append_literal(const char *cp, const char *ep, struct ref_formatting_state *state)
> @@ -3248,18 +3249,6 @@ static int parse_sorting_atom(const char *atom)
>  	return res;
>  }
>  
> -/*  If no sorting option is given, use refname to sort as default */
> -static struct ref_sorting *ref_default_sorting(void)
> -{
> -	static const char cstr_name[] = "refname";
> -
> -	struct ref_sorting *sorting = xcalloc(1, sizeof(*sorting));
> -
> -	sorting->next = NULL;
> -	sorting->atom = parse_sorting_atom(cstr_name);
> -	return sorting;
> -}
> -
>  static void parse_ref_sorting(struct ref_sorting **sorting_tail, const char *arg)
>  {
>  	struct ref_sorting *s;
> @@ -3283,9 +3272,7 @@ struct ref_sorting *ref_sorting_options(struct string_list *options)
>  	struct string_list_item *item;
>  	struct ref_sorting *sorting = NULL, **tail = &sorting;
>  
> -	if (!options->nr) {
> -		sorting = ref_default_sorting();
> -	} else {
> +	if (options->nr) {
>  		for_each_string_list_item(item, options)
>  			parse_ref_sorting(tail, item->string);
>  	}
> diff --git a/t/t3200-branch.sh b/t/t3200-branch.sh
> index 3182abde27f..9918ba05dec 100755
> --- a/t/t3200-branch.sh
> +++ b/t/t3200-branch.sh
> @@ -1570,9 +1570,10 @@ test_expect_success 'tracking with unexpected .fetch refspec' '
>  
>  test_expect_success 'configured committerdate sort' '
>  	git init -b main sort &&
> +	test_config -C sort branch.sort "committerdate" &&
> +
>  	(
>  		cd sort &&
> -		git config branch.sort committerdate &&
>  		test_commit initial &&
>  		git checkout -b a &&
>  		test_commit a &&
> @@ -1592,9 +1593,10 @@ test_expect_success 'configured committerdate sort' '
>  '
>  
>  test_expect_success 'option override configured sort' '
> +	test_config -C sort branch.sort "committerdate" &&
> +
>  	(
>  		cd sort &&
> -		git config branch.sort committerdate &&
>  		git branch --sort=refname >actual &&
>  		cat >expect <<-\EOF &&
>  		  a
> @@ -1606,10 +1608,70 @@ test_expect_success 'option override configured sort' '
>  	)
>  '
>  
> +test_expect_success '--no-sort cancels config sort keys' '
> +	test_config -C sort branch.sort "-refname" &&
> +
> +	(
> +		cd sort &&
> +
> +		# objecttype is identical for all of them, so sort falls back on
> +		# default (ascending refname)
> +		git branch \
> +			--no-sort \
> +			--sort="objecttype" >actual &&

This test is a bit confusing to me. Shouldn't we in fact ignore the
configured sorting order as soon as we pass `--sort=` anyway? In other
words, I would expect the `--no-sort` option to not make a difference
here. What should make a difference is if you _only_ passed `--no-sort`.

> +		cat >expect <<-\EOF &&
> +		  a
> +		* b
> +		  c
> +		  main
> +		EOF
> +		test_cmp expect actual
> +	)
> +
> +'
> +
> +test_expect_success '--no-sort cancels command line sort keys' '
> +	(
> +		cd sort &&
> +
> +		# objecttype is identical for all of them, so sort falls back on
> +		# default (ascending refname)
> +		git branch \
> +			--sort="-refname" \
> +			--no-sort \
> +			--sort="objecttype" >actual &&
> +		cat >expect <<-\EOF &&
> +		  a
> +		* b
> +		  c
> +		  main
> +		EOF
> +		test_cmp expect actual
> +	)
> +'
> +
> +test_expect_success '--no-sort without subsequent --sort prints expected branches' '
> +	(
> +		cd sort &&
> +
> +		# Sort the results with `sort` for a consistent comparison
> +		# against expected
> +		git branch --no-sort | sort >actual &&
> +		cat >expect <<-\EOF &&
> +		  a
> +		  c
> +		  main
> +		* b
> +		EOF
> +		test_cmp expect actual
> +	)
> +'
> +
>  test_expect_success 'invalid sort parameter in configuration' '
> +	test_config -C sort branch.sort "v:notvalid" &&
> +
>  	(
>  		cd sort &&
> -		git config branch.sort "v:notvalid" &&
>  
>  		# this works in the "listing" mode, so bad sort key
>  		# is a dying offence.
> diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
> index 00a060df0b5..0613e5e3623 100755
> --- a/t/t6300-for-each-ref.sh
> +++ b/t/t6300-for-each-ref.sh
> @@ -1335,6 +1335,27 @@ test_expect_success '--no-sort cancels the previous sort keys' '
>  	test_cmp expected actual
>  '
>  
> +test_expect_success '--no-sort without subsequent --sort prints expected refs' '
> +	cat >expected <<-\EOF &&
> +	refs/tags/multi-ref1-100000-user1
> +	refs/tags/multi-ref1-100000-user2
> +	refs/tags/multi-ref1-200000-user1
> +	refs/tags/multi-ref1-200000-user2
> +	refs/tags/multi-ref2-100000-user1
> +	refs/tags/multi-ref2-100000-user2
> +	refs/tags/multi-ref2-200000-user1
> +	refs/tags/multi-ref2-200000-user2
> +	EOF
> +
> +	# Sort the results with `sort` for a consistent comparison against
> +	# expected
> +	git for-each-ref \
> +		--format="%(refname)" \
> +		--no-sort \
> +		"refs/tags/multi-*" | sort >actual &&
> +	test_cmp expected actual
> +'
> +
>  test_expect_success 'do not dereference NULL upon %(HEAD) on unborn branch' '
>  	test_when_finished "git checkout main" &&
>  	git for-each-ref --format="%(HEAD) %(refname:short)" refs/heads/ >actual &&
> diff --git a/t/t7004-tag.sh b/t/t7004-tag.sh
> index e689db42929..b41a47eb943 100755
> --- a/t/t7004-tag.sh
> +++ b/t/t7004-tag.sh
> @@ -1862,6 +1862,51 @@ test_expect_success 'option override configured sort' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success '--no-sort cancels config sort keys' '
> +	test_config tag.sort "-refname" &&
> +
> +	# objecttype is identical for all of them, so sort falls back on
> +	# default (ascending refname)
> +	git tag -l \
> +		--no-sort \
> +		--sort="objecttype" \
> +		"foo*" >actual &&
> +	cat >expect <<-\EOF &&
> +	foo1.10
> +	foo1.3
> +	foo1.6
> +	EOF
> +	test_cmp expect actual
> +'

Same question here.

Patrick

> +test_expect_success '--no-sort cancels command line sort keys' '
> +	# objecttype is identical for all of them, so sort falls back on
> +	# default (ascending refname)
> +	git tag -l \
> +		--sort="-refname" \
> +		--no-sort \
> +		--sort="objecttype" \
> +		"foo*" >actual &&
> +	cat >expect <<-\EOF &&
> +	foo1.10
> +	foo1.3
> +	foo1.6
> +	EOF
> +	test_cmp expect actual
> +'
> +
> +test_expect_success '--no-sort without subsequent --sort prints expected tags' '
> +	# Sort the results with `sort` for a consistent comparison against
> +	# expected
> +	git tag -l --no-sort "foo*" | sort >actual &&
> +	cat >expect <<-\EOF &&
> +	foo1.10
> +	foo1.3
> +	foo1.6
> +	EOF
> +	test_cmp expect actual
> +'
> +
>  test_expect_success 'invalid sort parameter on command line' '
>  	test_must_fail git tag -l --sort=notvalid "foo*" >actual
>  '
> -- 
> gitgitgadget
> 
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Victoria Dye wrote (reply to this):

Patrick Steinhardt wrote:
> On Tue, Nov 07, 2023 at 01:25:53AM +0000, Victoria Dye via GitGitGadget wrote:
>> From: Victoria Dye <vdye@github.com>
>>
>> Update 'ref_sorting_options()' to return a NULL 'struct ref_sorting *' if
>> the string list provided to it is empty, rather than returning the default
>> refname sort structure. Also update 'ref_array_sort()' to explicitly skip
>> sorting if its 'struct ref_sorting *' arg is NULL. Other functions using
>> 'struct ref_sorting *' do not need any changes because they already properly
>> ignore NULL values.
>>
>> The goal of this change is to have the '--no-sort' option truly disable
>> sorting in commands like 'for-each-ref, 'tag', and 'branch'. Right now,
>> '--no-sort' will still trigger refname sorting by default in 'for-each-ref',
>> 'tag', and 'branch'.
>>
>> To match existing behavior as closely as possible, explicitly add "refname"
>> to the list of sort keys in 'for-each-ref', 'tag', and 'branch' before
>> parsing options (if no config-based sort keys are set). This ensures that
>> sorting will only be fully disabled if '--no-sort' is provided as an option;
>> otherwise, "refname" sorting will remain the default. Note: this also means
>> that even when sort keys are provided on the command line, "refname" will be
>> the final sort key in the sorting structure. This doesn't actually change
>> any behavior, since 'compare_refs()' already falls back on comparing
>> refnames if two refs are equal w.r.t all other sort keys.
>>
>> Finally, remove the condition around sorting in 'ls-remote', since it's no
>> longer necessary. Unlike 'for-each-ref' et. al., it does *not* set any sort
>> keys by default. The default empty list of sort keys will produce a NULL
>> 'struct ref_sorting *', which causes the sorting to be skipped in
>> 'ref_array_sort()'.
> 
> I found the order in this commit message a bit funny because you first
> explain what you're doing, then explain the goal, and then jump into the
> changes again. The message might be a bit easier to read if the goal was
> stated up front.

I'll try to restructure it.

> 
> I was also briefly wondering whether it would make sense to split up
> this commit, as you're doing two different things:
> 
>     - Refactor how git-for-each-ref(1), git-tag(1) and git-branch(1) set
>       up their default sorting.
> 
>     - Change `ref_array_sort()` to not sort when its sorting option is
>       `NULL`.
> 
> If this was split up into two commits, then the result might be a bit
> easier to reason about. But I don't feel strongly about this.

The addition of "refname" to the sorting defaults really only makes sense in
the context of needing it to update 'ref_array_sort()', though. While you
can convey some of that in a commit message, when reading through commits
(mine and others') I find it much easier to contextualize small refactors
with their associated behavior change if they're done in a single patch.
There's a limit to that, of course; even within this series I have a lot of
"this will make sense later" commit messages (more than I'd like really)
because the refactors are large & varied enough that they'd be overwhelming
if squashed into a single patch.

So, while I definitely see where you're coming from, I think this patch is
better off not being split.

>> diff --git a/builtin/ls-remote.c b/builtin/ls-remote.c
>> index fc765754305..436249b720c 100644
>> --- a/builtin/ls-remote.c
>> +++ b/builtin/ls-remote.c
>> @@ -58,6 +58,7 @@ int cmd_ls_remote(int argc, const char **argv, const char *prefix)
>>  	struct transport *transport;
>>  	const struct ref *ref;
>>  	struct ref_array ref_array;
>> +	struct ref_sorting *sorting;
>>  	struct string_list sorting_options = STRING_LIST_INIT_DUP;
>>  
>>  	struct option options[] = {
>> @@ -141,13 +142,8 @@ int cmd_ls_remote(int argc, const char **argv, const char *prefix)
>>  		item->symref = xstrdup_or_null(ref->symref);
>>  	}
>>  
>> -	if (sorting_options.nr) {
>> -		struct ref_sorting *sorting;
>> -
>> -		sorting = ref_sorting_options(&sorting_options);
>> -		ref_array_sort(sorting, &ref_array);
>> -		ref_sorting_release(sorting);
>> -	}
>> +	sorting = ref_sorting_options(&sorting_options);
>> +	ref_array_sort(sorting, &ref_array);
> 
> We stopped calling `ref_sorting_release()`. Doesn't that cause us to
> leak memory?

Nice catch, thanks! It should have been moved to the end of this function
(right before the 'ref_array_clear()').

>> diff --git a/t/t3200-branch.sh b/t/t3200-branch.sh
>> index 3182abde27f..9918ba05dec 100755
>> --- a/t/t3200-branch.sh
>> +++ b/t/t3200-branch.sh
>> @@ -1606,10 +1608,70 @@ test_expect_success 'option override configured sort' '
>>  	)
>>  '
>>  
>> +test_expect_success '--no-sort cancels config sort keys' '
>> +	test_config -C sort branch.sort "-refname" &&
>> +
>> +	(
>> +		cd sort &&
>> +
>> +		# objecttype is identical for all of them, so sort falls back on
>> +		# default (ascending refname)
>> +		git branch \
>> +			--no-sort \
>> +			--sort="objecttype" >actual &&
> 
> This test is a bit confusing to me. Shouldn't we in fact ignore the
> configured sorting order as soon as we pass `--sort=` anyway? In other
> words, I would expect the `--no-sort` option to not make a difference
> here. What should make a difference is if you _only_ passed `--no-sort`.

The existing behavior (as demonstrated by this test) is that the command
line sort keys append to, rather than replace, the config-based sort keys. I
don't see any evidence in the commit history to indicate that this was an
intentional design decision, but it's not necessarily incorrect either.

For one, it's not universal in string list options that the command line
replaces the config. There are examples of both approaches to string list
options in other commands:

- in 'git push', specifying '--push-option' on the command line even once
  will remove any values set by 'push.pushoption'
- in 'git blame', any values specified with '--ignore-revs-file' are
  appended to those set by 'blame.ignorerevsfile'

In the case of 'git (tag|branch)', I can see why users might not want
command line sort keys to completely remove config-based ones. The only time
the config-based keys will come into play is when two entries are identical
w.r.t _all_ of the command line sort keys. In that scenario, I'd expect a
user would want to use their configured defaults to "break the tie" instead
of the hardcoded ascending refname sort. If they do actually want to remove
the config keys, they can set '--no-sort' before their other sort keys.

ref-filter.c Outdated
@@ -2764,8 +2764,6 @@ static int filter_ref_kind(struct ref_filter *filter, const char *refname)
struct ref_filter_cbdata {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Tue, Nov 07, 2023 at 01:25:56AM +0000, Victoria Dye via GitGitGadget wrote:
> From: Victoria Dye <vdye@github.com>
> 
> Move the 'contains_cache' and 'no_contains_cache' used in filter_refs into
> an 'internal' struct of the 'struct ref_filter'. In later patches, the
> 'struct ref_filter *' will be a common data structure across multiple
> filtering functions. These caches are part of the common functionality the
> filter struct will support, so they are updated to be internally accessible
> wherever the filter is used.
> 
> The design used here is mirrors what was introduced in 576de3d956

Nit: s/is //

Patrick

> (unpack_trees: start splitting internal fields from public API, 2023-02-27)
> for 'unpack_trees_options'.
> 
> Signed-off-by: Victoria Dye <vdye@github.com>
> ---
>  ref-filter.c | 14 ++++++--------
>  ref-filter.h |  6 ++++++
>  2 files changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/ref-filter.c b/ref-filter.c
> index 7250089b7c6..5129b6986c9 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -2764,8 +2764,6 @@ static int filter_ref_kind(struct ref_filter *filter, const char *refname)
>  struct ref_filter_cbdata {
>  	struct ref_array *array;
>  	struct ref_filter *filter;
> -	struct contains_cache contains_cache;
> -	struct contains_cache no_contains_cache;
>  };
>  
>  /*
> @@ -2816,11 +2814,11 @@ static int ref_filter_handler(const char *refname, const struct object_id *oid,
>  			return 0;
>  		/* We perform the filtering for the '--contains' option... */
>  		if (filter->with_commit &&
> -		    !commit_contains(filter, commit, filter->with_commit, &ref_cbdata->contains_cache))
> +		    !commit_contains(filter, commit, filter->with_commit, &filter->internal.contains_cache))
>  			return 0;
>  		/* ...or for the `--no-contains' option */
>  		if (filter->no_commit &&
> -		    commit_contains(filter, commit, filter->no_commit, &ref_cbdata->no_contains_cache))
> +		    commit_contains(filter, commit, filter->no_commit, &filter->internal.no_contains_cache))
>  			return 0;
>  	}
>  
> @@ -2989,8 +2987,8 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int
>  	save_commit_buffer_orig = save_commit_buffer;
>  	save_commit_buffer = 0;
>  
> -	init_contains_cache(&ref_cbdata.contains_cache);
> -	init_contains_cache(&ref_cbdata.no_contains_cache);
> +	init_contains_cache(&filter->internal.contains_cache);
> +	init_contains_cache(&filter->internal.no_contains_cache);
>  
>  	/*  Simple per-ref filtering */
>  	if (!filter->kind)
> @@ -3014,8 +3012,8 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int
>  			head_ref(ref_filter_handler, &ref_cbdata);
>  	}
>  
> -	clear_contains_cache(&ref_cbdata.contains_cache);
> -	clear_contains_cache(&ref_cbdata.no_contains_cache);
> +	clear_contains_cache(&filter->internal.contains_cache);
> +	clear_contains_cache(&filter->internal.no_contains_cache);
>  
>  	/*  Filters that need revision walking */
>  	reach_filter(array, &filter->reachable_from, INCLUDE_REACHED);
> diff --git a/ref-filter.h b/ref-filter.h
> index d87d61238b7..0db3ff52889 100644
> --- a/ref-filter.h
> +++ b/ref-filter.h
> @@ -7,6 +7,7 @@
>  #include "commit.h"
>  #include "string-list.h"
>  #include "strvec.h"
> +#include "commit-reach.h"
>  
>  /* Quoting styles */
>  #define QUOTE_NONE 0
> @@ -75,6 +76,11 @@ struct ref_filter {
>  		lines;
>  	int abbrev,
>  		verbose;
> +
> +	struct {
> +		struct contains_cache contains_cache;
> +		struct contains_cache no_contains_cache;
> +	} internal;
>  };
>  
>  struct ref_format {
> -- 
> gitgitgadget
> 
> 

@@ -2716,15 +2716,18 @@ static struct ref_array_item *new_ref_array_item(const char *refname,
return ref;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Tue, Nov 07, 2023 at 01:25:58AM +0000, Victoria Dye via GitGitGadget wrote:
> From: Victoria Dye <vdye@github.com>
> 
> Factor out parts of 'ref_array_push()', 'ref_filter_handler()', and
> 'filter_refs()' into new helper functions ('ref_array_append()',
> 'apply_ref_filter()', and 'do_filter_refs()' respectively), as well as
> rename 'ref_filter_handler()' to 'filter_one()'. In this and later
> patches, these helpers will be used by new ref-filter API functions. This
> patch does not result in any user-facing behavior changes or changes to
> callers outside of 'ref-filter.c'.
> 
> The changes are as follows:
> 
> * The logic to grow a 'struct ref_array' and append a given 'struct
>   ref_array_item *' to it is extracted from 'ref_array_push()' into
>   'ref_array_append()'.
> * 'ref_filter_handler()' is renamed to 'filter_one()' to more clearly
>   distinguish it from other ref filtering callbacks that will be added in
>   later patches. The "*_one()" naming convention is common throughout the
>   codebase for iteration callbacks.
> * The code to filter a given ref by refname & object ID then create a new
>   'struct ref_array_item' is moved out of 'filter_one()' and into
>   'apply_ref_filter()'. 'apply_ref_filter()' returns either NULL (if the ref
>   does not match the given filter) or a 'struct ref_array_item *' created
>   with 'new_ref_array_item()'; 'filter_one()' appends that item to
>   its ref array with 'ref_array_append()'.
> * The filter pre-processing, contains cache creation, and ref iteration of
>   'filter_refs()' is extracted into 'do_filter_refs()'. 'do_filter_refs()'
>   takes its ref iterator function & callback data as an input from the
>   caller, setting it up to be used with additional filtering callbacks in
>   later patches.

To me, a bulleted list spelling out the different changes I'm doing
often indicates that I might want to split up the commit into one for
each of the items. I don't feel strongly about this, but think that it
might help the reviewer in this case.

Patrick

> Signed-off-by: Victoria Dye <vdye@github.com>
> ---
>  ref-filter.c | 115 ++++++++++++++++++++++++++++++---------------------
>  1 file changed, 69 insertions(+), 46 deletions(-)
> 
> diff --git a/ref-filter.c b/ref-filter.c
> index 8992fbf45b1..ff00ab4b8d8 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -2716,15 +2716,18 @@ static struct ref_array_item *new_ref_array_item(const char *refname,
>  	return ref;
>  }
>  
> +static void ref_array_append(struct ref_array *array, struct ref_array_item *ref)
> +{
> +	ALLOC_GROW(array->items, array->nr + 1, array->alloc);
> +	array->items[array->nr++] = ref;
> +}
> +
>  struct ref_array_item *ref_array_push(struct ref_array *array,
>  				      const char *refname,
>  				      const struct object_id *oid)
>  {
>  	struct ref_array_item *ref = new_ref_array_item(refname, oid);
> -
> -	ALLOC_GROW(array->items, array->nr + 1, array->alloc);
> -	array->items[array->nr++] = ref;
> -
> +	ref_array_append(array, ref);
>  	return ref;
>  }
>  
> @@ -2761,46 +2764,36 @@ static int filter_ref_kind(struct ref_filter *filter, const char *refname)
>  	return ref_kind_from_refname(refname);
>  }
>  
> -struct ref_filter_cbdata {
> -	struct ref_array *array;
> -	struct ref_filter *filter;
> -};
> -
> -/*
> - * A call-back given to for_each_ref().  Filter refs and keep them for
> - * later object processing.
> - */
> -static int ref_filter_handler(const char *refname, const struct object_id *oid, int flag, void *cb_data)
> +static struct ref_array_item *apply_ref_filter(const char *refname, const struct object_id *oid,
> +			    int flag, struct ref_filter *filter)
>  {
> -	struct ref_filter_cbdata *ref_cbdata = cb_data;
> -	struct ref_filter *filter = ref_cbdata->filter;
>  	struct ref_array_item *ref;
>  	struct commit *commit = NULL;
>  	unsigned int kind;
>  
>  	if (flag & REF_BAD_NAME) {
>  		warning(_("ignoring ref with broken name %s"), refname);
> -		return 0;
> +		return NULL;
>  	}
>  
>  	if (flag & REF_ISBROKEN) {
>  		warning(_("ignoring broken ref %s"), refname);
> -		return 0;
> +		return NULL;
>  	}
>  
>  	/* Obtain the current ref kind from filter_ref_kind() and ignore unwanted refs. */
>  	kind = filter_ref_kind(filter, refname);
>  	if (!(kind & filter->kind))
> -		return 0;
> +		return NULL;
>  
>  	if (!filter_pattern_match(filter, refname))
> -		return 0;
> +		return NULL;
>  
>  	if (filter_exclude_match(filter, refname))
> -		return 0;
> +		return NULL;
>  
>  	if (filter->points_at.nr && !match_points_at(&filter->points_at, oid, refname))
> -		return 0;
> +		return NULL;
>  
>  	/*
>  	 * A merge filter is applied on refs pointing to commits. Hence
> @@ -2811,15 +2804,15 @@ static int ref_filter_handler(const char *refname, const struct object_id *oid,
>  	    filter->with_commit || filter->no_commit || filter->verbose) {
>  		commit = lookup_commit_reference_gently(the_repository, oid, 1);
>  		if (!commit)
> -			return 0;
> +			return NULL;
>  		/* We perform the filtering for the '--contains' option... */
>  		if (filter->with_commit &&
>  		    !commit_contains(filter, commit, filter->with_commit, &filter->internal.contains_cache))
> -			return 0;
> +			return NULL;
>  		/* ...or for the `--no-contains' option */
>  		if (filter->no_commit &&
>  		    commit_contains(filter, commit, filter->no_commit, &filter->internal.no_contains_cache))
> -			return 0;
> +			return NULL;
>  	}
>  
>  	/*
> @@ -2827,11 +2820,32 @@ static int ref_filter_handler(const char *refname, const struct object_id *oid,
>  	 * to do its job and the resulting list may yet to be pruned
>  	 * by maxcount logic.
>  	 */
> -	ref = ref_array_push(ref_cbdata->array, refname, oid);
> +	ref = new_ref_array_item(refname, oid);
>  	ref->commit = commit;
>  	ref->flag = flag;
>  	ref->kind = kind;
>  
> +	return ref;
> +}
> +
> +struct ref_filter_cbdata {
> +	struct ref_array *array;
> +	struct ref_filter *filter;
> +};
> +
> +/*
> + * A call-back given to for_each_ref().  Filter refs and keep them for
> + * later object processing.
> + */
> +static int filter_one(const char *refname, const struct object_id *oid, int flag, void *cb_data)
> +{
> +	struct ref_filter_cbdata *ref_cbdata = cb_data;
> +	struct ref_array_item *ref;
> +
> +	ref = apply_ref_filter(refname, oid, flag, ref_cbdata->filter);
> +	if (ref)
> +		ref_array_append(ref_cbdata->array, ref);
> +
>  	return 0;
>  }
>  
> @@ -2967,26 +2981,12 @@ void filter_ahead_behind(struct repository *r,
>  	free(commits);
>  }
>  
> -/*
> - * API for filtering a set of refs. Based on the type of refs the user
> - * has requested, we iterate through those refs and apply filters
> - * as per the given ref_filter structure and finally store the
> - * filtered refs in the ref_array structure.
> - */
> -int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int type)
> +static int do_filter_refs(struct ref_filter *filter, unsigned int type, each_ref_fn fn, void *cb_data)
>  {
> -	struct ref_filter_cbdata ref_cbdata;
> -	int save_commit_buffer_orig;
>  	int ret = 0;
>  
> -	ref_cbdata.array = array;
> -	ref_cbdata.filter = filter;
> -
>  	filter->kind = type & FILTER_REFS_KIND_MASK;
>  
> -	save_commit_buffer_orig = save_commit_buffer;
> -	save_commit_buffer = 0;
> -
>  	init_contains_cache(&filter->internal.contains_cache);
>  	init_contains_cache(&filter->internal.no_contains_cache);
>  
> @@ -3001,20 +3001,43 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int
>  		 * of filter_ref_kind().
>  		 */
>  		if (filter->kind == FILTER_REFS_BRANCHES)
> -			ret = for_each_fullref_in("refs/heads/", ref_filter_handler, &ref_cbdata);
> +			ret = for_each_fullref_in("refs/heads/", fn, cb_data);
>  		else if (filter->kind == FILTER_REFS_REMOTES)
> -			ret = for_each_fullref_in("refs/remotes/", ref_filter_handler, &ref_cbdata);
> +			ret = for_each_fullref_in("refs/remotes/", fn, cb_data);
>  		else if (filter->kind == FILTER_REFS_TAGS)
> -			ret = for_each_fullref_in("refs/tags/", ref_filter_handler, &ref_cbdata);
> +			ret = for_each_fullref_in("refs/tags/", fn, cb_data);
>  		else if (filter->kind & FILTER_REFS_ALL)
> -			ret = for_each_fullref_in_pattern(filter, ref_filter_handler, &ref_cbdata);
> +			ret = for_each_fullref_in_pattern(filter, fn, cb_data);
>  		if (!ret && (filter->kind & FILTER_REFS_DETACHED_HEAD))
> -			head_ref(ref_filter_handler, &ref_cbdata);
> +			head_ref(fn, cb_data);
>  	}
>  
>  	clear_contains_cache(&filter->internal.contains_cache);
>  	clear_contains_cache(&filter->internal.no_contains_cache);
>  
> +	return ret;
> +}
> +
> +/*
> + * API for filtering a set of refs. Based on the type of refs the user
> + * has requested, we iterate through those refs and apply filters
> + * as per the given ref_filter structure and finally store the
> + * filtered refs in the ref_array structure.
> + */
> +int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int type)
> +{
> +	struct ref_filter_cbdata ref_cbdata;
> +	int save_commit_buffer_orig;
> +	int ret = 0;
> +
> +	ref_cbdata.array = array;
> +	ref_cbdata.filter = filter;
> +
> +	save_commit_buffer_orig = save_commit_buffer;
> +	save_commit_buffer = 0;
> +
> +	ret = do_filter_refs(filter, type, filter_one, &ref_cbdata);
> +
>  	/*  Filters that need revision walking */
>  	reach_filter(array, &filter->reachable_from, INCLUDE_REACHED);
>  	reach_filter(array, &filter->unreachable_from, EXCLUDE_REACHED);
> -- 
> gitgitgadget
> 
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Victoria Dye wrote (reply to this):

Patrick Steinhardt wrote:
> On Tue, Nov 07, 2023 at 01:25:58AM +0000, Victoria Dye via GitGitGadget wrote:
>> From: Victoria Dye <vdye@github.com>
>>
>> Factor out parts of 'ref_array_push()', 'ref_filter_handler()', and
>> 'filter_refs()' into new helper functions ('ref_array_append()',
>> 'apply_ref_filter()', and 'do_filter_refs()' respectively), as well as
>> rename 'ref_filter_handler()' to 'filter_one()'. In this and later
>> patches, these helpers will be used by new ref-filter API functions. This
>> patch does not result in any user-facing behavior changes or changes to
>> callers outside of 'ref-filter.c'.
>>
>> The changes are as follows:
>>
>> * The logic to grow a 'struct ref_array' and append a given 'struct
>>   ref_array_item *' to it is extracted from 'ref_array_push()' into
>>   'ref_array_append()'.
>> * 'ref_filter_handler()' is renamed to 'filter_one()' to more clearly
>>   distinguish it from other ref filtering callbacks that will be added in
>>   later patches. The "*_one()" naming convention is common throughout the
>>   codebase for iteration callbacks.
>> * The code to filter a given ref by refname & object ID then create a new
>>   'struct ref_array_item' is moved out of 'filter_one()' and into
>>   'apply_ref_filter()'. 'apply_ref_filter()' returns either NULL (if the ref
>>   does not match the given filter) or a 'struct ref_array_item *' created
>>   with 'new_ref_array_item()'; 'filter_one()' appends that item to
>>   its ref array with 'ref_array_append()'.
>> * The filter pre-processing, contains cache creation, and ref iteration of
>>   'filter_refs()' is extracted into 'do_filter_refs()'. 'do_filter_refs()'
>>   takes its ref iterator function & callback data as an input from the
>>   caller, setting it up to be used with additional filtering callbacks in
>>   later patches.
> 
> To me, a bulleted list spelling out the different changes I'm doing
> often indicates that I might want to split up the commit into one for
> each of the items. I don't feel strongly about this, but think that it
> might help the reviewer in this case.

While that's a good guideline to keep in mind, it's not universally
applicable. In this case, (almost) all of the changes are done the same way,
focused on the same goal: extract bits of 'filter_refs()' into generic,
internal helpers so we can use those bits elsewhere in later patches.
Splitting those extractions into multiple patches would essentially lead to
a handful of very small patches that more-or-less have the same commit
message. As I mentioned in [1], I think there's value to having the
immediate context of related changes in a single patch (as long as that
single patch doesn't become unwieldy), so I'm not inclined to split this up.

That said, I did say "(almost) all" of the changes are conceptually similar.
Looking at this now, the rename of 'ref_filter_handler()' => 'filter_one()'
doesn't really fit the "extract into helper functions" theme of the rest of
the patch, I'll pull that out into its own.

[1] https://lore.kernel.org/git/a833b5a7-0201-4c2e-8821-f2a1930cb403@github.com/

@@ -2851,6 +2863,44 @@ static void free_array_item(struct ref_array_item *item)
free(item);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Tue, Nov 07, 2023 at 01:25:59AM +0000, Victoria Dye via GitGitGadget wrote:
> From: Victoria Dye <vdye@github.com>
> 
> Update 'filter_and_format_refs()' to try to perform ref filtering &
> formatting in a single ref iteration, without an intermediate 'struct
> ref_array'. This can only be done if no operations need to be performed on a
> pre-filtered array; specifically, if the refs are
> 
> - filtered on reachability,
> - sorted, or
> - formatted with ahead-behind information
> 
> they cannot be filtered & formatted in the same iteration. In that case,
> fall back on the current filter-then-sort-then-format flow.
> 
> This optimization substantially improves memory usage due to no longer
> storing a ref array in memory. In some cases, it also dramatically reduces
> runtime (e.g. 'git for-each-ref --no-sort --count=1', which no longer loads
> all refs into a 'struct ref_array' to printing only the first ref).
> 
> Signed-off-by: Victoria Dye <vdye@github.com>
> ---
>  ref-filter.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 74 insertions(+), 6 deletions(-)
> 
> diff --git a/ref-filter.c b/ref-filter.c
> index ff00ab4b8d8..384cf1595ff 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -2863,6 +2863,44 @@ static void free_array_item(struct ref_array_item *item)
>  	free(item);
>  }
>  
> +struct ref_filter_and_format_cbdata {
> +	struct ref_filter *filter;
> +	struct ref_format *format;
> +
> +	struct ref_filter_and_format_internal {
> +		int count;
> +	} internal;
> +};
> +
> +static int filter_and_format_one(const char *refname, const struct object_id *oid, int flag, void *cb_data)
> +{
> +	struct ref_filter_and_format_cbdata *ref_cbdata = cb_data;
> +	struct ref_array_item *ref;
> +	struct strbuf output = STRBUF_INIT, err = STRBUF_INIT;
> +
> +	ref = apply_ref_filter(refname, oid, flag, ref_cbdata->filter);
> +	if (!ref)
> +		return 0;
> +
> +	if (format_ref_array_item(ref, ref_cbdata->format, &output, &err))
> +		die("%s", err.buf);
> +
> +	if (output.len || !ref_cbdata->format->array_opts.omit_empty) {
> +		fwrite(output.buf, 1, output.len, stdout);
> +		putchar('\n');
> +	}
> +
> +	strbuf_release(&output);
> +	strbuf_release(&err);
> +	free_array_item(ref);
> +
> +	if (ref_cbdata->format->array_opts.max_count &&
> +	    ++ref_cbdata->internal.count >= ref_cbdata->format->array_opts.max_count)
> +		return -1;

It feels a bit weird to return a negative value here, which usually
indicates that an error has happened whereas we only use it here to
abort the iteration. But we ignore the return value of
`do_iterate_refs()` anyway, so it doesn't make much of a difference.

> +	return 0;
> +}
> +
>  /* Free all memory allocated for ref_array */
>  void ref_array_clear(struct ref_array *array)
>  {
> @@ -3046,16 +3084,46 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int
>  	return ret;
>  }
>  
> +static inline int can_do_iterative_format(struct ref_filter *filter,
> +					  struct ref_sorting *sorting,
> +					  struct ref_format *format)
> +{
> +	/*
> +	 * Refs can be filtered and formatted in the same iteration as long
> +	 * as we aren't filtering on reachability, sorting the results, or
> +	 * including ahead-behind information in the formatted output.
> +	 */

Do we want to format this as a bulleted list so that it's more readily
extensible if we ever need to pay attention to new options here? Also, I
noted that this commit doesn't add any new tests -- do we already
exercise all of these conditions?

More generally, I worry a bit about maintainability of this code snippet
as we need to remember to always update this condition whenever we add a
new option, and this can be quite easy to miss. The performance benefit
might be worth the effort though.

Patrick

> +	return !(filter->reachable_from ||
> +		 filter->unreachable_from ||
> +		 sorting ||
> +		 format->bases.nr);
> +}
> +
>  void filter_and_format_refs(struct ref_filter *filter, unsigned int type,
>  			    struct ref_sorting *sorting,
>  			    struct ref_format *format)
>  {
> -	struct ref_array array = { 0 };
> -	filter_refs(&array, filter, type);
> -	filter_ahead_behind(the_repository, format, &array);
> -	ref_array_sort(sorting, &array);
> -	print_formatted_ref_array(&array, format);
> -	ref_array_clear(&array);
> +	if (can_do_iterative_format(filter, sorting, format)) {
> +		int save_commit_buffer_orig;
> +		struct ref_filter_and_format_cbdata ref_cbdata = {
> +			.filter = filter,
> +			.format = format,
> +		};
> +
> +		save_commit_buffer_orig = save_commit_buffer;
> +		save_commit_buffer = 0;
> +
> +		do_filter_refs(filter, type, filter_and_format_one, &ref_cbdata);
> +
> +		save_commit_buffer = save_commit_buffer_orig;
> +	} else {
> +		struct ref_array array = { 0 };
> +		filter_refs(&array, filter, type);
> +		filter_ahead_behind(the_repository, format, &array);
> +		ref_array_sort(sorting, &array);
> +		print_formatted_ref_array(&array, format);
> +		ref_array_clear(&array);
> +	}
>  }
>  
>  static int compare_detached_head(struct ref_array_item *a, struct ref_array_item *b)
> -- 
> gitgitgadget
> 
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Victoria Dye wrote (reply to this):

Patrick Steinhardt wrote:
>> diff --git a/ref-filter.c b/ref-filter.c
>> index ff00ab4b8d8..384cf1595ff 100644
>> --- a/ref-filter.c
>> +++ b/ref-filter.c
>> @@ -2863,6 +2863,44 @@ static void free_array_item(struct ref_array_item *item)
>>  	free(item);
>>  }
>>  
>> +struct ref_filter_and_format_cbdata {
>> +	struct ref_filter *filter;
>> +	struct ref_format *format;
>> +
>> +	struct ref_filter_and_format_internal {
>> +		int count;
>> +	} internal;
>> +};
>> +
>> +static int filter_and_format_one(const char *refname, const struct object_id *oid, int flag, void *cb_data)
>> +{
>> +	struct ref_filter_and_format_cbdata *ref_cbdata = cb_data;
>> +	struct ref_array_item *ref;
>> +	struct strbuf output = STRBUF_INIT, err = STRBUF_INIT;
>> +
>> +	ref = apply_ref_filter(refname, oid, flag, ref_cbdata->filter);
>> +	if (!ref)
>> +		return 0;
>> +
>> +	if (format_ref_array_item(ref, ref_cbdata->format, &output, &err))
>> +		die("%s", err.buf);
>> +
>> +	if (output.len || !ref_cbdata->format->array_opts.omit_empty) {
>> +		fwrite(output.buf, 1, output.len, stdout);
>> +		putchar('\n');
>> +	}
>> +
>> +	strbuf_release(&output);
>> +	strbuf_release(&err);
>> +	free_array_item(ref);
>> +
>> +	if (ref_cbdata->format->array_opts.max_count &&
>> +	    ++ref_cbdata->internal.count >= ref_cbdata->format->array_opts.max_count)
>> +		return -1;
> 
> It feels a bit weird to return a negative value here, which usually
> indicates that an error has happened whereas we only use it here to
> abort the iteration. But we ignore the return value of
> `do_iterate_refs()` anyway, so it doesn't make much of a difference.

I'll update it to 1, and also add a comment that the non-zero return value
stops iteration since it's not immediately clear from other 'each_ref_fn's
what that means. For reference, there appears to only be one other
'each_ref_fn' that even has the potential to return a nonzero return value
('ref_present()' in 'refs/files-backend.c).

> 
>> +	return 0;
>> +}
>> +
>>  /* Free all memory allocated for ref_array */
>>  void ref_array_clear(struct ref_array *array)
>>  {
>> @@ -3046,16 +3084,46 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int
>>  	return ret;
>>  }
>>  
>> +static inline int can_do_iterative_format(struct ref_filter *filter,
>> +					  struct ref_sorting *sorting,
>> +					  struct ref_format *format)
>> +{
>> +	/*
>> +	 * Refs can be filtered and formatted in the same iteration as long
>> +	 * as we aren't filtering on reachability, sorting the results, or
>> +	 * including ahead-behind information in the formatted output.
>> +	 */
> 
> Do we want to format this as a bulleted list so that it's more readily
> extensible if we ever need to pay attention to new options here? Also, I
> noted that this commit doesn't add any new tests -- do we already
> exercise all of these conditions?

Sure, I'll convert it to a bulleted list. I don't really expect it to change
much, though; to have any effect on this condition, the new filter/format
would need to act on the pre-filtered ref_array, which isn't particularly
common.

And yes, the existing tests cover scenarios where this function returns true
(e.g. 'git for-each-ref --no-sort') & where it returns false (essentially
anything else).

> 
> More generally, I worry a bit about maintainability of this code snippet
> as we need to remember to always update this condition whenever we add a
> new option, and this can be quite easy to miss. The performance benefit
> might be worth the effort though.

I'll add more detailed comments to clarify what's going on here.

In practice, though, I don't think this would be all that easy to miss. As I
noted above, the only filters/formats that affect this are ones that need to
loop over an entire filtered ref_array after the initial
'for_each_fullref_in()'. To have it actually apply to commands that use
'filter_and_format_refs()', they'll need to add that behavior here (like
'filter_ahead_behind()'), where it should be apparent that
'can_do_iterative_format()' is relevant to their change. 

> 
> Patrick

@@ -11,6 +11,7 @@ SYNOPSIS
'git for-each-ref' [--count=<count>] [--shell|--perl|--python|--tcl]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Tue, Nov 07, 2023 at 01:26:00AM +0000, Victoria Dye via GitGitGadget wrote:
> From: Victoria Dye <vdye@github.com>
> 
> Add a boolean flag '--full-deref' that, when enabled, fills '%(*fieldname)'
> format fields using the fully peeled target of tag objects, rather than the
> immediate target.
> 
> In other builtins ('rev-parse', 'show-ref'), "dereferencing" tags typically
> means peeling them down to their non-tag target. Unlike these commands,
> 'for-each-ref' dereferences only one "level" of tags in '*' format fields
> (like "%(*objectname)"). For most annotated tags, one level of dereferencing
> is enough, since most tags point to commits or trees. However, nested tags
> (annotated tags whose target is another annotated tag) dereferenced once
> will point to their target tag, different a full peel to e.g. a commit.
> 
> Currently, if a user wants to filter & format refs and include information
> about the fully dereferenced tag, they can do so with something like
> 'cat-file --batch-check':
> 
>     git for-each-ref --format="%(objectname)^{} %(refname)" <pattern> |
>         git cat-file --batch-check="%(objectname) %(rest)"
> 
> But the combination of commands is inefficient. So, to improve the
> efficiency of this use case, add a '--full-deref' option that causes
> 'for-each-ref' to fully dereference tags when formatting with '*' fields.

I do wonder whether it would make sense to introduce this feature in the
form of a separate field prefix, as you also mentioned in your cover
letter. It would buy the user more flexibility, but the question is
whether such flexibility would really ever be needed.

The only thing I could really think of where it might make sense is to
distinguish tags that peel to a commit immediately from ones that don't.
That feels rather esoteric to me and doesn't seem to be of much use. But
regardless of whether or not we can see the usefulness now, if this
wouldn't be significantly more complex I wonder whether it would make
more sense to use a new field prefix instead anyway.

In any case, I think it would be helpful if this was discussed in the
commit message.

Patrick

> Signed-off-by: Victoria Dye <vdye@github.com>
> ---
>  Documentation/git-for-each-ref.txt |  9 ++++++++
>  builtin/for-each-ref.c             |  2 ++
>  ref-filter.c                       | 26 ++++++++++++++---------
>  ref-filter.h                       |  1 +
>  t/t6300-for-each-ref.sh            | 34 ++++++++++++++++++++++++++++++
>  5 files changed, 62 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/git-for-each-ref.txt b/Documentation/git-for-each-ref.txt
> index 407f624fbaa..2714a87088e 100644
> --- a/Documentation/git-for-each-ref.txt
> +++ b/Documentation/git-for-each-ref.txt
> @@ -11,6 +11,7 @@ SYNOPSIS
>  'git for-each-ref' [--count=<count>] [--shell|--perl|--python|--tcl]
>  		   [(--sort=<key>)...] [--format=<format>]
>  		   [ --stdin | <pattern>... ]
> +		   [--full-deref]
>  		   [--points-at=<object>]
>  		   [--merged[=<object>]] [--no-merged[=<object>]]
>  		   [--contains[=<object>]] [--no-contains[=<object>]]
> @@ -77,6 +78,14 @@ OPTIONS
>  	the specified host language.  This is meant to produce
>  	a scriptlet that can directly be `eval`ed.
>  
> +--full-deref::
> +	Populate dereferenced format fields (indicated with an asterisk (`*`)
> +	prefix before the fieldname) with information about the fully-peeled
> +	target object of a tag ref, rather than its immediate target object.
> +	This only affects the output for nested annotated tags, where the tag's
> +	immediate target is another tag but its fully-peeled target is another
> +	object type (e.g. a commit).
> +
>  --points-at=<object>::
>  	Only list refs which points at the given object.
>  
> diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c
> index 1c19cd5bd34..7a2127a3bc4 100644
> --- a/builtin/for-each-ref.c
> +++ b/builtin/for-each-ref.c
> @@ -43,6 +43,8 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
>  		OPT_INTEGER( 0 , "count", &format.array_opts.max_count, N_("show only <n> matched refs")),
>  		OPT_STRING(  0 , "format", &format.format, N_("format"), N_("format to use for the output")),
>  		OPT__COLOR(&format.use_color, N_("respect format colors")),
> +		OPT_BOOL(0, "full-deref", &format.full_deref,
> +			 N_("fully dereference tags to populate '*' format fields")),
>  		OPT_REF_FILTER_EXCLUDE(&filter),
>  		OPT_REF_SORT(&sorting_options),
>  		OPT_CALLBACK(0, "points-at", &filter.points_at,
> diff --git a/ref-filter.c b/ref-filter.c
> index 384cf1595ff..a66ac7921b1 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -237,7 +237,14 @@ static struct used_atom {
>  		char *head;
>  	} u;
>  } *used_atom;
> -static int used_atom_cnt, need_tagged, need_symref;
> +static int used_atom_cnt, need_symref;
> +
> +enum tag_dereference_mode {
> +	NO_DEREF = 0,
> +	DEREF_ONE,
> +	DEREF_ALL
> +};
> +static enum tag_dereference_mode need_tagged;
>  
>  /*
>   * Expand string, append it to strbuf *sb, then return error code ret.
> @@ -1066,8 +1073,8 @@ static int parse_ref_filter_atom(struct ref_format *format,
>  	memset(&used_atom[at].u, 0, sizeof(used_atom[at].u));
>  	if (valid_atom[i].parser && valid_atom[i].parser(format, &used_atom[at], arg, err))
>  		return -1;
> -	if (*atom == '*')
> -		need_tagged = 1;
> +	if (*atom == '*' && !need_tagged)
> +		need_tagged = format->full_deref ? DEREF_ALL : DEREF_ONE;
>  	if (i == ATOM_SYMREF)
>  		need_symref = 1;
>  	return at;
> @@ -2511,14 +2518,13 @@ static int populate_value(struct ref_array_item *ref, struct strbuf *err)
>  	 * If it is a tag object, see if we use a value that derefs
>  	 * the object, and if we do grab the object it refers to.
>  	 */
> -	oi_deref.oid = *get_tagged_oid((struct tag *)obj);
> +	if (need_tagged == DEREF_ALL) {
> +		if (peel_iterated_oid(&obj->oid, &oi_deref.oid))
> +			die("bad tag");
> +	} else {
> +		oi_deref.oid = *get_tagged_oid((struct tag *)obj);
> +	}
>  
> -	/*
> -	 * NEEDSWORK: This derefs tag only once, which
> -	 * is good to deal with chains of trust, but
> -	 * is not consistent with what deref_tag() does
> -	 * which peels the onion to the core.
> -	 */
>  	return get_object(ref, 1, &obj, &oi_deref, err);
>  }
>  
> diff --git a/ref-filter.h b/ref-filter.h
> index 0ce5af58ab3..0caa39ecee5 100644
> --- a/ref-filter.h
> +++ b/ref-filter.h
> @@ -92,6 +92,7 @@ struct ref_format {
>  	const char *rest;
>  	int quote_style;
>  	int use_color;
> +	int full_deref;
>  
>  	/* Internal state to ref-filter */
>  	int need_color_reset_at_eol;
> diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
> index 0613e5e3623..3c2af785cdb 100755
> --- a/t/t6300-for-each-ref.sh
> +++ b/t/t6300-for-each-ref.sh
> @@ -1839,6 +1839,40 @@ test_expect_success 'git for-each-ref with non-existing refs' '
>  	test_must_be_empty actual
>  '
>  
> +test_expect_success 'git for-each-ref with nested tags' '
> +	git tag -am "Normal tag" nested/base HEAD &&
> +	git tag -am "Nested tag" nested/nest1 refs/tags/nested/base &&
> +	git tag -am "Double nested tag" nested/nest2 refs/tags/nested/nest1 &&
> +
> +	head_oid="$(git rev-parse HEAD)" &&
> +	base_tag_oid="$(git rev-parse refs/tags/nested/base)" &&
> +	nest1_tag_oid="$(git rev-parse refs/tags/nested/nest1)" &&
> +	nest2_tag_oid="$(git rev-parse refs/tags/nested/nest2)" &&
> +
> +	# Without full dereference
> +	cat >expect <<-EOF &&
> +	refs/tags/nested/base $base_tag_oid tag $head_oid commit
> +	refs/tags/nested/nest1 $nest1_tag_oid tag $base_tag_oid tag
> +	refs/tags/nested/nest2 $nest2_tag_oid tag $nest1_tag_oid tag
> +	EOF
> +
> +	git for-each-ref --format="%(refname) %(objectname) %(objecttype) %(*objectname) %(*objecttype)" \
> +		refs/tags/nested/ >actual &&
> +	test_cmp expect actual &&
> +
> +	# With full dereference
> +	cat >expect <<-EOF &&
> +	refs/tags/nested/base $base_tag_oid tag $head_oid commit
> +	refs/tags/nested/nest1 $nest1_tag_oid tag $head_oid commit
> +	refs/tags/nested/nest2 $nest2_tag_oid tag $head_oid commit
> +	EOF
> +
> +	git for-each-ref --full-deref \
> +		--format="%(refname) %(objectname) %(objecttype) %(*objectname) %(*objecttype)" \
> +		refs/tags/nested/ >actual &&
> +	test_cmp expect actual
> +'
> +
>  GRADE_FORMAT="%(signature:grade)%0a%(signature:key)%0a%(signature:signer)%0a%(signature:fingerprint)%0a%(signature:primarykeyfingerprint)"
>  TRUSTLEVEL_FORMAT="%(signature:trustlevel)%0a%(signature:key)%0a%(signature:signer)%0a%(signature:fingerprint)%0a%(signature:primarykeyfingerprint)"
>  
> -- 
> gitgitgadget
> 
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Victoria Dye wrote (reply to this):

Patrick Steinhardt wrote:
> On Tue, Nov 07, 2023 at 01:26:00AM +0000, Victoria Dye via GitGitGadget wrote:
>> From: Victoria Dye <vdye@github.com>
>>
>> Add a boolean flag '--full-deref' that, when enabled, fills '%(*fieldname)'
>> format fields using the fully peeled target of tag objects, rather than the
>> immediate target.
>>
>> In other builtins ('rev-parse', 'show-ref'), "dereferencing" tags typically
>> means peeling them down to their non-tag target. Unlike these commands,
>> 'for-each-ref' dereferences only one "level" of tags in '*' format fields
>> (like "%(*objectname)"). For most annotated tags, one level of dereferencing
>> is enough, since most tags point to commits or trees. However, nested tags
>> (annotated tags whose target is another annotated tag) dereferenced once
>> will point to their target tag, different a full peel to e.g. a commit.
>>
>> Currently, if a user wants to filter & format refs and include information
>> about the fully dereferenced tag, they can do so with something like
>> 'cat-file --batch-check':
>>
>>     git for-each-ref --format="%(objectname)^{} %(refname)" <pattern> |
>>         git cat-file --batch-check="%(objectname) %(rest)"
>>
>> But the combination of commands is inefficient. So, to improve the
>> efficiency of this use case, add a '--full-deref' option that causes
>> 'for-each-ref' to fully dereference tags when formatting with '*' fields.
> 
> I do wonder whether it would make sense to introduce this feature in the
> form of a separate field prefix, as you also mentioned in your cover
> letter. It would buy the user more flexibility, but the question is
> whether such flexibility would really ever be needed.
> 
> The only thing I could really think of where it might make sense is to
> distinguish tags that peel to a commit immediately from ones that don't.
> That feels rather esoteric to me and doesn't seem to be of much use. But
> regardless of whether or not we can see the usefulness now, if this
> wouldn't be significantly more complex I wonder whether it would make
> more sense to use a new field prefix instead anyway.
> 
> In any case, I think it would be helpful if this was discussed in the
> commit message.
I've been going back and forth on this, but I think a field specifier might
be the way to go after all. Using a field specifier would inherently be more
complex than the command line option (since the formatting code is a bit
complicated), but that's not an insurmountable problem. The thing I kept
getting caught up on was which symbol (or symbols?) to use to indicate a full
object peel. I mentioned `**fieldname` in the cover letter, but that looks
more like a double dereference than a recursive one.

I think `^{}fieldname` would be a good candidate, but it's *extremely*
important (for the sake of avoiding user confusion/frustration) that it
produces the same object & associated info as the standard revision parsing
machinery [1]. One notable difference (it might be the only one) from
`*fieldname` would be, if a ref points to a non-tag object, then that
object's information would printed (rather than an empty string). But maybe
that difference is what we'd want anyway, since it's a better one-for-one
replacement of 'git for-each-ref | git cat-file --batch-check'.

I'll try implementing that for V2. If it doesn't work for some reason,
though, I'll explain why in the commit message.

[1] https://git-scm.com/docs/git-rev-parse#Documentation/git-rev-parse.txt-emltrevgtemegemv0998em

> 
> Patrick
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Victoria Dye <vdye@github.com> writes:

> I think `^{}fieldname` would be a good candidate, but it's *extremely*

Gaah.  Why?  fieldname^{} I may understand, but in the prefix form?

In any case, has anybody considered that we may be better off to
declare that "*field" peeling a tag only once is a longstanding bug?

IOW, can we not add "fully peel" command line option or a new syntax
and instead just "fix" the bug to fully peel when "*field" is asked
for?

An application that cares about handling a chain of annotatetd tags
would want to be able to say "this is the outermost tag's
information; one level down, the tag was signed by this person;
another level down, the tag was signed by this person, etc."  which
would mean either

 * we have a syntax that shows the information from all levels
   (e.g., "**taggername" may say "Victoria\nPatrick\nGitster")

 * we have a syntax that allows to specify how many levels to peel,
   (e.g., "*0*taggername" may be the same as "taggername",
   "*1*taggername" may be the same as "*taggername") plus some
   programming construct like variables and loops.

but the repertoire being proposed that consists only of "peel only
once" and "peel all levels" is way too insufficient.

Note that I do not advocate for allowing inspection of each levels
separately.  Quite the contrary.  I would say that --format=<>
placeholder should not be a programming language to satisify such a
niche need.  And my conclusion from that stance is "peel once" plus
"peel all" are already one level too many, and "peel once" was a
very flawed implementation from day one, when 9f613ddd (Add
git-for-each-ref: helper for language bindings, 2006-09-15)
introduced it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, Nov 08, 2023 at 12:14:02PM +0900, Junio C Hamano wrote:
> Victoria Dye <vdye@github.com> writes:
> 
> > I think `^{}fieldname` would be a good candidate, but it's *extremely*
> 
> Gaah.  Why?  fieldname^{} I may understand, but in the prefix form?
> 
> In any case, has anybody considered that we may be better off to
> declare that "*field" peeling a tag only once is a longstanding bug?
> 
> IOW, can we not add "fully peel" command line option or a new syntax
> and instead just "fix" the bug to fully peel when "*field" is asked
> for?

I see where you're coming from, but I wonder whether this wouldn't break
scripts. To me, the documentation seems to explicitly state that this
will only deref tags once:

    If fieldname is prefixed with an asterisk (*) and the ref points at
    a tag object, use the value for the field in the object which the
    tag object refers to (instead of the field in the tag object).

So changing that now would break both the documented and the actual
behaviour. Now whether anybody actually cares about such a breaking
change is of course a different question, and you're probably correct
that in practice nobody does.

Patrick

> An application that cares about handling a chain of annotatetd tags
> would want to be able to say "this is the outermost tag's
> information; one level down, the tag was signed by this person;
> another level down, the tag was signed by this person, etc."  which
> would mean either
> 
>  * we have a syntax that shows the information from all levels
>    (e.g., "**taggername" may say "Victoria\nPatrick\nGitster")
> 
>  * we have a syntax that allows to specify how many levels to peel,
>    (e.g., "*0*taggername" may be the same as "taggername",
>    "*1*taggername" may be the same as "*taggername") plus some
>    programming construct like variables and loops.
> 
> but the repertoire being proposed that consists only of "peel only
> once" and "peel all levels" is way too insufficient.
> 
> Note that I do not advocate for allowing inspection of each levels
> separately.  Quite the contrary.  I would say that --format=<>
> placeholder should not be a programming language to satisify such a
> niche need.  And my conclusion from that stance is "peel once" plus
> "peel all" are already one level too many, and "peel once" was a
> very flawed implementation from day one, when 9f613ddd (Add
> git-for-each-ref: helper for language bindings, 2006-09-15)
> introduced it.
> 
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Victoria Dye wrote (reply to this):

Junio C Hamano wrote:
> Victoria Dye <vdye@github.com> writes:
> 
>> I think `^{}fieldname` would be a good candidate, but it's *extremely*
> 
> Gaah.  Why?  fieldname^{} I may understand, but in the prefix form?

'fieldname^{}' seemed like more of a misuse of "^{}" than the prefixed form,
since we're not peeling "fieldname" but instead getting the value of
"fieldname" from the peeled tag. But then we're not dereferencing
"fieldname" in '*fieldname' either, so 'fieldname^{}' is no worse than what
already exists.

> 
> In any case, has anybody considered that we may be better off to
> declare that "*field" peeling a tag only once is a longstanding bug?
> 
> IOW, can we not add "fully peel" command line option or a new syntax
> and instead just "fix" the bug to fully peel when "*field" is asked
> for?

I'd certainly prefer that from a technical standpoint; it simplifies this
patch if I can just replace 'get_tagged_oid' with 'peel_iterated_oid'. The
two things that make me hesitate are:

1. There isn't a straightforward 1:1 substitute available for getting info
   on the immediate target of a list of tags. 
2. The performance of a recursive peel can be worse than that of a single
   tag dereference, since (unless the formatting is done in a ref_iterator
   iteration *and* the tag is a packed ref) the dereferenced object needs to
   be resolved to determine whether it's another tag or not.

#1 may not be an issue in practice, but I don't have enough information on
how applications use that formatting atom to say for sure. #2 is a bigger
issue, IMO, since one of the goals of this series was to improve performance
for some cases of 'for-each-ref' without hurting it in others.

> An application that cares about handling a chain of annotatetd tags
> would want to be able to say "this is the outermost tag's
> information; one level down, the tag was signed by this person;
> another level down, the tag was signed by this person, etc."  which
> would mean either
> 
>  * we have a syntax that shows the information from all levels
>    (e.g., "**taggername" may say "Victoria\nPatrick\nGitster")
> 
>  * we have a syntax that allows to specify how many levels to peel,
>    (e.g., "*0*taggername" may be the same as "taggername",
>    "*1*taggername" may be the same as "*taggername") plus some
>    programming construct like variables and loops.
> 
> but the repertoire being proposed that consists only of "peel only
> once" and "peel all levels" is way too insufficient.
> 
> Note that I do not advocate for allowing inspection of each levels
> separately.  Quite the contrary.  I would say that --format=<>
> placeholder should not be a programming language to satisify such a
> niche need.  And my conclusion from that stance is "peel once" plus
> "peel all" are already one level too many, and "peel once" was a
> very flawed implementation from day one, when 9f613ddd (Add
> git-for-each-ref: helper for language bindings, 2006-09-15)
> introduced it.

I can (and would like to) deprecate the "peel once" behavior and replace it
with "peel all", but with how long it's been around and the potential
performance impact, such a change should probably be clearly communicated.
How that happens depends on how aggressively we want to cut over. We could:

1. Change the behavior of '*' from single dereference to recursive
   dereference, make a note of it in the documentation.
2. Same as #1, but also add an option like '--no-recursive-dereference' or
   something to use the old behavior. Remove the option after 1-2 release
   cycles?
3. Add a new format specifier '^{}', note that '*' is deprecated in the
   docs.
4. Same as #3, but also show a warning/advice if '*' is used.
5. Same as #3, but die() if '*' is used.

I'm open to other options, those were just the first few I could think of. 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Victoria Dye <vdye@github.com> writes:

> I can (and would like to) deprecate the "peel once" behavior and replace it
> with "peel all", but with how long it's been around and the potential
> performance impact, such a change should probably be clearly communicated.

I've written a fairly detailed response on this about the reason why
I think that "leave a mention in the backward compatibility notes
section of the release notes" (your #1) is sufficient, but it seems
to have been lost in the ether.  I'll wait a bit and if the previous
response does not materialize, I may type it again.

But in addition to what I wrote there, there is this thread [*] from
2019 that indicates that our position is to mildly discourage
tag-to-tag in the first place.


[Reference]

* https://lore.kernel.org/git/20190404020226.GG4409@sigill.intra.peff.net/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Victoria Dye <vdye@github.com> writes:

> I'd certainly prefer that from a technical standpoint; it simplifies this
> patch if I can just replace 'get_tagged_oid' with 'peel_iterated_oid'. The
> two things that make me hesitate are:
>
> 1. There isn't a straightforward 1:1 substitute available for getting info
>    on the immediate target of a list of tags. 
> 2. The performance of a recursive peel can be worse than that of a single
>    tag dereference, since (unless the formatting is done in a ref_iterator
>    iteration *and* the tag is a packed ref) the dereferenced object needs to
>    be resolved to determine whether it's another tag or not.
>
> #1 may not be an issue in practice, but I don't have enough information on
> how applications use that formatting atom to say for sure. #2 is a bigger
> issue, IMO, since one of the goals of this series was to improve performance
> for some cases of 'for-each-ref' without hurting it in others.

In a repository without any tag-to-tag at tips of refs, would #2
above still be an issue?  My assumption when I raised "isn't this
simply a bug?" question was that the use of tag-to-tag is a mere
intellectual curiosity, there is no serious use case, and they are
not heavily used.  Hence I was envisioning that #1 below (i.e., a
mention in the Release Notes' backward compatibility notes section)
would be sufficient.

If it weren't the case, then I do not think any "transition" would
work, either.

And stepping back a bit, even though "peel only once" is how
for-each-ref works, I do not think anybody who really cares about
tag-to-tag and inspecting each level of peeled tag is helped by it
all that much.  Yes, you can get the result of single level peeling
via "git format-patch --format=%(*objectname)", but then what would
you do to dig further from that point?  You cannot ask rev-parse to
peel the result with "^{}", as that will peel all the way down.

You have to feed it to "git cat-file tag" and parse the contents of
the tag obbject yourself to manually peel further levels of onion.
Anybody who do care must already have such a machinery, and such a
machinery does not depend on "git for-each-ref --format='%(*field)'"
peeling just once, I would say.  They would most likely learn the
"%(objectname) %(objecttype) %(refname)" from the command, and for
those that are tags, they would manually peel the object with such a
machinery, because they have to do that for second and further
levels anyway.

And that is why I am not so worried about "breaking" existing users
in this particular case.  Our existing support with tag-to-tag is so
poor that those who truly need it would have invented necessary
support without relying on for-each-ref's peeling (if such people
did exist, that is).

But perhaps I am so overly optimistic against Hyrum's law.

> I can (and would like to) deprecate the "peel once" behavior and replace it
> with "peel all", but with how long it's been around and the potential
> performance impact, such a change should probably be clearly communicated.
> How that happens depends on how aggressively we want to cut over. We could:
>
> 1. Change the behavior of '*' from single dereference to recursive
>    dereference, make a note of it in the documentation.
> 2. Same as #1, but also add an option like '--no-recursive-dereference' or
>    something to use the old behavior. Remove the option after 1-2 release
>    cycles?
> 3. Add a new format specifier '^{}', note that '*' is deprecated in the
>    docs.
> 4. Same as #3, but also show a warning/advice if '*' is used.
> 5. Same as #3, but die() if '*' is used.
>
> I'm open to other options, those were just the first few I could think of. 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes:

> You have to feed it to "git cat-file tag" and parse the contents of
> the tag obbject yourself to manually peel further levels of onion.

Alternatively, you can drive "git show -s" with "--format" and you
probably can produce a machine parseable output.

But it does not change the argument fundamentally.  The point is
that "for-each-ref --format=%(*field)" that peels only the first
layer would not have helped all that much, if somebody really cares
about each levels of nested tags.  They would have been relying on
a solution to deal with the second and further layers anyway, and
that solution would have been working with the first layer, too.

@@ -101,7 +101,8 @@ OPTIONS

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Øystein Walle wrote (reply to this):

Hi Victoria,

Victoria Dye <vdye@github.com> writes:

> Update the 'for-each-ref' builtin documentation to clarify that refs
> "omitted" by --omit-empty are still counted toward the limit specified
> by --count. The use of the term "omit" would otherwise be somewhat
> ambiguous and could incorrectly be construed as excluding empty refs
> entirely (i.e. not counting them towards the total ref count).

I implemented --omit-empty and I completely overlooked --count!

(If I were to do it all over again I probably would have implemented it
so that so-called omitted refs did not count towards the total. It makes
sense to me since e.g.  `git log -3 -- git.c` prints the three most
recent commits that touch git.c regardless of how many commits were
walked in the process.)

This is a good and welcome clarification. 

Acked-by: Øystein Walle <oystwa@gmail.com>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Victoria Dye wrote (reply to this):

Øystein Walle wrote:
> Hi Victoria,
> 
> Victoria Dye <vdye@github.com> writes:
> 
>> Update the 'for-each-ref' builtin documentation to clarify that refs
>> "omitted" by --omit-empty are still counted toward the limit specified
>> by --count. The use of the term "omit" would otherwise be somewhat
>> ambiguous and could incorrectly be construed as excluding empty refs
>> entirely (i.e. not counting them towards the total ref count).
> 
> I implemented --omit-empty and I completely overlooked --count!
> 
> (If I were to do it all over again I probably would have implemented it
> so that so-called omitted refs did not count towards the total. It makes
> sense to me since e.g.  `git log -3 -- git.c` prints the three most
> recent commits that touch git.c regardless of how many commits were
> walked in the process.)

Since the interaction isn't clearly defined at the moment, we could probably
still update it to work like you're describing here. I'm happy to drop this
patch and implement your recommendation in a follow-up series. Let me know
what you think!

> 
> This is a good and welcome clarification. 
> 
> Acked-by: Øystein Walle <oystwa@gmail.com>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Øystein Walle wrote (reply to this):

On Tue, 7 Nov 2023 at 20:30, Victoria Dye <vdye@github.com> wrote:

> Since the interaction isn't clearly defined at the moment, we could probably
> still update it to work like you're describing here. I'm happy to drop this
> patch and implement your recommendation in a follow-up series. Let me know
> what you think!

Regardless of whether the logic is changed in a follow-up series or not
I think the current behavior is worth documenting even if it doesn't
exist for much longer in the tree. So I am favor of having this patch as
part of this series.

I am also in favor of changing the behavior but that warrants a separate
series and discussion.

Øsse

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, "Kristoffer Haugsbakk" wrote (reply to this):

On Wed, Nov 8, 2023, at 08:53, Øystein Walle wrote:
> On Tue, 7 Nov 2023 at 20:30, Victoria Dye <vdye@github.com> wrote:
>
>> Since the interaction isn't clearly defined at the moment, we could probably
>> still update it to work like you're describing here. I'm happy to drop this
>> patch and implement your recommendation in a follow-up series. Let me know
>> what you think!
>
> Regardless of whether the logic is changed in a follow-up series or not
> I think the current behavior is worth documenting even if it doesn't
> exist for much longer in the tree. So I am favor of having this patch as
> part of this series.

The funny thing though is that once it’s documented then you also kind of
commit yourself to it, right? That it’s how it’s supposed to behave.[1] If
you instead change the behavior (to the correct one) and document it in
the same series then there is no in-between time when people can claim to
rely on it via the documentation.

[1] Modulo “subject to change” hedging, but it seems that even
    experimental commands who are documented as that are now resistant to
    change in practice.

Copy link

gitgitgadget bot commented Nov 7, 2023

User Øystein Walle <oystwa@gmail.com> has been added to the cc: list.

Copy link

gitgitgadget bot commented Nov 8, 2023

On the Git mailing list, Victoria Dye wrote (reply to this):

Patrick Steinhardt wrote:
> On Mon, Nov 06, 2023 at 06:48:29PM -0800, Victoria Dye wrote:
>> Junio C Hamano wrote:
>>> "Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:
> [snip]
>>>>  * I'm not attached to '--full-deref' as a name - if someone has an idea for
>>>>    a more descriptive name, please suggest it!
>>>
>>> Another candidate verb may be "to peel", and I have no strong
>>> opinion between it and "to dereference".  But I have a mild aversion
>>> to an abbreviation that is not strongly established.
>>>
>>
>> Makes sense. I got the "deref" abbreviation for 'update-ref --no-deref', but
>> 'show-ref' has a "--dereference" option and protocol v2's "ls-refs" includes
>> a "peel" arg. "Dereference" is the term already used in the 'for-each-ref'
>> documentation, though, so if no one comes in with an especially strong
>> opinion on this I'll change the option to '--full-dereference'. Thanks!
> 
> But doesn't dereferencing in the context of git-update-ref(1) refer to
> something different? It's not about tags, but it is about symbolic
> references and whether we want to update the symref or the pointee. But
> true enough, in git-show-ref(1) "dereference" actually means that we
> should peel the tag.

Since both annotated tags and symbolic refs are essentially pointers, it's
not surprising that they both use the term "dereference." Even though
"deref" refers to symbolic refs in 'update-ref', its existence as an
abbreviation for "dereference" is relevant when coming up with a way to
abbreviate "dereference" when referring to tags.

> 
> To me it feels like preexisting commands are confused already. In my
> mind model:
> 
>     - "peel" means that an object gets resolved to one of its pointees.
>       This also includes the case here, where a tag gets peeled to its
>       pointee.
> 
>     - "dereference" means that a symbolic reference gets resolved to its
>       pointee. This matches what we do in `git update-ref --no-deref`.
> 
> But after reading through the code I don't think we distinguish those
> terms cleanly throughout our codebase. Still, "peeling" feels like a
> better match in my opinion.

Hmm. I think I mostly agree on your definition of "peel". In the docs, it's
used to refer to:

- recursively resolving an OID to an object of a specified type [1]
- recursively resolving a tag OID to a non-tag object [2]

Notably, there seems to be a strong association of "peeling" to "recursive
resolution". Which means it doesn't necessarily describe what "*" currently
does.

"Dereference" generally seems like a looser term than what you've suggested.
It does refer to symbolic ref resolution as you describe [3], but "recursive
dereference" is definitely also a synonym for "peel" [4]. That, combined
with the fact that "*" is the "dereference operator", leads me to believe
that "%(*fieldname)" would accurately be described as a "tag dereference"
field in the context of 'for-each-ref'.

As I mentioned in [5], I'm going to try adding this functionality with a
field specifier rather than a command line option, so the name of the option
might be moot. But, since dereferencing/peeling will still be relevant to
the changes, I'll make sure the terminology I use in the documentation is as
precise as possible (i.e., use "peel" where I previously used "fully
dereference").

Separately, this has inspired me to revisit something I've been putting off,
which is to add a definition for "peel" (and now probably "dereference" as
well) in 'gitglossary'. I'll try to send that out in the next couple days.

Thanks!

[1] https://git-scm.com/docs/git-rev-parse#Documentation/git-rev-parse.txt---verify
[2] https://git-scm.com/docs/gitprotocol-v2#_ls_refs
[3] https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefsymrefasymref
[4] https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefcommit-ishacommit-ishalsocommittish
[5] https://lore.kernel.org/git/cf691b7c-288f-4cc9-a2ac-1a43972ae446@github.com/

> 
> Patrick

Copy link

gitgitgadget bot commented Nov 8, 2023

This patch series was integrated into seen via git@68b0b2f.

Copy link

gitgitgadget bot commented Nov 8, 2023

User "Kristoffer Haugsbakk" <code@khaugsbakk.name> has been added to the cc: list.

Copy link

gitgitgadget bot commented Nov 8, 2023

This patch series was integrated into seen via git@eb1fb79.

Copy link

gitgitgadget bot commented Nov 8, 2023

There was a status update in the "Cooking" section about the branch vd/for-each-ref-unsorted-optimization on the Git mailing list:

"git for-each-ref --no-sort" still sorted the refs alphabetically
which paid non-trivial cost.  It has been redefined to show output
in an unspecified order, to allow certain optimizations to take
advantage of.

Expecting a reroll.
cf. <dbcbcf0e-aeee-4bb9-9e39-e2e85194d083@github.com>
source: <pull.1609.git.1699320361.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Nov 9, 2023

This patch series was integrated into seen via git@09cb485.

vdye added 2 commits November 13, 2023 14:22
Update 'filter_and_format_refs()' to try to perform ref filtering &
formatting in a single ref iteration, without an intermediate 'struct
ref_array'. This can only be done if no operations need to be performed on a
pre-filtered array; specifically, if the refs are

- filtered on reachability,
- sorted, or
- formatted with ahead-behind information

they cannot be filtered & formatted in the same iteration. In that case,
fall back on the current filter-then-sort-then-format flow.

This optimization substantially improves memory usage due to no longer
storing a ref array in memory. In some cases, it also dramatically reduces
runtime (e.g. 'git for-each-ref --no-sort --count=1', which no longer loads
all refs into a 'struct ref_array' to printing only the first ref).

Signed-off-by: Victoria Dye <vdye@github.com>
Move the description of the `*` prefix from the --format option
documentation to the part of the command documentation that deals with other
object type-specific modifiers. Also reorganize and reword the remaining
--format documentation so that the explanation of the default format doesn't
interrupt the details on format string interpolation.

Signed-off-by: Victoria Dye <vdye@github.com>
Copy link

gitgitgadget bot commented Nov 14, 2023

This patch series was integrated into seen via git@d04cc19.

Copy link

gitgitgadget bot commented Nov 14, 2023

This patch series was integrated into seen via git@313630d.

Copy link

gitgitgadget bot commented Nov 14, 2023

This patch series was integrated into seen via git@6376dce.

Copy link

gitgitgadget bot commented Nov 14, 2023

There was a status update in the "Cooking" section about the branch vd/for-each-ref-unsorted-optimization on the Git mailing list:

"git for-each-ref --no-sort" still sorted the refs alphabetically
which paid non-trivial cost.  It has been redefined to show output
in an unspecified order, to allow certain optimizations to take
advantage of.

Expecting a reroll.
cf. <dbcbcf0e-aeee-4bb9-9e39-e2e85194d083@github.com>
source: <pull.1609.git.1699320361.gitgitgadget@gmail.com>

vdye added 2 commits November 14, 2023 11:01
In most builtins ('rev-parse <revision>^{}', 'show-ref --dereference'),
"dereferencing" a tag refers to a recursive peel of the tag object. Unlike
these cases, the dereferencing prefix ('*') in 'for-each-ref' format
specifiers triggers only a single, non-recursive dereference of a given tag
object. For most annotated tags, a single dereference is all that is needed
to access the tag's associated commit or tree; "recursive" and
"non-recursive" dereferencing are functionally equivalent in these cases.
However, nested tags (annotated tags whose target is another annotated tag)
dereferenced once return another tag, where a recursive dereference would
return the commit or tree.

Currently, if a user wants to filter & format refs and include information
about a recursively-dereferenced tag, they can do so with something like
'cat-file --batch-check':

    git for-each-ref --format="%(objectname)^{} %(refname)" <pattern> |
        git cat-file --batch-check="%(objectname) %(rest)"

But the combination of commands is inefficient. So, to improve the
performance of this use case and align the defererencing behavior of
'for-each-ref' with that of other commands, update the ref formatting code
to use the peeled tag (from 'peel_iterated_oid()') to populate '*' fields
rather than the tag's immediate target object (from 'get_tagged_oid()').

Additionally, add a test to 't6300-for-each-ref' to verify new nested tag
behavior and update 't6302-for-each-ref-filter.sh' to print the correct
value for nested dereferenced fields.

Signed-off-by: Victoria Dye <vdye@github.com>
Add performance tests for 'for-each-ref'. The tests exercise different
combinations of filters/formats/options, as well as the overall performance
of 'git for-each-ref | git cat-file --batch-check' to demonstrate the
performance difference vs. 'git for-each-ref' with "%(*fieldname)" format
specifiers.

All tests are run against a repository with 40k loose refs - 10k commits,
each having a unique:

- branch
- custom ref (refs/custom/special_*)
- annotated tag pointing at the commit
- annotated tag pointing at the other annotated tag (i.e., a nested tag)

After those tests are finished, the refs are packed with 'pack-refs --all'
and the same tests are rerun.

Signed-off-by: Victoria Dye <vdye@github.com>
@vdye
Copy link
Author

vdye commented Nov 14, 2023

/submit

Copy link

gitgitgadget bot commented Nov 14, 2023

Submitted as pull.1609.v2.git.1699991638.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1609/vdye/vdye/for-each-ref-optimizations-v2

To fetch this version to local tag pr-1609/vdye/vdye/for-each-ref-optimizations-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1609/vdye/vdye/for-each-ref-optimizations-v2

@@ -767,7 +767,13 @@ int cmd_branch(int argc, const char **argv, const char *prefix)
if (argc == 2 && !strcmp(argv[1], "-h"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Victoria Dye <vdye@github.com>
>
> When '--no-sort' is passed to 'for-each-ref', 'tag', and 'branch', the
> printed refs are still sorted by ascending refname. Change the handling of
> sort options in these commands so that '--no-sort' to truly disables
> sorting.

"to truly disables" -> "truly disables" I think?

> '--no-sort' does not disable sorting in these commands is because their

"'--no-sort' does not" -> "The reason why '--no-sort' does not", or
"is because" -> "because".

> option parsing does not distinguish between "the absence of '--sort'"
> (and/or values for tag.sort & branch.sort) and '--no-sort'. Both result in
> an empty 'sorting_options' string list, which is parsed by
> 'ref_sorting_options()' to create the 'struct ref_sorting *' for the
> command. If the string list is empty, 'ref_sorting_options()' interprets
> that as "the absence of '--sort'" and returns the default ref sorting
> structure (equivalent to "refname" sort).
>
> To handle '--no-sort' properly while preserving the "refname" sort in the
> "absence of --sort'" case, first explicitly add "refname" to the string list
> *before* parsing options. This alone doesn't actually change any behavior,
> since 'compare_refs()' already falls back on comparing refnames if two refs
> are equal w.r.t all other sort keys.
>
> Now that the string list is populated by default, '--no-sort' is the only
> way to empty the 'sorting_options' string list. Update
> 'ref_sorting_options()' to return a NULL 'struct ref_sorting *' if the
> string list is empty, and add a condition to 'ref_array_sort()' to skip the
> sort altogether if the sort structure is NULL. Note that other functions
> using 'struct ref_sorting *' do not need any changes because they already
> ignore NULL values.

Nice.

> Finally, remove the condition around sorting in 'ls-remote', since it's no
> longer necessary. Unlike 'for-each-ref' et. al., it does *not* do any
> sorting by default. This default is preserved by simply leaving its sort key
> string list empty before parsing options; if no additional sort keys are
> set, 'struct ref_sorting *' is NULL and sorting is skipped.

Doubly nice.

> diff --git a/ref-filter.c b/ref-filter.c
> index e4d3510e28e..7250089b7c6 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -3142,7 +3142,8 @@ void ref_sorting_set_sort_flags_all(struct ref_sorting *sorting,
>  
>  void ref_array_sort(struct ref_sorting *sorting, struct ref_array *array)
>  {
> -	QSORT_S(array->items, array->nr, compare_refs, sorting);
> +	if (sorting)
> +		QSORT_S(array->items, array->nr, compare_refs, sorting);
>  }

Nice.  We do allow passing NULL to ref_sorting_release(), and we can
return NULL from ref_sorting_options(), and allowing NULL to be
passed to this function makes it easier for the callers to deal with
the case where no sorting is specified.

> diff --git a/t/t3200-branch.sh b/t/t3200-branch.sh
> index 3182abde27f..9918ba05dec 100755
> --- a/t/t3200-branch.sh
> +++ b/t/t3200-branch.sh
> @@ -1570,9 +1570,10 @@ test_expect_success 'tracking with unexpected .fetch refspec' '
>  
>  test_expect_success 'configured committerdate sort' '
>  	git init -b main sort &&
> +	test_config -C sort branch.sort "committerdate" &&
> +
>  	(
>  		cd sort &&
> -		git config branch.sort committerdate &&
>  		test_commit initial &&
>  		git checkout -b a &&
>  		test_commit a &&
> @@ -1592,9 +1593,10 @@ test_expect_success 'configured committerdate sort' '
>  '
>  
>  test_expect_success 'option override configured sort' '
> +	test_config -C sort branch.sort "committerdate" &&
> +
>  	(
>  		cd sort &&
> -		git config branch.sort committerdate &&
>  		git branch --sort=refname >actual &&
>  		cat >expect <<-\EOF &&
>  		  a

The above two are not strictly necessary for the purpose of this
patch, in that the tests that come after these tests do not care if
the branch.sort configuration variable is set in the "sort"
repository, as they set their own value before doing their test.

But of course, cleaning up after yourself with test_config and
friends is a good idea regardless, and a handful of new tests added
after this point follow the same pattern.  Good.

> @@ -1606,10 +1608,70 @@ test_expect_success 'option override configured sort' '
>  	)
>  '
>  
> +test_expect_success '--no-sort cancels config sort keys' '
> +	test_config -C sort branch.sort "-refname" &&
> +
> +	(
> +		cd sort &&
> +
> +		# objecttype is identical for all of them, so sort falls back on
> +		# default (ascending refname)

Interesting.

> +		git branch \
> +			--no-sort \
> +			--sort="objecttype" >actual &&
> +		cat >expect <<-\EOF &&
> +		  a
> +		* b
> +		  c
> +		  main
> +		EOF
> +		test_cmp expect actual
> +	)
> +
> +'
> +
> +test_expect_success '--no-sort cancels command line sort keys' '
> +	(
> +		cd sort &&
> +
> +		# objecttype is identical for all of them, so sort falls back on
> +		# default (ascending refname)
> +		git branch \
> +			--sort="-refname" \
> +			--no-sort \
> +			--sort="objecttype" >actual &&

OK, this exercises the same "--no-sort cleans the slate" as before,
and for this one it is essential that we lack branch.sort after the
previous step is done, which is ensured thanks to the use of
test_config in the previous one.  Nice.

> +		cat >expect <<-\EOF &&
> +		  a
> +		* b
> +		  c
> +		  main
> +		EOF
> +		test_cmp expect actual
> +	)
> +'
> +
> +test_expect_success '--no-sort without subsequent --sort prints expected branches' '
> +	(
> +		cd sort &&
> +
> +		# Sort the results with `sort` for a consistent comparison
> +		# against expected
> +		git branch --no-sort | sort >actual &&
> +		cat >expect <<-\EOF &&
> +		  a
> +		  c
> +		  main
> +		* b
> +		EOF
> +		test_cmp expect actual
> +	)
> +'
> +
>  test_expect_success 'invalid sort parameter in configuration' '
> +	test_config -C sort branch.sort "v:notvalid" &&
> +
>  	(
>  		cd sort &&
> -		git config branch.sort "v:notvalid" &&
>  
>  		# this works in the "listing" mode, so bad sort key
>  		# is a dying offence.

With such an invalid configuration value set, running the command
with "--no-sort" would stop the command from failing?  Is that worth
protecting with a new test, I wonder.

Overall very nicely done.

Thanks.

@@ -438,8 +437,6 @@ static void print_ref_list(struct ref_filter *filter, struct ref_sorting *sortin
{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This consolidates much of the code used to filter and format refs in
> 'builtin/for-each-ref.c', 'builtin/tag.c', and 'builtin/branch.c', reducing
> duplication and simplifying the future changes needed to optimize the filter
> & format process.
>
> Signed-off-by: Victoria Dye <vdye@github.com>
> ---
>  builtin/branch.c       | 33 +++++++++++++++++----------------
>  builtin/for-each-ref.c | 27 +--------------------------
>  builtin/tag.c          | 23 +----------------------
>  ref-filter.c           | 35 +++++++++++++++++++++++++++++++++++
>  ref-filter.h           | 14 ++++++++++++++
>  5 files changed, 68 insertions(+), 64 deletions(-)

The amount of existing duplication of code is rather surprising, and
this patch nicely refactors to improve.  Good.

Thanks.

@@ -298,6 +295,10 @@ fields will correspond to the appropriate date or name-email-date tuple
from the `committer` or `tagger` fields depending on the object type.
These are intended for working on a mix of annotated and lightweight tags.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Victoria Dye <vdye@github.com>
>
> In most builtins ('rev-parse <revision>^{}', 'show-ref --dereference'),
> "dereferencing" a tag refers to a recursive peel of the tag object. Unlike
> these cases, the dereferencing prefix ('*') in 'for-each-ref' format
> specifiers triggers only a single, non-recursive dereference of a given tag
> object. For most annotated tags, a single dereference is all that is needed
> to access the tag's associated commit or tree; "recursive" and
> "non-recursive" dereferencing are functionally equivalent in these cases.
> However, nested tags (annotated tags whose target is another annotated tag)
> dereferenced once return another tag, where a recursive dereference would
> return the commit or tree.

This may be the only potentially controversial step in the series.

> -	/*
> -	 * NEEDSWORK: This derefs tag only once, which
> -	 * is good to deal with chains of trust, but
> -	 * is not consistent with what deref_tag() does
> -	 * which peels the onion to the core.
> -	 */
>  	return get_object(ref, 1, &obj, &oi_deref, err);
>  }

Very nice to see an ancient comment I added at 9f613ddd (Add
git-for-each-ref: helper for language bindings, 2006-09-15) finally
go.

Thanks.

Copy link

gitgitgadget bot commented Nov 16, 2023

This patch series was integrated into seen via git@5f13dd4.

@@ -45,7 +45,6 @@ static const char *head;
static struct object_id head_oid;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Øystein Walle wrote (reply to this):

Victoria Dye <vdye@github.com> writes:

> diff --git a/ref-filter.h b/ref-filter.h
> index 1524bc463a5..d87d61238b7 100644
> --- a/ref-filter.h
> +++ b/ref-filter.h
> @@ -92,6 +92,11 @@ struct ref_format {
>  
>  	/* List of bases for ahead-behind counts. */
>  	struct string_list bases;
> +
> +	struct {
> +		int max_count;
> +		int omit_empty;
> +	} array_opts;
>  };

What the benefit of having them in a nested struct is compared to just
two distinct members?

Regardless this is the kind of deduplication I wanted to achieve when I
added --omit-empty, but never did. Either way, I meant to ack this in
the last round never got around to it. Nice work.

Acked-by: Øystein Walle <oystwa@gmail.com>

Øsse

Copy link

gitgitgadget bot commented Nov 17, 2023

This patch series was integrated into seen via git@cb7e83b.

Copy link

gitgitgadget bot commented Nov 17, 2023

This patch series was integrated into next via git@ff99420.

@gitgitgadget gitgitgadget bot added the next label Nov 17, 2023
Copy link

gitgitgadget bot commented Nov 17, 2023

There was a status update in the "Cooking" section about the branch vd/for-each-ref-unsorted-optimization on the Git mailing list:

"git for-each-ref --no-sort" still sorted the refs alphabetically
which paid non-trivial cost.  It has been redefined to show output
in an unspecified order, to allow certain optimizations to take
advantage of.

Will cook in 'next'.
source: <pull.1609.v2.git.1699991638.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Nov 20, 2023

This patch series was integrated into seen via git@784d2c3.

Copy link

gitgitgadget bot commented Nov 20, 2023

There was a status update in the "Cooking" section about the branch vd/for-each-ref-unsorted-optimization on the Git mailing list:

"git for-each-ref --no-sort" still sorted the refs alphabetically
which paid non-trivial cost.  It has been redefined to show output
in an unspecified order, to allow certain optimizations to take
advantage of.

Will cook in 'next'.
source: <pull.1609.v2.git.1699991638.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Nov 27, 2023

There was a status update in the "Cooking" section about the branch vd/for-each-ref-unsorted-optimization on the Git mailing list:

"git for-each-ref --no-sort" still sorted the refs alphabetically
which paid non-trivial cost.  It has been redefined to show output
in an unspecified order, to allow certain optimizations to take
advantage of.

Will merge to 'master'.
source: <pull.1609.v2.git.1699991638.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Dec 9, 2023

This patch series was integrated into seen via git@b38c0bd.

Copy link

gitgitgadget bot commented Dec 9, 2023

There was a status update in the "Cooking" section about the branch vd/for-each-ref-unsorted-optimization on the Git mailing list:

"git for-each-ref --no-sort" still sorted the refs alphabetically
which paid non-trivial cost.  It has been redefined to show output
in an unspecified order, to allow certain optimizations to take
advantage of.

Will merge to 'master'.
source: <pull.1609.v2.git.1699991638.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Dec 10, 2023

This patch series was integrated into seen via git@98d0a1f.

Copy link

gitgitgadget bot commented Dec 10, 2023

This patch series was integrated into master via git@98d0a1f.

Copy link

gitgitgadget bot commented Dec 10, 2023

This patch series was integrated into next via git@98d0a1f.

@gitgitgadget gitgitgadget bot added the master label Dec 10, 2023
Copy link

gitgitgadget bot commented Dec 10, 2023

Closed via 98d0a1f.

@gitgitgadget gitgitgadget bot closed this Dec 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant