Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trailer readability cleanups #1563

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion builtin/commit.c
Original file line number Diff line number Diff line change
Expand Up @@ -900,7 +900,7 @@ static int prepare_to_commit(const char *index_file, const char *prefix,
strbuf_stripspace(&sb, '\0');

if (signoff)
append_signoff(&sb, ignore_non_trailer(sb.buf, sb.len), 0);
append_signoff(&sb, ignored_log_message_bytes(sb.buf, sb.len), 0);

if (fwrite(sb.buf, 1, sb.len, s->fp) < sb.len)
die_errno(_("could not write commit template"));
Expand Down
2 changes: 1 addition & 1 deletion builtin/merge.c
Original file line number Diff line number Diff line change
Expand Up @@ -870,7 +870,7 @@ static void prepare_to_commit(struct commit_list *remoteheads)
_(no_scissors_editor_comment), comment_line_char);
}
if (signoff)
append_signoff(&msg, ignore_non_trailer(msg.buf, msg.len), 0);
append_signoff(&msg, ignored_log_message_bytes(msg.buf, msg.len), 0);
write_merge_heads(remoteheads);
write_file_buf(git_path_merge_msg(the_repository), msg.buf, msg.len);
if (run_commit_hook(0 < option_edit, get_index_file(), NULL,
Expand Down
2 changes: 1 addition & 1 deletion commit.c
Original file line number Diff line number Diff line change
Expand Up @@ -1769,7 +1769,7 @@ const char *find_commit_header(const char *msg, const char *key, size_t *out_len
* Returns the number of bytes from the tail to ignore, to be fed as
* the second parameter to append_signoff().
*/
size_t ignore_non_trailer(const char *buf, size_t len)
size_t ignored_log_message_bytes(const char *buf, size_t len)
{
size_t boc = 0;
size_t bol = 0;
Expand Down
4 changes: 2 additions & 2 deletions commit.h
Original file line number Diff line number Diff line change
Expand Up @@ -294,8 +294,8 @@ const char *find_header_mem(const char *msg, size_t len,
const char *find_commit_header(const char *msg, const char *key,
size_t *out_len);

/* Find the end of the log message, the right place for a new trailer. */
size_t ignore_non_trailer(const char *buf, size_t len);
/* Find the number of bytes to ignore from the end of a log message. */
size_t ignored_log_message_bytes(const char *buf, size_t len);

typedef int (*each_mergetag_fn)(struct commit *commit, struct commit_extra_header *extra,
void *cb_data);
Expand Down
2 changes: 1 addition & 1 deletion sequencer.c
Original file line number Diff line number Diff line change
Expand Up @@ -345,7 +345,7 @@ static int has_conforming_footer(struct strbuf *sb, struct strbuf *sob,
if (ignore_footer)
sb->buf[sb->len - ignore_footer] = saved_char;

if (info.trailer_start == info.trailer_end)
if (info.trailer_block_start == info.trailer_block_end)
return 0;

for (i = 0; i < info.trailer_nr; i++)
Expand Down
85 changes: 51 additions & 34 deletions trailer.c
Original file line number Diff line number Diff line change
Expand Up @@ -809,28 +809,57 @@ static ssize_t last_line(const char *buf, size_t len)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

"Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Linus Arver <linusa@google.com>
> 
> Previously, trailer_info_get() computed the trailer block end position
> by
> 
> (1) checking for the opts->no_divider flag and optionally calling
>     find_patch_start() to find the "patch start" location (patch_start), and
> (2) calling find_trailer_end() to find the end of the trailer block
>     using patch_start as a guide, saving the return value into
>     "trailer_end".
> 
> The logic in (1) was awkward because the variable "patch_start" is
> misleading if there is no patch in the input. The logic in (2) was
> misleading because it could be the case that no trailers are in the
> input (yet we are setting a "trailer_end" variable before even searching
> for trailers, which happens later in find_trailer_start()). The name
> "find_trailer_end" was misleading because that function did not look for
> any trailer block itself --- instead it just computed the end position
> of the log message in the input where the end of the trailer block (if
> it exists) would be (because trailer blocks must always come after the
> end of the log message).

I might be biased since I wrote the text in question in 022349c3b0
(trailer: avoid unnecessary splitting on lines, 2016-11-02), but the
concept of patch_start and trailer_end being where the patch would start
and where the trailer would end (if they were present) goes back to
2013d8505d (trailer: parse trailers from file or stdin, 2014-10-13). I
don't remember exactly my thoughts in 2016, but today, this makes sense
to me.

As it is, the reader still needs to know that trailer_start is where
the trailer would start if it was present, and then I think it's quite
natural to have trailer_end be where the trailer would end if it was
present.

I believe the code is simpler this way, because trailer absence now no
longer needs to be special-cased when we use these variables (or maybe
they sometimes do, but not as often, since code that writes to the end
of the trailers, for example, can now just write at trailer_end instead
of having to check whether trailers exist). Same comment for patch 4
regarding using the special value 0 if no trailers are found (I think
the existing code makes more sense).

> Combine the logic in (1) and (2) together into find_patch_start() by
> renaming it to find_end_of_log_message(). The end of the log message is
> the starting point which find_trailer_start() needs to start searching
> backward to parse individual trailers (if any).

Having said that, if patch_start is too confusing for whatever reason,
this refactoring makes sense. (Avoid the confusing name by avoiding
needing to name it in the first place.)

> -static size_t find_patch_start(const char *str)
> +static size_t find_end_of_log_message(const char *input, int no_divider)
>  {
> +	size_t end;
> +
>  	const char *s;
>  
> -	for (s = str; *s; s = next_line(s)) {
> +	/* Assume the naive end of the input is already what we want. */
> +	end = strlen(input);
> +
> +	/* Optionally skip over any patch part ("---" line and below). */
> +	for (s = input; *s; s = next_line(s)) {
>  		const char *v;
>  
> -		if (skip_prefix(s, "---", &v) && isspace(*v))
> -			return s - str;
> +		if (!no_divider && skip_prefix(s, "---", &v) && isspace(*v)) {
> +			end = s - input;
> +			break;
> +		}
>  	}
>  
> -	return s - str;
> +	/* Skip over other ignorable bits. */
> +	return end - ignored_log_message_bytes(input, end);
>  }

This sometimes redundantly calls strlen and sometimes redundantly loops.
I think it's better to do as the code currently does - so, have a big
if/else at the beginning of this function that checks no_divider.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Linus Arver wrote (reply to this):

Hi Jonathan, it's been a while because I was on vacation. I've forgotten
about most of the intricacies of this patch (I think this was a good
thing, read on below).

Jonathan Tan <jonathantanmy@google.com> writes:

> "Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> From: Linus Arver <linusa@google.com>
>> 
>> Previously, trailer_info_get() computed the trailer block end position
>> by
>> 
>> (1) checking for the opts->no_divider flag and optionally calling
>>     find_patch_start() to find the "patch start" location (patch_start), and
>> (2) calling find_trailer_end() to find the end of the trailer block
>>     using patch_start as a guide, saving the return value into
>>     "trailer_end".
>> 
>> The logic in (1) was awkward because the variable "patch_start" is
>> misleading if there is no patch in the input. The logic in (2) was
>> misleading because it could be the case that no trailers are in the
>> input (yet we are setting a "trailer_end" variable before even searching
>> for trailers, which happens later in find_trailer_start()). The name
>> "find_trailer_end" was misleading because that function did not look for
>> any trailer block itself --- instead it just computed the end position
>> of the log message in the input where the end of the trailer block (if
>> it exists) would be (because trailer blocks must always come after the
>> end of the log message).
>
> [...]
>
> As it is, the reader still needs to know that trailer_start is where
> the trailer would start if it was present, and then I think it's quite
> natural to have trailer_end be where the trailer would end if it was
> present.
>
> I believe the code is simpler this way, because trailer absence now no
> longer needs to be special-cased when we use these variables (or maybe
> they sometimes do, but not as often, since code that writes to the end
> of the trailers, for example, can now just write at trailer_end instead
> of having to check whether trailers exist). Same comment for patch 4
> regarding using the special value 0 if no trailers are found (I think
> the existing code makes more sense).

I think the root cause of my confusion with this codebase is due to the
variables being named as if the things they refer to exist, but without
any guarantees that they indeed exist. This applies to "patch_start"
(the patch part might not be present in the input),
"trailer_{start,end}" (trailers block might not exist (yet)). IOW these
variables are named as if the intent is to always add new trailers into
the input, which may not be the case (we have "--parse", after all).

Looking again at patch 4, I'm now leaning toward dropping it. Other
than the reasons you cited, we also add a new struct field which by
itself does not add new information (the information can already be
deduced from the other fields). In the near future I want to simplify
the data structures as much as possible, and adding a new field seems to
go against this desire of mine.

>> Combine the logic in (1) and (2) together into find_patch_start() by
>> renaming it to find_end_of_log_message(). The end of the log message is
>> the starting point which find_trailer_start() needs to start searching
>> backward to parse individual trailers (if any).
>
> Having said that, if patch_start is too confusing for whatever reason,
> this refactoring makes sense. (Avoid the confusing name by avoiding
> needing to name it in the first place.)

I think the existing code is confusing, and would prefer to keep this
patch.

>> -static size_t find_patch_start(const char *str)
>> +static size_t find_end_of_log_message(const char *input, int no_divider)
>>  {
>> +	size_t end;
>> +
>>  	const char *s;
>>  
>> -	for (s = str; *s; s = next_line(s)) {
>> +	/* Assume the naive end of the input is already what we want. */
>> +	end = strlen(input);
>> +
>> +	/* Optionally skip over any patch part ("---" line and below). */
>> +	for (s = input; *s; s = next_line(s)) {
>>  		const char *v;
>>  
>> -		if (skip_prefix(s, "---", &v) && isspace(*v))
>> -			return s - str;
>> +		if (!no_divider && skip_prefix(s, "---", &v) && isspace(*v)) {
>> +			end = s - input;
>> +			break;
>> +		}
>>  	}
>>  
>> -	return s - str;
>> +	/* Skip over other ignorable bits. */
>> +	return end - ignored_log_message_bytes(input, end);
>>  }
>
> This sometimes redundantly calls strlen and sometimes redundantly loops.
> I think it's better to do as the code currently does - so, have a big
> if/else at the beginning of this function that checks no_divider.

Will update, thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Linus Arver <linusa@google.com> writes:

> Hi Jonathan, it's been a while because I was on vacation. I've forgotten
> about most of the intricacies of this patch (I think this was a good
> thing, read on below).

Welcome back ;-).

> Will update, thanks.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Linus Arver <linusa@google.com>
>
> Previously, trailer_info_get() computed the trailer block end position
> by
>
> (1) checking for the opts->no_divider flag and optionally calling
>     find_patch_start() to find the "patch start" location (patch_start), and
> (2) calling find_trailer_end() to find the end of the trailer block
>     using patch_start as a guide, saving the return value into
>     "trailer_end".
>
> The logic in (1) was awkward because the variable "patch_start" is
> misleading if there is no patch in the input. The logic in (2) was
> misleading because it could be the case that no trailers are in the
> input (yet we are setting a "trailer_end" variable before even searching
> for trailers, which happens later in find_trailer_start()). The name
> "find_trailer_end" was misleading because that function did not look for
> any trailer block itself --- instead it just computed the end position
> of the log message in the input where the end of the trailer block (if
> it exists) would be (because trailer blocks must always come after the
> end of the log message).
>
> Combine the logic in (1) and (2) together into find_patch_start() by
> renaming it to find_end_of_log_message(). The end of the log message is
> the starting point which find_trailer_start() needs to start searching
> backward to parse individual trailers (if any).
>
> Helped-by: Jonathan Tan <jonathantanmy@google.com>
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: Linus Arver <linusa@google.com>
> ---
>  trailer.c | 64 +++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 41 insertions(+), 23 deletions(-)
>
> diff --git a/trailer.c b/trailer.c
> index 3c54b38a85a..70c81fda710 100644
> --- a/trailer.c
> +++ b/trailer.c
> @@ -809,21 +809,50 @@ static ssize_t last_line(const char *buf, size_t len)
>  }
>  
>  /*
> - * Return the position of the start of the patch or the length of str if there
> - * is no patch in the message.
> + * Find the end of the log message as an offset from the start of the input
> + * (where callers of this function are interested in looking for a trailers
> + * block in the same input). We have to consider two categories of content that
> + * can come at the end of the input which we want to ignore (because they don't
> + * belong in the log message):
> + *
> + * (1) the "patch part" which begins with a "---" divider and has patch
> + * information (like the output of git-format-patch), and
> + *
> + * (2) any trailing comment lines, blank lines like in the output of "git
> + * commit -v", or stuff below the "cut" (scissor) line.
> + *
> + * As a formula, the situation looks like this:
> + *
> + *     INPUT = LOG MESSAGE + IGNORED
> + *
> + * where IGNORED can be either of the two categories described above. It may be
> + * that there is nothing to ignore. Now it may be the case that the LOG MESSAGE
> + * contains a trailer block, but that's not the concern of this function.
>   */
> -static size_t find_patch_start(const char *str)
> +static size_t find_end_of_log_message(const char *input, int no_divider)
>  {
> +	size_t end;
>  	const char *s;
>  
> -	for (s = str; *s; s = next_line(s)) {
> +	/* Assume the naive end of the input is already what we want. */
> +	end = strlen(input);
> +
> +	if (no_divider) {
> +		return end;
> +	}

OK.  The early return may make sense, as we are essentially
declaring that everything is the "INPUT (= message + ignored)".

You do not need {braces} around a single-statement block, though.

Other than that, I didn't find anything quesionable in any of the
patches in this round.  Looking good.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Linus Arver wrote (reply to this):

TL;DR: I'm working on a new approach.

Junio C Hamano <gitster@pobox.com> writes:
> Other than that, I didn't find anything quesionable in any of the
> patches in this round.  Looking good.

So actually, I'm now taking a much more aggressive approach to libifying
the trailer subsystem. Instead of incrementally simplifying/improving
things as in this series, I think I need to get to the root problem,
which is that the trailer.h API isn't rich enough to make it pleasant
for clients to use, including our own builtin/interpret-trailers.c
client. That is, the problem we have today is that the trailer subsystem
is not very ergonomic for internal use, much less external use (outside
of Git itself).

As an example, the current API exposes process_trailers() which does a
whole bunch of things that only builtin/interpret-trailers.c cares
about. Multiple other clients of trailer.h exist in our codebase (e.g.,
sequencer.c, pretty.c, ref-filter.c) but none of them use
process_trailers().

One really useful data structure is the trailer_iterator that was
introduced in f0939a0eb1 (trailer: add interface for iterating over
commit trailers, 2020-09-27). The only problem is that it is not generic
enough such that interpret-trailers.c can use it.

My new goal is to introduce a new API in trailer.h so that
interpret-trailers.c and everyone else can start using these new data
structures and associated functions (while preserving the
trailer_iterator interface). So the order of operations should be:

(1) enrich the trailer API (make trailer.h have simpler data structures
    and practical functions that clients can readily use), and
(2) make builtin/interpret-trailers.c, and other clients in the Git
    codebase use this new API.

This way when the unit test framework selection process is finalized we
can

(3) write unit tests for the functions in the (enriched) trailer API,

which is one of the major goals for my efforts around this area.

The work I've started locally for (1) does not depend on this series,
and I think it'll be cleaner (less churn) that way. So, feel free to
drop this series in favor of the forthcoming work described in this
message.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Linus Arver wrote (reply to this):

> (1) enrich the trailer API (make trailer.h have simpler data structures
>     and practical functions that clients can readily use), and
> (2) make builtin/interpret-trailers.c, and other clients in the Git
>     codebase use this new API.

I've done some more thinking/hacking and I'm realizing that changing the
data structures significantly as a first step will be difficult to get
right without being able to unit-test things as we go. As we don't have
unit tests (sorry, I keep forgetting...), I think changing the shape of
data structures right now is a showstopper.

Step (1) will have to be done without changing any of the existing data
structures.


/*
* Return the position of the start of the patch or the length of str if there
* is no patch in the message.
* Find the end of the log message as an offset from the start of the input
* (where callers of this function are interested in looking for a trailers
* block in the same input). We have to consider two categories of content that
* can come at the end of the input which we want to ignore (because they don't
* belong in the log message):
*
* (1) the "patch part" which begins with a "---" divider and has patch
* information (like the output of git-format-patch), and
*
* (2) any trailing comment lines, blank lines like in the output of "git
* commit -v", or stuff below the "cut" (scissor) line.
*
* As a formula, the situation looks like this:
*
* INPUT = LOG MESSAGE + IGNORED
*
* where IGNORED can be either of the two categories described above. It may be
* that there is nothing to ignore. Now it may be the case that the LOG MESSAGE
* contains a trailer block, but that's not the concern of this function.
*/
static size_t find_patch_start(const char *str)
static size_t find_end_of_log_message(const char *input, int no_divider)
{
size_t end;
const char *s;

for (s = str; *s; s = next_line(s)) {
/* Assume the naive end of the input is already what we want. */
end = strlen(input);

if (no_divider) {
return end;
}

/* Optionally skip over any patch part ("---" line and below). */
for (s = input; *s; s = next_line(s)) {
const char *v;

if (skip_prefix(s, "---", &v) && isspace(*v))
return s - str;
if (skip_prefix(s, "---", &v) && isspace(*v)) {
end = s - input;
break;
}
}

return s - str;
/* Skip over other ignorable bits. */
return end - ignored_log_message_bytes(input, end);
}

/*
* Return the position of the first trailer line or len if there are no
* trailers.
*/
static size_t find_trailer_start(const char *buf, size_t len)
static size_t find_trailer_block_start(const char *buf, size_t len)
{
const char *s;
ssize_t end_of_title, l;
Expand Down Expand Up @@ -925,12 +954,6 @@ static size_t find_trailer_start(const char *buf, size_t len)
return len;
}

/* Return the position of the end of the trailers. */
static size_t find_trailer_end(const char *buf, size_t len)
{
return len - ignore_non_trailer(buf, len);
}

static int ends_with_blank_line(const char *buf, size_t len)
{
ssize_t ll = last_line(buf, len);
Expand Down Expand Up @@ -1052,7 +1075,6 @@ void process_trailers(const char *file,
LIST_HEAD(head);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Linus Arver <linusa@google.com>
>
> Previously these fields in the trailer_info struct were of type "const
> char *" and pointed to positions in the input string directly (to the
> start and end positions of the trailer block).
>
> Use offsets to make the intended usage less ambiguous. We only need to
> reference the input string in format_trailer_info(), so update that
> function to take a pointer to the input.

Hmm, I am not sure if this is an improvement.  If the underlying
buffer can be reallocated (to grow), the approach to use the offsets
certainly is easier to deal with, as they will stay valid even after
such a reallocation.  But you lose the obvious sentinel value NULL
that can mean something special, and have to make the readers aware
of the local convention you happened to have picked with a comment
like ...

> Signed-off-by: Linus Arver <linusa@google.com>
> ---
>  trailer.c | 17 ++++++++---------
>  trailer.h |  7 +++----
>  2 files changed, 11 insertions(+), 13 deletions(-)
> ...
>  	/*
> -	 * Pointers to the start and end of the trailer block found. If there
> -	 * is no trailer block found, these 2 pointers point to the end of the
> -	 * input string.
> +	 * Offsets to the trailer block start and end positions in the input
> +	 * string. If no trailer block is found, these are set to 0.
>  	 */

... this, simply because there is no obvious sentinel value for an
unsigned integral type; even if you count MAX_ULONG and its friends,
they are not as obvious as NULL for pointer types.

So, I dunno.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Linus Arver wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes:

> "Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Linus Arver <linusa@google.com>
>>
>> Previously these fields in the trailer_info struct were of type "const
>> char *" and pointed to positions in the input string directly (to the
>> start and end positions of the trailer block).
>>
>> Use offsets to make the intended usage less ambiguous. We only need to
>> reference the input string in format_trailer_info(), so update that
>> function to take a pointer to the input.
>
> Hmm, I am not sure if this is an improvement.  If the underlying
> buffer can be reallocated (to grow), the approach to use the offsets
> certainly is easier to deal with, as they will stay valid even after
> such a reallocation.  But you lose the obvious sentinel value NULL
> that can mean something special

True.

> and have to make the readers aware
> of the local convention you happened to have picked with a comment
> like ...
>
>> Signed-off-by: Linus Arver <linusa@google.com>
>> ---
>>  trailer.c | 17 ++++++++---------
>>  trailer.h |  7 +++----
>>  2 files changed, 11 insertions(+), 13 deletions(-)
>> ...
>>  	/*
>> -	 * Pointers to the start and end of the trailer block found. If there
>> -	 * is no trailer block found, these 2 pointers point to the end of the
>> -	 * input string.
>> +	 * Offsets to the trailer block start and end positions in the input
>> +	 * string. If no trailer block is found, these are set to 0.
>>  	 */
>
> ... this, simply because there is no obvious sentinel value for an
> unsigned integral type; even if you count MAX_ULONG and its friends,
> they are not as obvious as NULL for pointer types.

I agree that there is no trustworthy sentinel value for an unsigned
integral type.

On the other hand, we never used NULL as a sentinel value before even
when they were const char pointers --- the current comment for these
fields which say ...

    If there is no trailer block found, these 2 pointers point to the end of the
    input string.

... sounds somewhat arbitrary to me (and I don't think we care about
this property in trailer.c, and AFAICS it's also not something that
consumers should be aware of). Consumers of the trailer_info struct
could also just see if "info->trailer_nr > 0" to check whether any
trailers were found, although if we're merging Patch 1 [1] of this
series the consumers will not have easy access any more to any of the
trailer_info fields, and they should instead be using a public-facing
function that does the "were trailers found" check.

> So, I dunno.

If the "we don't use NULL sentinel values currently anyway" argument is
convincing enough, I'm happy to add this to the commit message on a
reroll. But I'm also OK with dropping this patch. Thoughts?

[1] https://lore.kernel.org/git/pull.1563.git.1691211879.gitgitgadget@gmail.com/T/#m8f1b5f1eb346331658c8c7b3e057a4ee31223664

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Linus Arver wrote (reply to this):

Linus Arver <linusa@google.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> "Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Linus Arver <linusa@google.com>
>>>
>>> Previously these fields in the trailer_info struct were of type "const
>>> char *" and pointed to positions in the input string directly (to the
>>> start and end positions of the trailer block).
>>>
>>> Use offsets to make the intended usage less ambiguous. We only need to
>>> reference the input string in format_trailer_info(), so update that
>>> function to take a pointer to the input.
>>
>> Hmm, I am not sure if this is an improvement.  If the underlying
>> buffer can be reallocated (to grow), the approach to use the offsets
>> certainly is easier to deal with, as they will stay valid even after
>> such a reallocation.  But you lose the obvious sentinel value NULL
>> that can mean something special
>
> True.
>
>> and have to make the readers aware
>> of the local convention you happened to have picked with a comment
>> like ...
>>
>>> Signed-off-by: Linus Arver <linusa@google.com>
>>> ---
>>>  trailer.c | 17 ++++++++---------
>>>  trailer.h |  7 +++----
>>>  2 files changed, 11 insertions(+), 13 deletions(-)
>>> ...
>>>  	/*
>>> -	 * Pointers to the start and end of the trailer block found. If there
>>> -	 * is no trailer block found, these 2 pointers point to the end of the
>>> -	 * input string.
>>> +	 * Offsets to the trailer block start and end positions in the input
>>> +	 * string. If no trailer block is found, these are set to 0.
>>>  	 */

I've just realized that the new comment "If no trailer block is found,
these are set to 0" is perhaps not always true in the current version
of this patch. This is because this patch did a mechanical type
conversion of the old fields from "const char *" to "size_t", and we
still run

    trailer_end = find_trailer_end(str, trailer_search_boundary);
    trailer_start = find_trailer_start(str, trailer_end);

inside trailer_info_get() (not visible in the patch context lines). I
will update this patch to make the comment "If no trailer block is found,
these are set to 0" true.

struct strbuf sb = STRBUF_INIT;
struct trailer_info info;
size_t trailer_end;
FILE *outfile = stdout;

ensure_configured();
Expand All @@ -1063,11 +1085,10 @@ void process_trailers(const char *file,
outfile = create_in_place_tempfile(file);

parse_trailers(&info, sb.buf, &head, opts);
trailer_end = info.trailer_end - sb.buf;

/* Print the lines before the trailers */
if (!opts->only_trailers)
fwrite(sb.buf, 1, info.trailer_start - sb.buf, outfile);
fwrite(sb.buf, 1, info.trailer_block_start, outfile);

if (!opts->only_trailers && !info.blank_line_before_trailer)
fprintf(outfile, "\n");
Expand All @@ -1089,7 +1110,7 @@ void process_trailers(const char *file,

/* Print the lines after the trailers as is */
if (!opts->only_trailers)
fwrite(sb.buf + trailer_end, 1, sb.len - trailer_end, outfile);
fwrite(sb.buf + info.trailer_block_end, 1, sb.len - info.trailer_block_end, outfile);

if (opts->in_place)
if (rename_tempfile(&trailers_tempfile, file))
Expand All @@ -1101,24 +1122,19 @@ void process_trailers(const char *file,
void trailer_info_get(struct trailer_info *info, const char *str,
const struct process_trailer_options *opts)
{
int patch_start, trailer_end, trailer_start;
size_t end_of_log_message = 0, trailer_block_start = 0;
struct strbuf **trailer_lines, **ptr;
char **trailer_strings = NULL;
size_t nr = 0, alloc = 0;
char **last = NULL;

ensure_configured();

if (opts->no_divider)
patch_start = strlen(str);
else
patch_start = find_patch_start(str);

trailer_end = find_trailer_end(str, patch_start);
trailer_start = find_trailer_start(str, trailer_end);
end_of_log_message = find_end_of_log_message(str, opts->no_divider);
trailer_block_start = find_trailer_block_start(str, end_of_log_message);

trailer_lines = strbuf_split_buf(str + trailer_start,
trailer_end - trailer_start,
trailer_lines = strbuf_split_buf(str + trailer_block_start,
end_of_log_message - trailer_block_start,
'\n',
0);
for (ptr = trailer_lines; *ptr; ptr++) {
Expand All @@ -1139,9 +1155,9 @@ void trailer_info_get(struct trailer_info *info, const char *str,
strbuf_list_free(trailer_lines);

info->blank_line_before_trailer = ends_with_blank_line(str,
trailer_start);
info->trailer_start = str + trailer_start;
info->trailer_end = str + trailer_end;
trailer_block_start);
info->trailer_block_start = trailer_block_start;
info->trailer_block_end = end_of_log_message;
info->trailers = trailer_strings;
info->trailer_nr = nr;
}
Expand All @@ -1156,6 +1172,7 @@ void trailer_info_release(struct trailer_info *info)

static void format_trailer_info(struct strbuf *out,
const struct trailer_info *info,
const char *msg,
const struct process_trailer_options *opts)
{
size_t origlen = out->len;
Expand All @@ -1165,8 +1182,8 @@ static void format_trailer_info(struct strbuf *out,
if (!opts->only_trailers && !opts->unfold && !opts->filter &&
!opts->separator && !opts->key_only && !opts->value_only &&
!opts->key_value_separator) {
strbuf_add(out, info->trailer_start,
info->trailer_end - info->trailer_start);
strbuf_add(out, msg + info->trailer_block_start,
info->trailer_block_end - info->trailer_block_start);
return;
}

Expand Down Expand Up @@ -1220,7 +1237,7 @@ void format_trailers_from_commit(struct strbuf *out, const char *msg,
struct trailer_info info;

trailer_info_get(&info, msg, opts);
format_trailer_info(out, &info, opts);
format_trailer_info(out, &info, msg, opts);
trailer_info_release(&info);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Glen Choo wrote (reply to this):

As someone who isn't that familiar with trailer code, and will have less
time for the ML soon, this is more of a quick drive-by..

"Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +#define private __private_to_trailer_c__do_not_use
> +
>  void trailer_iterator_init(struct trailer_iterator *iter, const char *msg)
>  {
>  	struct process_trailer_options opts = PROCESS_TRAILER_OPTIONS_INIT;
>  	strbuf_init(&iter->key, 0);
>  	strbuf_init(&iter->val, 0);
>  	opts.no_divider = 1;
> -	trailer_info_get(&iter->info, msg, &opts);
> -	iter->cur = 0;
> +	trailer_info_get(&iter->private.info, msg, &opts);
> +	iter->private.cur = 0;
>  }
> --- a/trailer.h
> +++ b/trailer.h
> @@ -119,8 +119,10 @@ struct trailer_iterator {
>  	struct strbuf val;
>...
>  	/* private */
> -	struct trailer_info info;
> -	size_t cur;
> +	struct {
> +		struct trailer_info info;
> +		size_t cur;
> +	} __private_to_trailer_c__do_not_use;
>  };

Interesting approach to "private members". I like that it's fairly
lightweight and clear. On the other hand, I think this will fail to
autocomplete on most people's development setups, and I don't think this
is worth the tradeoff.

This is the first instance of this I could find in the codebase. I'm not
really opposed to having a new way of doing things, but it would be nice
for us to be consistent with how we handle private members. Other
approaches I've seen are:

- Using a "larger" struct to hold private members and "downcasting" for
  public users (struct dir_iterator and struct dir_iterator_int). I
  dislike this because I think this enables 'wrong' memory access too
  easily.

  (As an aside, if we really wanted to 'strictly' enforce privateness in
  this patch, shouldn't we move the "#define private" into the .c file,
  the way dir_iterator_int is in the .c file?)

- Prefixing private members with "__" (khash.h and other header-only
  libraries use this at least, not sure if we have this in the 'main
  tree'). I think this works pretty well most of the time.
- Just marking private members with a comment. IMO this is good enough
  the vast majority of the time - if something is private for a good
  reason, it's unlikely to get used accidentally anyway. But properly
  enforcing "privateness" is worthy goal anyway.

Personally, I think a decent tradeoff between enforcement and ergonomics
would be to use an inner struct like you do here, but name it something
autocomplete-friendly and obviously private, like "private" or
"_private". I suspect self-regulation and code review should be enough
to catch nearly all accidental uses of private members.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

On 07/08/2023 22:16, Glen Choo wrote:
> As someone who isn't that familiar with trailer code, and will have less
> time for the ML soon, this is more of a quick drive-by..

This is a bit of a drive-by comment as well ...

> "Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >> +#define private __private_to_trailer_c__do_not_use
>> +
>>   void trailer_iterator_init(struct trailer_iterator *iter, const char *msg)
>>   {
>>   	struct process_trailer_options opts = PROCESS_TRAILER_OPTIONS_INIT;
>>   	strbuf_init(&iter->key, 0);
>>   	strbuf_init(&iter->val, 0);
>>   	opts.no_divider = 1;
>> -	trailer_info_get(&iter->info, msg, &opts);
>> -	iter->cur = 0;
>> +	trailer_info_get(&iter->private.info, msg, &opts);
>> +	iter->private.cur = 0;
>>   }
>> --- a/trailer.h
>> +++ b/trailer.h
>> @@ -119,8 +119,10 @@ struct trailer_iterator {
>>   	struct strbuf val;
>> ...
>>   	/* private */
>> -	struct trailer_info info;
>> -	size_t cur;
>> +	struct {
>> +		struct trailer_info info;
>> +		size_t cur;
>> +	} __private_to_trailer_c__do_not_use;
>>   };
> > Interesting approach to "private members". I like that it's fairly
> lightweight and clear. On the other hand, I think this will fail to
> autocomplete on most people's development setups, and I don't think this
> is worth the tradeoff.
> > This is the first instance of this I could find in the codebase. We have something similar in unpack_trees.h see 576de3d9560 (unpack_trees: start splitting internal fields from public API, 2023-02-27). That adds an "internal" member to "sturct unpack_trees" of type "struct unpack_trees_internal which seems to be a easier naming scheme.

> I'm not
> really opposed to having a new way of doing things, but it would be nice
> for us to be consistent with how we handle private members. Other
> approaches I've seen are:
> > - Using a "larger" struct to hold private members and "downcasting" for
>    public users (struct dir_iterator and struct dir_iterator_int). I
>    dislike this because I think this enables 'wrong' memory access too
>    easily.
> >    (As an aside, if we really wanted to 'strictly' enforce privateness in
>    this patch, shouldn't we move the "#define private" into the .c file,
>    the way dir_iterator_int is in the .c file?)

That #define is pretty ugly

Another common scheme is to have an opaque pointer to the private struct   in the public struct (aka pimpl idiom). The merge machinery uses this - see merge-recursive.h. (I'm working on something similar for the sequencer so we can change the internals without having to re-compile everything that includes "sequencer.h")

> - Prefixing private members with "__" (khash.h and other header-only
>    libraries use this at least, not sure if we have this in the 'main
>    tree'). I think this works pretty well most of the time.

It is common but I think the C standard reserves names beginning with "__"

> - Just marking private members with a comment. IMO this is good enough
>    the vast majority of the time - if something is private for a good
>    reason, it's unlikely to get used accidentally anyway. But properly
>    enforcing "privateness" is worthy goal anyway.
>
> Personally, I think a decent tradeoff between enforcement and ergonomics
> would be to use an inner struct like you do here, but name it something
> autocomplete-friendly and obviously private, like "private" or
> "_private".

I agree, something like that would match the unpack_trees example

> I suspect self-regulation and code review should be enough
> to catch nearly all accidental uses of private members.

Agreed

Best Wishes

Phillip

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Linus Arver wrote (reply to this):

Glen Choo <chooglen@google.com> writes:

> As someone who isn't that familiar with trailer code, and will have less
> time for the ML soon, this is more of a quick drive-by..

Aren't you also going on vacation soon? ;-)

> "Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> +#define private __private_to_trailer_c__do_not_use
>> +
>>  void trailer_iterator_init(struct trailer_iterator *iter, const char *msg)
>>  {
>>  	struct process_trailer_options opts = PROCESS_TRAILER_OPTIONS_INIT;
>>  	strbuf_init(&iter->key, 0);
>>  	strbuf_init(&iter->val, 0);
>>  	opts.no_divider = 1;
>> -	trailer_info_get(&iter->info, msg, &opts);
>> -	iter->cur = 0;
>> +	trailer_info_get(&iter->private.info, msg, &opts);
>> +	iter->private.cur = 0;
>>  }
>> --- a/trailer.h
>> +++ b/trailer.h
>> @@ -119,8 +119,10 @@ struct trailer_iterator {
>>  	struct strbuf val;
>>...
>>  	/* private */
>> -	struct trailer_info info;
>> -	size_t cur;
>> +	struct {
>> +		struct trailer_info info;
>> +		size_t cur;
>> +	} __private_to_trailer_c__do_not_use;
>>  };
>
> [...]
>
> This is the first instance of this I could find in the codebase. I'm not
> really opposed to having a new way of doing things, but it would be nice
> for us to be consistent with how we handle private members. Other
> approaches I've seen are:
>
> - Using a "larger" struct to hold private members and "downcasting" for
>   public users (struct dir_iterator and struct dir_iterator_int). I
>   dislike this because I think this enables 'wrong' memory access too
>   easily.
>   [...]
> - Prefixing private members with "__" (khash.h and other header-only
>   libraries use this at least, not sure if we have this in the 'main
>   tree'). I think this works pretty well most of the time.
> - Just marking private members with a comment. IMO this is good enough
>   the vast majority of the time - if something is private for a good
>   reason, it's unlikely to get used accidentally anyway. But properly
>   enforcing "privateness" is worthy goal anyway.

Thanks for documenting these other approaches.

I prefer the "larger" struct to hold private members pattern. More
specifically I like the container_of approach pointed out by Jacob [2],
because it is an established pattern in the Linux Kernel and because we
already sort of use the same idea in the list_head type we imported from
the Kernel in 94e99012fc (http-walker: reduce O(n) ops with
doubly-linked list, 2016-07-11). That is, for example for the
new_trailer_item struct we have

    struct new_trailer_item {
        struct list_head list;
        <list item stuff>
    };

and to me this is symmetric to the container_of pattern described by Jacob:

    struct dir_entry_private {
        struct dir_entry entry;
        <private stuff>
    };

Accordingly, we are already doing the "structure pointer math" (which
Jacob described in [2]) for list_head in list.h:

    /* Get typed element from list at a given position. */
    #define list_entry(ptr, type, member) \
        ((type *) ((char *) (ptr) - offsetof(type, member)))

In this patch series though, I decided to just stick with giving the
struct a private-sounding name, because I don't think we reached
consensus on what the preferred approach is for separating
public/private fields.

>   (As an aside, if we really wanted to 'strictly' enforce privateness in
>   this patch, shouldn't we move the "#define private" into the .c file,
>   the way dir_iterator_int is in the .c file?)

I think you meant moving the struct into the .c file (the "#define" is
already in the .c file).

> Personally, I think a decent tradeoff between enforcement and ergonomics
> would be to use an inner struct like you do here, but name it something
> autocomplete-friendly and obviously private, like "private" or
> "_private".

SGTM. I think I'll go with "internal" though, to align with 576de3d956
(unpack_trees: start splitting internal fields from public API,
2023-02-27) which Phillip pointed out. Will reroll.

> I suspect self-regulation and code review should be enough
> to catch nearly all accidental uses of private members.

Ack. In the future, if and when we want compiler-level guarantees to
make it impossible (this was the discussion at [1]), we can revisit this
area.

[1] https://lore.kernel.org/git/16ff5069-0408-21cd-995c-8b47afb9810d@github.com/
[2] https://lore.kernel.org/git/CA+P7+xo02dGkjb5DwJ1Af_hoQ5HiuxASheZxoFz+r6B-6cQMug@mail.gmail.com/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Linus Arver wrote (reply to this):

Phillip Wood <phillip.wood123@gmail.com> writes:

> We have something similar in unpack_trees.h see 576de3d9560 
> (unpack_trees: start splitting internal fields from public API, 
> 2023-02-27). That adds an "internal" member to "sturct unpack_trees" of 
> type "struct unpack_trees_internal which seems to be a easier naming scheme.

Ack, I will use "internal" as the member name in the next reroll.

>>> +#define private __private_to_trailer_c__do_not_use
> [...]
> That #define is pretty ugly

Haha, indeed. But I think that's the point though (i.e., the degree of
ugliness matches the strength of our codebase's posture to discourage
its use by external users).

I will drop the #define in the next reroll though, so, I guess it's a
moot point anyway.

> Another common scheme is to have an opaque pointer to the private struct 
>   in the public struct (aka pimpl idiom). The merge machinery uses this 
> - see merge-recursive.h. (I'm working on something similar for the 
> sequencer so we can change the internals without having to re-compile 
> everything that includes "sequencer.h")

Very interesting! I look forward to seeing your work. :)

>> - Prefixing private members with "__" (khash.h and other header-only
>>    libraries use this at least, not sure if we have this in the 'main
>>    tree'). I think this works pretty well most of the time.
>
> It is common but I think the C standard reserves names beginning with "__"

Indeed (see [1]).

[1] https://devblogs.microsoft.com/oldnewthing/20230109-00/?p=107685

}

Expand Down
10 changes: 5 additions & 5 deletions trailer.h
Original file line number Diff line number Diff line change
Expand Up @@ -32,16 +32,16 @@ int trailer_set_if_missing(enum trailer_if_missing *item, const char *value);
struct trailer_info {
/*
* True if there is a blank line before the location pointed to by
* trailer_start.
* trailer_block_start.
*/
int blank_line_before_trailer;

/*
* Pointers to the start and end of the trailer block found. If there
* is no trailer block found, these 2 pointers point to the end of the
* input string.
* Offsets to the trailer block start and end positions in the input
* string. If no trailer block is found, these are both set to the
* "true" end of the input (find_end_of_log_message()).
*/
const char *trailer_start, *trailer_end;
size_t trailer_block_start, trailer_block_end;

/*
* Array of trailers found.
Expand Down