Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@
DOCNAME = softid

# count up; you probably do not want to bother with versions <1.0
DOCVERSION = 1.0
DOCVERSION = 1.1

# Publication date, ISO format; update manually for "releases"
DOCDATE = 2021-05-28
DOCDATE = 2025-12-18

# What is it you're writing: NOTE, WD, PR, REC, PEN, or EN
DOCTYPE = NOTE
Expand All @@ -22,13 +22,21 @@ SOURCES = $(DOCNAME).tex

# List of image files to be included in submitted package (anything that
# can be rendered directly by common web browsers)
FIGURES =
FIGURES =

# List of PDF figures (figures that must be converted to pixel images to
# work in web browsers).
VECTORFIGURES =
VECTORFIGURES =

# Additional files to distribute (e.g., CSS, schema files, examples...)
AUX_FILES = tapstats.py

include ivoatex/Makefile

ivoatex/Makefile:
@echo "*** ivoatex submodule not found. Initialising submodules."
@echo
git submodule update --init

test:
@echo "*** No tests defined"
76 changes: 50 additions & 26 deletions softid.tex
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
\editor{Markus Demleitner}

% \previousversion[????URL????]{????Concise Document Label????}
\previousversion[https://www.ivoa.net/documents/Notes/softid/]{Version 1.0}
\previousversion{This is the first public release}

\newcommand{\headername}[1]{{\tt #1}}
Expand Down Expand Up @@ -64,7 +65,7 @@ \section*{Conformance-related definitions}
\section{Introduction}

Very early on in the construction of client-server architectures it was
found that it is useful to have mechanisms for
found that it is useful to have mechanisms for
discovering which software runs
at the other side of a connection, rather typically to aid in debugging.
In particular, HTTP, which is the basis of many of the VO's protocols
Expand Down Expand Up @@ -140,6 +141,15 @@ \subsubsection{Notifications}
request for a software update), server developers might want to contact
deployers of vulnerable or otherwise broken software.

\subsubsection{Mitigation of Reckless Crawling}

Around 2024, many crawlers doing indiscriminate bulk downloads at high
rates appeared on the internet; filtering their requests is a particular
challenge in the VO, where machine clients potentially running a large
number of legitimate queries are the norm. Services may want to more
strictly limit requests from user agents that do not comply with VO
rules on their identification.

\subsection{Security and Privacy Considerations}

Several guidelines on IT security discourage giving details on the
Expand All @@ -148,8 +158,8 @@ \subsection{Security and Privacy Considerations}

Following the practices proposed here will, indeed, weaken the
``security by obscurity'' put forward in these treatments; on the other
hand, when, as is the case in the VO, attackers only have
to scan perhaps several hundred URLs,
hand, when, as is the case in the VO, attackers only have
to scan perhaps several hundred sites,
relying on security by obscurity does not seem a promising policy.

On the other
Expand All @@ -163,7 +173,8 @@ \subsection{Security and Privacy Considerations}
seems unlikely that rogue services could be aided by information on the
client version when they target clients.

Software identification does play a role in user privacy; user agents
Software identification does play a role in user privacy;
user agent identifications
are regularly employed in user tracking on the WWW. While, presumably,
the generally non-profit operators in the VO will not use such data to
significantly violate their users' privacy, client authors may want to
Expand Down Expand Up @@ -215,10 +226,9 @@ \subsection{User-Agent Header IVOA Recommendations}
The Operations IG endorses and encourages use of these standard
rules concerning the \headername{User-Agent} header,
and adds a further convention, which does not
conflict with the above rules: clients whose primary purpose
is \emph{operational}, as opposed to \emph{scientific},
should indicate that purpose by including a
comment token of the form
conflict with the above rules. User agents written to interact with VO
services should indicate their purpose by including a
comment token of the form
$$\hbox{\verb|(IVOA-<op-purpose> <optional-extra-text>)|.}$$

Suggested {\tt op-purpose} values are currently:
Expand All @@ -229,14 +239,23 @@ \subsection{User-Agent Header IVOA Recommendations}
performance (monitoring) or standards-compliance (validation);
at this point,
no good reason to separate the different cases was identified.
\item[copy]

\item[copy]
The purpose of the access is to replicate (parts of) the content
published through
the service, be it for aggregation (harvesting) or re-publication
(mirroring).

\item[science]
The access was done to directly support a science case. This explicitly
includes education and training, in particular because we do not want to
suggest that software used in such settings -- which plausibly is going
to be the same as software used in pure research -- should be
reconfigured for them.
Comment on lines +250 to +254
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about "to directly support a science case"; I'd suggest something a bit more woolly like "in support of science usage" or "in the context of science usage". I think the main target here is to differentiate clients that understand the VO/astronomy services they are engaging with from those that are just hitting anything they can find. From a practical point of view, at least for clients like topcat and stilts, it's not likely to be feasible to get them to present different user-agent headers on the basis of the user intention for particular
requests, only on the basis of the tools in use.

Given that I'm wondering if there's a different term than "science" that should be used here, but I don't have great suggestions. IVOA-voclient or just IVOA-client maybe?


\end{description}

This list may evolve in the future; extensions should be proposed on
This list may evolve in the future; extensions should be proposed on
the ops@ivoa.net mailing list. Custom {\tt op-purpose} values are permitted.
Case is significant in {\tt op-purpose} values and its ``{\tt IVOA-}'' prefix.

Expand All @@ -251,39 +270,40 @@ \subsection{User-Agent Header IVOA Recommendations}
Formally:

\begin{verbatim}
ivoa-comment = "(IVOA-" op-purpose *(
ivoa-comment = "(IVOA-" op-purpose *(
ctext | quoted-pair | comment ) ")"
op-purpose = "test" | "copy" | token
op-purpose = "test" | "copy" | "science" | token
\end{verbatim}

Tokens of the form \verb|ivoa-comment| should not appear in the
\headername{User-Agent} field
if the request is a ``normal'' user science query. There
are obviously grey areas between operational and science requests; this
convention does not attempt to provide a rigid definition of these
categories.

This arrangement allows service operators to test in their logs for
This arrangement allows service operators to filter their logs against
\headername{User-Agent} values
whose content matches the sequence ``\verb|(IVOA-|'', or
perhaps ``\verb|(IVOA-test|'', and adjust their usage statistics
whose content matches the sequence ``\verb|(IVOA-test|'' (or, if so
desired,
``\verb|(IVOA-copy|'' as well) and adjust their usage statistics
appropriately. Note, however, that it is not feasible to force operational
clients to follow this convention, so service operators will still need
to be careful in analysing their usage statistics.

User agents intended for researchers should set their IVOA comment to
IVOA-science. The purpose of this rule is to help operators to throttle
indiscriminate downloads by ``stupid'' crawlers (like the harvesters
employed to gather training material for AI models around 2025) without
impacting common clients; for instance, rate limits could be tight
without a conforming user agent header.
Comment on lines +288 to +292
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say simply "the purpose of this rule is [throttling]", since there are other use cases, for instance managing usage statistics. Possible alternative wording:

Presence of this header provides a means to identify requests by known VO-aware clients as distinct from those by potentially indiscriminate crawlers like the harvesters employed to gather training material for AI models around 2025. This information may be used for instance to throttle indiscriminate downloads by applying tighter rate limits for requests without a conforming user-agent header, or for better understanding of usage statistics by distinguishing known science queries.


\subsection{Examples}

A science query from the STILTS tapquery TAP client might contain the
HTTP header
\begin{verbatim}
User-Agent: STILTS/3.1-4 Java/1.8.0_181
User-Agent: STILTS/3.1-4 (IVOA-science) Java/1.8.0_181
\end{verbatim}
while a query from the STILTS taplint TAP service validator might
contain the header
\begin{verbatim}
User-Agent: STILTS/3.1-4 (IVOA-test) Java/1.8.0_181
\end{verbatim}
or maybe
or maybe
\iftth
\begin{verbatim}
User-Agent: STILTS/3.1-4 (IVOA-test http://validators.org/results) Java/1.8.0_181
Expand Down Expand Up @@ -356,12 +376,16 @@ \subsection{Notes}
will serve many different resources), the use cases for global server
identification can probably be satisfied by running one request each
against these servers, access URLs for which can readily be discovered
in the Registry as it is.
in the Registry as it is.

\appendix
\section{Changes from Previous Versions}

No previous versions yet.
\subsection{Changes from Version 1.0}

Now recommending the use of a \texttt|IVOA-science| ivoa-comment as
potential mitigation strategy of AI crawlers.

% these would be subsections "Changes from v. WD-..."
% Use itemize environments.

Expand Down