Skip to content

Conversation

@supervacuus
Copy link
Collaborator

@supervacuus supervacuus commented Nov 25, 2025

📜 Description

As discussed last week with @markushi, the process will be to make minimal atomic changes in each PR and merge directly to main rather than accumulating on an uber feature branch. This allows for easier review/feedback/corrections, and we can already test a subset of the entire feature "in the field".

The first minimal change is a basic tombstone integration:

  • exposes internal option to enable/disable (disabled by default)
  • the integration will only ever be enabled if the runtime system is at least Android 12
  • the current setup suggests an operation where users will have to disable NDK support (or only depend on sentry-android-core) or get two reports for the same crash
  • only retrieves the most current tombstone (the current implementation is not entirely correct: we should either remove any remaining ApplicationExitInfo entries with REASON_CRASH_NATIVE or report them too, including the option for the latter; I left this out for minimal interface exposure in the first step, but either variant is easy to add to this PR or later)
  • adds a basic tombstone parser/decoder and accompanying snapshot test
  • introduces the following dependencies:
    • runtime: protobuf-javalite (the entire features adds ca. 75KiB to the Android sample release APK)
    • build: the protobuf plugin and the protoc compiler to automate protocol updates

Open Issues:

  • While the protobuf runtime is relatively small, there is still the possibility of conflicting with client-side protobuf versions (major versions often introduce quite severe breakage, but I haven't tested this yet, only reviewed change logs).
  • finding clarity in how to proceed with older tombstones (ignore or handling similarly to ANRv2)
  • decision if this minimal setup already makes sense to release as an internal API
  • add ManifestMetadataReader to configure conveniently? (or not since the corresponding options are only internal?)
  • No changes yet to the UI (there are quite a few aspects that would make these stack-traces more readable, from my pov this is currently out of scope, but i am writing stuff down)

💡 Motivation and Context

First sensible release step for #3295

Part of https://linear.app/getsentry/project/tombstone-support-android-0024cb6e3ffd/

💚 How did you test it?

  • Added a basic parser snapshot test for a serialized tombstone protobuf.
  • Manual testing.

📝 Checklist

  • I added GH Issue ID & Linear ID
  • I added tests to verify the changes.
  • No new PII added or SDK only sends newly added PII if sendDefaultPII is enabled.
  • I updated the docs if needed.
  • I updated the wizard if needed.
  • Review from the native team if needed.
  • No breaking change or entry added to the changelog.
  • No breaking change for hybrid SDKs or communicated to hybrid SDKs.

🔮 Next steps

  • Decide if we want to go forward with client-side tombstone processing. This decision should happen with this PR.
    • If no, consider sending a tombstone as an attachment, similar to how we attach minidumps and let ingestion/processing deal with decoding (I can prototype this in parallel, since I think we can do most of it in relay, as a first step)
    • If yes, decide on adding the protobuf runtime library as a dependency to sentry-android-core (or shade/relocate, or implement our own decoder, given that this is a stable format which only requires a subset of protobuf).
      • Depending on feedback, adaptation to this minimal change (primarily handling older tombstones).
      • Otherwise, integrate an EventProcessor that merges crash events from sentry-native with tombstones.

#skip-changelog (for now)

private void reportAsSentryEvent(ApplicationExitInfo exitInfo) {
SentryEvent event;
try {
TombstoneParser parser = new TombstoneParser(exitInfo.getTraceInputStream());

This comment was marked as outdated.

Copy link
Collaborator Author

@supervacuus supervacuus Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but also left it open, since I don't know yet how the general propagation of errors is handled in integration runs (top-level handler? silent/logging discard? ...?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use a wide try/catch and silently swallow those errors in combination with logging the error using options.getLogger()

latestTombstone = applicationExitInfo;
// remove it, so it's not reported twice
// TODO: if we fail after this, we effectively lost the ApplicationExitInfo (maybe only remove after we reported it)
exitInfos.remove(applicationExitInfo);

This comment was marked as outdated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentionally marked as TODO, so we can discuss handling history in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, for ANRs we write a timestamp to disk (impl) to ensure we don't double report on the next app launch. @romtsn what do you think about this? I guess what's to discuss here:

  • How many days back should we consider crashes?
    the 90 days threshold sounds good to me

  • Should we report them all?
    In my opinion yes. TBD: How to deal with persisted scope data for enrichment?

  • How to deal with unprocessable tombstones?
    In my opinion we should skip them, otherwise the processing will get stuck until the crash is out of the 90 day window. TBD: should we report those as dropped events?

private void reportAsSentryEvent(ApplicationExitInfo exitInfo) {
SentryEvent event;
try {
TombstoneParser parser = new TombstoneParser(exitInfo.getTraceInputStream());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing null check for trace input stream

ApplicationExitInfo.getTraceInputStream() can return null according to Android documentation (and as handled correctly in AnrV2Integration.parseThreadDump()). When null is passed to TombstoneParser, the subsequent TombstoneProtos.Tombstone.parseFrom(null) call in parse() will throw a NullPointerException. This exception isn't caught by the IOException catch block, causing the background TombstoneProcessor thread to crash silently without reporting the tombstone.

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final Class<?> sentryNdkClass = loadClass.loadClass(SENTRY_NDK_CLASS_NAME, options.getLogger());
options.addIntegration(new NdkIntegration(sentryNdkClass));

if (buildInfoProvider.getSdkInfoVersion() >= Build.VERSION_CODES.R) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure about this one:

  • not sure if we want to add the integration conditionally here, or instead only run it conditionally (with a log that explains that the integration isn't really running)
  • independent of where we check this: while it is true that REASON_CRASH_NATIVE is available with R, the tombstone will not be available before S.
  • We can ignore the latter, since the API is available with R and we have to handle a null InputStream anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As without tombstones the reported events won't be any useful I suggest to check against S instead, >= Build.VERSION_CODES.S

SentryEvent event;
try {
TombstoneParser parser = new TombstoneParser(exitInfo.getTraceInputStream());
event = parser.parse();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: InputStream from getTraceInputStream is never closed

The InputStream returned by exitInfo.getTraceInputStream() is passed to TombstoneParser but never closed. The existing AnrV2Integration properly uses try-with-resources to close this stream (line 306). The tombstone stream is stored in TombstoneParser.tombstoneStream and used by parseFrom(), but neither the parser nor the caller closes it afterward. This resource leak could exhaust file descriptors if the integration processes tombstones repeatedly.

Fix in Cursor Fix in Web

event = parser.parse();
event.setTimestamp(DateUtils.getDateTime(exitInfo.getTimestamp()));
} catch (IOException e) {
throw new RuntimeException(e);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Uncaught exception crashes background thread silently

When IOException occurs during tombstone parsing, wrapping it in RuntimeException and throwing terminates the background thread without logging. Since this runs in an ExecutorService submitted task, the exception is swallowed and never reported. The try-catch at line 77 only catches submission failures, not execution failures. This makes tombstone parsing failures invisible, preventing debugging and potentially losing crash reports.

Fix in Cursor Fix in Web

final Class<?> sentryNdkClass = loadClass.loadClass(SENTRY_NDK_CLASS_NAME, options.getLogger());
options.addIntegration(new NdkIntegration(sentryNdkClass));

if (buildInfoProvider.getSdkInfoVersion() >= Build.VERSION_CODES.R) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Wrong Android API level check for tombstone support

The integration checks for Build.VERSION_CODES.R (Android 11, API 30), but according to the author's own PR comment and Android documentation, tombstones via getTraceInputStream() are only available from Build.VERSION_CODES.S (Android 12, API 31). On Android R, getTraceInputStream() returns null for native crashes, which combined with the missing null check in the parser, causes the integration to fail on devices running Android 11.

Fix in Cursor Fix in Web

@supervacuus
Copy link
Collaborator Author

supervacuus commented Nov 26, 2025

@markushi, I just realized I cannot omit having a separate "tombstone" marker, even if I report all events, without repeating them. I mean, this was clear to me, but, in addition to it being a must for this PR already, unlike the ANR marker, I must also align it with crashedLastRun.

I wonder if it would make the most sense to do it similarly to ANR by using a TombstoneHint to ensure writing the marker timestamp at the correct life-cycle stage of the event, but introduce a new interface analog to AbnormalExit called CrashExit, so that the hint can take a different route (since it isn't truly an abnormal exit and should affect similar paths to the Native SDK crash marker).

The biggest issue with that approach is that the crashedLastRun is handled entirely in EnvelopeCache rather than AndroidEnvelopeCache, whereas ApplicationExitInfo hint/marker handling happens in the latter.

The PR still makes sense for a first review from you (since, if the general direction makes sense and you have todos not in my list, I can also add a test for the integration itself). Still, I would appreciate a short sync on how to align these execution paths (maybe I don't need to align tombstones with crashedLastRun and can abuse AbnormalExit for the same outcome, though it feels like a significant shortcut, even for minimal internal exposure, which is not at all what I aimed for).

I can convert the PR back to a draft if you prefer.

Copy link
Member

@markushi markushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments - great work so far!

import java.util.Locale;
import java.util.Map;

public class TombstoneParser {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want classes to be consumed outside our SDK we usually mark them as Internal

Suggested change
public class TombstoneParser {
@ApiStatus.Internal
public class TombstoneParser {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I already have a change in progress that annotates this more strictly.

private final InputStream tombstoneStream;
private final Map<String, String> excTypeValueMap = new HashMap<>();

public TombstoneParser(InputStream tombstoneStream) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually are very keen on marking as much as possible as final as well as annotating any non-primitive method parameters as @Nullable or @NotNull

Suggested change
public TombstoneParser(InputStream tombstoneStream) {
public TombstoneParser(final @NotNull InputStream tombstoneStream) {

import java.util.Locale;
import java.util.Map;

public class TombstoneParser {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there are different ways to deal with closing the input stream at some point, a pattern we've used in the past would be to implement the Closable interface and then initiate the class using a try-with-resources wrapping call.

Suggested change
public class TombstoneParser {
public class TombstoneParser implements Closable {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this makes sense.

I haven't been closing anything in the integration yet because I wasn't sure about the semantics of the trace InputStream. The API docs mention that the system ring buffer evicts tombstones, which made me a bit wary RE: statefulness; however, after reading the code, it turns out that only means the returned InputStream could be null (which it can be for many reasons actually; curious choice to highlight the tombstone rotation).

All of the above operate on the same set of files (shared between system daemons and all apps, where binder delivery per app/user is decided in the system_server module linked above), however, we still operate in classic UNIX territory, aka unlink removes the path name but not the inode if there are open fds.

thread.setStacktrace(stacktrace);
if (tombstone.getTid() == threadEntry.getValue().getId()) {
thread.setCrashed(true);
// even though we refer to the thread_id from the exception,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that's the perfect level of code comments I'd like to see, nice!

// https://android.googlesource.com/platform/system/unwinding/+/refs/heads/main/libunwindstack/Regs.cpp#175
// prevent "processing" from doing it again.
unknown.put("instruction_addr_adjustment", "none");
stacktrace.setUnknown(unknown);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unknown is mainly used for hybrid SDKs or when the Java SDK is used as a sending mechanism to ensure no data is lost due to parsing. In this case I'd favor extending the SentryStackTrace class with a private @Nullable String instructionAddressAdjustment field instead.

final Class<?> sentryNdkClass = loadClass.loadClass(SENTRY_NDK_CLASS_NAME, options.getLogger());
options.addIntegration(new NdkIntegration(sentryNdkClass));

if (buildInfoProvider.getSdkInfoVersion() >= Build.VERSION_CODES.R) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As without tombstones the reported events won't be any useful I suggest to check against S instead, >= Build.VERSION_CODES.S

latestTombstone = applicationExitInfo;
// remove it, so it's not reported twice
// TODO: if we fail after this, we effectively lost the ApplicationExitInfo (maybe only remove after we reported it)
exitInfos.remove(applicationExitInfo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, for ANRs we write a timestamp to disk (impl) to ensure we don't double report on the next app launch. @romtsn what do you think about this? I guess what's to discuss here:

  • How many days back should we consider crashes?
    the 90 days threshold sounds good to me

  • Should we report them all?
    In my opinion yes. TBD: How to deal with persisted scope data for enrichment?

  • How to deal with unprocessable tombstones?
    In my opinion we should skip them, otherwise the processing will get stuck until the crash is out of the 90 day window. TBD: should we report those as dropped events?

private void reportAsSentryEvent(ApplicationExitInfo exitInfo) {
SentryEvent event;
try {
TombstoneParser parser = new TombstoneParser(exitInfo.getTraceInputStream());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use a wide try/catch and silently swallow those errors in combination with logging the error using options.getLogger()

@@ -0,0 +1,204 @@
// Added and adapted from: https://android.googlesource.com/platform/system/core/+/refs/heads/main/debuggerd/proto/tombstone.proto
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add some license attribution here, e.g. like we did here.

otel-semconv = { module = "io.opentelemetry.semconv:opentelemetry-semconv", version.ref = "otelSemanticConventions" }
otel-semconv-incubating = { module = "io.opentelemetry.semconv:opentelemetry-semconv-incubating", version.ref = "otelSemanticConventionsAlpha" }
p6spy = { module = "p6spy:p6spy", version = "3.9.1" }
protobuf-javalite = { module = "com.google.protobuf:protobuf-javalite", version.ref = "protobuf"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romtsn Not sure if you followed the conversations, but as you can see here, protobuf requires a runtime dependency. It will have an impact of around 10kb. IMHO fine for now, we should still check how stable this library is to avoid and consumer version mismatch issues.

Co-authored-by: Markus Hintersteiner <m.hintersteiner@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants