-
-
Notifications
You must be signed in to change notification settings - Fork 464
feat: minimal tombstone integration #4933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: minimal tombstone integration #4933
Conversation
| private void reportAsSentryEvent(ApplicationExitInfo exitInfo) { | ||
| SentryEvent event; | ||
| try { | ||
| TombstoneParser parser = new TombstoneParser(exitInfo.getTraceInputStream()); |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but also left it open, since I don't know yet how the general propagation of errors is handled in integration runs (top-level handler? silent/logging discard? ...?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use a wide try/catch and silently swallow those errors in combination with logging the error using options.getLogger()
| latestTombstone = applicationExitInfo; | ||
| // remove it, so it's not reported twice | ||
| // TODO: if we fail after this, we effectively lost the ApplicationExitInfo (maybe only remove after we reported it) | ||
| exitInfos.remove(applicationExitInfo); |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentionally marked as TODO, so we can discuss handling history in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, for ANRs we write a timestamp to disk (impl) to ensure we don't double report on the next app launch. @romtsn what do you think about this? I guess what's to discuss here:
-
How many days back should we consider crashes?
the 90 days threshold sounds good to me -
Should we report them all?
In my opinion yes. TBD: How to deal with persisted scope data for enrichment? -
How to deal with unprocessable tombstones?
In my opinion we should skip them, otherwise the processing will get stuck until the crash is out of the 90 day window. TBD: should we report those as dropped events?
| private void reportAsSentryEvent(ApplicationExitInfo exitInfo) { | ||
| SentryEvent event; | ||
| try { | ||
| TombstoneParser parser = new TombstoneParser(exitInfo.getTraceInputStream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Missing null check for trace input stream
ApplicationExitInfo.getTraceInputStream() can return null according to Android documentation (and as handled correctly in AnrV2Integration.parseThreadDump()). When null is passed to TombstoneParser, the subsequent TombstoneProtos.Tombstone.parseFrom(null) call in parse() will throw a NullPointerException. This exception isn't caught by the IOException catch block, causing the background TombstoneProcessor thread to crash silently without reporting the tombstone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: #4933 (comment)
| final Class<?> sentryNdkClass = loadClass.loadClass(SENTRY_NDK_CLASS_NAME, options.getLogger()); | ||
| options.addIntegration(new NdkIntegration(sentryNdkClass)); | ||
|
|
||
| if (buildInfoProvider.getSdkInfoVersion() >= Build.VERSION_CODES.R) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure about this one:
- not sure if we want to add the integration conditionally here, or instead only run it conditionally (with a log that explains that the integration isn't really running)
- independent of where we check this: while it is true that
REASON_CRASH_NATIVEis available withR, the tombstone will not be available beforeS. - We can ignore the latter, since the API is available with
Rand we have to handle anullInputStreamanyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As without tombstones the reported events won't be any useful I suggest to check against S instead, >= Build.VERSION_CODES.S
| SentryEvent event; | ||
| try { | ||
| TombstoneParser parser = new TombstoneParser(exitInfo.getTraceInputStream()); | ||
| event = parser.parse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: InputStream from getTraceInputStream is never closed
The InputStream returned by exitInfo.getTraceInputStream() is passed to TombstoneParser but never closed. The existing AnrV2Integration properly uses try-with-resources to close this stream (line 306). The tombstone stream is stored in TombstoneParser.tombstoneStream and used by parseFrom(), but neither the parser nor the caller closes it afterward. This resource leak could exhaust file descriptors if the integration processes tombstones repeatedly.
| event = parser.parse(); | ||
| event.setTimestamp(DateUtils.getDateTime(exitInfo.getTimestamp())); | ||
| } catch (IOException e) { | ||
| throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Uncaught exception crashes background thread silently
When IOException occurs during tombstone parsing, wrapping it in RuntimeException and throwing terminates the background thread without logging. Since this runs in an ExecutorService submitted task, the exception is swallowed and never reported. The try-catch at line 77 only catches submission failures, not execution failures. This makes tombstone parsing failures invisible, preventing debugging and potentially losing crash reports.
| final Class<?> sentryNdkClass = loadClass.loadClass(SENTRY_NDK_CLASS_NAME, options.getLogger()); | ||
| options.addIntegration(new NdkIntegration(sentryNdkClass)); | ||
|
|
||
| if (buildInfoProvider.getSdkInfoVersion() >= Build.VERSION_CODES.R) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Wrong Android API level check for tombstone support
The integration checks for Build.VERSION_CODES.R (Android 11, API 30), but according to the author's own PR comment and Android documentation, tombstones via getTraceInputStream() are only available from Build.VERSION_CODES.S (Android 12, API 31). On Android R, getTraceInputStream() returns null for native crashes, which combined with the missing null check in the parser, causes the integration to fail on devices running Android 11.
|
@markushi, I just realized I cannot omit having a separate "tombstone" marker, even if I report all events, without repeating them. I mean, this was clear to me, but, in addition to it being a must for this PR already, unlike the ANR marker, I must also align it with I wonder if it would make the most sense to do it similarly to The biggest issue with that approach is that the The PR still makes sense for a first review from you (since, if the general direction makes sense and you have todos not in my list, I can also add a test for the integration itself). Still, I would appreciate a short sync on how to align these execution paths (maybe I don't need to align tombstones with I can convert the PR back to a draft if you prefer. |
markushi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments - great work so far!
| import java.util.Locale; | ||
| import java.util.Map; | ||
|
|
||
| public class TombstoneParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't want classes to be consumed outside our SDK we usually mark them as Internal
| public class TombstoneParser { | |
| @ApiStatus.Internal | |
| public class TombstoneParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I already have a change in progress that annotates this more strictly.
| private final InputStream tombstoneStream; | ||
| private final Map<String, String> excTypeValueMap = new HashMap<>(); | ||
|
|
||
| public TombstoneParser(InputStream tombstoneStream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually are very keen on marking as much as possible as final as well as annotating any non-primitive method parameters as @Nullable or @NotNull
| public TombstoneParser(InputStream tombstoneStream) { | |
| public TombstoneParser(final @NotNull InputStream tombstoneStream) { |
| import java.util.Locale; | ||
| import java.util.Map; | ||
|
|
||
| public class TombstoneParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there are different ways to deal with closing the input stream at some point, a pattern we've used in the past would be to implement the Closable interface and then initiate the class using a try-with-resources wrapping call.
| public class TombstoneParser { | |
| public class TombstoneParser implements Closable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this makes sense.
I haven't been closing anything in the integration yet because I wasn't sure about the semantics of the trace InputStream. The API docs mention that the system ring buffer evicts tombstones, which made me a bit wary RE: statefulness; however, after reading the code, it turns out that only means the returned InputStream could be null (which it can be for many reasons actually; curious choice to highlight the tombstone rotation).
- Each returned trace stream is wrapped as an
AutoCloseInputStreamon a fresh parcel fd - The tombstones the app searches through were retrieved by
system_servervia a file watcher on/data/tombstones - and the rotation mentioned in the API docs only happens in the
CrashQueueoftombstonedviaunlinkatandlinkat.
All of the above operate on the same set of files (shared between system daemons and all apps, where binder delivery per app/user is decided in the system_server module linked above), however, we still operate in classic UNIX territory, aka unlink removes the path name but not the inode if there are open fds.
| thread.setStacktrace(stacktrace); | ||
| if (tombstone.getTid() == threadEntry.getValue().getId()) { | ||
| thread.setCrashed(true); | ||
| // even though we refer to the thread_id from the exception, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that's the perfect level of code comments I'd like to see, nice!
| // https://android.googlesource.com/platform/system/unwinding/+/refs/heads/main/libunwindstack/Regs.cpp#175 | ||
| // prevent "processing" from doing it again. | ||
| unknown.put("instruction_addr_adjustment", "none"); | ||
| stacktrace.setUnknown(unknown); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unknown is mainly used for hybrid SDKs or when the Java SDK is used as a sending mechanism to ensure no data is lost due to parsing. In this case I'd favor extending the SentryStackTrace class with a private @Nullable String instructionAddressAdjustment field instead.
| final Class<?> sentryNdkClass = loadClass.loadClass(SENTRY_NDK_CLASS_NAME, options.getLogger()); | ||
| options.addIntegration(new NdkIntegration(sentryNdkClass)); | ||
|
|
||
| if (buildInfoProvider.getSdkInfoVersion() >= Build.VERSION_CODES.R) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As without tombstones the reported events won't be any useful I suggest to check against S instead, >= Build.VERSION_CODES.S
| latestTombstone = applicationExitInfo; | ||
| // remove it, so it's not reported twice | ||
| // TODO: if we fail after this, we effectively lost the ApplicationExitInfo (maybe only remove after we reported it) | ||
| exitInfos.remove(applicationExitInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, for ANRs we write a timestamp to disk (impl) to ensure we don't double report on the next app launch. @romtsn what do you think about this? I guess what's to discuss here:
-
How many days back should we consider crashes?
the 90 days threshold sounds good to me -
Should we report them all?
In my opinion yes. TBD: How to deal with persisted scope data for enrichment? -
How to deal with unprocessable tombstones?
In my opinion we should skip them, otherwise the processing will get stuck until the crash is out of the 90 day window. TBD: should we report those as dropped events?
| private void reportAsSentryEvent(ApplicationExitInfo exitInfo) { | ||
| SentryEvent event; | ||
| try { | ||
| TombstoneParser parser = new TombstoneParser(exitInfo.getTraceInputStream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use a wide try/catch and silently swallow those errors in combination with logging the error using options.getLogger()
| @@ -0,0 +1,204 @@ | |||
| // Added and adapted from: https://android.googlesource.com/platform/system/core/+/refs/heads/main/debuggerd/proto/tombstone.proto | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add some license attribution here, e.g. like we did here.
| otel-semconv = { module = "io.opentelemetry.semconv:opentelemetry-semconv", version.ref = "otelSemanticConventions" } | ||
| otel-semconv-incubating = { module = "io.opentelemetry.semconv:opentelemetry-semconv-incubating", version.ref = "otelSemanticConventionsAlpha" } | ||
| p6spy = { module = "p6spy:p6spy", version = "3.9.1" } | ||
| protobuf-javalite = { module = "com.google.protobuf:protobuf-javalite", version.ref = "protobuf"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romtsn Not sure if you followed the conversations, but as you can see here, protobuf requires a runtime dependency. It will have an impact of around 10kb. IMHO fine for now, we should still check how stable this library is to avoid and consumer version mismatch issues.
Co-authored-by: Markus Hintersteiner <m.hintersteiner@gmail.com>
📜 Description
As discussed last week with @markushi, the process will be to make minimal atomic changes in each PR and merge directly to
mainrather than accumulating on an uber feature branch. This allows for easier review/feedback/corrections, and we can already test a subset of the entire feature "in the field".The first minimal change is a basic tombstone integration:
sentry-android-core) or get two reports for the same crashApplicationExitInfoentries withREASON_CRASH_NATIVEor report them too, including the option for the latter; I left this out for minimal interface exposure in the first step, but either variant is easy to add to this PR or later)protobuf-javalite(the entire features adds ca. 75KiB to the Android sample release APK)protobufplugin and theprotoccompiler to automate protocol updatesOpen Issues:
protobufruntime is relatively small, there is still the possibility of conflicting with client-sideprotobufversions (major versions often introduce quite severe breakage, but I haven't tested this yet, only reviewed change logs).ANRv2)ManifestMetadataReaderto configure conveniently? (or not since the corresponding options are only internal?)💡 Motivation and Context
First sensible release step for #3295
Part of https://linear.app/getsentry/project/tombstone-support-android-0024cb6e3ffd/
💚 How did you test it?
📝 Checklist
sendDefaultPIIis enabled.🔮 Next steps
protobufruntime library as a dependency tosentry-android-core(or shade/relocate, or implement our own decoder, given that this is a stable format which only requires a subset ofprotobuf).EventProcessorthat merges crash events fromsentry-nativewith tombstones.#skip-changelog (for now)