Skip to content

Conversation

GregoryComer
Copy link
Member

We are seeing intermittent SIGILL crashes when running neon dot kernels on certain UNISOC-based phones. The previous change (#265) addressed the majority of these crashes, but we are still seeing some crashes on a small subset of hardware running Meta apps.

I tracked this down to a failure of the chipset detection logic in CPUINFO causing the existing logic to not recognize the soc as one of the UNISOC chips that shouldn't run neon dot instructions. While it would be nice to resolve the chipset detection logic issues, it's only happening on a very small subset of devices and I can't repro it on a local Itel A50 (which is one of the affected prod devices). Maybe due to differing firmware or OS images.

To solve it, I've added an additional piece of logic in arm/linux/aarch32-isa.c to disable neon dot instructions on unknown chipsets. Running this internally at Meta for a few weeks has cleared up the remaining crashes on UNISOC devices (zero instances with this patch). This should solve the issue with minimal collateral damage.

@GregoryComer GregoryComer force-pushed the unknown-chipsets-no-dot branch from 5396889 to b210c01 Compare June 3, 2025 07:18
@GregoryComer GregoryComer merged commit 6c9eb84 into pytorch:main Jun 3, 2025
12 checks passed
bunnitha pushed a commit to bunnitha/open-cpuinfo that referenced this pull request Jun 27, 2025
@fbarchard
Copy link
Collaborator

This causes dot product not to be recognized on Samsung Exynos, Samsung Qualcomm, and Pixel Tensor, all which have vendor->unknown.

Pixel 6
SoC name: Unknown
Microarchitectures:
2x Cortex-X1
2x Cortex-A76
4x Cortex-A55

Samsung S22 Exynos
SoC name: Unknown
Microarchitectures:
1x Cortex-X2
3x Cortex-A710
4x Cortex-A510

Samsung S23 Qualcomm
SoC name: Unknown
Microarchitectures:
1x Cortex-X3
4x Cortex-A715
3x Cortex-A510

Pixel Watch
SoC name: Unknown
Microarchitectures:
4x Cortex-A53
well sure, A53 doesnt have dot product, but the qualcomm SW series might improve in the future and is still 'unknown' vendor

I'm thinking a safer fix would be all known unisoc disable dot product, as it looks like future chips will continue to have this problem. (linux kernel bug)

@GregoryComer
Copy link
Member Author

This causes dot product not to be recognized on Samsung Exynos, Samsung Qualcomm, and Pixel Tensor, all which have vendor->unknown.

Thanks for the detailed info. We should be able to add a more fine grained check, since this one is causing issues. CPUINFO doesn't detect the vendor on the problem chips currently, but we can likely add some logic to properly detect UNISOC phones in the current failing cases.

It's a little tricky as I don't have easy access to the problem devices - it seems to be specific to certain hardware or system firmware revisions, as it's there are differences in how some of these phones report hardware even among the same model + kernel version. But I was able to narrow it down to conflicting detected unisoc vs spreadtrum manufacturers (unisoc bought spreadtrum, I think?). Maybe we can update the detection logic to allow this conflict and treat it as unisoc when there's a mismatch, instead of bailing out.

CC @digantdesai @kimishpatel

@fbarchard
Copy link
Collaborator

You noted 'Itel A50'
https://www.gsmarena.com/itel_a50-13183.php

Android 14 (Go edition)
Chipset Unisoc T603

Looking at the match_t function, it should detect any 3 or 4 digit version in the T series
match_t is from proc/cpuinfo?

modern cpus dont fill in /proc/cpuinfo and we fallback on getprop
eg
cpuinfo_arm_android_decode_chipset_from_ro_mediatek_platform

e.g. Pixel 4 cpuinfo
adb shell more /proc/cpuinfo | grep Hardware
Hardware : Qualcomm Technologies, Inc SM8150

Pixel 4 getprop
adb shell getprop | grep 8150
[gsm.version.baseband]: [g8150-00063-200702-B-6648947]
[persist.vendor.radio.cnv.ver_info]: [MCFG-g8150-00063-200702-B-6648947
[ro.boot.hardware.platform]: [sm8150]
[ro.build.expect.baseband]: [g8150-00063-200702-B-6648947]
[vendor.sys.slpi.firmware.version]: [sm8150-slpi-gfc68f2efb55-6761976 Fri Aug 14 00:45:57 UTC 2020]

The function
cpuinfo_arm_android_decode_chipset_from_ro_board_platform
would likely work on Pixel 4

Trying a Pixel Watch 2, the cpuinfo does not provide 'Hardware'
getprop might work, but its
adb shell getprop | grep platform
[ro.board.platform]: [monaco]
[ro.boot.hardware.platform]: [sdw5100]
[ro.cw_build.platform_mr]: [5]
[ro.cw_build.platform_qpr.version]: [3]

There is not much consistency with getprop
cpuinfo_arm_android_parse_properties fetches the following properties:
ro.product.board
ro.board.platform
ro.mediatek.platform
ro.arch
ro.chipname
ro.hardware.chipname
But it looks like Qualcomm Pixel 4 and Pixel Watch would want
ro.boot.hardware.platform

A qualcomm Samsung S23 the product exists but is not useful
[ro.product.board]: [kalama]
There are friendly strings for the device
[ro.product.manufacturer]: [samsung]
[ro.product.model]: [SM-S918U1]
but we're probably want the soc
[ro.soc.manufacturer]: [QTI]
[ro.soc.model]: [SM8550]

soc works on Samsung S25
[ro.soc.manufacturer]: [QTI]
[ro.soc.model]: [SM8750]
Pixel 6
[ro.soc.manufacturer]: [Google]
[ro.soc.model]: [Tensor]
Pixel Watch 2
[ro.soc.manufacturer]: [Qualcomm]
[ro.soc.model]: [SW5100]
Samsung S22 Exynos
[ro.soc.manufacturer]: [Samsung]
[ro.soc.model]: [s5e9925]

So for new arm android devices ro.soc.model would fill in the vendor more completely

I'm not sure how to detect your 'unknown' T603 other than try the existing /proc/cpuinfo method

There are many T series with Cortex A55 that likely have the issue
https://en.wikipedia.org/wiki/List_of_UNISOC_systems_on_chips

FWIW I don't have the T603 but do have a T310

@fbarchard
Copy link
Collaborator

Filed a new bug #322
and PR #321

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants