Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete duplicate geometries: Incorrect result when the output is generated in the same geopackage containing the source layer. #60023

Open
2 tasks done
ludovico85 opened this issue Dec 27, 2024 · 5 comments
Labels
Bug Either a bug report, or a bug fix. Let's hope for the latter! Data Provider Related to specific vector, raster or mesh data providers Processing Relating to QGIS Processing framework or individual Processing algorithms Regression Something which used to work, but doesn't anymore

Comments

@ludovico85
Copy link

What is the bug or the crash?

Hi everyone,
I have a geopackage layer where all the geometries are duplicated (344,918 geometries).
The "Delete duplicate geometries" algorithm behaves strangely depending on the output:

  • Result saved as a temporary layer: 172,459 geometries
  • Result saved as a shapefile: 172,459 geometries
  • Result saved in a new geopackage: 172,459 geometries
  • Result saved in the source layer's geopackage: 167,281 geometries

Steps to reproduce the issue

  1. Upload the layer included in the geopackage
  2. Run the Delete duplicate geometries algorithm
  3. Save the output in the same geopackage of the source layer
  4. The number of returned geometries is wrong

Versions

<style type="text/css"> p, li { white-space: pre-wrap; } </style>
Versione di QGIS 3.34.14-Prizren Revisione codice QGIS 0cdaf6d
Versione Qt 5.15.13
Versione Python 3.12.8
Versione GDAL/OGR 3.9.3
Versione PROJ 9.5.0
Versione database del Registro EPSG v11.016 (2024-08-31)
Versione GEOS 3.13.0-CAPI-1.19.0
Versione SQLite 3.46.1
Versione PDAL 2.8.1
Versione client PostgreSQL 16.2
Versione SpatiaLite 5.1.0
Versione QWT 6.3.0
Versione QScintilla2 2.14.1
Versione SO Windows 11 Version 2009
       
Plugins Python attivi
Cxf_in 9.2
FreehandRasterGeoreferencer 0.8.3
GroupStats 2.2.7
LAStools 2.1.1
latlontools 3.6.20
lizmap 3.14.3
multi_filter 1.0
profile-manager 0.31
project_report 1.2
qfieldsync v4.9.1
QPackage 1.5
QuickOSM 2.2.3
raster_tracer 0.3.3
redLayer 2.2
SelectByRelationship 0.3.3
ViewshedAnalysis 1.9
db_manager 0.1.20
processing 2.12.99
Versione di QGIS 3.34.14-Prizren Revisione codice QGIS [0cdaf6d](https://github.com/qgis/QGIS/commit/0cdaf6d9) Versione Qt 5.15.13 Versione Python 3.12.8 Versione GDAL/OGR 3.9.3 Versione PROJ 9.5.0 Versione database del Registro EPSG v11.016 (2024-08-31) Versione GEOS 3.13.0-CAPI-1.19.0 Versione SQLite 3.46.1 Versione PDAL 2.8.1 Versione client PostgreSQL 16.2 Versione SpatiaLite 5.1.0 Versione QWT 6.3.0 Versione QScintilla2 2.14.1 Versione SO Windows 11 Version 2009

Plugins Python attivi
Cxf_in
9.2
FreehandRasterGeoreferencer
0.8.3
GroupStats
2.2.7
LAStools
2.1.1
latlontools
3.6.20
lizmap
3.14.3
multi_filter
1.0
profile-manager
0.31
project_report
1.2
qfieldsync
v4.9.1
QPackage
1.5
QuickOSM
2.2.3
raster_tracer
0.3.3
redLayer
2.2
SelectByRelationship
0.3.3
ViewshedAnalysis
1.9
db_manager
0.1.20
processing
2.12.99

Supported QGIS version

  • I'm running a supported QGIS version according to the roadmap.

New profile

Additional context

No response

@ludovico85 ludovico85 added the Bug Either a bug report, or a bug fix. Let's hope for the latter! label Dec 27, 2024
@pigreco
Copy link
Contributor

pigreco commented Dec 27, 2024

I confirm

@pigreco pigreco added the Regression Something which used to work, but doesn't anymore label Dec 27, 2024
@agiudiceandrea
Copy link
Contributor

agiudiceandrea commented Dec 27, 2024

I can also confirm the issue running QGIS LTR 3.34.14 and QGIS 3.40.2 (both with GDAL/OGR 3.9.2) and QGIS 3.41.0-Master (with GDAL/OGR 3.11.0-dev) on Windows 10 from OSGeo4W.

Neither the processing log nor the Log Messages panel report any error, thus the users cannot be aware that the issue occurred and they are misled to think the output layer has been correctly created with all the non duplicated feature, which is not the case.

The issue also occurs using a layer containing randomly generated duplicated points (1M features): it look likes it occurs if the layer contains a large number of features, while it doesn't for a limited number of features.

The issue didn't occur running QGIS 3.22.0 (with GDAL/OGR 3.4.0) and previous versions.

The issue doesn't occur if the OGR_SQLITE_JOURNAL=WAL env. var is set.

The issue also doesn't occur the subsequent times the processing algorithm is executed shortly afterwards the first incorrect run (only if the input layer is opened in QGIS): on the first run the .gpkg-shm and .gpkg-wal files are created only when the algorithm execution reaches the 99%, while the second and subsequent runs the .gpkg-shm and .gpkg-wal files are created right at the start of the algorithm's execution.

@rouault, I guess the PRs #47098 (implemented since QGIS 3.22.6 and 3.24.0) and OSGeo/gdal#5207 (implemented since GDAL/OGR 3.4.2 and 3.5.0) may have triggered such issue.

@agiudiceandrea agiudiceandrea added Processing Relating to QGIS Processing framework or individual Processing algorithms Data Provider Related to specific vector, raster or mesh data providers labels Dec 27, 2024
@gdt
Copy link
Contributor

gdt commented Dec 30, 2024

It would be interesting to get a trace of the sqlite calls to figure out if this is a sqlite bug or a qgis/gdal bug. sqlite should be safe to use without WAL. Are there perhaps operations on the same file in two threads, if that's not allowed?

@agiudiceandrea
Copy link
Contributor

agiudiceandrea commented Dec 31, 2024

@gdt, how to get a useful trace of the sqlite calls?

Anyway, the issue occurs on Windows 10 also running the algorithm using the qgis_process tool in the OSGeo4W Shell CLI, even explicitly forcing it to use of only 1 core/thread (with START /AFFINITY 1), or running the algorithm in the QGIS's GUI and setting the "Number of threads to use" alg.'s parameter to 1 (even setting the GDAL_NUM_THREADS env. var. to 1).

It's interesting to note that it looks like the issue doesn't occur when running other algorithms even with the same layer, e.g. running the "Delete duplicates by attribute".

@gdt
Copy link
Contributor

gdt commented Dec 31, 2024

I don't really know; perhaps there is debug code already or perhaps it needs to be written.
In theory WAL mode provides the same atomicity/transaction guarantees. Therefore I would suspect that something in code is not really write with respect to transactions, if one gets different results with different modes.

I would naively expect that the dedup code might be beginning a transaction on the source layer, reading all the features, and ending that read transaction. And also beginning a transaction on the destination layer, inserting results, and ending that transaction. But maybe it isn't using transactions, and maybe it's more complicated. This is the sort of thing I was suggesting to try to understand.

sqlite3 does document that the default mode is "serialized", which should be ok if there are two threads accessing the same geopackage: https://www.sqlite.org/threadsafe.html Of course, it would be good to check if qgis/gdal/sqlite3 is really operating in this mode. I would be highly surprised if not, but still checking things you can check is good strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Either a bug report, or a bug fix. Let's hope for the latter! Data Provider Related to specific vector, raster or mesh data providers Processing Relating to QGIS Processing framework or individual Processing algorithms Regression Something which used to work, but doesn't anymore
Projects
None yet
Development

No branches or pull requests

4 participants