Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "IO Error: GDAL Error (4): `<file>.xlsx' not recognized as a supported file format." when calling st_read on registered .xlsx file #1870

Open
rpbouman opened this issue Sep 28, 2024 · 7 comments

Comments

@rpbouman
Copy link

What happens?

  • WASM is initialized
  • spatial extension loaded and installed
  • .xlsx file registered with registerFileHandle using DuckDBDataProtocol.BROWSER_FILEREADER
  • query SELECT * FROM st_read( '<filename>' ) fails with

Error: IO Error: GDAL Error (4): test.xlsx' not recognized as a supported file format.`

To Reproduce

I'm running this page locally:

<!doctype html>
<html>
  <head>
    <title>DuckDB WASM: Reading XSLX</title>
  </head>
  <body>
    <div id="content">
      <div id="initialization">
        <div id="status">Initializing, please wait...</div>
        <ul id="initActions">
        </ul>
      </div>
      <div style="display:none" id="ui">
        <label>Choose a file<label>: <input type="file" id="fileInput" accept=".xlsx, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
      </div>
    </div>
    <script type="application/javascript">
      function statusUpdate(action) {
        var list = document.getElementById('initActions');
        list.innerHTML += `<li>${action}</li>`;
      }
      
      function alertError(message, error){
        var message = `${message}: ${error.message}\n\n${error.stack}`;
        console.error(message);
        alert(message);
      }
      
      function afterInit(){
        document.getElementById('initialization').style.display = 'none';
        document.getElementById('fileInput').addEventListener('change', async function(event){
          var files = event.currentTarget.files;
          if (files.length === 0){ 
            alert('No file selected!');
            return;
          }
          
          var file = files[0];
          var duckdb = window.duckdb;
          try {
            await duckdb.instance.registerFileHandle(
              file.name, 
              file, 
              duckdb.api.DuckDBDataProtocol.BROWSER_FILEREADER, 
              true
            );
            alert(`File registered!`);
          }
          catch (e){
            alertError('Error registering file', e);
            return;
          }

          try {
            var sql = `SELECT * FROM st_read('${file.name}')`;
            var result = await duckdb.connection.query(sql);
            debugger;
          }
          catch(e){
            alertError('Error executing query on file', e);
            return;
          }
        });
        document.getElementById('ui').style.display = 'block';
      }
    </script>
    <script type="module">
      import * as duckdb from 'https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm@1.28.1-dev287.0/+esm';

      statusUpdate('module loaded');

      const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
      const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);
      statusUpdate('jsdelivr bundle selected');

      const worker_url = URL.createObjectURL(
        new Blob([`importScripts("${bundle.mainWorker}");`], {type: 'text/javascript'})
      );
      const worker = new Worker(worker_url);
      const logger = new duckdb.ConsoleLogger();
      const db = new duckdb.AsyncDuckDB(logger, worker);
      URL.revokeObjectURL(worker_url);

      await db.instantiate(bundle.mainModule, bundle.pthreadWorker);
      statusUpdate('DuckDB instance created');

      const connection = await db.connect();
      statusUpdate('DuckDB connection created');
      
      var versionResult = await connection.query(`SELECT version() as versioninfo`);
      var versionInfo = versionResult.get(0)['versioninfo'];
      statusUpdate(`DuckDB ${versionInfo} initialized.`);
      console.log(`DuckDB version: ${versionInfo}`) ;

      // expose duckdb as global
      window.duckdb = {
        api: duckdb,
        instance: db,
        connection: connection
      };
      
      statusUpdate('Installing spatial extension');
      await connection.query('INSTALL SPATIAL');
      statusUpdate('Loading spatial extension');
      await connection.query('LOAD SPATIAL');
      afterInit();
    </script>
    
  </body>
</html>

Sample xlsx file is attached.
test.xlsx

Browser/Environment:

Chrome 129.0.6668.71, MS Edge 129.0.2792.65

Device:

Windows 11 Desktop

DuckDB-Wasm Version:

using https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm@1.28.1-dev287.0/+esm

DuckDB-Wasm Deployment:

https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm@1.28.1-dev287.0/+esm

Full Name:

Roland Bouman

Affiliation:

EPAM Systems BV Netherlands

@rpbouman
Copy link
Author

I should have included: I can open the file fine using duckdb cli:

DuckDb> SELECT * FROM st_read( 'C:\roland\projects\duckdb-wasm-snippets\3. reading xlsx\test.xlsx' );
┌────────────┬──────────┐
│   Fruit    │ Calories │
│  varchar   │  int32   │
├────────────┼──────────┤
│ Apple      │       52 │
│ Banana     │       89 │
│ Cantaloupe │       39 │
└────────────┴──────────┘
Run Time (s): real 0.045 user 0.000000 sys 0.015625
changes:   0   total_changes: 0
DuckDb>

@carlopi
Copy link
Collaborator

carlopi commented Sep 29, 2024

Question, can the file be opened when queried from an http endpoint?

Something like:

LOAD spatial;
SELECT * FROM st_read( 'http://127.0.0.1:8081/<filename>' )

while having a local http-server?

Question is basically whether this problem is with reading the format at all in Wasm (possibly due to different config in DuckDB-Wasm spatial) or this is a filesystem mismatch within duckdb-wasm.
Thanks

@carlopi
Copy link
Collaborator

carlopi commented Sep 29, 2024

Actually this seems to work, at least on the shell, from a remote endpoint:
https://shell.duckdb.org/#queries=v0,LOAD-spatial~,SUMMARIZE-FROM-st_read('https%3A%2F%2Fraw.githubusercontent.com%2Fduckdb%2Fduckdb%20rs%2Fmain%2Fcrates%2Fduckdb%2Fexamples%2FMovies_Social_metadata.xlsx')~
(it's another file, but should likely be independent)

@rpbouman
Copy link
Author

rpbouman commented Sep 29, 2024

Hi @carlopi - I only just noticed you had two replies.

I can confirm that I can read it with DuckDB WASM when I call st_read explicitly and pass in the http url.

Interestingly, the example with the http url currently fails in the CLI (no WASM) with DuckDB 1.1.0 GA and the nightly build v1.1.2-dev44:

image

@carlopi
Copy link
Collaborator

carlopi commented Sep 29, 2024

I think in the CLI explicitly loading httpfs is needed:

LOAD https;
LOAD spatial;
SUMMARIZE FROM st_read('https://raw.githubusercontent.com/duckdb/duckdb-rs/main/crates/duckdb/examples/Movies_Social_metadata.xlsx');

then works on my end.

@carlopi
Copy link
Collaborator

carlopi commented Sep 29, 2024

And @rpbouman, on your previous comment: I was trying to test (with very low effort) whether this was supported at all by spatial in wasm / find where things go wrong, and it appears it's connected to the File-based filesystem. I will need to dig there, log system calls, and find where stuff start to differ (and why / how to fix it).

@rpbouman
Copy link
Author

I think in the CLI explicitly loading httpfs is needed:
...
then works on my end.

Ah! Ok, I thought that was enabled by default. After installing it explicitly, it works for me too. THanks for clearing that up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants