Skip to content

Insert record example missing schema #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jesseVast opened this issue Feb 5, 2025 · 4 comments
Open

Insert record example missing schema #4

jesseVast opened this issue Feb 5, 2025 · 4 comments

Comments

@jesseVast
Copy link

In https://vast-data.github.io/data-platform-field-docs/vast_database/sdk_ref/manipulation.html#insert, the example code fails with the following error.

2025-02-05 09:27:59,897: errors:from_response:229:WARNING - RPC failed: {'code': 'TabularMismatchColumnType', 'message': 'Mismatched column type between file to import and existing table.', 'method': 'POST', 'url': 'http://main.selab-var204.selab.vastdata.com/jthaloor-db/test2/tbl_1?rows', 'status': 404, 'headers': {'x-amz-id-2': '50131000547d9', 'x-amz-request-id': '50131000547d9', 'Date': 'Wed, 05 Feb 2025 14:27:10 GMT', 'Strict-Transport-Security': 'max-age=86400', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'Access-Control-Allow-Origin': '*', 'Transfer-Encoding': 'chunked', 'Server': 'vast 5.2.0.131'}}
Couldn't insert data: {'code': 'TabularMismatchColumnType', 'message': 'Mismatched column type between file to import and existing table.', 'method': 'POST', 'url': 'http://main.selab-var204.selab.vastdata.com/jthaloor-db/test2/tbl_1?rows', 'status': 404, 'headers': {'x-amz-id-2': '50131000547d9', 'x-amz-request-id': '50131000547d9', 'Date': 'Wed, 05 Feb 2025 14:27:10 GMT', 'Strict-Transport-Security': 'max-age=86400', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'Access-Control-Allow-Origin': '*', 'Transfer-Encoding': 'chunked', 'Server': 'vast 5.2.0.131'}}
<pyarrow.lib.RecordBatchReader object at 0x135ae1f50>

It appears that there is problem with the schema.

ROWS = { 
    'Citizen_Name': ['Alice','Bob'], 'Citizen_Age': [25,24]
}
PA_RECORD_BATCH = pa.RecordBatch.from_pydict(ROWS)

Digging into it, the instance function definition of RecordBatch takes "schema" as one of the arguments. Providing a schema solves the issue.

I think we should always provide a schema to create the recordbatch. Alternatively, we should have the

.insert function automagically infer the schema at runtime.

Versions:
vastdb: 1.3.6
Vast: 5.2 SP10

@snowch
Copy link
Collaborator

snowch commented Feb 5, 2025

@jesseVast what was the output of running list_rows() from the previous cell?

@jesseVast
Copy link
Author

jesseVast commented Feb 5, 2025

Did not follow the exact example. But here is what I did.

Results from the list:
[{'id': 2, 'name': 'hello', '$row_id': 0}, {'id': 3, 'name': 'world', '$row_id': 1}]

Function to insert data (vastdbs is a session class that does all the initial set with the access_key(s) and endpoints

def insert_data(vastdbs,tablename):
    ROWS = { 
    'id': [4,5], 
    'name': ['hello','world']
    }
    # PA_RECORD_BATCH = pa.RecordBatch.from_pydict(ROWS,schema=vastdbs.tableschema[tablename])
    PA_RECORD_BATCH = pa.RecordBatch.from_pydict(ROWS)`

    with vastdbs._session.transaction() as tx:   
        try:
            table = vastdbs.get_table(tx,tablename)
            table.insert(PA_RECORD_BATCH)
            print("Data inserted.")
        except Exception as e:
            print("Couldn't insert data:", e)

Table schema and table creation

tableschema = pa.schema([('id', pa.int32()), ('name', pa.string())])
vast.create_table('tbl_1',tableschema)

Create table instance function from the vastdbs class.

def create_table(self,tablename: str, tableschema: pa.schema) -> None:
        '''
        create_table(self,tablename: str, tableschema: pa.schema)
        Creates table "tablename" if does not exist.
        '''
        logger.info(f"Checking table {tablename} exists.")
        if self._ready:
            with self._session.transaction() as tx:
            # connect to bucket
                bucket=tx.bucket(self.bucket)
                schema = bucket.schema(self.dbschema,fail_if_missing=False)
                # Check if table exists
                table = schema.table(tablename, fail_if_missing=False)
                if table is None:
                    # newschema=self.include_rowID(tableschema)
                    table = schema.create_table(tablename, tableschema)
                    logger.info(f"Table '{tablename}' created with schema {tableschema}")
                else:
                    logger.warning(f"Table '{tablename}' already exists")
                self.tableschema[tablename] = tableschema
        else:
            logger.error("Database not ready for transactions.")
            raise(Exception("Database not ready for transactions."))

@snowch
Copy link
Collaborator

snowch commented Feb 5, 2025

Thanks @jesseVast. I've added the guard code - does this look better? https://vast-data.github.io/data-platform-field-docs/vast_database/sdk_ref/manipulation.html

@jesseVast
Copy link
Author

This catches the error. We should add more details. The "TabularMismatchColumnType" error is painful to troubleshoot.

From the RecordBatch.py_dict definition, the Schema argument says If not passed, will be inferred from the Mapping values. The functions (.from_pylist/.from_pydict) will try to infer the schema and MAY fail as pydict to pylist has no column type info. Guessing column types may not be best in this case. On the other hand, *.from_pandas(..) will succeed since pandas -> pa types has a fixed translation.

Can we make the change in addition to the error checking.
PA_RECORD_BATCH = pa.RecordBatch.from_pydict(rows=ROWS, schema=ARROW_SCHEMA)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants