Listing existing columns in table and unique columns #344

dgutman · 2024-10-27T14:50:53Z

dgutman
Oct 27, 2024

Perhaps I am just missing something obvious, but what is the simplest way to list the existing columns in an existing table. Similarly, is there a way to define a column as unique to prevent re-inserting the same files.

I am currently experimenting with this in a jupyter notebook, which is a non-linear world in terms of rerunning code snippets.

I am creating a set of different columns / computed columns, and if I rerun the same cell block it not shockingly throws an error. I see there is a .column() property attached to the table, but that seems to return a tuple, making something like

from pixeltable.functions.audio import get_metadata
if 'audio_metadata' not in audio_table:
audio_table['metadata'] = get_metadata(audio_table.audio)

In a similar vein... I have a directory of audio files I am ingesting, but at least initially I am only inserting the first ten. It seems if I rerun that code block, it will keep inserting the same audio files (again.. not surprisingly) because I don't define any sort of unique index for the audio_file path...

I see my most likely future use case is not wanting to necessarily keep track of every audio file I insert, and just doing a glob.glob(//*.mp3) or something similar, and ideally letting the database figure out what's new and what's not new.

dgutman · 2024-10-27T14:51:50Z

dgutman
Oct 27, 2024
Author

Sorry-- finished this too soon. I realize there are of course ways around all of this using queries, etc.. just trying to not build a much more complicated code base if there's something as simple as a unique flag or some other similar syntactic sugar somewhere I haven't stumbled upon.

0 replies

aaron-siegel · 2024-10-28T19:04:33Z

aaron-siegel
Oct 28, 2024
Maintainer

Thanks for the feedback! To print the table schema, you can type t.describe() or simply t in a notebook. If you want to get the column names as a list, try this:

list(t.get_metadata()['schema'].keys())

To see whether a column name exists in the table, col_name in t.get_metadata()['schema'] should work. (Note that here I'm calling get_metadata() on the table, not to be confused with the audio.get_metadata() UDF that you cited above.) We've also discussed potentially adding an if_exists='ignore' option to add_column(), which would obviate the need for the explicit check, and it sounds like that'd be useful to you.

You've raised a good question regarding skipping duplicates - that's currently not a feature but I can certainly see the value of it. It is possible to enforce uniqueness by defining a primary key that includes the column in question, but if you do that, inserting duplicate values will raise an exception, which presumably isn't what you want. I think this is worth pursuing but needs more discussion (e.g. what do we do in the following scenario?: the table has two columns col_x and col_y with col_x unique. I insert a row with col_x=a and col_y=b. Later, I try to insert a row with col_x=a and col_y=c. Do we skip it because it's a duplicate, or do we raise an exception because it's a duplicate but with a conflicting value for col_y?)

0 replies

dgutman · 2024-10-29T16:26:51Z

dgutman
Oct 29, 2024
Author

So.. exactly! I've dealt with all of this in the past, and previously had a pgvector database I was using for this (prior to me experimenting with pixeltable...). I was just trying to look for convenience functions / methods vs list(t.get_metadata()['schema'].keys()) ... which is just uglier to read.. And I've spent an annoying amount of time dealing with the unique thing and writing catch/except statements to avoid the unique error.. or writing the wrapper to first check if something is there... and then only inserting the ones are not there. All of this stuff is of course possible.. just didn't want to try and write this if it was already a simple ignore_duplicates=True flag or something like that. Also is there a way to get the filenames that are already uploaded? At least for now I can just do that.. I haven't dug deep enough yet into the audio metadata yet..

…

On Mon, Oct 28, 2024 at 3:04 PM Aaron Siegel ***@***.***> wrote: Thanks for the feedback! To print the table schema, you can type t.describe() or simply t in a notebook. If you want to get the column names as a list, try this: list(t.get_metadata()['schema'].keys()) To see whether a column name exists in the table, col_name in t.get_metadata()['schema'] should work. (Note that here I'm calling get_metadata() on the table, not to be confused with the audio.get_metadata() UDF that you cited above.) We've also discussed potentially adding an if_exists='ignore' option to add_column(), which would obviate the need for the explicit check, and it sounds like that'd be useful to you. You've raised a good question regarding skipping duplicates - that's currently not a feature but I can certainly see the value of it. It *is* possible to enforce uniqueness by defining a primary key that includes the column in question, but if you do that, inserting duplicate values will raise an exception, which presumably isn't what you want. I think this is worth pursuing but needs more discussion (e.g. what do we do in the following scenario?: the table has two columns col_x and col_y with col_x unique. I insert a row with col_x=a and col_y=b. Later, I try to insert a row with col_x=a and col_y=c. Do we skip it because it's a duplicate, or do we raise an exception because it's a duplicate but with a conflicting value for col_y?) — Reply to this email directly, view it on GitHub <#344 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFODTQDZ5ARIAXW2PY3263Z52DFPAVCNFSM6AAAAABQV4ONVSVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBXHA3TKMY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- David A Gutman, M.D. Ph.D. Associate Professor of Pathology Emory University School of Medicine

0 replies

dgutman · 2024-11-03T15:19:51Z

dgutman
Nov 3, 2024
Author

BTW re this scenario:
\

You've raised a good question regarding skipping duplicates - that's currently not a feature but I can certainly see the value of it. It is possible to enforce uniqueness by defining a primary key that includes the column in question, but if you do that, inserting duplicate values will raise an exception, which presumably isn't what you want. I think this is worth pursuing but needs more discussion (e.g. what do we do in the following scenario?: the table has two columns col_x and col_y with col_x unique. I insert a row with col_x=a and col_y=b. Later, I try to insert a row with col_x=a and col_y=c. Do we skip it because it's a duplicate, or do we raise an exception because it's a duplicate but with a conflicting value for col_y?)

So throwning an exception all the time is just more boiler plate I'd like to avoid.. generally speaking I feel like the data scientist should have a fairly good understanding of their data model... but not necessarily a deep knowledge of every single file they are trying to load into their data set. I think in the case your talking about.. for example I may be loading in some sort of LABEL file along with the img/audio/whatever data.. in case of a conflict... maybe at least initially it should just IGNORE those entries unless you have explicitly set the flag. This happens in my use case(s) as lot where the ground truth may have been updating in some CSV file... I don't necessarily want to overwrite my existing answer in all cases...

0 replies

dgutman · 2024-11-03T15:47:06Z

dgutman
Nov 3, 2024
Author

Sorry I didn't mean throwing an exception.. I meant adding a lot of code to deal with that error.... I imagine rescanning / reparsing input directories over time would be a common use case as data sets grow/evolve over time..

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pixeltable

Listing existing columns in table and unique columns #344

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pixeltable

Listing existing columns in table and unique columns #344

dgutman Oct 27, 2024

Replies: 5 comments

dgutman Oct 27, 2024 Author

aaron-siegel Oct 28, 2024 Maintainer

dgutman Oct 29, 2024 Author

dgutman Nov 3, 2024 Author

dgutman Nov 3, 2024 Author

dgutman
Oct 27, 2024

dgutman
Oct 27, 2024
Author

aaron-siegel
Oct 28, 2024
Maintainer

dgutman
Oct 29, 2024
Author

dgutman
Nov 3, 2024
Author

dgutman
Nov 3, 2024
Author