Replies: 5 comments
-
Sorry-- finished this too soon. I realize there are of course ways around all of this using queries, etc.. just trying to not build a much more complicated code base if there's something as simple as a unique flag or some other similar syntactic sugar somewhere I haven't stumbled upon. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the feedback! To print the table schema, you can type list(t.get_metadata()['schema'].keys()) To see whether a column name exists in the table, You've raised a good question regarding skipping duplicates - that's currently not a feature but I can certainly see the value of it. It is possible to enforce uniqueness by defining a primary key that includes the column in question, but if you do that, inserting duplicate values will raise an exception, which presumably isn't what you want. I think this is worth pursuing but needs more discussion (e.g. what do we do in the following scenario?: the table has two columns |
Beta Was this translation helpful? Give feedback.
-
So.. exactly! I've dealt with all of this in the past, and previously had
a pgvector database I was using for this (prior to me experimenting with
pixeltable...).
I was just trying to look for convenience functions / methods vs
list(t.get_metadata()['schema'].keys()) ... which is just uglier to read..
And I've spent an annoying amount of time dealing with the unique thing and
writing catch/except statements to avoid the unique error.. or writing the
wrapper to first check if something is there... and then only inserting the
ones are not there. All of this stuff is of course possible.. just didn't
want to try and write this if it was already a simple
ignore_duplicates=True flag or something like that.
Also is there a way to get the filenames that are already uploaded? At
least for now I can just do that.. I haven't dug deep enough yet into the
audio metadata yet..
…On Mon, Oct 28, 2024 at 3:04 PM Aaron Siegel ***@***.***> wrote:
Thanks for the feedback! To print the table schema, you can type
t.describe() or simply t in a notebook. If you want to get the column
names as a list, try this:
list(t.get_metadata()['schema'].keys())
To see whether a column name exists in the table, col_name in
t.get_metadata()['schema'] should work. (Note that here I'm calling
get_metadata() on the table, not to be confused with the
audio.get_metadata() UDF that you cited above.) We've also discussed
potentially adding an if_exists='ignore' option to add_column(), which
would obviate the need for the explicit check, and it sounds like that'd be
useful to you.
You've raised a good question regarding skipping duplicates - that's
currently not a feature but I can certainly see the value of it. It *is*
possible to enforce uniqueness by defining a primary key that includes the
column in question, but if you do that, inserting duplicate values will
raise an exception, which presumably isn't what you want. I think this is
worth pursuing but needs more discussion (e.g. what do we do in the
following scenario?: the table has two columns col_x and col_y with col_x
unique. I insert a row with col_x=a and col_y=b. Later, I try to insert a
row with col_x=a and col_y=c. Do we skip it because it's a duplicate, or
do we raise an exception because it's a duplicate but with a conflicting
value for col_y?)
—
Reply to this email directly, view it on GitHub
<#344 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFODTQDZ5ARIAXW2PY3263Z52DFPAVCNFSM6AAAAABQV4ONVSVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBXHA3TKMY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
David A Gutman, M.D. Ph.D.
Associate Professor of Pathology
Emory University School of Medicine
|
Beta Was this translation helpful? Give feedback.
-
BTW re this scenario: You've raised a good question regarding skipping duplicates - that's currently not a feature but I can certainly see the value of it. It is possible to enforce uniqueness by defining a primary key that includes the column in question, but if you do that, inserting duplicate values will raise an exception, which presumably isn't what you want. I think this is worth pursuing but needs more discussion (e.g. what do we do in the following scenario?: the table has two columns col_x and col_y with col_x unique. I insert a row with col_x=a and col_y=b. Later, I try to insert a row with col_x=a and col_y=c. Do we skip it because it's a duplicate, or do we raise an exception because it's a duplicate but with a conflicting value for col_y?)So throwning an exception all the time is just more boiler plate I'd like to avoid.. generally speaking I feel like the data scientist should have a fairly good understanding of their data model... but not necessarily a deep knowledge of every single file they are trying to load into their data set. I think in the case your talking about.. for example I may be loading in some sort of LABEL file along with the img/audio/whatever data.. in case of a conflict... maybe at least initially it should just IGNORE those entries unless you have explicitly set the flag. This happens in my use case(s) as lot where the ground truth may have been updating in some CSV file... I don't necessarily want to overwrite my existing answer in all cases... |
Beta Was this translation helpful? Give feedback.
-
Sorry I didn't mean throwing an exception.. I meant adding a lot of code to deal with that error.... I imagine rescanning / reparsing input directories over time would be a common use case as data sets grow/evolve over time.. |
Beta Was this translation helpful? Give feedback.
-
Perhaps I am just missing something obvious, but what is the simplest way to list the existing columns in an existing table. Similarly, is there a way to define a column as unique to prevent re-inserting the same files.
I am currently experimenting with this in a jupyter notebook, which is a non-linear world in terms of rerunning code snippets.
I am creating a set of different columns / computed columns, and if I rerun the same cell block it not shockingly throws an error. I see there is a .column() property attached to the table, but that seems to return a tuple, making something like
from pixeltable.functions.audio import get_metadata
if 'audio_metadata' not in audio_table:
audio_table['metadata'] = get_metadata(audio_table.audio)
In a similar vein... I have a directory of audio files I am ingesting, but at least initially I am only inserting the first ten. It seems if I rerun that code block, it will keep inserting the same audio files (again.. not surprisingly) because I don't define any sort of unique index for the audio_file path...
I see my most likely future use case is not wanting to necessarily keep track of every audio file I insert, and just doing a glob.glob(//*.mp3) or something similar, and ideally letting the database figure out what's new and what's not new.
Beta Was this translation helpful? Give feedback.
All reactions