Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] introduce BIDSLayoutV2 #863

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

erdalkaraca
Copy link

In agreement with @gkiar, the following approach allows for a better migration to pybids-core:

  • introduce a new interface BIDSLayoutV2 which uses ancpBIDS to mimic functionality from the BIDSLayout legacy interface
  • ancpBIDS will be dynamically imported at run-time and raise a warning if it is not installed
  • users can decide which interface to use and experiment with BIDSLayoutV2 without breaking their existing code

Once we get sufficient user feedback such as bugs or critical implementation gaps, the BIDSLayoutV2 can be renamed to BIDSLayout while the old implementation may exist as BIDSLayoutLegacy within the next releases.

@erdalkaraca
Copy link
Author

Note:
there is a new test module: test_layout_v2.py

I have not yet fixed all unit tests in that module as this is work-in-progress

@codecov
Copy link

codecov bot commented May 28, 2022

Codecov Report

Base: 86.24% // Head: 83.96% // Decreases project coverage by -2.28% ⚠️

Coverage data is based on head (7f08cdb) compared to base (de95cb5).
Patch coverage: 12.09% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #863      +/-   ##
==========================================
- Coverage   86.24%   83.96%   -2.29%     
==========================================
  Files          32       33       +1     
  Lines        3904     4028     +124     
  Branches      947      966      +19     
==========================================
+ Hits         3367     3382      +15     
- Misses        346      455     +109     
  Partials      191      191              
Impacted Files Coverage Δ
bids/layout/layout_v2.py 7.62% <7.62%> (ø)
bids/layout/__init__.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@adelavega
Copy link
Collaborator

Awesome. I think a full integration test using BIDSLayoutV2 would be most informative.

That is, is those of use that use BIDSLayout heavily try to use BIDSLayoutV2 instead in our more complex apps. I will try to set aside some time for this soon.

bids/layout/__init__.py Outdated Show resolved Hide resolved
erdalkaraca and others added 3 commits May 31, 2022 23:24
define BIDSLayoutV2 as function if ancpbids package not installed

Co-authored-by: Chris Markiewicz <effigies@gmail.com>
@gkiar
Copy link
Contributor

gkiar commented Jun 16, 2022

Hey @erdalkaraca are you around this week to discuss?

@effigies effigies added this to the 1.0.0 milestone Jul 22, 2022
@effigies
Copy link
Collaborator

To me this feels like a big enough change to warrant a 1.0 release when we can make it happen. I won't have any time to work on this until at the absolute earliest August 22, but I do think this is a pretty important thing to push on as people are able.

Could have a ~30 minute call next week if people are around and want to discuss strategy.

@erdalkaraca
Copy link
Author

Depends on ANCPLabOldenburg/ancp-bids#60

Copy link
Collaborator

@adelavega adelavega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello!

I finally had time to run these tests myself. Honestly, all in all the API differences between ancpbids and pybids seem minimal at this point, and I think we can get this done soon.

The main issue I identified were:

  • ancpbids seems to load all derivatives eagerly, even if derivatives=True is not set. This actually is fairly important as you may specifically not want to load derivatives. We may want to discuss if its better to "filter" derivatives prior to instantiation of a layout, or at the time of a query. I think the latter is more flexible and generic, but I think there may be times you really don't want to index a derivative.
  • ancpbids currently uses the "literal" name of entities (e.g. "ses" instead of "session")
  • some __repr__s are different and could be improved. Don't need to be identical but it's something to make a decision on
  • get_fieldmap is not working. Looks like a bug in ancpbids's getattr method

the rest seem like small issue that are very doable to fix, and i've commented on all known test failrues

)
assert len(unvalidated.get()) == 4
with pytest.raises(ValueError):
unvalidated.get(desc="preproc")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like ancpbids returns files here, because desc is not a valid entity.

By default, BIDSLayout does not allow unrecognized entities, and raises an error.

You can allow them using invalid_filters='allow', but even then, since it doesn't match, it returns empty lis: []


def test_dataset_missing_generatedby_fails_validation(self):
dataset_path = Path("ds005_derivs", "format_errs", "no_pipeline_description")
with pytest.raises(BIDSDerivativesValidationError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected error:

*** bids.exceptions.BIDSDerivativesValidationError: Every valid BIDS-derivatives dataset must have a GeneratedBy.Name field set inside 'dataset_description.json'. 
Example: {'GeneratedBy': [{'Name': 'Example pipeline'}]}

That seems like an easy shim

target = 'sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-1_bold.nii.gz'
target = target.split('/')
result = layout_7t_trt.get_metadata(
join(layout_7t_trt.root, *target), include_entities=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include_entities doesn't seem to work due to:

ANCPLabOldenburg/ancp-bids#66

with pytest.raises(TargetError) as exc:
layout_7t_trt.get(target='unicorn')
msg = str(exc.value)
assert 'subject' in msg and 'reconstruction' in msg and 'proc' in msg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference here is ancpbids is using the literal entity value, and pybids uses the full name.

ancpbids:

"Unknown target 'unicorn'. Valid targets are: ['task', 'acq', 'run', 'sub', 'ses']"

pybids:

"Unknown target 'unicorn'. Valid targets are: ['subject', 'session', 'sample', 'task', 'acquisition', 'ceagent', 'staining', 'tracer', 'reconstruction', 'direction', 'run', 'proc', 'modality', 'echo', 'flip', 'inv', 'mt', 'part', 'recording', 'space', 'chunk', 'suffix', 'scans', 'fmap', 'datatype', 'extension', 'EchoTime2', 'EchoTime1', 'IntendedFor', 'CogAtlasID', 'EchoTime', 'EffectiveEchoSpacing', 'PhaseEncodingDirection', 'RepetitionTime', 'SliceEncodingDirection', 'SliceTiming', 'TaskName', 'StartTime', 'SamplingFrequency', 'Columns']"

I also see that pybids lists all of the metadata keys, not just core entities. This may cause problems elsewhere (although I'm not suggesting we need to keep this logic)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pybids returning also metadata keys is a bit confusing as they are no entities in terms of the spec... not sure, but maybe there is an overlap between BIDS spec "entity" and sqlalchemy DB table name "Entity"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will raise an issue about this in pybids-refactor, but it is useful functionality to be able to query by meta-data keys



def test_get_bvals_bvecs(layout_ds005):
dwifile = layout_ds005.get(subject="01", datatype="dwi")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entities associated with the targetfile

ancpbids: [{'key': 'sub', 'value': '01'}]
pybids: {'datatype': 'dwi', 'extension': '.nii.gz', 'subject': '01', 'suffix': 'dwi'}

the ancpbids repr also lists: {'name': 'sub-01_dwi.nii.gz', 'extension': '.nii.gz', 'suffix': 'dwi'}

Not sure why ancpbids misses the datatype entity? Is it related to the new schema?

BTW, this also makes me realize we currently treat extension as an "entity" even though it's technically not. This will for sure cause problems in many pipeline that use the BIDSImageFile.entities attribute (cc: @effigies)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so in general datatype seems to not be a valid entity in ancpbids

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so in general datatype seems to not be a valid entity in ancpbids

Yes, datatype is not listed as an entity in the schema. Maybe it makes sense to treat it the same way as extension/suffix, i.e. as an additional property of the Artifact class.

bold_files = layout_ds005.get(suffix='bold', run=1, subject='01', session='*')
assert not bold_files
bold_files = layout_ds005.get(suffix='bold', run=1, subject='01')
assert len(bold_files) == 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails because ancpbids returns:


[
 {'name': 'sub-01_task-mixedgamblestask_run[...]', 'extension': '.nii.gz', 'suffix': 'bold'}, 
 {'name': 'sub-01_task-mixedgamblestask_run[...]', 'extension': '.nii.gz', 'suffix': 'bold'}, 
 {'name': 'sub-01_task-mixedgamblestask_run[...]', 'extension': '.json', 'suffix': 'bold'}
]

which map to these three filenames:

[
  'sub-01_task-mixedgamblestask_run-01_bold.nii.gz', 
  'sub-01_task-mixedgamblestask_run-01_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz', 
  'sub-01_task-mixedgamblestask_run-01_space-MNI152NLin2009cAsym_desc-preproc_bold.json'
]

Looks like pybids was not indexing the last two files as they were "not valid", whereas ancpbids does index them, and returns them here. This seems like a pretty big change in behavior we'll have to address, as a lot of applications may be using the fact that invalid files are not indexed as a form of filtering on files.

Also, ancpbids's repr makes the two first files look identical, I would suggest not truncating so fast, as the filename can be quite important

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that layout_ds005 is instantiated without derivatives=True so IMO those files should not be returned (unless we want to deliberately change this behavior)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, syntactically, the files comply to the BIDS naming scheme and are considered "valid".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only due to derivatives being indexed by ancpbids by default it turns out

def test_layout_with_derivs(layout_ds005_derivs):
assert layout_ds005_derivs.root == join(get_test_data_path(), 'ds005')
assert isinstance(layout_ds005_derivs.files, dict)
assert len(layout_ds005_derivs.derivatives) == 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the issue here is all derivatives are loaded, whereas in the pybids tests only one set of derivs (events) is deliberately loaded

assert dd['Name'] == 'Mixed-gambles task'
dd = layout_ds005_derivs.get_dataset_description('all', True)
assert isinstance(dd, list)
assert len(dd) == 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the issue here is loading 2 vs 3 derivs.

I think the fact that the object return by ancpbids is different is acceptable, although a breaking change.

ancpbids:

[{'name': 'dataset_description.json', 'Name': 'Mixed-gambles task', 'BIDSVersion': '1.0.0rc2', 'License': 'This dataset is made available u[...]', 'ReferencesAndLinks': 'Tom, S.M., Fox, C.R., Trepel, C.[...]'}, {'name': 'dataset_description.json', 'Name': 'Mixed-gambles task', 'BIDSVersion': '1.0.0rc2', 'License': 'This dataset is made available u[...]', 'ReferencesAndLinks': 'Tom, S.M., Fox, C.R., Trepel, C.[...]'}, {'name': 'dataset_description.json', 'Name': 'fMRIPrep - fMRI PREProcessing wo[...]', 'BIDSVersion': '1.4.0', 'DatasetType': 'derivative', 'License': 'CC0', 'HowToAcknowledge': 'Please cite our paper (https://d[...]'}]

pybids;

{'PipelineDescription': {'Name': 'events'}, 'BIDSVersion': '1.0.0rc2', 'License': 'This dataset is made available under the Public Domain Dedication and License \nv1.0, whose full text can be found at \nhttp://www.opendatacommons.org/licenses/pddl/1.0/. \nWe hope that all users will follow the ODC Attribution/Share-Alike \nCommunity Norms (http://www.opendatacommons.org/norms/odc-by-sa/); \nin particular, while not legally required, we hope that all users \nof the data will acknowledge the OpenfMRI project and NSF Grant \nOCI-1131441 (R. Poldrack, PI) in any publications.', 'Name': 'Mixed-gambles task', 'ReferencesAndLinks': 'Tom, S.M., Fox, C.R., Trepel, C., Poldrack, R.A. (2007). The neural basis of loss aversion in decision-making under risk. Science, 315(5811):515-8'}

l = layout_ds005
# Raise error with suggestions
with pytest.raises(ValueError, match='session'):
l.get(subject='12', ses=True, invalid_filters='error')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues:

  • Literal vs display name (e.g. ses vs session)
  • invalid_filters doesn't seem to be implemented

for run in (1, "1", "01"):
res = layout_ds005.get(subject="01", task="mixedgamblestask",
run=run, extension=".nii.gz")
assert len(res) == 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, derivatives are loaded when they should not be

@adelavega
Copy link
Collaborator

Update: all tests in test_layout_on_example are passing w/ ancpbids.

@adelavega
Copy link
Collaborator

adelavega commented Nov 19, 2022

Additional observations from trying to test functionality in other modules:

Missing functionality:

  • BIDSLayoutFile.get_df
    • get_dict
    • get_entities
    • path and relpath (ancpbids Artifact only has 'name')
  • layout.regex_search
  • .get function on derivatives (probably it can access in different way)
  • BIDSLayout.build_path write_to_file and copy_file
  • As @gkiar pointed out BIDSLayout.files has a meaning of all files that are valid (if validation=True), but this is a different meaning in ancpbids. I also recommend renaming to extra_files, and adding a shim or another way to access all files in a flat list.
  • validation seems to produce slightly different results-- and one difference is pybids simply skips invalid files, ancpbids seems to throw an error. @erdalkaraca? In particular this causes issues on these files on ds005: `models/ds-005_type interceptonlyrunlevel_model.json'

There's also many functions that look like get_{entity} but are not and need to be added (I'm open to adding w/ a new, less confusing name):

  • get_collections -- load variable collections

Indexing discrepancies

  • participants.tsv is not indexed (as @gkiar previously said, this is ill defined in the schema)

@erdalkaraca WDYT about functions like get_collections? Should we rename them, or add exceptions to __getattr__?

@erdalkaraca
Copy link
Author

Additional observations from trying to test functionality in other modules:

Missing functionality:

  • BIDSLayoutFile.get_df

    • get_dict
    • get_entities
    • path and relpath (ancpbids Artifact only has 'name')
  • layout.regex_search

  • .get function on derivatives (probably it can access in different way)

  • BIDSLayout.build_path write_to_file and copy_file

  • As @gkiar pointed out BIDSLayout.files has a meaning of all files that are valid (if validation=True), but this is a different meaning in ancpbids. I also recommend renaming to extra_files, and adding a shim or another way to access all files in a flat list.

  • validation seems to produce slightly different results-- and one difference is pybids simply skips invalid files, ancpbids seems to throw an error. @erdalkaraca? In particular this causes issues on these files on ds005: `models/ds-005_type interceptonlyrunlevel_model.json'

There's also many functions that look like get_{entity} but are not and need to be added (I'm open to adding w/ a new, less confusing name):

  • get_collections -- load variable collections

Indexing discrepancies

  • participants.tsv is not indexed (as @gkiar previously said, this is ill defined in the schema)

@erdalkaraca WDYT about functions like get_collections? Should we rename them, or add exceptions to __getattr__?

I am not really familiar with the variable concept, but sounds like a higher level API. What is the source of the variables that pybids extracts those from?

@adelavega
Copy link
Collaborator

@erdalkaraca yes it is a higher-level API that requires reading in the _events.tsv files.

It is independent than the querying/indexing functionality. I will admit in hindsight its a bit messy to have this type of functionality mixed in with the core querying functionality in the same object. That said, we could address that later in a backwards breaking change.

In the meantime, we will implement these functions to allow for deeper integration testing.

I see you already saw, but we will continue development in https://github.com/bids-standard/pybids-refactor/ so that we can have multiple PRs by different authors, and then merge that fork to pybids

@erdalkaraca
Copy link
Author

@erdalkaraca yes it is a higher-level API that requires reading in the _events.tsv files.

It is independent than the querying/indexing functionality. I will admit in hindsight its a bit messy to have this type of functionality mixed in with the core querying functionality in the same object. That said, we could address that later in a backwards breaking change.

In the meantime, we will implement these functions to allow for deeper integration testing.

I see you already saw, but we will continue development in https://github.com/bids-standard/pybids-refactor/ so that we can have multiple PRs by different authors, and then merge that fork to pybids

It is possible to mixin the functionality from a different class to better separate APIs (monkey-patching, plugin) without a breaking change.

@adelavega
Copy link
Collaborator

@erdalkaraca yes it is a higher-level API that requires reading in the _events.tsv files.
It is independent than the querying/indexing functionality. I will admit in hindsight its a bit messy to have this type of functionality mixed in with the core querying functionality in the same object. That said, we could address that later in a backwards breaking change.
In the meantime, we will implement these functions to allow for deeper integration testing.
I see you already saw, but we will continue development in https://github.com/bids-standard/pybids-refactor/ so that we can have multiple PRs by different authors, and then merge that fork to pybids

It is possible to mixin the functionality from a different class to better separate APIs (monkey-patching, plugin) without a breaking change.

Yes, that might be the way to go. Maybe your BIDSLayout object should be called BIDSBaseLayout or something to indicate its the core indexer/querying engine, vs the additional high level API we will add to make it backwards compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants