Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.

Added handling of filename_as_id and file_extractor to SharePointReader #934

Conversation

ferdinandosimonetti
Copy link
Contributor

@ferdinandosimonetti ferdinandosimonetti commented Feb 8, 2024

Description

I've taken MinioReader's handling of file_extractor parameter for SimpleDirectoryReader
This allows to choose a customized matching between file extension and its Reader/Decoder, and *shouldn't wreak havoc on SharePointReader's functionality.

Type of Change

Please delete options that are not relevant.

  • Bug fix / Smaller change

How Has This Been Tested?

  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have added a library.json file if a new loader/tool was added
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

…er (mimicking what is done, for example, for MinioReader
Copy link
Collaborator

@anoopshrma anoopshrma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ferdinandosimonetti!,
added one comment, rest everything looks great!

@@ -28,6 +28,8 @@ def __init__(
client_id: str,
client_secret: str,
tenant_id: str,
filename_as_id: bool = False,
file_extractor: Optional[Dict[str, Union[str, BaseReader]]] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a small mistake here:
file_extractor: Optional[Dict[str, BaseReader]] = None,

For ref: https://github.com/run-llama/llama_index/blob/0393b081f3aed854e0a628f49b8e51f8da7906ef/llama_index/readers/file/base.py#L118

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully my second commit should solve the issue that prevented the first run to work out... previously I just forgot to add the appropriate imports from typing (Optional and Union).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to add union for basereader. It'll be always a baseReader class

Replace file_extractor line with the below one

file_extractor: Optional[Dict[str, BaseReader]] = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it that way because I was shamelessly copying line 25 of llama_hub/minio/minio-client/base.py

file_extractor: Optional[Dict[str, Union[str, BaseReader]]] = None,

however, I'll rewrite it that way

@anoopshrma
Copy link
Collaborator

You'll need to look at linting and test case as well on this.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ferdinandosimonetti
Copy link
Contributor Author

Solved the last complaint about importing Union, that is unused

Copy link
Collaborator

@anoopshrma anoopshrma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt action on the lint part @ferdinandosimonetti
Highly appreciated!

@anoopshrma anoopshrma merged commit 642dc8a into run-llama:main Feb 12, 2024
3 checks passed
@ferdinandosimonetti ferdinandosimonetti deleted the feat/file_extractor-for-SharePointReader branch February 12, 2024 14:23
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants