-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Roadmap
Albert Villanova del Moral edited this page Apr 22, 2021
·
1 revision
- Datasets Hub
- Datasets Viewer
- AutoNLP
- External integrations
- Tasks + Evaluations
- Datasets Streaming
- Image/Audio support
- Researchers usage
- GitHub repository
- Community/Contributors
- Make the dataset script optional
- Load processed datasets
- Use cold storage (parquet)
- More documentation + concrete tutorials
- Integrate a validation tool in the CI for yaml tags + dataset card
- Fix runs out of disk space
- Update the dependencies
- Fix methods that have memory issues: cast (WIP), filter, concatenate_datasets
- Add audio type
- How to download a processed dataset from the Hub
- How to implement a universal dataset loader
- Improve error messages per file
- Test using big JSON files
- Allow to get datasets metadata without loading them
- Allow to use the dataset builders as iterators
- Add task-specific preparation
- Define task-specific feature templates
- Add task argument in load_dataset
- Automatic post processing based on the supervised_keys passed in the info and the queried task
- User defined post processing to cover cases that automatic post processing can't handle (maybe using the post_process method of the builder)
- Sync with AutoNLP
- Use fsspec
- Create a new class StreamingDataset
- Enable the streaming of csv/text/json data
- Set the format of a streaming dataset
- Implement new feature types Image and Audio
- Implement a decoding step
- Either keep storing the path in the arrow data, or write the encoded bytes in the arrow data
- Keep small datasets in memory and without caching
- Load one split without download and processing the others
- Update Wikipedia
- Complete the dataset card with usage examples to show how to use a specific date
- Preprocess recent wikipedia dumps (en, fr, es, de...)
- Optimize Beam pipelines
- Process Wikipedia systematically
- Add FAQs in the documentation or as a markdown file in the repo
- Try git lfs for dummy data
- Fix conda build
- Share Roadmap
- Add all the tasks on the Roadmap as GitHub Issues
- Create GitHub Projects:
- Core library
- Addition of new datasets
- Improve the docs on how to contribute to the core library
- Refactorize code to make it simpler