Image Captioning with Transformers: Transforming Visual Content into Audio for the Visually Impaired
Sample.Output.mp4
The "Image Captioning with Transformers" project was initially undertaken as a team assignment for our Modern Analytics course, focusing on the integration of Convolutional Neural Networks (CNNs) and transformers for image captioning. During this phase, we primarily used CNNs for image feature extraction and transformers for caption generation. However, over the break, I took the initiative to further refine and enhance the project's accuracy, to improve the overall performance of the image captioning system.
- Programming Language: Python
- Deep Learning Framework: PyTorch
- Convolutional Neural Network (CNN): ResNet
The primary objective of this project is to contribute to the independence of visually impaired individuals by transforming visual content into audio descriptions. While this project represents an initial step, the overarching goal is to provide a meaningful impact on the lives of the visually impaired.
- Positional Embedding: Understanding and implementation of positional embedding techniques in the context of image captioning.
- Natural Language Processing (NLP): Exploration and application of NLP techniques for generating human-like captions from visual content.
- Transformers and Autoencoders: Proficiency in working with transformers and gaining insights into their application in conjunction with autoencoders.
This project is a result of dedicated coursework in Modern Analytics, and the development was guided by a commitment to improving accessibility and inclusivity for individuals with visual impairments.
While this project represents a significant achievement, future enhancements could include refining the model, expanding the dataset for improved generalization, and exploring additional technologies to further enhance the user experience.
This project is licensed under the MIT License.