Skip to content

Latest commit

 

History

History
40 lines (21 loc) · 2.29 KB

File metadata and controls

40 lines (21 loc) · 2.29 KB

Image Captioning with Transformers: Transforming Visual Content into Audio for the Visually Impaired

Sample Output

Sample.Output.mp4

Project Overview

The "Image Captioning with Transformers" project was initially undertaken as a team assignment for our Modern Analytics course, focusing on the integration of Convolutional Neural Networks (CNNs) and transformers for image captioning. During this phase, we primarily used CNNs for image feature extraction and transformers for caption generation. However, over the break, I took the initiative to further refine and enhance the project's accuracy, to improve the overall performance of the image captioning system.

Technologies and Tools Used

  • Programming Language: Python
  • Deep Learning Framework: PyTorch
  • Convolutional Neural Network (CNN): ResNet

Objectives

The primary objective of this project is to contribute to the independence of visually impaired individuals by transforming visual content into audio descriptions. While this project represents an initial step, the overarching goal is to provide a meaningful impact on the lives of the visually impaired.

Learning Outcomes

  1. Positional Embedding: Understanding and implementation of positional embedding techniques in the context of image captioning.
  2. Natural Language Processing (NLP): Exploration and application of NLP techniques for generating human-like captions from visual content.
  3. Transformers and Autoencoders: Proficiency in working with transformers and gaining insights into their application in conjunction with autoencoders.

Acknowledgements

This project is a result of dedicated coursework in Modern Analytics, and the development was guided by a commitment to improving accessibility and inclusivity for individuals with visual impairments.

Future Enhancements

While this project represents a significant achievement, future enhancements could include refining the model, expanding the dataset for improved generalization, and exploring additional technologies to further enhance the user experience.

License

This project is licensed under the MIT License.