Skip to content

Using PyTorch, combining CNN for feature extraction and Transformers for natural language generation. The project enhances accessibility for the visually impaired, converting visual content into precise audio descriptions

Notifications You must be signed in to change notification settings

sathishprasad/Image-Captioning-with-Transformers--Transforming-Visual-Content-into-Audio-for-the-Visually-Impaired

Repository files navigation

Image Captioning with Transformers: Transforming Visual Content into Audio for the Visually Impaired

Sample Output

Sample.Output.mp4

Project Overview

The "Image Captioning with Transformers" project was initially undertaken as a team assignment for our Modern Analytics course, focusing on the integration of Convolutional Neural Networks (CNNs) and transformers for image captioning. During this phase, we primarily used CNNs for image feature extraction and transformers for caption generation. However, over the break, I took the initiative to further refine and enhance the project's accuracy, to improve the overall performance of the image captioning system.

Technologies and Tools Used

  • Programming Language: Python
  • Deep Learning Framework: PyTorch
  • Convolutional Neural Network (CNN): ResNet

Objectives

The primary objective of this project is to contribute to the independence of visually impaired individuals by transforming visual content into audio descriptions. While this project represents an initial step, the overarching goal is to provide a meaningful impact on the lives of the visually impaired.

Learning Outcomes

  1. Positional Embedding: Understanding and implementation of positional embedding techniques in the context of image captioning.
  2. Natural Language Processing (NLP): Exploration and application of NLP techniques for generating human-like captions from visual content.
  3. Transformers and Autoencoders: Proficiency in working with transformers and gaining insights into their application in conjunction with autoencoders.

Acknowledgements

This project is a result of dedicated coursework in Modern Analytics, and the development was guided by a commitment to improving accessibility and inclusivity for individuals with visual impairments.

Future Enhancements

While this project represents a significant achievement, future enhancements could include refining the model, expanding the dataset for improved generalization, and exploring additional technologies to further enhance the user experience.

License

This project is licensed under the MIT License.

About

Using PyTorch, combining CNN for feature extraction and Transformers for natural language generation. The project enhances accessibility for the visually impaired, converting visual content into precise audio descriptions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published