Gmail Extractor is an automated system for processing email attachments from a Gmail account. It downloads attachments, processes them based on their file type, and saves the processed data in a structured format. The system is designed to handle various file types including PDFs, Word documents, Excel spreadsheets, CSVs, and images.
- Project Structure
- Prerequisites
- Installation
- Configuration
- Usage + Process Flow
- File Processing
- Troubleshooting
- Deployment
- Contributing
- License
gmail-extractor/
│
├── config/
│ └── constants.js
│
├── logs/
│
├── src/
│ ├── attachments/
│ │ ├── fileHandler/
│ │ │ ├── imageHandler.js
│ │ │ ├── pdfHandler.js
│ │ │ ├── spreadsheetHandler.js
│ │ │ └── wordHandler.js
│ │ └── attachmentProcessor.js
│ │
│ ├── auth/
│ │ └── authHandler.js
│ │
│ ├── email/
│ │ ├── emailProcessor.js
│ │ ├── imapListener.js
│ │ └── resetEmailsAndAttachments.js
│ │
│ ├── google-sheets/
│ │ └── google-sheets-api.js
│ │
│ ├── utils/
│ │ ├── combineEmailData.js
│ │ ├── convertPdfToImage.js
│ │ ├── createDataDirectories.js
│ │ ├── deleteFile.js
│ │ ├── fileUtils.js
│ │ └── logger.js
│ │
│ └── zod-json/
│ ├── emailDataProcessor.js
│ └── emailDataSchema.js
│
├── .env
├── .gitignore
├── credentials.json
├── Dockerfile
├── index.js
├── package.json
├── README.md
└── token.json
- Node.js (v14 or later)
- npm or Yarn
- A Gmail account
- Google Cloud Console project with Gmail API enabled
-
Clone the repository:
git clone https://github.com/yourusername/gmail-extractor.git cd gmail-extractor
-
Install dependencies:
npm install
or if you're using Yarn:
yarn install
-
Copy the
.env.example
file to.env
:cp .env.example .env
Edit the .env
file and fill in your specific details:
EMAIL_ADDRESS
: Your Gmail addressPROCESSED_DIR
: Directory for processed attachments (e.g.,processed_attachments
)- Add any other necessary environment variables
- Go to the Google Cloud Console.
- Create a new project or select an existing one.
- Enable the Gmail API for your project.
- Go to "Credentials" and create an OAuth 2.0 Client ID.
- Set up the OAuth consent screen if prompted.
- For "Application type", choose "Web application".
- Add
http://localhost:3000/auth/google/callback
to the "Authorized redirect URIs".
Create a credentials.json
file in the root directory with the following structure:
{
"web": {
"client_id": "YOUR_CLIENT_ID.apps.googleusercontent.com",
"project_id": "your-project-name",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_secret": "YOUR_CLIENT_SECRET",
"redirect_uris": ["http://localhost:3000/auth/google/callback"]
}
}
- Enable IMAP in your Gmail settings.
- If not using OAuth, create an App Password:
- Go to your Google Account settings.
- Select "Security".
- Under "Signing in to Google," select "App Passwords".
- Generate a new App Password for "Mail" and "Other (Custom name)".
- Use this password in your
.env
file instead of your regular Gmail password.
To start the Gmail extractor:
npm start
On first run, you'll be prompted to authorize the application. Follow the URL provided in the console to complete the OAuth2 flow.
Below is a sequence diagram illustrating the main process flow of the Gmail Extractor:
sequenceDiagram
participant User
participant ImapListener
participant EmailProcessor
participant AttachmentProcessor
participant FileHandlers
participant AuthHandler
participant ZodProcessor
participant OpenAIProcessor
participant Gmail
participant GoogleSheets
User->>ImapListener: Start application
ImapListener->>AuthHandler: Request authentication
AuthHandler->>Gmail: Authenticate (OAuth2)
Gmail-->>AuthHandler: Return access token
AuthHandler-->>ImapListener: Authentication successful
loop Listen for new emails
ImapListener->>Gmail: Check for new emails
Gmail-->>ImapListener: New email notification
ImapListener->>EmailProcessor: Process new email
EmailProcessor->>Gmail: Fetch email content
Gmail-->>EmailProcessor: Return email content
EmailProcessor->>AttachmentProcessor: Process attachments
AttachmentProcessor->>FileHandlers: Handle specific file types
FileHandlers-->>AttachmentProcessor: Return processed data
AttachmentProcessor-->>EmailProcessor: Return processed attachments
EmailProcessor->>EmailProcessor: Combine email data (all_{emailId}.json)
EmailProcessor->>ZodProcessor: Validate combined data
ZodProcessor-->>EmailProcessor: Return validated data
EmailProcessor->>OpenAIProcessor: Process data with OpenAI
OpenAIProcessor-->>EmailProcessor: Return structured data
EmailProcessor->>EmailProcessor: Save processed_offer_{emailId}.json
EmailProcessor->>GoogleSheets: Update spreadsheet with processed data
GoogleSheets-->>EmailProcessor: Confirmation
end
ImapListener->>User: Notification of processed emails
The system processes the following file types:
- PDF: Handled by
pdfHandler.js
- Word (.doc, .docx): Handled by
wordHandler.js
- Excel (.xls, .xlsx), CSV: Handled by
spreadsheetHandler.js
- Images (.png, .jpg, .jpeg): Handled by
imageHandler.js
Processed files and their extracted data are managed by attachmentProcessor.js
.
- If you encounter authentication issues, ensure your
credentials.json
file is correctly set up and your Gmail account settings are properly configured. - Check the logs in the
logs/
directory for detailed error messages. - For IMAP connection issues, verify that IMAP is enabled in your Gmail settings and that your network allows the connection.
For deploying to a production environment:
- Ensure all sensitive data (like
credentials.json
and.env
) are properly secured and not exposed in your repository. - Consider using environment variables for all sensitive information.
- If deploying to a cloud service, follow their specific guidelines for Node.js applications.
- Use a process manager like PM2 to keep the application running continuously.
Contributions are welcome! Please feel free to submit a Pull Request.
[Specify your license here, e.g., MIT, GPL, etc.]