Skip to content

get_doc() in Loop for Sync and Daily Tasks #45

@mrrobot47

Description

@mrrobot47

Metadata

  • File(s):
    • frappe_gmail_thread/tasks/sync.py:12
    • frappe_gmail_thread/tasks/daily.py:20
  • Category: Database / Scheduled Tasks
  • Severity: Medium
  • Effort to Fix: Low
  • Estimated Performance Gain: 20-40%

Problem Description

Both scheduled tasks (sync_emails and enable_pubsub_everyday) fetch Gmail Account names with get_all(), then immediately load the full document with get_doc() in a loop. This creates unnecessary database queries when only specific fields are needed.

The sync_emails task runs every 30 minutes, and enable_pubsub_everyday runs daily. With many Gmail accounts, these N+1 query patterns add up.

Code Location

sync.py (line 12):

def sync_emails():
    gmail_accounts = frappe.get_all(
        "Gmail Account",
        filters={"gmail_enabled": 1},
        fields=["name"],  # Only fetches name
    )
    for gmail_account in gmail_accounts:
        gaccount = frappe.get_doc("Gmail Account", gmail_account.name)  # Full doc load
        if gaccount.refresh_token:
            user = gaccount.linked_user
            # ...

daily.py (line 20):

def enable_pubsub_everyday():
    # ...
    gmail_accounts = frappe.get_all(
        "Gmail Account", filters={"gmail_enabled": 1}, fields=["name"]
    )
    for gmail_account in gmail_accounts:
        gaccount = frappe.get_doc("Gmail Account", gmail_account.name)  # Full doc load
        try:
            enable_pubsub(gaccount)
        # ...

Root Cause

The pattern of:

  1. Fetch list with minimal fields
  2. Loop and load full document for each item

This is inefficient because:

  • Initial query returns only name
  • Each get_doc() requires a separate database round-trip
  • Full document loads all fields, child tables, and runs controller logic

Proposed Solution

For sync.py - Only need refresh_token and linked_user:

def sync_emails():
    gmail_accounts = frappe.get_all(
        "Gmail Account",
        filters={"gmail_enabled": 1},
        fields=["name", "refresh_token", "linked_user"],  # Fetch needed fields
    )
    for gmail_account in gmail_accounts:
        if gmail_account.refresh_token:
            user = gmail_account.linked_user
            job_name = f"gmail_thread_sync_{user}"
            if not is_job_enqueued(job_name):
                frappe.enqueue(
                    "frappe_gmail_thread.frappe_gmail_thread.doctype.gmail_thread.gmail_thread.sync",
                    user=user,
                    queue="long",
                    job_name=job_name,
                    job_id=job_name,
                )

For daily.py - enable_pubsub needs full doc, but can batch check first:

If enable_pubsub() truly needs the full document, at minimum filter accounts that have the necessary tokens:

def enable_pubsub_everyday():
    google_settings = frappe.get_single("Google Settings")
    if not google_settings.enable:
        return
    if (
        not google_settings.custom_gmail_sync_in_realtime
        or not google_settings.custom_gmail_pubsub_topic
    ):
        return

    # Only fetch accounts with refresh tokens (they're the only ones that can use pubsub)
    gmail_accounts = frappe.get_all(
        "Gmail Account", 
        filters={
            "gmail_enabled": 1,
            "refresh_token": ["is", "set"]  # Filter out accounts without tokens
        }, 
        fields=["name"]
    )
    for gmail_account in gmail_accounts:
        gaccount = frappe.get_doc("Gmail Account", gmail_account.name)
        try:
            enable_pubsub(gaccount)
        except Exception as e:
            frappe.log_error(title="PubSub Error", message=e)
            continue

Implementation Steps

  1. sync.py:

    • Update fields to include ["name", "refresh_token", "linked_user"]
    • Remove get_doc() call
    • Access fields directly from the dict returned by get_all()
  2. daily.py:

    • Add filter for refresh_token to reduce iterations
    • If enable_pubsub() can be refactored to accept specific fields, do so
    • Otherwise, the filter optimization still reduces unnecessary iterations
  3. Test both scheduled tasks with multiple Gmail accounts

Additional Notes

  • The sync() function called by enqueue() does need the full document, but that's inside the background job
  • The scheduler only needs to check if tokens exist and get the user email
  • This pattern appears across multiple rtCamp repos (frappe-slack-connector, frappe-github-connector)

Metadata

Metadata

Labels

mediumMedium severityperformancePerformance optimization

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions