Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to optimize the performance of large documents? #677

Open
OTime-star opened this issue Aug 26, 2024 · 3 comments
Open

How to optimize the performance of large documents? #677

OTime-star opened this issue Aug 26, 2024 · 3 comments

Comments

@OTime-star
Copy link

OTime-star commented Aug 26, 2024

I'm trying to build a program for Online Excel, but I'm having performance issues.
In my implementation, a workbook is a document. The json format of the document is as follows:

{
    "activeSheetId": "aaa",
    "sheets": {
        "aaa": {
            "cells": {
                "0": {
                    "0": {
                        "value": "123"
                    }
                }
            }
        },
        "bbb": {
            "cells": {}
        }
    }
}

op format is as follows.

//set value
{ type: 'setValue', sheetId: 'aaa', row: 0, col: 0, value: '123' }
//add worksheet
{ type: 'addSheet', sheetId: 'bbb', sheetJson: { cells: {} } }

The performance issue is that the workbook's json is so large, and each time applyOp reads the entire workbook's json from the database.

I expect each worksheet to be a subdocument, and each applyOp reads only the main document (workbook) and the corresponding subdocument (worksheet) based on op.

@alecgibson
Copy link
Collaborator

In general, the smaller you can make the documents the better. Even Excel is notoriously bad at processing large datasets.

For this particular application, I'd recommend:

  • making sure your sheet is sparsely populated (which it looks like it is if you're using objects rather than arrays?)
  • having worksheets as their own documents if possible (though this may make referencing values between worksheets difficult if that's a feature you have)
  • if you need further gains, considering chunking worksheets up even further if possible (eg into rows, or columns, or subgrids), which would help with performance, but at the cost of increased complexity when moving things between subgrids, or trying to apply operations to an entire row, etc.
  • maybe working with projections to work on a single worksheet at a time

@OTime-star
Copy link
Author

Thank you for your help. treating each worksheet as a separate document seems like a good idea. Here are some of my attempts and the issues I've encountered:

"workbook": {
    "activeSheetId": "aaa",
    "sheets": [
        "aaa",
        "bbb"
    ]
}
"aaa": {
    "cells": {
        "0": {
            "0": {
                "value": "123"
            }
        }
    }
},
"bbb": {
    "cells": {}
}

Using the above data as an example, I would create three documents ['aaa', 'bbb', 'workbook']. There are some issues:

  1. Adding a sheet will change from one op to two ops: one op to create the sheet document and another op to modify the workbook document. How can these two ops be guaranteed to occur within a single transaction?
   // add sheet (before)
   { type: 'addSheet', sheetId: 'bbb', sheetJson: { cells: {} } }

   // add sheet (now)
   { create: { sheetJson: { cells: {} } } }, // for sheet doc
   { op: { type: 'addSheet', sheetId: 'bbb' } }, // for workbook doc
  1. Each document has its own ops and versions, making it complex to view the history of the workbook.
  2. I am using PostgreSQL database and custom ot type, and the projections feature seems can't to work together.

@showonne
Copy link

showonne commented Oct 8, 2024

Maybe you can compose op to ops by transaction. eg: ops: [op1, ...opn]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants