-
Notifications
You must be signed in to change notification settings - Fork 37
Document Processing Pipelines
Adam Hooper edited this page Mar 19, 2018
·
5 revisions
Where to start? How about ImportJob: our progress-reporting mechanism. No matter how you upload files, Overview can always tell you:
- The DocumentSet the files will go into. (Overview always creates a document set first and adds files to it second.)
- Progress-reporting information: a way to set the user's expectations.
Beyond that, our import pipelines have a bit in common:
- Every pipeline creates Document objects.
- Documents are always generated in Overview's "worker" process (as opposed to its "web server" process).
Each import pipeline creates Documents. Document data is stored in a few places:
- Most document data is in the Postgres database, in the
document
table. In particular, document text and title (which Overview generates within these pipelines) and document notes and metadata (which the user provides) are stored here. - Tags are in the
tag
table, and document-tag lists are in thedocument_tag
table. - Processed uploaded files are Files, with metadata in the
file
table and file contents in BlobStorage (Amazon S3 or the filesystem). Alongside each uploaded file is a generated PDF file Overview lets the user view. - When the user chooses to split by page, Overview generates a PDF per page for the user to view: that's in the
page
table and in BlobStorage. - Thumbnails are in BlobStorage.
- Each document set also has a Lucene index containing document titles, text and metadata. The worker maintains those indexes on the filesystem.
See Ingest Pipeline. We plan to make this the only Pipeline in Overview.
Right now, a bit of a hack remains:
-
User uploads files into GroupedFileUploads (and Postgres Large Objects).
- On demand, the server creates a FileGroup to hold all the files the user will upload. (There is one FileGroup per User+DocumentSet, and DocumentSet may be null here.)
- The user streams each file into a GroupedFileUpload, assigning a client-generated GUID to handle resuming. See js-mass-upload for design details.
- The user clicks "Finish". Overview creates the DocumentSet if it's a new document set, then Overview sets
FileGroup.addToDocumentSetId
and kicks off the worker process.
-
For each file:
- Worker converts
GroupedFileUpload
toWrittenFile
. It deletesGroupedFileUpload
's associated Postgres Large Object. - Overview runs the Ingest Pipeline on the
WrittenFile
. - Worker sorts the documents and writes the result to
document_set.sorted_document_ids
. - Worker deletes the FileGroup.
- Worker converts
When the user asks for a progress report, the web server builds an ImportJob from the file_group
table.
TODO
TODO