Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not useable newspaper migration site #4384

Closed
henning-gerhardt opened this issue May 6, 2021 · 10 comments · Fixed by #4417
Closed

Not useable newspaper migration site #4384

henning-gerhardt opened this issue May 6, 2021 · 10 comments · Fixed by #4417
Labels

Comments

@henning-gerhardt
Copy link
Collaborator

If you have a large database with a lot of newspaper processes (over 300,000) and corresponding batches (> 150) using the newspaper migration site is not possible as it takes hours or days to display this page. While loading this page a lot of SELECT statements - which is ok - but even the same amount of UPDATE statements are started. It is for me unclear why UPDATE statements are used / fired on displaying this page.

Displaying the batches tab on the process page takes one or two minutes and is much faster then displaying the nearly same data on the newspaper migration site.

@matthias-ronge
Copy link
Collaborator

Can you see what is updated? Process? Project? Batch?

@henning-gerhardt
Copy link
Collaborator Author

So far as I can say process table get updated and of every single newspaper process the field processBaseUri get populated by a string interpretation of the process id.

@matthias-ronge
Copy link
Collaborator

I am guessing that the save action is fired from ProcessService.getProcessDataDirectory(). The NewspaperProcessesMigrator checks for the existence of a file with the name meta_year.xml to decide if a batch is a newspaper batch. To look for the file, it needs to retrieve the process data directory. If I read the source code correctly, this should only happen, if the database field processBaseUri is null (which is the case right after migration), but once the page has loaded it should be fast every next time.

@henning-gerhardt
Copy link
Collaborator Author

henning-gerhardt commented May 11, 2021

I can only confirm, that this long loading of this page (takes hours on our migration test system) appears only once and later access of this page is more or less fast - at least faster than on first access. But as migration of our data is not done yet I will struggle in this issue after every data migration :-(

I discovered that storing the changed value in the method storeObject of class BaseDAO take a lot of time: up to 15 seconds.

Edit: if a non-null value is needed in processBaseUri and is set to the used process id, why did this is not done in database migration?

@matthias-ronge
Copy link
Collaborator

if a non-null value is needed in processBaseUri and is set to the used process id, why did this is not done in database migration?

I assume it was because, when this was implemented, it was first a blob field and these cannot be set by SQL language. So the getter obtains the value from the file management module when it is null, and saves it to the database.

But this is an idea, you should can set the field manually in the database, like:

UPDATE process SET processBaseUri = CONCAT(id, "/") WHERE processBaseUri IS NULL;

Another consideration is whether we should allow the file system check to be switched off via the kitodo_config.properties. Then all batches are displayed, including those that do not contain newspapers, but it would be much faster. Unfortunately, the route via the filemanagement module is very slow.

@henning-gerhardt
Copy link
Collaborator Author

henning-gerhardt commented May 11, 2021

Your suggestion is wrong or not correct as in database only a string representation of the process is stored without the ending slash! Even with the ending slash I don't understand the benefit of storing the process id with an ending slash. Accessing this field from the database has a higher cost then combining this two values in program code - only true if the process id is already available if not costs are similar.

Edit start:
database dump diff between before and after accessing the newspaper migration page:

-INSERT INTO `process` VALUES (134049,'EiseveNa_435567721-17520408',17,'100000000',6,0,3,0,'2015-07-21 14:00:59',5,24,'',1,'INDEX',NULL,1,NULL,NULL,0);
+INSERT INTO `process` VALUES (134049,'EiseveNa_435567721-17520408',17,'100000000',6,0,3,0,'2015-07-21 14:00:59',5,24,'',1,'INDEX','134049',1,NULL,NULL,0);

It is visible that after the indexAction column with the content 'INDEX' only the process id without a trailing slash is inserted.

Edit end

Issue is not proving the existing of a file which is working really fast the remaining issue is to update the processBaseUri field which takes a long time on accessing this page.

@matthias-ronge
Copy link
Collaborator

matthias-ronge commented May 11, 2021

Your suggestion is wrong or not correct as in database only a string representation of the process is stored without the ending slash!

This is what I used to know, too, yes, but I checked my database and on my local system it is stored with an ending slash. I was really surprised to see that this morning, because I also thought it would be stored without slash. The question is, why is there a difference? Are there are more methods that write this entry and generate different values? I have no idea where this comes from. If it is without slash, the database statement to set it is trivial.

Even with the ending slash I don't understand the benefit of storing the process id with an ending slash.

I think this needs clarification. When I create a process in version 3, the field is set and it is with slash.

Issue is not proving the existing of a file which is working really fast

This is good to know, I wasn’t shure about it.

@matthias-ronge
Copy link
Collaborator

After digging deeper into this: Reproducibly, when creating a process in version 3, the field processBaseUri is populated with the process ID with trailing slash; but if it is null and it is accessed (for example, accept a task which has read or write images property), it is set to process ID without trailing slash.

On process creation (CreateProcessForm.createProcessesLocation(List<TempProcess>)):

URI processBaseUri = ServiceManager.getFileService().createProcessLocation(tempProcess.getProcess());
tempProcess.getProcess().setProcessBaseUri(processBaseUri);

public URI createProcessLocation(Process process) throws IOException, CommandException {
    URI processLocationUri = fileManagementModule.createProcessLocation(process.getId().toString());
    /* ... */
}

public URI createProcessLocation(String processId) throws IOException {
    File processRootDirectory = new File(KitodoConfig.getKitodoDataDirectory() + File.separator + processId);
    /* ... */
    return fileMapper.unmapUriFromKitodoDataDirectoryUri(Paths.get(processRootDirectory.getPath()).toUri());
}

processRootDirectoryD:\Kitodo_Data.v3\metadata\3
.getPath()D:\Kitodo_Data.v3\metadata\3
Paths.get(…)D:\Kitodo_Data.v3\metadata\3
.toUri()file:///D:/Kitodo_Data.v3/metadata/3/
fileMapper.unmapUriFromKitodoDataDirectoryUri(…)3/

On access (FileService.getProcessBaseUriForExistingProcess(Process)):

process.setProcessBaseUri(fileManagementModule.createUriForExistingProcess(process.getId().toString()));

public URI createUriForExistingProcess(String processId) {
    return URI.create(processId);
}

process.getId().toString()3
URI.create(…)3

Either both spellings are allowed, or either is a bug and we haven't noticed it yet.

@henning-gerhardt
Copy link
Collaborator Author

reopen as still not complete fixed

@henning-gerhardt
Copy link
Collaborator Author

Closing as #4421 is now merged and this issue solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants