Not useable newspaper migration site #4384

henning-gerhardt · 2021-05-06T07:44:00Z

If you have a large database with a lot of newspaper processes (over 300,000) and corresponding batches (> 150) using the newspaper migration site is not possible as it takes hours or days to display this page. While loading this page a lot of SELECT statements - which is ok - but even the same amount of UPDATE statements are started. It is for me unclear why UPDATE statements are used / fired on displaying this page.

Displaying the batches tab on the process page takes one or two minutes and is much faster then displaying the nearly same data on the newspaper migration site.

matthias-ronge · 2021-05-10T13:02:29Z

Can you see what is updated? Process? Project? Batch?

henning-gerhardt · 2021-05-10T14:21:42Z

So far as I can say process table get updated and of every single newspaper process the field processBaseUri get populated by a string interpretation of the process id.

matthias-ronge · 2021-05-11T08:11:46Z

I am guessing that the save action is fired from ProcessService.getProcessDataDirectory(). The NewspaperProcessesMigrator checks for the existence of a file with the name meta_year.xml to decide if a batch is a newspaper batch. To look for the file, it needs to retrieve the process data directory. If I read the source code correctly, this should only happen, if the database field processBaseUri is null (which is the case right after migration), but once the page has loaded it should be fast every next time.

henning-gerhardt · 2021-05-11T08:18:58Z

I can only confirm, that this long loading of this page (takes hours on our migration test system) appears only once and later access of this page is more or less fast - at least faster than on first access. But as migration of our data is not done yet I will struggle in this issue after every data migration :-(

I discovered that storing the changed value in the method storeObject of class BaseDAO take a lot of time: up to 15 seconds.

Edit: if a non-null value is needed in processBaseUri and is set to the used process id, why did this is not done in database migration?

matthias-ronge · 2021-05-11T09:19:33Z

if a non-null value is needed in processBaseUri and is set to the used process id, why did this is not done in database migration?

I assume it was because, when this was implemented, it was first a blob field and these cannot be set by SQL language. So the getter obtains the value from the file management module when it is null, and saves it to the database.

But this is an idea, you should can set the field manually in the database, like:

UPDATE process SET processBaseUri = CONCAT(id, "/") WHERE processBaseUri IS NULL;

Another consideration is whether we should allow the file system check to be switched off via the kitodo_config.properties. Then all batches are displayed, including those that do not contain newspapers, but it would be much faster. Unfortunately, the route via the filemanagement module is very slow.

henning-gerhardt · 2021-05-11T10:53:06Z

Your suggestion is wrong or not correct as in database only a string representation of the process is stored without the ending slash! Even with the ending slash I don't understand the benefit of storing the process id with an ending slash. Accessing this field from the database has a higher cost then combining this two values in program code - only true if the process id is already available if not costs are similar.

Edit start:
database dump diff between before and after accessing the newspaper migration page:

-INSERT INTO `process` VALUES (134049,'EiseveNa_435567721-17520408',17,'100000000',6,0,3,0,'2015-07-21 14:00:59',5,24,'',1,'INDEX',NULL,1,NULL,NULL,0);
+INSERT INTO `process` VALUES (134049,'EiseveNa_435567721-17520408',17,'100000000',6,0,3,0,'2015-07-21 14:00:59',5,24,'',1,'INDEX','134049',1,NULL,NULL,0);

It is visible that after the indexAction column with the content 'INDEX' only the process id without a trailing slash is inserted.

Edit end

Issue is not proving the existing of a file which is working really fast the remaining issue is to update the processBaseUri field which takes a long time on accessing this page.

matthias-ronge · 2021-05-11T12:43:06Z

Your suggestion is wrong or not correct as in database only a string representation of the process is stored without the ending slash!

This is what I used to know, too, yes, but I checked my database and on my local system it is stored with an ending slash. I was really surprised to see that this morning, because I also thought it would be stored without slash. The question is, why is there a difference? Are there are more methods that write this entry and generate different values? I have no idea where this comes from. If it is without slash, the database statement to set it is trivial.

Even with the ending slash I don't understand the benefit of storing the process id with an ending slash.

I think this needs clarification. When I create a process in version 3, the field is set and it is with slash.

Issue is not proving the existing of a file which is working really fast

This is good to know, I wasn’t shure about it.

matthias-ronge · 2021-05-11T13:17:26Z

After digging deeper into this: Reproducibly, when creating a process in version 3, the field processBaseUri is populated with the process ID with trailing slash; but if it is null and it is accessed (for example, accept a task which has read or write images property), it is set to process ID without trailing slash.

On process creation (`CreateProcessForm.createProcessesLocation(List<TempProcess>)`):

URI processBaseUri = ServiceManager.getFileService().createProcessLocation(tempProcess.getProcess());
tempProcess.getProcess().setProcessBaseUri(processBaseUri);

public URI createProcessLocation(Process process) throws IOException, CommandException {
    URI processLocationUri = fileManagementModule.createProcessLocation(process.getId().toString());
    /* ... */
}

public URI createProcessLocation(String processId) throws IOException {
    File processRootDirectory = new File(KitodoConfig.getKitodoDataDirectory() + File.separator + processId);
    /* ... */
    return fileMapper.unmapUriFromKitodoDataDirectoryUri(Paths.get(processRootDirectory.getPath()).toUri());
}

processRootDirectory → D:\Kitodo_Data.v3\metadata\3
.getPath() → D:\Kitodo_Data.v3\metadata\3
Paths.get(…) → D:\Kitodo_Data.v3\metadata\3
.toUri() → file:///D:/Kitodo_Data.v3/metadata/3/
fileMapper.unmapUriFromKitodoDataDirectoryUri(…) → 3/

On access (`FileService.getProcessBaseUriForExistingProcess(Process)`):

process.setProcessBaseUri(fileManagementModule.createUriForExistingProcess(process.getId().toString()));

public URI createUriForExistingProcess(String processId) {
    return URI.create(processId);
}

process.getId().toString() → 3
URI.create(…) → 3

Either both spellings are allowed, or either is a bug and we haven't noticed it yet.

henning-gerhardt · 2021-05-20T07:51:48Z

reopen as still not complete fixed

henning-gerhardt · 2021-05-27T08:09:52Z

Closing as #4421 is now merged and this issue solved.

henning-gerhardt added 3.x blocking labels May 6, 2021

Kathrin-Huber mentioned this issue May 18, 2021

speed up migrationPage #4417

Merged

Kathrin-Huber closed this as completed in #4417 May 20, 2021

henning-gerhardt reopened this May 20, 2021

henning-gerhardt closed this as completed May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not useable newspaper migration site #4384

Not useable newspaper migration site #4384

henning-gerhardt commented May 6, 2021

matthias-ronge commented May 10, 2021

henning-gerhardt commented May 10, 2021

matthias-ronge commented May 11, 2021

henning-gerhardt commented May 11, 2021 •

edited

Loading

matthias-ronge commented May 11, 2021

henning-gerhardt commented May 11, 2021 •

edited

Loading

matthias-ronge commented May 11, 2021 •

edited

Loading

matthias-ronge commented May 11, 2021

henning-gerhardt commented May 20, 2021

henning-gerhardt commented May 27, 2021

Not useable newspaper migration site #4384

Not useable newspaper migration site #4384

Comments

henning-gerhardt commented May 6, 2021

matthias-ronge commented May 10, 2021

henning-gerhardt commented May 10, 2021

matthias-ronge commented May 11, 2021

henning-gerhardt commented May 11, 2021 • edited Loading

matthias-ronge commented May 11, 2021

henning-gerhardt commented May 11, 2021 • edited Loading

matthias-ronge commented May 11, 2021 • edited Loading

matthias-ronge commented May 11, 2021

On process creation (CreateProcessForm.createProcessesLocation(List<TempProcess>)):

On access (FileService.getProcessBaseUriForExistingProcess(Process)):

henning-gerhardt commented May 20, 2021

henning-gerhardt commented May 27, 2021

henning-gerhardt commented May 11, 2021 •

edited

Loading

henning-gerhardt commented May 11, 2021 •

edited

Loading

matthias-ronge commented May 11, 2021 •

edited

Loading

On process creation (`CreateProcessForm.createProcessesLocation(List<TempProcess>)`):

On access (`FileService.getProcessBaseUriForExistingProcess(Process)`):