Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report scraper progress #52

Open
benoit74 opened this issue Oct 10, 2023 · 2 comments
Open

Report scraper progress #52

benoit74 opened this issue Oct 10, 2023 · 2 comments
Labels
enhancement New feature or request
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Oct 10, 2023

Add support to generate the task_progress.json file, so that it can reported by Zimfarm workers and be displayed in Zimfarm UI

@benoit74 benoit74 added the enhancement New feature or request label Oct 10, 2023
@benoit74 benoit74 added this to the 1.2.0 milestone Oct 10, 2023
@benoit74 benoit74 modified the milestones: 1.2.0, 1.3.0 Feb 14, 2024
@benoit74 benoit74 modified the milestones: 1.3.0, 2.0.0 Mar 27, 2024
@githyuvi
Copy link
Contributor

I went through this issue. This is what I suppose is to be implemented.

After going through this code.

    def populate_nodes_executor(self):
        """Loop on content nodes to create zim entries from kolibri DB"""

        def schedule_node(item):
            future = self.nodes_executor.submit(self.add_node, item=item)
            self.nodes_futures.add(future)

        # schedule root-id
        schedule_node((self.db.root["id"], self.db.root["kind"]))

        # fill queue with (node_id, kind) tuples for all root node's descendants
        for node in self.db.get_node_descendants(self.root_id):
            if self.node_ids is None or node["id"] in self.node_ids:
                schedule_node((node["id"], node["kind"]))

I suppose I should track self.nodes_futures.
Let me know if I am on the right track

@benoit74
Copy link
Collaborator Author

Yes, plus the videos_futures ; videos_futures are particularly important since they are populated when a video needs reencoding, and quite often this takes way longer that nodes processing.

However, I suspect this multiprocessing code is significantly broken, see #106

I suspect we will not use this methods anymore in the future, or at least we will most probably mutualise the multiprocessing logic in a shared module.

I don't know if it is really convenient to implement this scraper progress feature now, given that it might be difficult to debug due to the other issue + the functions might change in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants