Skip to content

Entity network #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Entity network #54

wants to merge 3 commits into from

Conversation

caiocmello
Copy link
Collaborator

Hi @Ferdaous-af,

Can I ask you please to review this notebook? Please, refer to the reviewer's guidelines below.

Notebook file: https://github.com/impresso/impresso-datalab-notebooks/blob/main/explore-vis/entity_network.ipynb

Reviewer's guidelines:

  1. Is the code consistent? Eg. Use of variables, formatting
  2. Is the explanation of the code correct? Is there imprecise information? Could we expand it in some aspect you consider important to understand the code?
  3. Are there any references to external resources that could enrich this notebook?
  4. Is the information (text) contained in the NB enough to perform the proposed task? Is there something missing?
  5. Do the objectives under 'what you will learn' match the content? Do we provide everything we promise in the objectives?

@caiocmello caiocmello requested a review from Ferdaous-af April 16, 2025 14:47
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@caiocmello caiocmello marked this pull request as ready for review April 16, 2025 14:47
@@ -132,17 +132,8 @@
},
Copy link

@Ferdaous-af Ferdaous-af Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We define people's connections based on whether they occur in the same content item

Small rewording suggestion for clarity:

We define people's connections based on whether their names are mentioned in the same content item.

Please note that when a newspaper does not have segmentation (OLR)

Perhaps add here what OLR is and a link to the FAQ section on it: https://impresso-project.ch/app/faq#What-OCR

Suggestion:

Please note that when a newspaper lacks segmentation (OLR – Optical Layout Recognition), content items for this title correspond to entire pages.

To unveil the reasons why they occur together, further analysis using different methods is necessary.

Suggested clarification:

Understanding the reason behind a co-occurrence typically requires further contextual or qualitative analysis.


Reply via ReviewNB

@@ -132,17 +132,8 @@
},
Copy link

@Ferdaous-af Ferdaous-af Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer to harmonize the structure across the list (suggestion below). Also, it could be useful to clarify that “persons” refer to named entities extracted from content items, and to specify what formats are meant by “different formats” when exporting the network graph.

Suggested rewording:

What will you learn?

By completing this notebook, you will learn how to:

  • Retrieve a list of named persons mentioned in content items for a given query;
  • Transform this list of entities into a dataframe suitable for generating co-occurrence network graphs;
  • Create and display an interactive network graph to visualise connections between persons mentioned together in Impresso content;
  • Export the resulting dataframes as CSV files to support reproducibility
  • Save the network graph in different formats (png, svg, gexf, and json) for further analysis.

Reply via ReviewNB

@@ -132,17 +132,8 @@
},
Copy link

@Ferdaous-af Ferdaous-af Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's okay to make this part a little bit longer, it would helpful to briefly describe what each link offers and perhaps also add links to documentation on Impresso API and python library, NetworkX and ipysigma. Also good to change the order of the two suggested resources, as the "Exploring and Analyzing etc." assumes familiarity with "From Hermeneutics to data etc."

Suggestion:

Useful resources

If you’d like to go deeper into network analysis or its use in historical research, the following resources are recommended:

Additional references:

- Impresso Public API documentation

- Impresso python library

- NetworkX documentation

- ipysigma documentation

Also, a suggestion of other resources:

Introduction to Social Network Analysis: Youtube tutorials by Martin Grandjean reviewing the main concepts of social network analysis, and highlighting the challenges that arise when analyzing relational historical objects.

Demystifying Networks, Parts I & II by Scott B. Weingart: an older but still interesting resource with a simple introduction to networks, including concept definitions and key vocabulary

The Six Degrees of Francis Bacon Project: a DH project that reconstructs the social network of early modern intellectual life in Britain and includes publications and methodology.

Historical Network Research Community: A hub for scholars working at the intersection of history and network analysis. Offers conference proceedings, reading lists, and tutorials.


Reply via ReviewNB

@@ -132,17 +132,8 @@
},
Copy link

@Ferdaous-af Ferdaous-af Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence “all person entities mentioned in all articles that talk about the Prague Spring” is a bit misleading and could be interpreted as exhaustive, while in reality, if I'm not wrong, the query in the code returns the top 100 most frequently mentioned person entities associated with that query, not all possible results.

Suggested rephrasing:

First, we retrieve the top 100 most frequently mentioned person entities in all articles that talk about the Prague Spring using search facets method from the Impresso Python library.

Also, a brief explanation of what "search facets method" would be helpful


Reply via ReviewNB

@@ -132,17 +132,8 @@
},
Copy link

@Ferdaous-af Ferdaous-af Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an explanation of the output of this query would be helpful to avoid any confusion on the results.

Actual results of the query: "Contains 100 items (0 - 100) of 2355 total items": the word item itself can be confused with content item explained above, which is not the case here as the results are about named entities.

Suggestion:

The result is a list of the 100 most frequently mentioned person entities, where each entry includes:

  • a unique identifier (value),
  • the number of times the person is mentioned (count),
  • and the display name (label).

Note: these 100 entries are the most frequent out of a total of 2,355 persons mentioned in all matched content items.


Reply via ReviewNB

@@ -132,17 +132,8 @@
},
Copy link

@Ferdaous-af Ferdaous-af Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If running in Colab - activate custom widgets to allow ipysigma to render the graph.

It would be helpful to add more details here on how to activate custom widget, I'm not sure how one does this in Colab ? in any case the code rendered correctly


Reply via ReviewNB

@@ -132,17 +132,8 @@
},
Copy link

@Ferdaous-af Ferdaous-af Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output will prompt you to choose 'what should represent the size of the nodes' in your graph. Select it before you continue.

I suggest to rephrase this for more clarity on "what should represent the size of the nodes", by clarifying that these are centrality measures.

Suggestion:

The output will prompt you to choose from a dropdown list 'what should represent the size of the nodes', i.e. which centrality measure should determine the size of the nodes in your graph. Select it before you continue. These measures help reveal the structural importance of each node within the network.


Reply via ReviewNB

@@ -132,17 +132,8 @@
},
Copy link

@Ferdaous-af Ferdaous-af Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refresh the next cell after changing the value above.

More clarity needed here, perhaps something along the line "If you want to change the centrality measure above, re-run the next cell to update the visualisation"


Reply via ReviewNB

@@ -132,17 +132,8 @@
},
Copy link

@Ferdaous-af Ferdaous-af Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #24.    # Displaying the graph with a size mapped on degree and
# a color mapped on a categorical attribute of the nodes
Sigma(g, node_size=node_size, edge_size='count', clickable_edges=True, )

I believe node size depends on the user’s selection in the dropdown, not only on degree ? Also there is no color mapping defined in the code.

Suggestion: change the comment to "node size based on the selected centrality measure", and add a comment on the weight of the edge or its size "edge thickness based on co-occurrence count", and either not mention the color mapping or add a line in the code where the color is mapped to the attribute.


Reply via ReviewNB

@@ -132,17 +132,8 @@
},
Copy link

@Ferdaous-af Ferdaous-af Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Visualising Place Entities on Maps notebook
the Named Entity Recognition with impresso-pipelines notebook

The links here don't redirect to the mentioned notebooks.

The correct links I believe are:

which demonstrates how to visualise in a map mentions to places in the Impresso corpus.

Suggestion for a smoother rephrasing of this sentence:

which shows how to visualise mentions of places from the Impresso corpus in a map.


Reply via ReviewNB

@caiocmello caiocmello linked an issue Apr 25, 2025 that may be closed by this pull request
@caiocmello caiocmello closed this Apr 25, 2025
@caiocmello caiocmello deleted the entity_network branch April 28, 2025 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NB#07 Exploring Entity Co-occurrence Networks
2 participants