-
Notifications
You must be signed in to change notification settings - Fork 2
Entity network #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entity network #54
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -132,17 +132,8 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We define people's connections based on whether they occur in the same content item
Small rewording suggestion for clarity:
We define people's connections based on whether their names are mentioned in the same content item.
Please note that when a newspaper does not have segmentation (OLR)
Perhaps add here what OLR is and a link to the FAQ section on it: https://impresso-project.ch/app/faq#What-OCR
Suggestion:
Please note that when a newspaper lacks segmentation (OLR – Optical Layout Recognition), content items for this title correspond to entire pages.
To unveil the reasons why they occur together, further analysis using different methods is necessary.
Suggested clarification:
Understanding the reason behind a co-occurrence typically requires further contextual or qualitative analysis.
Reply via ReviewNB
@@ -132,17 +132,8 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer to harmonize the structure across the list (suggestion below). Also, it could be useful to clarify that “persons” refer to named entities extracted from content items, and to specify what formats are meant by “different formats” when exporting the network graph.
Suggested rewording:
What will you learn?
By completing this notebook, you will learn how to:
Retrieve a list of named persons mentioned in content items for a given query;
Transform this list of entities into a dataframe suitable for generating co-occurrence network graphs;
Create and display an interactive network graph to visualise connections between persons mentioned together in Impresso content;
Export the resulting dataframes as CSV files to support reproducibility
Save the network graph in different formats (png, svg, gexf, and json) for further analysis.
Reply via ReviewNB
@@ -132,17 +132,8 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's okay to make this part a little bit longer, it would helpful to briefly describe what each link offers and perhaps also add links to documentation on Impresso API and python library, NetworkX and ipysigma. Also good to change the order of the two suggested resources, as the "Exploring and Analyzing etc." assumes familiarity with "From Hermeneutics to data etc."
Suggestion:
Useful resources
If you’d like to go deeper into network analysis or its use in historical research, the following resources are recommended:
From Hermeneutics to Data to Networks: Data Extraction and Network Visualization of Historical Sources: A conceptual and practical guide to extracting structured data from historical sources and creating meaningful network visualizations
Exploring and Analyzing Network Data with Python: An introduction to working with the NetworkX package and drawing conclusions from network metrics when working with humanities data.
Additional references:
- Impresso Public API documentation
Also, a suggestion of other resources:
Introduction to Social Network Analysis: Youtube tutorials by Martin Grandjean reviewing the main concepts of social network analysis, and highlighting the challenges that arise when analyzing relational historical objects.
Demystifying Networks, Parts I & II by Scott B. Weingart: an older but still interesting resource with a simple introduction to networks, including concept definitions and key vocabulary
The Six Degrees of Francis Bacon Project: a DH project that reconstructs the social network of early modern intellectual life in Britain and includes publications and methodology.
Historical Network Research Community: A hub for scholars working at the intersection of history and network analysis. Offers conference proceedings, reading lists, and tutorials.
Reply via ReviewNB
@@ -132,17 +132,8 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence “all person entities mentioned in all articles that talk about the Prague Spring” is a bit misleading and could be interpreted as exhaustive, while in reality, if I'm not wrong, the query in the code returns the top 100 most frequently mentioned person entities associated with that query, not all possible results.
Suggested rephrasing:
First, we retrieve the top 100 most frequently mentioned person entities in all articles that talk about the Prague Spring using search facets method from the Impresso Python library.
Also, a brief explanation of what "search facets method" would be helpful
Reply via ReviewNB
@@ -132,17 +132,8 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an explanation of the output of this query would be helpful to avoid any confusion on the results.
Actual results of the query: "Contains 100 items (0 - 100) of 2355 total items": the word item itself can be confused with content item explained above, which is not the case here as the results are about named entities.
Suggestion:
The result is a list of the 100 most frequently mentioned person entities, where each entry includes:
a unique identifier (value),
the number of times the person is mentioned (count),
and the display name (label).
Note: these 100 entries are the most frequent out of a total of 2,355 persons mentioned in all matched content items.
Reply via ReviewNB
@@ -132,17 +132,8 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If running in Colab - activate custom widgets to allow ipysigma
to render the graph.
It would be helpful to add more details here on how to activate custom widget, I'm not sure how one does this in Colab ? in any case the code rendered correctly
Reply via ReviewNB
@@ -132,17 +132,8 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output will prompt you to choose 'what should represent the size of the nodes' in your graph. Select it before you continue.
I suggest to rephrase this for more clarity on "what should represent the size of the nodes", by clarifying that these are centrality measures.
Suggestion:
The output will prompt you to choose from a dropdown list 'what should represent the size of the nodes', i.e. which centrality measure should determine the size of the nodes in your graph. Select it before you continue. These measures help reveal the structural importance of each node within the network.
Reply via ReviewNB
@@ -132,17 +132,8 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refresh the next cell after changing the value above.
More clarity needed here, perhaps something along the line "If you want to change the centrality measure above, re-run the next cell to update the visualisation"
Reply via ReviewNB
@@ -132,17 +132,8 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #24. # Displaying the graph with a size mapped on degree and # a color mapped on a categorical attribute of the nodes Sigma(g, node_size=node_size, edge_size='count', clickable_edges=True, )
I believe node size depends on the user’s selection in the dropdown, not only on degree ? Also there is no color mapping defined in the code.
Suggestion: change the comment to "node size based on the selected centrality measure", and add a comment on the weight of the edge or its size "edge thickness based on co-occurrence count", and either not mention the color mapping or add a line in the code where the color is mapped to the attribute.
Reply via ReviewNB
@@ -132,17 +132,8 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Visualising Place Entities on Maps notebook
the Named Entity Recognition with impresso-pipelines notebook
The links here don't redirect to the mentioned notebooks.
The correct links I believe are:
- Visualising Place Entities on Maps: https://github.com/impresso/impresso-datalab-notebooks/blob/main/explore-vis/place-entities_map.ipynb
- Named Entity Recognition with impresso-pipelines: https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoAPI.ipynb
which demonstrates how to visualise in a map mentions to places in the Impresso corpus.
Suggestion for a smoother rephrasing of this sentence:
which shows how to visualise mentions of places from the Impresso corpus in a map.
Reply via ReviewNB
Hi @Ferdaous-af,
Can I ask you please to review this notebook? Please, refer to the reviewer's guidelines below.
Notebook file: https://github.com/impresso/impresso-datalab-notebooks/blob/main/explore-vis/entity_network.ipynb
Reviewer's guidelines: