-
Todo: Meetings Before?
- n/a
- This talk by Michael Garris on the history of MNIST was useful for contextualising the history of public datasets in my background.
-
Text extractor now very functional with some smart regexes.
-
And yet, annotation is still needed.
-
Will need to re-run experiments...
- Writing my report, I have no idea how I will meet the expected 15 pages.
n/a
- My deadline is in 3 days, am I doomed to a re-exam?
- I have ANOTHER concurrent project due on the same day. So it looks like I might need to hold off on one of them.
- Gave some motivation to Catthrine and Trine.
- A reminder from my boss at work that sometimes the Minimum Viable Product is the best product.
- I discovered a serious issue in my pipeline
- Some articles were not parsing.
- It turns out the text was being jumbled.
- Avoiding manual annotation.
- Solution: Incorporate manual annotation into methodology.
- The goal of automation now is to make annotation easier, however, anyone repeating or extending my work will also need to do some manual work after.
- The last step of automation will now produce excerpts which contain the references to datasets. It is now the responsibility of the annotator to extract the final details into the dataset they produce.
- How can I contact you next week if I need to ask anything?
- No one :|
- Going over past notes on Notion has been helpful. I was much better of keeping track of everything when I first started the project and that has given me ample material for the report.
- Time Management: Part-time work and my other research project have been quite draining.
- Trying to minimize manual annotation.
- Results:
- Research obervations have given us new problems to examine.
- Bibliometric Exclusion: The consequences of Datasets being used even when not cited in the bibliography.
- Broken Links: Frequency of ML Challenge Web Addresses being noted incorrectly or being unreachable.
- Research obervations have given us new problems to examine.
-
How do we demonstrate under-representation:
- Statistical power?
- F.x. the Power Inequality Effect definition from the The glass ceiling in NLP paper.
-
Does it make sense to include raw PDFs or .txt files in the dataset?
- These are retrievable my script, anyway.
- No one :|
- Christina's Thesis Notes have been helpful figuring out what to do next.
-
Data Schema for Datasets in Research Papers:
-
Can fully automate:
- Venue
- Title
- Dataset Origin (given URL)
-
Can partially automate (on keyword matches. F.x. Kaggle.)
- Citation Category (Footnote, Journal Publication or URL)
- Access (Public or Private)
- Bibliography Mentions
-
Not so easy to automate:
-
Dataset Identifier
- Multiple references, sometimes full name sometimes abbreviated.
- f.x. Usiigaci is not trivially associable with the title
Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl
.
-
Notes
- When keyword match isn't present, annotation requires further context.
- No matter what I do, some manual annotation is needed.
- This hurts the scalability of my results.
- Behind on plotting, but I expect to catch up next week.
- I realized it might be a bit overkill to include every dataset used in a research paper.
- If I only include challenge datasets, I will only be able to compare them amongst each other:
- Footnotes vs. Bibliographic Citations. Broken, Rotten and Active Links.
- But if I should exclude any datasets, I don't know which ones to exclude.
- Can compare dataset hosts (GitHub vs. Institutional), frequency of challenge datasets against baseline.
- If I only include challenge datasets, I will only be able to compare them amongst each other:
- Results, results, results!
- If I complete the pipeline on the automatable data,
- then I can manually fill in the remaining fields and update figures in future weeks.
- This would require me to churn most of the report within the next 2 weeks.
-
What should my inclusion/exclusion criteria be on datasets from our collected research papers?
-
My deadline is December 15, but I am starting to worry if I might need an extension.
-
What are my options? Scaling down my scope or extending my deadline?
- No one :|
- Reading past notes.
-
Some notes on challenge dataset hunting:
- In papers comparing multiple datasets there is often a dedicated section listing all datasets.
- Appendices matter! There may be additional datasets only mentioned after the paper.
-
Challenge datasets are presented in one of three ways:
-
The X competition
- Ex. moco-cxr performance on the
chexpert
competition task pathologies.
- Ex. moco-cxr performance on the
-
The X chalenge
- Ex. the 2019
fastmri
challenge - as opposed to The goal of the conference is to foster excellent research that addresses the unique
challenge
s and opportunities
- Ex. the 2019
-
URL identifier
- Analysis Paralysis:
- I'm not sure what I want to annotate, and with what tool
- How to scale it to 128 PDFs?
- Comparing non-challenge datasets to challenge datasets
- Annotation seems infeasible.
- Improving biblography checker.
- Retrieving contest identifier by pattern-matching
-
Any other querying ideas?
-
Good practices for annotation?
-
What should I tag and why?
-
My hypothesis is that "Challenge datasets are under-represented in bibliographies (than there actually are)"
- Forgot to add checks on Title :_)
- No one :|
-
I did attend the Digital Tech Summit though, which I'd love to talk about this week.
-
Earlier this month Bethany helped me with some suggestions. I also gave her some ideas, but they came a bit late into her project.
- I realized that it's a good idea to check for keywords in bibliography and before bibliography:
-
I now actually have an objective way to demonstrate representation of online datasets (through bibliography)
- Now do I build a hypothesis that this is a under-representation?
-
Misc. refactoring
- Main dataset is venues/titles only
- Will later enable output for alternative match rules.
- The context-sensitivity of the word "challenge" makes it difficult as a keyword:
- "Challenges in Machine Learning" vs. "Machine Learning Challenges"
- Improving biblography checker.
- Querying even more venues!
- Can now do some basic quantative.
-
Any other querying ideas?
-
What are some meaningful comparisons?
Kaggle
mentioned in body vs. Bibliography?Competition
mentioned in title + body vs. Bibliography?- Number of times
competition
mentioned in title + body vs. Bibliography
-
Will have to do some tagging (i.e. if name is Data Science Bowl instead of Kaggle)
- Workaround ideas?
- Forgot to add checks on Title :_)
- Just sharing some ideas with the
MentalHealth
project members over MS Teams.
- Bethany's session on Surveys and Zotero from last week.
- Scrapy docs for webcrawling.
- Coded a working pipeline!
- Producing a new dataset checking name/keyword mentions on
Data Science Bowl 2017
in MIDL 2021 papers.
- DOIs are missing. They were missing from the proceedings and I don’t yet have a strategy to scan for them.
- I’m not filtering, for example, the preface of the proceedings. It was also downloaded and saved as
Proceedings.pdf
as part of the dataset. - Match frequency isn’t taken into consideration. One match is just as good as 100 in this scheme.
- Also, I have no way to check for homographic false-positives. Hence the
DSB
2017 vs. 2018 confusion.
- Also, I have no way to check for homographic false-positives. Hence the
As far as future work goes, searching for kaggle
was the most revealing thing I came up with. When we have the collection of publications from a particular venue (or set of venues) we can find references to datasets which are missed in most relevance-based indexing systems.
Then perhaps, instead of focussing on DSB
alone we might try querying the word kaggle
to extract a family of interrelated datasets across publications.
- Should we increase the number of venues?
- Which venues?
- What years?
- Is the
kaggle
,competition
querying for discovery a good idea?
I am also going to conference next week (Belgium from the 12-14th).
It's not much but Amelia & Olivia seem to be interested in the (creative, critical & feminism) methods of ETHOS lab, which I told them I joined recently.
- The pdfminer.six project gave me a decent way to bulk-extract text from PDFs.
- Thu Vu's https://github.com/thu-vu92/the_witcher_network project was some nice inspo.
- Text extraction from PDFs
- My previous tool, https://github.com/kingaling/pydf2json was outdated and too verbose.
- Recovered thanks to pdfminer.six.
- Coding :)
- Determining some nice keywords.
- No quantitive analysis yet, only direct inspection. Although, maybe relevant in the future.
- Nothing for this week :)
Maybe should meet with Bethany some time next week to exchange ideas.
n/a
Some resources on semantic search for flexible keyword queries:
- Hugging Face: https://www.youtube.com/watch?v=OATCgQtNX2o
- Amir Shamsi on Gensim: https://stackoverflow.com/a/71828372/2089784
- Trey Grainger on Data Philosophy and Solr: https://www.youtube.com/watch?v=4fMZnunTRF8
- Collected MIDl 2018 papers
- Annotated for DOI, if available
- Annotated for keywrod matches
- Did not find any matches for
Data Science Bowl 2017
. - Precise keyword searches can miss, false negative matches
- Figure out PDF extraction pipeline. Searching for datasets in conference papers is like searching for a needle in a haystack so we need to be able to process more PDFs faster.
- Is it a good idea to stick to MIDL (future years?)
- MICCAI?
- Try different keywords maybe
This space is yours to add to as needed.
Replace this text with a one/two sentence description of who you helped this week and how.
Replace this text with a one/two sentence description of what helped you this week and how.
- Replace this text with a bullet point list of what you achieved this week.
- It's ok if your list is only one bullet point long!
- Replace this text with a bullet point list of where you struggled this week.
- It's ok if your list is only one bullet point long!
- Replace this text with a bullet point list of what you would like to work on next week.
- It's ok if your list is only one bullet point long!
- Try to estimate how long each task will take.
- Replace this text with a bullet point list of what you need help from Veronica on.
- It's ok if your list is only one bullet point long!
- Try to estimate how long each task will take.
This space is yours to add to as needed.