After closure of the Academic API on 24.6.2023 this collection has stopped. If you cannot rehydrate tweets because of the API closures, feel free to contact us in the issues or via email.
Twitter (and maybe later other social media) data around the Ukraine Invasion in February 2022
Data reaches back until February 1st and will be updated daily.
The updates are automatised and should be available every afternoon for the preceding day. Let us know if there are any problems!
Please cite as:
Münch, F. V., & Kessling, P. (2022, March 1). ukraine_twitter_data. https://doi.org/10.17605/OSF.IO/RTQXN
Other citation styles can be found on OSF:
https://doi.org/10.17605/OSF.IO/RTQXN
Right now, we provide data on all tweets that contain the hashtag or word 'ukraine' (see query below) in different languages since February 1st here according to the Twitter Academic API.
Furthermore, data on English tweets containing the term 'bucha' are now available.
Collections:
language | query | data nearly complete from |
---|---|---|
English | #ukraine AND lang='en' | 1. February 2022 |
German | ukraine AND lang='de' | 1. February 2022 |
Russian | Украина AND lang:ru | 1. February 2022 |
Ukrainian | Україна AND lang:uk | 1. February 2022 |
English | bucha AND lang='en' | 1. March 2022 |
German | (bucha OR butscha) AND lang='de' | 1. March 2022 |
Russian | (Бу́ча OR bucha) AND lang:ru | 1. March 2022 |
Ukrainian | (Бу́ча OR bucha) AND lang:uk | 1. March 2022 |
To comply with Twitter TOS and protect people who have decided to delete their tweets, we share tweet IDs, creation date, and metadata about our collection methods and dates only.
If you are elegible for Academic API access with Twitter and want to add further languages, let us know and we are happy to support you.
With the focalevents tool by @ryanjgallagher using our Academic Twitter API access.
We query tweets that contain the keywords stated above and filter for languages detected by Twitter.
We started collecting tweets on 24. February, backfilling tweets since 1. February.
You find information on whether an ID was collected via the search ('backfilled') or streamed in the data itself. Backfilled data will not contain tweets that have been deleted before the collection time.
We share the data in language specific folders.
The filenames indicate the date of the tweets.
Furthermore, every file is available in two CSV versions:
- one with the IDs only, for easy hydration with tools mentioned below.
- one with metadata on how the data was collected in every line, for you to filter it to your needs
This many:
These figures will be updated periodically.
Via the Twitter API, e.g. with twarc or, if you prefer a graphical user interface, with the Hydrator by @DocNow.
We provide files that do contain the tweet IDs only for this purpose.
If you need any data that is not available this way, we might be able to help you, pending an ethical evaluation of your research goals.
Due to connection and other problems there can and always will be gaps in such a large-scale data collection. We are in the process of meticulously backfilling any gaps that we discover in our data collection.
Here we compare our data with the estimated counts returned by the API (number of collected tweets per hour divided by Twitter API count estimates).
We aim for 95% of the hourly estimated counts by Twitter. As you can see, this is not always possible, most likely due to tweet deletions, account bans, account protections, or wrong estimates by Twitter.
In the English and Bucha datasets our count is for one hour 16-18 times higher than the Twitter estimate we got. We will have a closer look at that asap, but more is usually better than less. Maybe its a glitch caused by daylight saving time (even though we should see that also in other languages) or the 'spikyness' of the event 🤷. Most counts are >= 95%, less than 10 hours have only more than 90% of the estimated count.
Because we publish tweet IDs only, we comply with the Twitter TOS.
Given the public interest in this data and that this data will be indispensable for presenting research findings on events in contemporary history we also comply with the GDPR, especially its German implentation, the DSGVO (Art 6 (1) f) GDPR in connection with Art 85 GDPR, § 27 BDSG).
From an ethical standpoint, we do not share any data the conflict parties would not have collected or be able to collect anyway.
As we share only Tweet IDs, accounts can protect themselves at any time.
We think, sharing this collection contributes to the cause of open science.
Furthermore, while much of the information contained in the tweets will be dis- and misinformation, this dataset at least provides transparency by enabling researchers and OSINT experts to analyse it independently, which is in the public interest of democratic states.
However, we still ask you to assess your respective use of this data with your ethical review board, and/or with our ethical and legal guidance questionaire SOCRATES
You can use Git's sparse checkout feature: https://dev.to/kiwicopple/quick-tip-clone-a-single-folder-from-github-44h6
If you're just interested in single days, the easiest way is to just download single files manually in the Github interface or automated with their URL via curl/wget.
This data is mainly limited by the fidelity of the Twitter API and data degradation over time:
- Tweet IDs of Tweets that have been deleted, suspended, hidden or protected before the collection time will not be in the dataset
- Tweets that have been deleted or otherwise depublished after collection time will not be returned during hydration
- The collection depends on Twitter's language detection, which is known to be far from perfect, but good enough for large scale assessments:
- Tweets that have not been detected as being in one of the collected languages will not be in the collection.
- Also, there will be mislabeled tweets (e.g. Dutch as German, or maybe even Ukrainian as Russian) in the collection.
- Tweets that do not contain any text (e.g. links or pictures only) might be missing in the collection.
Furthermore, while we backfilled any gaps occuring in the data so far, there might be gaps in the future due to systems failures or errors in our code or used software. We plan to publish the count estimates by Twitter alongside the data automatically in the near future so that researchers can double check themselves. In the meantime, researchers with access to the Twitter Academic API have access to the count endpoints themselves and are able to compare the counts. Please let us know in the Issues if you see any major deviations.
We do not guarantee any ongoing collection, mainly because Twitter limits the amount of Tweets we can collect per month. So please do not plan with anything beyond what's here already, e.g. for project planning or grant proposals and such. (Or approach us and we will help you to apply for Academic access to the Twitter API yourself and set up your own collection.)