-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There seems to be an issue in the format of the value of some sports #1
Comments
Hi @reallyyy All the data is what was returned from the OBS server at time of scraping, there was no processing or changes to the data made at all. There may well be inaccuracies, but that's concerning if there are. I have not audited the data or carried out sanity checks like you seem to have done. Where are you finding this information? The way events are laid out is a little confusing; from memory there are many different data types for different sections of the event - SubEventUnit, Stage, Result, Phase, Event, and EventUnit. Is it possible that the data you're finding is only partial, perhaps for one section of measurements taken? (i.e the first 50 metres in a swimming race could be timed separately in different legs?) The USDF messages may also be useful for troubleshooting. |
Where are you finding this information?
As you can see only the last row is wrong. |
Ah, I see what's happening. For the men's marathon (event unit ID I've attached a CSV of this subset that should help you understand what's going wrong here immediately. mensmarathon.csv The thing I think that's being missed here is that a Result is not final - there are unofficial, partial, and final official results. To quote from the ODF spec (the data from this repo is a parsed form of the ODF spec for the most part, although not always identical in structure)
So in this case:
For a marathon, there would be 10 intermediate frame results sent at different checkpoints. For 100m swimming, a frame would be sent for each lap, which matches up with the data that you were seeing. |
Also, I have data from the 2020 Paralympics & 2022 Bejing Olympics/Paralympics in the same format that I never got around to uploading to Kaggle, if you'd like it. |
Wow thank you so much for the fast reply and spending time exploring the issue.
|
For example:
Here is the data for "Women's 100m Breaststroke"
Some participants's times is a much as 30 seconds, for context the value according to google is around 1:04 - 1:06 depending on the participants in question.
eventTile value participantName
Heat 2 31.77 Dalma Sebestyen
Heat 2 31.86 Remedy Rule
Heat 3 32.42 Erin Gallagher
Heat 4 36.08 Benedetta Pilato
Heat 2 40.94 Claudia Verdino
The values for "Men's Marathon" is also questionable. Here are some examples:
eventTile value participantName
Men's Marathon Final 10:09 Cameron Levins
Men's Marathon Final 10:52 Ivan Zarco Alvarez
Men's Marathon Final 11:28 Yuma Hattori
Men's Marathon Final 12:07 Christian Pacheco
Men's Marathon Final 12:07 Hassan Chahdi
Men's Marathon Final 15:36 Stephen Scullion
Men's Marathon Final 15:44 Mykola Nyzhnyk
Men's Marathon Final 15:48 Lemawork Ketema
Men's Marathon Final 16:12 Oleksandr Sitkovskiy
The names of the participants are right but the values are wrong. You can't run a marathon in 10 minutes and 10 hours is too long. The average time is about 2 - 3 hours or so.
The text was updated successfully, but these errors were encountered: