Storing treebanks? #96
Replies: 9 comments
-
Firstly, do not compress or tar the files if they are to be tracked in Git. Git can do the compression itself, and if it's uncompressed it can compute deltas on the text files so the history footprint doesn't balloon so much with each change. My preference is also to not store profiles directly in the repositories, but only skeletons (possibly in a second repository, both to avoid storage limits and to potentially allow people to use the data having to get the grammar), and the filled out profiles could be attached (tarred and compressed as needed) to a release in GitHub, just as you would attach binaries to releases of compiled software. Alternatively, these could be stored in Git-LFS. Full profiles, especially tarred/compressed ones, should not be stored in the grammar repositories. It's bad practice, and GitHub may even refused to check in such large files. |
Beta Was this translation helpful? Give feedback.
-
Great, thanks for the answer and the info! I am not so much looking for info on what not to do, however, as on some advice on what to do :). Suppose I have a treebank for the SRG; where do I put it? It is super important because some of the treebanks (associated with the SRG in particular) were already lost because they weren't checked in anywhere... I am aware that Github doesn't like large files. the LFS sounds like a possible solution; but what I am asking about is: do we have a strategy on how to deal with this, e.g. to prevent losing treebanks in the future... |
Beta Was this translation helpful? Give feedback.
-
Regarding my suggestion about what not to do, I framed it in this way because it is in contrast to what has been done with the Subversion repositories, so I wanted to emphasize what we should do differently. If I frame it more positively, it might be:
Of the options for (2) above, I prefer attaching them to releases, for a few reasons:
You mean like you have Tibidabo or some other treebank for the SRG already parsed for some version of the grammar? As mentioned above, you could create a new repository, like |
Beta Was this translation helpful? Give feedback.
-
Regarding full profiles, particularly the large |
Beta Was this translation helpful? Give feedback.
-
No, I mean releases on GitHub, which are Git tags, yes, but with some additional information. E.g., you can see some Zhong releases here https://github.com/delph-in/zhong/releases, and you'll notice that they have "Assets" which are archives of the source code when the release was created (these assets are created automatically). When you create a release (or when you edit it later), you can drag-and-drop more assets on that page. Or you can attach them with the GitHub CLI, as Francis and I do for the OMW (here is the relevant code). |
Beta Was this translation helpful? Give feedback.
-
Hi,
we normally store just the thinned profile --- that is only the selected
trees and the derivations used to choose them. In this case, the results
should not be so big. If you have the grammar you can rebuild the forest,
so there is no point in keeping more.
…On Wed, 14 Sept 2022 at 16:09, Olga Zamaraeva ***@***.***> wrote:
What's the current plan for storing the tsdb folders associated with
grammars? I see that jacy has them checked in; I am assuming they aren't
large. For zhong, I am not seeing the tsdb folder checked in. Suppose I
would like to check in a treebank for which the result file is larger
than 50MB. What do I do?
—
Reply to this email directly, view it on GitHub
<https://github.com/orgs/delph-in/teams/admins/discussions/10>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRUT3NESU5UXAOVAP3TV6HL7XANCNFSM6AAAAAAQMOS2HQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Francis Bond <https://fcbond.github.io/>
|
Beta Was this translation helpful? Give feedback.
-
@goodmami and the "Assets" option somehow means the file size limit is no longer a problem, do I understand right? |
Beta Was this translation helpful? Give feedback.
-
That is mostly correct. See here: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github. In particular, this part is relevant:
NB: they say "binary" but text files are fine, too. |
Beta Was this translation helpful? Give feedback.
-
Great, thanks, Mike! I think this is the way to go then. |
Beta Was this translation helpful? Give feedback.
-
What's the current plan for storing the tsdb folders associated with grammars? I see that jacy has them checked in; I am assuming they aren't large. For zhong, I am not seeing the tsdb folder checked in. Suppose I would like to check in a treebank for which the
result
file is larger than 50MB. What do I do?Beta Was this translation helpful? Give feedback.
All reactions