-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tutorial Videos] Add transcripts and subtitles script #3251
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3251 +/- ##
==========================================
- Coverage 74.59% 74.55% -0.04%
==========================================
Files 279 279
Lines 10687 10687
Branches 1288 1288
==========================================
- Hits 7972 7968 -4
- Misses 2352 2355 +3
- Partials 363 364 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 8 files at r2, all commit messages.
Reviewable status: 3 of 15 files reviewed, all discussions resolved (waiting on @imnasnainaec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 14 files at r1, 2 of 8 files at r2, 1 of 2 files at r3, all commit messages.
Reviewable status: 8 of 16 files reviewed, 1 unresolved discussion (waiting on @imnasnainaec)
docs/tutorial_subtitles/README.md
line 3 at r3 (raw file):
This folder contains the transcripts for tutorial videos and timestamps for generating video subtitles. The `_in_progress_transcripts_` subfolder holds transcripts that are awaiting a video recording.
Is there a reason for the leading and trailing underscores in the directory name? I have been using filenames that start with _
to prevent them from being added to the git repo. In addition, leading _
s often have special meanings in other languages (e.g. C, Python). I can change my key for not adding files to git.
One issue I have run into is that when I tried to mention the directory in Slack, it kept showing "in_progress_transcripts" in italics!
It's not a big deal and if you have a reason you would like to keep it this way, that is fine with me.
Code quote:
_in_progress_transcripts_
docs/tutorial_subtitles/export_1_flex_to_combine/times.txt
line 44 at r3 (raw file):
5:13 5:16 5:19
3 seconds seems like a very short time for a caption - even one as simple as this since it may be longer in other languages. In merge_dups_1_basics
it's only 2 seconds.
Code quote:
5:19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 14 files at r1.
Reviewable status: 8 of 16 files reviewed, 1 unresolved discussion (waiting on @jmgrady)
docs/tutorial_subtitles/README.md
line 3 at r3 (raw file):
Previously, jmgrady (Jim Grady) wrote…
Is there a reason for the leading and trailing underscores in the directory name? I have been using filenames that start with
_
to prevent them from being added to the git repo. In addition, leading_
s often have special meanings in other languages (e.g. C, Python). I can change my key for not adding files to git.One issue I have run into is that when I tried to mention the directory in Slack, it kept showing "in_progress_transcripts" in italics!
It's not a big deal and if you have a reason you would like to keep it this way, that is fine with me.
I want to distinguish it from the other folders (and have it autosort to the start or end of the containing folder), but a _
prefix may not be the best way to do that, given what you've described.
docs/tutorial_subtitles/export_1_flex_to_combine/times.txt
line 44 at r3 (raw file):
Previously, jmgrady (Jim Grady) wrote…
3 seconds seems like a very short time for a caption - even one as simple as this since it may be longer in other languages. In
merge_dups_1_basics
it's only 2 seconds.
The times are identified to keep the caption up until the next sentence begins or the video ends. Perhaps I could try to pause a bit more between sentences in future recordings to allow for longer translations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 8 of 16 files reviewed, all discussions resolved (waiting on @imnasnainaec)
docs/tutorial_subtitles/README.md
line 3 at r3 (raw file):
Previously, imnasnainaec (D. Ror.) wrote…
I want to distinguish it from the other folders (and have it autosort to the start or end of the containing folder), but a
_
prefix may not be the best way to do that, given what you've described.
That is a fine reason. I should have been clearer - in C & Python, a leading _
has special meaning in identifiers, not file names. I'll pick something else for files I do not want tracked (in a different PR). _
is more of a pain to type anyway! 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 14 files at r1, 2 of 8 files at r2, all commit messages.
Reviewable status: 14 of 16 files reviewed, 3 unresolved discussions (waiting on @imnasnainaec)
docs/tutorial_subtitles/_in_progress_transcripts_/merge_dups_2_senses.eng.txt
line 10 at r5 (raw file):
Voila! Now the correct senses are together. Wait a second… the two glosses of the first word are not different senses. They are the exact same idea expressed redundantly. To delete the unnecessary sense, click-and-drag it over the garbage icon in the bottom corner.
In ./docs/tutorial_subtitles/_in_progress_transcripts_/data_entry_1_basics.eng.txt
this same icon is called the delete icon.
Perhaps use drag it over the delete icon in the bottom corner. It looks like a garbage can. As an additional nit, it should be garbage can - garbage is what goes into the garbage can.
Code quote:
garbage icon
docs/tutorial_subtitles/_in_progress_transcripts_/review_entries_2_editing.eng.txt
line 7 at r5 (raw file):
In this video, we look at how to edit an entry. Each word in your project shows up as a row in this table. To delete a word, click the rubbish bin icon at the right end.
This is yet a different term than is used in other tutorials for the same icon.
Code quote:
rubbish bin icon
docs/tutorial_subtitles/_in_progress_transcripts_/sem_dom_1_basics.eng.txt
line 6 at r5 (raw file):
Let’s go to thecombine.app and log in. When you click on a project, the semantic domain tree appears. If you are doing something else in the project (such as data cleanup or project settings), you can get back to the semantic domain tree by clicking the “Data Entry” button in the top bar.
This sentence seems awkward to me. doing something else should refer to a task. data cleanup is a task but not project settings. Should be something like reviewing the project settings or changing the project settings.
Code quote:
If you are doing something else in the project (such as data cleanup or project settings), you can get back to the semantic domain tree by clicking the “Data Entry” button in the top bar.
docs/tutorial_subtitles/merge_dups_1_basics/merge_dups_1_basics.eng.txt
line 27 at r4 (raw file):
If we click “Yes”, it will look for more sets of potential duplicates. After The Combine can’t find any more words with identical vernacular form, it will suggest words with similar vernacular forms. This will help catch duplicates where one of the words has a typo or is using alternative spelling.
How about "an alternate" instead of "alternative"?
Alternate is an adjective (or a verb if you pronounce differently).
Alternative is generally a noun although its use as an adjective is increasing I believe.
Perhaps it's my age showing.
Code quote:
alternative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4, 1 of 1 files at r5.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @imnasnainaec)
scripts/subtitle_tutorial_video.py
line 101 at r5 (raw file):
for line in file.readlines(): if not (line): continue
Strip the line before testing if the line exists.
for line in file.readlines():
line = line.strip()
if not line:
...
Code quote:
for line in file.readlines():
if not (line):
continue
scripts/subtitle_tutorial_video.py
line 104 at r5 (raw file):
min_sec = line.strip().split(":") # Split into minutes and seconds sec_mil = min_sec[1].split(".") # Check for partial seconds mil = sec_mil[1] if len(sec_mil) > 1 else "0" # Compute milliseconds
I find it simpler to split on :
or .
import re
...
times = re.split(":|\.", line) # See comment above about strip
# now the time components can be referenced as
min = times[0]
sec = times[1]
msec = times[2] if len(times) > 2 else 0.
(also use the standard abbreviation for milliseconds).
Code quote:
min_sec = line.strip().split(":") # Split into minutes and seconds
sec_mil = min_sec[1].split(".") # Check for partial seconds
mil = sec_mil[1] if len(sec_mil) > 1 else "0" # Compute milliseconds
scripts/subtitle_tutorial_video.py
line 138 at r5 (raw file):
" ".join(metadata_strings), f"{args.output}", ]
Since you join the arguments below, you don't need to join them here (which will make it easier to read in the debugger if you need to). Use this form instead:
exec_command_parts: List[str] = [
"ffmpeg", "-i",
args.input] +
i_strings +
map_strings) +
["-c",
"copy",
"-c:s mov_text"] +
metadata_strings,
args.output,
]
(And use subprocess.run
as in the following comment. It takes an array for the command).
Code quote:
exec_command_parts: List[str] = [
"ffmpeg -i",
f"{args.input}",
" ".join(i_strings),
" ".join(map_strings),
"-c copy -c:s mov_text",
" ".join(metadata_strings),
f"{args.output}",
]
scripts/subtitle_tutorial_video.py
line 141 at r5 (raw file):
exec_command = " ".join(exec_command_parts) logging.info(f"Executing: {exec_command}") system(exec_command)
Use subprocess.run()
instead of os.system()
. It is generally considered more "Pythonic" and does not require you to join the arguments and prevents some problems quoting of arguments.
The only time I have had to resort to os.system
is when I need to print stdout
and stderr
while the command is running. It is usually sufficient to print it at the end if I want it. (There is a way to do it with the subprocess
module but it is cumbersome.)
Code quote:
system(exec_command)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 10 of 16 files reviewed, 4 unresolved discussions (waiting on @jmgrady)
docs/tutorial_subtitles/_in_progress_transcripts_/merge_dups_2_senses.eng.txt
line 10 at r5 (raw file):
Previously, jmgrady (Jim Grady) wrote…
In
./docs/tutorial_subtitles/_in_progress_transcripts_/data_entry_1_basics.eng.txt
this same icon is called the delete icon.Perhaps use drag it over the delete icon in the bottom corner. It looks like a garbage can. As an additional nit, it should be garbage can - garbage is what goes into the garbage can.
(Discussed at standup), to avoid the various international interpretations of "trash" and "garbage", I've just gone with "delete icon" everywhere.
docs/tutorial_subtitles/_in_progress_transcripts_/sem_dom_1_basics.eng.txt
line 6 at r5 (raw file):
Previously, jmgrady (Jim Grady) wrote…
This sentence seems awkward to me. doing something else should refer to a task. data cleanup is a task but not project settings. Should be something like reviewing the project settings or changing the project settings.
Reworded.
scripts/subtitle_tutorial_video.py
line 101 at r5 (raw file):
Previously, jmgrady (Jim Grady) wrote…
Strip the line before testing if the line exists.
for line in file.readlines(): line = line.strip() if not line: ...
Done.
scripts/subtitle_tutorial_video.py
line 104 at r5 (raw file):
Previously, jmgrady (Jim Grady) wrote…
I find it simpler to split on
:
or.
import re ... times = re.split(":|\.", line) # See comment above about strip # now the time components can be referenced as min = times[0] sec = times[1] msec = times[2] if len(times) > 2 else 0.
(also use the standard abbreviation for milliseconds).
Done.
scripts/subtitle_tutorial_video.py
line 138 at r5 (raw file):
Previously, jmgrady (Jim Grady) wrote…
Since you join the arguments below, you don't need to join them here (which will make it easier to read in the debugger if you need to). Use this form instead:
exec_command_parts: List[str] = [ "ffmpeg", "-i", args.input] + i_strings + map_strings) + ["-c", "copy", "-c:s mov_text"] + metadata_strings, args.output, ]
(And use
subprocess.run
as in the following comment. It takes an array for the command).
Done.
scripts/subtitle_tutorial_video.py
line 141 at r5 (raw file):
Previously, jmgrady (Jim Grady) wrote…
Use
subprocess.run()
instead ofos.system()
. It is generally considered more "Pythonic" and does not require you to join the arguments and prevents some problems quoting of arguments.The only time I have had to resort to
os.system
is when I need to printstdout
andstderr
while the command is running. It is usually sufficient to print it at the end if I want it. (There is a way to do it with thesubprocess
module but it is cumbersome.)
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 6 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @imnasnainaec)
Script uses ffmpeg per https://www.bannerbear.com/blog/how-to-add-subtitles-to-a-video-file-using-ffmpeg
Example videos:
This change is