Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tutorial Videos] Add transcripts and subtitles script #3251

Merged
merged 18 commits into from
Aug 12, 2024
Merged

Conversation

imnasnainaec
Copy link
Collaborator

@imnasnainaec imnasnainaec commented Jul 18, 2024

@imnasnainaec imnasnainaec added the documentation Improvements or additions to documentation label Jul 18, 2024
@imnasnainaec imnasnainaec self-assigned this Jul 18, 2024
Copy link

codecov bot commented Jul 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.55%. Comparing base (8c51260) to head (f828da9).
Report is 35 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3251      +/-   ##
==========================================
- Coverage   74.59%   74.55%   -0.04%     
==========================================
  Files         279      279              
  Lines       10687    10687              
  Branches     1288     1288              
==========================================
- Hits         7972     7968       -4     
- Misses       2352     2355       +3     
- Partials      363      364       +1     
Flag Coverage Δ
backend 83.79% <ø> (-0.09%) ⬇️
frontend 66.45% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@imnasnainaec imnasnainaec marked this pull request as ready for review July 24, 2024 21:08
@imnasnainaec imnasnainaec mentioned this pull request Jul 24, 2024
26 tasks
Copy link
Collaborator

@jmgrady jmgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 8 files at r2, all commit messages.
Reviewable status: 3 of 15 files reviewed, all discussions resolved (waiting on @imnasnainaec)

@imnasnainaec imnasnainaec marked this pull request as draft August 6, 2024 16:05
@jmgrady jmgrady added the python label Aug 6, 2024
Copy link
Collaborator

@jmgrady jmgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 14 files at r1, 2 of 8 files at r2, 1 of 2 files at r3, all commit messages.
Reviewable status: 8 of 16 files reviewed, 1 unresolved discussion (waiting on @imnasnainaec)


docs/tutorial_subtitles/README.md line 3 at r3 (raw file):

This folder contains the transcripts for tutorial videos and timestamps for generating video subtitles.

The `_in_progress_transcripts_` subfolder holds transcripts that are awaiting a video recording.

Is there a reason for the leading and trailing underscores in the directory name? I have been using filenames that start with _ to prevent them from being added to the git repo. In addition, leading _s often have special meanings in other languages (e.g. C, Python). I can change my key for not adding files to git.

One issue I have run into is that when I tried to mention the directory in Slack, it kept showing "in_progress_transcripts" in italics!

It's not a big deal and if you have a reason you would like to keep it this way, that is fine with me.

Code quote:

_in_progress_transcripts_

docs/tutorial_subtitles/export_1_flex_to_combine/times.txt line 44 at r3 (raw file):

5:13
5:16
5:19

3 seconds seems like a very short time for a caption - even one as simple as this since it may be longer in other languages. In merge_dups_1_basics it's only 2 seconds.

Code quote:

5:19

@imnasnainaec imnasnainaec marked this pull request as ready for review August 6, 2024 18:18
Copy link
Collaborator Author

@imnasnainaec imnasnainaec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 14 files at r1.
Reviewable status: 8 of 16 files reviewed, 1 unresolved discussion (waiting on @jmgrady)


docs/tutorial_subtitles/README.md line 3 at r3 (raw file):

Previously, jmgrady (Jim Grady) wrote…

Is there a reason for the leading and trailing underscores in the directory name? I have been using filenames that start with _ to prevent them from being added to the git repo. In addition, leading _s often have special meanings in other languages (e.g. C, Python). I can change my key for not adding files to git.

One issue I have run into is that when I tried to mention the directory in Slack, it kept showing "in_progress_transcripts" in italics!

It's not a big deal and if you have a reason you would like to keep it this way, that is fine with me.

I want to distinguish it from the other folders (and have it autosort to the start or end of the containing folder), but a _ prefix may not be the best way to do that, given what you've described.


docs/tutorial_subtitles/export_1_flex_to_combine/times.txt line 44 at r3 (raw file):

Previously, jmgrady (Jim Grady) wrote…

3 seconds seems like a very short time for a caption - even one as simple as this since it may be longer in other languages. In merge_dups_1_basics it's only 2 seconds.

The times are identified to keep the caption up until the next sentence begins or the video ends. Perhaps I could try to pause a bit more between sentences in future recordings to allow for longer translations.

Copy link
Collaborator

@jmgrady jmgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 8 of 16 files reviewed, all discussions resolved (waiting on @imnasnainaec)


docs/tutorial_subtitles/README.md line 3 at r3 (raw file):

Previously, imnasnainaec (D. Ror.) wrote…

I want to distinguish it from the other folders (and have it autosort to the start or end of the containing folder), but a _ prefix may not be the best way to do that, given what you've described.

That is a fine reason. I should have been clearer - in C & Python, a leading _ has special meaning in identifiers, not file names. I'll pick something else for files I do not want tracked (in a different PR). _ is more of a pain to type anyway! 😉

Copy link
Collaborator

@jmgrady jmgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 14 files at r1, 2 of 8 files at r2, all commit messages.
Reviewable status: 14 of 16 files reviewed, 3 unresolved discussions (waiting on @imnasnainaec)


docs/tutorial_subtitles/_in_progress_transcripts_/merge_dups_2_senses.eng.txt line 10 at r5 (raw file):

Voila! Now the correct senses are together.
Wait a second… the two glosses of the first word are not different senses. They are the exact same idea expressed redundantly.
To delete the unnecessary sense, click-and-drag it over the garbage icon in the bottom corner.

In ./docs/tutorial_subtitles/_in_progress_transcripts_/data_entry_1_basics.eng.txt this same icon is called the delete icon.

Perhaps use drag it over the delete icon in the bottom corner. It looks like a garbage can. As an additional nit, it should be garbage can - garbage is what goes into the garbage can.

Code quote:

garbage icon

docs/tutorial_subtitles/_in_progress_transcripts_/review_entries_2_editing.eng.txt line 7 at r5 (raw file):

In this video, we look at how to edit an entry.
Each word in your project shows up as a row in this table.
To delete a word, click the rubbish bin icon at the right end.

This is yet a different term than is used in other tutorials for the same icon.

Code quote:

rubbish bin icon

docs/tutorial_subtitles/_in_progress_transcripts_/sem_dom_1_basics.eng.txt line 6 at r5 (raw file):

Let’s go to thecombine.app and log in.
When you click on a project, the semantic domain tree appears.
If you are doing something else in the project (such as data cleanup or project settings), you can get back to the semantic domain tree by clicking the “Data Entry” button in the top bar.

This sentence seems awkward to me. doing something else should refer to a task. data cleanup is a task but not project settings. Should be something like reviewing the project settings or changing the project settings.

Code quote:

If you are doing something else in the project (such as data cleanup or project settings), you can get back to the semantic domain tree by clicking the “Data Entry” button in the top bar.

docs/tutorial_subtitles/merge_dups_1_basics/merge_dups_1_basics.eng.txt line 27 at r4 (raw file):

If we click “Yes”, it will look for more sets of potential duplicates.
After The Combine can’t find any more words with identical vernacular form, it will suggest words with similar vernacular forms.
This will help catch duplicates where one of the words has a typo or is using alternative spelling.

How about "an alternate" instead of "alternative"?
Alternate is an adjective (or a verb if you pronounce differently).
Alternative is generally a noun although its use as an adjective is increasing I believe.

Perhaps it's my age showing.

Code quote:

alternative

Copy link
Collaborator

@jmgrady jmgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, 1 of 1 files at r5.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @imnasnainaec)


scripts/subtitle_tutorial_video.py line 101 at r5 (raw file):

        for line in file.readlines():
            if not (line):
                continue

Strip the line before testing if the line exists.

for line in file.readlines():
  line = line.strip()
  if not line:
...

Code quote:

        for line in file.readlines():
            if not (line):
                continue

scripts/subtitle_tutorial_video.py line 104 at r5 (raw file):

            min_sec = line.strip().split(":")  # Split into minutes and seconds
            sec_mil = min_sec[1].split(".")  # Check for partial seconds
            mil = sec_mil[1] if len(sec_mil) > 1 else "0"  # Compute milliseconds

I find it simpler to split on : or .

import re
...
times = re.split(":|\.", line)  # See comment above about strip
# now the time components can be referenced as
 min = times[0]
sec = times[1]
msec = times[2] if len(times) > 2 else 0.

(also use the standard abbreviation for milliseconds).

Code quote:

            min_sec = line.strip().split(":")  # Split into minutes and seconds
            sec_mil = min_sec[1].split(".")  # Check for partial seconds
            mil = sec_mil[1] if len(sec_mil) > 1 else "0"  # Compute milliseconds

scripts/subtitle_tutorial_video.py line 138 at r5 (raw file):

            " ".join(metadata_strings),
            f"{args.output}",
        ]

Since you join the arguments below, you don't need to join them here (which will make it easier to read in the debugger if you need to). Use this form instead:

        exec_command_parts: List[str] = [
            "ffmpeg", "-i",
            args.input] +
            i_strings +
            map_strings) +
            ["-c",
            "copy",
             "-c:s mov_text"] +
            metadata_strings,
            args.output,
        ]

(And use subprocess.run as in the following comment. It takes an array for the command).

Code quote:

        exec_command_parts: List[str] = [
            "ffmpeg -i",
            f"{args.input}",
            " ".join(i_strings),
            " ".join(map_strings),
            "-c copy -c:s mov_text",
            " ".join(metadata_strings),
            f"{args.output}",
        ]

scripts/subtitle_tutorial_video.py line 141 at r5 (raw file):

        exec_command = " ".join(exec_command_parts)
        logging.info(f"Executing: {exec_command}")
        system(exec_command)

Use subprocess.run() instead of os.system(). It is generally considered more "Pythonic" and does not require you to join the arguments and prevents some problems quoting of arguments.

The only time I have had to resort to os.system is when I need to print stdout and stderr while the command is running. It is usually sufficient to print it at the end if I want it. (There is a way to do it with the subprocess module but it is cumbersome.)

Code quote:

        system(exec_command)

Copy link
Collaborator Author

@imnasnainaec imnasnainaec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 16 files reviewed, 4 unresolved discussions (waiting on @jmgrady)


docs/tutorial_subtitles/_in_progress_transcripts_/merge_dups_2_senses.eng.txt line 10 at r5 (raw file):

Previously, jmgrady (Jim Grady) wrote…

In ./docs/tutorial_subtitles/_in_progress_transcripts_/data_entry_1_basics.eng.txt this same icon is called the delete icon.

Perhaps use drag it over the delete icon in the bottom corner. It looks like a garbage can. As an additional nit, it should be garbage can - garbage is what goes into the garbage can.

(Discussed at standup), to avoid the various international interpretations of "trash" and "garbage", I've just gone with "delete icon" everywhere.


docs/tutorial_subtitles/_in_progress_transcripts_/sem_dom_1_basics.eng.txt line 6 at r5 (raw file):

Previously, jmgrady (Jim Grady) wrote…

This sentence seems awkward to me. doing something else should refer to a task. data cleanup is a task but not project settings. Should be something like reviewing the project settings or changing the project settings.

Reworded.


scripts/subtitle_tutorial_video.py line 101 at r5 (raw file):

Previously, jmgrady (Jim Grady) wrote…

Strip the line before testing if the line exists.

for line in file.readlines():
  line = line.strip()
  if not line:
...

Done.


scripts/subtitle_tutorial_video.py line 104 at r5 (raw file):

Previously, jmgrady (Jim Grady) wrote…

I find it simpler to split on : or .

import re
...
times = re.split(":|\.", line)  # See comment above about strip
# now the time components can be referenced as
 min = times[0]
sec = times[1]
msec = times[2] if len(times) > 2 else 0.

(also use the standard abbreviation for milliseconds).

Done.


scripts/subtitle_tutorial_video.py line 138 at r5 (raw file):

Previously, jmgrady (Jim Grady) wrote…

Since you join the arguments below, you don't need to join them here (which will make it easier to read in the debugger if you need to). Use this form instead:

        exec_command_parts: List[str] = [
            "ffmpeg", "-i",
            args.input] +
            i_strings +
            map_strings) +
            ["-c",
            "copy",
             "-c:s mov_text"] +
            metadata_strings,
            args.output,
        ]

(And use subprocess.run as in the following comment. It takes an array for the command).

Done.


scripts/subtitle_tutorial_video.py line 141 at r5 (raw file):

Previously, jmgrady (Jim Grady) wrote…

Use subprocess.run() instead of os.system(). It is generally considered more "Pythonic" and does not require you to join the arguments and prevents some problems quoting of arguments.

The only time I have had to resort to os.system is when I need to print stdout and stderr while the command is running. It is usually sufficient to print it at the end if I want it. (There is a way to do it with the subprocess module but it is cumbersome.)

Done.

Copy link
Collaborator

@jmgrady jmgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 6 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @imnasnainaec)

@imnasnainaec imnasnainaec merged commit fa18e58 into master Aug 12, 2024
19 checks passed
@imnasnainaec imnasnainaec deleted the tutorial branch August 12, 2024 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants