Subtitle Generator Using Whisper

I want to generate the subtitles for the Normal PeopleTV series in my laptop using LLM. After searching a bit, whisper from OpenAI was a proper fit.

Step 1: Extracting Audio from Video

The first step is to extract the audio from the video file using ffmpeg and store it separately.

ffmpeg -i /Users/kracekumar/Movies/TV/Normal.People.S01/Normal.People.S01E01.mp4 -vn -acodec copy /Users/kracekumar/Movies/TV/Normal.People.S01/audio/Normal.People.S01E01.aac

Step 2: Converting Audio to Text

The second step is to run the audio file through the whisper model from OpenAI. I use uv to install and run inside a project.

uv run whisper /Users/kracekumar/Movies/TV/Normal.People.S01/audio/Normal.People.S01E01.aac --model turbo -f srt --output_dir /Users/kracekumar/Movies/TV/Normal.People.S01/generated_subs/

Here is the first ten subtitles generated from the model

1
00:00:00,000 --> 00:00:24,000
It's a simple game. You have 15 players. Give one of them the ball. Get it into the net.

2
00:00:24,000 --> 00:00:26,000
Very simple. Isn't it?

3
00:00:26,000 --> 00:00:31,000
Brilliant. How's it going, Rachel? Talking tactics there for the big game.

4
00:00:31,000 --> 00:00:35,000
We're getting a masterclass. How incredibly boring of you.

5
00:00:35,000 --> 00:00:39,000
Yeah. Did you use your hair though? I did, yeah.

6
00:00:39,000 --> 00:00:44,000
It's very pretty. Thanks. Can I use my locker? By any chance?

7
00:00:44,000 --> 00:00:50,000
Yeah. Yeah, I sorta need you to move, Connell.

8
00:00:50,000 --> 00:00:55,000
Oh, sorry. Excuse me. Sorry. Excuse me. Right, relax, will ya?

9
00:00:55,000 --> 00:01:00,000
Okay, now that's important because it's turned up in the exam twice out of the last three years.

10
00:01:02,000 --> 00:01:03,000
Marianne.

Here is subtitles from the other site

1
00:00:18,989 --> 00:00:20,269
It's a simple game.

2
00:00:20,320 --> 00:00:22,642
You have 15 players.
Give one of them the ball,

3
00:00:22,693 --> 00:00:24,048
get it into the net.

4
00:00:24,099 --> 00:00:25,708
- Very simple.
- Isn't it?

5
00:00:26,052 --> 00:00:27,192
Oh, what?

6
00:00:27,415 --> 00:00:28,535
Brilliant.

7
00:00:28,833 --> 00:00:31,520
How's it going, Rachel?
Talking tactics, there, for the game.

8
00:00:31,571 --> 00:00:33,200
We're getting a master class.

9
00:00:33,598 --> 00:00:35,965
- How incredibly boring of you.
- Yeah.

10
00:00:36,601 --> 00:00:38,570
- Did you get your hair done?
- I did, yeah.

The complete generated subtitles can be found in gist

Comparison with original subtitles

The LLM produced close to perfect subtitles in terms of text and highly useful with certain annoying behaviours

  1. Text appears before characters starts to speak: The first generated text appears as soon a video starts whereas in the original file, starts at 18th second. When there is a long pause in the video, the next dialogue appears immediately.
  2. Subtitle Length: The first subtitle, It's a simple game. You have 15 players. Give one of them the ball. Get it into the net.is longer and represents 6 seconds of dialogue. The splitting these into multiple sequences would be useful especially for movie subtitles but may not matter for speech to text.(couldn’t find any CLI options)
  3. Inconsistent Punctuation: While some generated text includes proper punctuation, other sections lack it. The CLI offers --append_punctuations, --prepend_punctations to address this, but I haven’t tried.

Script to bulk convert

import os
import subprocess
import argparse

def extract_audio(input_dir, output_dir):
    """
    Extracts audio from video files in the input directory and saves them to the output directory.

    Args:
        input_dir: Path to the directory containing video files.
        output_dir: Path to the directory where audio files will be saved.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for filename in os.listdir(input_dir):
        if filename.endswith(('.mp4', '.avi', '.mov')):  # Add more video extensions if needed
            input_path = os.path.join(input_dir, filename)
            output_path = os.path.join(output_dir, os.path.splitext(filename)[0] + '.aac')
            command = f"ffmpeg -i {input_path} -vn -acodec copy {output_path}"
            # Add logging to track
            print(f"Running the command: {command}")
            subprocess.run(command, shell=True)

def generate_subtitles(input_dir, output_dir):
    """
    Generates subtitles for audio files using the Whisper LLM model.

    Args:
        input_dir: Path to the directory containing audio files.
        output_dir: Path to the directory where subtitle files will be saved.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for filename in os.listdir(input_dir):
        if filename.endswith(('.aac', '.wav')):
            input_path = os.path.join(input_dir, filename)
            command = f"whisper {input_path} --model turbo -f srt --output_dir {output_dir}"
            # Adjust model size ('tiny', 'base', 'small', 'medium', 'large') as needed
            
            print(f"Running the command: {command}")
            subprocess.run(command, shell=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Extract audio and generate subtitles.")
    parser.add_argument("input_dir", help="Path to the directory containing video files.")
    parser.add_argument("audio_dir", help="Path to the directory to save extracted audio files.")
    parser.add_argument("subtitle_dir", help="Path to the directory to save generated subtitles.")

    args = parser.parse_args()

    extract_audio(args.input_dir, args.audio_dir)
    generate_subtitles(args.audio_dir, args.subtitle_dir)

I asked Gemini to generate the code for the task. The following is the code that was generated by the model. The prompt was basic and as follows

Write a Python program that takes a command line arguments to do following tasks

1) Get a directory that contains video files and extracts the audio from the video and stores the audio in a separate directory using ffmpeg command. If the output directory is missing, new directory should be created.

2) Then audio file is passed on to the command of whisper llm model to produce the sub ttitles for the audio file. The output should be stored in a new directory

From the generated code, I modified two things

  1. The command line arguments to ffmpeg and whisper command
  2. Add a log line to print the current command to track progress.

Chinese language

After successful english subtitles generation, I was tempted to try non-english audio. for In the mood for love movie. The whisper model failed to convert the generated chinese translation to English.

$uv run whisper /Users/kracekumar/Movies/In.the.Mood.for.Love/audio/In.the.Mood.for.Love.mp4 --model turbo -f srt --output_dir /Users/kracekumar/Movies/In.the.Mood.for.Love/generated_sub --language zh --task translate
[00:00.000 --> 00:00.180]
[00:30.000 --> 00:30.180]
[01:00.000 --> 01:00.160]  The frustrating is for thoseatks. It's beautiful and adorable and significant. It's adorable and typical. There's a blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank blank
[01:30.000 --> 01:30.180]
[02:00.000 --> 02:05.200] 謝謝你, 那我先走了
[02:05.260 --> 02:06.680] of息, 再見
[02:07.800 --> 02:11.360] 請問你們有嗎?
[02:11.520 --> 02:15.200] 對不起, 房間剛剛租給一位太太
[02:15.520 --> 02:16.520] 謝謝你
...
[01:28:35.420 --> 01:28:36.420] 謝謝
[01:28:36.420 --> 01:28:37.420]
Traceback (most recent call last):
  File "/Users/kracekumar/code/s2t/.venv/lib/python3.12/site-packages/whisper/transcribe.py", line 598, in cli
    writer(result, audio_path, **writer_args)
  File "/Users/kracekumar/code/s2t/.venv/lib/python3.12/site-packages/whisper/utils.py", line 101, in __call__
    self.write_result(result, file=f, options=options, **kwargs)
  File "/Users/kracekumar/code/s2t/.venv/lib/python3.12/site-packages/whisper/utils.py", line 257, in write_result
    for i, (start, end, text) in enumerate(
                                 ^^^^^^^^^^
  File "/Users/kracekumar/code/s2t/.venv/lib/python3.12/site-packages/whisper/utils.py", line 197, in iterate_result
    for subtitle in iterate_subtitles():
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/kracekumar/code/s2t/.venv/lib/python3.12/site-packages/whisper/utils.py", line 147, in iterate_subtitles
    last: float = get_start(result["segments"]) or 0.0
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kracekumar/code/s2t/.venv/lib/python3.12/site-packages/whisper/utils.py", line 72, in get_start
    return next(
           ^^^^^
  File "/Users/kracekumar/code/s2t/.venv/lib/python3.12/site-packages/whisper/utils.py", line 73, in <genexpr>
    (w["start"] for s in segments for w in s["words"]),
                                           ~^^^^^^^^^
KeyError: 'words'

Conclusion

Whisper model was able to provide useable and close to accurate subtitles for the audio. There are a lot of rough edges like producing long text without proper truncation that hampers the experience. I’m pretty sure, there are tweaks to get the perfect subtitles with enough effort.