cpg

Subtitle synchronization with STT and sequence alignment (part I)

2024-11-18

Combining speech-to-text and sequence alignment for fast and flexible resynchronization of subtitles

Contents

Audio/transcription alignment problem

Assume given:

For instance, misalignment can occur when the transcription corresponds to a slightly different audio track, caused different framerate or extra/missing content at the beginning or throughout. This can lead to delays ranging from less than a second to tens of seconds, all quite noticeable.


Different examples of misalignments between audio and transcription.

An example that many will have encountered is video subtitles, which will be the main case considered therein. Misalignments usually originate from advertisements or slightly different cuts (extra/missing scenes).


All examples on this page are from the Blender Studio open films,
(CC) Blender Foundation. This is Sprite Fright (2021).

The natural question that will be discussed is:

Can the transcription be automatically (re-)aligned to the audio track?

General method: time transformation fitting

Formally, the alignment problem is about finding a transformation $T$ from the time frame of the transcription to that of the audio.

A general method for solving this is to:

  1. Obtain samples from that transformation, namely pairs of timestamps (transcription, audio).
  2. Reconstruct the transformation using the samples as well as priors on what shape the transformation can take (e.g. piecewise linear). This must be robust to noisy or erroneous samples.
  3. Apply the transformation to the transcription to resynchronize it.

Fitting a linear time transformation using samples (left),
before using it to retime the transcription (right).

Identifying pairs of timestamps

The first step, obtaining sample pairs from the transformation, must of course use information present in the transcription. In the case of subtitles, this is sentences with their start and end time. In SubRip notation:

23
00:01:40,708 --> 00:01:44,667
Oy, fungus freak! Make yourself useful and fix the tent.
 
24
00:01:44,667 --> 00:01:48,917
Oh, a bit of friendly campsite ribbing. Fun.

Waveform and spectrogram (computed with Audacity),
corresponding to the dialogue segment above from Sprite Fright.

From there, the following strategies naturally arise:

Existing solutions

The most important distinguishing factors of existing tools for subtitle synchronization are:

A semi-exhaustive list of projects (as of 2023), ordered by repository creation date, is given in the following table:

NameMethodShiftsLanguageCommits/Date
kaegi/almtassVADCRust50/8282017-02
AlmtbertoSabater [*]VADLPy7/662017-09
tympanix/subsync [*]VADLPy70/1252018-03
subsync.online, sc0ty/subsyncTranscriptionLC++/Py316/10032018-10
oseiskar/autosubsync [*]VADFixedPy70/3052018-09
pums974/srtsyncVAD, FFT almtLPy19/502018-12
smacke/ffsubsyncVAD, FFT almtLPy363/61002019-02
captionpalmt, Overdrivr/autosync [*]VADLPy98/102019-04
baxtree/subalmtignerLocal+global RNNCPy266/3362019-12
Amazon Prime VideoVAD, drift modelC?-2023-03

[*]: based on Sabater’s write-up on a CNN for VAD combined with a linear alignment search.

There is also a certain number of scientific articles on the subject (albeit rarely containing source code), such as:

VAD methods

Methods using Voice Activity Detection (VAD, see above) are the most common ones.

The voice activity detection is carried out with either:

Alignment is usually performed by minimizing cross-entropy between the VAD probabilities and the transcription indicator, either:

Transcription methods

Transcription (see above) is the technique used by sc0ty/subsync, employing Sphinx for STT.
The synchronization algorithm fits a line on the timings of the transcription/subtitle words, based on time and word similarity.

Obvious drawbacks compared to other methods is speed and not being language-agnostic.

Other methods

The approach used by subaligner. is more “end-to-end”, as a neural network directly predicts offsets. To handle more complex transformation than simple offsets, this is applied in two stages, first globally and then locally. Note that this does not quite fit into the general method described above.

Synchronization with STT and sequence alignment

Goals

After evaluating some of the implementations above, I wanted to try implementing an alternative method with the following goals:

Method

The idea follows the general method outlined above with:

Altogether, there are five steps:

  1. Subtitle parsing and word-level decomposition
  2. Speech-to-text
  3. Word alignment between the STT output and the decomposed subtitle
  4. Fitting
  5. Resynchronization

Architecture of the alignment pipeline.

These are described in the next sections.

Implementation

Step 1: Subtitle parsing and decomposition

Restricting to the common SubRip format illustrated above, a simple parser can be written using the chumsky parser combinator crate. A .srt file is parsed as:

pub struct Subtitle {
    entries: Vec<SubtitleEntry>,
}
pub struct SubtitleEntry {
    pub sequence: u64,
    pub start: Timestamp,
    pub end: Timestamp,
    pub text: String,
}

The earlier description of the extraction of pairs of timestamps using STT remained vague on how the STT transcription is paired with the original subtitles. This is done at the word level: right after parsing, the subtitle is decomposed into a Vec<Word>:

pub struct Word {
    pub start: Timestamp,
    pub end: Timestamp,
    pub text: String,
    pub confidence: f32,
}

To do so, it is assumed that the number of characters per second is constant in each SubtitleEntry, which is a good enough approximation. The confidence field will be used later with the transcriptions; here it is simply set to 1.0.

For example, the entry

4
00:00:28,375 --> 00:00:30,833
Can’t believe we got stuck with the tree hugger.

gets decomposed as

00:00:28,375 --> 00:00:28,719 [1.00]: Can’t
00:00:28,719 --> 00:00:29,063 [1.00]: believe
00:00:29,063 --> 00:00:29,161 [1.00]: we
00:00:29,161 --> 00:00:29,308 [1.00]: got
00:00:29,308 --> 00:00:29,554 [1.00]: stuck
00:00:29,554 --> 00:00:29,751 [1.00]: with
00:00:29,751 --> 00:00:29,898 [1.00]: the
00:00:29,898 --> 00:00:30,095 [1.00]: tree
00:00:30,095 --> 00:00:30,439 [1.00]: hugger.

It would also have been possible to use sentences as the input to the alignment, which would have avoided this decomposition. However, STT does not necessarily generate full sentences, so the analogous step to decomposition here would have been to assemble sentences. Later, instead of scoring pairs of words, pairs of sentences would have been scored.

Speed: Instantaneous.

Step 2: Speech-to-text

Speech-to-text is the most computationally expensive step of the process, and care has to be taken to meet the first goal of a faster-than-real-time overall processing.

To that goal, the Silero STT models are used. They are accurate, very fast even on the CPU, and easy to integrate thanks to an ONNX export. They are available in English, German, Spanish and Ukrainian. Details about the models are provided in this article; the starting point is Facebook’s wav2letter.
The model takes as input samples at 16k Hz, and produce probabilities over a set of 999 tokens. More precisely, chunks of 2 seconds are fed to the model, and 13 distributions over tokens per seconds are retrieved.

Audio decoding and resampling is done using the gstreamer Rust bindings.

Inference is implemented the Rust onnxruntime bindings and takes ~50-100 ms for each chunk on a laptop CPU (AMD Ryzen 7 3700U), namely ~20-40x real-time speed. Despite the inputs and outputs having fixed size, there is quite some variance, in particular in the first few batches (possibly from onnxrutime optimizations).

A simple greedy decoding is performed, where the most probable token is selected for each output sample. Then, words are reassembled from tokens, by considering blanks and repetitions.

          _ 1.00 00:00:32,000
        for 0.99 00:00:32,077
            0.95 00:00:32,154
            0.98 00:00:32,231
        one 1.00 00:00:32,308
            0.77 00:00:32,385
            1.00 00:00:32,462
       more 1.00 00:00:32,539
            0.89 00:00:32,616
            1.00 00:00:32,693
         an 1.00 00:00:32,770
         an 1.00 00:00:32,847
         im 1.00 00:00:32,924
         al 0.93 00:00:33,001
         al 0.76 00:00:33,078
            0.78 00:00:33,155
            0.70 00:00:33,232
            0.93 00:00:33,309
            0.60 00:00:33,386
          s 0.79 00:00:33,463
         we 0.92 00:00:33,540
         ar 0.96 00:00:33,617
          _ 0.50 00:00:33,694
          _ 0.51 00:00:33,771
          _ 0.89 00:00:33,848
          _ 0.98 00:00:33,925
 for / one / more / animal / swear

This could be improved in many ways, for example by using beam search, scoring with the word probabilities and/or a language model. It would also be possible to check and correct words that do not exist (e.g. nat,j,ure is decoded to natjure), but robustness to misspellings will be taken into account in the score used for alignment.

Speed: 20-40x real time on the CPU of an average laptop.

Step 3: Word alignment

The previous steps, STT and subtitle parsing/decomposition, yield two sets of Words, namely text with timestamp and confidence.

In this step, the Needleman-Wunsch algorithm is applied to find a global alignment between these two sets. Gaps can be inserted, displayed below as _.


Extract from an alignment, with a high density of matches.
The subtitles are on the left, the transcription on the right.
Note the transcription errors: missing “it was”, “catch up” vs “ketchup”,
“oh” is “or”, “them” vs “their”, and “hat so” vs “hats”.

Every pair of word is compared and assigned a score. Every word is first normalized with respect to punctuation and capitalization. The score is then simply the product of the confidence if the normalized words match, 0 otherwise.

More complex normalization could of course be considered, as well as less strict metrics, such as the Levenshtein edit distance.

Gaps are assigned a constant score, which should reflect the fact that gaps are more acceptable than incorrect matches, which would pollute the transformation fitting.

The Needleman-Wunsch dynamic programming algorithm then finds the alignment that maximizes the score.

pub struct Alignment<'a, T> {
    pub score: f32,
    pub choices: Vec<(Option<&'a T>, Option<&'a T>)>,
}
impl Alignment {
  pub fn compute(
          left: &'a [T],
          right: &'a [T],
          gap_penalty: f32,
          score_fn: impl Fn(&T, &T) -> f32,
      ) -> Result<Self, errors::Error> {
        // ..
      }
}

This runs quite fast:

Computing scores with 486 vs 849 words
Done computing scores, 2393 nonzero values
Done computing alignment in 63 ms

The second line shows that less than 1% of the words have a nonzero score. A simple sparse matrix structure is used to hold these.

Speed: 60 ms on the same hardware as above, for a 10 minutes short film. In general, the complexity is proportional to the product of the number of subtitle and transcription words.

Interlude: Testing Steps 1-3

The pipeline so far can be tested by plotting the two sets of words with the alignments.

In the following plots, for the Blender Studios films, the STT words are up (in orange) and the subtitle words are down (in blue). The horizontal axis is time.

Given that the subtitles were synchronized, non-vertical lines denote errors in the alignment. As the alignment does not depend on the timestamp, the observations would still apply if the subtitles were not synchronized.

NameWord ratioAlignedBad matchesDistance [s]
Agent 327199%55%0-0.26 ± 0.34
Cosmos Laundromat113%57%0-0.33 ± 0.49
Elephants Dream145%37%0-2.51 ± 10.3
Sprite Fright175%40%00.65 ± 5.82

The alignment percentage is computed with respect to the words from the subtitles.
The “word ratio” is the ratio of words in the transcriptions over words in the subtitles.
The “Distance” column indicates the mean and standard deviation of the time difference between matches. For already aligned subtitles, this should be close to zero.

Agent 327Agent 327
Cosmos LaundromatCosmos Laundromat
Elephant DreamElephants Dream
Sprite FrightSprite Fright

This is great overall: there is a large number of matches, in each of the sections with spoken content, and most are correct.

Keeping only the matched pairs, a set of (Word, Word) pairs is obtained, corresponding respectively to a subtitle and transcription word.

Step 4: Fitting

As explained earlier, this step is responsible for estimating the time transformation $T$ between the subtitles time frame to that of the transcription, using the pairs of timestamps $(t_1, t_2)$ arising from the pairs of matched words $(w_1,w_2)$ from the previous step.


A linear and a piecewise-linear transformation (extra transcription content).

Many of the existing methods (see above) assume that the transformation is a simple offset $T(x)=x+b$ or a linear transformation $T(x)=ax+b$ (to handle e.g. framerate changes). In this case, a simple least squares fit is already quite efficient (see the left plot above).

To handle the case where there is extra content in either the transcription or the audio (e.g. right plot above), piecewise linear transformations are considered.

It is important to note that “segmented regression” libraries such as pwlf (Python), piecewise-regression (Python) or plr (Rust) are solving a different problem, where the functions are assumed to be continuous. They perform either a grid search over breakpoints, or a joint optimization between breakpoints and coefficients. In the case at hand, the functions are by definition discontinuous, rather corresponding to the stegmented model handled (for a single breakpoint) by chngpt.

The strategy used for the fit is to:

The number of clusters is selected with a loss optimizing for a low loss and a low number of clusters (complexity/generalization trade-off).


Fitting a piecewise-linear time transformation.
The blue line is the true function and the red lines are the fits.

The following plots display the same data as earlier — the result of the alignment — but with the setup of the above plot (i.e. transcription on the vertical axis). The dotted diagonal line is $y=x$, on which all matches should lie given that the subtitles are aligned.

Agent 327Agent 327
Cosmos LaundromatCosmos Laundromat
Elephant DreamElephants Dream
Sprite FrightSprite Fright

The outliers would not be an issue when fitting a global linear transformation, but it is important to remove them to be able to handle more complex functions $T$.

Speed: Almost instantaneous.

Step 5: Resynchronization

Once the time transformation $T$ has been estimated, resynchronization consists simply in applying $T$ to the original subtitles.

Speed: Almost instantaneous.

In part II

The second part of this note will (when time permits):