Zola syntax highlighting with Pygments

2023-07-29 #rust

Replacing syntect with pygments as Zola’s syntax highlighter.

This website is created using the Zola static website engine, combining HTML templates (with Tera) and Markdown files.

Zola highlights code blocks such as

async fn test() {
    todo!()
}

using the syntect Rust crate, based on the syntaxes from Sublime Text

Issues with `syntect` and possible solutions

Unfortunately, syntect uses relatively old Sublime syntaxes and an update is not trivial. This results for example in:

async and await keywords not being highlighted correctly in Rust.
Useful syntaxes missing, such as a console mode.

This problem discussed at length in this Zola issue, with two options being migrating from syntect to

tree-sitter; the GitHub issue contains a prototype branch using the tree-painter crate.
a Rust port of the pygments Python library, which works quite well despite performing simpler parsing than tree-sitter.

Syntax highlighting with Pygments

In the meantime, we can do a really dirty hack and call pygments from Rust, not even via pyo3 but via the pygmentize CLI, paying for the initialization of the Python runtime at each call.

Processing an average-sized code block with pygmentize takes roughly 100 ms. Given that Zola is a static website generator and that pages are built in parallel, increased compilation time is a problem mostly for live previewing (zola serve).

To speed things up a bit, we can implement caching based on the code block contents, so that each code block needs to be processed only once. Overall, the change is focused on the CodeBlock::highlight method:

impl CodeBlock {
    pub fn highlight(&mut self, content: &str) -> errors::Result<String> {
        // Setup cache
        let cache = dirs::cache_dir().unwrap();
        let cache = cache.join("zola");
        std::fs::create_dir_all(&cache)?;
 
        // Check cache
        let mut hasher = DefaultHasher::new();
        content.hash(&mut hasher);
        self.language.hash(&mut hasher);
        let hash = hasher.finish();
        let filename = cache.join(hash.to_string());
        if filename.is_file() {
            return Ok(std::fs::read_to_string(filename)?);
        }
        // This is critical to not mess up the recursive parsing in markdown_to_html
        let content = content.lines().map(|l| if l.is_empty() { " " } else { l })
                             .join("\n");
        // Run through pygmentize if necessary
        let mut child = Command::new("pygmentize")
            .args(["-l", &self.language, "-P", "style=nord", "-P", "nowrap",
                   "-f", "html"])
            .stdout(Stdio::piped())
            .stdin(Stdio::piped())
            .spawn()?;
 
        let mut stdin = child.stdin.take().unwrap();
        stdin.write(content.as_bytes())?;
        drop(stdin);
 
        let output = child.wait_with_output()?;
        let output = String::from_utf8(output.stdout)?;
        // Write cache
        std::fs::write(filename, &output)?;
        Ok(output)
}

The CSS must be included on the page separately after being generated with:

$ pygmentize -S nord -f html > highlighting.css

Timings

This results in the following global generation timings for this website:

Zola highlighting	Generation time [ms]
`syntect`	190
`pygments`	950
`pygments` (cached)	90

Assuming that only one block of code changes at a time and that they are infrequent compared to text edits, this is reasonable.

Follow-ups

With a bit more care, we could:

Read the style from the configuration rather than hard-coding it.
Support the line annotations from Zola (numbering, selecting, highlighting, hiding).
Avoid repeated loadings of Python and the pygments module, although that might conflict with GIL locks.
Implement caching for syntect too.

Zola syntax highlighting with Pygments

Issues with syntect and possible solutions

Syntax highlighting with Pygments

Timings

Follow-ups

Issues with `syntect` and possible solutions