Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace client-side highlight.js by server-side syntect highlighting #1842

Merged
merged 15 commits into from
Nov 13, 2022

Conversation

Nemo157
Copy link
Member

@Nemo157 Nemo157 commented Sep 17, 2022

This is setup to only support the same languages we targeted before: Rust, Markdown and TOML.

fixes #1652

TODO in this PR:

  • some screenshots/comparisons with current highlighting
  • is there a better alternative than submodules for getting the syntax files

Future extensions:

  • we could load more syntax definitions, either from the main Packages subrepository, or there's a collection of third-party syntaxes at https://github.com/sharkdp/bat/tree/master/assets/syntaxes/02_Extra
  • the mapping from syntect classes to colors could be extended, I just added a minimal set of rules to get some samples rendering well, but I almost certainly missed some common ones

@github-actions github-actions bot added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Sep 17, 2022
@syphar syphar added S-waiting-on-author Status: This PR is incomplete or needs to address review comments and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Sep 17, 2022
@Nemo157 Nemo157 force-pushed the syntect branch 2 times, most recently from 1880e7c to c1dcb38 Compare October 15, 2022 13:14
@Nemo157
Copy link
Member Author

Nemo157 commented Oct 15, 2022

Screenshots (original on left, updated on right):

TOML and Rust rendered as part of a readme, https://docs.rs/crate/bpaf/0.6.1:
image

The same section of readme in markdown source, https://docs.rs/crate/bpaf/0.6.1/source/README.md :
image

Unsupported filetype, https://docs.rs/crate/bpaf/0.6.1/source/.gitignore :
image

@GuillaumeGomez
Copy link
Member

Wouldn't it be better to store the rendered content directly?

@Nemo157
Copy link
Member Author

Nemo157 commented Oct 15, 2022

I couldn't find a better alternative than submodules for the assets.

@GuillaumeGomez
Copy link
Member

I'm not sure if you're answering my question. In case you are, I'll try to reword it better:

Instead of highlighting every time the page is requested, why not store the rendered page to not recompute it every time?

@Nemo157
Copy link
Member Author

Nemo157 commented Oct 15, 2022

Wouldn't it be better to store the rendered content directly?

That would mean that any improvements to the highlighting wouldn't apply to old files. Highlighting is very cheap, and these pages aren't requested that often compared to rustdoc pages.

@GuillaumeGomez
Copy link
Member

I see, makes sense. Thanks for the explanation! 👍

@Nemo157 Nemo157 marked this pull request as ready for review October 15, 2022 14:11
@Nemo157 Nemo157 added S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed and removed S-waiting-on-author Status: This PR is incomplete or needs to address review comments labels Oct 15, 2022
Copy link
Member

@syphar syphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!

Parts of the restructuring could be a chance to write some unit-tests for these ( like from_path_and_mime or web::markdown, but that's not a blocker.

I did some manual testing, comparing the current source view with this one.

Looking at: https://docs.rs/crate/regex/1.6.0

  • the code examples in the readme are not all hilighted (example: "Usage: Avoid compiling the same regex in a loop")

Looking at: https://docs.rs/crate/itertools/0.10.3/source/
Files that are currently highlited but aren't with this PR any more:

  • .rustfmt.toml
  • Cargo.lock
  • Cargo.toml.orig
  • .github/workflows/ci.yml (here I'm not sure which highlighting is applied currently, could not yaml)

looking at https://docs.rs/crate/regex/1.6.0/source/

  • test

Of course we could decide not to highlight certain file-types, though I would prefer to kepe highlighting for commonly used things (cargo.lock is a good example for this )

README.md Show resolved Hide resolved
@syphar syphar added S-waiting-on-author Status: This PR is incomplete or needs to address review comments and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Oct 16, 2022
@Nemo157
Copy link
Member Author

Nemo157 commented Oct 16, 2022

Fixed the readme block, .rustfmt.toml, Cargo.lock and Cargo.toml.orig. I hadn't realised the highlight.min.js actually contains some default languages, including bash used for the test file (I guess via first-line detection) and less used for the ci.yml (not sure why it detected this instead of the builtin yaml syntax it also supports). I'll take a look at what it supports vs. what's in the default sublime package and add some extras to our set.

@syphar
Copy link
Member

syphar commented Oct 17, 2022

I'll take a look at what it supports vs. what's in the default sublime package and add some extras to our set.

Feel free to ping me / request a review when I should have another look.

Thank you!

@Nemo157
Copy link
Member Author

Nemo157 commented Oct 17, 2022

Extracted the full list of languages from the version of highlight.js we had: apache, bash, c, coffeescript, cpp, csharp, css, diff, go, http, ini, java, javascript, json, kotlin, less, lua, makefile, markdown, nginx, objectivec, perl, php, php-template, plaintext, properties, python, python-repl, r, ruby, rust, scss, shell, sql, swift, typescript, vbnet, xml, yaml

The languages included in the base sublime syntax packages: ActionScript, AppleScript, ASP, Batch, File, C#, C++, Clojure, CSS, D, Diff, Erlang, Git, Formats, Go, Graphviz, Groovy, Haskell, HTML, Java, JavaScript, JSON, LaTeX, LICENSE, Lisp, Lua, Makefile, Markdown, Matlab, Objective-C, OCaml, Pascal, Perl, PHP, Python, R, Rails, Regular Expressions, RestructuredText, Ruby, Rust, Scala, ShellScript, SQL, TCL, Text, Textile, XML, YAML

The major missing languages I can see are: ini, kotlin, swift, typescript, but these are all very niche and could be left for a future PR.

@Nemo157
Copy link
Member Author

Nemo157 commented Oct 17, 2022

To load all the syntaxes it turns out syntect doesn't support all the latest definitions features (trishume/syntect#323) so had to revert the submodule to the v4050 tag. The rust syntax seem not to have had many changes, but the older markdown syntax doesn't support the embedded languages that let it highlight the rust code blocks while rendering the markdown source.

YAML (https://docs.rs/crate/itertools/0.10.3/source/.github/workflows/ci.yml):

image

JSON (https://docs.rs/crate/itertools/0.10.3/source/.cargo_vcs_info.json):

image

Copy link
Member

@syphar syphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did more testing with the last state:

not hilighted:

templates/crate/source.html Show resolved Hide resolved
@Nemo157
Copy link
Member Author

Nemo157 commented Oct 18, 2022

I don't know how we are even finding that readme to render, it's not set via package.readme in the Cargo.toml and crates.io doesn't see it https://crates.io/crates/itertools/0.10.1, the builder code is only supposed to look for README.md:

let readme_path = source_dir.join(pkg.readme.as_deref().unwrap_or("README.md"));

highlight.js is detecting that codeblock as apache for some reason. There is no first_line_match defined for the Rust syntax file so syntect can't ever detect an unannotated block as it, in general it seems it's much less aggressive about detecting every codeblock as something and only detects relatively unambiguous things like shebang lines or DOCTYPE.

@Nemo157
Copy link
Member Author

Nemo157 commented Oct 18, 2022

Ah, it's not the readme, that's the crate-level docs being very hackily extracted:

fn read_rust_doc(file_path: &Path) -> Result<Option<String>> {
let reader = fs::File::open(file_path).map(BufReader::new)?;
let mut rustdoc = String::new();
for line in reader.lines() {
let line = line?;
if line.starts_with("//!") {
// some lines may or may not have a space between the `//!` and the start of the text
let line = line.trim_start_matches("//!").trim_start();
if !line.is_empty() {
rustdoc.push_str(line);
}
rustdoc.push('\n');
}
}
if rustdoc.is_empty() {
Ok(None)
} else if rustdoc.len() > 51200 {
Ok(Some(format!(
"(Library doc comment ignored due to being too long. ({} > 51200))",
rustdoc.len()
)))
} else {
Ok(Some(rustdoc))
}
}

We should probably stop doing that, it doesn't make sense when those docs are being rendered as part of the documentation anyway...

@syphar
Copy link
Member

syphar commented Oct 18, 2022

now that's a neat edge-case :D

We should probably stop doing that, it doesn't make sense when those docs are being rendered as part of the documentation anyway...

I agree, IMO we don't have to show more than crates.io here.

Copy link
Member

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hate manually managing submodules; can we clone them in the build script instead of requiring people to deal with it themselves?

Other than that this seems reasonable; @jsha asked a while ago why are we even syntax highlighting in the first place, and that seems good to keep in the back of our minds, but the new highlighting looks reasonable.

Copy link
Member

@syphar syphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While doing a last test I locally saw some higher response times with bigger source files ( > 2 seconds, for example from the regex crate), but this was better in the release build.

Also these pages are candidates for caching anyways when that's live.

@Nemo157
Copy link
Member Author

Nemo157 commented Oct 19, 2022

I really hate manually managing submodules; can we clone them in the build script instead of requiring people to deal with it themselves?

I doubt these will get more than 1 update per year, the overhead of working out how to manage them seems higher than the overhead of running git submodule update --init once per clone. (First thought I had was: how do you distinguish between an outdated submodule and a wip update of a submodule?).

@syphar
Copy link
Member

syphar commented Nov 12, 2022

@Nemo157 I saw you rebased / added a new commit, feel free to ping me when this is ready again

@Nemo157
Copy link
Member Author

Nemo157 commented Nov 12, 2022

We can only clone it if the directory doesn't exist yet.

Finally got round to implementing this and rebasing to resolve the workflow conflicts. (Then removed it again after the following changes 😁)

@Nemo157 one thought about this: if these will really get only that little updates, why not just copy them over? Similar to what we do with font-awesome?

It's ~1.3k files/2.8MB, which I guess is actually smaller than the font-awesome files. I've switched to using git subtree --squash to vendor them (I didn't realize it supported --squash before, otherwise I probably would have done this from the start).

@Nemo157 Nemo157 added S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed and removed S-waiting-on-author Status: This PR is incomplete or needs to address review comments labels Nov 12, 2022
@syphar syphar merged commit a43b0fb into rust-lang:master Nov 13, 2022
@github-actions github-actions bot added S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Nov 13, 2022
@syphar syphar removed the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Nov 13, 2022
@Nemo157 Nemo157 deleted the syntect branch May 31, 2023 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Look into replacing highlight.js with server-side highlighting
4 participants