Skip to content

Database build scripts #319

@zfLQ2qx2

Description

@zfLQ2qx2

I converted the database build scripts to Go over the holiday. When I compared the Go output to the original Bash + Python scripts, I noticed the following issues with the original results:

  1. The last item in each import statement is either being dropped or is picking up the ); as part of the title or link target.
  2. Titles or link targets with embedded commas are being truncated at the first comma.
  3. While the page IDs in each record of links.with_counts.txt.gz are sorted (alphabetically, not numerically), the records themselves are in a non-deterministic hash order.

Just FYI: there is so much garbage in the link targets in particular, it’s amazing that sed is doing as well as it does!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions