-
Notifications
You must be signed in to change notification settings - Fork 99
Open
Description
I converted the database build scripts to Go over the holiday. When I compared the Go output to the original Bash + Python scripts, I noticed the following issues with the original results:
- The last item in each import statement is either being dropped or is picking up the ); as part of the title or link target.
- Titles or link targets with embedded commas are being truncated at the first comma.
- While the page IDs in each record of links.with_counts.txt.gz are sorted (alphabetically, not numerically), the records themselves are in a non-deterministic hash order.
Just FYI: there is so much garbage in the link targets in particular, it’s amazing that sed is doing as well as it does!
Metadata
Metadata
Assignees
Labels
No labels