This repository contains a Python script that updates daily a list of Flow-related sites, GitHub repositories, and GitHub discussions and converts them into Markdown files. The resulting .md files are intended for AI ingestion, Retrieval-Augmented Generation (RAG) pipelines, chatbots, or any other knowledge base platform that benefits from structured text.
We want a single repository that periodically crawls all relevant Flow ecosystem content—documentation, code examples, and community discussions—and stores them in a consolidated Markdown format. You can then feed these files into:
- ChatGPT plugins (for enhanced Q&A)
- Retrieval-Augmented Generation (indexing and searching them in a vector database)
- Discord/Telegram bots that cite official doc sections
- Any other knowledge base for advanced Q&A or search.
The Python script performs domain-limited BFS (Breadth-First Search) and specialized scraping logic based on each URL:
- Non-GitHub URLs are treated as “normal” websites.
- The script fetches each page and removes
<script>,<style>,<noscript>tags. - Then it uses
markdownifyto convert the remaining HTML into Markdown. - It recurses only within the same domain to avoid crawling unrelated pages.
- For GitHub repo links like
https://github.com/onflow/flow-ft/, the script visits:- The repo root
tree/(main|master)/...subdirectoriesblob/(main|master)/...file pages
- Files with certain extensions (like
.cdc,.md,.json, etc.) or anyREADMEare downloaded in their raw form fromraw.githubusercontent.com. - The file contents are saved in a
.mdfile, wrapped in triple backticks for easy code parsing.
- For
https://github.com/orgs/onflow/discussions, the script:- Crawls the listing pages, discovers discussion links like
/orgs/onflow/discussions/1330 - For each thread, it extracts only the text from user posts (skipping the GitHub UI) and converts it to Markdown.
- Crawls the listing pages, discovers discussion links like
- This yields
.mdfiles containing the original question and comments/replies.
- Python 3.7+
requests,beautifulsoup4,markdownify
Install all dependencies:
pip install requests beautifulsoup4 markdownify
Clone or download this repo locally.
In the repo directory, run:
python scraper.pyThe script will crawl each site listed in SITES (inside scraper.py) and output the results under scraped_docs/.
Inside scraper.py, near the top, you’ll see:
SITES = [
"https://developers.flow.com/",
"https://academy.ecdao.org/en/cadence-by-example",
...
"https://github.com/onflow/flow-ft/",
...
"https://github.com/orgs/onflow/discussions"
]- Add a docs site by appending its URL if it’s not on GitHub.
- Add a GitHub repo by appending the base URL (e.g. "https://github.com/onflow/another-repo").
- Add another GitHub Discussions page if needed.
- Remove any site by deleting or commenting out its line.
For private sites or repos, you may need authentication tokens/cookies to see content that’s not public.
You can merge all the .md files into a single file or a file containing only the essentials (removing code blocks, etc.).
That will be useful for indexing or searching or being used in a chatbot.
python merge.pyAfter a successful run, you’ll see:
scraped_docs/
├─ developers_flow_com/
│ ├─ index.md
│ ├─ docs_tutorial_somepage.md
│ └─ ...
├─ github_com_onflow_flow_ft/
│ ├─ blob_main_contracts_exampletoken_cdc.md
│ ├─ ...
├─ github_com_orgs_onflow_discussions/
│ ├─ discussion_1330.md
│ ├─ discussion_1514.md
│ └─ ...
└─ ...
merged_docs/
├─ all_merged.md
└─ essentials_merged.md- Docs directories for each site
- Repos with code files in
.md(wrapped code blocks) - Discussions as
discussion_<id>.md, each containing Q&A text.
The script can be scheduled to run daily using GitHub Actions.