Bug Report
#612
Replies: 2 comments
-
the Namespace id conflict is fine I need to fix the dump but that shouldn't cause it to fail to build the db |
Beta Was this translation helpful? Give feedback.
0 replies
-
seems to fail with any wiki dump so its probably my implementation of it that's broken, no longer a bug now a feature request with example code on what not to do |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Bug Report
So I did a fresh install on a new conda envoriment so nothing should effect this
I added a few lines to ingest.py to add support for the mediawiki dumps and it seems to error out on only mediawiki dumps
heres the lankchain documentation I followed
https://python.langchain.com/en/latest/integrations/mediawikidump.html
and the wiki dump I am working on is the Current pages (This version is usually best for bot use): 2022-07-09 01:13:59
https://fallout.fandom.com/wiki/Special:Statistics
from langchain.document_loaders import (
MWDumpLoader,
LOADER_MAPPING = {
".xml": (MWDumpLoader, {}),
here is a full dump of running it with no db folder
(pgpt) C:\Users\Name\AI\privateGPT>python ingest.py
Creating new vectorstore
Loading documents from source_documents
Loading new documents: 58%|███████████▌ | 11/19 [01:00<01:18, 9.75s/it]Namespace id conflict detected. <title>=Fallout 3 The Pitt trailer, =401, mapped_namespace=0
Namespace id conflict detected. <title>=��MAD����������� �������OP�FALLOUT3�, =401, mapped_namespace=0
Namespace id conflict detected. <title>=Fallout 3: Point Lookout - E3 2009 Trailer, =401, mapped_namespace=0
Namespace id conflict detected. <title>=Resident Evil 5 OST - Wind Of Madness HQ (Wesker Battle), =401, mapped_namespace=0
Namespace id conflict detected. <title>=Ghost seen at old school, =401, mapped_namespace=0
Loading new documents: 63%|████████████▋ | 12/19 [01:55<02:43, 23.39s/it]Namespace id conflict detected. <title>=GET 28 PERKS & INFINITE XP - Fallout New Vegas Glitch, =401, mapped_namespace=0
Loading new documents: 95%|██████████████████▉ | 18/19 [12:13<00:40, 40.73s/it]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\Name\miniconda3\envs\pgpt\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Users\Name\AI\privateGPT\ingest.py", line 91, in load_single_document
return loader.load()
File "C:\Users\Name\miniconda3\envs\pgpt\lib\site-packages\langchain\document_loaders\mediawikidump.py", line 49, in load
for revision in page:
File "C:\Users\Name\miniconda3\envs\pgpt\lib\site-packages\mwxml\iteration\page.py", line 32, in iter
for revision in self.__revisions:
File "C:\Users\Name\miniconda3\envs\pgpt\lib\site-packages\mwxml\iteration\page.py", line 44, in load_revisions
yield Revision.from_element(first_revision)
File "C:\Users\Name\miniconda3\envs\pgpt\lib\site-packages\mwxml\iteration\revision.py", line 85, in from_element
contents.insert(0, Content(
File "C:\Users\Name\miniconda3\envs\pgpt\lib\site-packages\jsonable\type.py", line 42, in new
return super().new(cls, *args, **kwargs)
File "C:\Users\Name\miniconda3\envs\pgpt\lib\site-packages\jsonable\self_constructor.py", line 16, in new
inst.initialize(*args, **kwargs)
File "C:\Users\Name\miniconda3\envs\pgpt\lib\site-packages\mwtypes\slots.py", line 55, in initialize
self.bytes = none_or(bytes, int)
File "C:\Users\Name\miniconda3\envs\pgpt\lib\site-packages\mwtypes\util.py", line 5, in none_or
return func(val)
ValueError: invalid literal for int() with base 10: ''
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Name\AI\privateGPT\ingest.py", line 168, in
main()
File "C:\Users\Name\AI\privateGPT\ingest.py", line 158, in main
texts = process_documents()
File "C:\Users\Name\AI\privateGPT\ingest.py", line 120, in process_documents
documents = load_documents(source_directory, ignored_files)
File "C:\Users\Name\AI\privateGPT\ingest.py", line 109, in load_documents
for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
File "C:\Users\Name\miniconda3\envs\pgpt\lib\multiprocessing\pool.py", line 870, in next
raise value
ValueError: invalid literal for int() with base 10: ''
(pgpt) C:\Users\Name\AI\privateGPT>
Beta Was this translation helpful? Give feedback.
All reactions