Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cliche crawler problem, #93

Open
miaekim opened this issue Mar 5, 2015 · 5 comments
Open

cliche crawler problem, #93

miaekim opened this issue Mar 5, 2015 · 5 comments
Assignees
Labels

Comments

@miaekim
Copy link
Contributor

miaekim commented Mar 5, 2015

I tried to run crawler, but it didn't work.
Here's my command.

$ celery worker -A cliche.services.wikipedia.crawler --config dev.yml

And this is my dev.ml

database_url: 'postgresql:///cliche_db_'
broker_url: 'redis://localhost/0'
WIKIPEDIA_RETRY_LIMIT: 30
DEBUG: True
SECRET_KEY: 'abcd'
SENTRY_DSN: 'https://1:2@3:4/5'

This is error message.

[2015-03-05 20:01:11,658: WARNING/Worker-3] /Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py:364: RuntimeWarning: Exception raised outside body: TypeError("report_task_failure() got an unexpected keyword argument 'signal'",):
Traceback (most recent call last):
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 437, in __protected_call__
    return self.run(*args, **kwargs)
  File "/Users/miaekim/cliche/cliche/services/tvtropes/crawler.py", line 181, in crawl_link
    result, tree, namespace, name, url = fetch_link(url, session)
  File "/Users/miaekim/cliche/cliche/services/tvtropes/crawler.py", line 145, in fetch_link
    name = tree.xpath('//div[@class="pagetitle"]/span')[0].text.strip()
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 253, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 201, in on_error
    R = I.handle_error_state(task, eager=eager)
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 85, in handle_error_state
    }[self.state](task, store_errors=store_errors)
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 125, in handle_failure
    einfo=einfo)
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/utils/dispatch/signal.py", line 166, in send
    response = receiver(signal=self, sender=sender, **named)
TypeError: report_task_failure() got an unexpected keyword argument 'signal'

  exc, exc_info.traceback)))
@tkiapril
Copy link
Contributor

tkiapril commented Mar 5, 2015

Did you pull the latest source from upstream master? I recall seeing the same message a bit ago, but the prod server configuration crawled normally.

@miaekim
Copy link
Contributor Author

miaekim commented Mar 5, 2015

The latest commit is 62fd990.
Can I get your configuration?

@tkiapril
Copy link
Contributor

tkiapril commented Mar 5, 2015

DEBUG: True
SECRET_KEY: [REDACTED]
database_url: 'postgres://localhost/tki_cliche_dummy'

# http://docs.celeryproject.org/en/latest/configuration.html
broker_url: 'redis://localhost/1'

CELERYBEAT_SCHEDULE:
  'tvtropes-sync':
    task: 'cliche.services.tvtropes.crawler.crawl'
    schedule: !!python/object/apply:datetime.timedelta [1, 0, 0]
  'wikipedia-sync':
    task: 'cliche.services.wikipedia.crawler.crawl'
    schedule: !!python/object/apply:datetime.timedelta [1, 0, 0]

CELERY_TIMEZONE: 'UTC'

SENTRY_DSN: [REDACTED]

uwsgi:
  chdir: /home/tki/cliche
  chmod-socket: 666
  callable: app
  wsgi-file: /home/tki/cliche/deploy/etc/wsgi.py
  socket: /tmp/cliche-uwsgi.sock
  plugins: python34

Here you go, but I don't really think the configuration is the problem. Assuming from the error log, if there is a problem it'll be something from processing the crawled page.

@miaekim
Copy link
Contributor Author

miaekim commented Mar 6, 2015

@tkiapril
yes the homepage "tvtropes" has been changed a lot..

@miaekim
Copy link
Contributor Author

miaekim commented Mar 6, 2015

you need to change

name = tree.xpath('//div[@class="pagetitle"]/span')[0].text.strip()

miaekim added a commit to miaekim/cliche that referenced this issue Mar 7, 2015
miaekim added a commit to miaekim/cliche that referenced this issue Mar 7, 2015
miaekim added a commit to miaekim/cliche that referenced this issue Apr 21, 2015
miaekim added a commit to miaekim/cliche that referenced this issue May 9, 2015
tkiapril added a commit that referenced this issue May 9, 2015
#93 Change xpath based on 'changed tvtropes'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants