cliche crawler problem, #93

miaekim · 2015-03-05T13:03:33Z

I tried to run crawler, but it didn't work.
Here's my command.

$ celery worker -A cliche.services.wikipedia.crawler --config dev.yml

And this is my dev.ml

database_url: 'postgresql:///cliche_db_'
broker_url: 'redis://localhost/0'
WIKIPEDIA_RETRY_LIMIT: 30
DEBUG: True
SECRET_KEY: 'abcd'
SENTRY_DSN: 'https://1:2@3:4/5'

This is error message.

[2015-03-05 20:01:11,658: WARNING/Worker-3] /Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py:364: RuntimeWarning: Exception raised outside body: TypeError("report_task_failure() got an unexpected keyword argument 'signal'",):
Traceback (most recent call last):
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 437, in __protected_call__
    return self.run(*args, **kwargs)
  File "/Users/miaekim/cliche/cliche/services/tvtropes/crawler.py", line 181, in crawl_link
    result, tree, namespace, name, url = fetch_link(url, session)
  File "/Users/miaekim/cliche/cliche/services/tvtropes/crawler.py", line 145, in fetch_link
    name = tree.xpath('//div[@class="pagetitle"]/span')[0].text.strip()
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 253, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 201, in on_error
    R = I.handle_error_state(task, eager=eager)
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 85, in handle_error_state
    }[self.state](task, store_errors=store_errors)
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/app/trace.py", line 125, in handle_failure
    einfo=einfo)
  File "/Users/miaekim/rdflib/lib/python3.4/site-packages/celery/utils/dispatch/signal.py", line 166, in send
    response = receiver(signal=self, sender=sender, **named)
TypeError: report_task_failure() got an unexpected keyword argument 'signal'

  exc, exc_info.traceback)))

The text was updated successfully, but these errors were encountered:

tkiapril · 2015-03-05T14:25:55Z

Did you pull the latest source from upstream master? I recall seeing the same message a bit ago, but the prod server configuration crawled normally.

miaekim · 2015-03-05T14:36:51Z

The latest commit is 62fd990.
Can I get your configuration?

tkiapril · 2015-03-05T15:10:54Z

DEBUG: True
SECRET_KEY: [REDACTED]
database_url: 'postgres://localhost/tki_cliche_dummy'

# http://docs.celeryproject.org/en/latest/configuration.html
broker_url: 'redis://localhost/1'

CELERYBEAT_SCHEDULE:
  'tvtropes-sync':
    task: 'cliche.services.tvtropes.crawler.crawl'
    schedule: !!python/object/apply:datetime.timedelta [1, 0, 0]
  'wikipedia-sync':
    task: 'cliche.services.wikipedia.crawler.crawl'
    schedule: !!python/object/apply:datetime.timedelta [1, 0, 0]

CELERY_TIMEZONE: 'UTC'

SENTRY_DSN: [REDACTED]

uwsgi:
  chdir: /home/tki/cliche
  chmod-socket: 666
  callable: app
  wsgi-file: /home/tki/cliche/deploy/etc/wsgi.py
  socket: /tmp/cliche-uwsgi.sock
  plugins: python34

Here you go, but I don't really think the configuration is the problem. Assuming from the error log, if there is a problem it'll be something from processing the crawled page.

miaekim · 2015-03-06T11:00:51Z

@tkiapril
yes the homepage "tvtropes" has been changed a lot..

miaekim · 2015-03-06T11:07:09Z

you need to change

name = tree.xpath('//div[@class="pagetitle"]/span')[0].text.strip()

#93 Change xpath based on 'changed tvtropes'

miaekim added the bug label Mar 5, 2015

miaekim assigned tkiapril Mar 5, 2015

miaekim added a commit to miaekim/cliche that referenced this issue Mar 7, 2015

clicheio#93 Change xpath based on 'changed tvtropes'

73228c8

miaekim added a commit to miaekim/cliche that referenced this issue Mar 7, 2015

clicheio#93 Change xpath based on 'changed tvtropes'

1794f1f

miaekim added a commit to miaekim/cliche that referenced this issue Apr 21, 2015

clicheio#93 Change xpath based on 'changed tvtropes'

257bb46

miaekim mentioned this issue Apr 22, 2015

#93 Change xpath based on 'changed tvtropes' #94

Merged

miaekim added a commit to miaekim/cliche that referenced this issue May 9, 2015

clicheio#93 Change xpath based on 'changed tvtropes'

5e3dc92

tkiapril added a commit that referenced this issue May 9, 2015

Merge pull request #94 from miaekim/dummies

dcfa34b

#93 Change xpath based on 'changed tvtropes'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cliche crawler problem, #93

cliche crawler problem, #93

miaekim commented Mar 5, 2015

tkiapril commented Mar 5, 2015

miaekim commented Mar 5, 2015

tkiapril commented Mar 5, 2015

miaekim commented Mar 6, 2015

miaekim commented Mar 6, 2015

cliche crawler problem, #93

cliche crawler problem, #93

Comments

miaekim commented Mar 5, 2015

tkiapril commented Mar 5, 2015

miaekim commented Mar 5, 2015

tkiapril commented Mar 5, 2015

miaekim commented Mar 6, 2015

miaekim commented Mar 6, 2015