Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

运行爬虫时,出现exception #28

Open
xiayank opened this issue May 12, 2017 · 2 comments
Open

运行爬虫时,出现exception #28

xiayank opened this issue May 12, 2017 · 2 comments
Labels

Comments

@xiayank
Copy link

xiayank commented May 12, 2017

以前运行的时候都没有错误,错误是出现在 AmazonCrawler.javaDocument doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(100000).get();
下面是exception log和我的代码

Exception in thread "main" java.lang.IllegalArgumentException: String must not be empty
	at org.jsoup.helper.Validate.notEmpty(Validate.java:92)
	at org.jsoup.nodes.Attribute.setKey(Attribute.java:51)
	at org.jsoup.parser.ParseSettings.normalizeAttributes(ParseSettings.java:54)
	at org.jsoup.parser.HtmlTreeBuilder.insert(HtmlTreeBuilder.java:185)
	at org.jsoup.parser.HtmlTreeBuilderState$7.process(HtmlTreeBuilderState.java:553)
	at org.jsoup.parser.HtmlTreeBuilder.process(HtmlTreeBuilder.java:113)
	at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:50)
	at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:43)
	at org.jsoup.parser.HtmlTreeBuilder.parse(HtmlTreeBuilder.java:56)
	at org.jsoup.parser.Parser.parseInput(Parser.java:32)
	at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:136)
	at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:666)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:225)
	at io.bittiger.crawler.AmazonCrawler.GetAdBasicInfoByQuery(AmazonCrawler.java:167)
	at io.bittiger.crawler.CrawlerMain.main(CrawlerMain.java:54)
@hackjutsu
Copy link
Member

@xiayank

这个exception从JSoup parser内部出来的,可能Amazon返回了50X网页。
如果这个错误不是每次都出现,试试用try-catch包起来,然后直接忽略?

@jygan
Copy link
Contributor

jygan commented May 12, 2017

print url and check what's the value of url

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants