Jsoup Crawler #43

xiayank · 2017-05-31T03:55:19Z

I have a question about using jsop api to select the target element.
Here is the HTML.

I want to get the href attribute value in <a>tag, which is under the <div class=bxc-grid__image bxc-grid__image--light>.
I tried use

Elements elements = doc.select("div[class=bxc-grid__image   bxc-grid__image--light]");

to locate the div. It works. I followed the API E > F an F direct child of E . So the select css will be li[class=sub-categories__list__item]>a. Howerver, there is exception.

Anyone knows how to locate the <a>tag?

Thanks in advance!
Jsoup select API
URL OF ORGINAL PAGE

Here is the exception log:

Exception in thread "main" java.lang.IllegalArgumentException: String must not be empty
	at org.jsoup.helper.Validate.notEmpty(Validate.java:92)
	at org.jsoup.nodes.Attribute.setKey(Attribute.java:51)
	at org.jsoup.parser.ParseSettings.normalizeAttributes(ParseSettings.java:54)
	at org.jsoup.parser.HtmlTreeBuilder.insert(HtmlTreeBuilder.java:185)
	at org.jsoup.parser.HtmlTreeBuilderState$7.process(HtmlTreeBuilderState.java:553)
	at org.jsoup.parser.HtmlTreeBuilder.process(HtmlTreeBuilder.java:113)
	at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:50)
	at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:43)
	at org.jsoup.parser.HtmlTreeBuilder.parse(HtmlTreeBuilder.java:56)
	at org.jsoup.parser.Parser.parseInput(Parser.java:32)
	at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:135)
	at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:747)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:250)
	at test.main(test.java:26)

The text was updated successfully, but these errors were encountered:

jygan · 2017-05-31T04:03:55Z

are you using "copy selector" in chrome?
#nav-subnav > a:nth-child(7)

xiayank · 2017-06-01T02:18:34Z

大家在用jsou的时候会不会总是出现，就算对于同一个界面，同一个css selector，抓到的Element有的时候可以正常工作，取到要抓的东西。但是也有可能有的时候为空，有的时候报错IllegalArgumentException: String must not be empty？
感觉jsoup不是很稳定，很多时候会失败。

bihjuchiu · 2017-06-01T02:20:59Z

Same here. I thought it was Amazon blocking the crawler...

xiayank · 2017-06-01T02:22:54Z

If so, shouldn't there be 503 error?

xiayank · 2017-06-01T02:31:59Z

@bihjuchiu
In class, John had the exception like org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://amazon.com.

bihjuchiu · 2017-06-01T02:50:19Z

Good point, maybe it's Jsoup problem...

jygan · 2017-06-01T02:53:53Z

@xiayank can you post the url and selector you are using, also tell me which item you want to crawl?
i will take a look

xiayank · 2017-06-01T03:02:35Z

@jygan

URL list

https://www.amazon.com/workout-clothes/b/ref=nav_shopall_sa_sp_athclg?ie=UTF8&node=11444071011
https://www.amazon.com/Exercise-Equipment-Gym-Equipment/b/ref=nav_shopall_sa_sp_exfit?ie=UTF8&node=3407731
https://www.amazon.com/Hunting-Fishing-Gear-Equipment/b/ref=nav_shopall_hntfsh?ie=UTF8&node=706813011
https://www.amazon.com/soccer-store-soccer-shop/b/ref=nav_shopall_sa_sp_team?ie=UTF8&node=706809011
https://www.amazon.com/Fan-Shop-Sports-Outdoors/b/ref=nav_shopall_sa_sp_fan?ie=UTF8&node=3386071
https://www.amazon.com/Golf/b/ref=nav_shopall_sa_sp_golf?ie=UTF8&node=3410851
https://www.amazon.com/man-cave/b/ref=nav_shopall_sa_sp_gamerm?ie=UTF8&node=706808011
https://www.amazon.com/Sports-Collectibles/b/ref=nav_shopall_sa_sp_sptcllct?ie=UTF8&node=3250697011
https://www.amazon.com/Sports-Fitness/b/ref=nav_shopall_sa_sp_allsport?ie=UTF8&node=10971181011
https://www.amazon.com/b/ref=nav_shopall_lpd_gno_sports?ie=UTF8&node=12034909011
https://www.amazon.com/camping-hiking/b/ref=nav_shopall_sa_out_camphike?ie=UTF8&node=3400371
https://www.amazon.com/Cycling-Wheel-Sports-Outdoors/b/ref=nav_shopall_sa_out_cyc?ie=UTF8&node=3403201
https://www.amazon.com/Outdoor-Recreation-Clothing/b/ref=nav_shopall_sa_out_outcloth?ie=UTF8&node=11443874011
https://www.amazon.com/skateboarding-scooters-skates/b/ref=nav_shopall_sa_out_scooskate?ie=UTF8&node=11051398011
https://www.amazon.com/water-sports/b/ref=nav_shopall_sa_out_water?ie=UTF8&node=11051399011
https://www.amazon.com/winter-sports/b/ref=nav_shopall_sa_out_wintersport?ie=UTF8&node=2204518011
https://www.amazon.com/climbing/b/ref=nav_shopall_sa_out_climb?ie=UTF8&node=3402401
https://www.amazon.com/outdoor-accessories/b/ref=nav_shopall_sa_out_accout?ie=UTF8&node=11051400011
https://www.amazon.com/outdoor-recreation/b/ref=nav_shopall_sa_out_alloutrec?ie=UTF8&node=706814011

Selector:

Elements elements = doc.select("span[class=nav-a-content]");
System.out.println(elements.size());
//The element size sometimes equals to zero, sometimes not.

for(int i = 2; i <= elements.size(); i++){
       String css = "#nav-subnav > a:nth-child(" + Integer.toString(i) +")";
       Element element = doc.select(css).first();
}

Item

The links of in the menu.

Thanks!

jygan · 2017-06-01T03:16:24Z

just hard code up to 10 category and catch exception.
for(int i = 2; i <= 10; i++){
String css = "#nav-subnav > a:nth-child(" + Integer.toString(i) +")";
Element element = doc.select(css).first();
}

xiayank · 2017-06-01T03:27:10Z

The thing is that next time we crawl the page again. It may work or not. So there will be some products first be crawled, but never be crawl again.
Does it affect our project? Since we need to compare the price between the different time we crawl.

jygan · 2017-06-01T03:31:50Z

we need to crawl the product again even if it's crawled already.
your code sometime fail at this line? Element element = doc.select(css).first();

xiayank · 2017-06-01T03:45:18Z

My code has two problems:
1.Document doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(1000000).get(); sometimes throw excption IllegalArgumentException: String must not be empty.

2.Sometimes elements.size() will be zero, sometimes not.

My solution for the link exploring crawler is to initialize all the url into a queue. Also, put the url failed into the queue. Quit the loop until there is no element in the queue. At the end, I can get all the links.

But we I design the product detail crawler. It is normal I have the elements.size() = 0, since not all the sub-category page have the same css selector, because some do not have the product list on the it . I cannot use the queue to do it.

jygan · 2017-06-01T04:50:11Z

Document doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(1000000).get(); sometimes throw excption IllegalArgumentException: String must not be empty

for this error, can you check if url is empty or not?

2.Sometimes elements.size() will be zero, sometimes not.
this might be related to max size of body crawler could load, try

Document doc = Jsoup.connect(url)
                .header(headers)
                .userAgent(USER_AGENT)
                .maxBodySize(0)
                .get();

hackjutsu added the question label May 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jsoup Crawler #43

Jsoup Crawler #43

xiayank commented May 31, 2017 •

edited

Loading

jygan commented May 31, 2017

xiayank commented Jun 1, 2017

bihjuchiu commented Jun 1, 2017

xiayank commented Jun 1, 2017

xiayank commented Jun 1, 2017

bihjuchiu commented Jun 1, 2017

jygan commented Jun 1, 2017

xiayank commented Jun 1, 2017 •

edited

Loading

jygan commented Jun 1, 2017 •

edited

Loading

xiayank commented Jun 1, 2017

jygan commented Jun 1, 2017

xiayank commented Jun 1, 2017

jygan commented Jun 1, 2017 •

edited by hackjutsu

Loading

Jsoup Crawler #43

Jsoup Crawler #43

Comments

xiayank commented May 31, 2017 • edited Loading

jygan commented May 31, 2017

xiayank commented Jun 1, 2017

bihjuchiu commented Jun 1, 2017

xiayank commented Jun 1, 2017

xiayank commented Jun 1, 2017

bihjuchiu commented Jun 1, 2017

jygan commented Jun 1, 2017

xiayank commented Jun 1, 2017 • edited Loading

URL list

Selector:

Item

jygan commented Jun 1, 2017 • edited Loading

xiayank commented Jun 1, 2017

jygan commented Jun 1, 2017

xiayank commented Jun 1, 2017

jygan commented Jun 1, 2017 • edited by hackjutsu Loading

xiayank commented May 31, 2017 •

edited

Loading

xiayank commented Jun 1, 2017 •

edited

Loading

jygan commented Jun 1, 2017 •

edited

Loading

jygan commented Jun 1, 2017 •

edited by hackjutsu

Loading