Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jsoup Crawler #43

Open
xiayank opened this issue May 31, 2017 · 13 comments
Open

Jsoup Crawler #43

xiayank opened this issue May 31, 2017 · 13 comments
Labels

Comments

@xiayank
Copy link

xiayank commented May 31, 2017

I have a question about using jsop api to select the target element.
Here is the HTML.
image
I want to get the href attribute value in <a>tag, which is under the <div class=bxc-grid__image bxc-grid__image--light>.
I tried use

Elements elements = doc.select("div[class=bxc-grid__image   bxc-grid__image--light]");

to locate the div. It works. I followed the API E > F an F direct child of E . So the select css will be li[class=sub-categories__list__item]>a. Howerver, there is exception.

Anyone knows how to locate the <a>tag?

Thanks in advance!
Jsoup select API
URL OF ORGINAL PAGE

Here is the exception log:

Exception in thread "main" java.lang.IllegalArgumentException: String must not be empty
	at org.jsoup.helper.Validate.notEmpty(Validate.java:92)
	at org.jsoup.nodes.Attribute.setKey(Attribute.java:51)
	at org.jsoup.parser.ParseSettings.normalizeAttributes(ParseSettings.java:54)
	at org.jsoup.parser.HtmlTreeBuilder.insert(HtmlTreeBuilder.java:185)
	at org.jsoup.parser.HtmlTreeBuilderState$7.process(HtmlTreeBuilderState.java:553)
	at org.jsoup.parser.HtmlTreeBuilder.process(HtmlTreeBuilder.java:113)
	at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:50)
	at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:43)
	at org.jsoup.parser.HtmlTreeBuilder.parse(HtmlTreeBuilder.java:56)
	at org.jsoup.parser.Parser.parseInput(Parser.java:32)
	at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:135)
	at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:747)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:250)
	at test.main(test.java:26)
@jygan
Copy link
Contributor

jygan commented May 31, 2017

are you using "copy selector" in chrome?
#nav-subnav > a:nth-child(7)

@xiayank
Copy link
Author

xiayank commented Jun 1, 2017

大家在用jsou的时候会不会总是出现,就算对于同一个界面,同一个css selector,抓到的Element有的时候可以正常工作,取到要抓的东西。但是也有可能有的时候为空,有的时候报错IllegalArgumentException: String must not be empty
感觉jsoup不是很稳定,很多时候会失败。

@bihjuchiu
Copy link

Same here. I thought it was Amazon blocking the crawler...

@xiayank
Copy link
Author

xiayank commented Jun 1, 2017

If so, shouldn't there be 503 error?

@xiayank
Copy link
Author

xiayank commented Jun 1, 2017

@bihjuchiu
In class, John had the exception like org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://amazon.com.

@bihjuchiu
Copy link

Good point, maybe it's Jsoup problem...

@jygan
Copy link
Contributor

jygan commented Jun 1, 2017

@xiayank can you post the url and selector you are using, also tell me which item you want to crawl?
i will take a look

@xiayank
Copy link
Author

xiayank commented Jun 1, 2017

@jygan

URL list

https://www.amazon.com/workout-clothes/b/ref=nav_shopall_sa_sp_athclg?ie=UTF8&node=11444071011
https://www.amazon.com/Exercise-Equipment-Gym-Equipment/b/ref=nav_shopall_sa_sp_exfit?ie=UTF8&node=3407731
https://www.amazon.com/Hunting-Fishing-Gear-Equipment/b/ref=nav_shopall_hntfsh?ie=UTF8&node=706813011
https://www.amazon.com/soccer-store-soccer-shop/b/ref=nav_shopall_sa_sp_team?ie=UTF8&node=706809011
https://www.amazon.com/Fan-Shop-Sports-Outdoors/b/ref=nav_shopall_sa_sp_fan?ie=UTF8&node=3386071
https://www.amazon.com/Golf/b/ref=nav_shopall_sa_sp_golf?ie=UTF8&node=3410851
https://www.amazon.com/man-cave/b/ref=nav_shopall_sa_sp_gamerm?ie=UTF8&node=706808011
https://www.amazon.com/Sports-Collectibles/b/ref=nav_shopall_sa_sp_sptcllct?ie=UTF8&node=3250697011
https://www.amazon.com/Sports-Fitness/b/ref=nav_shopall_sa_sp_allsport?ie=UTF8&node=10971181011
https://www.amazon.com/b/ref=nav_shopall_lpd_gno_sports?ie=UTF8&node=12034909011
https://www.amazon.com/camping-hiking/b/ref=nav_shopall_sa_out_camphike?ie=UTF8&node=3400371
https://www.amazon.com/Cycling-Wheel-Sports-Outdoors/b/ref=nav_shopall_sa_out_cyc?ie=UTF8&node=3403201
https://www.amazon.com/Outdoor-Recreation-Clothing/b/ref=nav_shopall_sa_out_outcloth?ie=UTF8&node=11443874011
https://www.amazon.com/skateboarding-scooters-skates/b/ref=nav_shopall_sa_out_scooskate?ie=UTF8&node=11051398011
https://www.amazon.com/water-sports/b/ref=nav_shopall_sa_out_water?ie=UTF8&node=11051399011
https://www.amazon.com/winter-sports/b/ref=nav_shopall_sa_out_wintersport?ie=UTF8&node=2204518011
https://www.amazon.com/climbing/b/ref=nav_shopall_sa_out_climb?ie=UTF8&node=3402401
https://www.amazon.com/outdoor-accessories/b/ref=nav_shopall_sa_out_accout?ie=UTF8&node=11051400011
https://www.amazon.com/outdoor-recreation/b/ref=nav_shopall_sa_out_alloutrec?ie=UTF8&node=706814011

Selector:

Elements elements = doc.select("span[class=nav-a-content]");
System.out.println(elements.size());
//The element size sometimes equals to zero, sometimes not.

for(int i = 2; i <= elements.size(); i++){
       String css = "#nav-subnav > a:nth-child(" + Integer.toString(i) +")";
       Element element = doc.select(css).first();
}

Item

The links of in the menu.
image

Thanks!

@jygan
Copy link
Contributor

jygan commented Jun 1, 2017

just hard code up to 10 category and catch exception.
for(int i = 2; i <= 10; i++){
String css = "#nav-subnav > a:nth-child(" + Integer.toString(i) +")";
Element element = doc.select(css).first();
}

@xiayank
Copy link
Author

xiayank commented Jun 1, 2017

The thing is that next time we crawl the page again. It may work or not. So there will be some products first be crawled, but never be crawl again.
Does it affect our project? Since we need to compare the price between the different time we crawl.

@jygan
Copy link
Contributor

jygan commented Jun 1, 2017

we need to crawl the product again even if it's crawled already.
your code sometime fail at this line? Element element = doc.select(css).first();

@xiayank
Copy link
Author

xiayank commented Jun 1, 2017

My code has two problems:
1.Document doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(1000000).get(); sometimes throw excption IllegalArgumentException: String must not be empty.

2.Sometimes elements.size() will be zero, sometimes not.

My solution for the link exploring crawler is to initialize all the url into a queue. Also, put the url failed into the queue. Quit the loop until there is no element in the queue. At the end, I can get all the links.

But we I design the product detail crawler. It is normal I have the elements.size() = 0, since not all the sub-category page have the same css selector, because some do not have the product list on the it . I cannot use the queue to do it.

@jygan
Copy link
Contributor

jygan commented Jun 1, 2017

Document doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(1000000).get(); sometimes throw excption IllegalArgumentException: String must not be empty

for this error, can you check if url is empty or not?

2.Sometimes elements.size() will be zero, sometimes not.
this might be related to max size of body crawler could load, try

Document doc = Jsoup.connect(url)
                .header(headers)
                .userAgent(USER_AGENT)
                .maxBodySize(0)
                .get();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants