-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jsoup Crawler #43
Comments
are you using "copy selector" in chrome? |
大家在用jsou的时候会不会总是出现,就算对于同一个界面,同一个css selector,抓到的Element有的时候可以正常工作,取到要抓的东西。但是也有可能有的时候为空,有的时候报错 |
Same here. I thought it was Amazon blocking the crawler... |
If so, shouldn't there be 503 error? |
@bihjuchiu |
Good point, maybe it's Jsoup problem... |
@xiayank can you post the url and selector you are using, also tell me which item you want to crawl? |
URL listhttps://www.amazon.com/workout-clothes/b/ref=nav_shopall_sa_sp_athclg?ie=UTF8&node=11444071011
https://www.amazon.com/Exercise-Equipment-Gym-Equipment/b/ref=nav_shopall_sa_sp_exfit?ie=UTF8&node=3407731
https://www.amazon.com/Hunting-Fishing-Gear-Equipment/b/ref=nav_shopall_hntfsh?ie=UTF8&node=706813011
https://www.amazon.com/soccer-store-soccer-shop/b/ref=nav_shopall_sa_sp_team?ie=UTF8&node=706809011
https://www.amazon.com/Fan-Shop-Sports-Outdoors/b/ref=nav_shopall_sa_sp_fan?ie=UTF8&node=3386071
https://www.amazon.com/Golf/b/ref=nav_shopall_sa_sp_golf?ie=UTF8&node=3410851
https://www.amazon.com/man-cave/b/ref=nav_shopall_sa_sp_gamerm?ie=UTF8&node=706808011
https://www.amazon.com/Sports-Collectibles/b/ref=nav_shopall_sa_sp_sptcllct?ie=UTF8&node=3250697011
https://www.amazon.com/Sports-Fitness/b/ref=nav_shopall_sa_sp_allsport?ie=UTF8&node=10971181011
https://www.amazon.com/b/ref=nav_shopall_lpd_gno_sports?ie=UTF8&node=12034909011
https://www.amazon.com/camping-hiking/b/ref=nav_shopall_sa_out_camphike?ie=UTF8&node=3400371
https://www.amazon.com/Cycling-Wheel-Sports-Outdoors/b/ref=nav_shopall_sa_out_cyc?ie=UTF8&node=3403201
https://www.amazon.com/Outdoor-Recreation-Clothing/b/ref=nav_shopall_sa_out_outcloth?ie=UTF8&node=11443874011
https://www.amazon.com/skateboarding-scooters-skates/b/ref=nav_shopall_sa_out_scooskate?ie=UTF8&node=11051398011
https://www.amazon.com/water-sports/b/ref=nav_shopall_sa_out_water?ie=UTF8&node=11051399011
https://www.amazon.com/winter-sports/b/ref=nav_shopall_sa_out_wintersport?ie=UTF8&node=2204518011
https://www.amazon.com/climbing/b/ref=nav_shopall_sa_out_climb?ie=UTF8&node=3402401
https://www.amazon.com/outdoor-accessories/b/ref=nav_shopall_sa_out_accout?ie=UTF8&node=11051400011
https://www.amazon.com/outdoor-recreation/b/ref=nav_shopall_sa_out_alloutrec?ie=UTF8&node=706814011 Selector:Elements elements = doc.select("span[class=nav-a-content]");
System.out.println(elements.size());
//The element size sometimes equals to zero, sometimes not.
for(int i = 2; i <= elements.size(); i++){
String css = "#nav-subnav > a:nth-child(" + Integer.toString(i) +")";
Element element = doc.select(css).first();
} ItemThanks! |
just hard code up to 10 category and catch exception. |
The thing is that next time we crawl the page again. It may work or not. So there will be some products first be crawled, but never be crawl again. |
we need to crawl the product again even if it's crawled already. |
My code has two problems: 2.Sometimes My solution for the link exploring crawler is to initialize all the url into a But we I design the product detail crawler. It is normal I have the |
for this error, can you check if url is empty or not? 2.Sometimes
|
I have a question about using jsop api to select the target element.
Here is the HTML.
I want to get the
href
attribute value in<a>
tag, which is under the<div class=bxc-grid__image bxc-grid__image--light>
.I tried use
to locate the
div
. It works. I followed the APIE > F an F direct child of E
. So the select css will beli[class=sub-categories__list__item]>a
. Howerver, there is exception.Anyone knows how to locate the
<a>
tag?Thanks in advance!
Jsoup select API
URL OF ORGINAL PAGE
Here is the exception log:
The text was updated successfully, but these errors were encountered: