Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

爬虫 用doc.select("li[data-asin]") 来得到这一页内result_id的数量,为什么每次不一样 #30

Open
bluecode2017 opened this issue May 14, 2017 · 3 comments
Labels

Comments

@bluecode2017
Copy link

bluecode2017 commented May 14, 2017

老师,我在写爬虫作业 增加paging时遇到点疑问:

老师的代码里是用doc.select("li[data-asin]") 来得到id的数量,然后拼出每个product的detail link. 比如

//detail url
String detail_path = "#result_" + Integer.toString(i) + " > div > div > div > div.a-fixed-left-grid-col.a-col-left > div > div > a";

我也利用了这个方法,为了看的清楚,省略了解析每页产品的那部分代码,代码是

Elements results = doc.select("li[data-asin]");
System.out.println("this url page has num of results = " + results.size());

代码:
image
我发现,编译后代码,每次执行的时候,results.size() 就是这个页面里面的li[data-asin] 的数量每次执行不一样,有时候18,有时候20.。这是为什么呢? 这种变化,会导致这一页的产品取出来不对。
但是我到Amazon里用inspect看的话,都是18个产品,应该是 page1 0-17 page1 16-33, page2 32-49
但是现在results.size()每次不同,导致我得到的产品序号也不对了。所以爬出来的也不对

请帮忙看看是什么问题,


具体请看下截屏

我去Amazon看第2页时是看就18个
image

编译后,执行一次:

image

马上执行第二次

image

奇怪啊,page=2的时候,有时是是18,有时候是20,其他页也是。

@hackjutsu
Copy link
Member

hackjutsu commented May 14, 2017

也许是在整个HTML页面中,除了商品列表外,还有其他符合li[data-asin]的选项。

我的做法提取每一页的第一个产品序号,然后依次加一往后算。

Document doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(100000).get();

Elements results = doc.select("li[data-asin]");
System.out.println("-----> " + "The id for the first element is " + results.first().id());
int startId = 0;
try {
    // Parse the first product Id, for example, parse "16" from "result_16" 
   startId = Integer.parseInt(results.first().id().split("_")[1]);
} catch(Exception e) {
    // Intentionally left blank
}

System.out.println("num of results = " + results.size());
System.out.println("----> startId is " + Integer.toString(startId));

for(int i = startId; i < startId + results.size(); i++) {
    System.out.println("-----> ProductId is " + Integer.toString(i));
    // Put your code here
}

@bluecode2017
Copy link
Author

bluecode2017 commented May 15, 2017 via email

@hackjutsu
Copy link
Member

@bluecode2017 可以参考上一个回复中提到的解法,即分别提取每一页的第一个产品序号。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants