爬虫用doc.select("li[data-asin]") 来得到这一页内result_id的数量，为什么每次不一样 #30

bluecode2017 · 2017-05-14T09:11:23Z

老师，我在写爬虫作业增加paging时遇到点疑问：

老师的代码里是用doc.select("li[data-asin]") 来得到id的数量，然后拼出每个product的detail link. 比如

//detail url
String detail_path = "#result_" + Integer.toString(i) + " > div > div > div > div.a-fixed-left-grid-col.a-col-left > div > div > a";

我也利用了这个方法，为了看的清楚，省略了解析每页产品的那部分代码，代码是

Elements results = doc.select("li[data-asin]");
System.out.println("this url page has num of results = " + results.size());

代码：

我发现，编译后代码，每次执行的时候，results.size（）就是这个页面里面的li[data-asin] 的数量每次执行不一样，有时候18，有时候20.。这是为什么呢？这种变化，会导致这一页的产品取出来不对。
但是我到Amazon里用inspect看的话，都是18个产品，应该是 page1 0-17 page1 16-33, page2 32-49
但是现在results.size()每次不同，导致我得到的产品序号也不对了。所以爬出来的也不对

请帮忙看看是什么问题，

具体请看下截屏

我去Amazon看第2页时是看就18个

编译后，执行一次：

马上执行第二次

奇怪啊，page=2的时候，有时是是18，有时候是20，其他页也是。

hackjutsu · 2017-05-14T19:33:20Z

也许是在整个HTML页面中，除了商品列表外，还有其他符合li[data-asin]的选项。

我的做法提取每一页的第一个产品序号，然后依次加一往后算。

Document doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(100000).get();

Elements results = doc.select("li[data-asin]");
System.out.println("-----> " + "The id for the first element is " + results.first().id());
int startId = 0;
try {
    // Parse the first product Id, for example, parse "16" from "result_16" 
   startId = Integer.parseInt(results.first().id().split("_")[1]);
} catch(Exception e) {
    // Intentionally left blank
}

System.out.println("num of results = " + results.size());
System.out.println("----> startId is " + Integer.toString(startId));

for(int i = startId; i < startId + results.size(); i++) {
    System.out.println("-----> ProductId is " + Integer.toString(i));
    // Put your code here
}

bluecode2017 · 2017-05-15T04:18:46Z

没有，我找了，没有这一项。所以觉得奇怪

hackjutsu · 2017-05-15T17:36:25Z

@bluecode2017 可以参考上一个回复中提到的解法，即分别提取每一页的第一个产品序号。

hackjutsu added the 爬虫 label May 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

爬虫用doc.select("li[data-asin]") 来得到这一页内result_id的数量，为什么每次不一样 #30

爬虫用doc.select("li[data-asin]") 来得到这一页内result_id的数量，为什么每次不一样 #30

bluecode2017 commented May 14, 2017 •

edited by hackjutsu

Loading

hackjutsu commented May 14, 2017 •

edited

Loading

bluecode2017 commented May 15, 2017 via email •

edited by hackjutsu

Loading

hackjutsu commented May 15, 2017

爬虫 用doc.select("li[data-asin]") 来得到这一页内result_id的数量，为什么每次不一样 #30

爬虫 用doc.select("li[data-asin]") 来得到这一页内result_id的数量，为什么每次不一样 #30

Comments

bluecode2017 commented May 14, 2017 • edited by hackjutsu Loading

hackjutsu commented May 14, 2017 • edited Loading

bluecode2017 commented May 15, 2017 via email • edited by hackjutsu Loading

hackjutsu commented May 15, 2017

爬虫用doc.select("li[data-asin]") 来得到这一页内result_id的数量，为什么每次不一样 #30

爬虫用doc.select("li[data-asin]") 来得到这一页内result_id的数量，为什么每次不一样 #30

bluecode2017 commented May 14, 2017 •

edited by hackjutsu

Loading

hackjutsu commented May 14, 2017 •

edited

Loading

bluecode2017 commented May 15, 2017 via email •

edited by hackjutsu

Loading