Py selenium j script click #1

miztiik · 2016-11-04T17:16:10Z

Working - Able to scrap the web pages under the forums using selenium and store them under the "output" directory

Signed-off-by: Mystique <[email protected]>

- uses both wrapper display and xfvb (need to get this right) - need to get the delay for browser wait right Signed-off-by: Mystique <[email protected]>

- Scrapes URLs & prints out the list, - Has sub routines to scrape the given page, have to iterate through to the end Signed-off-by: Mystique <[email protected]>

Signed-off-by: Mystique <[email protected]>

Wait condition works but not enough for this site. ToDo - Working on next page click Signed-off-by: Mystique <[email protected]>

- for 'Next Page' Button Click - Uses 'Execture_Script' for JS Button Click - Scraping Multiple pages - Return url List

Signed-off-by: Mystique <[email protected]>

- Scrap and return dictionary with aws Tag, sourceUrl, uri,pages crawled, crawl success - stores them in json format output file with the format 'acloudguru-<awsTag>.json' Signed-off-by: Mystique <[email protected]>

Signed-off-by: Mystique <[email protected]>

- Output also commited Signed-off-by: Mystique <[email protected]>

Signed-off-by: Mystique <[email protected]>

- Added another area to scrap Signed-off-by: Mystique <[email protected]>

- Ability to read urls from file and scrap them - Each URL output is dumped to separate file with tag and date prefix Working code can be re-used as is. Signed-off-by: Mystique <[email protected]>

- Scrapy the link and store the output as JSON in "output" directory - added a date time stamp to the JSON Signed-off-by: Mystique <[email protected]>

- Added timestamp - Added the page wait load time in the url , no more hardcoding Signed-off-by: Mystique <[email protected]>

Signed-off-by: Mystique <[email protected]>

- added the available tags "hard Coded" in a dictionary, probably will iterate in future. Signed-off-by: Mystique <[email protected]>

miztiik · 2016-11-04T17:17:30Z

LGTM

Mystique added 22 commits October 17, 2016 23:59

Trying to re-organize into function routines

50f8a5a

Signed-off-by: Mystique <[email protected]>

Working Code to scrapy msg board of acloudguru

7b36f84

- uses both wrapper display and xfvb (need to get this right) - need to get the delay for browser wait right Signed-off-by: Mystique <[email protected]>

WORKING - REMOVED UN-NECESSARY ELEMENTS

8bb4b2d

- Scrapes URLs & prints out the list, - Has sub routines to scrape the given page, have to iterate through to the end Signed-off-by: Mystique <[email protected]>

Removed settings relevant to "Scrapy Splash" as that was no longer used

823fa62

Signed-off-by: Mystique <[email protected]>

Removed superflous comments & lines

034e3fd

Wait condition works but not enough for this site. ToDo - Working on next page click Signed-off-by: Mystique <[email protected]>

Working code

05a4e6f

- for 'Next Page' Button Click - Uses 'Execture_Script' for JS Button Click - Scraping Multiple pages - Return url List

Almost there

caa3c63

Signed-off-by: Mystique <[email protected]>

WORKING CODE:

ce1dcd0

- Scrap and return dictionary with aws Tag, sourceUrl, uri,pages crawled, crawl success - stores them in json format output file with the format 'acloudguru-<awsTag>.json' Signed-off-by: Mystique <[email protected]>

WORKING CODE:

977ca5f

- Scrap and return dictionary with aws Tag, sourceUrl, uri,pages crawled, crawl success - stores them in json format output file with the format 'acloudguru-<awsTag>.json' Signed-off-by: Mystique <[email protected]>

Changed the name to reflect the function of "urlCollection"

284a236

Signed-off-by: Mystique <[email protected]>

Bot to collect the question text and answers ( without comments )

7bc6767

- Output also commited Signed-off-by: Mystique <[email protected]>

minor improvements

cf4e596

Signed-off-by: Mystique <[email protected]>

Removed these temporary scraps

34d3a0d

Signed-off-by: Mystique <[email protected]>

Created "input" & "output" directories to organized the data

22ece0d

Signed-off-by: Mystique <[email protected]>

- Improved Error messages,

9927a7b

- Added another area to scrap Signed-off-by: Mystique <[email protected]>

Lots of improvements,

06b5223

- Ability to read urls from file and scrap them - Each URL output is dumped to separate file with tag and date prefix Working code can be re-used as is. Signed-off-by: Mystique <[email protected]>

- Readfile from a particular location

69c9a33

- Scrapy the link and store the output as JSON in "output" directory - added a date time stamp to the JSON Signed-off-by: Mystique <[email protected]>

- Scrap and store output in a folder

83cd752

- Added timestamp - Added the page wait load time in the url , no more hardcoding Signed-off-by: Mystique <[email protected]>

No longer required, moved them another directory "LnksToScrape"

7c5ced3

Signed-off-by: Mystique <[email protected]>

This is how the input links looks

66503a4

Signed-off-by: Mystique <[email protected]>

This is how the output links look

6d680af

Signed-off-by: Mystique <[email protected]>

Improved the script to stop when the reaching the end of pagination

4dd7f94

- added the available tags "hard Coded" in a dictionary, probably will iterate in future. Signed-off-by: Mystique <[email protected]>

miztiik merged commit 4891dd2 into master Nov 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Py selenium j script click #1

Py selenium j script click #1

miztiik commented Nov 4, 2016

miztiik commented Nov 4, 2016

Py selenium j script click #1

Py selenium j script click #1

Conversation

miztiik commented Nov 4, 2016

miztiik commented Nov 4, 2016