Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py selenium j script click #1

Merged
merged 22 commits into from
Nov 4, 2016
Merged

Py selenium j script click #1

merged 22 commits into from
Nov 4, 2016

Conversation

miztiik
Copy link
Owner

@miztiik miztiik commented Nov 4, 2016

Working - Able to scrap the web pages under the forums using selenium and store them under the "output" directory

Mystique added 22 commits October 17, 2016 23:59
 - uses both wrapper display and xfvb (need to get this right)
 - need to get the delay for browser wait right

Signed-off-by: Mystique <[email protected]>
- Scrapes URLs & prints out the list,
- Has sub routines to scrape the given page, have to iterate through to the end

Signed-off-by: Mystique <[email protected]>
Wait condition works but not enough for this site.
ToDo
- Working on next page click

Signed-off-by: Mystique <[email protected]>
 - for 'Next Page' Button Click
  - Uses 'Execture_Script' for JS Button Click
 - Scraping Multiple pages
 - Return url List
Signed-off-by: Mystique <[email protected]>
 - Scrap and return dictionary with aws Tag, sourceUrl, uri,pages crawled, crawl success
 - stores them in json format output file with the format 'acloudguru-<awsTag>.json'

Signed-off-by: Mystique <[email protected]>
     - Scrap and return dictionary with aws Tag, sourceUrl, uri,pages crawled, crawl success
     - stores them in json format output file with the format 'acloudguru-<awsTag>.json'

Signed-off-by: Mystique <[email protected]>
Signed-off-by: Mystique <[email protected]>
 - Added another area to scrap

Signed-off-by: Mystique <[email protected]>
- Ability to read urls from file and scrap them
- Each URL output is dumped to separate file with tag and date prefix

Working code can be re-used as is.

Signed-off-by: Mystique <[email protected]>
- Scrapy the link and store the output as JSON in "output" directory
- added a date time stamp to the JSON

Signed-off-by: Mystique <[email protected]>
- Added timestamp
- Added the page wait load time in the url , no more hardcoding

Signed-off-by: Mystique <[email protected]>
- added the available tags "hard Coded" in a dictionary, probably will iterate in future.

Signed-off-by: Mystique <[email protected]>
@miztiik
Copy link
Owner Author

miztiik commented Nov 4, 2016

LGTM

@miztiik miztiik merged commit 4891dd2 into master Nov 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant