Need help with python -m ichrome.web #140

juanfrilla · 2023-09-26T19:53:23Z

If i launch a browser as a service:
python -m ichrome.web
Then

import requests
from bs4 import BeautifulSoup

headers = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
   'Accept-Language': 'es-ES,es;q=0.9',
   'Cache-Control': 'max-age=0',
   'Connection': 'keep-alive',
   'Sec-Fetch-Dest': 'document',
   'Sec-Fetch-Mode': 'navigate',
   'Sec-Fetch-Site': 'none',
   'Sec-Fetch-User': '?1',
   'Upgrade-Insecure-Requests': '1',
   'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   'sec-ch-ua': '"Google Chrome";v="117", "Not;A=Brand";v="8", "Chromium";v="117"',
   'sec-ch-ua-mobile': '?0',
   'sec-ch-ua-platform': '"macOS"',
}

params = (
   ('url', "https://oficinajudicialvirtual.pjud.cl/home/index.php"),
)

response = requests.get('http://127.0.0.1:8080/chrome/preview', headers=headers, params=params)

soup = BeautifulSoup(response.text, 'html.parser')

recaptcha_url = soup.select('iframe[title="reCAPTCHA"]')[0]["src"]

I have the recaptcha url

But If I do it like this:

from bs4 import BeautifulSoup
from torequests import tPool
from inspect import getsource
req = tPool()



async def tab_callback(task, tab, data, timeout):
    await tab.wait_loading(20)
    return await tab.html

json = {
    'tab_callback': getsource(tab_callback),
    "timeout": 20,
    "incognito_args": {
        "url": "https://oficinajudicialvirtual.pjud.cl/home/index.php",
        "proxyServer": "37.19.220.129:8443"
    }
}

response = req.post('http://127.0.0.1:8080/chrome/do',json=json)

soup = BeautifulSoup(response.text, 'html.parser')

recaptcha_url = soup.select('iframe[title="reCAPTCHA"]')[0]["src"]

I'm not having the fully load soup, I guess it could be some security measure of the origin website im scraping.
Any help?

The text was updated successfully, but these errors were encountered:

ClericPy · 2023-09-27T14:20:58Z

try "proxyServer": "http://37.19.220.129:8443"
use await tab.screenshot(save_path='image_path') watch the image what happened?
use python -m ichrome.web --disable-headless watch what happened while you request

juanfrilla · 2023-10-01T08:49:05Z

Thanks @ClericPy ,it open the browser in the page, when the browser stops loading, loads the recaptcha but It looks that the response that returns me its without recaptcha url.
Maybe it can be an async/await issue.

I tried this:
python -m ichrome.web --disable-headless

from bs4 import BeautifulSoup
from torequests import tPool
from inspect import getsource
req = tPool()



async def tab_callback(task, tab, data, timeout):
    await tab.wait_loading(5000)
    await tab.screenshot(save_path='./screenshot.png')
    return await tab.html

json = {
    'tab_callback': getsource(tab_callback),
    "timeout": 5000,
    "incognito_args": {
        "url": "https://oficinajudicialvirtual.pjud.cl/home/index.php",
        "proxyServer": "http://37.19.220.129:8443"
    }
}

response = req.post('http://127.0.0.1:8080/chrome/do',json=json)

soup = BeautifulSoup(response.text, 'html.parser')

recaptcha_url = soup.select('iframe[title="reCAPTCHA"]')[0]["src"]

ClericPy · 2023-10-10T11:45:32Z

what did you see screenshot.png?
I can't see the html to reappear.

use python -m ichrome.web --disable-headless

async def tab_callback(task, tab, data, timeout):
    await asyncio.sleep(10000)
    return await tab.html

to check the HTML in real chrome

juanfrilla · 2023-10-12T09:06:35Z

@ClericPy can you implement one day an API request like this and pass a proxy as a parameter in the payload to the API call?

It's better like this because in this way, async/await it's removed

import requests
from bs4 import BeautifulSoup

headers = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
   'Accept-Language': 'es-ES,es;q=0.9',
   'Cache-Control': 'max-age=0',
   'Connection': 'keep-alive',
   'Sec-Fetch-Dest': 'document',
   'Sec-Fetch-Mode': 'navigate',
   'Sec-Fetch-Site': 'none',
   'Sec-Fetch-User': '?1',
   'Upgrade-Insecure-Requests': '1',
   'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   'sec-ch-ua': '"Google Chrome";v="117", "Not;A=Brand";v="8", "Chromium";v="117"',
   'sec-ch-ua-mobile': '?0',
   'sec-ch-ua-platform': '"macOS"',
}

params = (
   ('url', "https://oficinajudicialvirtual.pjud.cl/home/index.php"),
)

data = {
"proxyServer": "http://37.19.220.129:8443"
} 

response = requests.get('http://127.0.0.1:8080/chrome/preview', headers=headers, params=params, data = data)

soup = BeautifulSoup(response.text, 'html.parser')

recaptcha_url = soup.select('iframe[title="reCAPTCHA"]')[0]["src"]

ClericPy · 2023-10-13T12:10:33Z

The headers didn't be used by ichrome yet
I need to think about the API some time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help with python -m ichrome.web #140

Need help with python -m ichrome.web #140

juanfrilla commented Sep 26, 2023 •

edited

Loading

ClericPy commented Sep 27, 2023 •

edited

Loading

juanfrilla commented Oct 1, 2023 •

edited

Loading

ClericPy commented Oct 10, 2023

juanfrilla commented Oct 12, 2023 •

edited

Loading

ClericPy commented Oct 13, 2023

Need help with python -m ichrome.web #140

Need help with python -m ichrome.web #140

Comments

juanfrilla commented Sep 26, 2023 • edited Loading

ClericPy commented Sep 27, 2023 • edited Loading

juanfrilla commented Oct 1, 2023 • edited Loading

ClericPy commented Oct 10, 2023

juanfrilla commented Oct 12, 2023 • edited Loading

ClericPy commented Oct 13, 2023

juanfrilla commented Sep 26, 2023 •

edited

Loading

ClericPy commented Sep 27, 2023 •

edited

Loading

juanfrilla commented Oct 1, 2023 •

edited

Loading

juanfrilla commented Oct 12, 2023 •

edited

Loading