-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Scrape list of these events #34
Comments
|
For MindCORE, there is a link to download all events as iCal format. Shall we try parsing with this https://pypi.org/project/icalendar/? |
@bluenex yes, using |
Thanks so much @Tanvvenk! You can keep updating the list here! |
Oh my, that is such a big list 😮 |
@bluenex is there a way to extract calendar from this page https://www.lrsm.upenn.edu/calendar/? |
@bluenex @titipata I tried to scrape the following website: http://go.activecalendar.com/UPennMINS/?view=list2&search=y |
@bluenex @titipata I tried to scrape the following website: https://www.law.upenn.edu/institutes/legalhistory/workshops-lectures.php However, the href for the event url is different from the actual event_url. They link to the external javascript to get to the event-page. Could someone check this? Thank you. |
@bluenex, so they use
Then, they use Javascript to call the event in the following format: |
@Tanvvenk so basically, you got <a href="/live/events/50930-legal-history-workshop-intisar-a-rabb">Legal History Workshop: Intisar A. Rabb</a> The You can use which is an encoded url of a bunch of XML tags: https://www.law.upenn.edu/live/calendar/view/event/event_id/50930?user_tz=IT&syntax=<widget type="events_calendar" priority="high">
<arg id="mini_cal_heat_map">true</arg>
<arg id="exclude_tag">Library Hours</arg>
<arg id="exclude_tag">Library Event Private</arg>
<arg id="exclude_tag">Careers Calendar ONLY</arg>
<arg id="exclude_tag">Docs Only Event</arg>
<arg id="exclude_tag">Pending Event</arg>
<arg id="exclude_tag">Private Event</arg>
<arg id="exclude_tag">Offcampus</arg>
<arg id="exclude_group">Registrar</arg>
<arg id="exclude_group">Admitted JD</arg>
<arg id="placeholder">Search Calendar</arg>
<arg id="disable_timezone">true</arg>
<arg id="thumb_width">144</arg>
<arg id="thumb_height">144</arg>
<arg id="modular">true</arg>
<arg id="default_view">week</arg>
</widget> So the conclusion is that we should be able to retrieve event data (any event that shares the same XML configs) from the encoded URL. In other words, you can get event data in JSON from this python snippets: import requests
url = '''
https://www.law.upenn.edu/live/calendar/view/event/event_id/50930?user_tz=IT&syntax=
%3Cwidget%20type%3D%22events_calendar%22%20priority%3D%22high%22%3E%3
Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%
22exclude_tag%22%3ELibrary%20Hours%3C%2Farg%3E%3Carg%20id%3D%22exclud
e_tag%22%3ELibrary%20Event%20Private%3C%2Farg%3E%3Carg%20id%3D%22excl
ude_tag%22%3ECareers%20Calendar%20ONLY%3C%2Farg%3E%3Carg%20id%3D%22ex
clude_tag%22%3EDocs%20Only%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclud
e_tag%22%3EPending%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22
%3EPrivate%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EOffca
mpus%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ERegistrar%3C%2Far
g%3E%3Carg%20id%3D%22exclude_group%22%3EAdmitted%20JD%3C%2Farg%3E%3Ca
rg%20id%3D%22placeholder%22%3ESearch%20Calendar%3C%2Farg%3E%3Carg%20i
d%3D%22disable_timezone%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22thumb_w
idth%22%3E144%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E144%3C%2F
arg%3E%3Carg%20id%3D%22modular%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22
default_view%22%3Eweek%3C%2Farg%3E%3C%2Fwidget%3E
'''
resp = requests.get(url=url)
data = resp.json()
data
Out[5]:
{'date': 'Date',
'title': 'March 17, 2016',
'event': {'id': 50930,
... Thank you very much! I will try this out @bluenex |
@bluenex @titipata Page.soup is giving unexpected information that doesn't seem to contain event_url href Thank you |
@Tanvvenk checking XHR of the page, it sent a request to this link: https://penntoday.upenn.edu/events-feed?_format=json You get JSON from that right away. |
Nice, thanks @bluenex! Extra questions So, we have a few more URLs to scrape from event_id = '701084'
text = """https://events.med.upenn.edu/live/calendar/view/event/event_id/{}?user_tz=IT&synta
x=%3Cwidget%20type%3D%22events_calendar%22%3E%3Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Fa
rg%3E%3Carg%20id%3D%22thumb_width%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E200%3C%
2Farg%3E%3Carg%20id%3D%22hide_repeats%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_groups%22%3Efa
lse%3C%2Farg%3E%3Carg%20id%3D%22show_tags%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22default_view%22%
3Eday%3C%2Farg%3E%3Carg%20id%3D%22group%22%3ECenter%20for%20Clinical%20Epidemiology%20and%20Bios
tatistics%20%28CCEB%29%3C%2Farg%3E%3C%21--start%20excluded%20groups--%3E%3C%21--%20if%20you%20ar
e%20changing%20the%20excluded%20groups%2C%20they%20should%20also%20be%20changed%20in%20this%20fi
le%3A%0A%09%2Fpublic_html%2Fpsom-excluded-groups.php--%3E%3Carg%20id%3D%22exclude_group%22%3EAdm
in%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Import%3C%2Farg%3E%3Carg%20id%3D%22excl
ude_group%22%3ELiveWhale%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Department%3C%2Fa
rg%3E%3C%21--end%20excluded%20groups--%3E%3Carg%20id%3D%22webcal_feed_links%22%3Etrue%3C%2Farg%3
E%3C%2Fwidget%3E""".format(event_id).replace('\n', '').replace(' ', '')
event_json = requests.get(text).json() |
https://www.gse.upenn.edu/eventThis one is weird that it doesn't make a request at all. However, all the data for events seems to be packed in a HTML file. You should be able to get all the event data from <div class="views-row views-row-7 views-row-odd">
<div class="views-field views-field-nothing-1">
<span class="field-content">
<a href="#" title="Add to Calendar" class="addthisevent">
<img src="http://addthisevent.com/gfx/icon-calendar-t2.png" />
<span class="_start">2018-12-10 10:30</span>
<span class="_end">2018-12-10 10:30</span>
<span class="_zonecode">15</span>
<span class="_summary"
>Penn GSE Event: Colloquium Presented by Dr. John Papay</span
>
<span class="_description"
>Dr. John Papay, Assistant Professor of Education at Brown University,
will present a colloquium titled: Where Teachers Succeed and Improve:
The Importance of School Context for Teacher Effectiveness and
Development.
https://www.gse.upenn.edu/event/colloquium-presented-dr-john-papay</span
>
<span class="_location">3700 Walnut St, Room 203</span>
<span class="_organizer"></span> <span class="_organizer_email"></span>
<span class="_date_format">DD/MM/YYYY</span>
</a>
</span>
</div>
<div class="views-field views-field-nothing">
<span class="field-content">
<span class="event-month-date">
<span class="date-display-single">Dec</span>
</span>
<span class="event-day-date">
<span class="date-display-single">10</span>
</span>
</span>
</div>
<div class="views-field views-field-field-date-4">
<div class="field-content">
<span class="date-display-single">1544455800</span>
</div>
</div>
<div class="views-field views-field-title">
<span class="field-content">
<a href="/event/colloquium-presented-dr-john-papay">
<h2>Colloquium Presented by Dr. John Papay</h2>
</a>
</span>
</div>
<div class="views-field views-field-field-date-5">
<div class="field-content">
<span class="glyphicon glyphicon-time"></span>
<span class="date-display-single">10:30am</span>
</div>
</div>
<div class="views-field views-field-field-location-1">
<div class="field-content">
<span class="glyphicon glyphicon-map-marker"></span> 3700 Walnut St, Room
203
</div>
</div>
<div class="views-field views-field-term-node-tid">
<span class="field-content">
<div class="item-list">
<ul>
<li class="first">On Campus</li>
<li>Students</li>
<li>Faculty</li>
<li>Staff</li>
<li class="last">Lecture/Panel/Presentation</li>
</ul>
</div>
</span>
</div>
<div class="views-field views-field-field-event-short-description-1">
<div class="field-content">
Dr. John Papay, Assistant Professor of Education at Brown University, will
present a colloquium titled: Where Teachers Succeed and Improve: The
Importance of School Context for Teacher Effectiveness and Development.
</div>
</div>
</div> The URL to specific event can be found on <div class="views-field views-field-title">
<span class="field-content">
<a href="/event/colloquium-presented-dr-john-papay">
<h2>Colloquium Presented by Dr. John Papay</h2>
</a>
</span>
</div> Still, it doesn't make request and you will need to scrape from HTML anyway. https://events.med.upenn.edu/For this, I found that if we click on Response from a link above is a JSON object containing 50 events out of https://www.med.upenn.edu/inclusion-and-diversity/Upcoming-Events/When we open a link to an event in this page we will be brought to another page of This may be hard to read, let's see decoded XML: https://events.med.upenn.edu/live/calendar/view/all?user_tz=IT&syntax=<widget type="events_calendar">
<arg id="mini_cal_heat_map">true</arg>
<arg id="thumb_width">200</arg>
<arg id="thumb_height">200</arg>
<arg id="hide_repeats">false</arg>
<arg id="show_groups">false</arg>
<arg id="show_tags">true</arg>
<arg id="default_view">day</arg>
<arg id="group">Office of Inclusion and Diversity</arg> <!--start excluded groups--> <!-- if you are changing the excluded groups, they should also be changed in this file:
/public_html/psom-excluded-groups.php-->
<arg id="exclude_group">Admin</arg>
<arg id="exclude_group">Test Import</arg>
<arg id="exclude_group">LiveWhale</arg>
<arg id="exclude_group">Test Department</arg> <!--end excluded groups-->
<arg id="webcal_feed_links">true</arg>
</widget> I suspect that https://events.med.upenn.edu/#!view/all doesn't actually show all the events (for instance, each request will get 50 items out of 1086 starting from today). Events from Office of Inclusion and Diversity are also not included in those 50 items, instead you need to make a request with this argument One thing to notice: Main event page for med.upenn: or |
Adding the scraper for MINS (http://go.activecalendar.com/upennmins) |
@bluenex can you help on getting the JSON format from the following sites: |
@titipata fortunately, they are fairly simple. Don't forget to extract arguments from the request URL (if there are).
|
The event is available at https://web.sas.upenn.edu/mindcore/events/. Not sure if there is an easy to download format for the page or not.
The text was updated successfully, but these errors were encountered: