Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Scrape list of these events #34

Open
titipata opened this issue Sep 10, 2018 · 25 comments
Open

[Data] Scrape list of these events #34

titipata opened this issue Sep 10, 2018 · 25 comments

Comments

@titipata
Copy link
Owner

The event is available at https://web.sas.upenn.edu/mindcore/events/. Not sure if there is an easy to download format for the page or not.

@titipata
Copy link
Owner Author

titipata commented Sep 11, 2018

@bluenex
Copy link
Collaborator

bluenex commented Sep 12, 2018

For MindCORE, there is a link to download all events as iCal format. Shall we try parsing with this https://pypi.org/project/icalendar/?

@titipata
Copy link
Owner Author

titipata commented Sep 12, 2018

@bluenex yes, using icalendar should be proper. Also, we're actually hiring people to find all sources of events now. I will forward you an email later this week.

@titipata titipata changed the title Scrape MindCORE events Scrape list of these events Oct 24, 2018
@Tanvvenk
Copy link
Collaborator

Tanvvenk commented Oct 24, 2018

10/26/18

@titipata
Copy link
Owner Author

Thanks so much @Tanvvenk! You can keep updating the list here!

@bluenex
Copy link
Collaborator

bluenex commented Oct 29, 2018

Oh my, that is such a big list 😮

@titipata
Copy link
Owner Author

titipata commented Oct 30, 2018

@titipata
Copy link
Owner Author

titipata commented Nov 8, 2018

@bluenex is there a way to extract calendar from this page https://www.lrsm.upenn.edu/calendar/?

@Tanvvenk
Copy link
Collaborator

@bluenex @titipata I tried to scrape the following website: http://go.activecalendar.com/UPennMINS/?view=list2&search=y
However, the event url is not in PageSoup. Can someone check if there is a hidden link? Thank you.

@Tanvvenk
Copy link
Collaborator

@bluenex @titipata I tried to scrape the following website: https://www.law.upenn.edu/institutes/legalhistory/workshops-lectures.php

However, the href for the event url is different from the actual event_url. They link to the external javascript to get to the event-page. Could someone check this? Thank you.

@titipata
Copy link
Owner Author

@bluenex, so they use href in the following format:

<a href="/live/events/50930-legal-history-workshop-intisar-a-rabb">Legal History Workshop: Intisar A. Rabb</a>

Then, they use Javascript to call the event in the following format: https://www.law.upenn.edu/newsevents/calendar.php#!event_id/50930/view/event. I guess we can check the event ID and a way to retrieve JSON from the event ID as we did previously.

@bluenex
Copy link
Collaborator

bluenex commented Nov 20, 2018

@bluenex @titipata I tried to scrape the following website: https://www.law.upenn.edu/institutes/legalhistory/workshops-lectures.php

However, the href for the event url is different from the actual event_url. They link to the external javascript to get to the event-page. Could someone check this? Thank you.

@Tanvvenk so basically, you got a tag like @titipata mentioned:

<a href="/live/events/50930-legal-history-workshop-intisar-a-rabb">Legal History Workshop: Intisar A. Rabb</a>

The /live/events/50930-legal-history-workshop-intisar-a-rabb in href is used to request JSON data to fill the event page template. Because we actually need event data, it would be better if we can get JSON data directly (meaning we don't even need to scrape from the HTML).

You can use https://www.law.upenn.edu/newsevents/calendar.php#!event_id/50930/view/event as a pattern to get a link to event page, but for event data you can check page's request which you will see this link:

https://www.law.upenn.edu/live/calendar/view/event/event_id/50930?user_tz=IT&syntax=%3Cwidget%20type%3D%22events_calendar%22%20priority%3D%22high%22%3E%3Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3ELibrary%20Hours%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3ELibrary%20Event%20Private%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3ECareers%20Calendar%20ONLY%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EDocs%20Only%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EPending%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EPrivate%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EOffcampus%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ERegistrar%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3EAdmitted%20JD%3C%2Farg%3E%3Carg%20id%3D%22placeholder%22%3ESearch%20Calendar%3C%2Farg%3E%3Carg%20id%3D%22disable_timezone%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22thumb_width%22%3E144%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E144%3C%2Farg%3E%3Carg%20id%3D%22modular%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22default_view%22%3Eweek%3C%2Farg%3E%3C%2Fwidget%3E

which is an encoded url of a bunch of XML tags:

https://www.law.upenn.edu/live/calendar/view/event/event_id/50930?user_tz=IT&syntax=<widget type="events_calendar" priority="high">
  <arg id="mini_cal_heat_map">true</arg>
  <arg id="exclude_tag">Library Hours</arg>
  <arg id="exclude_tag">Library Event Private</arg>
  <arg id="exclude_tag">Careers Calendar ONLY</arg>
  <arg id="exclude_tag">Docs Only Event</arg>
  <arg id="exclude_tag">Pending Event</arg>
  <arg id="exclude_tag">Private Event</arg>
  <arg id="exclude_tag">Offcampus</arg>
  <arg id="exclude_group">Registrar</arg>
  <arg id="exclude_group">Admitted JD</arg>
  <arg id="placeholder">Search Calendar</arg>
  <arg id="disable_timezone">true</arg>
  <arg id="thumb_width">144</arg>
  <arg id="thumb_height">144</arg>
  <arg id="modular">true</arg>
  <arg id="default_view">week</arg>
</widget>

So the conclusion is that we should be able to retrieve event data (any event that shares the same XML configs) from the encoded URL. In other words, you can get event data in JSON from this python snippets:

import requests

url = '''
  https://www.law.upenn.edu/live/calendar/view/event/event_id/50930?user_tz=IT&syntax=
  %3Cwidget%20type%3D%22events_calendar%22%20priority%3D%22high%22%3E%3
  Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%
  22exclude_tag%22%3ELibrary%20Hours%3C%2Farg%3E%3Carg%20id%3D%22exclud
  e_tag%22%3ELibrary%20Event%20Private%3C%2Farg%3E%3Carg%20id%3D%22excl
  ude_tag%22%3ECareers%20Calendar%20ONLY%3C%2Farg%3E%3Carg%20id%3D%22ex
  clude_tag%22%3EDocs%20Only%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclud
  e_tag%22%3EPending%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22
  %3EPrivate%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EOffca
  mpus%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ERegistrar%3C%2Far
  g%3E%3Carg%20id%3D%22exclude_group%22%3EAdmitted%20JD%3C%2Farg%3E%3Ca
  rg%20id%3D%22placeholder%22%3ESearch%20Calendar%3C%2Farg%3E%3Carg%20i
  d%3D%22disable_timezone%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22thumb_w
  idth%22%3E144%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E144%3C%2F
  arg%3E%3Carg%20id%3D%22modular%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22
  default_view%22%3Eweek%3C%2Farg%3E%3C%2Fwidget%3E
'''
resp = requests.get(url=url)
data = resp.json()

data
Out[5]:
{'date': 'Date',
 'title': 'March 17, 2016',
 'event': {'id': 50930,
...

Thank you very much! I will try this out @bluenex

@Tanvvenk
Copy link
Collaborator

Tanvvenk commented Dec 3, 2018

@bluenex @titipata
Is information being drawn from an external or hidden link for the following: https://penntoday.upenn.edu/events

Page.soup is giving unexpected information that doesn't seem to contain event_url href

Thank you

@bluenex
Copy link
Collaborator

bluenex commented Dec 8, 2018

@bluenex @titipata
Is information being drawn from an external or hidden link for the following: https://penntoday.upenn.edu/events

Page.soup is giving unexpected information that doesn't seem to contain event_url href

Thank you

@Tanvvenk checking XHR of the page, it sent a request to this link:

https://penntoday.upenn.edu/events-feed?_format=json

You get JSON from that right away.

@titipata
Copy link
Owner Author

Nice, thanks @bluenex!

Extra questions So, we have a few more URLs to scrape from Med events which I couldn't get the event id out. These include https://www.gse.upenn.edu/event, https://events.med.upenn.edu/, and https://www.med.upenn.edu/inclusion-and-diversity/Upcoming-Events/. I can get event's details using the following code but cannot use BeautifulSoup to get the event_id:

event_id = '701084'
text = """https://events.med.upenn.edu/live/calendar/view/event/event_id/{}?user_tz=IT&synta
x=%3Cwidget%20type%3D%22events_calendar%22%3E%3Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Fa
rg%3E%3Carg%20id%3D%22thumb_width%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E200%3C%
2Farg%3E%3Carg%20id%3D%22hide_repeats%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_groups%22%3Efa
lse%3C%2Farg%3E%3Carg%20id%3D%22show_tags%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22default_view%22%
3Eday%3C%2Farg%3E%3Carg%20id%3D%22group%22%3ECenter%20for%20Clinical%20Epidemiology%20and%20Bios
tatistics%20%28CCEB%29%3C%2Farg%3E%3C%21--start%20excluded%20groups--%3E%3C%21--%20if%20you%20ar
e%20changing%20the%20excluded%20groups%2C%20they%20should%20also%20be%20changed%20in%20this%20fi
le%3A%0A%09%2Fpublic_html%2Fpsom-excluded-groups.php--%3E%3Carg%20id%3D%22exclude_group%22%3EAdm
in%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Import%3C%2Farg%3E%3Carg%20id%3D%22excl
ude_group%22%3ELiveWhale%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Department%3C%2Fa
rg%3E%3C%21--end%20excluded%20groups--%3E%3Carg%20id%3D%22webcal_feed_links%22%3Etrue%3C%2Farg%3
E%3C%2Fwidget%3E""".format(event_id).replace('\n', '').replace(' ', '')
event_json = requests.get(text).json()

@bluenex
Copy link
Collaborator

bluenex commented Dec 15, 2018

@titipata

https://www.gse.upenn.edu/event

This one is weird that it doesn't make a request at all. However, all the data for events seems to be packed in a HTML file. You should be able to get all the event data from <div class="views-row views-row-{n} views-row-odd">. Notice that {n} is the row of the event being shown. Also notice that some of the events in HTML may not be rendered on the website. I still can't figure out what makes the event to be rendered or not.

<div class="views-row views-row-7 views-row-odd">
  <div class="views-field views-field-nothing-1">
    <span class="field-content">
      <a href="#" title="Add to Calendar" class="addthisevent">
        <img src="http://addthisevent.com/gfx/icon-calendar-t2.png" />
        <span class="_start">2018-12-10 10:30</span>
        <span class="_end">2018-12-10 10:30</span>
        <span class="_zonecode">15</span>
        <span class="_summary"
          >Penn GSE Event: Colloquium Presented by Dr. John Papay</span
        >
        <span class="_description"
          >Dr. John Papay, Assistant Professor of Education at Brown University,
          will present a colloquium titled: Where Teachers Succeed and Improve:
          The Importance of School Context for Teacher Effectiveness and
          Development.
          https://www.gse.upenn.edu/event/colloquium-presented-dr-john-papay</span
        >
        <span class="_location">3700 Walnut St, Room 203</span>
        <span class="_organizer"></span> <span class="_organizer_email"></span>
        <span class="_date_format">DD/MM/YYYY</span>
      </a>
    </span>
  </div>
  <div class="views-field views-field-nothing">
    <span class="field-content">
      <span class="event-month-date">
        <span class="date-display-single">Dec</span>
      </span>
      <span class="event-day-date">
        <span class="date-display-single">10</span>
      </span>
    </span>
  </div>
  <div class="views-field views-field-field-date-4">
    <div class="field-content">
      <span class="date-display-single">1544455800</span>
    </div>
  </div>
  <div class="views-field views-field-title">
    <span class="field-content">
      <a href="/event/colloquium-presented-dr-john-papay">
        <h2>Colloquium Presented by Dr. John Papay</h2>
      </a>
    </span>
  </div>
  <div class="views-field views-field-field-date-5">
    <div class="field-content">
      <span class="glyphicon glyphicon-time"></span>
      <span class="date-display-single">10:30am</span>
    </div>
  </div>
  <div class="views-field views-field-field-location-1">
    <div class="field-content">
      <span class="glyphicon glyphicon-map-marker"></span> 3700 Walnut St, Room
      203
    </div>
  </div>
  <div class="views-field views-field-term-node-tid">
    <span class="field-content">
      <div class="item-list">
        <ul>
          <li class="first">On Campus</li>
          <li>Students</li>
          <li>Faculty</li>
          <li>Staff</li>
          <li class="last">Lecture/Panel/Presentation</li>
        </ul>
      </div>
    </span>
  </div>
  <div class="views-field views-field-field-event-short-description-1">
    <div class="field-content">
      Dr. John Papay, Assistant Professor of Education at Brown University, will
      present a colloquium titled: Where Teachers Succeed and Improve: The
      Importance of School Context for Teacher Effectiveness and Development.
    </div>
  </div>
</div>

The URL to specific event can be found on href at:

<div class="views-field views-field-title">
    <span class="field-content">
      <a href="/event/colloquium-presented-dr-john-papay">
        <h2>Colloquium Presented by Dr. John Papay</h2>
      </a>
    </span>
  </div>

Still, it doesn't make request and you will need to scrape from HTML anyway.

https://events.med.upenn.edu/

For this, I found that if we click on List it will call this url https://events.med.upenn.edu/#!view/all and then the following request is made (notice ...calendar/view/all?):

https://events.med.upenn.edu/live/calendar/view/all?user_tz=IT&syntax=%3Cwidget%20type%3D%22events_calendar%22%3E%3Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22thumb_width%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22hide_repeats%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_groups%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_tags%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22default_view%22%3E%3C%2Farg%3E%3Carg%20id%3D%22group%22%3E%3C%2Farg%3E%3C%21--start%20excluded%20groups--%3E%3C%21--%20if%20you%20are%20changing%20the%20excluded%20groups%2C%20they%20should%20also%20be%20changed%20in%20this%20file%3A%0A%09%2Fpublic_html%2Fpsom-excluded-groups.php--%3E%3Carg%20id%3D%22exclude_group%22%3EAdmin%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Import%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ELiveWhale%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Department%3C%2Farg%3E%3C%21--end%20excluded%20groups--%3E%3Carg%20id%3D%22webcal_feed_links%22%3Etrue%3C%2Farg%3E%3C%2Fwidget%3E

Response from a link above is a JSON object containing 50 events out of event_count starting from today. What you may want to know are events and event_count.

https://www.med.upenn.edu/inclusion-and-diversity/Upcoming-Events/

When we open a link to an event in this page we will be brought to another page of events.med.upenn.edu that is https://events.med.upenn.edu/inclusion-and-diversity/#!view/all. This page sends a request with the same format but different value of <arg id="group"> as you can see below:

https://events.med.upenn.edu/live/calendar/view/all?user_tz=IT&syntax=%3Cwidget%20type%3D%22events_calendar%22%3E%3Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22thumb_width%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22hide_repeats%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_groups%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_tags%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22default_view%22%3Eday%3C%2Farg%3E%3Carg%20id%3D%22group%22%3EOffice%20of%20Inclusion%20and%20Diversity%3C%2Farg%3E%3C%21--start%20excluded%20groups--%3E%3C%21--%20if%20you%20are%20changing%20the%20excluded%20groups%2C%20they%20should%20also%20be%20changed%20in%20this%20file%3A%0A%09%2Fpublic_html%2Fpsom-excluded-groups.php--%3E%3Carg%20id%3D%22exclude_group%22%3EAdmin%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Import%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ELiveWhale%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Department%3C%2Farg%3E%3C%21--end%20excluded%20groups--%3E%3Carg%20id%3D%22webcal_feed_links%22%3Etrue%3C%2Farg%3E%3C%2Fwidget%3E

This may be hard to read, let's see decoded XML:

https://events.med.upenn.edu/live/calendar/view/all?user_tz=IT&syntax=<widget type="events_calendar">
    <arg id="mini_cal_heat_map">true</arg>
    <arg id="thumb_width">200</arg>
    <arg id="thumb_height">200</arg>
    <arg id="hide_repeats">false</arg>
    <arg id="show_groups">false</arg>
    <arg id="show_tags">true</arg>
    <arg id="default_view">day</arg>
    <arg id="group">Office of Inclusion and Diversity</arg>    <!--start excluded groups-->    <!-- if you are changing the excluded groups, they should also be changed in this file:
	/public_html/psom-excluded-groups.php-->
    <arg id="exclude_group">Admin</arg>
    <arg id="exclude_group">Test Import</arg>
    <arg id="exclude_group">LiveWhale</arg>
    <arg id="exclude_group">Test Department</arg>    <!--end excluded groups-->
    <arg id="webcal_feed_links">true</arg>
</widget>

I suspect that https://events.med.upenn.edu/#!view/all doesn't actually show all the events (for instance, each request will get 50 items out of 1086 starting from today). Events from Office of Inclusion and Diversity are also not included in those 50 items, instead you need to make a request with this argument <arg id="group">Office of Inclusion and Diversity</arg>.

One thing to notice:

Main event page for med.upenn: https://events.med.upenn.edu/#!view/all.
Event page for Office of inclusion and Diversity: https://events.med.upenn.edu/inclusion-and-diversity/#!view/all.

or https://events.med.upenn.edu/{office or institute}/#!view/all and that {office or institute} is associated with argument in URL decoded XML <arg id="group">Office of Inclusion and Diversity</arg>.

@titipata
Copy link
Owner Author

Adding the scraper for MINS (http://go.activecalendar.com/upennmins)

@bluenex bluenex changed the title Scrape list of these events [Data] Scrape list of these events Oct 23, 2019
@titipata
Copy link
Owner Author

@bluenex can you help on getting the JSON format from the following sites:

@bluenex
Copy link
Collaborator

bluenex commented Oct 25, 2019

@titipata fortunately, they are fairly simple. Don't forget to extract arguments from the request URL (if there are).

@titipata
Copy link
Owner Author

titipata commented Nov 2, 2019

@titipata
Copy link
Owner Author

titipata commented Nov 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants