Refactor the UTD Comet Calendar Scraper #68

mikehquan19 · 2025-04-21T07:36:41Z

Related to #57,
In addition to changing the selector's tag so that there wouldn't be any hangs, I've also updated the code so that it can gets as much information as possible (the description, location, email, etc).

Please let me know if the change will benefit the data for the API or I need to revert those changes.
Thank you!

TyHil · 2025-04-26T01:42:30Z

I was poking around since it seems this only scrapes events from https://calendar.utdallas.edu/calendar/1, which is less than a days worth of events, and check this out:

https://calendar.utdallas.edu/api/2/events?days=365&pp=100&page=1
https://calendar.utdallas.edu/api/2/events?days=365&pp=100&page=2
https://help.concept3d.com/hc/en-us/articles/11940613344915-Localist-API

Seems like this returns all the events, we'd just need to loop through all the pages.

mikehquan19 · 2025-04-26T01:50:49Z

@TyHil, No problem! I would update the code so that it loop through the pages. How many pages would you like to scrape from before the events started getting irrelevant time-wise?

TyHil · 2025-04-26T01:56:44Z

Awesome! I think a years work of data is good, since that's in line with the Mazevo scraper. With the api url I provided it automatically stops after 365 days and returns an empty array. Also in page.total of the result it says how many pages are available, 12 in this case.

mikehquan19 · 2025-04-26T02:47:55Z

I just visited https://calendar.utdallas.edu/calendar, and realized if we traverse through from 1 to 54 in the pagination bar, it will show all the events til next year, so I'm inclined to do that. But would I need to take advantage of the url you provided for the scraping process or is it just for reference?

TyHil · 2025-04-26T02:53:49Z

That would certainly work and you wouldn't need to use the url I provided. But I think the slowness of scraping each page and event with chromedp might make just calling the API a better choice. Then all our scraper really needs to do is make 12 API calls and save them to a file.

mikehquan19 · 2025-04-26T03:06:18Z

I would certainly go for your approach of calling the API. What I will do is still adding the looping in the existing function, and writing another function in calendar.go that calls the API to get the data. Later on, we can get rid of the scraping function when we are confident about the API calling. Would that work?

And in case you agree, would you want to me to fetch the data in the exact same struct as in scraping, or it would have to change a bit?

TyHil · 2025-04-26T03:09:40Z

Sounds great!

I don't think the data has to change, that'll be a job for the parser.

mikehquan19 · 2025-08-03T00:19:40Z

Hey @TyHil, if you have time, plz review the api approach, which can get the whole year worth of data. If there's anything you want me to change, let me know. If all goes well, I can even let go of old html scraping one.

TyHil

This is looks great! Maybe some of this parsing logic could be moved to the parser but up to you. Definitely better than the HTML scraper, and if we ever need it again, it'll be in the git history.

mikehquan19 · 2025-08-03T22:42:44Z

Great! In that case, we can let go of html scraper (since we can find them in history, kinda reminds me of Doctor Who lol). I will make quick changes and then close this PR once and for all.

mikehquan19 · 2025-08-04T01:10:57Z

@TyHil, can you approve my PR? I need someone's approval to merge this lol.

BTW I've updated the env in gcp and merged your PR. If there's anything, let me know I will address them asap.

TyHil

mikehquan19 and others added 9 commits March 13, 2025 19:54

Add unit tests for validator

cec6ca0

Add unit test for validator

627137d

Merge branch 'UTDNebula:develop' into develop

48193ee

Merge branch 'develop' into develop

d2829ef

Merge branch 'UTDNebula:develop' into develop

8f783a8

Refactor calendar scraper

f1a2f92

Just revert the previous erroneous commit

27bcf63

Some minor fixes and comments for readability

616ead8

Adjust the validation test a bit for readability

d5f4b6e

TyHil linked an issue May 2, 2025 that may be closed by this pull request

Refactor UTD Comet Calendar Scraper #57

Closed

mikehquan19 added 4 commits August 1, 2025 14:34

Implement API approach to getting callendar data

401e115

Move api calendar as separate options, can change later

6508e80

Change the name of the file to write into

72018d7

Merge branch 'UTDNebula:develop' into develop

1ae07a7

TyHil reviewed Aug 3, 2025

View reviewed changes

mikehquan19 added 3 commits August 3, 2025 17:47

Merge branch 'UTDNebula:develop' into develop

3aeb5cd

Completely refactor the calendar scraper to api calling approach

b21b721

Trim the space in location

a254516

TyHil approved these changes Aug 4, 2025

View reviewed changes

mikehquan19 merged commit 3c26acd into UTDNebula:develop Aug 4, 2025
2 checks passed

Refactor the UTD Comet Calendar Scraper #68

Refactor the UTD Comet Calendar Scraper #68

Uh oh!

Conversation

mikehquan19 commented Apr 21, 2025

Uh oh!

TyHil commented Apr 26, 2025

Uh oh!

mikehquan19 commented Apr 26, 2025

Uh oh!

TyHil commented Apr 26, 2025

Uh oh!

mikehquan19 commented Apr 26, 2025

Uh oh!

TyHil commented Apr 26, 2025

Uh oh!

mikehquan19 commented Apr 26, 2025

Uh oh!

TyHil commented Apr 26, 2025

Uh oh!

mikehquan19 commented Aug 3, 2025

Uh oh!

TyHil left a comment

Choose a reason for hiding this comment

Uh oh!

mikehquan19 commented Aug 3, 2025

Uh oh!

mikehquan19 commented Aug 4, 2025

Uh oh!

TyHil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants