Skip to content

Conversation

@mikehquan19
Copy link
Contributor

Related to #57,
In addition to changing the selector's tag so that there wouldn't be any hangs, I've also updated the code so that it can gets as much information as possible (the description, location, email, etc).

Please let me know if the change will benefit the data for the API or I need to revert those changes.
Thank you!

@TyHil
Copy link
Member

TyHil commented Apr 26, 2025

I was poking around since it seems this only scrapes events from https://calendar.utdallas.edu/calendar/1, which is less than a days worth of events, and check this out:

https://calendar.utdallas.edu/api/2/events?days=365&pp=100&page=1
https://calendar.utdallas.edu/api/2/events?days=365&pp=100&page=2
https://help.concept3d.com/hc/en-us/articles/11940613344915-Localist-API

Seems like this returns all the events, we'd just need to loop through all the pages.

@mikehquan19
Copy link
Contributor Author

@TyHil, No problem! I would update the code so that it loop through the pages. How many pages would you like to scrape from before the events started getting irrelevant time-wise?

@TyHil
Copy link
Member

TyHil commented Apr 26, 2025

Awesome! I think a years work of data is good, since that's in line with the Mazevo scraper. With the api url I provided it automatically stops after 365 days and returns an empty array. Also in page.total of the result it says how many pages are available, 12 in this case.

@mikehquan19
Copy link
Contributor Author

I just visited https://calendar.utdallas.edu/calendar, and realized if we traverse through from 1 to 54 in the pagination bar, it will show all the events til next year, so I'm inclined to do that. But would I need to take advantage of the url you provided for the scraping process or is it just for reference?

@TyHil
Copy link
Member

TyHil commented Apr 26, 2025

That would certainly work and you wouldn't need to use the url I provided. But I think the slowness of scraping each page and event with chromedp might make just calling the API a better choice. Then all our scraper really needs to do is make 12 API calls and save them to a file.

@mikehquan19
Copy link
Contributor Author

I would certainly go for your approach of calling the API. What I will do is still adding the looping in the existing function, and writing another function in calendar.go that calls the API to get the data. Later on, we can get rid of the scraping function when we are confident about the API calling. Would that work?

And in case you agree, would you want to me to fetch the data in the exact same struct as in scraping, or it would have to change a bit?

@TyHil
Copy link
Member

TyHil commented Apr 26, 2025

Sounds great!

I don't think the data has to change, that'll be a job for the parser.

@TyHil TyHil linked an issue May 2, 2025 that may be closed by this pull request
@mikehquan19
Copy link
Contributor Author

Hey @TyHil, if you have time, plz review the api approach, which can get the whole year worth of data. If there's anything you want me to change, let me know. If all goes well, I can even let go of old html scraping one.

Copy link
Member

@TyHil TyHil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looks great! Maybe some of this parsing logic could be moved to the parser but up to you. Definitely better than the HTML scraper, and if we ever need it again, it'll be in the git history.

@mikehquan19
Copy link
Contributor Author

Great! In that case, we can let go of html scraper (since we can find them in history, kinda reminds me of Doctor Who lol). I will make quick changes and then close this PR once and for all.

@mikehquan19
Copy link
Contributor Author

@TyHil, can you approve my PR? I need someone's approval to merge this lol.

BTW I've updated the env in gcp and merged your PR. If there's anything, let me know I will address them asap.

Copy link
Member

@TyHil TyHil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@mikehquan19 mikehquan19 merged commit 3c26acd into UTDNebula:develop Aug 4, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor UTD Comet Calendar Scraper

3 participants