edx-crawler

edx-crawler is a Python-based cross-platform tool for mining text data of the enrolled edX courses available on a user's dashboard. It was developed by teaching assistants at Tokyo Tech Online Education Development Office as an extension of edx-dl.

Prerequisites

Python libraries and modules:

Python - version 3.5+
beautifulsoup - a Python library for pulling data out of HTML and XML files
webvtt-py - a Python module for reading/writing WebVTT caption files
youtube-dl - command-line program to download videos from YouTube.com

How to run

Run a python script edx_crawler.py passing edx course link -url , username -u and password -p as parameters.

python edx_crawler.py -url [course_url] -u [edx_user_name] -p [edx_user_password]

OPTIONS

-url, --course-urls		Specify target course urls given from edx dashboard
-u, --username			Specify your edX username (email)
-p, --password			Input your edX password
-d, --html-dir			Specify directory to store data

The output is a weekly organized structure of the course, which is stored by default in "HTMLs" folder. The output contents are the following:

seq_contents_#.html - text data of the course unit in .html format
seq_contents_#.txt - text data of the course unit in .txt format
seq_contents_#_prob.txt - text data of the quiz sections in .txt format
seq_contents_#_vdo.json - video transcript information in .json format

Converting data to JSON format

After crawling courses, you may run txtcomp2json.py to store crawled data in .json format.

python txtcomp2json.py

The program aggregates crawled data of the courses in "HTMLs" folder and produces .json output in the following manner:

all text components -> all_textcomp.json
all problem components -> all_probcomp.json
all video components -> all_videocomp.json
all components (text, quizes, videos) -> all_comp.json

Extra files and folders

transcript_error_report.txt contains the information about video transcripts which are not provided by edX or YouTube.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
lib		lib
.gitattributes		.gitattributes
README.md		README.md
edx_crawler.py		edx_crawler.py
txtcomp2json.py		txtcomp2json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

edx-crawler

Prerequisites

How to run

OPTIONS

Converting data to JSON format

Extra files and folders

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

edx-crawler

Prerequisites

How to run

OPTIONS

Converting data to JSON format

Extra files and folders

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages