xTraktor

Structured data extractor for the modern world wide web.

What is this?

Implementation of 3 approaches to structured data extraction:

Usage demonstrated on sample pages from 3 websites: overstock.com, rtvslo.si and avto.net. We have gathered two pages from each website.

This is the second assignment in the Web information extraction and retrieval course.

Setup

[Optional] Create a virtualenv and activate it.

$ virtualenv --python=python3 --system-site-packages wiervenv
$ source wiervenv/bin/activate

Install required dependencies.

$ pip3 install -r requirements.txt

Install in dev mode.

$ python3 setup.py develop

Running the parser

implementation/ contains the implementations of regular-expressions-based (regex.py) and XPath-based (xpath.py) approaches. RoadRunner-like approach is not implemented. Running those files will produce the JSON outputs for files in the input/ folder.

Assuming you are inside the implementation/ directory:

$ python3 regex.py
$ python3 xpath.py

Project structure

.
├── input/               # websites, that are used to test the approaches
├── output/              # JSON outputs generated by the methods
├── implementation/      # source code of our implemented approaches
└──report.pdf            # Final report PDF

2019, Jaka Stavanja, Matej Klemen & Andraž Povše

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
doc		doc
implementation		implementation
input		input
output		output
.gitignore		.gitignore
README.md		README.md
logotip.png		logotip.png
report.pdf		report.pdf
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xTraktor

What is this?

Setup

Running the parser

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

xTraktor

What is this?

Setup

Running the parser

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages