-
Notifications
You must be signed in to change notification settings - Fork 1
add the web scraping process for nlp paper and current version #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add the web scraping process for nlp paper and current version #100
Conversation
5e9e1bf to
3c44256
Compare
| @@ -0,0 +1,20 @@ | |||
| *** This is the web scraping process *** | |||
|
|
|||
| [] Wikipedia_articles/Web_scraping_wiki.py is the script for web scraping, and process the whole text as a header-content pair, which is the web scraping process of EN Wiki articles used for prompt V_3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use - instead of [] for bullet points to show properly in markdown.
| type=str, | ||
| ) | ||
| parser.add_argument( | ||
| "-h", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not name any flag -h because that's the shorthand flag for --help. If you keep -h here, this is what a user would see when trying to get the docs of this script:
$ poetry run python3 Database/Wikipedia_articles/Web_scraping_wiki.py --help
Traceback (most recent call last):
File "/Users/shorouqza/Code/Wikimpacts/Database/Wikipedia_articles/Web_scraping_wiki.py", line 38, in <module>
parser.add_argument(
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1468, in add_argument
return self._add_action(action)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1850, in _add_action
self._optionals._add_action(action)
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1670, in _add_action
action = super(_ArgumentGroup, self)._add_action(action)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1482, in _add_action
self._check_conflict(action)
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1619, in _check_conflict
conflict_handler(action, confl_optionals)
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1628, in _handle_conflict_error
raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument -h/--header: conflicting option string: -h| "-h", | ||
| "--header", | ||
| dest="header", | ||
| help="The header for web scraping", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of what he header could look like? Maybe in the README.md
| @@ -0,0 +1,143 @@ | |||
| import argparse | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running both this file and Web_scraping_wiki.py with this command and in both cases I get a similar error:
$ poetry run python3 Database/Wikipedia_articles/Web_scraping_wiki.py --raw_dir Database/Wiki_dev_test_articles --filename wiki_test_whole_infobox_20240729_159single_events.json --output_dir Database --header "123"
web_scraping: 2024-09-08 21:17:40 INFO Passed args: Namespace(filename='wiki_test_whole_infobox_20240729_159single_events.json', raw_dir='Database/Wiki_dev_test_articles', output_dir='Database', header='123')
web_scraping: 2024-09-08 21:17:40 INFO Creating Database if it does not exist!
Traceback (most recent call last):
File "/Users/shorouqza/Code/Wikimpacts/Database/Wikipedia_articles/Web_scraping_wiki.py", line 64, in <module>
raw = pd.read_csv(f"{args.raw_dir}/{args.filename}", encoding="ISO-8859-1")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 583, in _read
return parser.read(nrows)
^^^^^^^^^^^^^^^^^^
File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1704, in read
) = self._engine.read( # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 850, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status
File "pandas/_libs/parsers.pyx", line 2029, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2| @@ -0,0 +1,148 @@ | |||
| import argparse | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.

@i-be-snek , this is also in low priority, and just need to check if the files are not damaging the main branch, thanks!