Skip to content

Conversation

@liniiiiii
Copy link
Collaborator

@i-be-snek , this is also in low priority, and just need to check if the files are not damaging the main branch, thanks!

@liniiiiii liniiiiii requested a review from i-be-snek September 5, 2024 09:32
@liniiiiii liniiiiii self-assigned this Sep 5, 2024
@liniiiiii liniiiiii linked an issue Sep 5, 2024 that may be closed by this pull request
2 tasks
@i-be-snek i-be-snek force-pushed the 99-add-web-scraping-script-for-current-wikipedia-article-text-collection branch from 5e9e1bf to 3c44256 Compare September 8, 2024 19:07
@@ -0,0 +1,20 @@
*** This is the web scraping process ***

[] Wikipedia_articles/Web_scraping_wiki.py is the script for web scraping, and process the whole text as a header-content pair, which is the web scraping process of EN Wiki articles used for prompt V_3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use - instead of [] for bullet points to show properly in markdown.

type=str,
)
parser.add_argument(
"-h",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not name any flag -h because that's the shorthand flag for --help. If you keep -h here, this is what a user would see when trying to get the docs of this script:

 $ poetry run python3 Database/Wikipedia_articles/Web_scraping_wiki.py --help

Traceback (most recent call last):
  File "/Users/shorouqza/Code/Wikimpacts/Database/Wikipedia_articles/Web_scraping_wiki.py", line 38, in <module>
    parser.add_argument(
  File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1468, in add_argument
    return self._add_action(action)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1850, in _add_action
    self._optionals._add_action(action)
  File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1670, in _add_action
    action = super(_ArgumentGroup, self)._add_action(action)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1482, in _add_action
    self._check_conflict(action)
  File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1619, in _check_conflict
    conflict_handler(action, confl_optionals)
  File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1628, in _handle_conflict_error
    raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument -h/--header: conflicting option string: -h

"-h",
"--header",
dest="header",
help="The header for web scraping",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of what he header could look like? Maybe in the README.md

@@ -0,0 +1,143 @@
import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running both this file and Web_scraping_wiki.py with this command and in both cases I get a similar error:

$ poetry run python3 Database/Wikipedia_articles/Web_scraping_wiki.py --raw_dir Database/Wiki_dev_test_articles --filename wiki_test_whole_infobox_20240729_159single_events.json  --output_dir Database --header "123"

web_scraping: 2024-09-08 21:17:40 INFO     Passed args: Namespace(filename='wiki_test_whole_infobox_20240729_159single_events.json', raw_dir='Database/Wiki_dev_test_articles', output_dir='Database', header='123')
web_scraping: 2024-09-08 21:17:40 INFO     Creating Database if it does not exist!
Traceback (most recent call last):
  File "/Users/shorouqza/Code/Wikimpacts/Database/Wikipedia_articles/Web_scraping_wiki.py", line 64, in <module>
    raw = pd.read_csv(f"{args.raw_dir}/{args.filename}", encoding="ISO-8859-1")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 850, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "pandas/_libs/parsers.pyx", line 2029, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

@@ -0,0 +1,148 @@
import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general convention in the repo is to give python scrips names in snake case. I recommend updating these file names to snake case.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add web scraping script for current wikipedia article text collection

3 participants