add the web scraping process for nlp paper and current version #100

liniiiiii · 2024-09-05T09:32:03Z

@i-be-snek , this is also in low priority, and just need to check if the files are not damaging the main branch, thanks!

i-be-snek · 2024-09-08T19:03:54Z

Database/Wikipedia_articles/README.md

@@ -0,0 +1,20 @@
+*** This is the web scraping process ***
+
+[] Wikipedia_articles/Web_scraping_wiki.py is the script for web scraping, and process the whole text as a header-content pair, which is the web scraping process of EN Wiki articles used for prompt V_3


Use - instead of [] for bullet points to show properly in markdown.

i-be-snek · 2024-09-08T19:13:20Z

Database/Wikipedia_articles/Web_scraping_wiki.py

+        type=str,
+    )
+    parser.add_argument(
+        "-h",


You should not name any flag -h because that's the shorthand flag for --help. If you keep -h here, this is what a user would see when trying to get the docs of this script:

$ poetry run python3 Database/Wikipedia_articles/Web_scraping_wiki.py --help Traceback (most recent call last): File "/Users/shorouqza/Code/Wikimpacts/Database/Wikipedia_articles/Web_scraping_wiki.py", line 38, in <module> parser.add_argument( File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1468, in add_argument return self._add_action(action) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1850, in _add_action self._optionals._add_action(action) File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1670, in _add_action action = super(_ArgumentGroup, self)._add_action(action) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1482, in _add_action self._check_conflict(action) File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1619, in _check_conflict conflict_handler(action, confl_optionals) File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1628, in _handle_conflict_error raise ArgumentError(action, message % conflict_string) argparse.ArgumentError: argument -h/--header: conflicting option string: -h

i-be-snek · 2024-09-08T19:14:24Z

Database/Wikipedia_articles/Web_scraping_wiki.py

+        "-h",
+        "--header",
+        dest="header",
+        help="The header for web scraping",


Can you give an example of what he header could look like? Maybe in the README.md

i-be-snek · 2024-09-08T19:19:02Z

Database/Wikipedia_articles/Web_scraping_wiki_artemis_NLP2024.py

@@ -0,0 +1,143 @@
+import argparse


I tried running both this file and Web_scraping_wiki.py with this command and in both cases I get a similar error:

$ poetry run python3 Database/Wikipedia_articles/Web_scraping_wiki.py --raw_dir Database/Wiki_dev_test_articles --filename wiki_test_whole_infobox_20240729_159single_events.json --output_dir Database --header "123" web_scraping: 2024-09-08 21:17:40 INFO Passed args: Namespace(filename='wiki_test_whole_infobox_20240729_159single_events.json', raw_dir='Database/Wiki_dev_test_articles', output_dir='Database', header='123') web_scraping: 2024-09-08 21:17:40 INFO Creating Database if it does not exist! Traceback (most recent call last): File "/Users/shorouqza/Code/Wikimpacts/Database/Wikipedia_articles/Web_scraping_wiki.py", line 64, in <module> raw = pd.read_csv(f"{args.raw_dir}/{args.filename}", encoding="ISO-8859-1") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv return _read(filepath_or_buffer, kwds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 583, in _read return parser.read(nrows) ^^^^^^^^^^^^^^^^^^ File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1704, in read ) = self._engine.read( # type: ignore[attr-defined] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read chunks = self._reader.read_low_memory(nrows) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 850, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status File "pandas/_libs/parsers.pyx", line 2029, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

i-be-snek · 2024-09-08T19:20:04Z

Database/Wikipedia_articles/Web_scraping_wiki.py

@@ -0,0 +1,148 @@
+import argparse


The general convention in the repo is to give python scrips names in snake case. I recommend updating these file names to snake case.

liniiiiii requested a review from i-be-snek September 5, 2024 09:32

liniiiiii self-assigned this Sep 5, 2024

liniiiiii linked an issue Sep 5, 2024 that may be closed by this pull request

Add web scraping script for current wikipedia article text collection #99

Closed

2 tasks

liniiiiii added the low priority label Sep 5, 2024

add the web scraping process for nlp paper and current version

3c44256

i-be-snek force-pushed the 99-add-web-scraping-script-for-current-wikipedia-article-text-collection branch from 5e9e1bf to 3c44256 Compare September 8, 2024 19:07

i-be-snek requested changes Sep 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add the web scraping process for nlp paper and current version #100

add the web scraping process for nlp paper and current version #100

Uh oh!

liniiiiii commented Sep 5, 2024

Uh oh!

i-be-snek Sep 8, 2024

Uh oh!

i-be-snek Sep 8, 2024

Uh oh!

i-be-snek Sep 8, 2024

Uh oh!

i-be-snek Sep 8, 2024

Uh oh!

i-be-snek Sep 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,20 @@
		* This is the web scraping process *

		[] Wikipedia_articles/Web_scraping_wiki.py is the script for web scraping, and process the whole text as a header-content pair, which is the web scraping process of EN Wiki articles used for prompt V_3

add the web scraping process for nlp paper and current version #100

Are you sure you want to change the base?

add the web scraping process for nlp paper and current version #100

Uh oh!

Conversation

liniiiiii commented Sep 5, 2024

Uh oh!

i-be-snek Sep 8, 2024

Choose a reason for hiding this comment

Uh oh!

i-be-snek Sep 8, 2024

Choose a reason for hiding this comment

Uh oh!

i-be-snek Sep 8, 2024

Choose a reason for hiding this comment

Uh oh!

i-be-snek Sep 8, 2024

Choose a reason for hiding this comment

Uh oh!

i-be-snek Sep 8, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants