Web Scraping Project

Overview

It's a program I wrote to collect data from an online boardgame store based in Taiwan.

Script Development Process

Before actually writing any code, I'd outline my plan and do some experiments as below:
1️⃣ Identify the goal: the data I need and their data types for future database
2️⃣ Navigate the target website: observe its HTML and URL structure
3️⃣ Experiment with different methods and time my code

For the script, it can be broken down into two major modules: web scraping and writing into CSV file.

I Web scraping

1. Loop through every product page and retrieve all the product names and save them into a list

2. Loop through the list and format them like so: "http://www.swanpanasia.com/products/{product's_english_name}"

A. if it reaches the destination successfully (HTTP response status codes returns 200), do the following:

give it a product Id (as primary key for future database)
save product tags as a string list
save its image url
save description as a string
save prices as integer
save additional infomation as a string list

B. if it fails to reach the destination (HTTP response status codes returns 400 or 404), do the following:

Give it a product Id as 0 (so it can be tracked by its Id and be removed afterwards)
Print out the product's English name, in order to handle the missing data in the future, but for now I just remove them

3. Print out success count and failure count

4. Save the completed data as a list of dictionaries, and delete data that is incomplete

5. Print the length of total data, which should match the success count above

The final result should be something like the following :

[
   {'eng_name': '10-dwarves',
  'ch_name': '矮人十兄弟',
  'img_src': 'https://something.jpg',
  'pId': 1,
  'description': '這些礦工矮人們整天在坑道中挖掘寶石，雖然累積的財富無數，但長年灰頭土臉，生活習慣相當糟糕，衛生兩字他們從沒聽過。自從白雪公主來了以後，豐盛的晚餐上桌前，必定要求他們將手洗淨，但是，頑固的矮人們由於長相穿著都很相似，經常魚目混珠、瞞騙公主。',
  'tags': ['兒童', '紙牌', '合作', '1-2人', '台灣作家', '數學', '綜合'],
  'price': 120,
  'notes': ['8歲以上', '2-4人', '10分鐘']},
  
 {'eng_name': '13-clues',
  'ch_name': '13道線索',
  'img_src': 'https://something.jpg',
  'pId': 2,
  'description': '數起兇殘的犯罪案件震驚了1899年的倫敦，謎樣的案情掩蓋了真相，蘇格蘭警場在黑暗中摸索，號召一群優秀的偵探前來協助破案。每位偵探必須利用敏銳的直覺，從13道線索中找出蛛絲馬跡，負責解開自己的謎題，比其他人更快偵破自己的案件！',
  'tags': ['6人＋', '大腦', '猜心', '1-2人', '社會', '科技'],
  'price': 990,
  'notes': ['8歲以上', '2-6人', '30分鐘']},
]

II Compile to CSV file

1. Write the first row as header

2. To avoid square brackets in the cells, an extra function is needed to convert string list into string like so:

Instead of this:

to be this:

Future Improvement

Implement generators to be more memory efficient
Minimize the use of for loops to speed up performance

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
demo1.png		demo1.png
demo2.png		demo2.png
get_products.py		get_products.py
get_str.py		get_str.py
main.py		main.py
sort_products.py		sort_products.py
to_save_file.csv		to_save_file.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping Project

Overview

Script Development Process

I Web scraping

1. Loop through every product page and retrieve all the product names and save them into a list

2. Loop through the list and format them like so: "http://www.swanpanasia.com/products/{product's_english_name}"

3. Print out success count and failure count

4. Save the completed data as a list of dictionaries, and delete data that is incomplete

5. Print the length of total data, which should match the success count above

II Compile to CSV file

1. Write the first row as header

2. To avoid square brackets in the cells, an extra function is needed to convert string list into string like so:

Future Improvement

About

Uh oh!

Releases

Packages

Languages

JCEleanor/web-sraping-with-python

Folders and files

Latest commit

History

Repository files navigation

Web Scraping Project

Overview

Script Development Process

I Web scraping

1. Loop through every product page and retrieve all the product names and save them into a list

2. Loop through the list and format them like so: "http://www.swanpanasia.com/products/{product's_english_name}"

3. Print out success count and failure count

4. Save the completed data as a list of dictionaries, and delete data that is incomplete

5. Print the length of total data, which should match the success count above

II Compile to CSV file

1. Write the first row as header

2. To avoid square brackets in the cells, an extra function is needed to convert string list into string like so:

Future Improvement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages