Skip to content

Commit 2a28805

Browse files
committed
fix based on comment
Signed-off-by: guangli.bao <guangli.bao@daocloud.io>
1 parent c3dada5 commit 2a28805

File tree

3 files changed

+13
-5
lines changed

3 files changed

+13
-5
lines changed

contrib/sharegpt_preprocess/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22

33
You can use ShareGPT_V3_unfiltered_cleaned_split.json as benchmark datasets.
44

5+
## Prerequisites
6+
Before you begin, ensure you have the following installed:
7+
8+
* Python 3.9 or higher
9+
* pip (Python package manager)
10+
511
## Example Commands
612

713
Download and prepare the ShareGPT dataset; You can specify the proportion of data to process by providing a number between 0 and 1 as an argument to the script.

contrib/sharegpt_preprocess/preprocessing_sharegpt_data.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,8 +62,9 @@ def extract_and_save_with_filtering(file):
6262
and
6363
# except special characters
6464
not re.search(r"[<>{}[\]\\]", prompt_text)
65+
# except pure numbers
6566
and not prompt_text.isdigit()
66-
): # except pure numbers
67+
):
6768
filtered_prompts.append(
6869
{
6970
"from": turn.get("from"),
Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
tqdm
2-
pandas
3-
openai
4-
pyyaml
1+
tqdm==4.67.1
2+
pandas==2.3.1
3+
openai==1.99.9
4+
datasets==4.0.0
5+
transformers==4.55.4

0 commit comments

Comments
 (0)