Skip to content

Commit 76801b9

Browse files
Added data generator code
1 parent 29c8fbc commit 76801b9

File tree

11 files changed

+2417
-0
lines changed

11 files changed

+2417
-0
lines changed

data_generator/README.md

Lines changed: 390 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,390 @@
1+
# Data Generator Project - Summary
2+
3+
## What You Have
4+
5+
A complete, production-ready **Multi-Table Data Generator** with:
6+
- ✅ Python module (.py file)
7+
- ✅ Interactive Jupyter notebook (.ipynb file)
8+
- ✅ Configuration files (.yaml files)
9+
- ✅ Complete documentation (README.md)
10+
- ✅ Setup guide (PROJECT_SETUP.md)
11+
12+
---
13+
14+
## All Files Created
15+
16+
### Core Files (Required)
17+
1. **`data_generator.py`** - Main Python module
18+
- Contains `MultiTableDataGenerator` class
19+
- All generation logic
20+
21+
2. **`requirements.txt`** - Dependencies
22+
- pyyaml (required)
23+
- pandas (recommended)
24+
25+
### Tutorial & Examples
26+
3. **`DataGenerator_Tutorial.ipynb`** - Jupyter Notebook
27+
- Interactive tutorial
28+
- Step-by-step examples
29+
- Data analysis examples
30+
- Ready to run
31+
32+
4. **`config_simple.yaml`** - Simple Example
33+
- 2 tables (users + orders)
34+
- Foreign key relationship
35+
- Easy to understand
36+
37+
5. **`config_ecommerce.yaml`** - E-commerce Example
38+
- 4 tables (customers, products, orders, reviews)
39+
- Multiple foreign keys
40+
- Realistic scenario
41+
42+
6. **`config_all_types.yaml`** - Complete Reference
43+
- Shows ALL column types
44+
- Reference documentation
45+
- Copy-paste templates
46+
47+
### Documentation
48+
7. **`README.md`** - Complete Documentation
49+
- Features overview
50+
- Installation guide
51+
- API reference
52+
- Examples
53+
- Troubleshooting
54+
55+
8. **`PROJECT_SETUP.md`** - Setup Guide
56+
- Step-by-step setup
57+
- Directory structure
58+
- Testing instructions
59+
- Troubleshooting
60+
61+
9. **`COMPLETE_PROJECT_SUMMARY.md`** - This File
62+
- Quick overview
63+
- Usage instructions
64+
- File descriptions
65+
66+
---
67+
68+
## Quick Start (3 Steps)
69+
70+
### Step 1: Setup
71+
```bash
72+
# Create directory
73+
mkdir data-generator
74+
cd data-generator
75+
76+
# Save all 9 files in this directory
77+
78+
# Create requirements.txt and add following dependencies in it. Add the dependencies in the cluster
79+
pyyaml
80+
pandas
81+
```
82+
83+
### Step 2: Test
84+
```python
85+
# Test basic generation
86+
from data_generator import MultiTableDataGenerator; \
87+
88+
MultiTableDataGenerator(seed=42).generate_from_config('config_simple.yaml')
89+
```
90+
91+
### Step 3: Explore
92+
93+
Open DataGenerator_Tutorial.ipynb notebook in AIDP. Run the commands. Kindly change the paths as per your folder.
94+
95+
96+
---
97+
98+
## 📋 File Purposes
99+
100+
| File | What It Does | When to Use |
101+
|------|--------------|-------------|
102+
| `data_generator.py` | Core generator class | Import in your code |
103+
| `DataGenerator_Tutorial.ipynb` | Interactive tutorial | Learning & examples |
104+
| `config_simple.yaml` | Basic 2-table example | Quick testing |
105+
| `config_ecommerce.yaml` | Real-world scenario | Complex relationships |
106+
| `config_all_types.yaml` | All features demo | Reference guide |
107+
| `README.md` | Full documentation | When stuck |
108+
| `requirements.txt` | Dependencies | Installation |
109+
110+
---
111+
112+
## 💡 Usage Examples
113+
114+
### Example 1: Python Script
115+
```python
116+
from data_generator import MultiTableDataGenerator
117+
118+
# Simple usage
119+
generator = MultiTableDataGenerator(seed=42)
120+
results = generator.generate_from_config('config_simple.yaml')
121+
122+
# View sample
123+
generator.print_sample('users', n=5)
124+
125+
# Get as DataFrame
126+
df = generator.get_dataframe('users')
127+
```
128+
129+
### Example 3: Custom Configuration
130+
```python
131+
config = {
132+
'table_name': 'my_data',
133+
'rows_count': 100,
134+
'output_format': 'both',
135+
'columns': [
136+
{'name': 'id', 'type': 'integer', 'range': [1, 1000], 'unique': True},
137+
{'name': 'name', 'type': 'string', 'length': 8},
138+
{'name': 'email', 'type': 'email', 'unique': True}
139+
]
140+
}
141+
142+
generator = MultiTableDataGenerator(seed=42)
143+
generator.generate_from_config(config)
144+
```
145+
146+
### Example 4: From Config File
147+
```python
148+
# Use existing config
149+
generator = MultiTableDataGenerator(seed=42)
150+
results = generator.generate_from_config('config_ecommerce.yaml')
151+
152+
# Access tables
153+
df_customers = generator.get_dataframe('customers')
154+
df_orders = generator.get_dataframe('orders')
155+
```
156+
157+
---
158+
159+
## Learning Path
160+
161+
1. **Start**: Read `README.md`
162+
2. **Learn**: Open `DataGenerator_Tutorial.ipynb`
163+
3. **Practice**: Modify `config_simple.yaml`
164+
4. **Build**: Create your own config
165+
166+
---
167+
168+
## Key Features
169+
170+
### Multi-Table Support
171+
```yaml
172+
tables:
173+
- table_name: users
174+
rows_count: 10
175+
- table_name: orders
176+
rows_count: 50
177+
```
178+
179+
### Foreign Keys
180+
```yaml
181+
- name: user_id
182+
type: reference
183+
ref_table: users
184+
ref_column: user_id
185+
```
186+
187+
### 11+ Column Types
188+
- integer, float, string
189+
- choice (with weights)
190+
- boolean
191+
- date, datetime
192+
- email, phone, uuid
193+
- reference (foreign key)
194+
195+
### Automatic Features
196+
- Dependency resolution
197+
- Unique constraints
198+
- Progress indicators
199+
- CSV/JSON export
200+
- Pandas integration
201+
202+
---
203+
204+
## 📊 Output Structure
205+
206+
After running, you'll get:
207+
208+
```
209+
Data_generator/
210+
├── [All your source files]
211+
212+
└── output/ (or ecommerce_data/, etc.)
213+
├── users.csv
214+
├── users.json
215+
├── orders.csv
216+
└── orders.json
217+
```
218+
219+
---
220+
221+
## 🎯 Common Use Cases
222+
223+
### 1. Testing Databases
224+
```python
225+
# Generate test data
226+
gen = MultiTableDataGenerator(seed=42)
227+
gen.generate_from_config('config_ecommerce.yaml')
228+
# Import CSVs into your database
229+
```
230+
231+
### 2. Prototyping Applications
232+
```python
233+
# Quick demo data
234+
gen = MultiTableDataGenerator()
235+
gen.generate_from_config('config_simple.yaml')
236+
# Use in your app prototype
237+
```
238+
239+
### 3. Data Science Practice
240+
```python
241+
# Generate training data
242+
gen = MultiTableDataGenerator(seed=100)
243+
results = gen.generate_from_config('my_ml_config.yaml')
244+
df = gen.get_dataframe('features')
245+
# Use for ML experiments
246+
```
247+
248+
### 4. API Testing
249+
```python
250+
# Generate test payloads
251+
gen = MultiTableDataGenerator()
252+
results = gen.generate_from_config('api_test_config.yaml')
253+
# Use in API tests
254+
```
255+
256+
---
257+
258+
## Configuration Cheat Sheet
259+
260+
### Basic Structure
261+
```yaml
262+
table_name: my_table
263+
rows_count: 100
264+
output_format: both
265+
columns: [...]
266+
```
267+
268+
### Multi-Table Structure
269+
```yaml
270+
output_path: ./output
271+
output_format: both
272+
tables:
273+
- table_name: table1
274+
rows_count: 10
275+
columns: [...]
276+
- table_name: table2
277+
rows_count: 50
278+
columns: [...]
279+
```
280+
281+
### Column Template
282+
```yaml
283+
- name: column_name
284+
type: column_type
285+
# type-specific options...
286+
unique: false # optional
287+
```
288+
289+
---
290+
291+
## Commands Reference
292+
293+
```python
294+
295+
# Python interactive
296+
297+
>>> from data_generator import MultiTableDataGenerator
298+
>>> gen = MultiTableDataGenerator(seed=42)
299+
>>> gen.generate_from_config('config_simple.yaml')
300+
301+
# Check output
302+
! ls -la output/
303+
```
304+
305+
---
306+
307+
## 📈 Scaling Tips
308+
309+
| Dataset Size | Recommendation |
310+
|--------------|----------------|
311+
| < 1K rows | Any format, instant |
312+
| 1K - 100K rows | Prefer CSV, seconds |
313+
| 100K - 1M rows | CSV only, minutes |
314+
| 1M+ rows | Batch generation |
315+
316+
---
317+
318+
## Quick Troubleshooting
319+
320+
| Problem | Solution |
321+
|---------|----------|
322+
| Module not found | `pip install pyyaml pandas` |
323+
| Config not found | Check file path, use `ls` |
324+
| Can't generate unique | Increase range |
325+
| Referenced table error | Parent table must be first |
326+
327+
328+
---
329+
330+
## Success Checklist
331+
332+
- [ ] All 9 files saved
333+
- [ ] Dependencies installed (`pip install pyyaml pandas`)
334+
- [ ] Can import: `from data_generator import MultiTableDataGenerator`
335+
- [ ] `config_simple.yaml` generates successfully
336+
- [ ] Output files created in `./output/`
337+
- [ ] Jupyter notebook opens and runs
338+
- [ ] Can create custom configs
339+
340+
**All checked? You're ready to generate data! **
341+
342+
343+
## Notes
344+
345+
- **Reproducibility**: Use seeds (`seed=42`) for consistent results
346+
- **Performance**: CSV is faster than JSON for large datasets
347+
- **Testing**: Start with small `rows_count` (10-20) for testing
348+
- **Safety**: Generated data is saved automatically
349+
- **Pandas**: Use `get_dataframe()` for easy data analysis
350+
351+
---
352+
353+
## What You Can Build
354+
355+
With this generator, you can create:
356+
- ✅ E-commerce databases
357+
- ✅ Social media datasets
358+
- ✅ School management systems
359+
- ✅ Hospital records
360+
- ✅ Banking transactions
361+
- ✅ IoT sensor data
362+
- ✅ Any relational database!
363+
364+
---
365+
366+
## 🚀 Get Started Now
367+
368+
```bash
369+
# 1. Install
370+
pip install pyyaml pandas
371+
372+
# 2. Test
373+
python -c "from data_generator import MultiTableDataGenerator; \
374+
MultiTableDataGenerator(seed=42).generate_from_config('config_simple.yaml')"
375+
376+
```
377+
378+
---
379+
380+
**You have everything you need to generate professional-quality test data! **
381+
382+
**Questions?** Check `README.md` for complete documentation.
383+
384+
**Want examples?** Open `DataGenerator_Tutorial.ipynb` for interactive tutorials.
385+
386+
**Ready to build?** Start with `config_simple.yaml` and customize it!
387+
388+
---
389+
390+
**Happy Data Generating! **

0 commit comments

Comments
 (0)