Every developer needs realistic test data — for unit tests, integration tests, database seeding, performance testing and demos. Handcrafting test data manually is slow and the results are usually unrealistic. This guide shows the best approaches.
Why test data matters
Poor test data causes hidden bugs. If your test users all have the same name, the same email format and sequential IDs, you're not testing edge cases that real users will hit. Realistic test data includes:
- Names from different cultures and languages
- Phone numbers in different formats
- Emails from different providers
- Dates spread across realistic ranges
- Null values and edge cases (empty strings, maximum lengths)
- Referential integrity across related tables
Method 1: Generate test data with Python Faker
The faker library is the most popular Python tool for generating realistic fake data:
from faker import Faker
import csv
fake = Faker('en_IN') # Indian locale
users = []
for i in range(100):
users.append({
'id': i + 1,
'name': fake.name(),
'email': fake.email(),
'phone': fake.phone_number(),
'city': fake.city(),
'created_at': fake.date_between(
start_date='-2y', end_date='today'
).isoformat()
})
with open('users.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=users[0].keys())
writer.writeheader()
writer.writerows(users)
print(f"Generated {len(users)} users")
💡 Locales: Faker supports en_IN (India), en_US (USA), en_GB (UK), hi_IN (Hindi) and 70+ other locales. Always use the right locale for your target market.
Method 2: Generate data from SQL schema
The hardest part of test data generation is maintaining referential integrity — ensuring that foreign key values in child tables actually exist in parent tables. Here's an approach:
from faker import Faker
import random
fake = Faker('en_IN')
# Generate parent table first
departments = [
{'id': i+1, 'name': dept}
for i, dept in enumerate([
'Engineering', 'Product', 'Sales',
'Marketing', 'HR', 'Finance'
])
]
# Generate child table referencing parent IDs
employees = []
for i in range(200):
dept = random.choice(departments)
employees.append({
'id': i + 1,
'name': fake.name(),
'email': fake.email(),
'department_id': dept['id'], # valid FK
'salary': random.randint(40000, 200000),
'hire_date': fake.date_between(
start_date='-5y', end_date='today'
).isoformat()
})
print(f"Generated {len(employees)} employees")
Method 3: Generate test data from SQL schema without code
Sylvaera's Schema to Dataset tool accepts your SQL CREATE TABLE statements and generates realistic, relationally consistent data automatically. Paste your schema, set the row count, and download as CSV, JSON or Excel.
It understands:
- Foreign key constraints — child rows reference valid parent IDs
- NOT NULL constraints — required fields are always populated
- UNIQUE constraints — no duplicate values in unique columns
- Data types — integers get integers, dates get dates, booleans get true/false
- Column names — a column named
emailgets email values,phonegets phone numbers
Generating data for performance testing
Performance tests need large volumes — often millions of rows. For this scale, pure Python loops are too slow. Use vectorized generation:
import pandas as pd
import numpy as np
from faker import Faker
fake = Faker('en_IN')
n = 1_000_000 # 1 million rows
df = pd.DataFrame({
'id': range(1, n + 1),
'amount': np.random.lognormal(mean=9, sigma=1, size=n).round(2),
'status': np.random.choice(
['completed', 'pending', 'failed', 'refunded'],
size=n, p=[0.7, 0.15, 0.1, 0.05]
),
'created_at': pd.date_range(
start='2022-01-01', periods=n, freq='1min'
)
})
df.to_csv('transactions.csv', index=False)
print(f"Generated {n:,} rows in seconds")
The most important rule of test data: make it look like production data, but make sure it isn't.
Try Schema to Dataset — Free
Paste any SQL schema and get realistic, relationally consistent test data instantly. Understands foreign keys, constraints and data types. Export as JSON, CSV or Excel.
Open Schema to Dataset →