Sylvaera / Blog / Testing & QA
TESTING & QA

How to Generate Realistic test data for SQL databases

Every developer needs realistic test data — for unit tests, integration tests, database seeding, performance testing and demos. Handcrafting test data manually is slow and the results are usually unrealistic. This guide shows the best approaches.

Why test data matters

Poor test data causes hidden bugs. If your test users all have the same name, the same email format and sequential IDs, you're not testing edge cases that real users will hit. Realistic test data includes:

Method 1: Generate test data with Python Faker

The faker library is the most popular Python tool for generating realistic fake data:

from faker import Faker
import csv

fake = Faker('en_IN')  # Indian locale

users = []
for i in range(100):
    users.append({
        'id': i + 1,
        'name': fake.name(),
        'email': fake.email(),
        'phone': fake.phone_number(),
        'city': fake.city(),
        'created_at': fake.date_between(
            start_date='-2y', end_date='today'
        ).isoformat()
    })

with open('users.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=users[0].keys())
    writer.writeheader()
    writer.writerows(users)

print(f"Generated {len(users)} users")

💡 Locales: Faker supports en_IN (India), en_US (USA), en_GB (UK), hi_IN (Hindi) and 70+ other locales. Always use the right locale for your target market.

Method 2: Generate data from SQL schema

The hardest part of test data generation is maintaining referential integrity — ensuring that foreign key values in child tables actually exist in parent tables. Here's an approach:

from faker import Faker
import random

fake = Faker('en_IN')

# Generate parent table first
departments = [
    {'id': i+1, 'name': dept}
    for i, dept in enumerate([
        'Engineering', 'Product', 'Sales',
        'Marketing', 'HR', 'Finance'
    ])
]

# Generate child table referencing parent IDs
employees = []
for i in range(200):
    dept = random.choice(departments)
    employees.append({
        'id': i + 1,
        'name': fake.name(),
        'email': fake.email(),
        'department_id': dept['id'],  # valid FK
        'salary': random.randint(40000, 200000),
        'hire_date': fake.date_between(
            start_date='-5y', end_date='today'
        ).isoformat()
    })

print(f"Generated {len(employees)} employees")

Method 3: Generate test data from SQL schema without code

Sylvaera's Schema to Dataset tool accepts your SQL CREATE TABLE statements and generates realistic, relationally consistent data automatically. Paste your schema, set the row count, and download as CSV, JSON or Excel.

It understands:

Generating data for performance testing

Performance tests need large volumes — often millions of rows. For this scale, pure Python loops are too slow. Use vectorized generation:

import pandas as pd
import numpy as np
from faker import Faker

fake = Faker('en_IN')
n = 1_000_000  # 1 million rows

df = pd.DataFrame({
    'id': range(1, n + 1),
    'amount': np.random.lognormal(mean=9, sigma=1, size=n).round(2),
    'status': np.random.choice(
        ['completed', 'pending', 'failed', 'refunded'],
        size=n, p=[0.7, 0.15, 0.1, 0.05]
    ),
    'created_at': pd.date_range(
        start='2022-01-01', periods=n, freq='1min'
    )
})

df.to_csv('transactions.csv', index=False)
print(f"Generated {n:,} rows in seconds")

The most important rule of test data: make it look like production data, but make sure it isn't.

Try Schema to Dataset — Free

Paste any SQL schema and get realistic, relationally consistent test data instantly. Understands foreign keys, constraints and data types. Export as JSON, CSV or Excel.

Open Schema to Dataset →