Random Data Generation: Then and Now

Modern approaches to generating test data with Python Faker

Random Data Generation: Then and Now
Page content

In 2018, I wrote about using SQL functions to generate random test data in MySQL. While that approach served its purpose, the landscape of test data generation has evolved significantly. Today, I want to share my experience with using the Faker library, which has become my go-to tool for creating realistic test datasets.

The Traditional SQL Approach

The traditional approach to generating test data relied heavily on SQL functions like RAND() and string manipulation. This method worked but had limitations:

  • Generated data often looked artificial
  • Creating complex patterns required intricate SQL
  • Maintaining consistent relationships between fields was challenging
  • Each new data type needed custom SQL logic

Moving to Faker

The Faker library represents a modern approach to test data generation. Available through PyPI (faker package), this library provides a comprehensive solution for creating realistic test data.

Here’s a basic example for Faker:

import random
from faker import Faker
import json  # import json for pretty printing

fake = Faker("en_CA")
Faker.seed(4321)

job_department_map = {
    "delivery": ["delivery driver"],
    "counter": ["counterperson", "front manager"],
    "kitchen": ["pizza chef", "prep cook", "sub maker"],
    "marketing": ["marketing lead", "social media manager"],
}


records = []
for _ in range(5):
    department = random.choice(list(job_department_map.keys()))
    job_title = random.choice(job_department_map[department])
    name = fake.name()
    first, last = name.split(" ")[0], name.split(" ")[-1]
    domain = fake.free_email_domain()
    email = f"{first.lower()}.{last.lower()}@{domain}"

    user_data = {
        "name": name,
        "email": email,
        "address": fake.address(),
        "phone": fake.phone_number(),
        "company": fake.company() + " Pizza",
        "job_title": job_title,
        "department": department,
        "product_sku": fake.bothify(text="PZA-###??"),
    }
    records.append(user_data)

# pretty print all records at once
print(json.dumps(records, indent=4, ensure_ascii=False))

What I particularly appreciate about Faker is its built-in support for:

  1. Localized data: Generate region-specific content (in the example, English-speaking Canada)
  2. Related fields: Create logically connected data points (in the example, making the email match the record’s name)
  3. Company context (in the example, custom department names and jobs are specified)

Helpful Testing Tip!

If your test suite requires generating test data repeatedly, you can pin the seed value to create consistently reproducible datasets. See “Faker.seed(4321)” in the example above.

Key Benefits I’ve Found

In my database work, Faker has proven valuable for several reasons:

  1. Time savings: What used to take dozens of lines of SQL now takes just a few lines of Python
  2. Data quality: The generated data looks and feels realistic
  3. Consistency: Related fields maintain logical relationships
  4. Flexibility: Easy to adapt for different use cases

When to Use Each Approach

Based on my experience, here’s when would I choose each method:

SQL Generation works best for:

  • Quick, simple random values
  • Direct database operations
  • When avoiding external dependencies is crucial

Faker shines when:

  • Data needs to look realistic
  • Working with complex data types
  • Building comprehensive test scenarios
  • Supporting multiple locales

Read more about Faker: https://faker.readthedocs.io/en/master/

Summary

The evolution from SQL-based random data to tools like Faker reflects our growing need for realistic test data. While both approaches have their place, I’ve found Faker to be an invaluable addition to my toolkit for generating test data that better represents data found in the real world.