Data Modeling

Random Data Generation: Then and Now

Random Data Generation: Then and Now

Modern approaches to generating test data with Python Faker

Valerie Parham-Thompson

In 2018, I wrote about using SQL functions to generate random test data in MySQL. While that approach served its purpose, the landscape of test data generation has evolved significantly. Today, I want to share my experience with using the Faker library, which has become my go-to tool for creating realistic test datasets.

The Traditional SQL Approach

The traditional approach to generating test data relied heavily on SQL functions like RAND() and string manipulation. This method worked but had limitations:

Code as Instructional Technology

Code as Instructional Technology

Writing an interactive command-line tool as a learning tool for YugabyteDB REST APIs

Valerie Parham-Thompson

I’ve had the chance to share my database expertise in a variety of venues: speaking at meetups and conferences, leading hands-on workshops, mentoring new technologists, and of course writing.

I had been brewing a new idea for sharing content when a great opportunity landed in my lap.

The idea was: share what I know about managing a specific database product in code. Instead of creating a runbook for how to set up replication, I would write code that sets up replication. The key part is that it would have to be well-organized, commented, and documented to be useful to learners. Making it interactive would help users understand the options and parameters as they chose the commands and added flags. Even the error statements would give them insight into how it all works.

Database transformation from SQL Server to YugabyteDB

Database transformation from SQL Server to YugabyteDB

Migrating data from SQL Server to YugabyteDB

Valerie Parham-Thompson

A database transformation and migration project takes solid planning and testing. I’ve found that three common changes required when transforming a SQL Server database to YugabyteDB YSQL are related to syntax, performance, and stored procedures. These will get you started on your transformation project.

Syntax

Transforming a schema from MS SQL to YugabyteDB requires some minor syntax changes. This is true for any cross-database transformation. The YugabyteDB YSQL API utilizes PostgreSQL syntax.

Why You Need a Default Partition

Why You Need a Default Partition

Required default partitions to avoid lost data in Postgres and YugabyteDB

Valerie Parham-Thompson

Postgres and YugabyteDB allow you to define partitions of parent tables. Partitions are useful in at least two ways:

  1. You can take advantage of partition pruning. The database doesn’t need to look at partitions it knows won’t meet the parameters of the query.
  2. You can easily archive data by disconnecting and/or dropping partitions instead of managing expensive delete queries.

Here’s one gotcha I ran into recently. What happens if you insert a row into a partitioned table, but there’s no partition for it? The insert fails with an error – see below for a reproduction of this scenario.

Generate Random Data

Generate Random Data

Generating random data for testing in YugabyteDB

Valerie Parham-Thompson

I had to create a 10 million row table for testing recently, and put together a query to generate random data for it.

INSERT INTO my_table
(id,
mydatetime,
string1,
string2)

SELECT
(random() * 70 + 10)::int,
TIMESTAMP '2024-01-01 00:00:00.000000' + interval '1 millisecond' * (random() * 86400 * 1000 * 365),
(array['alligator','bear','cat','dog'])[(random() * 3 + 1)::int],
substr(md5(random()::text), 1, 10)

FROM generate_series(1, 10);

The id field is just a random integer in this example, but you’d probably use an identity column.

Cleaning College Scorecard Data

Cleaning College Scorecard Data

Tips on cleaning the College Scorecard data

Valerie Parham-Thompson

Cleaning the College Scorecard data before using it locally to query columns of interest to us in a college search allowed me to use correct datatypes and to fit the data into a Postgres table. In case you haven’t had a chance to see other walkthroughs of my automation process for various demo needs, the full Ansible setup will download data from a source and then load it into a table. Previously, I have used the process for MoMA art and artists data, and for generating a million-row table, and storing these in different YugabyteDB topologies. I leveraged this recently to load College Scorecard data for our child’s college search.

Timestamps in PostgreSQL Migration

Timestamps in PostgreSQL Migration

Handling timestamps across database systems like Postgres

Valerie Parham-Thompson

Math… the universal language. Timestamps, not so much.

The way we decide to denote date and time differs across both computer languages and human languages. The format also differs across implementations of SQL. For example, Oracle and Postgres allow very different formats to be entered in the timestamp data type.

Oracle allows a wide variety of punctuation in dates: hyphens, slashes, commas, periods, colons. Postgres supports a more limited list.

String Search

String Search

Link to YFTT fuzzy matching for string searches

Valerie Parham-Thompson

Quick post to share my presentation last week at the YugabyteDB Friday Tech Talk. It was on fuzzy matching, and more generally string searches. Got to nerd out on two of my favorite topics: words (broadly, linguistics and specifically, names) and databases. Check it out!

(Code for scenarios in my repo, here: https://github.com/dataindataout/xtest_ansible/tree/main/scenarios/fuzzy)

https://www.youtube.com/watch?v=vmHRnR1nFdQ

Fuzzy matching in YugabyteDB

Fuzzy matching in YugabyteDB

Techniques for matching similar strings in YugabyteDB

Valerie Parham-Thompson

Fuzzy string matching in YugabyteDB can be done with wildcard lookups, phonetic algorithms (Soundex, Metaphone), and trigram similarity. I’ll show a demo of practical examples using artist names, highlighting the performance differences between wildcard searches and phonetic indices. A combination of indexed double metaphone and trigram methods works best for both speed and precision. Also, while YugabyteDB supports PostgreSQL-style extensions, some indexing optimizations behave differently due to its distributed storage layer.

Postgres Partial Indexes on Email Address Domains

Postgres Partial Indexes on Email Address Domains

Improve query performance for email domain searches using PostgreSQL partial indexes to avoid full table scans

Valerie Parham-Thompson

Partial indexes in PostgreSQL dramatically improve query performance for searches on specific email domains. Standard indexes may not help with wildcard searches, but a partial index targeting a particular domain can reduce query times by avoiding full table scans. Practical examples and execution plans illustrate the speedup, and partial indexes are best for data distributions that remain stable over time.

Read more!