Data Engineering

Correct Partition Endpoints

Correct Partition Endpoints

Using the correct endpoints in YugabyteDB database partitioning

Valerie Parham-Thompson

I was recently reviewing a database partitioning definition in YugabyteDB (the postgres “ysql” API), and realized the partition distribution might not be what the developer intended.

What is database partitioning?

Database partitioning is used to divide large tables into smaller tables (partitions). While the data is physically separate, the application can access the data logically as a single table.

This can help performance through a process called partition pruning. The database planner skips partitions that don’t hold the data. For example, if a table is partitioned on months of the year, a query on a single month only has to access the rows in the single partition for that month.

Why You Need a Default Partition

Why You Need a Default Partition

Required default partitions to avoid lost data in Postgres and YugabyteDB

Valerie Parham-Thompson

Postgres and YugabyteDB allow you to define partitions of parent tables. Partitions are useful in at least two ways:

  1. You can take advantage of partition pruning. The database doesn’t need to look at partitions it knows won’t meet the parameters of the query.
  2. You can easily archive data by disconnecting and/or dropping partitions instead of managing expensive delete queries.

Here’s one gotcha I ran into recently. What happens if you insert a row into a partitioned table, but there’s no partition for it? The insert fails with an error – see below for a reproduction of this scenario.

Generate Random Data

Generate Random Data

Generating random data for testing in YugabyteDB

Valerie Parham-Thompson

I had to create a 10 million row table for testing recently, and put together a query to generate random data for it.

INSERT INTO my_table
(id,
mydatetime,
string1,
string2)

SELECT
(random() * 70 + 10)::int,
TIMESTAMP '2024-01-01 00:00:00.000000' + interval '1 millisecond' * (random() * 86400 * 1000 * 365),
(array['alligator','bear','cat','dog'])[(random() * 3 + 1)::int],
substr(md5(random()::text), 1, 10)

FROM generate_series(1, 10);

The id field is just a random integer in this example, but you’d probably use an identity column.

Open Source Database

Open Source Database

What does it mean to be a database engineer of multiple open-source databases?

Valerie Parham-Thompson

I’m an open-source database consultant. But which open-source database? Well, several of them.

I made the decision several years ago to take every opportunity to work with multiple databases. Why?

  1. Learning a new language teaches you more about your own. For example, taking time to understand sstables in Cassandra gave me more insight into how storage works in MySQL. Having these experiences across multiple databases forced me to question what I knew about internals, therefore deepening my understanding overall.

Database Scaling for Seasonal Increases

Database Scaling for Seasonal Increases

Understanding seasonal patterns when planning for database scaling

Valerie Parham-Thompson

Timing of seasonal demand depends on the industry, but a cyclical increase in traffic applies to all industries. Maybe your cycle is shorter than a full year. Or maybe it’s related to things like weather patterns or fashion.

You know when your business gets the most traffic.

  • Retail? Black Friday/Cyber Monday.
  • Health and fitness? New Year’s Day.
  • Pizza restaurant? Halloween and Thanksgiving, surprisingly.

You do know your business well. But do you know how to make sure your database infrastructure can keep up with traffic during these busy periods?

Cleaning College Scorecard Data

Cleaning College Scorecard Data

Tips on cleaning the College Scorecard data

Valerie Parham-Thompson

Cleaning the College Scorecard data before using it locally to query columns of interest to us in a college search allowed me to use correct datatypes and to fit the data into a Postgres table. In case you haven’t had a chance to see other walkthroughs of my automation process for various demo needs, the full Ansible setup will download data from a source and then load it into a table. Previously, I have used the process for MoMA art and artists data, and for generating a million-row table, and storing these in different YugabyteDB topologies. I leveraged this recently to load College Scorecard data for our child’s college search.

Open Data College Scorecard

Open Data College Scorecard

Exploring the College Scorecard open data set from the Department of Education

Valerie Parham-Thompson

I’ve been pretty quiet recently, I know. My youngest has been going through the college application phase, which has taken a lot of time for both of us. I wouldn’t give up all the lovely college visits and overnights, but I might easily part ways with the paperwork.

I do have something fun to share from the experience. I didn’t like the limitations of common college search websites. In particular, we were looking for a college in a subset of surrounding states. Most college search forms allow you to enter only a region, and the Southeastern region was too broad for her search. I also didn’t like that signing up for the sites subjected you to a lot of marketing.

Timestamps in PostgreSQL Migration

Timestamps in PostgreSQL Migration

Handling timestamps across database systems like Postgres

Valerie Parham-Thompson

Math… the universal language. Timestamps, not so much.

The way we decide to denote date and time differs across both computer languages and human languages. The format also differs across implementations of SQL. For example, Oracle and Postgres allow very different formats to be entered in the timestamp data type.

Oracle allows a wide variety of punctuation in dates: hyphens, slashes, commas, periods, colons. Postgres supports a more limited list.

Leveraging time to live (TTL)

Leveraging time to live (TTL)

Using TTL to expire records in YugabyteDB and Cassandra

Valerie Parham-Thompson

In both MySQL and Postgres, expiring records after a set period of time takes a couple of timestamps and a little creativity. With Cassandra, or in this case the YugabyteDB ycql API, TTL (time to live) can be leveraged to handle this functionality, simplifying both the table definition and amount of work required by your code.

Here’s a short test to demonstrate. Reminder that you can set up a quick 3-node cluster using the code here: https://github.com/dataindataout/xtest_ansible.

How do you keep up with technology?

How do you keep up with technology?

Using Obsidian for close reading and annotation of technical articles

Valerie Parham-Thompson

One of my favorite interview questions is, “How do you keep up with technology?” The answer to this question shows a lot about a candidate. Do they use down time at work to read up on the recent blogs? Are they asking for new assignments to stretch their skillsets? What about the thought leaders are in the space – are they connected?

But we do have to face that there are more new technologies and news about technology than anyone could possibly read about in a day. In my chosen slice of technology, what is a small slice of open source databases, there seems to be a new major database every 6 months. That’s not even counting feature updates, security bugs, and the broad ecosystem around databases. My inbox has been full to overflowing of invitations to review, attend conferences, notice this hot new feature in a top technology, etc.