Data Engineering

Query Optimization with HypoPG

Query Optimization with HypoPG

Using HypoPG to test hypothetical indexes for query optimization in YugabyteDB

Valerie Parham-Thompson

Query optimization is a critical aspect of database performance tuning. While YugabyteDB’s YSQL API provides powerful tools for analyzing query performance through EXPLAIN plans, sometimes we need to experiment with different indexing strategies without the overhead of actually creating the indexes. This is where HypoPG comes in handy.

Understanding HypoPG

HypoPG is a PostgreSQL extension that allows you to create hypothetical indexes and see how they would affect your query plans without actually creating the indexes. This is particularly useful when:

Random Data Generation: Then and Now

Random Data Generation: Then and Now

Modern approaches to generating test data with Python Faker

Valerie Parham-Thompson

In 2018, I wrote about using SQL functions to generate random test data in MySQL. While that approach served its purpose, the landscape of test data generation has evolved significantly. Today, I want to share my experience with using the Faker library, which has become my go-to tool for creating realistic test datasets.

The Traditional SQL Approach

The traditional approach to generating test data relied heavily on SQL functions like RAND() and string manipulation. This method worked but had limitations:

College Scorecard Application

College Scorecard Application

Building an application to serve College Scorecard open data

Valerie Parham-Thompson

Still working on the College Scorecard dataset. Previously I explored the dataset in a real-world application, talked about how to clean the data, and worked with the data API.

Now I’ve decided I want to put this in a web application so others can use the dataset in the same flexible way that I have been using it. (Reminder that several popular college-search websites exist, but they are limited in the ways you can filter the data. Also they tend to gather personal data for I suppose ad generation.)

Handling Reserved Keywords in DSBulk for Seamless Data Migration

Handling Reserved Keywords in DSBulk for Seamless Data Migration

How to handle reserved keywords using Datastax DSBulk in YugabyteDB migration

Valerie Parham-Thompson

Migrating to YugabyteDB offers significant advantages in terms of high availability, global distribution, and horizontal scalability—features essential for managing modern database workloads. However, data migration can be a complex process, particularly when transforming your schema definition. Differences in datatype support, query syntax, and core features across systems can complicate the transformation.

One of the challenges is dealing with reserved keywords in the source schema that cannot be directly used in the target system. This can require changes not only in the database schema during transformation but also in application code and related tooling.

College Scorecard API

College Scorecard API

Mapping College Scorecard data using the API

Valerie Parham-Thompson

After I finished the YugabyteDB universe network mapping example, I started thinking about other things to map. Anything with latitude and longitude will work. College locations from my previous work on the College Scorecard data set were an obvious choice.

Previously, I had exported the data and transformed it to allow for sorting and analysis. That’s still a valid method if you want to play with the pull data set, since the API allows only page size of max 100 at a time. However, with the right filters, that might be enough, and the API is a quicker path to getting the data.

Plotly Network Map

Plotly Network Map

Using the Plotly library to work with geographic data

Valerie Parham-Thompson

I’ve added a new feature to the day 2 ops tool.

With the diagram command, you can create a map of your Yugabyte cluster overlaid on a map of the world. Here’s an example:

Yugabyte Network Map

The Plotly library is very powerful, with a lot of options. I used the network map option, which allows you to define nodes and the edges between the nodes. In this case, the nodes are an abstraction of the database instances in a YugabyteDB cluster, and the edges represent the network connections between them.

Code as Instructional Technology

Code as Instructional Technology

Writing an interactive command-line tool as a learning tool for YugabyteDB REST APIs

Valerie Parham-Thompson

I’ve had the chance to share my database expertise in a variety of venues: speaking at meetups and conferences, leading hands-on workshops, mentoring new technologists, and of course writing.

I had been brewing a new idea for sharing content when a great opportunity landed in my lap.

The idea was: share what I know about managing a specific database product in code. Instead of creating a runbook for how to set up replication, I would write code that sets up replication. The key part is that it would have to be well-organized, commented, and documented to be useful to learners. Making it interactive would help users understand the options and parameters as they chose the commands and added flags. Even the error statements would give them insight into how it all works.

Count Large Partitions in YCQL

Count Large Partitions in YCQL

Counting large partitions in the YugabyteDB Cassandra API

Valerie Parham-Thompson

One thing that can really wreck your performance in Cassandra and the similar YugabyteDB YCQL is large partitions due to an imbalanced key. Without the robust nodetool commands of Cassandra, it can be challenging to find these large partitions in YugabyteDB.

dsbulk is a tool used for migrating data, and YugabyteDB has a fork that takes into consideration slight differences from Cassandra. That tool can be leveraged to list the top largest partitions.

Database transformation from SQL Server to YugabyteDB

Database transformation from SQL Server to YugabyteDB

Migrating data from SQL Server to YugabyteDB

Valerie Parham-Thompson

A database transformation and migration project takes solid planning and testing. I’ve found that three common changes required when transforming a SQL Server database to YugabyteDB YSQL are related to syntax, performance, and stored procedures. These will get you started on your transformation project.

Syntax

Transforming a schema from MS SQL to YugabyteDB requires some minor syntax changes. This is true for any cross-database transformation. The YugabyteDB YSQL API utilizes PostgreSQL syntax.

Correct Partition Endpoints

Correct Partition Endpoints

Using the correct endpoints in YugabyteDB database partitioning

Valerie Parham-Thompson

I was recently reviewing a database partitioning definition in YugabyteDB (the postgres “ysql” API), and realized the partition distribution might not be what the developer intended.

What is database partitioning?

Database partitioning is used to divide large tables into smaller tables (partitions). While the data is physically separate, the application can access the data logically as a single table.

This can help performance through a process called partition pruning. The database planner skips partitions that don’t hold the data. For example, if a table is partitioned on months of the year, a query on a single month only has to access the rows in the single partition for that month.