Database Load Generator¶

db-load-generator is a Python framework and toolbox for generating artificial database loads with as little code as necessary. It uses Java and JDBC drivers to connect to the databases. And comes with a command line and ready-to-use scenarios out of the box.

What db-load-generator can do

Generate load scenarios out of annotated SQL code
Provide convenient decorators to build queries and scenarios
Fully compatible with Python’s DBAPI2
Invoke any scenario or query using powerful CLI tool
Configurable through config files, environment variables, and CLI arguments
Can run as a Dramatiq background task queue processor

Supported databases

Microsoft SQL Server
Oracle
IBM DB2 (both LUW and z/OS)
SAP Hana DB
Teradata DB
MySQL
PostgreSQL
SQLite
any other database that comes with a JDBC driver…

Getting Started¶

Dynamic documentation

All code, command samples, and outputs below are dynamically generated during documentation build process using the latest version of db-load-generator.

Basic setup¶

Create an annotated SQL file with a couple of queires in it. We are going to start with two queries, one of which creates a simple table and the other one drops it.

queries.sql:

-- name: create_departments_table, scenario: setup
CREATE TABLE departments (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT UNIQUE
);

-- name: drop_departments_table, scenario: teardown
DROP TABLE IF EXISTS departments;

Launch dbload CLI to see how the queries were parsed:

dbload --sql queries.sql show queires

The output below shows that two queries were produced from the parsed SQL file:

+-------------------------------+---------------------------------------------------------------------------------+
| Key                           | Value                                                                           |
+-------------------------------+---------------------------------------------------------------------------------+
| add_department                |                                                                                 |
|   sql                         | INSERT INTO departments (name) VALUES (?);                                      |
| add_employee                  |                                                                                 |
|   sql                         | INSERT INTO employees (name, title, dep_id) VALUES (?, ?, ?);                   |
| create_departments_table      |                                                                                 |
|   sql                         | CREATE TABLE departments ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT U... |
| create_employees_table        |                                                                                 |
|   sql                         | CREATE TABLE employees ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, ti... |
| drop_departments_table        |                                                                                 |
|   sql                         | DROP TABLE IF EXISTS departments                                                |
| get_departments               |                                                                                 |
|   sql                         | SELECT * FROM departments;                                                      |
| get_departments_return_random |                                                                                 |
|   sql                         | SELECT * FROM departments;                                                      |
| sort_employees                |                                                                                 |
|   sql                         | SELECT * FROM employees ORDER BY name;                                          |
+-------------------------------+---------------------------------------------------------------------------------+

Launch dbload CLI once again to show what scenarios were parsed:

dbload --sql queries.sql show scenarios

One scenario called setup and the other one called teardown were parsed and created based on the annotated SQL file. In this case, each scenario contains a link to the query that mentioned it in the annotation. These queries will be executed in the order specified by the auto_run_queries list belonging to the scenario.

+--------------------+--------------------------+
| Key                | Value                    |
+--------------------+--------------------------+
| setup              |                          |
|   auto_run_queries |                          |
|     -              | create_departments_table |
|     -              | create_employees_table   |
| teardown           |                          |
|   auto_run_queries |                          |
|     -              | drop_departments_table   |
+--------------------+--------------------------+

You can see that these scenarios were created implicitly (without being explicitly declared in a python module), because they were specified in the scenario: setup and scenario: teardown parts of the query annotations. Each annotated SQL query can mention several scenarios as part of its annotations and these scenarios will be created automatically, if they were not explicitly declared before. If a query mentions a scenario in its annotation, then that query will get executed as part of the mentioned scenario. Since several queries can mention the same scenario, in that scenario they are executed in the order they were defined in the SQL file. Alternatively, execution order can be specified in square brackets after the name of the scenario: scenario: setup[100]. The lower the number – the higher is priority of the execution of that query. Any such queires are executed before any other logic inside the defined scenario proceeds.

Connect to database¶

It’s time to connect to the actual database to run our scenarios and queries! At least Java 8 must be installed and JAVA_HOME environment variable should point to the location of the JVM installation. If no suitable JVM is found there, common directories based on the platform will be searched. On windows the registry will be consulted. JDBC driver for the destination database must be present on the system as well.

In this example we will connect to the SQLite database and a proper driver sqlite-jdbc-3.34.0.jar (latest GA release at the time of the writing) can be downloaded from SQLite JDBC Driver GitHub.

In order to connect, a couple of parameters must be specified: DSN string (URI) and class path to the driver files. Let’s test our connection:

dbload \
   --dsn jdbc:sqlite:sample.db \
   --classpath sqlite-jdbc-3.34.0.jar \
   test

More feature-rich databases such as SQL Server, Oracle, SAP Hana DB, and others might require additional parameters to be specified: JDBC driver class name (--driver com.ibm.db2.jcc.DB2Jcc) or driver arguments such as password, user, or database name (--driver-arg user=sa) might be required. But for SQLite using just the dsn and classpath is enough. The connection was successful:

Successfully connected to the database.

Configuration sources¶

Specifying the connection parameters as command line arguments every time can get out of hands really quick. Instead, let’s put them in the config file.

dbload.json:

{
  "classpath": ["sqlite-jdbc-3.34.0.jar"],
  "dsn": "jdbc:sqlite:sample.db",
  "sql": ["queries.sql"]
}

Every command line argument has its matching counterpart in the config file. If some of the arguments are not specified in either of the locations, the default values are used.

By default, dbload looks for the dbload.json file in the current directory. It is possible to specify a different configuration path using the --config other_dir/dbload.cfg option. If non-default configuration file is not in the current directory, then any relative path in that file will be adjusted according to the current working directory of dbload (directory from which dbload was launched).

Every configuration parameter can also be supplied as an environment variable with the DBLOAD_ prefix. For example:

export DBLOAD_DSN="jdbc:sqlite:sample.db"

is equivalent to --dsn jdbc:sqlite:sample.db.

Configurations are loaded in the following order from least to most important. Merging any new values on top of the old ones:

Default values
JSON file
Environment variables
Command line arguments

As mentioned above, more sophisticated databases require other parameters, such as driver arguments with user names and passwords, to initiate a connection. For some databases, the driver arguments must be presented as a dictionary and for others, as a list. Both ways are supported. Choose based on what the particular database’s driver needs. Example for Microsoft SQL Server:

{
  "driver_arg": { "databaseName": "master", "user": "sa", "password": "StrongPassword!" }
}

and for IBM DB2:

{
  "driver_arg": { "db2demo", "PaSsWoRdforDB2" }
}

Execute a statement¶

You can execute a random statement against the database. Let’s see what tables we have in our fresh SQLite database. The database currently contains no tables:

$ dbload execute "SELECT * FROM sqlite_master where type='table'"
Executing: SELECT * FROM sqlite_master where type='table'
+---+------+------+----------+----------+-----+
| # | type | name | tbl_name | rootpage | sql |
+---+------+------+----------+----------+-----+
+---+------+------+----------+----------+-----+

Prepared statements with properties are also supported. This command will produce the exact same result:

$ dbload execute "SELECT * FROM sqlite_master where type=?" --property table
Executing: SELECT * FROM sqlite_master where type=?
Properties: ['table']
+---+------+------+----------+----------+-----+
| # | type | name | tbl_name | rootpage | sql |
+---+------+------+----------+----------+-----+
+---+------+------+----------+----------+-----+

Run queries¶

Let’s run some of the queries we’ve declared earlier in the queries.sql file. Create a table:

dbload query create_departments_table

Executing: create_departments_table
+---------------+
| Rows affected |
+---------------+
|       0       |
+---------------+

and then drop it:

dbload query drop_departments_table

Executing: drop_departments_table
+---------------+
| Rows affected |
+---------------+
|       0       |
+---------------+

Run scenarios¶

Similarly, we can run entire scenarios from the command line.

dbload scenario setup

Executing: setup

Our implicit scenario “setup” merely executes its auto_run_queries and does nothing else. So, nothing besides the general “Executing…” is printed to the terminal.

Insert data¶

Let’s insert some data into the table we’ve just created. Add a new query that inserts data into the departments table:

-- name: create_departments_table, scenario: setup
CREATE TABLE departments (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT UNIQUE
);

-- name: drop_departments_table, scenario: teardown
DROP TABLE IF EXISTS departments

-- name: add_department
INSERT INTO departments (name) VALUES (?);

As you can see, add_department is a prepared statement. It requires a property to be supplied before execution:

dbload query add_department --property 'accounting'
dbload query add_department -p 'r&d'

Notice how -p is a shortcut for --property. dbload CLI has shortcuts for almost all supported configuration properties.

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/bin/dbload", line 5, in <module>
    main()
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/app/dbload/cli.py", line 162, in query
    ctx.infuse()
  File "/app/dbload/context.py", line 127, in infuse
    self._load_predefined_simulation()
  File "/app/dbload/context.py", line 148, in _load_predefined_simulation
    raise UnsupportedPredefinedSimulationError(cfg.predefined)
dbload.exceptions.UnsupportedPredefinedSimulationError: Attempting to load predefined simulation 'accounting' that does not exist.

The same approach with supplying data via properties can be applied to update and removal of data.

Select data¶

Now, to compose some queries that retrieve the data, let’s add a query that retrieves all rows from departments table.

-- name: create_departments_table, scenario: setup
CREATE TABLE departments (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT UNIQUE
);

-- name: drop_departments_table, scenario: teardown
DROP TABLE IF EXISTS departments

-- name: add_department
INSERT INTO departments (name) VALUES (?);

-- name: get_departments, option: return_random
SELECT * FROM departments;

But there is a new annotation tag: we’ve added option: return_random part to the annotation of the query. That gives as an advangate by automatically generating an additional variant of the get_departments query, which returns a single random row instead of all the rows. That can come in handy when we need to quickly retrieve a random foreign key required for some other update or insert operation.

dbload query get_departments

Executing: get_departments
+---+----+------+
| # | id | name |
+---+----+------+
+---+----+------+

Now, let’s try the “return_random” variant of that query.

dbload query get_departments_return_random

Executing: get_departments_return_random
+---+----+------+
| # | id | name |
+---+----+------+
+---+----+------+

Explicit scenarios¶

We will now add employees to our database. First, we’ll add a couple more queries that will accomplish that.

-- name: create_departments_table, scenario: setup
CREATE TABLE departments (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT UNIQUE
);

-- name: drop_departments_table, scenario: teardown
DROP TABLE IF EXISTS departments

-- name: add_department
INSERT INTO departments (name) VALUES (?);

-- name: get_departments, option: return_random
SELECT * FROM departments;

-- name: create_employees_table, scenario: setup
CREATE TABLE employees (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT,
  title TEXT,
  dep_id INTEGER,
  FOREIGN KEY(dep_id) REFERENCES departments(id)
)

-- name: add_employee
INSERT INTO employees (name, title, dep_id) VALUES (?, ?, ?);

Now, let’s define the first explicit scenario in a python module.

scenarios.py:

import dbload
from faker import Faker

@dbload.scenario
def load_employees(connection):
    fake = Faker()

    for _ in range(10):
        name = fake.name()
        title = fake.job()
        dep_id = dep_name = None

        with connection.cursor() as c:
            dep_id, dep_name = load_employees.get_departments_return_random(c).first

        with connection.cursor() as c:
            load_employees.add_employee(c, name, title, dep_id)
            print(f"New employee: {name} working as '{title}' in '{dep_name}'.")

Boy, that escalated quickly. Let’s try to digest it line by line. At first, two modules are imported: dbload and Faker, a python package that generates fake data.

import dbload
from faker import Faker

Note

For all intents and purposes, the scenarios.py file we’ve created is a normal Python module. It can be a part of the package or contain any other logic or libraries. When we point dbload CLI to this file it imports the file as a module.

After that, we define a scenario using the @dbload.scenario decorator. This decorator automatically registers the scenario within dbload and does some other things, like providing a valid open connection.

@dbload.scenario
def load_employees(connection):
    fake = Faker()

For each employee we want to add to the database we’ll open a new cursor and generate some fake data. The department ID (dep_id), however, is a foreign key that has to reference an actual ID from the departments table. That’s where the return_random query variant will come handy. We will use it to retrieve a random department ID.

    for _ in range(10):
        name = fake.name()
        title = fake.job()
        dep_id = dep_name = None

        with connection.cursor() as c:
            dep_id, dep_name = load_employees.get_departments_return_random(c).first

        with connection.cursor() as c:
            load_employees.add_employee(c, name, title, dep_id)
            print(f"New employee: {name} working as '{title}' in '{dep_name}'.")

Notice, how get_departments_return_random function is not actually defined anywhere in our python module. But it was implicitly created from the annotated SQL file. And by default, each scenario gets infused with all parsed queries, which are attached to the scenario function as attributes. That’s why we access that query as an attribute of the scenario.

The .first part then retrieves the first row returned by the query (and our return_random query returns only one row). The first element in that tuple .first[0] is going to be the random department id we need. We unpack the .first tuple to get both the department name and its ID.

Finally, we invoke the add_employee query and supply it with the data we’ve generated and collected. Similarly, the add_employee query has been parsed from the SQL file and was attached to the scenario object as an attribute.

    for _ in range(10):
        name = fake.name()
        title = fake.job()
        dep_id = dep_name = None

        with connection.cursor() as c:
            dep_id, dep_name = load_employees.get_departments_return_random(c).first

        with connection.cursor() as c:
            load_employees.add_employee(c, name, title, dep_id)
            print(f"New employee: {name} working as '{title}' in '{dep_name}'.")

Note

You don’t have to pass a new cursor argument to the query, when calling it. If no open cursor or connection are passed to a query or a scenario, then they will be created automatically. The downside to this is that in the scenario above a new connection will be created for each query invoked without a valid cursor. That would be 20 new connections sequentially opened and closed to invoke 2 queries x10 times.

Let’s make sure the employees table is created and then add the employees to it by running the scenario we’ve created.

dbload --ignore scenario setup

Since we’ve added a new query to the “setup” scenario and we want to run it again to create employees table we’ll get errors, because other tables (departments) already exist. To ignore the errors we can specify the --ignore flag. This flag works when invoking queries as well.

Executing: setup
Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/bin/dbload", line 5, in <module>
    main()
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/app/dbload/cli.py", line 147, in scenario
    ctx.scenarios[scenario_name].function(ignore=config.ignore)
  File "/app/dbload/scenario.py", line 132, in wrapper_scenario
    raise ScenarioExecutionError(e) from None
dbload.exceptions.ScenarioExecutionError: [SQLITE_ERROR] SQL error or missing database (table departments already exists)

Now, let’s run the load_employees scenario:

dbload --module scenarios.py scenario load_employees

Scenario 'load_employees' does not exist.

Explicit queries¶

Let’s add a new query but this time we will define it in the Python module.

import dbload
from faker import Faker

@dbload.query
def sort_employees(cursor):
    sql = "SELECT * FROM employees ORDER BY name"
    with cursor:
        cursor.execute(sql)
        rows = cursor.fetchall()
        print(f"Alphabetically first employee: {rows[0][1]}")

In this query we are not relying on the annotated SQL file at all, instead doing everything manually: defining SQL statement in a string, executing it in the cursor, and fetching the results. We could automate the result fetching part in the same way it happens within implicit queries. In order to do that we are going to feed the cursor with executed statement to the QueryResult class.

from dbload import query, scenario, QueryResult
from faker import Faker

@query
def sort_employees(cursor):
    sql = "SELECT * FROM employees ORDER BY name"
    with cursor:
        cursor.execute(sql)
        name = QueryResult.from_cursor(cursor).first[1]
        print(f"Alphabetically first employee: {name}")

We’ve changed the import line to include QueryResult. Now, when building a query result instance from the cursor, we get its convenience in return, e.g. using such methods as .first.

dbload --module scenarios.py query sort_employees

Executing: sort_employees
+---+----+------+-------+--------+
| # | id | name | title | dep_id |
+---+----+------+-------+--------+
+---+----+------+-------+--------+

How about we combine both annotated SQL and explicitly defined query? It’s convenient to store actual SQL statements in the SQL file, because we get a proper syntax highlighting, reusability (use same .sql file in any SQL IDE), and keep our Python code more clean.

queries.sql:

-- name: sort_employees
SELECT * FROM employees ORDER BY name;

scenarios.py:

@query
def sort_employees(cursor):
    with cursor:
        cursor.execute(sort_employees.sql)
        name = QueryResult.from_cursor(cursor).first[1]
        print(f"Alphabetically first employee: {name}")

The annotated SQL query sort_employees got matched with the explicitly defined query with the same name in the Python module. The queires are created implicitly only when they were not defined in the python module. In this case, the query was defined, so the annotated SQL statement got attached to the query function as a .sql attribute that contains the actual statement.

By executing the rewritten query we’ll get the same result:

dbload --module scenarios.py query sort_employees

Executing: sort_employees
+---+----+------+-------+--------+
| # | id | name | title | dep_id |
+---+----+------+-------+--------+
+---+----+------+-------+--------+

Standalone script¶

It’s not required to use dbload CLI to invoke queries and scenarios. Instead, we can turn our scenarios.py into a standalone script.

#!/usr/bin/env python

from dbload import (
    query,
    scenario,
    QueryResult,
    get_context,
    get_connection,
    get_config,
)
from faker import Faker

@query
def sort_employees(cursor):
    with cursor:
        cursor.execute(sort_employees.sql)
        name = QueryResult.from_cursor(cursor).first[1]
        print(f"Alphabetically first employee: {name}")

@scenario
def load_employees(connection):
    fake = Faker()

    for _ in range(10):
        name = fake.name()
        title = fake.job()
        dep_id = dep_name = None

        with connection.cursor() as c:
            dep_id, dep_name = load_employees.get_departments_return_random(c).first

        with connection.cursor() as c:
            load_employees.add_employee(c, name, title, dep_id)
            print(f"New employee: {name} working as '{title}' in '{dep_name}'.")

if __name__ == "__main__":
    # Initialize config: will search and parse config file, env vars, etc.
    config = get_config({"driver": "org.sqlite.JDBC", "verbose": 0})
    # Get global context
    context = get_context()
    # Infuse context with content parsed from annotated SQL file
    context.infuse()
    # Initialize new connection
    connection = get_connection(config)
    # Invoke something...
    with connection.cursor() as cur:
        sort_employees(cur)

We can run it as a normal Python script:

python scenarios.py

2021-06-18 12:41:17.617 | DEBUG    | dbload.context:register_query:72 - Registering 'sort_employees' query in the context.
2021-06-18 12:41:17.618 | DEBUG    | dbload.context:register_scenario:94 - Registering 'load_employees' scenario in the context.
Traceback (most recent call last):
  File "/app/docs/tutorial/step_8.py", line 47, in <module>
    sort_employees(cur)
  File "/app/dbload/query.py", line 185, in wrapper_query
    raise QueryExecutionError(e) from None
dbload.exceptions.QueryExecutionError: 'NoneType' object is not subscriptable

Background tasks¶

It is possible to turn each query and scenario into a background task queue processor by using Dramatiq. All you need to do is to decorate the required objects with the @dramatiq.actor decorator.

Installation¶

db-load-generator can be installed from PyPI. It needs Java installed and a JDBC driver to connect to the actual database. For other installation options and Requirements please see Installation.

pip install db-load-generator

Documentation¶

Guides

API reference

Attributions¶

DBLoad is built on top of the Python, Java/JDBC, and the excellent JPype library.
Icons made by Nikita Golubev and Smashicons from Flaticon.
SQLite JDBC Driver by xerial.

Installation