Database Load Generator¶
db-load-generator
is a Python framework and toolbox for generating artificial database loads with as little code as necessary.
It uses Java and JDBC drivers to connect to the databases. And comes with a command line and ready-to-use scenarios out of the box.
What db-load-generator can do
Generate load scenarios out of annotated SQL code
Provide convenient decorators to build queries and scenarios
Fully compatible with Python’s DBAPI2
Invoke any scenario or query using powerful CLI tool
Configurable through config files, environment variables, and CLI arguments
Can run as a Dramatiq background task queue processor
Supported databases
Microsoft SQL Server
Oracle
IBM DB2 (both LUW and z/OS)
SAP Hana DB
Teradata DB
MySQL
PostgreSQL
SQLite
any other database that comes with a JDBC driver…
Getting Started¶
Dynamic documentation
All code, command samples, and outputs below are dynamically generated during documentation build process using the latest version of db-load-generator
.
Basic setup¶
Create an annotated SQL file with a couple of queires in it. We are going to start with two queries, one of which creates a simple table and the other one drops it.
queries.sql
:
-- name: create_departments_table, scenario: setup
CREATE TABLE departments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE
);
-- name: drop_departments_table, scenario: teardown
DROP TABLE IF EXISTS departments;
Launch dbload
CLI to see how the queries were parsed:
dbload --sql queries.sql show queires
The output below shows that two queries were produced from the parsed SQL file:
+-------------------------------+---------------------------------------------------------------------------------+
| Key | Value |
+-------------------------------+---------------------------------------------------------------------------------+
| add_department | |
| sql | INSERT INTO departments (name) VALUES (?); |
| add_employee | |
| sql | INSERT INTO employees (name, title, dep_id) VALUES (?, ?, ?); |
| create_departments_table | |
| sql | CREATE TABLE departments ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT U... |
| create_employees_table | |
| sql | CREATE TABLE employees ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, ti... |
| drop_departments_table | |
| sql | DROP TABLE IF EXISTS departments |
| get_departments | |
| sql | SELECT * FROM departments; |
| get_departments_return_random | |
| sql | SELECT * FROM departments; |
| sort_employees | |
| sql | SELECT * FROM employees ORDER BY name; |
+-------------------------------+---------------------------------------------------------------------------------+
Launch dbload
CLI once again to show what scenarios were parsed:
dbload --sql queries.sql show scenarios
One scenario called setup
and the other one called teardown
were parsed and created based on the annotated SQL file.
In this case, each scenario contains a link to the query that mentioned it in the annotation.
These queries will be executed in the order specified by the auto_run_queries
list belonging to the scenario.
+--------------------+--------------------------+
| Key | Value |
+--------------------+--------------------------+
| setup | |
| auto_run_queries | |
| - | create_departments_table |
| - | create_employees_table |
| teardown | |
| auto_run_queries | |
| - | drop_departments_table |
+--------------------+--------------------------+
You can see that these scenarios were created implicitly (without being explicitly declared in a python module),
because they were specified in the scenario: setup
and scenario: teardown
parts of the query annotations.
Each annotated SQL query can mention several scenarios as part of its annotations and these
scenarios will be created automatically, if they were not explicitly declared before.
If a query mentions a scenario in its annotation, then that query will get executed as part of the mentioned scenario.
Since several queries can mention the same scenario, in that scenario they are executed in the order they were defined in the SQL file.
Alternatively, execution order can be specified in square brackets after the name of the scenario: scenario: setup[100]
.
The lower the number – the higher is priority of the execution of that query.
Any such queires are executed before any other logic inside the defined scenario proceeds.
Connect to database¶
It’s time to connect to the actual database to run our scenarios and queries!
At least Java 8 must be installed and JAVA_HOME
environment variable should point to the location of the JVM installation.
If no suitable JVM is found there, common directories based on the platform will be searched.
On windows the registry will be consulted.
JDBC driver for the destination database must be present on the system as well.
In this example we will connect to the SQLite database and a proper driver
sqlite-jdbc-3.34.0.jar
(latest GA release at the time of the writing)
can be downloaded from SQLite JDBC Driver GitHub.
In order to connect, a couple of parameters must be specified: DSN string (URI) and class path to the driver files. Let’s test our connection:
dbload \
--dsn jdbc:sqlite:sample.db \
--classpath sqlite-jdbc-3.34.0.jar \
test
More feature-rich databases such as SQL Server, Oracle, SAP Hana DB, and others might require additional parameters
to be specified: JDBC driver class name (--driver com.ibm.db2.jcc.DB2Jcc
) or driver arguments such as password, user,
or database name (--driver-arg user=sa
) might be required.
But for SQLite using just the dsn
and classpath
is enough. The connection was successful:
Successfully connected to the database.
Configuration sources¶
Specifying the connection parameters as command line arguments every time can get out of hands really quick. Instead, let’s put them in the config file.
dbload.json
:
{
"classpath": ["sqlite-jdbc-3.34.0.jar"],
"dsn": "jdbc:sqlite:sample.db",
"sql": ["queries.sql"]
}
Every command line argument has its matching counterpart in the config file. If some of the arguments are not specified in either of the locations, the default values are used.
By default, dbload looks for the dbload.json
file in the current directory.
It is possible to specify a different configuration path using the --config other_dir/dbload.cfg
option.
If non-default configuration file is not in the current directory, then any relative path in that file will be adjusted
according to the current working directory of dbload (directory from which dbload was launched).
Every configuration parameter can also be supplied as an environment variable with the DBLOAD_
prefix.
For example:
export DBLOAD_DSN="jdbc:sqlite:sample.db"
is equivalent to --dsn jdbc:sqlite:sample.db
.
Configurations are loaded in the following order from least to most important. Merging any new values on top of the old ones:
Default values
JSON file
Environment variables
Command line arguments
As mentioned above, more sophisticated databases require other parameters, such as driver arguments with user names and passwords, to initiate a connection. For some databases, the driver arguments must be presented as a dictionary and for others, as a list. Both ways are supported. Choose based on what the particular database’s driver needs. Example for Microsoft SQL Server:
{
"driver_arg": { "databaseName": "master", "user": "sa", "password": "StrongPassword!" }
}
and for IBM DB2:
{
"driver_arg": { "db2demo", "PaSsWoRdforDB2" }
}
Execute a statement¶
You can execute a random statement against the database. Let’s see what tables we have in our fresh SQLite database. The database currently contains no tables:
$ dbload execute "SELECT * FROM sqlite_master where type='table'"
Executing: SELECT * FROM sqlite_master where type='table'
+---+------+------+----------+----------+-----+
| # | type | name | tbl_name | rootpage | sql |
+---+------+------+----------+----------+-----+
+---+------+------+----------+----------+-----+
Prepared statements with properties are also supported. This command will produce the exact same result:
$ dbload execute "SELECT * FROM sqlite_master where type=?" --property table
Executing: SELECT * FROM sqlite_master where type=?
Properties: ['table']
+---+------+------+----------+----------+-----+
| # | type | name | tbl_name | rootpage | sql |
+---+------+------+----------+----------+-----+
+---+------+------+----------+----------+-----+
Run queries¶
Let’s run some of the queries we’ve declared earlier in the queries.sql
file. Create a table:
dbload query create_departments_table
Executing: create_departments_table
+---------------+
| Rows affected |
+---------------+
| 0 |
+---------------+
and then drop it:
dbload query drop_departments_table
Executing: drop_departments_table
+---------------+
| Rows affected |
+---------------+
| 0 |
+---------------+
Run scenarios¶
Similarly, we can run entire scenarios from the command line.
dbload scenario setup
Executing: setup
Our implicit scenario “setup” merely executes its auto_run_queries
and does nothing else.
So, nothing besides the general “Executing…” is printed to the terminal.
Insert data¶
Let’s insert some data into the table we’ve just created.
Add a new query that inserts data into the departments
table:
-- name: create_departments_table, scenario: setup
CREATE TABLE departments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE
);
-- name: drop_departments_table, scenario: teardown
DROP TABLE IF EXISTS departments
-- name: add_department
INSERT INTO departments (name) VALUES (?);
As you can see, add_department
is a prepared statement.
It requires a property to be supplied before execution:
dbload query add_department --property 'accounting'
dbload query add_department -p 'r&d'
Notice how -p
is a shortcut for --property
.
dbload CLI has shortcuts for almost all supported configuration properties.
Traceback (most recent call last):
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/bin/dbload", line 5, in <module>
main()
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/app/dbload/cli.py", line 162, in query
ctx.infuse()
File "/app/dbload/context.py", line 127, in infuse
self._load_predefined_simulation()
File "/app/dbload/context.py", line 148, in _load_predefined_simulation
raise UnsupportedPredefinedSimulationError(cfg.predefined)
dbload.exceptions.UnsupportedPredefinedSimulationError: Attempting to load predefined simulation 'accounting' that does not exist.
The same approach with supplying data via properties can be applied to update and removal of data.
Select data¶
Now, to compose some queries that retrieve the data, let’s add a query that retrieves all rows from departments
table.
-- name: create_departments_table, scenario: setup
CREATE TABLE departments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE
);
-- name: drop_departments_table, scenario: teardown
DROP TABLE IF EXISTS departments
-- name: add_department
INSERT INTO departments (name) VALUES (?);
-- name: get_departments, option: return_random
SELECT * FROM departments;
But there is a new annotation tag: we’ve added option: return_random
part to the annotation of the query.
That gives as an advangate by automatically generating an additional variant of the get_departments query,
which returns a single random row instead of all the rows.
That can come in handy when we need to quickly retrieve a random foreign key required for some other update or insert operation.
dbload query get_departments
Executing: get_departments
+---+----+------+
| # | id | name |
+---+----+------+
+---+----+------+
Now, let’s try the “return_random” variant of that query.
dbload query get_departments_return_random
Executing: get_departments_return_random
+---+----+------+
| # | id | name |
+---+----+------+
+---+----+------+
Explicit scenarios¶
We will now add employees to our database. First, we’ll add a couple more queries that will accomplish that.
-- name: create_departments_table, scenario: setup
CREATE TABLE departments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE
);
-- name: drop_departments_table, scenario: teardown
DROP TABLE IF EXISTS departments
-- name: add_department
INSERT INTO departments (name) VALUES (?);
-- name: get_departments, option: return_random
SELECT * FROM departments;
-- name: create_employees_table, scenario: setup
CREATE TABLE employees (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
title TEXT,
dep_id INTEGER,
FOREIGN KEY(dep_id) REFERENCES departments(id)
)
-- name: add_employee
INSERT INTO employees (name, title, dep_id) VALUES (?, ?, ?);
Now, let’s define the first explicit scenario in a python module.
scenarios.py
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import dbload
from faker import Faker
@dbload.scenario
def load_employees(connection):
fake = Faker()
for _ in range(10):
name = fake.name()
title = fake.job()
dep_id = dep_name = None
with connection.cursor() as c:
dep_id, dep_name = load_employees.get_departments_return_random(c).first
with connection.cursor() as c:
load_employees.add_employee(c, name, title, dep_id)
print(f"New employee: {name} working as '{title}' in '{dep_name}'.")
|
Boy, that escalated quickly. Let’s try to digest it line by line. At first, two modules are imported: dbload and Faker, a python package that generates fake data.
1 2 | import dbload
from faker import Faker
|
Note
For all intents and purposes, the scenarios.py
file we’ve created is a normal Python module.
It can be a part of the package or contain any other logic or libraries.
When we point dbload
CLI to this file it imports the file as a module.
After that, we define a scenario using the @dbload.scenario
decorator.
This decorator automatically registers the scenario within dbload
and does some other things, like providing a valid open connection.
4 5 6 | @dbload.scenario
def load_employees(connection):
fake = Faker()
|
For each employee we want to add to the database we’ll open a new cursor and generate some fake data.
The department ID (dep_id
), however, is a foreign key that has to reference an actual ID from the departments table.
That’s where the return_random query variant will come handy. We will use it to retrieve a random department ID.
8 9 10 11 12 13 14 15 16 17 18 | for _ in range(10):
name = fake.name()
title = fake.job()
dep_id = dep_name = None
with connection.cursor() as c:
dep_id, dep_name = load_employees.get_departments_return_random(c).first
with connection.cursor() as c:
load_employees.add_employee(c, name, title, dep_id)
print(f"New employee: {name} working as '{title}' in '{dep_name}'.")
|
Notice, how get_departments_return_random
function is not actually defined anywhere in our python module.
But it was implicitly created from the annotated SQL file.
And by default, each scenario gets infused with all parsed queries, which are attached to the scenario function as attributes.
That’s why we access that query as an attribute of the scenario.
The .first
part then retrieves the first row returned by the query (and our return_random query returns only one row).
The first element in that tuple .first[0]
is going to be the random department id we need.
We unpack the .first
tuple to get both the department name and its ID.
Finally, we invoke the add_employee
query and supply it with the data we’ve generated and collected.
Similarly, the add_employee
query has been parsed from the SQL file and was attached to the scenario object as an attribute.
8 9 10 11 12 13 14 15 16 17 18 | for _ in range(10):
name = fake.name()
title = fake.job()
dep_id = dep_name = None
with connection.cursor() as c:
dep_id, dep_name = load_employees.get_departments_return_random(c).first
with connection.cursor() as c:
load_employees.add_employee(c, name, title, dep_id)
print(f"New employee: {name} working as '{title}' in '{dep_name}'.")
|
Note
You don’t have to pass a new cursor
argument to the query, when calling it.
If no open cursor or connection are passed to a query or a scenario, then they will be created automatically.
The downside to this is that in the scenario above a new connection will be created for each query
invoked without a valid cursor.
That would be 20 new connections sequentially opened and closed to invoke 2 queries x10 times.
Let’s make sure the employees table is created and then add the employees to it by running the scenario we’ve created.
dbload --ignore scenario setup
Since we’ve added a new query to the “setup” scenario and we want to run it again to create employees table we’ll get errors,
because other tables (departments) already exist.
To ignore the errors we can specify the --ignore
flag.
This flag works when invoking queries as well.
Executing: setup
Traceback (most recent call last):
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/bin/dbload", line 5, in <module>
main()
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/.cache/pypoetry/virtualenvs/db-load-generator-9TtSrW0h-py3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/app/dbload/cli.py", line 147, in scenario
ctx.scenarios[scenario_name].function(ignore=config.ignore)
File "/app/dbload/scenario.py", line 132, in wrapper_scenario
raise ScenarioExecutionError(e) from None
dbload.exceptions.ScenarioExecutionError: [SQLITE_ERROR] SQL error or missing database (table departments already exists)
Now, let’s run the load_employees scenario:
dbload --module scenarios.py scenario load_employees
Scenario 'load_employees' does not exist.
Explicit queries¶
Let’s add a new query but this time we will define it in the Python module.
import dbload
from faker import Faker
@dbload.query
def sort_employees(cursor):
sql = "SELECT * FROM employees ORDER BY name"
with cursor:
cursor.execute(sql)
rows = cursor.fetchall()
print(f"Alphabetically first employee: {rows[0][1]}")
In this query we are not relying on the annotated SQL file at all, instead doing everything manually:
defining SQL statement in a string, executing it in the cursor, and fetching the results.
We could automate the result fetching part in the same way it happens within implicit queries.
In order to do that we are going to feed the cursor with executed statement to the QueryResult
class.
from dbload import query, scenario, QueryResult
from faker import Faker
@query
def sort_employees(cursor):
sql = "SELECT * FROM employees ORDER BY name"
with cursor:
cursor.execute(sql)
name = QueryResult.from_cursor(cursor).first[1]
print(f"Alphabetically first employee: {name}")
We’ve changed the import line to include QueryResult.
Now, when building a query result instance from the cursor, we get its convenience in return, e.g. using such methods as .first
.
dbload --module scenarios.py query sort_employees
Executing: sort_employees
+---+----+------+-------+--------+
| # | id | name | title | dep_id |
+---+----+------+-------+--------+
+---+----+------+-------+--------+
How about we combine both annotated SQL and explicitly defined query?
It’s convenient to store actual SQL statements in the SQL file, because we get a proper syntax highlighting,
reusability (use same .sql
file in any SQL IDE), and keep our Python code more clean.
queries.sql
:
-- name: sort_employees
SELECT * FROM employees ORDER BY name;
scenarios.py
:
@query
def sort_employees(cursor):
with cursor:
cursor.execute(sort_employees.sql)
name = QueryResult.from_cursor(cursor).first[1]
print(f"Alphabetically first employee: {name}")
The annotated SQL query sort_employees got matched with the explicitly defined query with the same name in the Python module.
The queires are created implicitly only when they were not defined in the python module.
In this case, the query was defined, so the annotated SQL statement got attached to the query function as a .sql
attribute
that contains the actual statement.
By executing the rewritten query we’ll get the same result:
dbload --module scenarios.py query sort_employees
Executing: sort_employees
+---+----+------+-------+--------+
| # | id | name | title | dep_id |
+---+----+------+-------+--------+
+---+----+------+-------+--------+
Standalone script¶
It’s not required to use dbload
CLI to invoke queries and scenarios.
Instead, we can turn our scenarios.py into a standalone script.
#!/usr/bin/env python
from dbload import (
query,
scenario,
QueryResult,
get_context,
get_connection,
get_config,
)
from faker import Faker
@query
def sort_employees(cursor):
with cursor:
cursor.execute(sort_employees.sql)
name = QueryResult.from_cursor(cursor).first[1]
print(f"Alphabetically first employee: {name}")
@scenario
def load_employees(connection):
fake = Faker()
for _ in range(10):
name = fake.name()
title = fake.job()
dep_id = dep_name = None
with connection.cursor() as c:
dep_id, dep_name = load_employees.get_departments_return_random(c).first
with connection.cursor() as c:
load_employees.add_employee(c, name, title, dep_id)
print(f"New employee: {name} working as '{title}' in '{dep_name}'.")
if __name__ == "__main__":
# Initialize config: will search and parse config file, env vars, etc.
config = get_config({"driver": "org.sqlite.JDBC", "verbose": 0})
# Get global context
context = get_context()
# Infuse context with content parsed from annotated SQL file
context.infuse()
# Initialize new connection
connection = get_connection(config)
# Invoke something...
with connection.cursor() as cur:
sort_employees(cur)
We can run it as a normal Python script:
python scenarios.py
2021-06-18 12:41:17.617 | DEBUG | dbload.context:register_query:72 - Registering 'sort_employees' query in the context.
2021-06-18 12:41:17.618 | DEBUG | dbload.context:register_scenario:94 - Registering 'load_employees' scenario in the context.
Traceback (most recent call last):
File "/app/docs/tutorial/step_8.py", line 47, in <module>
sort_employees(cur)
File "/app/dbload/query.py", line 185, in wrapper_query
raise QueryExecutionError(e) from None
dbload.exceptions.QueryExecutionError: 'NoneType' object is not subscriptable
Installation¶
db-load-generator
can be installed from PyPI.
It needs Java installed and a JDBC driver to connect to the actual database.
For other installation options and Requirements please see Installation.
pip install db-load-generator
Attributions¶
DBLoad is built on top of the Python, Java/JDBC, and the excellent JPype library.
Icons made by Nikita Golubev and Smashicons from Flaticon.