Data is crucial for all applications. While fetching a significant amount of data from database multiple times, faster data load times improve performance.
The post considers tools like SQLAlchemy statement, SQLAlchemy ORM, Pscopg2, psql
for measuring latency. And to measure the python tool timing, jupyter notebook’s timeit
is used. Psql is for the lowest time taken reference.
Table Structure
annotation=> \d data;
Table "public.data"
Column | Type | Modifiers
--------+-----------+---------------------------------------------------
id | integer | not null default nextval('data_id_seq'::regclass)
value | integer |
label | integer |
x | integer[] |
y | integer[] |
Indexes:
"data_pkey" PRIMARY KEY, btree (id)
"ix_data_label" btree (label)
annotation=> select count(*) from data;
count
---------
1050475
(1 row)
SQLAlchemy ORM Declaration
class Data(Base):
__tablename__ = 'data'
id = Column(Integer, primary_key=True)
value = Column(Integer)
# 0 => Training, 1 => test
label = Column(Integer, default=0, index=True)
x = Column(postgresql.ARRAY(Integer))
y = Column(postgresql.ARRAY(Integer))
SQLAlchemy ORM
def sa_orm(limit=20):
sess = create_session()
try:
return sess.query(Data.value, Data.label).limit(limit).all()
finally:
sess.close()
Time taken
%timeit sa_orm(1)
28.9 ms ± 4.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The time is taken in milliseconds to fetch 1, 20, 100, 1000, 10000
queries.
{1: 28.9, 20: 26.6, 100: 26.2, 1000: 29.5, 10000: 70.2}
SQLAlchemy Select statement
def sa_select(limit=20):
tbl = Data.__table__
sess = create_session()
try:
stmt = select([tbl.c.value, tbl.c.label]).limit(limit)
return sess.execute(stmt).fetchall()
finally:
sess.close()
Time Taken
The time is taken in milliseconds to fetch 1, 20, 100, 1000, 10000
queries.
{1: 24.7, 20: 24.5, 100: 24.9, 1000: 26.8, 10000: 39.6}
Pscopg2
Pscopg2 is one of the Postgres drivers. You can use pscopg2 along with SQLAlchemy or as a stand-alone tool.
def pscopg2_select(limit=20):
conn = psycopg2.connect("dbname=db user=user password=password host=localhost")
cur = conn.cursor()
try:
# Note: In prod, escape SQL queries.
stmt = f"select value, label from data limit {limit}"
cur.execute(stmt)
return cur.fetchall()
finally:
cur.close()
conn.close()
Pscopg2 Time Taken
The time is taken in milliseconds to fetch 1, 20, 100, 1000, 10000
queries.
{1: 17.0, 20: 16.9, 100: 17.3, 1000: 18.1, 10000: 30.1},
Psql
annotation=> explain (analyze, timing off) select label, value from data limit 10000;
QUERY PLAN
------------------------------------------------------------------------------------------------
Limit (cost=0.00..322.22 rows=10000 width=8) (actual rows=10000 loops=1)
-> Seq Scan on data (cost=0.00..33860.40 rows=1050840 width=8) (actual rows=10000 loops=1)
Total runtime: 7.654 ms
(3 rows)
Time: 7.654 ms
Psql time taken
The time is taken in milliseconds to fetch 1, 20, 100, 1000, 10000
queries.
{1: 0.9, 20: 0.463, 100: 1.029, 1000: 1.643, 10000: 7.654}
All timings
{'pscopg2': {1: 17.0, 20: 16.9, 100: 17.3, 1000: 18.1, 10000: 30.1},
'sa_orm': {1: 28.9, 20: 26.6, 100: 26.2, 1000: 29.5, 10000: 70.2},
'sa_select': {1: 24.7, 20: 24.5, 100: 24.9, 1000: 26.8, 10000: 39.6},
'sql_select': {1: 0.9, 20: 0.463, 100: 1.029, 1000: 1.643, 10000: 7.654}}
Chart of all the timings
Lower the bar, better for the performance.
As you can see, SQLAlchemy ORM is slowest, and pscyopg2 is fastest. SQLAlchemy select statement is close to the pscopg2 performance, provides a sweet spot for not writing SQL Queries and handling data at a high level.
See also
- Python Typing Koans
- Model Field - Django ORM Working - Part 2
- Structure - Django ORM Working - Part 1
- jut - render jupyter notebook in the terminal
- Five reasons to use Py.test
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.