Tag Archives: optimization

Data migration with SQLAlchemy and Alembic

I needed to optimize an unruly table filled with floats but I also didn’t want to lose my data. Unfortunately the documentation on the alembic website doesn’t mention anything or give any hints on how to do a data migration versus just a schema migration.

Fortunately I was able to run a symbolic debugger against alembic and figured out that all of the op.<method>`calls are atomic. If you have an add_column call, it adds the column when it executes that method. So that opened the door to data migrations.

One note before I pasted the code. You don’t need to specify all of the columns of the source table when used in a data migration scope. This makes your code a lot cleaner as the working model code is specific to what data you plan on using.

Alright, no more babbling, here is the example code.

A while back I downloaded my google location and history data and ran into these strange lat7 and long7 columns (paraphrasing as I don’t remember their exact names). The data were these large integer numbers that I couldn’t figure out how to decode. Suddenly it became obvious when I noticed all of the latitude fields started with 35 and the longitude started with -104. 35, -104 is approximately a few hundred miles from where I live. By doing lat7 / 10000000 (10e7 or 10**7) I was able to get floating point GPS coordinates.

Since then, when it comes time to optimize database schemas I’ve always started with figuring out if I can shift the percentage out and use integers instead. If using sqlite3, a Float is actually a varchar and that’s huge in comparison to using a byte or two of signed integers. Throw a million records on and it can get up to 30-40% of wasted diskspace.

Anyway where was I. Since I wanted to get rid of all of the floats and replace the real fields with @hybrid_propertyand @hybrid_property.expression I renamed latitude to _latitude, shifted out the percent, and used the aforementioned decorators to transform the integers back to floats on demand.