Category Archives: howto

Data migration with SQLAlchemy and Alembic

I needed to optimize an unruly table filled with floats but I also didn’t want to lose my data. Unfortunately the documentation on the alembic website doesn’t mention anything or give any hints on how to do a data migration versus just a schema migration.

Fortunately I was able to run a symbolic debugger against alembic and figured out that all of the op.<method>`calls are atomic. If you have an add_column call, it adds the column when it executes that method. So that opened the door to data migrations.

One note before I pasted the code. You don’t need to specify all of the columns of the source table when used in a data migration scope. This makes your code a lot cleaner as the working model code is specific to what data you plan on using.

Alright, no more babbling, here is the example code.

A while back I downloaded my google location and history data and ran into these strange lat7 and long7 columns (paraphrasing as I don’t remember their exact names). The data were these large integer numbers that I couldn’t figure out how to decode. Suddenly it became obvious when I noticed all of the latitude fields started with 35 and the longitude started with -104. 35, -104 is approximately a few hundred miles from where I live. By doing lat7 / 10000000 (10e7 or 10**7) I was able to get floating point GPS coordinates.

Since then, when it comes time to optimize database schemas I’ve always started with figuring out if I can shift the percentage out and use integers instead. If using sqlite3, a Float is actually a varchar and that’s huge in comparison to using a byte or two of signed integers. Throw a million records on and it can get up to 30-40% of wasted diskspace.

Anyway where was I. Since I wanted to get rid of all of the floats and replace the real fields with @hybrid_propertyand @hybrid_property.expression I renamed latitude to _latitude, shifted out the percent, and used the aforementioned decorators to transform the integers back to floats on demand.

Non-blocking python subprocess

I am working on a pet project to compress a terabyte of video into a slimmer format. While I have been able to automate working with ffmpeg, I didn’t like the fact that I couldn’t follow along with the subprocess running ffmpeg.

I tried a few different ideas of how to watch ffmpeg but also avoid the script from blocking because I wanted to be able to time and monitor it’s progress

import subprocess

process = subprocess.pOpen

stdout, stderr = process communicate blocks until the process is finished

subprocess.stdout.readline() and subprocess.stderr.readline() will both block until there is sufficient data. In ffmpeg’s case there is never stdout output so it will block indefinitely.

https://gist.github.com/devdave/9b8553d63e24ef19eea7e56f7cb95c78

By using threading Queue and constantly polling the process, I can watch the output as fast as it can come in but not worry about the main process blocking, just the threads.

A further improvement on the idea would be to have two threads (for stdout and stderr respectively) with the queue items put with sentinels like queue.put((STDERR, line_from_stderr)) and a sentinel for STDOUT.

To use - 

r = Runner(["some_long_running_process", "-arg1", "arg1 value"])

for stdout, stderr in r.start():
    print("STDOUT", stdout)
    print("STDERR", stderr)

WSL GNU/Debian Python 3.6.5

Relevant date: 2018 May 30th
This works today but there is no guarantee this is the right way in a month and especially a year later.

After install WSL and then Debian linux, the first thing I did was something recommended from the reviews:

sudo apt update && sudo apt full-upgrade && sudo apt install aptitude
sudo aptitude install ~pstandard ~prequired ~pimportant -F%

Googling led to some goofy answers like using a PPA (which isn’t a bad thing) but you can screw up your environment if not careful

So….. time to go old school and build from source.

sudo apt-get install -y make build-essential libssl-dev zlib1g-dev   
sudo apt-get install -y libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm 
sudo apt-get install -y libncurses5-dev  libncursesw5-dev xz-utils tk-dev

Those are the basic requirements BUT I am fairly certain somethings like lxml (for Beautiful Soup) will be missing. Not an issue if you don’t intend to use lxml AND they are not required for building the python 3.6 interpreter.

wget https://www.python.org/ftp/python/3.6.5/Python-3.6.5.tgz
tar xvf Python-3.6.5.tgz
cd Python-3.6.5
./configure --enable-optimizations --with-ensurepip=install
make
sudo make altinstall 

NOTE! you can do

make -j8

for a faster build time BUT ideally using

-j# where the # is the number of cores on your CPU 

is better

NOTE! if you have a faster computer, the actual compile process is fairly quick but than it runs a post build unit testing suite that is absolutely massive because of the `–enable-optimizations` flag. If you need speed for a commercial environment AND you are 100% certain your environment across all machines is the same, just retar up the source directory with the pre-built binaries, send it to each machine, and then run `make altinstall`.

From there you can run

user@machine:~/$python3.6.5

some people recommend fiddling with

alternatives

so you can just do

user@machine:~/$python3.6.5

but generally I use virtualenv and explicitly make a new environment with my version of python to keep me sane.

Flask CRUD with sqlalchemy and jinja2 contextfilters

Quick disclaimer, the Flask CRUD thing is not public domain yet and is very volatile.

The project is here
https://github.com/devdave/wfmastery/tree/revamp_1/wfmastery
And the outline for the crud thing is in this commit https://github.com/devdave/wfmastery/commit/e249895ddc53c0696f59d3def5718e76855af5b9

https://github.com/devdave/wfmastery/blob/revamp_1/wfmastery/crud.py
https://github.com/devdave/wfmastery/blob/revamp_1/wfmastery/views.py
https://github.com/devdave/wfmastery/blob/revamp_1/wfmastery/templates/equipment_list.j2.html

First is how the crud is currently constructed

class Equipment(CrudAPI):

    def populate(self):

        self.record_cls = db.Equipment
        self.identity = "equipment"

        self.template_form = "equipment_form.j2.html"
        self.template_list = "equipment_list.j2.html"

        self._listColumn("id")
        self._listColumn("hidden")
        self._listColumn("name", magic_field="magic-string")
        self._listColumn("pretty_name", magic_field="magic-string")

        self._addRelationship("category", "name", magic_field="magic-filter")
        self._addRelationship("subcategory", "name", magic_field="magic-filter")

both vars “template_form” and “template_list” are going to be preset once I am certain that the templates can stand on their own with the context vars provided. The “magic-” params and their use are very much magic (eg really toxic) and would recommend ignoring them.

From there the CrudAPI takes over. Skipping ahead to how this relates to context filters. I had this tag mess here in the template

-{%-      for column_name in origin.list_columns -%}
 -{%-          if column_name in origin.magic_columns -%}
 -        {{ cell("", column_name|title, classes=origin.magic_columns[column_name]) -}}
 -{%-          else -%}
 -        {{ cell("", column_name|title) -}}
 -{%          endif %}
 -{%-      endfor %}

and was really not happy with it. So I dived into Flask and Jinja2’s documentation and code to figure out if I could apply Python code inline.

The answer is yes via jinja2’s contextfilters which are not exposed to Flask but can still be used.

@App.template_filter("render_header")
def render_header(context, column_name, value="", **kwargs):
    result = ""
    if column_name in context['origin'].magic_columns:
        result = context['cell'](value, column_name.capitalize(), classes=context['origin'].magic_columns[column_name])
    else:
        result = context['cell'](value, column_name.capitalize())


    return result

render_header.contextfilter=True

The trick to going from filter to contextfilter is just applying `my_func.contextfilter = True` outside of your functions scope. From there you have access to almost everything (if not everything). The var “origin” is the CrudAPI’s instance passed to the template.

This has opened a lot more opportunities to do clean up. Taking


{% macro data_attributes(data_map, prefix="data-") -%}

    {%- for name, value in data_map.items() -%}
    {{" "}}{{prefix}}{{name}}="{{value}}"
    {%- endfor -%}
{%- endmacro %}
  

{% macro cell(name, value, classes=None, data_attrs={}) %}
        
      {{- caller() if caller else value -}}
  {%- endmacro -%}

and condensing it down to

{% macro cell(name, value, classes=None, data_attrs={}) %}
        
    {{- caller() if caller else value -}}
{%- endmacro -%}

via a simple non-context filter

@App.template_filter("dict2attrs")
def dict_to_attributes(attributes, prefix=None):
    results = []
    name2dash = lambda *x: "-".join(x)
    format_str = "%s-{}=\"{}\"" % prefix if prefix else "{}=\"{}\""

    for key, value in attributes.items():
        results.append(format_str.format(key, value))

    #TODO disable autoescape
    return " ".join(results)

Just note that at the moment output is still managed by Jinja’s autoescape and I’d rather not shut that off so calls MUST be suffixed with “|safe” as used above.

As for the Crud API, I feel like that is coming along nicely.

Flask list routes (rake equivalent).

While working on a pet project I really wanted a rails rake equivalent for Flask.

Googling led to http://flask.pocoo.org/snippets/117/ which gave me enough direction to figure out how to make that work with Python 3.5 and Flask 0.12.

The biggest problem I had with that snippet is that it didn’t account for URL variable rules.

/foo/<int:bar>

as it would blow up because werkzeug & Flask sanitize inputs to ensure they match the expected type.

I started doing some seriously crazy stuff like monkey patching the rules temporarily to make
ALL converters run through a String/Unicode converter. It’s at this point that I noticed in dbgp (symbolic debugger) that it was naturally converting the rules to strings.

@App.cli.command("list_routes")
def list_routes():
    """
        Roll through Flask's URL rules and print them out
        Thank you to Jonathan Tushman
            And Thank you to Roger Pence
            Sourced http://flask.pocoo.org/snippets/117/ "Helper to list routes (like Rail's rake routes)"

        Note that a lot has possibly changed since that snippet and rule classes have a __str__
            which greatly simplifies all of this
    """


    format_str = lambda *x: "{:30s} {:40s} {}".format(*x)#pylint: disable=W0108
    clean_map = defaultdict(list)


    for rule in App.url_map.iter_rules():
        methods = ",".join(rule.methods)
        clean_map[rule.endpoint].append((methods, str(rule),))

    print(format_str("View handler", "HTTP METHODS", "URL RULE"))
    print("-"*80)
    for endpoint in sorted(clean_map.keys()):
        for rule, methods in sorted(clean_map[endpoint], key=lambda x: x[1]):
            print(format_str(endpoint, methods, rule))

Example output

"HOSTED@LOCAL:5000"
View handler                   HTTP METHODS                             URL RULE
--------------------------------------------------------------------------------
Equipment                      /equipment/                              OPTIONS,POST
Equipment                      /equipment/                              HEAD,OPTIONS,GET
Equipment                      /equipment/               HEAD,OPTIONS,GET,PUT,DELETE
index                          /                                        HEAD,OPTIONS,GET
static                         /static/                  HEAD,OPTIONS,GET

Follow up on Google API with service accounts and ruby

refer to the previous post on this subject

Using analytics.data.qa.get doesn’t work as desired, so instead you need to do something like:

visitCount = client.execute(analytics.to_h["analytics.data.ga.get"], {
                              :ids=>"ga:" + YOUR_PROFILE_NUMBER,
                              "start-date" => "2012-06-12",
                              "end-date" => "2012-07-01",
                              :metrics => "ga:visits"}
                            )

Note the part of using to_h[‘analytics.data.ga.get”] to get the right signature to pass onto the APIClient execute method. That will work and works well. To make it easier to formulate queries against GA, I’d recommend using Google’s query explorer tool here http://ga-dev-tools.appspot.com/explorer/

Service accounts with Ruby and the google api client

  1. Goto https://code.google.com/apis/console
  2. Create a new project
  3. Under services, select Google Analytics
  4. Goto API Access of a new project
  5. generate a new oAuth credential, selecting “service account”.
  6. Download your p12 cert file

 

In Google analytics, use the service account email address and assign it as a new administrator to the desired GA profile.

On ALL servers destined to use the service account to GA, install and syncronized with ntp. ┬áIf your even 500ms off, you’re going to have a bad time with “Grant_invalid” messages.

In a testbed project, make 2 files: Gemfile & testbed1.rb ( you’re going to create a lot of these testbed files on your own ).

Gemfile

source "http://rubygems.org"
gem 'google-api-client'
gem 'pry'

testbed1.rb

require 'pry'
require 'rubygems'
require 'google/api_client'

keyFile = "SOME_KEY_GOES_HERE-privatekey.p12"

cID = "YOUR_SOOPA_SECRET_SA"

scope = 'https://www.googleapis.com/auth/analytics.readonly'
email = 'YOUR_EMAIL_GOES_HERE@developer.gserviceaccount.com'


#Normally it would be bad to put the passphrase here, but aftet talking to several dozen devs, everyone's pass phrase is not a secret
key = Google::APIClient::PKCS12.load_key(keyFile, "notasecret")

asserter = Google::APIClient::JWTAsserter.new(
   email,
   scope,
   key)

puts asserter.authorize()

binding.pry

If all goes well, the JWTAsserter instance will get a valid token and no exceptions/errors will be thrown. If you’ve followed all of the steps listed earlier and things are still breaking, troubleshooting paths are: Verify your machine time is correct, verify your service account email address is bound to google analytics, and lastly you might need to wait until Google platform catches up with the changes. For myself, it took 9 hours until my account finally authenticated through. Google has potentially a million or more servers organized into cells, the fact that they seem to cooperate well doesn’t mean they’re perfectly in sync at all times.

Otherwise if all else fails, I recommend stalking this guy Nick https://groups.google.com/d/msg/google-analytics-data-export-api/maa_fyjD2cM/sT8tDDh0wNsJ – It seems like he’s a Google employee on the service account dev/implementation team and generally stuff gets fixed if he says it’s getting fixed.

As I run into any more troubles, I will update and add more notes.

Ubuntu 11.10 Google native client dependancies

sudo apt-get install ia32-libs
sudo apt-get install gcc-multilib g++-multilib libsdl1.2-dev texinfo libcrypto++-dev libssl-dev lib32ncurses5-dev m4 libelf-dev

This is about 300Mb of stuff, some of it can be culled but I didn’t feel like messing around with that.

Allows for pepper 19 examples to be compiled successfully.

SSH SOCKS proxy and Amazon EC2 to the rescue

I’m currently somewhere in the process of building a hadoop clouster in EC2 for one of my clients and one of the most important parts for keeping my sanity is the ability to access all of the node’s web interfaces ( jobtracker, namenode, tasktrackers’, datanodes, etc ). If you aren’t abs(crazy) all of these machines are jailed inside a locked down security group, a micro walled garden.

SSH -D 8080 someMachine.amazon-publicDNS.com

That will setup a socks between your machine and some instance that should be in the same SG as the hadoop cluster… now unless you are a saddist and like to write dozens of host file entries, the SOCKS proxy is useless.

But wait! Proxy Auto-configuration to the rescue! All you really need to get started is here at Wikipedia ( http://en.wikipedia.org/wiki/Proxy_auto-config ) but to be fair a dirt simple proxy might look like:

hadoop_cluster.pac
function FindProxyForURL(url, host) {
if (shExpMatch(host, "*.secret.squirrel.com")) {
return "SOCKS5 127.0.0.1:8080";
}
if (shExpMatch(host, "*.internal")) {
return "SOCKS5 127.0.0.1:8080";
}

return "DIRECT";
}

Save this to your harddrive then find the correct “file:///path/2/hadoop_cluster.pac” from there go into your browsers proxy configuration dialog window and paste that URL into the Proxy Auto-configuration box. After that, going to http://ip-1-2-3-4.amazon.internal in a web browser will automatically go through the SSH proxy into Amazon EC2 cloud space, resolve against Amazon DNS servers, and voila you’re connected.

NOTE: Windows users

It shouldn’t be a surprise that Microsoft has partially fucked up the beauty that is the PAC. Fortunately, they provide directions for resolving the issue here ( http://support.microsoft.com/kb/271361 ).

tl;dwrite – Microsoft’s network stack caches the results of the PAC script instead of checking it for every request. If your proxy goes down or you edit the PAC file, those changes can take sometime to actually come into play. Fortunately Firefox has a nifty “reload” button on their dialog box, but Microsoft Internet Explorer and the default Chrome for windows trust Microsofts netstack.

HTML5 Canvas windowing proof of concept/prototype

Code is here ( https://gist.github.com/a17216d5b1db068ab41b )

Taking a break from a client project today and decided to try something real quick out. For one of my unpublished pet projects ( a near real time web based MUD ) I wanted a little map window to give visual feedback to the user on where exactly they were. Fortunately the Canvas API makes this stupid easy.

       var oX = this.player.x * (this.mMX / this.rMX);
    var oY = this.player.y * (this.mMX / this.rMY);
    
    var halfX = ( this.wMX / 2 );
    var halfY = ( this.wMY / 2 );
    
    var startX = Math.max(0, oX - halfX);
    var startY = Math.max(0, oY - halfY);
    
    var endX = Math.max(startX + this.wMX);
    var endY = Math.max(startY + this.wMY);
    
    var buffer = this.ctx.getImageData(startX, startY, endX, endY);
    this.window.putImageData(buffer, 0, 0);

First block takes the player or avatars position then scales up their grid position to the map canvas position.

Second block gets middle point of the window canvas, in this case if the window is 150×150, the origin is (75,75)

Now given the player’s position in the canvas map it finds the upper left and lower right corners of the map that is going to be copied from the map canvas into the window canvas. The Math.Max’s are there to prevent the upper left corner from being an impossible point (123, -50 ).

The end*’s use of Math.Max is actually crap on my part and isn’t needed.