Want to know when someone opens your emails? How about knowing each time someone opens an email you sent? In the era of iMessage or WhatsApp, where the concept of “left on read” has entered the common lexicon, tracking the “when” of your message being read is near and dear to our hearts. Email clients, such as Outlook, have offered have offered “read-receipts” (technically MDN – Message Disposition Notifications) since as far back as 1998 (see original RFC). Read receipts rely on the recipient’s agent (email client) supporting the feature and opting-in.
An alternative that has existed for a while but has recently become more popular is the “tracker pixel” (known less affectionately as the “spy pixel”). This technique involves putting a small (think 1x1) pixel in the HTML body of the email1. The user’s agent will in general2 load the image when the email message is opened. The server delivering the image is able to record the image access along with the User Agent string and IP of the requesting client3.
I set out to build an email open tracking system that employs a tracking pixel approach. I assume the sender uses Gmail / Google Workspace (fka GSuite) and accesses the site through Chrome (later I expand to the mobile Apps). No such requirements are imposed on the recipient. There are of course already a number of Extensions in the Chrome Web Store that offer this functionality such as Mailtrack and Yesware; however, both as a challenge to myself and because of privacy concerns I decided to build my own.
The system boils down to 3 parts:
I built parts (1) and (3) using a Chrome extension that modifies the UI of the Gmail webpage. Part (3) was built using a server that served up a minimal pixel that could be tied to the email it was inserted into, while at the same time capturing information on the the client. I later re-built parts (1) and (3) using a Google Apps Script so that I was able to send and see the email open tracking from within the a Gmail mobile app.
The repos for the components can be found here:
The Chrome extension needs to (1) modify, as needed, the body of email messages as sent, (2) provide a UI for the sender to control when tracking occurs, (3) display when previously tracked messages are opened, and (4) prevent self-tracks (i.e. we want to exclude our own opens of an email from being counted as a view).
I leveraged this boilerplate project to get started developing my Chrome extension using TypeScript, React, and Webpack. After I had already built for a while, I did realize the project is no longer maintained4 and there are probably better starters out there now (e.g. 1 or 2). Fortunately the project had been updated to use Manifest V3, which while more limited in functionality than V2 will be needed as V2 is being deprecated in 2023.
I then found gmail-js, which is a library purpose built for browser extensions that look to interface with the Gmail page. It offers functions to modify the UI (e.g. inserting buttons on the compose window and modifying the toolbars) as well as hooks for various events (e.g. a message being sent). The library is under active development, well documented, and supports TypeScript. It also has a great boilerplate project to help get started. When looking for other libraries to help with building a browser extension for Gmail, I also found InboxSDK, a commercial offering similar to gmail-js, but with more of a focus on extending the Gmail UI. Ultimately gmail-js offered enough functionality for extending the UI that InboxSDK was unnecessary.
There are some limitations in gmail-js that I had to work around. The first limitation, stated right in the README, is that “Gmail.js does not work as a content-script.” From the Chrome docs: “Content scripts are files that run in the context of web pages.” Basically content scripts are standard (and simplest) way an extension injects JavaScript into a webpage. And unfortunately they don’t work with gmail-js5. An example workaround is shown in the boilerplate project and involves a content script that inserts a script tag to then download the actual js code.
The next limitation that I ran into was that even though there was a before
send hook for email message, there is no way from within that hook to actually modify the body of the message. From the docs, it seems like this functionality used to exist. Unfortunately, due to changes on with Gmail page, there was no longer a way to modify the message as it is sent. The workaround here is to modify the body of the email message before the Gmail send button is “clicked.” This can be done either by eagerly inserting the HTML into the body of the message when the compose window is first opened or by “proxying” the button click with a new button that runs the injection code and then calls the original button (using the send
method). The proxy button (created using add_compose_button
) may be a totally new button or it might just be made to obscure the original button. I opted for the latter which has the benefit of providing an UI fo the user to choose to track vs not track.
I used a similar approach to extend the “Scheduled send” button:
The flow for sending a tracked message now looks like:
id
) is injected into the email body (as additional HTML). We may also want to remove / modify old trackers (i.e. by removing HTML).id
for the tracker is sent to our server.The last limitation is that there is no easy way to modify the contents of the prior messages in the compose window. In other words, while I can insert HTML into the current message, I cannot access (i.e. to remove old trackers) from the prior portion of the email thread. The hack here would be to trigger the “Show trimmed content” button so that the full contents of the thread are downloaded.
Now that we are successfully able to track sent messages using our Chrome extension, I needed away for the sender of the email to see that their messages had been opened. I first added a UI element to the Inbox view that shows when the last email was opened:
and when clicked opens a modal with a list of the most recent message opens. Clicking on any of the them directly opens the relevant thread:
On the thread view, I also added a button to the toolbar:
which when clicked also opens a modal showing all of the views of the current thread along with additional information:
Another challenge I faced was being able to use React for developing my UI. Unfortunately gmail-js uses jQuery (and has an imperative API) and the Gmail page itself likes to kill, and then rebuild, the toolbar when it wants. I was able to work around this with a combination of MutationObserver
and unmountComponentAtNode
(see code). I was then able to use the React Developer Tools to confirm that I was not leaking React Elements.
The Chrome extension is also responsible for making sure that the sender doesn’t trigger their own View event each time they open their sent email message. There are two ways to handle this: one is to prevent the image from being loaded at all by the sender’s agent. The other would be to add logic on the backend to detect and then exclude these loads. I chose the former option, as due to privacy limitations, identifying the sender’s loads of the image proved impractical. Ideally if both the sender and recipient are users of my email tracking system, I do want the view event to trigger when the recipient opens the email message, even if they have the extension installed.
The simplest way to block the loads, and the way the other extensions I looked at6 accomplished this was using the webRequestBlocking
permission as part of the webRequest
. Unfortunately thus functionality was removed in Manifest V3 and replaced with declarativeNetRequest
.
Instead of figuring out how to use the new API, I decided to use injected CSS to prevent the image from being downloaded. I also think the rules for declarativeNetRequest
are static and thus cannot be customized based upon the logged in user (i.e. the blocking would prevent any images uses by this tracking system from loading, not just those sent by the specific sender). Instead I am able to generate a CSS rule dynamically that excludes images sent by just the sender. Even though CSS doesn’t seem to have a regex rule, I am able to use a string match to accomplish this. I ended up using a div
tag with a background-image
rather than a standard img
tag. I could not get the CSS to work to prevent the browser from downloading the image file when using the img
tag.
This is the CSS rule that gets injected:
1 2 3 |
|
Stay tuned for Part 2, where I go into more depth on the Backend Server.
The TL; DR, is that the server has a few key end points:
/t/<sender-id>/<tracker-id>/image.gif
) returning the smallest possible valid GIF image and returning as quickly as possible(2) captures the requester’s IP and User Agent string and then involves running through a series of rules and calls to third party APIs to gather information on the requester.
Stay tuned for subsequent posts!
This same technique is also used on webpages and known a “web beacon”.↩
This assume the email client supports HTML and the user has not disabled loading images (e.g. on Gmail). There are those that lobby for using plaintext email.↩
Gmail and other email services proxy the image so the actual IP address and User Agent are not revealed to the server.↩
A number of the dependencies are out of date leading to various security warnings on GitHub. Further npm dedupe
should be run to simplify and speed up the install.↩
The limitation is due to how gmail-js subscribes to XHR events and limitations due to security-context / namespaces (see discussion).↩
I found CRX Extractor super helpful in getting access to the source for other extensions in the Chrome Web Store↩
sqlalchemy-airtable
.
Airtable is a no-code / low-code SaaS tool, which Wikipedia describes as:
a spreadsheet-database hybrid, with the features of a database but applied to a spreadsheet…Users can create a database, set up column types, add records, link tables to one another
Airtable empowers non-technical (or lazy, technical) users to rapidly build various CRUD applications such as CRMs, project trackers, product catalogs, etc. The Airtable team has done a great job publishing a collection of templates to help users get started. Airtable offers a lot of of great UI elements for CRUD operations that include a very powerful grid view, detail (record-level) views, and “forms”, which are great for data entry. There are a whole bunch more view types documented here, including Gantt, Kanban, and calendar.
That being said, performing data visualization and data analysis is somewhat limited within Airtable itself. Over time, the Airtable team has rolled out additional functionality from the summary bar, which offers column-by-column summary stats, to apps (fka blocks), such as the pivot table to now more comprehensive data visualization apps. Even with the app / dashboard concept and the associated marketplace, there is still a gap in what an experienced business or data analyst might be looking for and what is offered. Further. Airtable might be just one part of a larger data storage system used at with the organization such that even if a whole bunch more functionality were added, there would still be desire to perform analysis elsewhere. For example, while you might start by building a v1 of your CRM and ticketing systems in Airtable, over time you may find yourself migrating certain systems to purpose-built tools such as Salesforce or Zendesk. Ideally visualizations and analysis can be performed in one place that isn’t vertically integrated with the data solution (not to mention in general avoiding vendor lock-in).
Recognizing the desire of its users to use existing BI tools, the Airtable team has written a helpful blog piece, “How to connect Airtable to Tableau, Google Data Studio, and Power BI”. The prebuilt integrators discussed in the post are helpful if you use one of those BI tools and, in the case of Tableau and Data Studio, actually require an expensive, enterprise Airtable account.
Railsware built Coupler.io (fka Airtable Importer) to make synchronizing data from Airtable into Google Sheets easy. This opens up using Google Sheets as a BI tool or simple data warehouse. There are a few other tools that attempt to synchronize from Airtable to traditional databases including Sequin (for CRM data) and CData and others like BaseQL that expose Airtable with a GraphQL API.
I do some of my analysis in Pandas and some in Superset and I wanted a way to connect either of these to Airtable. I am also “lazy” and looking for a solution that doesn’t involved setting up a data warehouse and then running an ETL / synchronization tool.
I figured I could achieve my objective by building a Shillelagh Adapter which would in effect create a SQLAlchemy Dialect exposing Airtable as a database and allow any Python tool leveraging SQLAlchemy or expecting a DB-API driver to connect.
Fortunately Airtable offers a rich RESTful API for both querying data as well as updating and inserting data. There are a number of client libraries for different languages wrapping this API. The library for JavaScript is “official” and is maintained by Airtable itself. There is a well-maintained, unofficial Python library with blocking I/O and a less well maintained library for Python based on asyncio (non-blocking). For the purposes of building the DB Dialect, I stick with blocking I/O as that library is better maintained and the underlying libraries used (Shillelagh –> APSW), won’t be able to take advantage of non-blocking I/O.
In order to expose an Airtable Base as a database we need to know the schema. The “schema” in practice is the table names and for each table, the list of column names and for each column its data types. Fortunately Airtable offers a Metadata API that provides this info. Unfortunately, the feature is currently invite only / wait-listed. I applied but have not been lucky enough to be accepted. I document a number of different workarounds for obtaining the metadata here. They roughly boil town to a) hardcoding the table list + pulling down some sample data from each table and then guessing the field types, b) allowing the user to pass in a schema definition (can be offline scraped from the API docs page), or c) hitting an endpoint that exposes the metadata (e.g. shared view or base). Option a) is the most seamless / low-friction to the user whereas c) is the most reliable. I ended up starting with option a) but offering partial support for option c).
There are a few gotchas with the guessing approach, the most annoying of which is that the Airtable API entirely omits fields within a record if the value is null or “empty”. This is particularly problematic when peeking at the first N
rows in cases in which new fields were added later on. Currently there is no easy way to pull the last N
rows. Other fields may be polymorphic, such as those coming from a formula. We err on the side of being conservative, for example assuming that even if we see only integers the field may at some point contain a float.
The Airtable “backing store” supports sorting, filtering, and page size limits2. Shillelagh by way of APSW by way of SQLite virtual tables is able to opt-in to these features to reduce the amount of work that has to be done in memory by SQLite and instead push the sorting and filtering work to the backing store. We just need to translate from what Shillelagh gives us to what the API expects.
Sorting is relatively easy as it just involves passing the the field names and the desired sort order to the API. Translating filters involves building an Airtable formula that evaluates as true
for the rows we want to return. This was pretty straight forward except for 1) when dealing with IS NULL
/ IS NOT NULL
filters and 2) filtering by ID
or createdTime
(these two values are returned on every record). Writing a filter to select for nulls was tricky in that Airtable’s concept of null is BLANK()
which is also equal to the number 0
. Filters for ID
and createdTime
are able to leverage built-in functions of RECORD_ID()
and CREATED_TIME()
.
Once a given column’s data type is known, decoding is generally pretty straightforward. The gotchas were: parsing relationships (Link records and Lookups), dealing with floats, and dealing with formulas. Relationships are always represented in the API as arrays, regardless of whether “allow linking to multiple records” is checked. In these cases, the array value needs to be flattened for support in SQLite / APSW. Floats are generally encoded according to the JSON Spec; however, JSON does not support “special” values like Infinity of NaN. Rather than sending a string value for these, the Airtable API returns a JSON object with a key of specialValue
with a corresponding value. Errors resulting from formulas are likewise encoded as an object with a key of error
and a corresponding error code (such as #ERROR
).
A “database engine spec” is required in order to fully utilize Superset when connected to a new datasource type (i.e. our Dialect). Basically the spec tells Superset what features the DB supports. The spec is required for any sort of time series work that requires time-binning. There is a pretty confusing message about “Grain” and how the options are defined in “source code”:
Ultimately I found this blog post from Preset and this implementation for the GSheets datasource, which I based mine on. You do also have to register the spec using an entry_point
. Once I added this Spec, I was then offered a full set of time grains to choose from. Superset does also now call out the datasource’s “backend”:
I am not currently taking advantage of the UI / parameters (it is all pretty undocumented), but from what I see it looks as if I can tweak the UI to gather specific values from the user.
With our Airtable Tables now accessible in SQLAlchemy as database tables, I figured I would try using SQL to perform a JOIN
between two Tables in the Base. The JOIN
worked, in that the data returned was correct and as expected; however, the performance was terrible and I observed far more calls to the Airtable API than expected and filters were not being passed in. For example if I were joining tables A
and B
on B.a_id = A.id
, I would we see get_data
for A
called once (with no filter) and then get_data
for B
called n
times (with no filters passed in) where n
is the number of rows in A
.
I thought this might be due to SQLite’s use of a sequential scans rather than an index scan and tried writing a get_cost
function to provide additional hints to the query planner. While this does cause SQLite to issues filtered queries to the API, cutting down on the amount of data fetched, it does not reduce the number of queries sent to the API. With a get_data
function, I see a single call to get_data
on A
and then n
called to get_data
for B
, but I now see ID=
filters being passed in on each call. Using EXPLAIN QUERY PLAN
, I see:
1 2 |
|
whereas without a get_cost
:
1 2 |
|
Unfortunately the issue ultimately is that SQLite does not try to hold the tables in memory during a query and instead relies on the Virtual table implementation to do so. Without any caching in our Adapter, this means that our get_data
methods is called n
times on one of the two tables.
I opened this discussion on the SQLite forum where the tentative conclusion is to add a SQLITE_INDEX_SCAN_MATERIALIZED
to tell SQLite to materialize the whole virtual table (keyed on the used constraints).3
For example building a “Source Handler” for GraphQL Mesh for Airtable or a clone of BaseQL.↩
While the API supports a LIMIT
(page size) concept, it uses cursor-based pagination so specifying OFFSET
is not possible.↩
There are some other hints to SQLite that might be important for query optimization such as SQLITE_INDEX_SCAN_UNIQUE
:
the idxFlags field may be set to SQLITE_INDEX_SCAN_UNIQUE to indicate that the virtual table will return only zero or one rows given the input constraints.
I wanted a way for Superset to pull data from a GraphQL API so that I could run this powerful and well used business intelligence (BI) tool on top of the breadth of data sources that can be exposed through GraphQL APIs. Unfortunately Superset wants to connect directly with the underlying database (from its landing page: “Superset can connect to any SQL based datasource through SQLAlchemy”).
Generally this constraint isn’t that restrictive as data engineering best practices are to set up a centralized data warehouse to which various production databases would be synchronized. The database used for the data warehouse (e.g. Amazon Redshift, Google BigQuery) can then be queried directly by Superset.
The alternative, “lazy” approach is to eschew setting up this secondary data storage system and instead point the BI tool directly at the production database (or, slightly better, at a read replica of that database). This alternative approach is the route often chosen by small, fast moving, early-stage product engineering teams that either lack the resources or know-how to engage in proper data engineering. Or the engineers just need something “quick and dirty” to empower the business users to explore and analyze the data themselves and get building ad-hoc reports off the engineering team’s plate.
I found myself in in the latter scenario, where I wanted to quickly connect a BI tool to our systems without needing to set up a full data warehouse. There was a twist though. While our primary application’s database is Postgres (which is supported by Superset), we augment that data using various external / third-party APIs and expose all of this through a GraphQL API1. In some cases there are data transformations and business logic in the resolvers that I wanted to avoid having to re-implement in the BI tool.
And with that I set out to connect Superset (my BI tool of choice) directly to my GraphQL API. It turns out someone else also had this idea and had kindly filed a GitHub ticket back in 2018. The ticket unfortunately was closed with:
a) It’s possible to extend the built-in connectors. “The preferred way is to go through SQLAlchemy when possible.”
b) “All datasources ready for Superset consumption are single-table”.
While (b) could be an issue with GraphQL as GraphQL does allow arbitrary return shapes, if we confine ourselves to top level List
s or Connection
s and flatten related entities, the data looks “table-like”. (a) boiled down to writing a SQLAlchemy Dialect
and associated Python DB-API database driver. A quick skim of both the SQLAlchemy docs and the PEP got me a little worried at the what I might be attempting to bite off.
Before I gave up hope, I decided to take a quick look at the other databases that Superset listed to see if there was one I could perhaps base my approach off of. Support for Google Sheets caught my attention and upon digging further, I saw it was based on a library called shillelagh
, whose docs say “Shillelagh allows you to easily query non-SQL resources” and “New adapters are relatively easy to implement. There’s a step-by-step tutorial that explains how to create a new adapter to an API.” Intriguing. On the surface this was exactly what I was looking for.
I took a look at the step-by-step tutorial, which very cleanly walks you through the process of developing a new Adapter
. From the docs: “Adapters are plugins that make it possible for Shillelagh to query APIs and other non-SQL resources.” Perfect!
The last important detail that I needed to figure out what how to map the arbitrary return shape from the API to something tabular. In particular I wanted the ability to leverage the graph nature of GraphQL and pull in related entities. However the graph may be infinitely deep, so I can’t just crawl all related entities. I solved this problem by adding an “include” concept that specified what related entities should be loaded. this include is specified as a query parameter on the table name. In the case of SWAPI, I might want to query the allPeople
connection. This would then create columns for all the scalar fields of the Nodes
returned. If I wanted to include the fields from the species
, the table name would be allPeople?include=species
. This would add to the columns the fields on the linked Species
. This path could be further traversed as allPeople?include=species,species__homeworld
.
The majority of the logic is in the Adapter
class and it roughly boils down to:
1. Taking the table name (exposed on the GraphQL API as a field on query
) along with any path traversal instructions and resolving the set of “columns” (flattened fields) and their types. This leverages the introspection functionality of GraphQL APIs.
2. Generating the needed GraphQL query and then mapping the results to a row-like structure.
I also wrote my own subclass of APSWDialect
, which while not documented in shillelagh’s tutorial was the approach taken for the gsheets://
Dialect. This comes with the benefit of being able to expose the list of tables supported by the Dialect
. This class leverages the GraphQL.
Once the graphql
Dialect
is registered with SQLAlchemy, adding the GraphQL API to Superset as a Database is as easy as creating a standard Database Url where the host
and port
(optional) are the host and port of the API and the database
portion is the path: graphql://host(:port)/path/to/api
2.
Once the Database is defined, the standard UI for defining Datasets can then be used, The list of “tables” is auto-populated through introspection of the GraphQL API3:
The columns names and their types can also be sync’ed from the Dataset:
Once the Database and the Dataset are defined, the standard Superset tools like charting can be used:
and exploring:
Longer term there are a whole bunch of great benefits of using accessing data through a GraphQL API which this blog post explores further. GraphQL presents a centralized platform / solution for stitching together data from a number of internal (e.g. microservices) and external sources (SaaS tools + third-party APIs). It is also a convenient location to implement authorization and data redaction.
I love to do early stage development work in a Jupyter Notebook with the autoreload extension (basically hot reloading for Python). However this presented a bit of a problem as both SQLAlchemy and shillelagh expected their respective Dialect
s and Adapter
s to be registered as entry points.
SQLAlchemy’s docs provided a way to do this In-Process:
1 2 3 |
|
However shillelagh did not (feature now requested), but I was able to work around the issue:
1 2 3 4 5 6 |
|
We use Redis caching + DataLoaders to mitigate the cost of accessing these APIs.↩
We do still need to specify the scheme (http://
vs https://
) to use when connecting to the GraphQL API. Options considered:
1. Two different Dialects (e.g.graphql://
vs graphqls://
),
2. Different Drivers (e.g. graphql+http://
vs graphql+https://
) or
3. Query parameter attached to the URL (e.g. graphql://host:port/path?is_https=0
).
I went with (3) as most consistent with other URLs (e.g. for Postgres: ?sslmode=require
).↩
In order to set query params on the dataset name, I did have to toggle the Virtual (SQL)
radio button, which then let me edit the name as free text:
↩
As of my writing this blog post, scikit-learn does not support MLPs (see this GSoC for plans to add this feature). Instead I turn to pylearn2, the machine learning library from the LISA lab. While pylearn2 is not as easy to use as scikit-learn, there are some great tutorials to get you started.
I added this line to the bottom of my last notebook to dump the Hu moments to a CSV so that I could start working in a fresh notebook:
1
|
|
In my new notebook, I performed the same feature normalization that had for the SVM: taking the log of the Hu moments and then mean centering and std-dev scaling those log-transformed values. I did actually test the MLP classifier without performing these normalization and like the SVM, classification performance degraded.
With the data normalized, I shuffled the data and then split the data into test, validation and training sets (15%, 15%, 70% respectively). The simplest way I found to shuffle the rows of a Pandas DataFrame was: df.ix[np.random.permutation(df.index)]
. I initially tried df.apply(np.random.permutation)
, which lead to each column being shuffled independently (and my fits being horrible).
I saved the three data sets out to three separate CSV files. I could then use the CSVDataset in the training YAML to load in the file. I created my YAML file by taking the one used for classifying the MNIST digits in the MLP tutorial, modifying the datasets and changing n_classes
(output classes) from 10 to 50 and nvis
(input features) from 784 to 7. I also had to reduce sparse_init
from 15 to 7 after a mildly helpful assertion error was thrown (I am not really sure what this variable does but it has to be <= the number of input features). For good measure I also reduced the size of the first layer from 500 nodes to 50, given that I have far fewer features than the 784 pixels in the MNIST images.
After that I ran the fit and saw that my test_y_misclass
went to 0.0 in the final epochs. I was able to plot this convergence using the plot_monitor
script. Since I like to have my entire notebook self-contained, I ended up modifying the plot_monitor script to run as a function call from within IPython Notebook (see my fork).
Here is the entire IPython Notebook.
I didn’t tinker with regularization, but this topic is well addressed in the pylearn2 tutorials.
In a future post, I plan to use CNN to classify the state map images without having to rely on domain specific knowledge for feature design.
]]>I wanted to write a program to identify a map of a US state.
To make life a little more challenging, this program had to work even if the maps are rotated and are no longer in the standard “North is up” orientation1. Further the map images may be off-center and rescaled:
The first challenge was getting a set of 50 solid-filled maps, one for each of the states. Some Google searching around led to this page which has outlined images for each state map. Those images have not just the outline of the state, but also text with the name of the state, the website’s URL, a star showing the state capital and dots indicating what I assume are major cities. In order to standardize the images, I removed the text and filled in the outlines. The fill took care of the star and the dots.
First some Python to get the list of all state names:
1 2 3 4 5 6 7 8 9 10 11 |
|
Next I tried to open one of these images using OpenCV’s imread
only to discover that OpenCV does not handle gifs so I built this utility method to convert the GIF to PNG and then to read them with imread
:
With the images loaded, I believed the following operations should suffice to get my final, filled-in image:
findContours
)Here when I say significant polygons, I want to make sure the polygon bounds a non-trivial area and is not a stray mark.
I use the following methods to convert image and to display the image using matplotlib:
This is my first attempt at the algorithm:
Which seems to work for most states, but fails for Massachusetts:
This seems to be due to a break somewhere in the outline. A simple dilate
with a 3x3
kernel solved the problem.
The final algorithm is here:
which works for all states:
I needed some way of comparing two images which may have different rotations, scalings and/or translations. My first thought was to develop a process that normalizes an image to a canonical orientation, scaling and position. I could then take an unlabeled image, and compare its standardized form to my set of canonicalized maps pixel-by-pixel with some distance metric such as MAE or MSE.
How to normalize the image? Image moments immediately came to mind. The zeroth moment is the “mass” of the image (actually the area of the image since I am working with black and white images, where the black pixels have unit mass and the white pixels to have no mass). The area can be used to normalize for image scaling.
The first moment is the center of mass or the centroid of the image. Calculating the centroid is then an arithmetic average of black pixel coordinates. The center of mass is a reference point and that allows me to normalize for translation of the image. Finally, I would like an “axis of orientation” which I can used to normalize rotation. Wikipedia provides the answer but a more complete explanation can be found here.
At this point I realized that rather than going through the trouble of standardizing the image and then comparing pixel-by-pixel, I could just compare a set of so-called “rotation invariant moments”. These values are invariant for a given image under translation, scale and rotation. OpenCV provides a function to calculate the Hu moments. The Hu moments are the most frequently used, but Flusser shows there are superior choices. For convenience I will use the seven Hu moments as feature values in a machine learning classifier.
In order to classify the images using the seven feature values, I decided to use scikit-learn’s SVM Classifier. I generated training data by taking each of the 50 state images, and then rotating, shifting and scaling them and then getting the Hu moments for the resulting image. I took this set of examples, split 20% out as test data and 80% for training. After observing that each of the seven features takes on very different ranges of values, I made sure to scale my features.
An SVM with a linear
kernel resulted in 31% classification accuracy. This is certainly better than the 2% I would get from blind guessing but a long way from what a gifted elementary school student would get after studying US geography. Using an “rbf” kernel and a default setting for gamma
gives an accuracy of 20%. With some tweaking to gamma, I was able to get this up to 99%, but this required a gamma=10000. Having this large a gamma made me question my approach. I reviewed the OpenCV method matchShapes
which also uses the Hu Moments. I noticed that rather than just comparing the Hu Moments, the OpenCV method instead compared the log of the moments.
I changed my features to be the signum’ed log of the absolute values of the Hu Moments, and then applied standard scaling. This log transform of the Hu moments is pretty common in the literature, see: Conley2. After doing that my linear kernel SVM was able to achieve a score of 100%. An rbf kernel with default gamma got a score of 99% (looking at the confusion matrix shows some confusion between Utah vs. Arizona and Georgia vs. Missouri, which do look similar):
While not part of the original problem specification, I wanted to see how this classification technique would fare under image blurring. I show below classification scores as a function of increasing amounts of blur. Using a normalized box filter:
and a Gaussian Blur:
My IPython Notebook is available here.
It would be cool to evaluate this same technique on maps of countries. I have found two source of country maps: this and this. Using the second link, the URLs for the country pages can be extracted using:
1 2 3 4 5 6 7 8 9 10 |
|
In reality, not only do I need to be concerned with scale, rotation, and translation, but I also should care about the map projection.↩
“To improve classification rates, the natural logarithm of the absolute value of the seven invariants are used instead of the invariants themselves, because the invariants often have very low absolute values (less than 0.001), so taking the logarithm of the invariants reduces the density of the invariants near the origin. Also, the values of the natural logarithms are then converted into standardized z-scores before classification”↩
@amesbah The current release is a bit neglected right now. The next version is being worked on quite a bit though. https://t.co/a3cogy8z2O
— Octopress (@octopress) May 15, 2014
The first issue that I came across was the vertical alignment of the “social buttons” on the bottom of each post. You can see the Facebook “Like” and “Share” buttons are lower than the Twitter and Google+ buttons:
which was fixed by following this comment.
The next issue was rendering of embedded Gists. The rendered Gist look pretty crappy:
There are a number of people talking about the issue: here and here. The solution I ended up using was a combination of these plus some of my own CSS.
Currently meta tags for keywords and description are only generated on post pages not on site level pages. A solution is offered here and a Pull Request.
I addressed this issue in a previous post.
In order to use post_url
to allow intelligent linking between posts, I had to follow this post.
Then I ran into this error message:
1
|
|
Following this comment fixed that issue.
Since I wanted the ability to use named anchors, I made the following change to the post_url.rb
file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
I doubt these will be the last of the paper cuts that I uncover, so stay tuned for more fixes. I am starting to think this applies to Octopress, but I still enjoy the challenge and the learning experience.
]]>Before RBC acquired this supposed state-of-the-art electronic-trading firm, Katsuyama’s computers worked as he expected them to. Suddenly they didn’t. It used to be that when his trading screens showed 10,000 shares of Intel offered at $22 a share, it meant that he could buy 10,000 shares of Intel for $22 a share. He had only to push a button. By the spring of 2007, however, when he pushed the button to complete a trade, the offers would vanish. In his seven years as a trader, he had always been able to look at the screens on his desk and see the stock market. Now the market as it appeared on his screens was an illusion.
For someone making $1.5-million-a-year running RBC’s electronic-trading operation, I am surprised how little Brad understands about liquidity. I don’t just mean the complicated American equity markets; I mean liquidity in general. A simple analogy illustrates why the market is in fact not an illusion and how a similar experience can happen in other markets.
Let’s say you are looking to buy airlines tickets to fly your entire extended family, all 30 of them, from Toronto to New York City. The first thing you, as a bargain hunting traveler do, is log onto your favorite travel agent site (be it Expedia, Travelocity, etc). You search for the cheapest fare and get ready to buy. Unfortunately most sites allow you to buy only six tickets at a time. So now you as a savvy shopper open your next four favorite travel agent sites and find the same flight. Assuming you are lucky (and the sites have access to the same fares), you sees the same price on all sites. Now it’s go time! You successfully buy six tickets on the first site and move on to the second site. Once again you are successful. Great, 12 tickets down 18 to go.
Now when you try to checkout on the third site, you receive a strange error message: “Unable to complete checkout. Please search again.” You scratch your head and hit the refresh button. The flight results reload, but the rates have gone up! You decide to move on to the fourth site and see what happens there. The same! You aren’t able to buy the tickets from your initial search and are once again presented with a higher fare.
So what happened? The airline has a bunch of seats its want to sell. Since it’s a profit maximizing corporation, it employs “revenue management” to make as much money as possible on these seats. In the case of my analogy the airline has 12 seats at some low price it is trying to sell. It offers another batch at a higher price. The problem is there are only 12 seats at this price, but the airline doesn’t know which travel site its customers will use. So it offers those same 12 seats on all of those sites. Once those 12 are sold, a new batch of seats is offered at a higher price1.
This is what Brad observers in the American equities market. There are give or take 12 public exchange whose aggregate offer he is observing. However like the airlines, market makers don’t know to which exchange liquidity takers (i.e. anyone that sends a market order) will go to. So they have some amount of stock they are comfortable selling, let’s say 2,000 shares. So they offer 1,000 shares on each of 12 venues. Now what Brad observes is a total of 12,000 shares offered. He tries to buy all 12,000 shares but ultimately only buys 2,000. This doesn’t mean the market is rigged, but rather his notion of liquidity, that is summing up everything he sees, is naive. It’s like trying to sum up the tickets offered on all of the travel websites and expecting the airline to sell all those tickets at the same price. This is even more crazy for an airline as the total number of tickets seen might exceeds the number of seats on the flight!
The new fare is indicated on the ticket as a different booking code.↩
I have chosen to run my blog with Octopress rather than WordPress or Blogger. Why? This blogger and this blogger do a good job of explaining. Not only does Octopress allow me to source control the entire blog, but also lets me very easily host the site using GitHub Pages.
The guide here does a great job of explaining setting up Octopress on GitHub. While I am using GitHub to host the static content I wanted to use my own custom domain. Following the instructions here to create a CNAME file and then adding a simple “CNAME Record” (an alias) on GoDaddy was all it took for blog.alexrothberg.com
to direct to cancan101.github.io
.
The default behavior for Octopress is to put the blog content in the /blog/
subdirectory in the URL. Since I am using a subdomain for my namespacing, this proved to be awkward. The URLs started with: http://blog.alexrothberg.com/blog/
. This post offers an easy solution. I also found this GitHub Issue discussing the issue. You can see changes I made here. When done with the edits I also had to run rake update_source
.