Bytes of Pi

Aug 07

Forcing SSL in a Sinatra App

When deploying on Heroku, you can piggy back on their SSL certificate, which allows you to have a secure connection right away without any SSL configurations of your own. I think this is a great solution for a lot of people until you need a really pretty URL. Because this is possible, you should use it, and if your building an API you should also think about forcing different environments to require SSL. Here is a simple implementation in my `app.rb` file:

class Myapi < Sinatra::Application
  ActiveRecord::Base.default_timezone = :utc

  configure :development, :test do
    set :host, 'localhost:9999'
    set :force_ssl, false
  end
  configure :staging do
    set :host, 'my-api-staging.herokuapp.com'
    set :force_ssl, true
  end
  configure :production do
    set :host, 'my-api.herokuapp.com'
    set :force_ssl, true
  end

  before do
    content_type :json
    ssl_whitelist = ['/calendar.ics']
    if settings.force_ssl && !request.secure? && !ssl_whitelist.include?(request.path_info)
      halt json_status 400, "Please use SSL at https://#{settings.host}"
    end
  end
end

require_relative 'application/helpers/init'
require_relative 'application/routes/init'
require_relative 'application/models/init'
require_relative 'application/core_extensions/init'

I have left some of my app-specific code in there as well, but I am sure you can dig around that to see how SSL is forced. Notice that because downloading `.ics` files shouldn’t require SSL, or in other words, it shouldn’t fail if the user uses `http`, it is included in a whitelist array.

Multi-environment post_build_hook using tddium, Heroku, and Ruby Sinatra

We have been using tddium as a deployment tool for our Ruby Sinatra API for some time now. It has been working great and now manages deployment for 7 of our API environments. We have ran into a few small issues, but there are some sharp engineers on support, so issues get fixed fast. The most recent develop in our system is to implement a post build hook to run migrations after the app is deployed on Heroku.

One of the straight-forward adjustments I had to make to their gist is to dynamically set the app name based on the branch that was pushed to Github. In our case, each Heroku app (“environment”) has a git branch, and this is also how tddium sets up test suites, so everything jives.

The portion that required some support was apparently due to some dependency issues with my gems and the way the git repo was being pushed from tddium. The first step was to add the heroku gem to my gemfile, and from there, it was to modify the tddium post_build_hook a little bit. Here is the full version:

namespace :tddium do
  desc "post_build_hook"
  task :post_build_hook do
    return unless ENV["TDDIUM_MODE"] == "ci"
    return unless ENV["TDDIUM_BUILD_STATUS"] == "passed"

    dir = File.expand_path("~/.heroku/")
    heroku_email = ENV["HEROKU_EMAIL"]
    heroku_api_key = ENV["HEROKU_API_KEY"]
    current_branch = `git symbolic-ref HEAD 2>/dev/null | cut -d"/" -f 3-`.strip
    abort "invalid current branch" unless current_branch
    puts "Current Branch: #{current_branch}"

    case current_branch
      when 'integration'
        app_name = 'matts-api-integration'
      when 'staging'
        app_name = 'matts-api-staging'
      when 'production'
        app_name = 'matts-api'
      when 'demo'
        app_name = 'matts-api-demo'
      when 'demo-staging'
        app_name = 'matts-api-demo-staging'
      when 'pilot'
        app_name = 'matts-api-pilot'
      when 'pilot-staging'
        app_name = 'matts-api-pilot-staging'
    end

    return unless defined? app_name

    puts "App Name: #{app_name}"
    push_sha = `git rev-parse HEAD`
    push_target = "git@heroku.com:#{app_name}.git"

    abort "invalid current branch" unless current_branch

    FileUtils.mkdir_p(dir) or abort "Could not create #{dir}"

    puts "Writing Heroku Credentials"
    File.open(File.join(dir, "credentials"), "w") do |f|
      f.write([heroku_email, heroku_api_key].join("\n"))
      f.write("\n")
    end

    File.open(File.expand_path("~/.netrc"), "a+") do |f|
      ['api', 'code'].each do |host|
        f.puts "machine #{host}.heroku.com"
        f.puts "  login #{heroku_email}"
        f.puts "  password #{heroku_api_key}"
      end
    end
    
    puts "Pushing to Heroku: #{push_target}..."
    cmd "git push #{push_target} HEAD:master --force" or abort "could not push to #{push_target}"

    puts "Running Heroku Migrations..."
    cmd "heroku run rake db:migrate --app #{app_name}" or abort "aborted migrations"

    puts "Restarting Heroku..."
    cmd "bundle exec heroku restart --app #{app_name}" or abort "aborted heroku restart" 

    puts "Post Build Complete!"
  end
end

Aug 05

[video]

Jul 20

Lazy Levenshtein: Using Abbreviations and Spellchecked Inputs in Ruby

I have been spending a lot of time writing Ruby programs that take in data through the terminal. One of the problems is that mis-spelling something can cause the program to crash, and I want to be as quick as possible when doing data entry.

One of my programs asks which server environment I would like to use before I start messing with any data (development, integration, staging, production). It would be great if all of the following abbreviations or misspellings would choose the development environment, and keep the program rolling:

You get the idea- abbreviations and spellchecking from known inputs. To accomplish this I leverage the Levenshtein distance algorithm, more commonly known as “edit distance”. This algorithm compares two strings and returns an integer that is equal to the amount of edits needed to transform the first string into the second.

Here is the Github Gist for Lazy Levenshtein, with the sample code below so we can dig through it.

# test by running:
# ruby lazy.rb
# use control+c to exit
require 'amatch'
include Amatch

def lazy input, matches, abbreviations=true
  # setup the Levenshtein comparator
  distances = {}
  m = Levenshtein.new(input)
  matches.reverse.each do |match|
    # get the edit distance
    tests = []
    tests << m.match(match)
    tests << m.match(match[0..3]) if abbreviations
    
    # lowest score gets placed
    distances[tests.min] = match
  end
  # return input, returns original if matches is empty
  input = distances.empty? ? input : distances.min.last
end

environments = %w(development integration staging production)
while true
  puts "\nenvironments #{environments.join(', ')}"  
  print "choose environment: "
  input = gets.chomp
  puts "match: #{lazy(input, environments)}"
end

The three parameters are the input itself, an array of possible matches, and a boolean that tells the method whether or not you want to match abbreviations. The method sets up a Levenshtein comparison for each potential match (using the Ruby Amatch library), and scores the comparison. We are playing golf here, because the lowest score wins the game!

The method also reverses the array in the main loop, which puts priority to the first items in the array if there happens to be a tie between matches. Unlike typical “spellcheck”, this method will never return “not found”, it will always return a match, and if the “matches” array is empty, it simply returns the provided input.

This has helped me make inputting much faster with smarter defaults, and given me the piece of mind that my misspellings will always turn into known/safe values.

Jul 13

Article Analysis: Matching People’s Names to Email Addresses

*code examples are based in ruby

The problem

Consider you are scraping web articles building a list of contacts for a PR company. Getting email addresses is as simple as a regular expression.

string = 'John Smith can be contacted at john.smith@gmail.com'
emails = []
string.scan(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/i).each {|x| emails << x}

However, an email address is only so powerful, it would be really great if we could match a name to that email address. The problem statement is pretty simple- match email addresses in the article to the names in the article.

Finding names

So if you are thinking that finding names in text can’t be that hard, lets take a stroll down that dark alley really quick. Maybe just a regular expression that matches two capitalized strings in a row could do the trick.

/([A-Z]+[a-zA-Z]*)\s+([A-Z]+[a-zA-Z]*)/

Well, that’s cool, but what if there is a middle name, or even worse, an abbreviated name? So now, we add an optional third string to the regular expression, and allow for abbreviations.

/([A-Z]+[a-zA-Z.]*)\s+([A-Z]+[a-zA-Z.]*)+(\s+([A-Z]+[a-zA-Z]*))?/

This is looking great, and then you hit a name like Michael D’hunt, or De’Angelo Munez. Okay, so now we allow some apostrophes.

/([A-Z']+[a-zA-Z'.]+)\s+([A-Z]+[a-zA-Z'.]*)+(\s+([A-Z']+[a-zA-Z']*))?/

So right now, you run this against a list of 10,000 common names, and a boatload of Lorum Ipsum, and you have some great accuracy. However, in the real world you get a sentence like this, “And Will Smith was not alone, he included his wife Jada on his trip to San Francisco, where they stayed at Hotel Palomar”. Your regular expression just got roasted in so many ways.

Natural Language Processing (NPL)

To parse out things like names, places, dates, and going into things like differentiating languages within text, we have to get a little more fancy. Natural language processing concerns itself with the study of linguistics, using a mixture of machine learning, statistics, and artificial intelligence to provide a meaningful analysis of human languages. For all intensive purposes, we can say that it breaks apart sentences so we can interpret them better using a computer.

This is the crucial link between knowing if ‘San Francisco’ is a person’s name, or a physical place. The Stanford Natural Language Processing Group has a set of core open-source tools that can take care of some of this by implementing known language patterns, and using massive libraries of common naming schemes for people, places, and things.

Finding names, the right way

By implementing the Stanford CoreNLP Toolset, we can essentially throw some text at it, and with a couple filters we can have a list of names contained within the text. So from the sentence above, we may get a result like this.  

[  
  {
    :name => "Will",
    :start => 4,
    :end => 8
  },
  {
    :name => "Smith",
    :start => 9,
    :end => 14
  },
  {
    :name => "Jada",
    :start => 51,
    :end => 55
  }
]

The ‘start’ and ‘end’ numbers refer to the string position of the name itself, and with a little magic it is possible to concatenate names if they appear together, giving a final result as follows.

["Will Smith", "Jada"]

Names to emails

This matching problem is quite problematic considering the format that an article or piece of text might come in. In a perfect world, a person’s name would appear right next to their email address.

'John Smith can be contacted at john.smith@gmail.com, and Jill Ruth can be contacted at jill.ruth@gmail.com.'

If we know the position of the name, and know the position of the email address, this is no problem- we just write a routine to find the closest email to the persons name. However this breaks down pretty quickly.

'John Smith and Jill Ruth can be contacted respectively at john.smith@gmail.com, and jill.ruth@gmail.com.'

Or even worse…

'Article written by Edward Jones

... article body ...

Contact the writer at ejones@gmail.com'

The article body itself could also contain important names and email addresses as well, scattered as they please, so this calls for some more advanced parsing techniques.

The Levenshtein distance

My take on this problem is that we can throw name-position vs. email-position out the window; it is not reliable. The one thing we can rely on is that, in most professional situations, a person’s email has some reference to their name. 

This is where the Levenshtein distance algorithm comes in- it calculates the numbers of edits needed to transform one string into another. In our case, we are comparing a persons name to their email. It quickly becomes apparent that the email extension and any numbers can be removed from the email address, and normalizing the case is helpful before making the comparisons. Let’s looks at some results for John Smith (or more specifically in lower case, “john smith”).

john.smith@gmail.com -> john.smith : 1 edit
j.smith123@gmail.com -> j.smith : 4 edits
john.walker.smith@gmail.com -> john.walker.smith : 8 edits
jwsmith@gmail.com -> jwsmith : 4 edits
jws3319@gmail.com -> jws : 8 edits

So that is pretty neat, and the next step to the problem is pretty clear- we need to test a bunch of common email address patterns against the persons name, and use the best score. So instead of making comparisons with just “john smith”, we can abstract the name into some common formats.

person = "John Smith"
email = "jsmith44@gmail.com"

# remove the email extension and everything besides characters
m = Levenshtein.new(email.split('@').first.downcase.scan(/[a-z]/).join(''))

# run a standard set of tests against the persons name
tests = []
tests << m.match(person.downcase.scan(/[a-z]/).join('')) 
tests << m.match("#{person.split(' ').first.downcase[0]}#{person.split(' ').last.downcase[0]}")
tests << m.match(person.split(' ').first.downcase)
tests << m.match(person.split(' ').last.downcase)
...

best_result = tests.min

If this is run for every person and every email address found in an article, it will provide the best score for each person vs. email address.

Scores into results

With any type of artificial intelligence, there is rarely a concept of “passing a test”, there are only various levels of failure. The goal is to simply minimize failure in the best way possible, and developing with any other intention can be a destructive process. Using our scores from the previous step, we attempt to award all the emails we found to the person most deserving. Consider the following sample set.

people = ["Matt Gaidica", "Brad Birdsall", "John Smith", "Grant Olidapo", "Minh Nguyen"]
emails = ["mattyg@gmail.com", "bradbirdman17@gmail.com", "grant.olidapo@gmail.com", "mn1@gmail.com"]

The scores we produced account for every name vs. every email, or 20 (5x4) unique values. We look to some sort of complexity reduction algorithm to reduce this set of 20 data points, to only 4, which directly relate names to emails, leaving one of our people email-less. After about 20 lines of magic, our algorithm spits out the results.

{
  "Matt Gaidica" => "mattg@gmail.com",
  "Brad Birdsall" => "bradbirdman17@gmail.com",
  "Grant Olidapo" => "grant.olidapo@gmail.com",
  "Minh Nguyen" => "mn1@gmail.com"
}

Tip of the iceberg

I look at this as just one of the ways to accomplish this goal. This process can be heavily supplemented with machine learning techniques to produce better name recognition, and further develop common email address patterns for your specific type of article, document, or data set.

I have opened a library on Github called Textract, which includes the code for this entire process. My goal is to keep the problems simple, and the solutions simpler.

Jul 12

A Ruby wrapper for the (new) Basecamp API -

This is about as structurally simple as it could get to interact with the new Basecamp API (using HTTParty for the RESTful stuff). Very few endpoints are supported right now, but this might help you get going.

Exporting to CSV on Mac’s Excel using PHP’s fgetcsv

PHP has a nice way of working with CSV files using the fgetcsv function. One of the downfalls in using the CSV format is dealing with what character is used for new lines, which is how nearly all CSV files determine the separation of rows.

If you are using excel on Mac and do not want to set your own newline character in PHP (or don’t have access to it!), when you save the file, choose Windows Comma Seperated (.csv) as the file type.

Jul 11

The Stanford Natural Language Processing Group

Jul 06

String to Boolean in Ruby -

A great couple of lines that add a `to_bool` method to the Ruby String class. This is really helpful when passing in boolean values from a query string (like in an API). Just put this in a file and require it!

Jun 19

rspec_helper.rb for Rspec in Ruby

require File.join(File.dirname(__FILE__), '..', 'app.rb')
require 'rack/test'

set :environment, :test
set :run, false
set :raise_errors, true
set :logging, false

def access_token
  "xxxxx"
end

def api_token
  "xxxxx"
end

def app
  Sinatra::Application
end

RSpec.configure do |config|
  config.include Rack::Test::Methods
end

#no database debug messages
ActiveRecord::Base.logger.level = Logger::INFO