Tag: parsing

Rack/Rails middleware that will add rel=”nofollow” to all your links

Few years ago I wrote a post about adding rel=”nofollow” to all the links in your comments, news, posts, messages in Ruby on Rails. I've been using this solution for a long time, but recently it started to be a pain in the ass. More and more models, more and more content - having to always declare some sort of filtering logic in the models don't seem legit any more. Instead I've decided to use a different approach. Why not use a Rack middleware that would add the nofollow rel to all "outgoing" links? That way models would not be "polluted" with stuff that is directly related only to views.

Nokogiri is the answer

To replace all the rel attributes, we can use Nokogiri. It is both convenient and fast:

require 'nokogiri'

doc = Nokogiri::HTML.parse(content)

doc.css('a').each do |a|
  a.set_attribute('rel', 'noindex nofollow')
end

doc.to_s

Small corner cases that we need to cover

Unfortunately there are some cases that we need to cover, so simple replacing all the links is not an option. We should not add nofollow when:

  • There's already a rel defined on an anchor
  • There are local links that should be "followable"
  • There are local links with a full domain in them
  • We want to narrow anchor seeking to a given css selector (we want to leave links that are in layout, etc)

If we include all of above, our code should look like this:

require 'nokogiri'

doc = Nokogiri::HTML.parse(content)
scope = '#main-content'
host = 'mensfeld.pl'

doc.css(scope + ' a').each do |a|
  # If there's a rel already don't change it
  next unless a.get_attribute('rel').blank?
  # If this is a local link don't change it
  next unless a.get_attribute('href') =~ /\Awww|http/i
  # Don't change it also if it is a local link with host
  next if a.get_attribute('href') =~ /#{host}/

  a.set_attribute('rel', 'noindex nofollow')
end

Hooking it up to Rack middleware

There's a great Rails on Rack tutorial, so I will skip some details.

Our middleware needs to accept following options:

  • whitelisted host
  • css scope (if we decide to norrow anchor seeking)

So, the initialize method for our middleware should look like this:

# @param app [SenpuuV7::Application]
# @param host [String] host that should be allowed - we should allow our internal
#   links to be without nofollow
# @param scope [String] we can norrow to a given part of HTML (id, class, etc)
def initialize(app, host, scope = 'body')
  @app = app
  @host = host
  @scope = scope
end

Each middleware needs to have a call method:

# @param [Hash] env hash
# @return [Array] full rack response
def call(env)
  response = @app.call(env)
  proxy = response[2]

  # Ignore any non text/html requests
  if proxy.is_a?(Rack::BodyProxy) &&
    proxy.content_type == 'text/html'
    proxy.body = sanitize(proxy.body)
  end

  response
end

and finally, the sanitize method that encapsulates the Nokogiri logic:

# @param [String] content of a response (body)
# @return [String] sanitized content of response (body)
def sanitize(content)
  doc = Nokogiri::HTML.parse(content)
  # Stop if we could't parse with HTML
  return content unless doc

  doc.css(@scope + ' a').each do |a|
    # If there's a rel already don't change it
    next unless a.get_attribute('rel').blank?
    # If this is a local link don't change it
    next unless a.get_attribute('href') =~ /\Awww|http/i
    # Don't change it also if it is a local link with host
    next if a.get_attribute('href') =~ /#{@host}/

    a.set_attribute('rel', 'noindex nofollow')
  end

  doc.to_s
# If anything goes wrong, return original content
rescue
  return content
end

Usage example

To use it, just create an initializer in config/initializers of your app with following code:

require 'nofollow_anchors'

MyApp::Application.config.middleware.use NofollowAnchors, 'mensfeld.pl', 'body #main-content'

also don't forget to add gem 'nokogiri' to your gemfile.

Performance

Nokogiri is quite fast and based on benchmark that I did, it takes about 5-30 miliseconds to parse the whole content. Below you can see time and number of links (up to 488) per page. Keep that in mind when you will use this middleware.

perf

TL;DR - Whole middleware

require 'nokogiri'

# Middleware used to ensure that we don't allow any links outside without a
# nofollow rel
# @example
#   App.middleware.use NofollowAnchors, 'example.com', 'body'
class NofollowAnchors
  # @param app [SenpuuV7::Application]
  # @param host [String] host that should be allowed - we should allow our internal
  #   links to be without nofollow
  # @param scope [String] we can norrow to a given part of HTML (id, class, etc)
  def initialize(app, host, scope = 'body')
    @app = app
    @host = host
    @scope = scope
  end

  # @param [Hash] env hash
  # @return [Array] full rack response
  def call(env)
    response = @app.call(env)
    proxy = response[2]

    if proxy.is_a?(Rack::BodyProxy) &&
      proxy.content_type == 'text/html'
      proxy.body = sanitize(proxy.body)
    end

    response
  end

  private

  # @param [String] content of a response (body)
  # @return [String] sanitized content of response (body)
  def sanitize(content)
    doc = Nokogiri::HTML.parse(content)
    # Stop if we could't parse with HTML
    return content unless doc

    doc.css(@scope + ' a').each do |a|
      # If there's a rel already don't change it
      next unless a.get_attribute('rel').blank?
      # If this is a local link don't change it
      next unless a.get_attribute('href') =~ /\Awww|http/i
      # Don't change it also if it is a local link with host
      next if a.get_attribute('href') =~ /#{@host}/

      a.set_attribute('rel', 'noindex nofollow')
    end

    doc.to_s
  rescue
    return content
  end
end

Using MongoDB to store and retrieve CSV files content in Ruby

There come cases, when we want to store CSV or any other sort of files data in a database. The problem occurs when the input files differs (they might have different columns). This would not be a problem if we would be able to know the files specification before parsing and inserting into DB. One of the solutions (the fastest) would be to store them in separate DB tables (each filetype / table). But what we should do, when column amount and their names are unknown? We could use SQL database, create table for data and another table for mapping appropriate DB columns to CSV columns. This might work, but it would not be an elegant solution. So what could we do?

MongoDB to the rescue!

This case is just perfect for MongoDB. MongoDB  is a scalable, high-performance, open source NoSQL database that allows us to create documents what have different attributes assigned to them. To make it work with Ruby you just need to add this to your gem file:

gem "mongoid"

Then you need to specify Mongoid yaml config file and you are ready to go (the MongoDB installation instructions can be found here):

Mongoid.load!("./mongoid.yml", :production)

MongoID example config file

Here you have really small Mongoid config file:

production:
  sessions:
    default:
      hosts:
        - localhost:27017
      database: csv_files
      username: csv
      password: csv
  options:
    allow_dynamic_fields: true
    raise_not_found_error: false
    skip_version_check: false

You can use your own config file, just remember to set allow_dynamic_fields to true!

CSV parsing

We will do a really simple CSV parsing. We will store all the values as strings, so we just need to read CSV file and create all needed attributes in objects that should represent each file row:

class StoredCSV
  include Mongoid::Document
  include Mongoid::Timestamps

  def self.import!(file_path)
    columns = []
    instances = []
    CSV.foreach(file_path) do |row|
      if columns.empty?
        # We dont want attributes with whitespaces
        columns = row.collect { |c| c.downcase.gsub(' ', '_') }
        next
      end

      instances << create!(build_attributes(row, columns))
    end
    instances
  end

  private

  def self.build_attributes(row, columns)
    attrs = {}
    columns.each_with_index do |column, index|
      attrs[column] = row[index]
    end
    attrs
  end
end

Thats all! Instead of creating SQL tables and doing some mappings - we just allow MongoDB to dynamically create all the fields that we need for given CSV file.

Usage

StoredCSV.import!('data.csv')
stored_data = StoredCSV.all

Checking attribute names for given object - don't use Mongoid attribute_names method

You need to remember, that various instances might have (since they come from different files) different attributes, so you cannot just assume that all will have field "name". There is a Mongoid method attribute_names, but this method will return only predefined attributes:

StoredCSV.first.attribute_names => ["_type", "_id", "created_at", "updated_at"]

To obtain all the fields for given instance you need to do something like this

StoredCSV.first.attributes.collect{|k,v| k} => ["_id", "name", "format", "description"]

Summary

This was just a simple example but this should also be a good base for a bigger and better solution. There should be implemented more complex key extracting mechanism with prefix (this would protect from reassigning protected values (like "_id") and whole bunch of other improvements ;)

Copyright © 2024 Closer to Code

Theme by Anders NorenUp ↑