StreamingGuide Crawler image

StreamingGuide Crawler

Author: Charli

Feb 01, 2021 - Reading time: 13 min

Project Background

During the process of developing the Streamingguide, we decided that it would make sense to extract the crawling and scraping part of the application to an independent service.
At the initial project start, we got some ballpark estimates and price-ranges for such a service. I decided that it was too expensive and the money was better to spend elsewhere, so I decided to, once again, get my hands dirty and start coding.

The task was clear; We need to know which movies (and series) are available and which services. So if you go to the game of thrones series profile, we would want a link to the show at HBO. This way the users would be able to navigate to the streaming service, once they have decided that they wanted to watch this particular movie or series.

So I created a service that crawls the streaming provider's contents and URLs every night, matches on IMDb id, and exposed the URLs in an API.

Tech stack

For this project, I used ruby (and ruby on rails). It might not be the obvious choice, since both python with beautiful soup and node/js with puppeteer, qualify for these projects and both should be faster than plain old ruby.
I choose ruby because it is my favorite language and the language in which I am most productive. I did also worry about the single-threaded-javascript, but I am sure it wouldn't be a problem.

Another reason for choosing ruby was that crawling/scraping speed wasn't important. I would rather crawl slower and keep the server cost at a minimum (which, of course, also is possible with the other languages).
The issue with choosing ruby is it might be hard to find a company to take over the maintenance of it.
For that reason, I plan to re-write it in another language at some point. It is a great opportunity to learn a new language or get more familiar with one of the boilers. :)
Maybe I should do it in Go?
I'll figure it out when I take up the challenge.

Application interface

I started by creating an application interface. I knew that I needed a range of datapoints from the crawlers to be able to create the content.
The application interface serves the purpose as a range of general methods the crawlers can call with a number of arguments.
Every crawler should invoke the methods with arguments containing relevant data. It is a perfect example how to use metaprogramming to create a general interface which works for all the different crawlers.
The whole point in doing this is that it should be very easy to add new crawlers and adjust current ones.

With the application interface done, all the coding will be inside the individual crawlers for each streamingprovider.

Here is an example of one of the methods:

# This methods takes 2 arrays as argument
# then it compares the 2 and deletes the records from the old array that are
# not present in the new array afterwards it creates the records that are
# present in the new array but not in the old

def self.create_or_delete_provider_content(new_items, provider, content_type)

  # starts a new report
  report = Report.new 
  # Add provider and content_type
  report.provider = provider
  report.content_type = content_type
  report.total_records_crawled = new_items.count

  # get all provider content records for this provider
  old_items = ProviderContent
              .where(provider_id: provider.id)
              .joins(:content)
              .merge(Content.where(content_type: content_type))

  # To be created

  # Here we need to compare new_items urls with old_items urls
  # And save the new item structs in the create_records variable
  # create_records = new_items - old_items
  old_urls = old_items.pluck(:url)

  create_records = []

  new_items.filter do |item|
    unless old_urls.include?(item.url)
      create_records << item
    end
  end

  # Adds total records to be created to report
  report.content_created = create_records.count
  # run the create content on the returned array and pass the provider and
  # conent_type as arguments

  # To be deleted
  # Here we need to compare the 2 again BUT we need to 
  # save the old_item provider_content records in the delete_records
  #delete_records = old_items - new_items
  new_urls = new_items.pluck(:url)

  delete_records = []

  old_items.filter do |item|
    unless new_urls.include?(item.url)
      delete_records << item
    end
  end

  # Adds total records to be deleted to report
  report.content_deleted = delete_records.count

  # Delete all records that are not avaliable in the streaming service anymore
  unless delete_records.empty? 
    delete_records.each do |record|
      record.destroy
    end
  end

  report.save!

  # Returns the array of new items to be created
  return create_records
end

This is one of the main methods in the interface. It is being called with data from the crawlers, which then compared it to the current data in the database.
It is important, because we need to know about what content was added, but also if anything was removed from the streaming service.
So doing this calculation in memory is a good way to make sure that the correct movies/shows are added and/or deleted.

Another example of one of the methods in the application interface is one where content is created.

def self.create_content(title, imdb, content_type)
  if imdb.nil? 
    # This needs to find or create instead so we dont have the same content over and over
    Content.find_or_create_by(title: title) do |content|
      content.title = title
      content.content_type = content_type
    end
  else
    Content.find_or_create_by(imdb: imdb) do |content|
      content.title = title
      content.imdb = imdb
      content.content_type = content_type
    end
  end
end

This method is called with each crawled element from the crawlers. The point is simply to match the movie or show with one in the database and if there is no match, then create a new one.
The IMDb id is important since this is the value the API uses to query the movie, but it is also the value I use to match the movies.

Some streaming providers do not include any reference to IMDb, which makes it harder to match the content with a movie/show.
Therefore if there is no IMDb from the crawlers, another method in the application interface will do a lookup to retrieve the IMDb id with the data that's available.

Crawlers

The crawlers are divided in two. One where I can access the API of the streaming provider and fetch the data directly and the other one where I scrape the data through an imitated web browser.

Fetching the data directly from the API is way better, both when it comes to performance, but also for maintainability and reliability.
You will rarely see changes in the API or the data.

When scraping the site, I rely on the HTML and structure to fetch the right data.
This is both slower, harder to maintain, and less reliable.

Every time the streaming provider updates their UI, I have to adjust the crawler to fit the new structure. So there will be more maintenance.
It seems to less reliable too, since it depends on an actual browser visiting the site. A lot of things can go wrong in the process :)
And it is a lot slower to scrape, compared to fetching the data directly from the API.
I found that especially the graphql APIs are very fast, but I guess it's no surprise.

The crawlers are made in plain old ruby and run on cronjobs every night.
Their only responsibility is to fetch all the content from the streaming providers and then send it to the application interface described above.

Here an example with graphql

 QUERY = <<~GRAPHQL.freeze
      query Search($type: String!, $page: Int) {
        search(type: $type, limit: 20, page: $page, sortBy: "title") {
          totalHits
          items {
            id
            title
            description

            ... on Movie {
              relativeUrl
              year
              imdb {
                id
              }
            }

            ... on Series {
              relativeUrl
              year
              imdb {
                id
              }
            }
          }
        }
      }
    GRAPHQL

    def all_movies
      extract('movie')
    end

    def all_series
      extract('series')
    end

    private

    def extract(type)
      page = 0
      products = []

      klass = Struct.new(:title, :production_year, :url, :imdb)

      loop do
        results = self.class.post(
          '/graphql',
          body: {
            query: QUERY,
            variables: {
              type: type,
              page: page
            }
          }.to_json,
          headers: {
            'Content-Type' => 'application/json'
          }
        )

        results = results.parsed_response['data']['search']

        break if results['items'].count.zero?

        products = products.concat(
          results['items'].map do |result|

            klass.new(
              result['title'],
              result['year'],
              "https://streamingprovider#{result['relativeUrl']}",
              result.dig('imdb', 'id')
            )
          end
        )

        page += 1
      end

      products
    end
  end

In the end, it returns products, an array of items with title, year, URL, and IMDb.

Hosting

This is one of the places where some unexpected issues occurred.
As mentioned earlier, speed was not something I had to worry about. On the other hand, I wanted the server to be strong enough to not crash when handling 10.000 records in memory.

I started thinking; let's be fancy and host it on AWS. Some kind of autoscale solution that would be super fast and awesome.
After the first bill, it became very clear to me that it would actually be cheaper to host on some kind of VPS, where you don't pay for data.
Hello to you heroku.com !
My favorite place to host. Not because it is cheap, fancy or anything, just because its really really really easy!

As soon as the hosting situation was settled, another issue came creeping.
When you access a streaming service, this streaming service will look at your IP and determine your location.

This is important because different content is available in different countries.
The reason for this is that you buy (or rent) the distribution rights per country. So maybe Netflix has distribution rights for Fifty Shades of Grey in Germany, but not in Denmark. For whatever reason ...
Since I am in Denmark and the streaming guides only current purpose is to serve a danish audience, I only needed content that's available in Denmark.
That required a danish IP. Heroku is using AWS as hosting center, but there is no option to select Denmark as a country. Neither at AWS nor heroku. This makes sense since AWS does not have servers in Denmark.
So unless I wanted to re-locate the API and migrate it to some danish host, I needed to find a way to proxy the IP.

At this point, we were slowly moving out of my comfort zone, so I got some help. (Thank you, Ian Murray!)
I found a cheap VPS host and on a plain Linux machine we set up a proxy where we could route the traffic through.

Now we have a cheap host and a danish IP. That's all we wanted.

Notifications and reports

Since this is a service that's running all the time with a lot of stuff going in, I needed some kind of notifications and reporting system.
As you might have noticed, the application interface creates a report each time a crawler feeds it with content.
This report will count the records that are crawled, created, and deleted. If crawled suddenly is zero, I know something went wrong. :)

Created and deleted is not being used currently, but at some point, I will make an API endpoint with "new content". Once that happens this will become relevant.
On a side note, if deleted records are the same number as crawled records, something is wrong too! :)

I added a very small notification system too, where I create a notification when a movie or show wasn't found on IMDb.

Actually, I added a few other checks, where I create a notification if something could be wrong with the content.
This goes for specific content, whereas the reports are for the crawling process.
The notification will help me with the manual work because sometimes it will happen that the crawler finds a movie but is unable to match it (or matches it with the wrong movie)

At last, I also added rollbar to get a notification when there's been an application error. Again, this can happen and I would like to give myself the best possible knowledge to fix it quickly.
Rollbar is awesome.

Other features

I realized that it wasn't possible to have this API running in production without some kind of admin interface, where we were able to edit and adjust the content.
From time to time it will happen that a piece of content is matched with the wrong movie or show.
Also, the option to create something manually might become relevant.

I added active admin and with a standard setup, this serves the purpose. It is also something I can expand upon if needed.

Features we should have developed

The only thing that worries me about this running in production is the uncertainty. I do not know for certain if a crawler crawled all the content if something is missing or matched wrong.
There will be times where we notice that a link to one of the streaming providers' content points to the wrong content. Where the matching failed.
Or points to a page that doesn't exist.
And then all the things I haven't thought of.

So far I've been pleasantly surprised by how well it's working and with how little maintenance I had to do.

Of the current time of me writing this, it's crawling 15 streaming providers and it's been running for 8 months with only a few minor issues.