Because the term low-cost airline is difficult to define we have agreed that we should bet on something easy to check - like the ticket price per kilometer. After some rough checks the bet was on: travel around the world, flying always in one direction (west or east) for less than €0.03 (3 euro cents) per kilometer (imagine Bologna-New York trip for less than €180).
TL;DR
I think I won. With some python help and many hours of coding I was able to find all the necessary tickets and stay below the price criterion. I have learnt a lot about airlines sales strategy... and something about ant colony optimization algorithm. The trip starts on November 15th, maybe you will meet me in the following months ;-)
Problem
Now the long story. Searching for cheap plane ticket is relatively easy task if you have strict dates and simple route (one or two stops). It starts to be more complicated if you want to stop in 3 places. But what if you have 6 stops or more? None of the existing online tools allows you to make such a query (if you know one - let me know). Things get more fuzzy if you don't have specific dates and just want to travel cheap. Searching manually is not an option: ticket price is likely to change daily and number of possible queries is quite big, I mean really big. Imagine you want to take 5 stops and search with margin of ± 10 days. It gives you 9765625 queries (5^10). Grabbing that data directly from airline database is also not doable. There is no standard approach, most of the small companies have their own system - others use providers that are much too expensive for a single user.
Sooner or later you will start to write a script.
Scrapy
I started to write the script using scrapy framework. I was using it before many times so it was a positive surprise when I noticed how the community have growth. If you don't know scrapy, you know python and you like to extract data from the web - you must check it. It's highly pluggable, it has several great add'ons (more on that topic later) and a friendly community.
Scrapy has two principle concepts: spider and pipeline. Spiders seems quite obvious - they are the place where you define the custom behavior for crawling and parsing pages for a particular site (or, in some cases, a group of sites). Pipelines on the other hand have several responsibilities:
- cleaning HTML data (if needed)
- validating scraped data (checking that the items contain certain fields)
- checking for duplicates (and dropping them)
- storing the scraped item in a database
After a while I had my scrapy application ready. I had one spider for each of the websites that I wanted to search the tickets for and several pipelines that were able to validate the ticket and save it if all the requirements were matched.
Validation
It was the first problem :) Price per kilometer is relatively easy. You need to calculate distance between airports and you got it:
To simplify the calculation I have populated my local redis with all the main airports geo-locations. Using scrapy pipeline behavior I was able then to draw a new airport if the condition was not matched.
More complicated was the direction criteria. It was quite important - I didn't want to get the spider into infinitive loop :) Going from Bologna to Paris should block all the airports to the east from Paris (like Munich), but still be able to cross long distances. Thanks to SO the code looks like that:
Scaling
Performance is an issue if the data you are searching for is willing to change each day. I couldn't wait few days for the whole query to finish (single threaded crawl for my initial query took 27h to finish). Most of those problems were solved thanks to scrapy-redis plugin. It allows you to write distributed crawling and post-processing. After plugging it into my app I was able to start a small cloud server with several crawlers for few hours each time I wanted to make a search. Redis queue was more than enough for that purpose.
Next step
The app did its job. When we discussed it in RedTurtle a lot of new ideas showed up. One of the annoying issues of the current implementation is the complexity of the results it's publishing - all the tickets are sorted by departure/arrival airport - not by the final route. This problem could be solved by storing the data in the graph and using many of the graph algorithms to optimize it.
The code itself has never been published and probably as is shouldn't be :) I wrote it to solve my specific goal - and so - it's not properly documented. However - if you are interesting in the topic - let me know. I can share the repository without any problems.
See you back in 2015...