I've spent a good part of the last week improving importing to make sure that the nightmare import that happened last week won't repeat itself. The amount of data had to be processed has been doubled since last time, and still the live import took only 30 minutes. To be honest it took 10 minutes, the rest was the purging of the expired disk cache. So, what changed?
This is the third generation of the importing strategy at Jabbithole, to understand where we are and how we arrived here let me explain the old and dismissed strategies.
The first gen
It is no secret that Jabbithole runs on Ruby on Rails, so it was natural to utilize the Rails task system to implement importing. The task processes the xml files you send, pulls up the related database objects, updates them where necessary, then writes them back. This is what's still happening, except it is not being run on the live site anymore. It is reliable, it works great, but it is as slow as hell, especially when running parallel with the live system. The first approach was to shut down the site and let the importer do its job and it was fine for a while, but beyond a few hundred xml files it ran so long that it was clear that it is not a good practice.
The second gen
Once it was clear that the import task can not be run on the live server the problem changed: how to get the changes into the live database while keeping database records consistent? Digging deeper into Rails revealed that the internal database updates can be logged, and with some magic added the second gen importer logged all the database updates in a format that could be executed directly against the live database. The only problem is that these logs are huge, and running it on the live site still interferes with normal content serving. It can't be run in a transaction or else it will keep the site hanging while the import is running, and outside of a transaction it runs extremely slow. It was clear that this approach is also far from being optimal, especially after what happened last week.
And now
Analyzing the files produced by the second gen importer revealed a few potential bottlenecks (and bugs: why do I update schematics 350.000 times?), so at first I tried to optimize and fix those. The result was a greatly reduced SQL file, but it was still too large to be able to import it efficiently with minimal downtime. Then came the idea: the live database and the last phase of the importing database are practically identical, except for the user generated content that might occur while the import is running. So, why not scrap the live database and overwrite it directly with the result of the import, instead of juggling with the updates? Carefully saving the user generated content of course, and re-importing that as well at the end, so nothing is lost. The result: 2500+ files imported in 10 minutes versus 1100+ files imported in 22 hours. The preparation of the data still runs for about a day, but that does not interfere with the live site at all. I think I'll stick to this strategy for a while. ;)
Thanks for sharing; it's always interesting to hear the story behind the improvements.
ReplyDeleteQuestion: Why not import each XML file when it's uploaded? This prevents a backlog requiring downtime from forming, and gets the item DB updated more regularly.
It would be unwise to run the import directly against the live database. First, processing XMLs this large requires a lot of resources and that would affect the performance of the website in an uncontrolled manner, which would be bad. Second, data coming from the outside world shall not be trusted right away. There might be issues with the XML that need to be caught, like inconsistencies due to a client crash, or an unnoticed change in the game API. Once the XMLs are found to be good they are imported into a test database to see that the import process executes without an error, as there still might be obscure issues with individual XML files. The process logs all the changes which goes through a manual review to make sure that everything's gone as planned. If things are fine then the same import process is executed against an identical copy of the live database which is again reviewed. If this last step is done without a glitch then the data is ready to go live. The live import itself takes about 10-15 minutes, the majority of the downtime is the waiting for the disk cache to be purged (which will not be an issue anymore after next week). So it all comes down to 15 minutes of weekly downtime vs a constantly overloaded server and potentially compromised data integrity.
Delete