This post can only possibly appeal to about 12 people and only when they’re really in the mood for weedsy code stuff. However, that’s about my normal readership, so here we go…
For the OU Create project, we now have 5 servers that are managed by Reclaim Hosting. We’ve got more than 4000 users, and, collectively, they have more than 5000 websites. Keeping track of all of the usernames, domains, and apps in use is difficult.
One of the ways that we study the digital ecology of these servers is by looking at the data.db files created by each server’s instance of Installatron. These database files keep track of all of the web apps installed using Installatron. Thus we have a record of every installation of WordPress, Drupal, Omeka, MediaWiki, or any of the other 140 or so apps with Installatron one-click installation packages. I can find the user’s name, the date, and the domain, subdomain, or subdirectory used for each installation. However, within the data.db file, there are five tables for all of this data and it’s a SQLite file, so it’s not exactly a quick or easy read. Further complicating everything is that each server has it’s own data.db file and each one is buried deep in the directory structure amongst the tens of thousands of files on the server.
Here at OU, we have a couple of websites that were built as derivatives of studying the data.db files. The first was community.oucreate.com/activity.
This site is built on Feed WordPress. We feed it the urls for each WordPress site in OU Create and it aggregates the blog activity into one feed averaging 300+ posts a week.
The other site is community.oucreate.com/sites.
Sites provides a filterable set of cards displaying screen captures, links, and app metadata for all of the sites in OU Create. You can see what sites are running Omeka or Vanilla Forums or whatever other app you’d like.
To maintain these sites, I would normally go into each server and navigate up one level from the default to the server root, then go into the var directory, then the installatron directory, and then download the data.db file. Once I’ve downloaded it, I use some software to open the file, and then export the i_installs table to a csv. Then I find the installations since my last update of the system, copy the urls, and paste them into the Activity site or run a script against them for the Sites site. I repeat this for each server. This process is terrible and I hate doing it, so it doesn’t get done as often as it should.
This week, I wrote some code to do these updates for me. At the most basic level, the code uses secure shell (SSH) to login to a server and download a desired file. My version of the code loops (repeats) for each of my five servers downloading the data.db file and storing them all in one folder. Here is the code and below I’ll explain how I got here and why:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
import os import paramiko import sqlite3 import csv #The keypasses list holds the passwords for the keys to each of the servers keypasses=["xxxxxxxxxx", "xxxxxxxxxx", "xxxxxxxxxx", "xxxxxxxxxx", "xxxxxxxxxx"] counter = 1 csvWriter = csv.writer(open("output.csv", "w")) #loop through the keypass list getting the data.db files for keypass in keypasses: db = "data%s.db" % counter servername = "oklahoma%s.reclaimhosting.com" % counter keyloc="/Users/johnstewart/Desktop/CreateUserLogs/id_oklahoma%s" % counter k = paramiko.RSAKey.from_private_key_file(keyloc, password=keypass) ssh = paramiko.SSHClient() ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) print ("oklahoma%s connecting" % counter) ssh.connect(servername, username="root", pkey = k) print ("connected") sftp = ssh.open_sftp() localpath = '/Users/johnstewart/Desktop/CreateUserLogs/data%s.db' % counter remotepath = '../var/installatron/data.db' sftp.get(remotepath,localpath) sftp.close() print ("data%s.db downloaded" % counter) ssh.close() #export the desired table from the database to the aggregated csv with sqlite3.connect(db) as connection: c = connection.cursor() c.execute("SELECT * FROM i_users") rows = c.fetchall() csvWriter.writerows(rows) print ("%s i_users exported to csv" % db) counter=counter+1
On Monday, I wrote the first draft of this in bash and then rewrote it in a similar language called expect. With expect, I could ssh into a server and then respond to the various login prompts with the relevant usernames and passwords. However, this exposed the passwords to Man in the Middle attacks. If someone where listening to the traffic between my computer and the server, they would be able to intercept the username and password from within the file. This is obviously not the best way to do things.
The solution was to use an ssh key. These keys are saved to your local computer and provide an encrypted code to login to the server. You in turn use a password to activate the key on your own computer, so there’s no ‘man in the middle.’ Unfortunately, I had no idea how to do this. Luckily for me, Tim Owens is a fantastical web host and has a video explaining how to set up keys on Reclaim accounts:
I set up keys for each of the servers and saved them into my project folder. This also denecessitated the ‘expect’ script because I no longer needed to enter a password for each server.
I turned back to a bash shell script, but couldn’t figure out what to do with my .db files once I had downloaded them all. This morning I turned from bash to python which is very good at handling data files. Python also has the paramiko library, which simplifies the process for logging into and downloading files from servers. You can see in the loop part of the code above where I call several paramiko functions to establish a connection to the server and then use sftp to get the file I want.
Our servers are labeled numerically oklahoma1 through oklahoma5, and I had saved my keys for each server as id_oklahoma1 through id_oklahoma5, so it was easy to repeat the basic procedure for each server by writing a loop that repeated 5 times. Each time the loop occurs it just increases the number associated with the servers and keys. The loop also saves the data.db files locally as data1, data2, etc.
The last step was to use Python to compile the desired data from each of these data.db files. SQLite3 provided the needed methods for handling data files. I could connect to each database after I downloaded them. Then I called the table that I wanted from each table and “fetched” all of the rows from that table. From there, I can use the csv library to write those rows to a csv (an excel like, comma separated variable table). This whole process was part of the larger programmatic loop, so each time I pulled a database from a server, I was adding it’s table rows to the collective csv.
For me, this process will make it easy to pull a daily update of all of the domains in OU Create and then upload those to my two community websites. As we follow Tom Woodward and Marie Selvanadin’s work on the Georgetown Community site, these up-to-date lists of sites will make it easier to build sites that pull together information from the API’s of the many OU Create websites. The process could also be generalized to pull or post any files from a set of similar servers allowing for quicker maintenance and analysis projects. Hopefully Lauren, Jim, and Tim will find fun applications as they continue their Reclaim Hosting server tech trainings.