Tag: python

image depicting a 404 error web page

Checking the status for 5200 websites

OU Create

The University of Oklahoma offers all of its students, staff, and faculty their own web domains through the OU Create program. Anyone who wants to can sign up for a free subdomain (something like yourname.oucreate.com) or pay $12/year for a top-level domain (something like yourname.com). We provide 5GB of storage, tech support, and workshops to help people get started.

Once someone has registered for OU Create, they can use the Installatron tool that is built into the system to launch a one-click install of WordPress, Drupal, Omeka, and about 150 other web-apps. Right now our Installatron logs show a little more than 5200 active installations of WordPress, accounting for more than 80% of all the active registered domains. This does not include the inactive installations of people who have deleted their accounts or left the university and taken their websites with them.

I’ve been curious for a while as to the status of all of these sites. How many of these installations are actively being used? How many were used for a course at some point but have not been updated for a while? How many installations were set up but never really used? Are there any sites that are glitching, and do the users know about those glitches.

Checking Statuses

As an administrator, I can use the Installatron logs to see the URLs for all of the WordPress installations on a server. We have 5 servers, so I wrote a python script that will quickly pull that information from each server and compile it into one big list.

Next I wrote another python script (the link goes to GitHub gist and the code is also written below) that ingested that list of URLs, went to each URL, and returned the “HTTP status code” for the site. A status code of 200 means that the site loaded as expected. A status code of 301, meant that the URL redirected me to another page that then loaded. A status code of 400, 401, or 404 meant there was some sort of error loading the page. There were also pages where the system couldn’t find anything, so I told my script to record these as “no connection.”

import urllib3
import os
import csv

rows = []
#load the csv list of URLs to check
with open('urls.csv', 'r') as input:
    csv_reader = csv.reader(input)
    for blog in csv_reader:
        #check blogs in list for status
            #this get response will only retry a url 3 times, does not preload the site data, and records the status but does not load a redirect
            resp = http.request('GET', blog[0], retries=3, preload_content=False, redirect=False)
            line = [blog[0], resp.status]
        #if there's a domain not found error, it will be caught by this except
            line = [blog[0], "no connection"]

with open('urlStatus.csv', 'w') as output:
    csv_writer = csv.writer(output)

When I tested the script on a list of 50 URLs, it worked, but it took several minutes. When I tried to run it on all 5000+ URLs, it ran for over two hours with no end in site. Ultimately, I split the list up into five smaller lists, and ran the script simultaneously on each of the five lists. It worked, but I left it running overnight, so I’m not entirely sure how long it took. This morning I compiled the five output .csv files.

The end result was that 3800 sites loaded with no problem. An additional 1100 sites redirected once, and then loaded. I think many of these were simply making a minor edit to my URL by adding a “/” to the end, pushing me from something like mysite.oucreate.com/blog to mysite.oucreate.com/blog/.

Of the 23 sites that had 400 status errors, 9 were actually redirects, that the my system didn’t recognize for some reason. The other 14 have issues, and I’ll follow up with their owners to help them either fix the sites or delete them if they’re no longer using them.

The remaining 282 sites returned “no connection.” Many of these are smaller sites that were once installed as subdomains and subdirectories of main sites, but are no longer in use. Others are sites URLs that have expired. I’m going to follow up with the owners of these expired sites. Where they were intentionally allowed to expire, I can delete them from our system. Where they expired unintentionally and haven’t been noticed yet, we can see about re-registering them.

Once I get to a good place with these sites that returned 400 and 500, I’m going to run a third script to see how many of the sites that are working properly are in active use. If the site only has the Hello World blog post that comes with the software and the default theme, I will reach out to the owners to see if they want any help getting set up or want me to delete an unused account. After I get through all of these WordPress sites, I’ll go back to Installatron and run similar checks on the rest of the OU Create domains.

Multiserver DoOO Data Management

This post can only possibly appeal to about 12 people and only when they’re really in the mood for weedsy code stuff. However, that’s about my normal readership, so here we go…

For the OU Create project, we now have 5 servers that are managed by Reclaim Hosting. We’ve got more than 4000 users, and, collectively, they have more than 5000 websites. Keeping track of all of the usernames, domains, and apps in use is difficult.

One of the ways that we study the digital ecology of these servers is by looking at the data.db files created by each server’s instance of Installatron. These database files keep track of all of the web apps installed using Installatron. Thus we have a record of every installation of WordPress, Drupal, Omeka, MediaWiki, or any of the other 140 or so apps with Installatron one-click installation packages. I can find the user’s name, the date, and the domain, subdomain, or subdirectory used for each installation. However, within the data.db file, there are five tables for all of this data and it’s a SQLite file, so it’s not exactly a quick or easy read. Further complicating everything is that each server has it’s own data.db file and each one is buried deep in the directory structure amongst the tens of thousands of files on the server.

Here at OU, we have a couple of websites that were built as derivatives of studying the data.db files. The first was community.oucreate.com/activity.

This site is built on Feed WordPress. We feed it the urls for each WordPress site in OU Create and it aggregates the blog activity into one feed averaging 300+ posts a week.

The other site is community.oucreate.com/sites.

Screen Shot 2017-12-06 of community.oucreate.com/sites

Sites provides a filterable set of cards displaying screen captures, links, and app metadata for all of the sites in OU Create. You can see what sites are running Omeka or Vanilla Forums or whatever other app you’d like.

To maintain these sites, I would normally go into each server and navigate up one level from the default to the server root, then go into the var directory, then the installatron directory, and then download the data.db file. Once I’ve downloaded it, I use some software to open the file, and then export the i_installs table to a csv. Then I find the installations since my last update of the system, copy the urls, and paste them into the Activity site or run a script against them for the Sites site. I repeat this for each server. This process is terrible and I hate doing it, so it doesn’t get done as often as it should.

This week, I wrote some code to do these updates for me. At the most basic level, the code uses secure shell (SSH) to login to a server and download a desired file. My version of the code loops (repeats) for each of my five servers downloading the data.db file and storing them all in one folder. Here is the code and below I’ll explain how I got here and why:

import os
import paramiko
import sqlite3
import csv
#The keypasses list holds the passwords for the keys to each of the servers
keypasses=["xxxxxxxxxx", "xxxxxxxxxx", "xxxxxxxxxx", "xxxxxxxxxx", "xxxxxxxxxx"]
counter = 1
csvWriter = csv.writer(open("output.csv", "w"))
#loop through the keypass list getting the data.db files
for keypass in keypasses:
db = "data%s.db" % counter
servername = "oklahoma%s.reclaimhosting.com" % counter
keyloc="/Users/johnstewart/Desktop/CreateUserLogs/id_oklahoma%s" % counter
k = paramiko.RSAKey.from_private_key_file(keyloc, password=keypass)
ssh = paramiko.SSHClient()
print ("oklahoma%s connecting" % counter)
ssh.connect(servername, username="root", pkey = k)
print ("connected")
sftp = ssh.open_sftp()
localpath = '/Users/johnstewart/Desktop/CreateUserLogs/data%s.db' % counter
remotepath = '../var/installatron/data.db'
print ("data%s.db downloaded" % counter)
#export the desired table from the database to the aggregated csv
with sqlite3.connect(db) as connection:
c = connection.cursor()
c.execute("SELECT * FROM i_users")
rows = c.fetchall()
print ("%s i_users exported to csv" % db)


On Monday, I wrote the first draft of this in bash and then rewrote it in a similar language called expect. With expect, I could ssh into a server and then respond to the various login prompts with the relevant usernames and passwords. However, this exposed the passwords to Man in the Middle attacks. If someone where listening to the traffic between my computer and the server, they would be able to intercept the username and password from within the file. This is obviously not the best way to do things.

The solution was to use an ssh key. These keys are saved to your local computer and provide an encrypted code to login to the server. You in turn use a password to activate the key on your own computer, so there’s no ‘man in the middle.’ Unfortunately, I had no idea how to do this. Luckily for me, Tim Owens is a fantastical web host and has a video explaining how to set up keys on Reclaim accounts:

I set up keys for each of the servers and saved them into my project folder. This also denecessitated the ‘expect’ script because I no longer needed to enter a password for each server.

I turned back to a bash shell script, but couldn’t figure out what to do with my .db files once I had downloaded them all. This morning I turned from bash to python which is very good at handling data files. Python also has the paramiko library, which simplifies the process for logging into and downloading files from servers. You can see in the loop part of the code above where I call several paramiko functions to establish a connection to the server and then use sftp to get the file I want.

Our servers are labeled numerically oklahoma1 through oklahoma5, and I had saved my keys for each server as id_oklahoma1 through id_oklahoma5, so it was easy to repeat the basic procedure for each server by writing a loop that repeated 5 times. Each time the loop occurs it just increases the number associated with the servers and keys. The loop also saves the data.db files locally as data1, data2, etc.

The last step was to use Python to compile the desired data from each of these data.db files. SQLite3 provided the needed methods for handling data files. I could connect to each database after I downloaded them. Then I called the table that I wanted from each table and “fetched” all of the rows from that table. From there, I can use the csv library to write those rows to a csv (an excel like, comma separated variable table). This whole process was part of the larger programmatic loop, so each time I pulled a database from a server, I was adding it’s table rows to the collective csv.

For me, this process will make it easy to pull a daily update of all of the domains in OU Create and then upload those to my two community websites. As we follow Tom Woodward and Marie Selvanadin’s work on the Georgetown Community site, these up-to-date lists of sites will make it easier to build sites that pull together information from the API’s of the many OU Create websites. The process could also be generalized to pull or post any files from a set of similar servers allowing for quicker maintenance and analysis projects. Hopefully Lauren, Jim, and Tim will find fun applications as they continue their Reclaim Hosting server tech trainings.

Powered by WordPress & Theme by Anders Norén