Identifying Unused DoOO with Google Scripts

As a graduate student, I made a short lived attempt to start an online journal for graduate students in the history of science using a php-based open source platform called Open Journal Systems. When I became a lecturer, I used a Drupal installation packaged called Transcribr to build a site that hosted three different editions of an 18th-century chemistry textbook, and I had my students study how chemistry courses changed during the chemical revolution.  However, at the time (2014) OU IT only gave faculty, staff, and students an HTML based website and something like 20mB of storage. After months of trying to get support, I ended up just hosting both of these projects on BlueHost and paying for them out of pocket.

OU Create started as a pilot in July 2014 when Adam Croom brought Jim Groom to campus to proselytize for the Domain of One’s Own concept. I jumped on the opportunity and was in the room during the initial pitch. I registered this domain and started blogging about my class, which ultimately landed me a job working for Adam and helping to maintain OU Create.

While signing up for a domain of my own literally changed my career, some of the other initial users didn’t make as much use of it. Some people sat through Adam’s pitch, registered a site, but never confirmed their email addresses and let their sites lapse. Others made a couple of posts, but never developed their site or committed to blogging. Some students, used their sites for a course but stopped using them when the course ended. However, part of the pilot project was that OU paid for the sites’ registration and hosting and continued to pay unless notified that the person didn’t want the site anymore. We knew anecdotally that some of these sites were in use, but we hadn’t tried to conduct a full audit of these sites until now.

There were about 2500 sites on our first server before we migrated some to lighten the load. I can manually look at about two sites a minute to see whether or not they are active and note in my spreadsheet their date of last activity and any unusual characteristics. At that pace it would take me something like 20 hours. This is not a super scalable method.

So, I did what any good DoOO admin would do, and asked Tom Woodward for suggestions. Obviously, we started by dumping my csv of URLs into a Google Sheet, and then Tom wrote three Google Script functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function checkPosts(url) {
  var api = '/wp-json/wp/v2/posts/';
  var response = UrlFetchApp.fetch(url+api);
  var json = JSON.parse(response.getContentText()); 
  var mostRecentPost = json[0].modified;//note last modified rather than pub date just in case . . . 
  Logger.log(mostRecentPost);
  return mostRecentPost;
}
function checkPages(url) {
  var api = '/wp-json/wp/v2/pages/';
  var response = UrlFetchApp.fetch(url+api);
  var json = JSON.parse(response.getContentText()); 
  var mostRecentPage = json[0].modified;
  Logger.log(mostRecentPage);
  return mostRecentPage;
}
function checkFeed(url) {
 var feed = '/feed/';  
  var xml = UrlFetchApp.fetch(url+feed).getContentText();
  var document = XmlService.parse(xml);
   var response = document.getRootElement();
  var channel = response.getChild('channel');
  var build = channel.getChild('lastBuildDate').getValue();
 return build;   
}

The first two functions use WordPress’s API v2 to pull the date of the last post and the date of the last page modification respectively. The third function calls the site’s RSS feed and then parses the XML to pull the date of the last post that way. Because many of the sites that I was looking at were built prior to 2015, and all of them before 2016, the API calls were prone to returning page not found (404) errors.

To get around these 404 errors, I wrapped the API functions in a try-catch syntax that asks the system to try to call the API and if it fails, catch the error and tell me that there was a 404 error:

23
24
25
26
27
28
29
30
31
32
33
34
35
36
function checkPosts(url) {
  var api = '/wp-json/wp/v2/posts/';
  try {
    var response = UrlFetchApp.fetch(url+api);
    var json = JSON.parse(response.getContentText()); 
    var mostRecentPost = json[0].modified;
    Logger.log(mostRecentPost);
    return mostRecentPost;
  }
  catch(err) {
    var errorMessage = "404";
    return errorMessage;
    }
}

I then wrote a loop that would go through all of the URLs in my Google Sheet, run each function, and write a new Google sheet that had the URL, date of last post based on API, date of last page based on API, and date of last post based on RSS.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
function collectURLs() {
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheets()[0];
  var rowsNumber = sheet.getLastRow();
  var values = sheet.getRange(2, 1, rowsNumber).getValues();
  Logger.log(values[0]);
 
  ss.insertSheet('Dates Output', 1);
  var outputSheet = ss.getSheets()[1];
  var headers = ["URL", "apiPages", "RSSPosts"];
  outputSheet.appendRow(headers);
 
  for (var i in values) {
    var originalurl = values[i];
    var newURL = originalurl.toString();
    var apiPosts = checkPosts(originalurl);
    var apiPages = checkPages(originalurl);
    var rssPosts = checkFeed(originalurl);
    var printArray = [newURL, apiPages, apiPages, rssPosts];
    outputSheet.appendRow(printArray);
  }}
 
function checkPosts(url) {
  var api = '/wp-json/wp/v2/posts/';
  try {
    var response = UrlFetchApp.fetch(url+api);
    var json = JSON.parse(response.getContentText()); 
    var mostRecentPost = json[0].modified;
    Logger.log(mostRecentPost);
    return mostRecentPost;
  }
  catch(err) {
    var errorMessage = "404";
    return errorMessage;
    }
}
 
function checkPages(url) {
  var api = '/wp-json/wp/v2/pages/';
  try {
    var response = UrlFetchApp.fetch(url+api);
    var json = JSON.parse(response.getContentText()); 
    var mostRecentPage = json[0].modified;
    Logger.log(mostRecentPage);
    return mostRecentPage;
  }
  catch(err) {
    var errorMessage = "404";
    return errorMessage;
  }}
 
function checkFeed(url) {
 var feed = '/feed/';  
  try {
    var xml = UrlFetchApp.fetch(url+feed).getContentText();
    var document = XmlService.parse(xml);
    var response = document.getRootElement();
    var channel = response.getChild('channel');
    var build = channel.getChild('lastBuildDate').getValue();
    return build;
  }
  catch(err){
    var errorMessage = "404";
    return errorMessage;
  }}

The code thus takes my initial list of URLs and returns three possible dates that the domain was last updated. If the site’s registration has lapsed, I get three 404 errors. Ultimately, I ended up deprecating the API call for date of last post, because it either returned a 404 for older sites or the same date as the RSS call. When the code’s running, here’s what I see:

CreateActivityDates

Before talking to Tom I could check about 4 URLs every two minutes. Now I can check about 60 URLs every two minutes. It’s still going to take me several hours, but the process is much better. If we were working at an order of magnitude more URLs, I would probably rewrite the functions in python and run the code over batches of 500 sites to generate my csv from the command line.

Once we have identified all of the sites that have lapsed, we’ll contact the owners and offer them migration assistance. We’ll then backup everything before terminating the remaining accounts from the servers.

Previous

Visualizing Domains Projects

Next

I am Open (and so can you!)

3 Comments

  1. Tom

    Are you looping it automatically or resetting by hand? I don’t see the former in this but I might have missed it. If you wanted to do the latter, you could check for the lastRow on the destination sheet and then auto run it vs that every X amount of time. You might burn up all your processing time though . . . which I’ve done a few times. :)

    • John Stewart

      That’s a good idea. I’ve just been resetting manually. As the code runs, I’m visually checking the records that return 404 for the RSS feed. Most of these are DNS expirations. Once the script times out, I finish checking the records, move them over into a ‘visually confirmed’ sheet, then restart the code.

      • Tom

        Long-term it feels like a light php app could do this nicely.

        I’m thinking installatron triggers a URL log to the database (with other useful info) then it could check in every X amount of time and write that response to the appropriate field with a fresh screenshot. That’d also serve some other needs.

        If updates lack, it could trigger a user email or a super-admin one with the option to auto email on confirmation.

Leave a Reply

Powered by WordPress & Theme by Anders Norén

css.php