Page 2 of 24

Identifying Unused DoOO with Google Scripts

As a graduate student, I made a short lived attempt to start an online journal for graduate students in the history of science using a php-based open source platform called Open Journal Systems. When I became a lecturer, I used a Drupal installation packaged called Transcribr to build a site that hosted three different editions of an 18th-century chemistry textbook, and I had my students study how chemistry courses changed during the chemical revolution.  However, at the time (2014) OU IT only gave faculty, staff, and students an HTML based website and something like 20mB of storage. After months of trying to get support, I ended up just hosting both of these projects on BlueHost and paying for them out of pocket.

OU Create started as a pilot in July 2014 when Adam Croom brought Jim Groom to campus to proselytize for the Domain of One’s Own concept. I jumped on the opportunity and was in the room during the initial pitch. I registered this domain and started blogging about my class, which ultimately landed me a job working for Adam and helping to maintain OU Create.

While signing up for a domain of my own literally changed my career, some of the other initial users didn’t make as much use of it. Some people sat through Adam’s pitch, registered a site, but never confirmed their email addresses and let their sites lapse. Others made a couple of posts, but never developed their site or committed to blogging. Some students, used their sites for a course but stopped using them when the course ended. However, part of the pilot project was that OU paid for the sites’ registration and hosting and continued to pay unless notified that the person didn’t want the site anymore. We knew anecdotally that some of these sites were in use, but we hadn’t tried to conduct a full audit of these sites until now.

There were about 2500 sites on our first server before we migrated some to lighten the load. I can manually look at about two sites a minute to see whether or not they are active and note in my spreadsheet their date of last activity and any unusual characteristics. At that pace it would take me something like 20 hours. This is not a super scalable method.

So, I did what any good DoOO admin would do, and asked Tom Woodward for suggestions. Obviously, we started by dumping my csv of URLs into a Google Sheet, and then Tom wrote three Google Script functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function checkPosts(url) {
  var api = '/wp-json/wp/v2/posts/';
  var response = UrlFetchApp.fetch(url+api);
  var json = JSON.parse(response.getContentText()); 
  var mostRecentPost = json[0].modified;//note last modified rather than pub date just in case . . . 
  Logger.log(mostRecentPost);
  return mostRecentPost;
}
function checkPages(url) {
  var api = '/wp-json/wp/v2/pages/';
  var response = UrlFetchApp.fetch(url+api);
  var json = JSON.parse(response.getContentText()); 
  var mostRecentPage = json[0].modified;
  Logger.log(mostRecentPage);
  return mostRecentPage;
}
function checkFeed(url) {
 var feed = '/feed/';  
  var xml = UrlFetchApp.fetch(url+feed).getContentText();
  var document = XmlService.parse(xml);
   var response = document.getRootElement();
  var channel = response.getChild('channel');
  var build = channel.getChild('lastBuildDate').getValue();
 return build;   
}

The first two functions use WordPress’s API v2 to pull the date of the last post and the date of the last page modification respectively. The third function calls the site’s RSS feed and then parses the XML to pull the date of the last post that way. Because many of the sites that I was looking at were built prior to 2015, and all of them before 2016, the API calls were prone to returning page not found (404) errors.

To get around these 404 errors, I wrapped the API functions in a try-catch syntax that asks the system to try to call the API and if it fails, catch the error and tell me that there was a 404 error:

23
24
25
26
27
28
29
30
31
32
33
34
35
36
function checkPosts(url) {
  var api = '/wp-json/wp/v2/posts/';
  try {
    var response = UrlFetchApp.fetch(url+api);
    var json = JSON.parse(response.getContentText()); 
    var mostRecentPost = json[0].modified;
    Logger.log(mostRecentPost);
    return mostRecentPost;
  }
  catch(err) {
    var errorMessage = "404";
    return errorMessage;
    }
}

I then wrote a loop that would go through all of the URLs in my Google Sheet, run each function, and write a new Google sheet that had the URL, date of last post based on API, date of last page based on API, and date of last post based on RSS.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
function collectURLs() {
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheets()[0];
  var rowsNumber = sheet.getLastRow();
  var values = sheet.getRange(2, 1, rowsNumber).getValues();
  Logger.log(values[0]);
 
  ss.insertSheet('Dates Output', 1);
  var outputSheet = ss.getSheets()[1];
  var headers = ["URL", "apiPages", "RSSPosts"];
  outputSheet.appendRow(headers);
 
  for (var i in values) {
    var originalurl = values[i];
    var newURL = originalurl.toString();
    var apiPosts = checkPosts(originalurl);
    var apiPages = checkPages(originalurl);
    var rssPosts = checkFeed(originalurl);
    var printArray = [newURL, apiPages, apiPages, rssPosts];
    outputSheet.appendRow(printArray);
  }}
 
function checkPosts(url) {
  var api = '/wp-json/wp/v2/posts/';
  try {
    var response = UrlFetchApp.fetch(url+api);
    var json = JSON.parse(response.getContentText()); 
    var mostRecentPost = json[0].modified;
    Logger.log(mostRecentPost);
    return mostRecentPost;
  }
  catch(err) {
    var errorMessage = "404";
    return errorMessage;
    }
}
 
function checkPages(url) {
  var api = '/wp-json/wp/v2/pages/';
  try {
    var response = UrlFetchApp.fetch(url+api);
    var json = JSON.parse(response.getContentText()); 
    var mostRecentPage = json[0].modified;
    Logger.log(mostRecentPage);
    return mostRecentPage;
  }
  catch(err) {
    var errorMessage = "404";
    return errorMessage;
  }}
 
function checkFeed(url) {
 var feed = '/feed/';  
  try {
    var xml = UrlFetchApp.fetch(url+feed).getContentText();
    var document = XmlService.parse(xml);
    var response = document.getRootElement();
    var channel = response.getChild('channel');
    var build = channel.getChild('lastBuildDate').getValue();
    return build;
  }
  catch(err){
    var errorMessage = "404";
    return errorMessage;
  }}

The code thus takes my initial list of URLs and returns three possible dates that the domain was last updated. If the site’s registration has lapsed, I get three 404 errors. Ultimately, I ended up deprecating the API call for date of last post, because it either returned a 404 for older sites or the same date as the RSS call. When the code’s running, here’s what I see:

CreateActivityDates

Before talking to Tom I could check about 4 URLs every two minutes. Now I can check about 60 URLs every two minutes. It’s still going to take me several hours, but the process is much better. If we were working at an order of magnitude more URLs, I would probably rewrite the functions in python and run the code over batches of 500 sites to generate my csv from the command line.

Once we have identified all of the sites that have lapsed, we’ll contact the owners and offer them migration assistance. We’ll then backup everything before terminating the remaining accounts from the servers.

Visualizing Domains Projects

One of the big challenges in running a Domains project, and part of my feeling of being adrift at sea, is the highlighting of particularly good work from users and the intelligible visualization of the broader activity from all users.

At OU, we have a couple of sites for this purpose. The Activity site shows the most recent blog posts from all sites that are capable of being read by Feeds WordPress. You can get a preview of each blog post and then click on the link to go to that web site. Each week during the school year, we put out our own blog post featuring the top 5 or 6 posts from this weeks Activity feed.

A second project, called Sites, is a filtered list of all of the web apps currently running on all four of the OU servers. This is really nice for finding the links for all sites running Vanilla Forums or dokuwiki or even more commonly used apps like Drupal and Omeka. However, if you filter for WordPress, you still get a simple list of the 3400 or so sites using WordPress.

At Domains17, Marie Selvanadin, Tom Woodward, Yianna Vovides talked about the answer that they are currently developing for Georgetown. Their design starts with a search bar that searches across all blogs. A second piece is called themes. Another option implements the TimelineJS library to visualize posts by a given user or in a given theme along a timeline.

Currently, Tom is running the first iteration of this visualization of the Georgetown domains off of a Google Spreadsheet, as he is want to do. He then generates a front page with the basic metadata and screenshots of the sites. For each of these sites, there is a dynamically generated page with the text of the various posts, word counts, and a charts visualization of the word counts. You can also see the timelineJS by category for each site.

The next phase of research is to dig into the visualization of community sites. How do we include closer analysis of Drupal, Omeka, and other apps along with html sites, rather than just WordPress sites? What types of questions can/should we ask? What are the ethical questions around mining this data?

Searching and Discoverability at Domains17

In our Information Age, I think we are all lost in a sea so vast that neither shore nor bottom are visible. We grab onto the first lifesavers thrown at us, FaceBook and other social web sites, and grateful to stay afloat, we do not venture far. It is not the fear of drowning that makes us cling to the known, but rather the complete lack of markers to know which way to swim. 

Martha Burris challenged us to think about the concrete space of the internet, the metaphors for understanding it, and I keep coming back to this infinite sea. If our Domains provide a place to stand and perhaps even an archipelago with acquaintances on them, we are still only small specs in the infinite sea. 

I think the great challenge for us is in building our boats and learning to navigate. We have grown reliant on guides like Google or the social media echo chambers, but with these we only skim the surface and touch briefly at nearby/easily accessible points of call. 

As we think about giving our students voice/a domain to stand upon, we also need to teach them to navigate and the means to be found by others. I think we need to spend more time thinking about discoverability. How do we help people find us and how do we know which way to set out to find them. 

If Facebook and Google are navigation by sight, how do we learn to follow the stars or create the new compass to turn away from the shores and chart new paths. As we discover new lands, how do we learn and respect the cultures of others, and share the paths to connect more people. 

The internet has opened a world of information to us, but how do we venture forth? Do we trust the colonizing guides that have always reached out to us, or can we find a new way and extend our hands to the students who will go further?

Page 2 of 24

Powered by WordPress & Theme by Anders Norén

css.php