Tag: Digital Humanities (Page 1 of 7)

Multiserver DoOO Data Management

This post can only possibly appeal to about 12 people and only when they’re really in the mood for weedsy code stuff. However, that’s about my normal readership, so here we go…

For the OU Create project, we now have 5 servers that are managed by Reclaim Hosting. We’ve got more than 4000 users, and, collectively, they have more than 5000 websites. Keeping track of all of the usernames, domains, and apps in use is difficult.

One of the ways that we study the digital ecology of these servers is by looking at the data.db files created by each server’s instance of Installatron. These database files keep track of all of the web apps installed using Installatron. Thus we have a record of every installation of WordPress, Drupal, Omeka, MediaWiki, or any of the other 140 or so apps with Installatron one-click installation packages. I can find the user’s name, the date, and the domain, subdomain, or subdirectory used for each installation. However, within the data.db file, there are five tables for all of this data and it’s a SQLite file, so it’s not exactly a quick or easy read. Further complicating everything is that each server has it’s own data.db file and each one is buried deep in the directory structure amongst the tens of thousands of files on the server.

Here at OU, we have a couple of websites that were built as derivatives of studying the data.db files. The first was community.oucreate.com/activity.

This site is built on Feed WordPress. We feed it the urls for each WordPress site in OU Create and it aggregates the blog activity into one feed averaging 300+ posts a week.

The other site is community.oucreate.com/sites.

Screen Shot 2017-12-06 of community.oucreate.com/sites

Sites provides a filterable set of cards displaying screen captures, links, and app metadata for all of the sites in OU Create. You can see what sites are running Omeka or Vanilla Forums or whatever other app you’d like.

To maintain these sites, I would normally go into each server and navigate up one level from the default to the server root, then go into the var directory, then the installatron directory, and then download the data.db file. Once I’ve downloaded it, I use some software to open the file, and then export the i_installs table to a csv. Then I find the installations since my last update of the system, copy the urls, and paste them into the Activity site or run a script against them for the Sites site. I repeat this for each server. This process is terrible and I hate doing it, so it doesn’t get done as often as it should.

This week, I wrote some code to do these updates for me. At the most basic level, the code uses secure shell (SSH) to login to a server and download a desired file. My version of the code loops (repeats) for each of my five servers downloading the data.db file and storing them all in one folder. Here is the code and below I’ll explain how I got here and why:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import os
import paramiko
import sqlite3
import csv
 
#The keypasses list holds the passwords for the keys to each of the servers
keypasses=["xxxxxxxxxx", "xxxxxxxxxx", "xxxxxxxxxx", "xxxxxxxxxx", "xxxxxxxxxx"]
counter = 1
csvWriter = csv.writer(open("output.csv", "w"))
 
#loop through the keypass list getting the data.db files
for keypass in keypasses:
 
db = "data%s.db" % counter
servername = "oklahoma%s.reclaimhosting.com" % counter
keyloc="/Users/johnstewart/Desktop/CreateUserLogs/id_oklahoma%s" % counter
k = paramiko.RSAKey.from_private_key_file(keyloc, password=keypass)
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
print ("oklahoma%s connecting" % counter)
ssh.connect(servername, username="root", pkey = k)
print ("connected")
sftp = ssh.open_sftp()
localpath = '/Users/johnstewart/Desktop/CreateUserLogs/data%s.db' % counter
remotepath = '../var/installatron/data.db'
sftp.get(remotepath,localpath)
sftp.close()
print ("data%s.db downloaded" % counter)
ssh.close()
#export the desired table from the database to the aggregated csv
with sqlite3.connect(db) as connection:
c = connection.cursor()
c.execute("SELECT * FROM i_users")
rows = c.fetchall()
csvWriter.writerows(rows)
print ("%s i_users exported to csv" % db)
counter=counter+1

 

On Monday, I wrote the first draft of this in bash and then rewrote it in a similar language called expect. With expect, I could ssh into a server and then respond to the various login prompts with the relevant usernames and passwords. However, this exposed the passwords to Man in the Middle attacks. If someone where listening to the traffic between my computer and the server, they would be able to intercept the username and password from within the file. This is obviously not the best way to do things.

The solution was to use an ssh key. These keys are saved to your local computer and provide an encrypted code to login to the server. You in turn use a password to activate the key on your own computer, so there’s no ‘man in the middle.’ Unfortunately, I had no idea how to do this. Luckily for me, Tim Owens is a fantastical web host and has a video explaining how to set up keys on Reclaim accounts:

I set up keys for each of the servers and saved them into my project folder. This also denecessitated the ‘expect’ script because I no longer needed to enter a password for each server.

I turned back to a bash shell script, but couldn’t figure out what to do with my .db files once I had downloaded them all. This morning I turned from bash to python which is very good at handling data files. Python also has the paramiko library, which simplifies the process for logging into and downloading files from servers. You can see in the loop part of the code above where I call several paramiko functions to establish a connection to the server and then use sftp to get the file I want.

Our servers are labeled numerically oklahoma1 through oklahoma5, and I had saved my keys for each server as id_oklahoma1 through id_oklahoma5, so it was easy to repeat the basic procedure for each server by writing a loop that repeated 5 times. Each time the loop occurs it just increases the number associated with the servers and keys. The loop also saves the data.db files locally as data1, data2, etc.

The last step was to use Python to compile the desired data from each of these data.db files. SQLite3 provided the needed methods for handling data files. I could connect to each database after I downloaded them. Then I called the table that I wanted from each table and “fetched” all of the rows from that table. From there, I can use the csv library to write those rows to a csv (an excel like, comma separated variable table). This whole process was part of the larger programmatic loop, so each time I pulled a database from a server, I was adding it’s table rows to the collective csv.

For me, this process will make it easy to pull a daily update of all of the domains in OU Create and then upload those to my two community websites. As we follow Tom Woodward and Marie Selvanadin’s work on the Georgetown Community site, these up-to-date lists of sites will make it easier to build sites that pull together information from the API’s of the many OU Create websites. The process could also be generalized to pull or post any files from a set of similar servers allowing for quicker maintenance and analysis projects. Hopefully Lauren, Jim, and Tim will find fun applications as they continue their Reclaim Hosting server tech trainings.

Omeka + HTML5Up

I have been working on a couple of Omeka sites this semester. I really like the Omeka system for its handling of meta data and the Neatline map extension for easy map creation. Omeka S just came out and looks fantastic, though I haven’t had a chance to build with it yet.

However, I do not like the Omeka themes. For a while I was using Denison as seen here. However, the theme mishandles the four images on the front page and the drop down menu has a tendency to malfunction.

I also tried Big Picture but abandoned it when I was unable to merge the beautiful Browse Collections page into the index page.

Ultimately I turned away from Omeka themes and instead used HTML5Up templates. I transitioned Dr. Daniela Busiciglio’s site that had been built with Denison from this:

to this (progettovitaliano.org):

Screen Shot 2017-12-04 of progettovitaliano.com

I’ve just finished transitioning another site that I built with Dr. Mirelsie Velazquez and her students from the Big Picture Omeka theme to this (http://docnarr.oucreate.com/):

Because Omeka’s API does not provide images, I was not able to load the images into these HTML based pages programmatically. Thus, these front pages will require a bit more maintenance in their HTML code than Omeka themes, which are controlled from the administrative GUI. However, I really like the flexibility that I get from stripping out the front page of the Omeka themes and replacing it with something that I can tweak to feature the best parts of the projects.

 

Time for R

Here at OU, there are at least 4 or 5 that teach introductory R. You can take a statistics class, an economics class, a biological stats class, or a class just on R as a programming language. R has become one of the most commonly used languages for computational stats and data visualization, so it’s not surprising to see it pop up in a number of different departments. However, it has not yet made its way into the humanities.

For those of us in the humanities then, I wanted to pull together a few online resources that can help you get started.

My favorite introduction is TryR from Code School. This pirate themed introduction is great for people with little coding experience. It walks you through basic expressions, variables, arrays, loops, and graphing in a lightly gamified, campy platform.

Screen Shot of the Try R platform from Chapter 1

In a recent blog post, Jesse Sadler from UCLA, offered a more targeted  ‘Introduction to Network Analysis using R.’ Jesse does a great job of explaining  how nodes and edges come together in network graphs and how various R libraries make it relatively easy to produce these graphs. Jesse’s research involves mapping the correspondence of the 16th-century Dutch merchant, Daniel van der Meulen, which serves as a great example of the promise of R for DH research.

Screen Shot of Jesse Sadler's network graphing research projectLincoln Mullen is currently composing an open textbook called Computational Historical Thinking which uses and teaches R.  The resources he’s already assembled are fantastic, and his book serves as an excellent example of open-writing and review.

Screen Shot of Lincoln Mullen's Computational Historical Thinking Website

Less open but more complete, Matthew Jockers has produced a book and website with Springer called Text Analysis with R for Students of Literature. Taylor Arnold and Lauren Tilton also have a Springer book out called Humanities Data in R.

Last, but certainly not least, are the workshops provided by Software Carpentry. Here at OU and throughout the world. Software Carpentry provides two day workshops that introduce command line programming, versioning (usually with Git and GitHub) and R. These workshops are great because they work from a very introductory level and are meant to ease people into coding and data management. The group on OU’s campus is based out of the library and are particularly eager to help graduate students who are venturing into data analysis for the first time.

If you haven’t tried out R yet, take a minute to poke around at one of the resources above and thinking about how you already use maps, graphs, and tables in your work. Rather than hand-drawing your next map or searching for something to represent a network graph, take the same time to learn a new skill.

Page 1 of 7

Powered by WordPress & Theme by Anders Norén

css.php