Grepping Internet History via USENET and Gopher

I just wanted to touch upon some interesting historical archives I found related to USENET and Gopher.

Firstly there's the issue of the USENET archive of the years 1981 to 1991 that is available on the Internet. I'm not going to reiterate that whole story here. Suffice to say you can download it here or here.

Download all the archive pieces first. You can use a method like this:

$ wget -r -A "*.tgz" http://www.skrenta.com/rt/utzoo-usenet/

After you get all the filez, move them into their own directory that you intend to store them in, then decompress them all using a script like this:

#!/bin/bash
for i in *.tgz
do
tar -paxvf $i
done

That will decompress each archive file into it's own individual directory.

Now that you have it all decompressed you can delete the source tgz files and then start to do some grepping. Change into the top level directory and grep for keywords on the files like this, the | character means a boolean OR and the -w switch specifies searching only whole-words, get rid of it if you are interested in sub-strings within words:

$ egrep -l -r -i -w 'wholeword1|wholeword2|wholeword3' ./*

To search for more than one word on the same line using AND logic:

$ grep -l -r -i -w 'wholeword1.*wholeword2' ./*

To search for a multi word phrase that appears exactly as you search for it:

$ grep -l -r -i -w 'wholeword1 wholeword2' ./*

Depending on the processing speed of your box, this might take awhile. The total size of this archive is 10.25 GiB unarchived. I've found the above method to be the most straightforward way to grep through large text archives. There are numerous indexing style programs available for Linux that index large data sets and build a database or whatever for the user to search through very quickly. The problem I've found with indexing programs like that is that they make an index file that is just as big or bigger than the original text archive, and it also takes forever for the indexing to complete. Those are the primary reasons I don't use indexing programs. Another reason is that it's only me looking at the data, if I had it on a website then of course there would need to be a database. YMMV.

I found to my surprise and delight a couple of Gopherspace archives available here and here. One of the collections is a gopherbot spider archive of all of gopherspace as it was in 2007. The other collection is a mirror of gopher.quux.org from 2006. Apparently both of these collections are available due to the efforts of someone named John Goerzen. Based upon the content of his webpage, he seems like my kind of guy.

FYI for anyone that delves into these collections. The 2007 gopherspace collection is 28 GiB unpacked with 541094 total files. The 2006 QUUX mirror is 3.6 GiB unpacked.

I utilize the egrep method as outlined above to grep words within text files inside these collections. Collections like these also beg to be searched for specific filenames or extensions. To that end you can make a simple script in /usr/local/bin that looks like this:

#!/bin/bash
find -mindepth 1 -iname "*$1*"

I call the script "findm", so I can just open a terminal and do "findm mp3" or something like that and it will do a recursive directory search for the specified term anywhere in the filename. This is useful for directories with a small to medium amount of files, and of course the speed of your box plays into it as well.

A more efficient way of searching though really massive collections such as these is to use the mlocate program which is really a lifesaver. All you need to do is create a special index for the directory you want the filenames indexed for, then you can search that specific index file at blazing speeds.

Create the index file for the directory that you unpacked the Gopher archive:
$ updatedb -v -l 0 -o index_file -U ./some_directory

After that's finished, you can use locate to search the index file:
$ locate -d index_file -i searchterm