Grepping Internet History via USENET and Gopher
I just wanted to touch upon some interesting historical archives I found related to USENET and Gopher.
Firstly there's the issue of the USENET archive of the years 1981 to 1991 that is
available on the Internet. I'm not
going to reiterate that whole story here.
Suffice to say you can download it here or here.
Download all the archive pieces first. You can use a method like this:
$ wget -r -A "*.tgz" http://www.skrenta.com/rt/utzoo-usenet/
After you get all the filez, move them into their own directory that
you intend to store them in, then decompress them all using a script
for i in *.tgz
tar -paxvf $i
That will decompress each archive file into it's own individual directory.
Now that you have it all decompressed you can delete the source tgz
files and then start to do some grepping. Change into the top
level directory and grep for keywords on the files like this, the |
character means a boolean OR and the -w switch specifies searching only
whole-words, get rid of it if you are interested in sub-strings within
-l -r -i -w 'wholeword1|wholeword2|wholeword3' ./*
To search for more than one word on the same line using AND logic:
-l -r -i -w 'wholeword1.*wholeword2' ./*
To search for a multi word phrase that appears exactly as you search for it:
-l -r -i -w 'wholeword1 wholeword2' ./*
Depending on the processing speed of your box, this might take
awhile. The total size of this archive is 10.25 GiB
unarchived. I've found the above method to be the most
straightforward way to grep through large text archives. There
are numerous indexing style programs available for Linux that index
large data sets and build a database or whatever for the user to search
through very quickly. The problem I've found with indexing
programs like that is that they make an index file that is just as big
or bigger than the original text archive, and it also takes forever for
the indexing to complete. Those are the primary reasons I don't
use indexing programs. Another reason is that it's only me
looking at the data, if I had it on a website then of course there
would need to be a database. YMMV.
I found to my surprise and delight a couple of Gopherspace archives available here and here.
One of the collections is a gopherbot spider archive of all of
gopherspace as it was in 2007. The other collection is a mirror
of gopher.quux.org from 2006. Apparently both of these
collections are available due to the efforts of someone named John Goerzen. Based upon the content of his webpage, he seems like my kind of guy.
FYI for anyone that delves into these collections. The 2007
gopherspace collection is 28 GiB unpacked with 541094 total
files. The 2006 QUUX mirror is 3.6 GiB unpacked.
I utilize the egrep method as outlined above to grep words within text files inside
these collections. Collections like these also beg to be searched
for specific filenames or extensions. To that end you can make a simple
script in /usr/local/bin that looks like this:
find -mindepth 1 -iname "*$1*"
I call the script "findm", so I can just open a terminal and do "findm
mp3" or something like that and it will do a recursive directory search
for the specified term anywhere in the filename. This is useful
for directories with a small to medium amount of files, and of course
the speed of your box plays into it as well.
A more efficient way of searching though really massive collections such as these is to use the mlocate
program which is really a lifesaver. All you need to do is create
a special index for the directory you want the filenames indexed for,
then you can search that specific index file at blazing speeds.
Create the index file for the directory that you unpacked the Gopher archive:
$ updatedb -v -l 0 -o index_file -U ./some_directory
After that's finished, you can use locate to search the index file:
$ locate -d index_file -i searchterm