The Old Scholar's Historical Thoughts

May 14, 2010

November 21, 2009

Just have a lot of random thoughts this week.

Looking at the articles on Open Access News, I saw that the British Library had digitized their 500,000 item but they charge you for looking at their stuff. On the other hand the British National Archives are free. Many of the collections at the Library of Congress are free, but our National Archives is having someone else do the digitizing and it will cost you a fee to see the data. Digitized US Maps are free, British maps cost money. Government logic and consistency seem to be mutually exclusive.

By the way – did everybody notice the note published on the Open Access News site that said the blog would not be kept up to date as much as the Open Access Tracking Project. That is a wiki with updates about OA.

A problem with Open Access is the Chaos that occurs when there are not “the accepted” places to go. It may be limiting to only have the 40 or so  journals on Victorian England to look at for research but that is a lot smaller than trying to link up every scholars site from every university and make sense of them. If you want to see Chaos, just try and track down all the standards that people propose for creating data sharing that Willinsky talks about. He references the Open Archives Initiative but that is only one of many. Which ones are being maintained? I have worked with the W3C before, so I know they are a standards body with clout, but how about MINH – who follows them, who uses them.

Now take that chaos magnify it millions of times and try to sift through scholary ouput – using the methods proposed by Wineburg discussed in chapter 11. Where did this come from? Who is the guy who wrote it? How can I trust what I found on the web?

One of the great things about coming back to school for me has been the access I get through the library to journals. I have found some great information. I went looking for an article the other day and found an abstract of exactly what I was looking for. I could not access it through the Library Catalog System. I went to the library and Mike was working there so I asked for some help – after all he’s in CLIO-1 like me, he must know all the answers. Well, we both learned something that day. The George Mason Library subscribes to the Oxford Journals. But they do not subscribe to the Oxford Journals before 1996. For those you have to try and get an inter-library loan or you can pay $36 per one day use of an article. Needless to say I did not pay the $36 to see how the one day use was enforced and I was not able to use the research someone else had done. So this is another form of Open Access that needs to go into Appendix 1 of Willinsky’s book. I think it is called “Open – but not for you”.

November 14, 2009

There are times when it is better to remain quiet and be thought a fool, than open your mouth and remove all doubt. (Attributed to Abraham Lincoln, but I can’t find the source)

This week I am going to sort of take this to heart. There are two great posts, one at Lynn’s site, and one at Carl’s site, that discuss data visualization very well. Instead of highlighting my foolishness on this site, I will remain quiet and provide my observations as comments on their sites.

I will comment on Professor Cohen’s article about trying to make sense of digital research with data mining techniques. We are at the forefront of data mining in the digital age. In the past historians were limited by time and distance into what they could review and try to correlate. Looking at the court records of 19th century Britain is possible – correlating the data between jurisdictions and developing time analysis is daunting. Taking this court data and correlating it with social data, such as parish records or economic data such as tax receipts is impossible, except for isolated cases. If all this data was digitized and correctly tagged historians could write queries that asked for correlations between data sets for whole sections or all of Britain. Trends of the whole country could be reviewed. Even if the data wasn’t tagged, if an API existed like the Google API for filtering queries within parameters, or H-bot could be used, the data could yield new correlations which historians have not yet theorized or investigated. We are lucky to be at the forefront of this trend, but only if we take advantage of what is there and get involved in setting the direction for future historians.

Getting to the problem of Abundance or Scarcity it seems as if historians will have some great tools on data born digital. For instance the British Court System now digitizes all records and even has a link that allows researchers to sign up for, and use these digital records. Not only are some records born digital – they are being prepared for researchres of the future.


OOPS! I guess I removed all doubt about my being a fool. 😉

November 11, 2009

Moretti (P51)  said there is a  change in how people used pronouns in novels which showed an increased sense of community. He references Elizabeth Gaskell’s, Cranford. He said it starts off with the word “Our” and ends with “Us.” However if you use Wordle and accept common words (Cranford Wordleized) plural pronouns are small and singular pronouns are large.

I think Moretti was just seeing what he wanted to see and not what was really there.

November 6, 2009

The readings this week were very interesting. I was fascinated with the talk by Norvig. I did some AI work a long time ago using Neural Nets and “computer learning,” for Military Intelligence. We were trying to determine the bad guys future course of action by viewing what our sensors were telling us in the present. Norvig’s whole section where he talks about Google feeding in raw data and then attempting to develop classes of similar words and concepts through statistical computations takes that thinking to a much higher level. The software does not have to be programmed with preconceived notions. Pattern matching and statistical algorithms transform raw data into abstract concepts.

As Norvig pointed out in the drug example, what the publishers of web pages provide is different from what searchers are looking for. Wouldn’t it be neat to take a corpus of work such as Victorian British parliamentary debates  and develop candidate classes of data using Google’s algorithms. Then you could take the British newspapers from the same time period and see what similar classes would yield.  Finally you could take pamphlets from Trade Unions or religious sermons or popular novels and correlate all of the different classes.  Comparing generated classes from different data sources could provide insights into controversy and tension between groups that people have not yet researched. It might provide links between people and organizations that no one has realized was there before.

Of course this all depends upon the amount of data you can process. Norvig wants billions of data points.  As Leary pointed out, the Victorians on the web can give us more data than other time periods because we don’t have to worry about the copyright. But Victorain data  is also limited because of limitations on scanning of primary sources. Like he said only the Scotsman and the Times are fully available and most others are not.

Using data mining with a robust set of works could give a researcher a starting point for further research.  Given this starting point  going to collections of letters, background notes and other sources to put context around the candidate data classes would need to be done. I think it would be a lot of fun figuring out the puzzle of why the algorithm found different relationships depending upon the data sources. We could all  play Sherlock Holmes.

Fascinating, fasinating way to look at research.

November 1, 2009

Like usual we have an abundance of material and a scarcity of time to process it all. The only way New Media changes the equation is now ABUNDANCE is in capital letters and time is in subscript. Much of the readings are right down my alley. They explain the need for tagging of data and what it can be used for. Of course the readings also ask the question “Do we really need to expend all of this effort?”

There is something to be said for the old stubby pencil solution. I currently work on a project where the customer wants a process automated. So far they have spent 7 million dollars on the development of a web based automated system. I showed them that they could get the same functionality by hiring 4 people who updated by hand. The automated system will need maintenance of about $250,000/year and will probably need to be replaced in 10 years. Even if the total cost was $100,000 per year per person, it would be cheaper doing it manually than automating it. Just because something CAN be done, does not mean it SHOULD be done.

Now that I got that off my chest, how does that relate to tagging for media on the web. I guess the biggest thing I can say is I don’t know what the long term benefits would be. If you would have asked someone 20 years ago how putting a Universal Product Code (UPC) on every item would be beneficial they would not have dreamed of an iPhone app which allows you to find the cheapest place to purchase something. Not only that, but I’m sure the Department of Defense did not envision GPS being used to hook into the iPhone app to know where you were when you asked the question so the app could tell you the closest place where the item is the cheapest. However, if products weren’t tagged with UPC codes and GPS data was not available through web services, this functionality would not work.

I have been thinking of Terese’s project of providing resources for teachers. If every teacher who used the web, and every software program that allowed teachers to create lesson plans, syllabi, and web sites, included the tagging the suggested age group or grade their work was aimed at, Terese could easily create a service that gathered that information and presented it to her users. You can’t do a Google search on”9-12” and hope you get information about class room work aimed at 9 to 12 year olds. How do you know “9-12” doesn’t mean grades 9 through 12? But Terese can’t tag her information with the keyword and someone else tag with the keyword . The computer needs some standards so programs can find and provide the right data. Google had an interface that allowed CHNM to provide this type of content, but once they removed their API that function no longer works. If everyone played well together the information would still be available.

Krista had a post about the need for curation, because, as the link she posted, said, ”Presenting endless volumes of content is no longer the defining characteristic of a good digital publisher. Instead, the core competency must shift to presenting the most relevant information.” Tagging allows a site to provide this context. I recommend you read her post and her link.

So getting back to the theme for this post. With tagging, we can allow computers to assist in figuring out the context of information, which the authors themselves provide. Without tagging we are forced to wade through hundreds and thousand’s of links returned by Google using their algorithms, over which we have no influence. If we want to help future scholars use our work, we need to provide the context for it.

October 26, 2009

I came across this article about John Dean fighting to stop a recording of him being posted on Stifling the publication of  primary evidence by threatening with copyright infringement doesn’t seem right. I can not find any other references to this fight, but those particular recordings are not on the site. It easier if you are an ancient scholar. I don’t think Julius Ceaser will be coming back to sue over copyright infringement or libel.

October 5, 2009

I don’t know if you all had a chance to review the link Professor Cohen gave us on twitter concerning how people disseminate their scholarly research but the Communicating Knowledge Report had some interesting surveys. The surveys were taken from scholars and researchers in the UK on how and why they publish their research. Peer reviewed journals are very important to 94% of the respondents whereas Internet blogs and forums are either not important (70%) or not applicable (18%).  When asked why they considered peered reviewed journals important 74% said they needed it for career advancement. (18) But hidden in the numbers is the ironic statistic that “and more than one third of respondents say that open access repositories are important to their research.”(17)

So most of these researchers consider web self publishing as not important, but value “open access.” I guess it all goes back to the “trust and rigor” implied in being published versus putting on the web. Some of the initiatives talked about in the Bell article from the American Historical Association of providing peer reviewed web material may change some people’s minds. However, that article was written in 2005. Does anyone know the state of peer reviewed web publishing site?

I also found it interesting that people in the humanities put so much more emphasis on publishing a chapter in a book, than the other disciplines.

Also, for a truly misleading bar graph I highly recommend Figure 2 on page 17 which shows the importance of peered reviewed journals for the different fields. All other bar graphs in the paper have a scale of 0% to 100%. Figure 2 starts at 80% and goes to 100%. The impression of importance is very skewed.

My reason for liking digital books doesn’t touch the problems discuessed in the article for this week, but it is an aspect of digital books that wasn’t covered. My wife works with people with disabilities. Digital books allow them to increase the size of the font (Which is also great for us older folks) or listen to the book. That computer generated voice may be annoying and grating to you, but it allows blind people to have access to material they could not get to before. Digital books allow people with other disiabilities to manipulate the book in ways that make no sense to me, but makes perfect sense to them.

There are problems with digital books – a business model, the lack of review processes, the stamp of respectability a book gets just because it got through a publisher to be published, but these books allow people to have access to material they otherwise would not get to read.

September 25, 2009


I think the readings about collaboration have a direct relevance to what we have been talking about in our blogs. Mark Kornbluh concludes his presentation with

“It is essential, however, to understand that librarians, archivists, curators, and scholars are as essential to the development of digital humanities as computer scientists and programmers. Digital humanities content requires curation. If we do not get the metadata right, all we have is junk. And if we do not figure out how to preserve digital objects, than scholarship will be fleeting.”

Isn’t that what we have been talking about with Carl’s digitization project and Lynn’s digitizing of the Arlington slave register. If we think about the digital projects we looked at in class, I can see the one concerning medieval canon law being around for a long time. The data is tagged following accepted standards and is presented in a Web 2.0 environment. Another content developer could easily take feeds from that site and create another site with other value added. Compare that with the Cleveland Corridor project, which is very ephemeral – in 20 years that train/bus line won’t exist and the ability to access the Flash will be gone. Since the data is accessed from Flash other sites will not be able to use the data or access it. This limits the amount of collaboration that can be achieved.

I was very impressed with Zayna’s post about the wikipedia articles. In fact I was so impressed I thought I would take our web design discussions to heart and just steal her approach and design. Using Zayna’s successful way to get a starting point I went to look at featured articles and saw a recent featured article was the Ross Sea Party. So that is where I started my review.

Ross Sea Party –

The discussion page only covered trivial, i.e. non-historical, issues. One person thought there was too much detail and since nothing was done in 8 months then went in and fixed the article they way they thought it should be. The other topic of discussion was the use of English grammar versus American grammar. I was intrigued by the amount of vandalism this site experienced after it was named a “featured site.” People put up all sorts of things about Sarah Palin and other foolishness. Perhaps becoming a featured site is not such a good thing for a Wikipedia article. It gets you a lot of traffic, but that traffic includes a lot of crazy people. This problem was covered in great detail in Roy Rosenzweig’s article.

Battle of Grand Port

This article was found using the Featured content off of the main page. This one had very little discussion (None), but had a lot of history as people changed dates of ship launchings, and other small details. However, there was very little evidence of vandalism of the site. If you’re site is not a top featured site I guess the vandalism is not nearly as important.

Second Boer War

I went to this site because the Boer War was the focus of my web site for Clio2. One thing I find amusing is the policy, discussed in Rosenzweig’s article, that all entries have to be neutral. Neutrality is almost impossible in discussing history, and is demonstrated quite convincingly in this article’s history and  discussions. In the very beginning there was a controversy about what it should be named. To most South Africans today it should be called the Anglo-Boer war. To the majority black population it was basically between two groups of foreigners, not native Africans. To some people naming the war one way or another is not only, not neutral, they consider it racist.

An interesting observation is the timing of changes to the article. It seems as if the article is left alone for months, and then all of a sudden there is a flurry of activity with more than one individual making changes, defending the changes and going back and forth.  In some ways, reading the different discussions is more illuminating than reading the article itself.

Can Wikipedia be used by historians and should historians contribute to the project? Rosenzweig gives arguments on both sides, and takes great exception to the “no original research,” and “neutrality” policy. I think we have to look at Wikipedia as a tool that can give some insights but needs to be used with a healthy dose of skepticism. After all Wikipedia only show the conventional wisdom, and using that the Wright Brothers would have believed flight was impossible.

