Google Scholar’s susceptibility to spam

February 25, 2011 Leave a comment

Bill Denton, a colleague of mine at York University, pointed me in the direction of an article titled “Academic Search Engine Spam and Google Scholar’s Resilience Against it.”

I was hoping to glean some ideas to legitimately help faculty disseminate their work, but this article was not useful in this regard. Nevertheless, it was fascinating!

I was impressed with the authors and their creativity in devising tests to see if Google Scholar could be manipulated by researchers. Some of the tests, such as creating a fake paper where a devious researcher would cite his or her own articles to increase rankings and visibility, were eye-opening.

To me, trying to game the system in such an open and public way would be unbelievably unwise, as one’s colleagues would see through this behaviour rather quickly, resulting in a severely damaged reputation.

Nevertheless, I highly recommend reading this paper as a method of raising awareness of what is possible with Google Scholar. It is important to be able to recognize spam as a means of protecting oneself against being misrepresented.

Categories: publishing, seo

RDA and the open access disappointment

April 6, 2010 Leave a comment

I admit that I have not been following the development of RDA (Resource Description and Access) closely. In short, RDA is a new library cataloguing standard that is being introduced to replace our current standard, AACR2R. I’ve seen a couple of conference presentations on the topic, read a couple of blog posts, but that’s about it.

Today our department organized a viewing of an archived webcast of A Guided Tour of the RDA Toolkit. As I sat in attendance, I was impressed by the resource that was created.  As a non cataloguer, I’m in no position to critique the content with authority, but I was very impressed with the layout and multiple access points for information.

Near the end of the presentation, the talk shifted to pricing models. Pricing starts at $325 dollars, with an added cost of approximately $50 for each added simultaneous user.

I sadly realized at that point that the open access movement still has a long way to go.  As librarians, we’re fighting for access to information by asking authors and publishers worldwide to provide open access to published works.  And here we are, enforcing toll access to our own rules that govern one of our most critical functions: describing items to enable access! If RDA is expected to become a worldwide standard for cataloguing, and is only available through a web interface, access to this information will only be available to the privileged class who can afford it.

While the subscription fee is not a crushing cost to bear for a large academic library, its a different story for libraries in the developing world, or small non profits, or even struggling public small town libraries. These costs are simply not affordable.

I’m upset by this fee structure, because I believe that by perpetuating barriers to access for cataloguing rules, we will see lower quality description of data from smaller organizations and the developing world, which will translate into these resources being less visible and less interoperable with our future work on linked data and the semantic web. In short, we’re working against or own mandate of broadening equitable access to information.

What I’m told by colleagues is that the development of RDA was a joint effort, including lots of free hours of work from many librarians who collaborated to come up with these new rules.  What I’m proposing is to continue this collaborative effort to ensure that this information is openly available to the global community.

The hosting and maintenance of RDA is not without cost, of course.  That fancy interface took many hours of labour to construct, and will take many more to maintain and update.  How about adopting a consortia approach to funding, similar to SCOAP3 or ArXiv, where the larger institutions that can afford to contribute, pay an annual subscription fee to cover costs?  This way, RDA is funded, but the world has access. Wouldn’t this be a wonderful opportunity to provide an example to the world of how new publishing models can work?

Come on library world…let’s practice what we preach!

Categories: open access

Can we further subdivide nuances of open access?

July 15, 2009 Leave a comment

The Public Knowledge Project 2009 conference ultimately made me re-think the way that open access (OA) is defined and subdivided.

The current subdivision is dichotomous.  Open access is subdivided into the gratis and the libre models as described by Peter Suber in his Open Access Newsletter, where gratis OA refers to access without price barriers alone, while libre OA involves the removal of price and at least some permission barriers.  I perceive this to be a hierarchy of use, where gratis OA is less usable as permissions for further use of these items are not clear.

The concept of the hierarchy was echoed in a workshop that I attended at the PKP Scholarly Publishing Conference on Lemon8-XML (L8X).  One of the speakers, MJ Suhonos, underscored that all document dissemination formats are not created equal. If one compares an XML encoded article to the same article available in PDF, we see that the XML encoded article enables enhanced access to the content.  The strength is in the modularity of the XML, which enables the content to be labeled and described explicitly in a standardized way.  The usefulness of XML can be described using the example of citations.  In a PDF, the citations sit lumped with the rest of the PDF and can not be reliably harvested or parsed as discrete citations because to a machine they appear to be identical to the text of the article.  In XML, the citations are denoted as citations and hence can be parsed and analyzed as such.

One can’t help by imagine a world where every document has semantically encoded citations!  We would not need to rely on ISI and Scopus anymore (or pay the Crossref fees).  Everyone would have equal access to citation harvesting and analysis. (Two years ago, a Scopus vendor told me their indexing rejection rate was approximately 80%…talk about an elite society!)  XML markup could enable global barrier-free citation analysis, where elite membership would no longer be necessary.

In this same L8X session, Juan Pablo Alperin discussed other benefits of XML markup besides the infinite possibilities of enhanced bibliometric analysis. He asked us to imagine the benefits of discovering collaboration networks, where enhanced author markup, for example, would enable us to see which institution collaborates with whom. Enhanced document discovery would also be a benefit, where the availability of complete metadata means that we can find related works in many ways such as: by the same author, subject, in same journal, by the same publisher.

While we are already seeing some of these benefits in Google Scholar, not all articles are marked up in a way to be able to fully benefit from what Google Scholar has to offer.

We see then, that there is a divide between articles which are static in their nature like PDF vs. articles that are marked up in such a way that all their components have meaning associated with them. I argue, then, that articles that are not marked up in XML are less usable than those that are, just like research that is available as gratis OA is somewhat less usable than libre OA.

The PKP team has been aware of the benefits of XML early on and responded by creating the Lemon8-XML software.  They recognize the need for equal semantic exposure for all scholarship and have created a tool that puts this ability within everyone’s reach.

The Lemon8 software enables an editor to upload an article, and takes them step by step through marking up that document in XML while abstracting them from the gory details. Lemon8 identifies document metadata such as title and author, and among other features, searches multiple databases to help verify citations by automatically suggesting additional data in a user friendly way. Article markup is still not a quick venture, but if editors were to incorporate Lemon8 into their workflow, it could actually save them time as it would greatly reduce the time it takes to verify citations while at the same time enabling their semantic markup.

I am excited to learn that integration of Lemon8 into the Open Journal Systems software is on the development roadmap for the Public Knowledge Project, and am looking forward to working with this added functionality.

What comes after cloud computing?

May 7, 2009 1 comment

On Tuesday May 5, my colleague Jeff Newman and I presented The Library in the Cloud at the TRY Conference.  In the days leading up to the presentation, I was asking myself why I had agreed to add yet another item to my “to do” list which seems to continually grow longer instead of becoming more manageable.

After the presentation, however, I realized that I was glad to be part of the show.  First off, I got a front row seat to Jeff’s part of the talk…and boy did he blow me away as a presenter!  He’s an absolute natural…I wouldn’t be surprised if he’s doing the keynote circuit in a few years.

I was also grateful for an audience member’s question.  It was something along the lines of “what comes after cloud computing?”.

During the talk I remember discussing the benefits of cloud computing, and one of them included shared standards.  I think that we still have much further to go in this respect.  The ability to share and aggregate information via RSS feeds is great, but I would love to see more semantic interoperability taking place.  Creating a friend of a friend profile is still a challenge, and I’m finding that I’m still having to re-enter much of my information in too many places.

While I do see the cloud continuing to enable more information to be created and available online,  I hope that further interoperability between cloud based platforms develops so that information can be mined and shared much more efficiently and creatively.

Categories: interoperability, software

PRONOM

March 25, 2009 Leave a comment

Paving the way towards sustainability of electronic records is PRONOM, an online registry of technical information.

An initiative of the National Archives (the UK government’s official archive in Surrey) the PRONOM registry was “originally developed to support the accession and long-term preservation of electronic records”. The National Archives have graciously made this valuable resource available to all.

As described on the site, “PRONOM holds information about file formats, and the software products which can process (read, write, identify etc) each format. Information related to the file formats, such as documentation about them, their compression types, character encoding schemes and intellectual property rights is also held. “

When browsing the site, I was pleased to find that in addition to a simple search, one can search by file format, vendor, software, lifecycle, migration pathway and Pronom unique identifier. The search also allows you to find file formats by extension, and to search for software that can process files with a particular extension (or file format name).  An online submission form is available to encourage user contributions and to help keep the registry current.

What an important step towards tackling the challenges of digital preservation!

Green OA vs. Gold OA

March 11, 2009 Leave a comment

A post by Stevan Harnad on the JISC repositories listserv directed me to Richard Poynder’s  post and article.  His article “Open Access: Whom would you back” provides an excellent history of the Open Access movement.  It chronicles the devious and fascinating approaches employed by scholarly publishers in adapting to the evolving publishing landscape.

As a librarian in the trenches promoting Green OA in support of our institutional repository, I see on a daily basis the tactics that some publishers are using to appear complicit to open access while at the same time doing their very best to undermine the authority and validity of self-archived works.  As Poynder points out, publishers are doing what they can to shift focus from Green OA to Gold OA as this is a more profitable venue for them.  He points out that we may be disappointed if all that Gold OA accomplishes is a shift from the paying of subscriptions to the paying of APCs.

While I do agree with many of Richard’s arguments, there’s one point I’d like to make. To me, it is not as important who profits most in the transition to OA, because we are all winners in the end.  While I agree that the Green OA would solve both affordability and access challenges and I back it wholeheartedly as an ideal solution,  no matter which OA models win out, as a global community we will all benefit from the barrier-free access to peer-reviewed scholarship.

SIMILE makes me smile…

January 8, 2009 Leave a comment

I’ve been researching semantic web applications and have finally set aside some time to try out some of the wonderful applications developped by the folks at MIT.

The SIMILE project creates open source applications that allow users to “access, manage, visualize and reuse digital assets”.  I recently downloaded and installed the Seek add-on for the Thunderbird e-mail application.

Seek essentially allows me to toggle a view in Thunderbird that enables me to browse my email in multiple faceted views that I can add or subtract, and even allows me to view and sort through threads.

This add-on has changed the way I tackle my email, especially when I become overwhelmed after coming back from a conference or vacation.   By clicking on a facet for a certain person’s email address, for example, I can quickly view and track all their correspondence in one screen.  It is also really helpful to find that one missing email that you know is lurking somewhere in a folder but is just not coming up in your searches.

Check out the google code location for the project to see updates and a great instructional video.  Congratulations to the SIMILE team, and especially David François Huynh for an extremely useful tool.

Next on my list for trying out SIMILE projects is Longwell, a web-based RDF-powered highly-configurable faceted browser.

Categories: open source, software

FSOSS 2008

October 26, 2008 Leave a comment

I attended the Free Software and Open Source Symposium October 23-24, 2008.  As always, it was a rewarding experience, and I intend to go back next year.  For the low price of only $50 (early registration), a delegate has access to two days of sessions, several workshops, and an assortment of goodies including a tee shirt, lunches and a reception. All sessions are recorded and available on-line for viewing after the conference.

I am definitely going to try out the TikiWiki CMS/Groupware application.  It has a very handy database tool that enables the user to create databases through an easy web interface.  Web forms with customized fields can be also easily created to populate your databases. I like the TikiWiki philosophy where each release comes with every available add-on which can then be enabled or disabled through site administration pages.  This ensures that all modules are updated at the time of a new release, and saves one from having to go module-hunting when new functionality is required.

I was happy to see that FSOSS featured a session on open access.  Leslie Chan discussed the convergence of open access with open source. His session reminded us of the significance of the open source contribution to the open access revolution.  John Willinsky was visionary in realizing that a major barrier to publishing journals on-line barrier-free was the cost of creating journal publishing software.  His Open Journal Systems project has enabled over 2000 journals worldwide to make journal content available on-line, most of it without barriers to access. Open source projects like his are contributing to the steady increase of peer-reviewed scholarship freely available on-line.

Marcus Bornfreund was absolutely swarmed with questions after his talk on Creative Commons and creative copyright licensing.  His session helped to bring home the message that assigning a creative commons license to a work does not limit ones ability to make a commercial profit from said work.  The cc license only sets the base standard for all who have not made alternate arrangements with the copyright holder.  It is necessary to remember that once a creative commons license is assigned to a work, any further arrangements made with respect to that work cannot be exclusive.  Marcus will be speaking at York University with Professor Pina D’Agostino on November 19th about copyright in the academy.

Follow-up post regarding DSpace 1.5 adoption

October 3, 2008 1 comment

I strongly encourage readers of the CARL IR post below to view Mark Diggory’s comment.  He discusses migration from DSpace 1.4 to DSpace 1.5.  He’s made a good point: in presenting perceptions of IR meeting attendees without further elaboration I may lead readers to make conclusions that are not entirely accurate.

I agree with Mark’s comments and to add to his case would like to share our migration story.

Our migration to DSpace 1.5 was delayed because of other pressing projects and the fact that we were upgrading our server environment. The actual migration from DSpace 1.4 to 1.5 was rather quick.  It took us only a few weeks to pilot the migration on the test server,  present the test version to our user communities, integrate suggestions, and run the actual upgrade.  When we showed the upgrade to our users on our test environment, their suggestions were related to our new design, and not DSpace functionality itself.

I am a fan of DSpace 1.5′s integration of selecting a Creative Commons licence at the point of item submission. (This feature alone is worth the upgrade!) The Manakin interface has enabled us to create custom submission templates and to easily display customized searching and browsing capabilities.

Finally, my observations should be taken in balance with the reality that many IR managers are pulled in multiple other directions such as faculty liaison, policy development, advocacy for open access, and digital projects such as journal publishing initiatives.  Repository management is only one aspect of their daily work, and as a result, time is short.  Finding the time to explore new features and carefully plan migrations is increasingly short, and this is likely the overarching factor that precipitates slower adoption.

Categories: DSpace, repositories

CARL IR Meeting at Access 2008

October 2, 2008 2 comments

The Canadian Association of Research Libraries hosed an Institutional Repository meeting in Hamilton Ontario on Wednesday October 1st, 2008 to coincide with Access 2008, and I was happy to be in attendance.

This was an extremely worthwhile meeting where participants were able to trade stories of their successes, challenges, and plans for the future. With over 40 of us in the room, it took almost the entire duration of the meeting to do a round of introductions discussing individual repositories. I sincerely hope that this becomes an annual event!

There were three major themes that emerged from the meeting:

Theme 1 – DSpace is commonly used, most institutions running 1.4 version

Most of the repositories in attendance were hosted using DSpace software.  I was very surprised to hear that most of the DSpace hosted repositories were versions of the 1.4 release, and that only two institutions had migrated to the 1.5 release.  I was slightly relieved because we finally completed our migration to 1.5 and I thought that we were behind!

As a result, the Manakin XML interface layer for DSpace was also not being used.  We were the only ones to have a production version of Manakin running.

Reasons cited for not migrating to 1.5 included:

  • customizations made to DSpace 1.4 will take a lot of programming time to move over to 1.5
  • certain plug-ins and enhancements that are in heavy use in 1.4 have not yet been made available for 1.5
  • administrators are evaluating other platforms and are not willing to invest the time in upgrading to 1.5 if they end up switching platforms
  • programmers are hard to find, train and retain

Please visit my follow-up post to this section that elaborates on these observations.

Theme 2 – Electronic Theses and Dissertations (ETDs)

ETDs were discussed at length as they are very popular and make up a sizable percentage of most Canadian repository content.  Only one institution has mandated electronic thesis deposit, but many have effective relationships with their respective Graduate Faculties where procedures have been established to enable the depositing of theses into repositories on an ongoing basis.

Copyright has been tackled in many ways: seeking legal advice from campus legal counsel, sending letters to alumni, taking out an ad in institutional alumni magazines, and re-writing agreements to be signed by current graduates.

The availability of past Proquest theses were discussed but common problems were echoed:  poor quality scans for certain year ranges, Proquest marc records not tying to digital copies of theses by filename, lack of ocr, and the need to remove signature pages have slowed down workflow to ingest these items.

Theme 3 – Scholarly Communications Programs

Many of the participating institutions are hosting outreach programs to discuss Scholarly Communications challenges with faculty.  Efforts include hosting speaker events and creating websites/supporting materials.

Categories: DSpace, repositories, software Tags: ,
Follow

Get every new post delivered to your Inbox.