Amazon, privacy, and the DHS

Thu, 7 Aug 2003

The other day, a KCLS patron emailed in and requested Amazon-like features for iPac (the web catalog), such as people who checked out this author’s work also checked out these other authors. Now, of course we can’t do that right now because it’s not one of the features available in iPac, but the question got me thinking about all the reasons why one wouldn’t want to do something like that in a library system. All I said to the patron at first was that it was contrary to KCLS’ privacy policy. He wrote back, asking if it were possible to remove the patron data, so that privacy wouldn’t be violated. Since he asked, and since he referred to the so-called Patriot act, I decided it was probably safe to talk to him honestly.

There are some technological barriers to this kind of feature. KCLS licenses its OPAC software from other companies; this means that it doesn’t have to develop, maintain, or support the software itself, but it also means that there’s less flexibility and autonomy. We are dependent on what Dynix decides to develop. (I have issues with Dynix, both as a company and as a product, but of course it’s really hard to move everything to another software system, once the data is there and the staff trained etc. Thus administrative inertia and the ability of Dynix to say Screw you, haha because it’s really hard for us to leave once they’ve got us. But I digress.)

There are a couple of ways I could think of to implement this kind of functionality. One is fairly simple, complete, and horribly wrong. The other isn’t as wrong but doesn’t actually provide the functionality it promises. Let me elaborate.

The first solution is to save patron circulation data and use that to generate book suggestions. This would be simple to implement and would be able to link items that a single person checked out over time. However, it’s also gross, horrendous, and horribly wrong, because there’s absolutely no way to sever the patron from the circulation data. Without the patron connected to the bibliographic data, there is no patrons who, which is the whole point of the feature. Not only is all that against KCLS’ privacy policy (KCLS’ system connects patrons to items only while the book is checked out and up to 4 days after it’s checked in, or sooner if someone else checks it out before that time), but the mere notion gives any librarian nightmares.

The other possibility would be to maintain data only for individual checkout sessions. This would require keeping another (kind of redundant) file of data and would link only items checked out in the same checkout session. This implementation would sever patron information and might even have a greater likelihood of linking only relevant works, but it might also miss out on a lot.

The other problem with implementing either of these options, whether there actually is a privacy violation or not, is that it looks like it. Patrons don’t know how this stuff is implemented. There are a lot of people who are concerned about personal circulation data, or about the idea that it could be subpoenaed, and we don’t want to give the impression that we’re violating privacy. We also don’t want to give law enforcement professionals the impression that we have data that could be subpoenaed. (But you have it in your catalog…)

I am reminded of the power of information, and this attempt by our government to control people’s access to it. (Filtering too… bah. But that’s another rant entirely.) Knowledge is power. The mission of the library is empowering people through information. No child left behind, my foot. What about the adults? What about self-learning? Mutter grumble ultra-conservative pigs.

Comments

Stephen says:

Option 1: Unfortunately, even if you could separate the patron identity patron from the record via some means of one-way encryption, the system would be self-defeating. All it would take would be to assign a unique "book" to the target patron, and voila-- "Patrons who checked out 'unique book' also checked out......"

Option 2: I actually like this one. I don't think the scope is as limiting as you suggest; in fact, it my add some focus, unmuddy the waters so to speak. (I mean, really, I've been buying from Amazon.com for years and they still have very little idea what interests me.) Instead of associating books with patrons, you associate books with each other. Over time, books (and authors) that are often checked out together would become associated with each other.

Of course, as you say, you can only work with the software you're given to work with.

Laurabelle says:

Stephen, somehow your comment about Option 1 isn't parsing properly in my head. Could you elaborate?

You're right that Option 2 isn't totally worthless. There is some logical likelihood that books that are checked out together will be more related than books that are checked out by the same patron at different times. On the other hand, practice is not the same as theory. It depends on what patrons are actually doing.

Your remark that Amazon still has trouble recommending good stuff to you is very pertinent. I often get recommendations for books that are already the most-read volumes on my shelf. On the other hand, I also take the time to go through and rate books for liking or disliking (and whether I own them).

The trouble with such automated systems is that they operate on raw checkout data of some sort, and the results of the analysis may or may not be at all useful. Statistically, the titles which are checked out together the most would rise to the top, but that depends on a mass of data, and there's no guarantee of relevance. There are lots of recommendation tools which have been consciously designed by librarians or other information professionals (like What Do I Read Next?), but there's also something to be said for letting the market decide. [Insert debate about the nature of relevance.]

I still think that appearances are very important, and that this sort of feature looks like a privacy violation even if it isn't. I also think that the library shouldn't imitate Amazon at every step. There's not much we can do about it right now anyway, if we wanted to add such a feature. We could put pressure on Dynix to implement it, but there's no guarantee.

Stephen says:

What I meant:

Let's say, hypothetically, you have a fool-proof method of logging a patron's check-outs without linking it to the name. For instance, John Smith's patron ID (say, 57100936)is encrypted so that his check-outs are logged under record #387J658W. We would use something like public-key cryptography (with the computer NOT recording a private key) so that it would be a one-way deal; John Smith's stuff would always go to 387J658W, but we could not go backwards and find out who record #387J658W belonged to.

The problem: If the Feds (or whoever) wanted to find out what John Smith was reading, they could input into his check-out data a "book" that doesn't actually exist. They could then locate his file by finding the one with that book in it, since no one else could possibly have that book in their record.

Theoretically, the Feds could have a computer insert a different unique "book" for each patron and cross-connect the names to the encrypted IDs that way. Then they would have a complete database to check against a terrorist literary profile (or whatever).

Laurabelle says:

Twice already I've started to type out something to the effect that it wouldn't be quite as easy as you suggest to circumvent the encryption... but indeed, it would be. One wouldn't even need to check out a false item to that patron; as long as the patron had a couple of books out, one could match records of patrons who had checked out those same titles, and the list of matching checkout histories should be fairly low.

Scary thoughts. It makes me want to smash all the computers... except that a paper trail would be even worse. At least bits and bytes are easy to erase.

Jeff says:

I think a system like the Amazon people who borrowed this book also borrowed would work. It's just a matter of hitting the threshold of data where the laws of large numbers start to take effect. I could see the setup being in the book record, a table associated with each book which lists each book the primary book has been checked out with a frequency field, sorted by frequency. There would be no reference to the actual check out operation, the check out process would merely update these tables in the database.

Perhaps a date field could be added to say when the particular book was added to the record, and use this to cull books of dubious relationship to the original book --- ie, have the date field just be month/year and if the book doesn't reach a certain frequency in x number of months, then that record is removed. Just a thought.

I base this on the assumption of how I use a library. When I'm in a library, I'm looking for something specific, and usually check out books of the same or similar topic. If there is an actual correlation, this data could be used to help patrons find books more easily---e.g., they look at one record, and up pops a list of other books that are closely associated. At that point, there would be feedback in the system, and it would tend to reinforce itself.

The big problem with these types of systems is volume, and getting enough data that the relationships are meaningful. It is a matter of getting past the threshold where the law of large numbers applies.

I'm just rambling anyway. Not necessarily a response to anyone, just thoughts inspired.

Jeff

Post a comment











XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

OpenID: If you use OpenID, your comment will be approved automatically and will not be held for moderation.