Kuan0: 2011

Thursday, 9 June 2011

'Personal data' in the UK, Anonymisation and Encryption

Footnote 97 of our paper 'Personal Data' in Cloud Computing - What Information is Regulated? The Cloud of Unknowing, Part 1 mentioned that the appeal against an Information Tribunal decision, Department Of Health v Information Commissioner & The Pro Life Alliance [2009] UKIT EA/2008/0074, was going to be heard by the courts (errata: it should have read High Court, not Court of Appeal).

That judgment has now been published, and is important regarding the interpretation of the 'personal data' definition in the UK.

This definition is critical when considering whether anonymised or encrypted personal data processed in the cloud should be treated as 'anonymous data', and therefore, eg, be transferable outside the EEA free of data protection law constraints - or whether such data would remain 'personal data' subject to the provisions of data protection legislation.

The UK legislation implementing the EU Data Protection Directive was the Data Protection Act 1998 (DPA), s.1(1) of which provides, in words that differ from the Directive's definition:

'"personal data" means data which relate to a living individual who can be identified--
(a) from those data, or
(b) from those data and other information which is in the possession of, or is likely to come into the possession of, the data controller,
and includes any expression of opinion about the individual and any indication of the intentions of the data controller or any other person in respect of the individual;'

The s.1(1) interpretation question is as follows. Suppose that a data controller holds information which is personal data. It then attempts to anonymise this information, and intends to disclose the resulting anonymised information to a third party. What is the status of that anonymised information? Is it still 'personal data', or can it be treated as anonymous data?

Now, para. (b) of the definition, above, requires that, when considering whether a living individual can be identified from the 'anonymised' data, account must be taken of 'other information' held by the data controller.

A strict, 'hard-line' interpretation of that provision might suggest that 'anonymised' information can never be treated as anonymous data for so long as the controller retains the original personal data from which the anonymised data were derived, because if you put together the anonymised data ('those data') with the original personal data ('other information') still possessed by the data controller, then of course people can still be identified - by the data controller, from the original personal data.

Or, suppose that the data controller has key-coded the original personal data (changed names to code numbers, with a 'key' showing which number corresponds to which name), and destroyed the original personal data, but still possesses the key. Again, people can be identified from the key-coded data in combination with the key. So, on the 'hard-line' view, the key-coded data would remain 'personal data'.

If encryption is applied to a data set, the whole data set would be transformed, and not just names within the data set. However, where the data controller possesses the decryption key, encrypted personal data might be viewed as similar to key-coded data, and if so would, on the hard-line view, always be considered 'personal data'.

So the recent judgment is very relevant to cloud computing as well as other areas of computing, and indeed more generally.

PLA

On 21 April 2011, Cranston J dismissed the Department of Health's appeal, in Department of Health v Information Commissioner, [2011] EWHC 1430 (Admin) (PLA).

After the UK Department of Health changed their approach to the release of anonymised abortion statistics, the Pro Life Alliance requested from the Department, under the Freedom of Information Act 2000 (FOIA), anonymised statistics in the more detailed form in which they had previously been released.

The Information Tribunal considered that the requested information was 'personal data' in the hands of the Department of Health under s.1(1)(b) DPA. However, it held that, as it considered the possibility of identification by a third party from the requested statistics was 'extremely remote', the disclosure would not contravene the data protection principles of Sch 1 to that Act and was proportionate and justified when balanced against important legitimate public interests in disclosure.

On appeal, the judge held that the Tribunal's interpretation was wrong: the requested information was not 'personal data', and therefore the Tribunal should have held that the disclosure of the information to the public did not constitute the processing of personal data. (He went on to rule that, even if he were wrong and the information was 'personal data', the Tribunal had acted properly in its overall assessment, from the statistical evidence and its own judgement, that it was 'extremely remote' that the public to whom the statistical data was disclosed would be able to identify individuals from it, and in deciding that disclosure was justified under the DPA.)

Although Cranston J was invited to adopt Lady Hale's reasoning in Common Services Agency v Scottish Information Commissioner [2008] UKHL 47 (CSA), he felt bound by precedent to follow the judgment of Lord Hope, as the majority of the judges had agreed with Lord Hope's leading speech.

He acknowledged that the CSA judgments were not easy to interpret, but concluded from the wording of Lord Hope's proposed order that Lord Hope had recognised that:

although the Agency [data controller] held the information as to the identities of the children to whom the requested information related, it did not follow from that that the information, sufficiently anonymised, would still be personal data when publicly disclosed. All members of the House of Lords agreed with Lord Hope's order demonstrating, in my view, their shared understanding that anonymised data which does not lead to the identification of a living individual does not constitute personal data... The status of information in the data controller's hands did not arise for decision in the CSA case. It was concerned with the implications of disclosure by the data controller... The opening sentence of paragraph 27 [of CSA] acknowledges that the Agency holds the key to identifying the children, but continues that, in his Lordship's opinion, the fact that the Agency had access to this information did not disable it from processing it in such a way consistent with recital 26 of the Directive, "that it becomes data from which a living individual can no longer be identified". That must relate to whether any living individuals can be identified by the public following the disclosure of the information. It cannot relate to whether any living individuals can be identified by the Agency, since that is addressed in the first sentence of the paragraph. Thus the order made by the House of Lords in the CSA case was concerned with the question of fact, whether barnardisation could preclude identification of the relevant individuals by the public.

(paras. 51-52)

Cranston J said that this conclusion reflected recital 26 of the Data Protection Directive, which recognises that the Directive does not apply to data rendered anonymous, giving that recital greater force than a suggestion that the Article 29 Working Party's opinion on 'personal data' required a broader initial interpretation of 'personal data' (para. 53)

Indeed, any other conclusion seemed to him to be:

divorced from reality. The Department of Health's interpretation is that any statistical information derived from reporting forms or patient records constitutes personal data. If that were the case, any publication would amount to the processing of sensitive personal data. That would be so notwithstanding the statistical exemption in Section 33, since that exemption does not exclude the requirement to satisfy Schedule 3 of the DPA. Thus, the statistic that 100,000 women had an abortion in a particular year would constitute personal data about each of those women, provided that the body that publishes this statistic has access to information which would enable it to identify each of them. That is not a sensible result and would seriously inhibit the ability of healthcare organisations and other bodies to publish medical statistics.

(para. 54)

APGER

A discussion of the interpretation issue had previously appeared in a recent Upper Tribunal decision, All Party Parliamentary Group on Extraordinary Rendition v The Information Commissioner & The Ministry of Defence [2011] UKUT 153 (AAC) (APGER), published in April 2011. This decision was briefly discussed in PLA.

The APGER asked the MoD for information on individuals detained or captured by UK soldiers operating jointly with forces of another country in Iraq or Afghanistan including, in the case of Iraq, detentions or captures jointly with US forces, information on their subsequent transfer to Guantanamo Bay or other detention facilities.

The Tribunal did not think that the dates of detention and dates and locations of any transfers would enable identification of individuals and therefore constitute personal data, based on the content of the information (especially the shortness of the detention periods) and the absence of any evidence that individuals would be identifiable from the information by reason of other knowledge held in the relevant communities (para. 109).

The MoD had also argued, based on (b) of the 'personal data' definition, that information on the numbers of individuals transferred to particular detention facilities or particular kinds of detention facilities remained personal data, even when anonymised, 'because the individuals remained identifiable by the MOD from other information in the possession of the MOD (ie, the unredacted information)'.

In this context, the Tribunal considered CSA and said (para. 127) that:

'Anonymisation by redaction is itself a form of processing. If the data controller carries out such anonymisation, but also retains the unredacted data, or retains the key by which the living individuals can be identified, the anonymised data remains “personal data” within the meaning of paragraph (b) of the definition and the data controller remains under a duty to process it only in compliance with the data protection principles.'

They also said (para. 128), emphasis added:

'However, we remain concerned at the use of this analysis in such a way as would have the effect of treating truly anonymised information as if it required the protection of the DPA, in circumstances where that is plainly not the case and indeed would be absurd. Lord Hope’s reasoning appears to lead to the result that, in a case where the data controller retains the ability to identify the individuals, the processing of the data by disseminating it in a fully anonymised form, from which no recipient can identify individuals, can only be justified by showing that it is effected in compliance with the data protection principles. Certainly the whole of the information still needs the protection of the DPA in the hands of the data controller, for as long as the data controller retains the other information which makes individuals identifiable by him. But outside the hands of the data controller the information is no longer personal data, because no individual can be identified. We therefore think, with diffidence given the difficulties of interpretation which led to such divergent reasoning among their Lordships, the best analysis is that disclosure of fully anonymised information is not a breach of the protection of the Act because at the moment of disclosure the information loses its character as personal data. It remains personal data in the hands of the data controller, because the controller holds the key, but it is not personal data in the hands of the recipients, because the public cannot identify any individual from it. That which escapes from the data controller to the outside world is only plain vanilla data. We think this was the reasoning that Baroness Hale had in mind, when she said at [92]:
“For the purpose of this particular act of processing, therefore, which is disclosure of these data in this form to these people, no living individual to whom they relate is identifiable”.'

Also of interest is the MoD's further argument against the release of the requested information even with redaction of names, because it might constitute personal data of third parties within s.40(2) FOIA: 'where small numbers of persons were involved, redaction of the names was insufficient and that individuals would be identifiable from information known to the public in areas where the detainees had been located prior to their detention.' (para. 124)

There, the Tribunal noted that, while the Information Commissioner had referred in argument to whether there was an “appreciable risk” of identification, that did not appear to them to be the statutory test, which uses the phrase “can be identified”.' However, in considering the facts of the matter, the Tribunal then said (para. 129) that 'On the evidence that we have received, our conclusion on the balance of probabilities is that publication of the information the subject of the MOD’s appeal will not render individuals identifiable.' Thus, the test of identifiability applied in practice there was 'the balance of probabilities', which supports the suggestion in our paper of 'more likely than not' (p. 40, last paragraph).

Summary and comment

As mentioned in footnote 97 of our paper, a Scottish court had previously remarked that the 'hard-line' approach to s.1(1) DPA, whereby the original personal data would have to be destroyed before anonymised information could be released, seemed 'hardly consistent' with recital 26 of the Directive.

While the Tribunal in APGER based their decision on Lady Hale's judgment and Cranston J in PLA followed Lord Hope, both have now firmly rejected the hard-line interpretation. It is not yet known whether the Department of Health will be appealing PLA.

Pending the outcome of any appeal, it is at least now clear that in the UK a data controller should be able to anonymise originally-personal data and then disclose or process the anonymised data, as long as the data are sufficiently anonymised so that the public cannot identify living individuals from the anonymised data. It should not matter that the data controller itself can identify living individuals from the anonymised data and/or the original personal data.

This makes it much more likely that securely-encrypted personal data may be stored in the cloud as 'anonymous data', and should also mean sufficiently-anonymised personal data may be stored or otherwise processed in the cloud.

However, the difficulty of 'sufficiently' or 'fully' anonymising personal data still remains. How much and what anonymisation will be good enough? It seems data will not be 'personal data' if the likelihood of identification is 'extremely remote' (PLA), or perhaps if 'on the balance of probabilities' disclosure of that data will not render individuals identifiable (as applied in APGER).

Also, it's still not entirely clear how the data controller must handle the anonymised data. Consider key-coded data, on the assumption that key-coding sufficiently anonymises the data (which may itself be problematic). Under APGER, while the controller still holds the key-coded data and the key it must process that data only in compliance with the DPA - even though it may release the data without breaching the DPA because, on disclosure, the data would, in the hands of third parties, lose 'personal data' character. In contrast, Cranston J seems to consider that sufficiently anonymised personal data would not be personal data. The exact factual circumstances may well affect the position - key-coding is not the same as aggregation, and it may also make a difference whether the controller retains the original personal data and/or the key.

Note further, in relation to the status of the anonymisation or encryption process itself (discussed in our paper at 3.3.1), that the Tribunal in APGER has stated that 'Anonymisation by redaction is itself a form of processing.' (para 127)

Sunday, 24 April 2011

Amazon SimpleDB Developer Guide - unofficial errata etc

Updated 28 April 2011: my review of this book has now been published on Slashdot. They edited it down. Here's the complete review as submitted, complete with links to Amazon's current free-tier offer, and cloud computing cartoons!

These are my notes of errata, typos, queries/issues, and hoped-for improvements to the 2010 Packt book "Amazon SimpleDB Developer Guide" by Prabhakar Chaganti and Rich Helms - aka, all the mistakes I made or spotted, so you don’t have to!

I’ve no comments on the PHP code as I only tried the Java and Python, using Windows. Forgive the ugly "pre" blocks for some of the code, but that was the only way I could stop WordPress from turning normal quotes into the dreaded "smart" curly quotes that prevent code from running.

Page by page

p 5 - link at the bottom is wrong – extra slash, link doesn’t work in the PDF.

pp 25-26 - needs “keep with next” for pics and captions.

p 27 - link at the top doesn't work.

p 28 and throughout - they should have put "awsAccessId, awsSecretKey" in a different font to make it really obvious that you insert your actual keys there rather than, eg, thinking there'd be a prompt to enter your keys when you run the code. Going further, the book should have made it crystal clear that you need quotes around the keys - they're strings.

pp 28-31 – no typica imports were given – the book should provide them once, then they can be used throughout the book, but the first time would help a lot, especially given that this is a "getting started" book, because Eclipse suggests several options and it's not clear which is the correct one. In Chapters 2 onwards the minimum imports needed (some need more) are generally:
import com.xerox.amazonws.sdb.Domain;
import com.xerox.amazonws.sdb.SDBException;
import com.xerox.amazonws.sdb.SimpleDB;
(alternatively, the easiest if laziest solution is to import com.xerox.amazonws.sdb.*; )

Cf Chapters 9 and 10, eg p 194, which do give all the code, complete with all imports and even “main” – why the inconsistency? It would be easier for readers if the full code were provided in the early chapters. Contrast with this SimpleDB typica tutorial, which gives all imports (and makes it crystal clear that the keys go in as strings).

There are also inconsistencies in the Python code, eg p 211 gives the preliminary code to import boto and set up the connection etc, whereas some earlier chapters leave that out. All the Python code should be similarly complete, for the convenience of those readers who (as seems most likely) try different chapters at different times: don't assume readers will work through the whole book in a single sitting. In contrast, the Amazon Web Services toolkit for Eclipse took seconds to install, a few more seconds to enter my credentials, and the SimpleDB sample code given ran immediately.

p 38 – this Chapter should explain installation for Windows too, ie open a command window in the boto-[whatever] folder, then it's python setup.py install. Add environment variables for your keys as user variables in the normal way eg through Computer Properties -> Advanced System Settings -> Advanced -> Environment Variables). This is a strange omission as it’s in an IBM Developerworks tutorial on SimpleDB/Python/botowritten by one of this book's authors.

p 40 – there should be a True at the bottom of the page for the output you get after creating new item. Similarly with top of p 41.

p 42 - the last one:

sdb_connection.get_attributes('prabhakar-dom-1',car1')

should be:

sdb_connection.get_attributes('prabhakar-dom-1','car1')

- ie there's a missing open quote.

p 59 – I don’t get Domain:Cars as the output in the penultimate line. Also, the code for creating the domain has a double underscore in the name cars__domain – but it needs to be single underscore ie cars_domain, or else copy/pasting the subsequent code (which uses cars_domain) won’t work.

p 60 – needs a space after the import ie it's import SPACE inspect. Also, pp.pprint(inspect.getmembers(cars_domain, inspect.ismethod)) won’t work because the name has a single underscore here, see p 59. And so on.

p 63 - The line

Domain domain = sdb.getDomain("songs");

should be

Domain domain = sdb.getDomain("Cars");

p 64 – pasting the Python code shown here won’t work unless the double underscore on p 59 is fixed, or you use a double underscore here instead, ie cars__domain

p 70 – “It makes sure you call save() to actually persist your additions to SimpleDB” is misleading and gives the impression that add_value automatically includes a save() - cf p.71, which reads (correctly) “You must once again call save() in order to persist the changes.” The p 70 sentence should read something like “After calling add_values, make sure that you also call save()...”

p 75 - cars __domain should be (see p 59) cars_domain
Missing code (this should be line 4):

myitem2 = cars_domain.get_item('Car 2’)

p 78 – I get

u'dealer'

in the results of running the code, not

'dealer'

p 88 – Java code gives the body of the method provided for zeropadding; but readers may be more interested in the use of the method, eg
String encoded = DataUtils.encodeZeroPadding(int number, int maxNumDigits);
or

int decoded = DataUtils.decodeZeroPaddingInt("0000234");

p 93 – again it would be more useful if rather than providing the method body this page provided code showing its use, like: Date aDate = new Date(); String encodedDate = DataUtils.encodeDate(aDate); System.out.println(encodedDate); - and similarly with the decodeDate() method.

p 111 – the p 116 info on quoting should be given here, and in the main body of the text rather than a side “warning” - I personally find those warnings easily missed, possibly because they’re in a smaller font. Using the backtick ` (above the Tab key) instead of a single quote ‘ isn’t obvious, especially to someone typing out the code instead of copy/pasting, so it merits major highlighting. The Amazon guide is much clearer on when ` must be used. To emphasise, in SELECT queries you must use ` around the domain name if the name contains, eg, a hyphen or underscore, or else it won't work. (And you're not "escaping" here, you're quoting with a backtick.)

p 115 – there's info missing about You’re a Strange Animal, whereas info about that item was added in p 108 and 110 - the example should be carried through in full ie:

>> 1045845425 {u'Genre': u'Rock', u'Rating': u'****', u'Song': u"You're a Strange Animal", u'Artist': u'Gowan', u'Year': u'1985'}

(cf pp 122, 124, 125, 129, 130, 131, 132, 134 which are consistent on that front).

p 136 – getAttributes in Java - this code won’t run, and I can’t find the getItemsAttributes() method in http://typica.s3.amazonaws.com/com/xerox/amazonws/sdb/Domain.html

p 146 – the download link for JetS3t is now http://jets3t.s3.amazonaws.com/downloads.html. And the info here is incomplete – “Add the jets3t-0.7.2.jar to your classpath” is not good enough. You also have to add commons-httpclient-*.jar (in the jets3t libs directory) to the classpath, or else it won’t work.

By the way, this isn't mentioned in the book but, when testing stuff on S3, a good way to check the results of running the code is to use JetS3t Cockpit (run the script in the JetS3t bin directory eg cockpit.bat if you're on Windows). And if you try the book's examples, you might want to use a different bucket name from packt_songs, or, alternatively, don't forget to delete that bucket when you're through. Bucket names are unique throughout the whole of AWS, not just to your account, so if you don't delete it, no other readers will be able to use the same bucket name.

p 148 – “We will use a MD5 hash that is generated from the name of the song, name of the artist, and year.” – but, the code given doesn’t in fact use the year.

p 149 – why is the line with user key details commented out?

p 149-151 – isn't there more efficient “for” code to do this, like the Python version on p 152, instead of going through each item individually?

p 151 – code is missing for “You’re a Strange Animal”. p154 – “/songs_folder” is used here, cf "/Users/prabhakar/Documents/SimpleDB Book/songs/” on p 160 – another inconsistency. More importantly, the code doesn’t run unless the mimes.type file from the jets3t configs folder is added to the classpath (I copied it to a lib folder in my Eclipse SimpleDB then added that lib folder to the project’s build path as Class folder in the project’s properties). Also, this code doesn’t allocate keys for the uploaded files using the relevant data from SimpleDB; the keys here are just the filenames. Either the book should provide code that uses SimpleDB data as the keys (as the Python code on p 159 does), or else it should explain clearly to readers that this can’t be done in Java.

p 160 – songs.select should be songs_domain.select in order to work with the previously-given code. Also, it wouldn't hurt to remind Windows users to escape the backslash in the file/folder path eg C:\path to\songs/%s.mp3

p 161 – why not use more efficient code with a “for” loop? In any event, this code wouldn’t run: first, “The method downloadObjects(S3Bucket, DownloadPackage[]) in the type S3ServiceSimpleMulti is not applicable for the arguments (S3Bucket, DownloadPackage[])”, then on casting downloadPackages to DownloadPackage[] and trying to run it: “Unable to determine S3 Object key name from signed URL: null”. And also warnings of deprecated methods/types. Also, it’s not clear what's the local location files get downloaded to, cf p 164 for Python which makes clear what the specified download directory will be. The info on http://www.ibm.com/developerworks/library/ar-cloudaws2/ with the comments in the code is clearer as to which code is mean to do what, and it would have helped if the code in the book had been similarly commented.

p 164 – see p 160 comment on “songs_domain” – occurs twice.

p 172 – “This sample will print the following values to the console:” – not exactly, the requestID will of course vary with the user.

pp 186-188 – memcached is also available for Windows http://www.splinedancer.com/memcached-win32/ - installation instructions are on that page, your directory structure may vary of course.

p 189 – why so specific on “Copy the JAR file named java_memcached-release_2.5.0.jar to a folder that is on your classpath.”? Why not just say, add it to your classpath? (adding it as an external jar also works, for instance). This page should also include instructions for memcached Windows, as p 38 sort of does - ie download the python-memcached library, extract the files, run cmd, cd to the folder, use “python setup.py install”; start the server with the command “c:pathtomemcached.exe -d start”.

p 190 – it can't be a bad idea to remind readers to start the memcached server running first, here.

p 192 – “mc = memcache.Client(['127.0.0.1:12312'])” – why is the port said to be 12312 here? Cf p 190 where it’s port 11211 for the Java. Only 11211 works for me, at least when using memcached for Windows with Python.

pp 194-196 – the Java code didn’t work for me, it still keeps retrieving the data afresh from SimpleDB – even though the Java test on p 190-191 showed that the memcached server is working fine, and the caching certainly works in Python (p 202).

p 201 – the code starting at the bottom of the page should be saved into a file called sdb_memcache.py – a big omission. Newbies – best to save the py files to the same folder as your Python installation eg the Lib subfolder; and NB you have to fix the indents if you copy/paste.

p 202 – if using memcached for Windows, it won’t work unless you use port number 11211 ie: sdb_mc = SDBMemcache("127.0.0.1","11211") p 205 - "In this chapter, we will explore how to run parallel operations against SimpleDB using boto." - but it's not just using boto. The page number's missing from this page.

p 213 - "Here is a simple Python script that updates items by making three different calls to SimpleDB, but in a serial fashion, that is one call after another." - but, no script was actually given?? And why not give the code for "Running this through time"?

p 213-216 - it would have been more helpful to give the explanations as comments against the relevant parts of the code, so that it's clear which bit of the code does what. That's a general point about the earlier Java code in this book too.

p 221 - to install eggs you have to first install easy_install. Although I'd already installed setuptools, I still had to download ez_install.py for this to work.

Pages