Monday, 16 September 2013

Data protection law: basic guide & info (including for open data / big data startups)

This is a 1-page basic guide to data protection law, particularly relevant to open data / big data in cases where the data processed involve 'personal data'.

Data protection law in a nutshell

To tech folk, 'data protection' usually means IT security. To lawyers, 'data protection' usually means data protection laws. There's some overlap, but they're not the same. I'm just going to use 'data protection' in the legal sense.


Data protection is also not the same as privacy. Again, there's some overlap, but technically they're different. Data protection laws can even apply to public data, ie non-private personal data. Privacy law in the UK has largely been developed by the courts under Article 8 of the European Convention on Human Rights to protect people against the misuse of their private information (mostly, celebs who can afford to litigate).

There are also laws about the use of confidential information, which could cover some corporate commercial data:imageSo what's data protection really? Well, EU data protection laws apply to the 'processing' of 'personal data', with exceptions eg for national security, or processing for personal purposes like keeping your personal contacts in an electronic addressbook(though at least one council has tried to argue that bloggers should be liable under data protection laws - see the full correspondence).

Data protection laws are really broad because 'processing' is almost anything you can do with or to personal data where it's been digitised at some point in the process, including just storing, transmitting or disclosing personal data as well as actively working on it. And 'personal data' is basically anything that can be linked to an identified or identifiable living individual ('data subject'), so something that's not 'personal data' one minute could become 'personal data' the next if it's become linked to an identifiable person through big data crunching, for instance.

Data protection law requires 'controllers', ie anyone who controls the 'purposes and means' of processing personal data, to process personal data according to certain key principles (regarding not only the use or abuse of personal data but also issues such as data accuracy and security), with tighter rules for certain sensitive information like health-related data. Failure to do so may be punished, mainly by the regulator (who in serious cases can fine up to £500k in the UK), or in some very limited cases the affected data subjects could try to stop the processing or sue for compensation. Breach of principles could also be a criminal offence in some situations. Controllers must register with or notify the national regulator and pay fees.

The concept of anonymous data is recognised. The approach taken is quite binary, in the sense that if something is 'personal data', all data protection law rules apply to it, so it must be processed in compliance with the principles etc; whereas if it's not 'personal data' but anonymous data, then none of them do. Of course in actuality the dividing line is harder to draw, but the law is what it is. Many laws are like this, claiming to apply to things in different ways depending on whether or not they fit within a set category or categories, implicitly assuming that there are bright lines between them, when in fact it's often hard to work out which if any category a real situation fits into, and technological, social and business developments can make the dividing lines even blurrier over time.

Something is not 'personal data' if it's been anonymised so that individuals can't be identified by any means 'likely reasonably' to be used to attempt de-anonymisation, including by combining it with other data (note: that refers to the means likely to be used, not the means actually used: if you can re-identify, eg because the anonymisation hasn't been done very well, the 'anonymous' data are still personal data even if you don't actually do it). This again means that as re-identification methods improve, something which used to be anonymous data could become 'personal data' when techniques get to the point that the data could be deanonymised. [Clarification: the 'likely reasonably' wording is from the EU Directive. For the UK-specific position and summaries of cases, see the Anonymisation Code of Practice]

EU data protection law comes from the Data Protection Directive. This applies to countries in the European Economic Area (I've done a Venn diagram showing the differences between EEA, EU, Europe and Council of Europe). As this is a Directive, not a Regulation, EEA countries have room to implement it differently, so detailed data protection laws may vary with the country - and do, sometimes significantly. For example, some countries protect the 'personal data' of organisations as well as people (the UK doesn't). The rules on security are about a few paragraphs long in the UK, several pages long in Italy.

The ICO is the UK's data protection (and freedom of information) regulator. It's published lots of useful info both for data subjects and for those who process personal data, so do rootle around its website.

I should also mention the Article 29 Working Party, the group of EU data protection regulators collectively. It's produced many opinions and other documentation, including on:

So there's lots of guidance out there, it's just that most people who aren't data protection specialists don't know about it (and, of course, may not know how to understand or apply it in practice).

But note that regulators' guides and opinions aren't legally binding - only a court case can provide definitive guidance. However, if you follow a regulator's guidance, you're of course less likely to find yourself in its enforcement sights.

General info

There's basic info on UK data protection law plus guide to data protection including:

Remember that the ICO can take enforcement action for breaches (and has a policy on regulatory action). This can include imposing monetary penalties (framework, guide, procedures - and see what enforcement action it's taken so far including criminal prosecutions, and CSV of fines issued so far).

For organisations like startups

There's a checklist on data protection compliance, a checklist on collecting personal data, and a brief general guide for small businesses.

The ICO website has free training materials including videos and security guidance. You can ask the ICO for help, eg request an advisory visit to your organisation.

On privacy notices in particular, there are guides on:

The ICO has sectoral guidance including for non-profits and the health sector, and you can see its full list of guidance material, including specific guidance on certain areas like:

Regarding sensors etc and the infamous cookie law:

To keep up to date - the ICO has:

For data subjects (whose data are being processed)

You have some data protection law rights, here's info on two main ones:

You can complain to the ICO, for free:

You could also sue for compensation in some situations but they're very limited, as you can tell from that link being directed at organisations and the lack of similar general info for individuals about suing! And you'd have to get lawyers to help you litigate. You could try to DIY, but that didn't work out very well for Mr Smeaton (short summary (scroll down), longer summary, another summary, full judgment).

However, an eminent data protection expert has argued that even the non-rich could, instead of suing, try complaining about privacy breaches (not just data protection breaches) to the ICO, ie 'ask [the ICO] for an assessment with respect to lawful processing with respect of Article 8' - and I think he's got a point there. So if you try this, good luck and please keep me informed!

"We can't, because of data protection"

Let's just dispel one myth. Too many organisations hide behind 'data protection' to refuse to do something that they can and indeed should do. Maybe because they just don't want to, or couldn't be bothered, or they're covering themselves and think it's just easier and safer not to do it. And they often get away with this, because too many people don't understand data protection law and believe their 'It's data protection' excuse.

That's partly the fault of data protection law and regulators, because the law is very complex and detailed, and there's tons of legislation and guidance to wade through (as well as some cases interpreting the law). But the basic principles are mostly quite straightforward (listed earlier).

The ICO has tried hard to address these practices by organisations, which it calls 'data protection duck outs' (eg myths and realities about data protection), but believe it or not there have been 'data protection' incidents regarding animals, trees and kids (plus a Superman suit). There are also myths about data sharing, and myths about marketing calls too.

Occurrences like this don't exactly fill one with confidence that things may change for the better. We can only hope that more people will learn about these myth debunkers, and that bureaucratic organisations will start applying common sense and stop using 'data protection' as a justification for introducing more unnecessary 'get in the way' red tape.

Usual weaselly disclaimers (and why you should use lawyers, and where to get free legal advice)

May I stress that all the above is general info only, not legal advice!

Lawyers say this sort of thing because legal advice needs to be tailored to your individual situation, and inevitably everyone's is different.

Also, laws don't always mean what they literally say. We'd love them to (as would the Good Law initiative), but sometimes, maybe even often, they don't. This may be because there can be layers of meaning, or qualifications, conditions and/or exceptions, so that it's sometimes necessary to wade through provision after provision, following the trail of definitions through to still further legislation, before it's possible to get even the bare bones of what something means.

For instance, 'fair and lawful' in the first principle means more than just 'fair and lawful': for processing to be 'fair and lawful', it must first fit within one of several defined boxes ('consent' is one), and it also has to be generally fair and lawful. And I've put quotes around 'consent' because 'consent' itself has a specific meaning, it's not 'consent' unless the consent was a freely-given, specific and informed indication of the data subject's agreement to their data being processed.

Or, legislation can be drafted obscurely, so it's hard to figure out what it means, and it would take a court case to find out what judges think it means. Or, legislation can be drafted by people who don't understand how technology works (yes it happens!), whether it's websites, or cloud computing. Or, the legislation is so old that it didn't properly envisage future technological developments - like copyright law controlling the right to copy rather than the right to use (book), leading to effectively all computer usage being copyright infringement because the technology works by copying. It's often hard to apply old or unsuitable laws to modern technology.

Even when an issue has gone to the courts for decision, while some judges are admirably easy to understand, with others even seasoned lawyers may get even wrinklier-browed desperately trying to figure out exactly what m'lud meant. Sometimes, it's because the judge isn't as clear as he or she could be. Other times, it's because judges are trying to do what they feel is the fair and right thing, and so may suggest or say that the law means something other than you might think it means (I dub this the Denning dimension, aka 'The little old lady wins!', sometimes manifested as 'hard cases make bad law'). That's why, while technologists may think in binary, in either/or, lawyers have to think in analogue - in shades of grey:

(Image reproduced by kind permission of

And that's also why attempts to translate laws into algorithms and code are almost certainly doomed to failure; it's near impossible, as for example an experiment in implementing supposedly simple road traffic rules in software showed.

Lawyers with expertise in particular fields, whether data protection, intellectual property or computer law, have been trained to understand the jargon and to know or be able to work out how to reconcile all these different elements in order to determine what the workable paramenters are, and to arrive at something that can make some kind of sense in practice.

In addition, experienced practitioners should have a feel for how the law is actually enforced in real life, eg by regulators, so that they can give you some idea of how likely it is that you'll be fined or worse, and what the penalties are. Then you can decide, particularly in the (too many) areas where the law isn't clear, whether to take the risk that (a) whatever you plan to do. that might be a breach, will be found out, (b) authorities will take enforcement action against you, and (c) you'll be fined or prosecuted for it.

Of course, if you use lawyers rather than DIY, you might be able to sue them if things go wrong and it's their fault - because practising lawyers should be insured!

Finally, the internet may be global but laws are national, so different countries' laws may apply in different (or indeed the same) situations, and so you may need advice from lawyers qualified in the relevant countries.

Therefore, at some point a startup will need a lawyer. Not just to keep certain lawyers (alas not me) in mansions and private school tuition fees, but for its own benefit in terms of protecting its IP, making sure it's not breaking data protection or other laws, and certainly when it comes to that hoped-for cashing-in IPO.

Law centres, citizens advice bureaux and the Bar pro bono unit are free, but may lack specialist IT or data protection expertise. Own-IT can give free intellectual property law advice, and Queen Mary, University of London (where I'm a PhD student and working part-time) has an advice service including a Law for the Arts Centre that offers free IP law advice, but again may not necessarily have IT or data protection expertise. However, Queen Mary is also launching a new free advisory service for startups, qLegal, aimed at providing legal and regulatory advice specifically to ICT startups, where postgrad students will work with collaborating law firms and academics - so please feel free to try that!

Disclaimer: the book I linked to above is by my PhD supervisor, but I linked to it because it makes very salient points on why many laws don't work in cyberspace and how they could be made work, plus it's a good read (even for non-lawyers) - not because I'm trying to curry favour!

Monday, 2 September 2013

Basic tutorial: Map/Reduce example with R & Hadoop, including Amazon Elastic MapReduce (EMR)

This is my write-up of Anette Bergo's very useful session for Women in Data in August 2013, but reordered and with some extra notes and screenshots of my own.

Anette showed exactly how this sort of thing should be done - basic foundation, enough code to demo the key principles without over-complicating things, talk through the code, run it!

Any errors are mine alone, if you spot any please let me know.



  • Download and install R - it's multi-platform so there are Linux, Mac and Windows versions
    • RStudio IDE helps provide a friendlier interface
  • (To clone Anette's example repo) Download and install Git
  • (For the EMR bit only) Sign up for an Amazon Web Services account.
    • If you have an Amazon account you can login with that, but you still need to sign up specifically for AWS.
  • (For EMR only, as it costs you money to run the demo) Sign up for Elastic MapReduce (circled in blue in the screenshot below, accessible via the AWS console - you'll need to enter credit card details and possibly go through a phone verification and wait for their confirmation email before you can use EMR.

What's the R programming language?

R is a DSL for statistical/mathematical analysis.

Everything is a vector in R (just as in Git everything is a directed graph).

What's MapReduce?

MapReduce is a programming framework for parallel distributed processing of large data sets. (Originally devised by Google engineers - MapReduce paper.)

Effectively, Hadoop is the open source version of Google's MapReduce (it also includes an open source version of Google File System and increasingly other components).

Amazon Web Services' Elastic MapReduce lets you set up and tear down Hadoop clusters (master and slaves). The demo uses R but EMR will accept eg Python, shell scripts, Ruby. You can deploy with the Boto library and Python scripts.

MapReduce involves: Input reader - map function - partition function - compare function - reduce function - output writer.

A map is a set of key/value pairs. The compare function usually sorts based on key/map. The reduce function collapses the map to the results. The output writer moves data to a meaningful easily-accessible single location (preventing data loss if the cluster powers down).

The master (ie the framework) organises execution, controlling one or more mapper nodes and one or more reduce nodes. The framework reads input (data file), and passes chunks to the mappers. Each mapper creates a map of the input. The framework sorts the map based on keys. It allocates a number of reducers to each mapper (the number can be specified). Reduce is called once per unique key (producing 0 or more outputs). Output is written to persistent storage.

Usually a mapper is more complex than in the demo, eg it may filter what's to be analysed etc. For less than 10 GB of data, you might run analyses on your own computer, for 10-100 GB your own servers, probably using MapReduce only for over 100GB pf data. It can process images, video files etc too - although the demo analyses words in a text file.

Canonical example of MapReduce: wordcount

Input - a series of different words eg: bla bla bla and so and.
Mapped - bla 1, bla 1, bla 1, and 1, so 1, and 1. (Ie maps 'bla' to value '1').
Reduced - and 2, bla 3, so 1.

Note: this assumes all input info is important, but often only part is, eg to check how often names are mentioned in a series of articles you wouldn't map everything.

The framework has readymade reducers for common map formats but you can write your own reducer.

Anette's example

Clone the demo repo at (see bottom right hand side - there are buttons to clone in desktop or get the clone URL; the command is git clone <url>).

Ensure everything's executable as necessary.

The input file is data.txt, the mapper is mapper.R and the reducer is reducer.R.

A shell script will demo the map/reduce locally - it reads data.txt to the mapper, sorts the output and puts the output into the reducer.

Going through the code (RStudio helps):

mapper.R - see last function in the code: it reads input from stdin. hsLineReader takes and reads chunks up to 3 lines, doesn't skip anything (eg headers), then applies emit function to each chunk read. The emit function (top of code) transforms the output (1-3 lines) to a uniform processable stream, turns chunks into words (strsplit). sapply applies an anonymous inner function to each word. (paste is used for string concatenate.) The sorted results go to the reducer.

reducer.R - the final function reads from stdin and runs the reduce function on the input. This creates an array of names - vector of columns. (The chunksize can be tweaked to make it more performant depending on the calculation to be run; the default separator is tab, here it's been set to a space.) Then the process function is applied to it (written as a separate function for clarity, but it could be an anonymous inner function). This function takes each piece of map and aggregates by words using an anonymous inner function producing sums.

Running locally

Run - this emulates what the framework does.

NB must install further packages, HadoopStreaming and getopt:


(If that doesn't work, install them from the R_packages folder: R cmd install packagename.tar.gz).

Running on Amazon Web Services

NB this isn't part of Amazon's free tier, so running these demos will cost you - not very much, probably less than a quid?

Go to AWS console

Create a new S3 bucket (click S3 - towards the bottom left at the moment, under 'Storage and Content Delivery'; click Create bucket; give it a unique name. NB the name must be unique for all buckets on AWS, not just for you!).


Edit the script at the line
to replace it with your new bucket's name. (The code is self-explanatory, see the comments)

Open the bucket by clicking on it, rightclick inside and upload the code from Anette's model repo. (You may need to rename the R_packages folder to just R, or change it to R_packages in the script.)

All nodes in the cluster get the code applied to them.

Now in the AWS console go to Elastic MapReduce (under 'Compute and Networking') - best do this in a new browser window or it'll break your upload! Click to sign up for it, if you haven't already, including adding credit card information etc.

Using Amazon's standard example. In EMR, click create a new job flow (see screenshot below):

  • Job Flow Name - anything you like
  • Hadoop version - Amazon Distribution
  • AMI - latest
  • Create a job flow - Run a sample application, pick Word Count (Streaming), then
  • click Continue.


In the next screen (see below):

  • Input Location is prepopulated (a public bucket), leave it
  • Output location - change <yourbucket> to your own new bucket's name (case sensitive I think)
  • Mapper and Reducer - use theirs
  • click Continue.


In the next screen (screenshot below):

  • Instance Type - small
  • Instance Count - 2, and
  • Continue.


In the next screen (see below):

  • Amazon EC2 Key Pair - leave it as Proceed without key pair (you may get an error, if so see below)
  • Amazon VPC Subnet ID - no preference
  • Amazon S3 Log Path - here enter your own path to your bucket, eg s3n://yourbucketname/log (note: s3n is an internal AWS protocol)
  • Enable debugging - Yes, and
  • Continue.


Leave it as Proceed with no Bootstrap Actions, click Continue:

imageThe next screen shows a summary of all the settings for your review, use Back to change any errors etc. When happy, click Create job flow to run it (and you'll get charged, so click Cancel if you'd rather not run it yet!).


It takes a few minutes to run. Click on the job name and click Debug to see the progress. There's a Refresh button to check if it's gone any further. Click on View jobs to see the jobs set up.

Error? If you get errors, at the top right hand side of the AWS Console click on your username, select Security Credentials, expand Access Keys and click Create New Set of Keys, then try again with Proceed without keypair (it seems that creating a new set of keys then enables you to proceed without actually using the created keys!)

Using the uploaded demo files. This is similar. In EMR create a new job flow, but this time under 'Create a job flow' choose 'Run your own application', with job type 'Streaming'.

For the Input Location use s3n://<yourbucketname>/data.txt, for the Output Location similarly the path to your bucket folder (eg Rtest.output) - it will be created if not already in existence, and can be downloaded to your own location. For Mapper, use the uploaded mapper.R file in your bucket, for Reducer the reducer.R file. Instance type small etc.

Proceed without key pair (see above if there are errors). Bootstrap action - this time choose your own custom action, and enter the path to your bucket and the file. Continue. Create. View. Run when you're happy! (NB again it costs you money.)


Further notes: in jobs, tasks can be viewed too - you can see eg 12 mappers and 3 reducers. Output files are created one per reader, you have to stitch them back together. 0 byte files are created where there was no output from the relevant chunk.