Sunday, 13 April 2014

Cloud computing: IaaS, SaaS, PaaS

What's the difference between IaaS, PaaS and SaaS? There still seems to be confusion especially about PaaS. I hope this will help.

Consider what lies behind using a software application, like email. (I will be simplifying and generalising below, to get the point across, so no need to point out eg that some languages are interpreted, that some programs can be run directly without installation, and that PaaS applications may need to be coded to integrate with the specific PaaS provider’s libraries!).

  1. The application is coded – someone writes the application in a programming language like C++, Python etc.
  2. The application is compiled – the code is converted into a form that can be run on a particular operating system eg Windows, Mac, Linux, Android, iPhone (iOS) etc.
  3. The application is acquired – eg downloaded from a website, obtained on DVD.
  4. The application is installed on the operating system – eg doubleclicking an .msi file in Windows.
  5. The application is run and used by the user – eg doubleclicking on the program filename.

Non-cloud – the end user of the application typically only takes steps 3-5, or even just 5 on a corporate network where the IT department has already taken care of 3 and 4.

SaaS – the cloud user only takes step 5, typically by logging into the SaaS service over the Internet (or company network) to access the application, instead of clicking on a local program name; the SaaS provider takes care of all the rest.

IaaS – the cloud user must take care of ALL of steps 1-5. In addition (consider this a step 3.5!) it must also manage its own VMs including creating VMs and installing the operating systems on its VMs (though it can use snapshots). But it could use someone else’s code (eg open source software) rather than writing the code itself (in which case it skips step 3). Or it could use someone else’s application, go straight to step 3 and install the application in its cloud VM on top of the operating system it installed, assuming the application licence allows installations in VMs. In step 5 the individual end users could be the employees of the cloud user organisation, or its customers, or both.

PaaS – the cloud user only takes care of step 1, again writing its own code (normally using an SDK or software development kit downloadable from the provider) or obtaining code from elsewhere. The PaaS provider handles steps 2-4. Step 1 can be and is often done locally, then the code is uploaded to the PaaS provider. Again, in step 5 the end users could be employees of the cloud user organisation or its customers. Hence startups offering new services over the web, eg mobile applications, like using IaaS or PaaS because they don’t have to buy equipment to service their customers, they can just focus on running their systems (in IaaS) and coding (in both). With PaaS, they don’t even have to manage IT systems - they can concentrate just on coding. Hence the ‘platform’ in PaaS – it provides a ‘platform’ for PaaS users to code their applications, deploy their applications (to servers provided by the PaaS provider) and host their applications (on servers provided by the PaaS provider), so that the applications are available for use by their end users over the Internet or corporate network.

Wednesday, 12 February 2014

9 Ds of Cloud Computing - what's different about cloud?

Here are my 9 Ds of Cloud Computing - D for Differences (which I produced for my Information Security FS 2013 presentation).

Cloud computing is a form of sourcing / outsourcing, of IT resources. But -

  1. Disassociation - separation of the physical from the logical is common (eg physical access to data vs logical, often remote, access); and so is separation of ownership vs control vs use
  2. Diverse supply chain (hardware, software, services); even layers of services are possible, eg:

  3. Don’t always know or have influence over all suppliers - customers are in quite a different position from traditional outsourcing, it's often a 'cloud of unknowing' for customers, who may not always be able to find out full information about sub-providers etc, or be able to negotiate providers' standard contract terms
  4. ‘Direction of travel’ is reversed - if using sub-providers. In traditional outsourcing, a customer may go out to tender with details of the service it seeks, discuss the position with several shortlisted potential providers and narrow it down; the provider finds sub-contractors to help it deliver the service requested by the customer. In cloud, SaaS (or even PaaS) providers often build their services on top of pre-existing IaaS or PaaS services, then offer their services to customers, ie the 'direction of travel' is the opposite from that in traditional outsourcing; and opportunities for customising the service are limited
  5. DIY - cloud involves the self-service use by customers of  IT hardware / software infrastructure, offered as  services, such as software applications in SaaS or virtual servers in IaaS; the provider doesn't actively process data for customers
  6. Design – the design of the individual service (as well as user measures eg encryption, which the service may or may not facilitate) will affect the extent to which the provider has access to user data, including encrypted data. Key access is also critical - if the user has encrypted the data but the provider can access the key, it can still access intelligible data. Conversely if the provider has encrypted user data and manages the key securely, any sub-provider(s) may not be able to access intelligible user data.
  7. Data – cloud-processed data are often:
    1. distributed, which overlaps with the following, that cloud data may be
    2. divided into chunks / fragments which are stored, and sometimes processed, separately 
    3. duplicated (multiple replicas or copies of data may be taken, perhaps to different geographical locations, for backup/business continuity purposes),
    4. 'deleted' in different ways - deletion may only delete 'pointers' to data rather than scrubbing underlying data, which are gradually over-written over time; even any scrubbing of data may be achieved to different degrees of deletion (and security), and duplicates of data stored in backups, etc may not get deleted
  8. Dependence – on shared, third party resources - including the customer's Internet connectivity
  9. Degrees of control, eg regarding security issues, differ with the situation - it's not one size fits all (see table below)


Table © Cloud Security Alliance reproduced with permission

See also: previous post about the 12 Cs of Cloud Computing (here's the full SCL article: The 12 Cs of Cloud Computing: A Culinary Confection), including explanations of SaaS, PaaS and IaaS for those not familiar with the terms.

Monday, 6 January 2014

OECD Privacy Guidelines – changes between 1980 and 2013 versions – comparison / markup / redline

Here's a markup showing changes between the 1980 and 2013 versions of the OECD Privacy Guidelines (ie Annexes to the Recommendations), aka the OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data, as I've not found anything similar online.

I used Word's automated comparison feature, tidied up a bit, so the last few paragraphs are not as clear as they could be, but they're usable enough. Obviously I've not compared the explanatory memoranda as they're very different.




1. For the purposes of these Guidelines:

a) "data“Data controller" means a party who, according to domesticnational law, is competent to decide about the contents and use of personal data regardless of whether or not such data are collected, stored, processed or disseminated by that party or by an agent on its behalf;.

b) "personal“Personal data" means any information relating to an identified or identifiable individual (data subject);).

c) “Laws protecting privacy” means national laws or regulations, the enforcement of which has the effect of protecting personal data consistent with these Guidelines.

d) “Privacy enforcement authority” means any public body, as determined by each Member country, that is responsible for enforcing laws protecting privacy, and that has powers to conduct investigations or pursue enforcement proceedings.

e) "transborder“Transborder flows of personal data" means movements of personal data across national borders.

Scope of the Guidelines

2. These Guidelines apply to personal data, whether in the public or private sectors, which, because of the manner in which they are processed, or because of their nature or the context in which they are used, pose a dangerrisk to privacy and individual liberties.

3. TheseThe principles in these Guidelines are complementary and should be read as a whole. They should not be interpreted:

a) as preventing:a) the application, of different protective measures to different categories of personal data, of different protective measures depending upon their nature and the context in which they are collected, stored, processed or disseminated; or

b) the exclusion from the application of the Guidelines of personal data which obviously do not contain any risk to privacy and individual liberties; or

c) the application of the Guidelines only to automatic processing of personal data.

b) in a manner which unduly limits the freedom of expression.

4. Exceptions to the Principles contained in Parts Two and Three of these Guidelines, including those relating to national sovereignty, national security and public policy ("ordre public"), should be:

a) as few as possible, and

b) made known to the public.

5. In the particular case of Federalfederal countries the observance of these Guidelines may be affected by the division of powers in the Federation.federation.

6. These Guidelines should be regarded as minimum standards which are capable of beingcan be supplemented by additional measures for the protection of privacy and individual liberties., which may impact transborder flows of personal data.


Collection Limitation Principle

7. There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject.

Data Quality Principle

8. Personal data should be relevant to the purposes for which they are to be used, and, to the extent necessary for those purposes, should be accurate, complete and kept up-to-date.

Purpose Specification Principle

9. The purposes for which personal data are collected should be specified not later than at the time of data collection and the subsequent use limited to the fulfilment of those purposes or such others as are not incompatible with those purposes and as are specified on each occasion of change of purpose.

Use Limitation Principle

10. Personal data should not be disclosed, made available or otherwise used for purposes other than those specified in accordance with Paragraph 9 except:

a) with the consent of the data subject; or

b) by the authority of law.

Security Safeguards Principle

11. Personal data should be protected by reasonable security safeguards against such risks as loss or unauthorised access, destruction, use, modification or disclosure of data.

Openness Principle

12. There should be a general policy of openness about developments, practices and policies with respect to personal data. Means should be readily available of establishing the existence and nature of personal data, and the main purposes of their use, as well as the identity and usual residence of the data controller.

Individual Participation Principle

13. An individualIndividuals should have the right:

a) to obtain from a data controller, or otherwise, confirmation of whether or not the data controller has data relating to him;them;

b) to have communicated to himthem, data relating to himthem

i. within a reasonable time;

ii. at a charge, if any, that is not excessive;

iii. in a reasonable manner; and

iv. in a form that is readily intelligible to him;them;

c) to be given reasons if a request made under subparagraphs (a) and (b) is denied, and to be able to challenge such denial; and

d) to challenge data relating to himthem and, if the challenge is successful to have the data erased, rectified, completed or amended.

Accountability Principle

14. A data controller should be accountable for complying with measures which give effect to the principles stated above.


15. A data controller should:

a) Have in place a privacy management programme that:

i. gives effect to these Guidelines for all personal data under its control;

ii. is tailored to the structure, scale, volume and sensitivity of its operations;

iii. provides for appropriate safeguards based on privacy risk assessment;

iv. is integrated into its governance structure and establishes internal oversight mechanisms;

v. includes plans for responding to inquiries and incidents;

vi. is updated in light of ongoing monitoring and periodic assessment;

b) Be prepared to demonstrate its privacy management programme as appropriate, in particular at the request of a competent privacy enforcement authority or another entity responsible for promoting adherence to a code of conduct or similar arrangement giving binding effect to these Guidelines; and

c) Provide notice, as appropriate, to privacy enforcement authorities or other relevant authorities where there has been a significant security breach affecting personal data. Where the breach is likely to adversely affect data subjects, a data controller should notify affected data subjects.


16. A data controller remains accountable15. Member countries should take into consideration the implications for other Member countries of domestic processing and re-export of personal data. under its control without regard16. Member countries should take all reasonable and appropriate steps to ensure that transborder flows of personal data, including transit through a Member country, are uninterrupted and secure.the location of the data.

17. A Member country should refrain from restricting transborder flows of personal data between itself and another Member country except where (a) the latter does not yetother country substantially observeobserves these Guidelines or where(b) sufficient safeguards exist, including effective enforcement mechanisms and appropriate measures put in place by the re-export of such data would circumvent its domestic privacy legislation. A Member country may also impose restrictions in respect of certain categories of personal data for which its domestic privacy legislation includes specific regulations in view of the nature of those data and for which the other Member country provides no equivalent controller, to ensure a continuing level of protection. consistent with these Guidelines.

18. Member countries should avoid developing laws, policies and practices in the name of the protection of privacy and individual liberties, which would create obstaclesAny restrictions to transborder flows of personal data that would exceed requirements for such protection.should be proportionate to the risks presented, taking into account the sensitivity of the data, and the purpose and context of the processing.



19. In implementing domestically the principles set forth in Parts Two and Threethese Guidelines, Member countries should:

a) develop national privacy strategies that reflect a co-ordinated approach across governmental bodies;

b) adopt laws protecting privacy;

c) establish legal, administrative or other procedures or institutions for the protection of privacy and individual liberties in respect of personal data. Member countries should in particular endeavour to:and maintain privacy enforcement authorities with the governance, resources and technical expertise necessary to exercise their powers effectively and to make decisions on an objective, impartial and consistent basis;

a) adopt appropriate domestic legislation;

bd) encourage and support self-regulation, whether in the form of codes of conduct or otherwise;

ce) provide for reasonable means for individuals to exercise their rights;

df) provide for adequate sanctions and remedies in case of failures to comply with laws protecting privacy;

g) consider the adoption of complementary measures which implement the principles set forth in Parts Two and Three; and, including education and awareness raising, skills development, and the promotion of technical measures which help to protect privacy;

eh) consider the role of actors other than data controllers, in a manner appropriate to their individual role; and

i) ensure that there is no unfair discrimination against data subjects.



20. Member countries should, where requested, make known to other Member countries details of the observance of the principles set forth in these Guidelines. Member countries should also ensure that procedures for transborder flows of personal data and for the protection of privacy and individual liberties are simple and compatible with those of other Member countries which comply with these Guidelines.

21. Member countries should establish procedures to facilitate:

information exchange related to these Guidelines,

20. Member countries should take appropriate measures to facilitate cross-border privacy law enforcement co-operation, in particular by enhancing information sharing among privacy enforcement authorities.

21. Member countries should encourage and mutual assistance in the procedural and investigative matters involved. support 22. Member countries should work towards the development of principles, domestic and international arrangements that promote interoperability among privacy frameworks that give practical effect to these Guidelines.

22. , to governMember countries should encourage the applicable law indevelopment of internationally comparable metrics to inform the case of policy making process related to privacy and transborder flows of personal data.

23. Member countries should make public the details of their observance of these Guidelines.

Thursday, 2 January 2014

Cloud security principles - UK guidance

The UK government issued some concise but fairly comprehensive cloud service security principles (edit: available in the preceding link, in HTML format only), in mid-December 2013, as guidance for UK public sector organisations when considering the use of cloud services to process official information. They were described by Government Digital Service COO Tony Singleton and on a related webpage (edit: see the preceding link for associated info) as being in public beta, although this is not stated on the page containing the principles itself. (Feedback should be sent to [email protected])

Below I set out the text of these security principles (licensed under the Open Government Licence), but adding some suggestions, highlighting and comments of my own with deletions and insertions marked (eg I've highlighted the sentence regarding a named senior executive being responsible for security).

These principles apply to proposed UK public sector use of cloud services and are not limited to personal data, of course. It is interesting that there is much use of 'should', rather than 'must', and no recommendations are explicitly made on whether obligations on the part of the service provider regarding these issues should be made legally-binding by embodying them in the contract terms.

Cloud Service Security Principles

The UK government issued some concise but fairly comprehensive cloud service security principles, in mid-December 2013. They were described by Government Digital Service COO Tony Singleton and on a related webpage as being in public beta, although this is not stated on the page containing the principles itself. (Feedback should be sent to [email protected])

Below I set out the text of these security principles (licensed under the Open Government Licence), with some suggestions, highlighting and comments of my own added (eg I've highlighted the sentence regarding a named senior executive being responsible for security).

These principles apply to proposed UK public sector use of cloud services and are not limited to personal data, of course. It is interesting that there is much use of 'should', rather than 'must', and no recommendations are explicitly made on whether obligations on the part of the service provider regarding these issues should be made legally-binding by embodying them in the contract terms.

Cloud Service Security Principles

This document describes principles which should be considered when evaluating the security features of cloud services. Some cloud services will provide all of the security principles, while others only a subset. It is for the consumer of the service to decide which of the security principles are important to them in the context of how they expect to use the service.

Some service providers will be able to offer higher levels of confidence in how they implement the different security principles. Consumers[1] will need to decide how much, if any, assurance they require in the different security principles which matter to them.

These principles apply equally to Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS) as defined by NIST.

1. Data in transit protection

The confidentiality and integrity of data should be adequately protected whilst in transit.

The following aspects should be specifically considered:

• Consumer to service

End user to service [ie individual citizens using the service]

• Within the service (for example, between data centres)

2. Asset protection and resilience

Data should be physically secure as it is processed by and stored within the service. This security should be based on suitable physical security controls within data processing, storage and management locations.

The business requirements for availability of the service should be an important consideration when choosing a cloud service. The consumer should ensure that a contractual agreement is in place with the service provider which adequately supports their business needs for availability of the service.

The legal jurisdiction of the service will be an important consideration[2] for many consumers, especially if they wish to use the service to store or process personal data. This principle dependsmay depend on the physical locations of processing, storage, transit and/or management of the service. as well as jurisdictions where the service provider or any relevant sub-provider(s) is incorporated or established or where it has operations and/or assets

The following aspects should be specifically considered:

• Location of data centres hosting the service

• Security surrounding those data centres

• Location of service management facilities

Countries of relevant service providers’ establishment or incorporation, and where they may have operations or assets

• How the confidentiality and integrity of data-at-rest will be maintained as appropriate to the nature of the data and service concerned (eg encrypting or tokenising data before upload to the service, bearing in mind the importance of adequate key access and key management; and paying for backups by the service provider if not included in the service, or for backups to another service provider, or even backing up to internal systems)

• Availability of the service

3. Separation between consumers

SeparationLogical separation between different consumers of a service (guaranteed to a level appropriate to the requirements of the [consumer]) should be achieved at all points within the service, including across compute, storage and networking resources.

An important consideration will be whether the service is a public, private, or community, shared cloud service; if all tenants[3] of the service are known to be trustworthy then less confidence in the separation properties of the service may be acceptable.

4. Governance

The service provider should have a security governance framework that coordinates and directs their overall approach to the management of IT systems, services and information. A clearly identified, and named, senior executive should be responsible for security of the cloud service.

5. Operational security

The service provider should have processes and procedures in place to ensure the operational security of the service.

The following aspects should be specifically considered:

• Configuration and change management

• Vulnerability management

• Protective monitoring

• Incident management

6. Personnel security

Service provider staff and any relevant sub-contractor staff should be subjected to adequate personnel security screening for their role. At a minimum this should include identity, unspent criminal convictions, and right to work checks. For roles with a higher level of service access, the service provider should undertake and maintain appropriate additional personnel security checks. Each individual’s access to [consumer] data, including metadata, should be limited to only data they are required to access to perform their role.

7. Secure development

The service should be developed in a secure fashion and should evolve to mitigate new threats as they emerge.

8. Supply chain security

Cloud services often rely upon third party services. Those third parties can have an impact on the overall security of the services. The service provider should ensure that its supply chain satisfactorily supports all of the security principles that the service claims to deliver[4] .

9. Secure consumer management

Consumers should be provided the tools they need to securely manage their usage of the service.

The following aspects should be specifically considered:

• Authentication of consumers to management interfaces

• Separation of consumers within management interfaces

• Authentication of consumers within support channels

• Separation of consumers within support channels

10. Secure on-boarding and off-boarding

The service should be provisioned to consumers in a known good state, and their data must be satisfactorily deleted when they leave the service. What is ‘satisfactory’ may vary with the requirements of the [consumer] as regards the nature (eg sensitivity) of the data and usage concerned, eg number of passes, confirmation of deletion from all duplicates and backups etc. When physical storage components reach their end of life, the service provider should make appropriate arrangements to securely destroy or purge any consumer data they held.

11. Service interface protection

All external or less trusted interfaces of the service should be identified and have appropriate protections to defend against attacks through them.

The following aspects should be specifically considered:

• Connections to external services on which the service depends

• Dedicated connections to tenants[5]

• Remote access by service provider

• Publicly exposed services

12. Secure service administration

The methods used by the service provider’s administrators to manage the operational service (monitor system health, apply patches, update configuration etc.) should be designed to mitigate any risk of exploitation which could undermine the security of the service. The security of the networks and devices used to perform this function should be specifically considered.

13. Audit information provision to tenants[6]

Consumers should be provided with the audit records they need in order to monitor access to their service and the data held within it. [What’s meant by ‘audit’ here? Just access to logs? Or formal annual audit reports by independent third party security experts? It would be helpful to have more detail on what kind of logs should be required, eg logs of accesses to metadata as well as data, and what kind of formal audits should be required.]

14. Secure use of the service by the consumer

Consumers will have certain responsibilities when using the service in order for their use of it to remain secure, and for their data to be adequately protected.

Depending on the type of service, the consumer will have responsibilities relating to the following topics:

• Audit and monitoring

• Storage

• Networking

• Authentication

• Development security

• End user devices used to access the service

• Secure configuration of the service

• Patching

15. Glossary

Management interface

a service exposed to consumers or service provider administrators to allow administrative tasks to be performed.

Support channel

an online, or out of band (e.g. telephone), communication channel which consumers can use to obtain support from the service provider.


the process of a consumer moving on to the service.


the process of migrating a consumer away from a service.

Public, private and community cloud

refer to the NIST definitions of these terms.


a tenant[7] of the cloud service.

[1]May be confusing to use the term ‘consumer’, which many think of as individual end users, eg citizens who use a government department’s service that the department provides using a cloud service. ‘Customer’ of cloud service?

[2]Absolutely. This is the critical point. I argue that, legally, physical location should be relevant not per se but only because it may determine legal jurisdiction and/or IT security.

[3]See comment on Glossary below on use of ‘tenant’

[4]What if the provider makes no or limited claims regarding its service’s security?

[5]As before.

[6]Or ‘consumer’?

[7]‘Tenant’ applies to IaaS but less so to PaaS and SaaS. And see previous comment on ‘consumer’.

Monday, 16 September 2013

Data protection law: basic guide & info (including for open data / big data startups)

This is a 1-page basic guide to data protection law, particularly relevant to open data / big data in cases where the data processed involve 'personal data'.

Data protection law in a nutshell

To tech folk, 'data protection' usually means IT security. To lawyers, 'data protection' usually means data protection laws. There's some overlap, but they're not the same. I'm just going to use 'data protection' in the legal sense.


Data protection is also not the same as privacy. Again, there's some overlap, but technically they're different. Data protection laws can even apply to public data, ie non-private personal data. Privacy law in the UK has largely been developed by the courts under Article 8 of the European Convention on Human Rights to protect people against the misuse of their private information (mostly, celebs who can afford to litigate).

There are also laws about the use of confidential information, which could cover some corporate commercial data:imageSo what's data protection really? Well, EU data protection laws apply to the 'processing' of 'personal data', with exceptions eg for national security, or processing for personal purposes like keeping your personal contacts in an electronic addressbook(though at least one council has tried to argue that bloggers should be liable under data protection laws - see the full correspondence).

Data protection laws are really broad because 'processing' is almost anything you can do with or to personal data where it's been digitised at some point in the process, including just storing, transmitting or disclosing personal data as well as actively working on it. And 'personal data' is basically anything that can be linked to an identified or identifiable living individual ('data subject'), so something that's not 'personal data' one minute could become 'personal data' the next if it's become linked to an identifiable person through big data crunching, for instance.

Data protection law requires 'controllers', ie anyone who controls the 'purposes and means' of processing personal data, to process personal data according to certain key principles (regarding not only the use or abuse of personal data but also issues such as data accuracy and security), with tighter rules for certain sensitive information like health-related data. Failure to do so may be punished, mainly by the regulator (who in serious cases can fine up to £500k in the UK), or in some very limited cases the affected data subjects could try to stop the processing or sue for compensation. Breach of principles could also be a criminal offence in some situations. Controllers must register with or notify the national regulator and pay fees.

The concept of anonymous data is recognised. The approach taken is quite binary, in the sense that if something is 'personal data', all data protection law rules apply to it, so it must be processed in compliance with the principles etc; whereas if it's not 'personal data' but anonymous data, then none of them do. Of course in actuality the dividing line is harder to draw, but the law is what it is. Many laws are like this, claiming to apply to things in different ways depending on whether or not they fit within a set category or categories, implicitly assuming that there are bright lines between them, when in fact it's often hard to work out which if any category a real situation fits into, and technological, social and business developments can make the dividing lines even blurrier over time.

Something is not 'personal data' if it's been anonymised so that individuals can't be identified by any means 'likely reasonably' to be used to attempt de-anonymisation, including by combining it with other data (note: that refers to the means likely to be used, not the means actually used: if you can re-identify, eg because the anonymisation hasn't been done very well, the 'anonymous' data are still personal data even if you don't actually do it). This again means that as re-identification methods improve, something which used to be anonymous data could become 'personal data' when techniques get to the point that the data could be deanonymised. [Clarification: the 'likely reasonably' wording is from the EU Directive. For the UK-specific position and summaries of cases, see the Anonymisation Code of Practice]

EU data protection law comes from the Data Protection Directive. This applies to countries in the European Economic Area (I've done a Venn diagram showing the differences between EEA, EU, Europe and Council of Europe). As this is a Directive, not a Regulation, EEA countries have room to implement it differently, so detailed data protection laws may vary with the country - and do, sometimes significantly. For example, some countries protect the 'personal data' of organisations as well as people (the UK doesn't). The rules on security are about a few paragraphs long in the UK, several pages long in Italy.

The ICO is the UK's data protection (and freedom of information) regulator. It's published lots of useful info both for data subjects and for those who process personal data, so do rootle around its website.

I should also mention the Article 29 Working Party, the group of EU data protection regulators collectively. It's produced many opinions and other documentation, including on:

So there's lots of guidance out there, it's just that most people who aren't data protection specialists don't know about it (and, of course, may not know how to understand or apply it in practice).

But note that regulators' guides and opinions aren't legally binding - only a court case can provide definitive guidance. However, if you follow a regulator's guidance, you're of course less likely to find yourself in its enforcement sights.

General info

There's basic info on UK data protection law plus guide to data protection including:

Remember that the ICO can take enforcement action for breaches (and has a policy on regulatory action). This can include imposing monetary penalties (framework, guide, procedures - and see what enforcement action it's taken so far including criminal prosecutions, and CSV of fines issued so far).

For organisations like startups

There's a checklist on data protection compliance, a checklist on collecting personal data, and a brief general guide for small businesses.

The ICO website has free training materials including videos and security guidance. You can ask the ICO for help, eg request an advisory visit to your organisation.

On privacy notices in particular, there are guides on:

The ICO has sectoral guidance including for non-profits and the health sector, and you can see its full list of guidance material, including specific guidance on certain areas like:

Regarding sensors etc and the infamous cookie law:

To keep up to date - the ICO has:

For data subjects (whose data are being processed)

You have some data protection law rights, here's info on two main ones:

You can complain to the ICO, for free:

You could also sue for compensation in some situations but they're very limited, as you can tell from that link being directed at organisations and the lack of similar general info for individuals about suing! And you'd have to get lawyers to help you litigate. You could try to DIY, but that didn't work out very well for Mr Smeaton (short summary (scroll down), longer summary, another summary, full judgment).

However, an eminent data protection expert has argued that even the non-rich could, instead of suing, try complaining about privacy breaches (not just data protection breaches) to the ICO, ie 'ask [the ICO] for an assessment with respect to lawful processing with respect of Article 8' - and I think he's got a point there. So if you try this, good luck and please keep me informed!

"We can't, because of data protection"

Let's just dispel one myth. Too many organisations hide behind 'data protection' to refuse to do something that they can and indeed should do. Maybe because they just don't want to, or couldn't be bothered, or they're covering themselves and think it's just easier and safer not to do it. And they often get away with this, because too many people don't understand data protection law and believe their 'It's data protection' excuse.

That's partly the fault of data protection law and regulators, because the law is very complex and detailed, and there's tons of legislation and guidance to wade through (as well as some cases interpreting the law). But the basic principles are mostly quite straightforward (listed earlier).

The ICO has tried hard to address these practices by organisations, which it calls 'data protection duck outs' (eg myths and realities about data protection), but believe it or not there have been 'data protection' incidents regarding animals, trees and kids (plus a Superman suit). There are also myths about data sharing, and myths about marketing calls too.

Occurrences like this don't exactly fill one with confidence that things may change for the better. We can only hope that more people will learn about these myth debunkers, and that bureaucratic organisations will start applying common sense and stop using 'data protection' as a justification for introducing more unnecessary 'get in the way' red tape.

Usual weaselly disclaimers (and why you should use lawyers, and where to get free legal advice)

May I stress that all the above is general info only, not legal advice!

Lawyers say this sort of thing because legal advice needs to be tailored to your individual situation, and inevitably everyone's is different.

Also, laws don't always mean what they literally say. We'd love them to (as would the Good Law initiative), but sometimes, maybe even often, they don't. This may be because there can be layers of meaning, or qualifications, conditions and/or exceptions, so that it's sometimes necessary to wade through provision after provision, following the trail of definitions through to still further legislation, before it's possible to get even the bare bones of what something means.

For instance, 'fair and lawful' in the first principle means more than just 'fair and lawful': for processing to be 'fair and lawful', it must first fit within one of several defined boxes ('consent' is one), and it also has to be generally fair and lawful. And I've put quotes around 'consent' because 'consent' itself has a specific meaning, it's not 'consent' unless the consent was a freely-given, specific and informed indication of the data subject's agreement to their data being processed.

Or, legislation can be drafted obscurely, so it's hard to figure out what it means, and it would take a court case to find out what judges think it means. Or, legislation can be drafted by people who don't understand how technology works (yes it happens!), whether it's websites, or cloud computing. Or, the legislation is so old that it didn't properly envisage future technological developments - like copyright law controlling the right to copy rather than the right to use (book), leading to effectively all computer usage being copyright infringement because the technology works by copying. It's often hard to apply old or unsuitable laws to modern technology.

Even when an issue has gone to the courts for decision, while some judges are admirably easy to understand, with others even seasoned lawyers may get even wrinklier-browed desperately trying to figure out exactly what m'lud meant. Sometimes, it's because the judge isn't as clear as he or she could be. Other times, it's because judges are trying to do what they feel is the fair and right thing, and so may suggest or say that the law means something other than you might think it means (I dub this the Denning dimension, aka 'The little old lady wins!', sometimes manifested as 'hard cases make bad law'). That's why, while technologists may think in binary, in either/or, lawyers have to think in analogue - in shades of grey:

(Image reproduced by kind permission of

And that's also why attempts to translate laws into algorithms and code are almost certainly doomed to failure; it's near impossible, as for example an experiment in implementing supposedly simple road traffic rules in software showed.

Lawyers with expertise in particular fields, whether data protection, intellectual property or computer law, have been trained to understand the jargon and to know or be able to work out how to reconcile all these different elements in order to determine what the workable paramenters are, and to arrive at something that can make some kind of sense in practice.

In addition, experienced practitioners should have a feel for how the law is actually enforced in real life, eg by regulators, so that they can give you some idea of how likely it is that you'll be fined or worse, and what the penalties are. Then you can decide, particularly in the (too many) areas where the law isn't clear, whether to take the risk that (a) whatever you plan to do. that might be a breach, will be found out, (b) authorities will take enforcement action against you, and (c) you'll be fined or prosecuted for it.

Of course, if you use lawyers rather than DIY, you might be able to sue them if things go wrong and it's their fault - because practising lawyers should be insured!

Finally, the internet may be global but laws are national, so different countries' laws may apply in different (or indeed the same) situations, and so you may need advice from lawyers qualified in the relevant countries.

Therefore, at some point a startup will need a lawyer. Not just to keep certain lawyers (alas not me) in mansions and private school tuition fees, but for its own benefit in terms of protecting its IP, making sure it's not breaking data protection or other laws, and certainly when it comes to that hoped-for cashing-in IPO.

Law centres, citizens advice bureaux and the Bar pro bono unit are free, but may lack specialist IT or data protection expertise. Own-IT can give free intellectual property law advice, and Queen Mary, University of London (where I'm a PhD student and working part-time) has an advice service including a Law for the Arts Centre that offers free IP law advice, but again may not necessarily have IT or data protection expertise. However, Queen Mary is also launching a new free advisory service for startups, qLegal, aimed at providing legal and regulatory advice specifically to ICT startups, where postgrad students will work with collaborating law firms and academics - so please feel free to try that!

Disclaimer: the book I linked to above is by my PhD supervisor, but I linked to it because it makes very salient points on why many laws don't work in cyberspace and how they could be made work, plus it's a good read (even for non-lawyers) - not because I'm trying to curry favour!

Monday, 2 September 2013

Basic tutorial: Map/Reduce example with R & Hadoop, including Amazon Elastic MapReduce (EMR)

This is my write-up of Anette Bergo's very useful session for Women in Data in August 2013, but reordered and with some extra notes and screenshots of my own.

Anette showed exactly how this sort of thing should be done - basic foundation, enough code to demo the key principles without over-complicating things, talk through the code, run it!

Any errors are mine alone, if you spot any please let me know.



  • Download and install R - it's multi-platform so there are Linux, Mac and Windows versions
    • RStudio IDE helps provide a friendlier interface
  • (To clone Anette's example repo) Download and install Git
  • (For the EMR bit only) Sign up for an Amazon Web Services account.
    • If you have an Amazon account you can login with that, but you still need to sign up specifically for AWS.
  • (For EMR only, as it costs you money to run the demo) Sign up for Elastic MapReduce (circled in blue in the screenshot below, accessible via the AWS console - you'll need to enter credit card details and possibly go through a phone verification and wait for their confirmation email before you can use EMR.

What's the R programming language?

R is a DSL for statistical/mathematical analysis.

Everything is a vector in R (just as in Git everything is a directed graph).

What's MapReduce?

MapReduce is a programming framework for parallel distributed processing of large data sets. (Originally devised by Google engineers - MapReduce paper.)

Effectively, Hadoop is the open source version of Google's MapReduce (it also includes an open source version of Google File System and increasingly other components).

Amazon Web Services' Elastic MapReduce lets you set up and tear down Hadoop clusters (master and slaves). The demo uses R but EMR will accept eg Python, shell scripts, Ruby. You can deploy with the Boto library and Python scripts.

MapReduce involves: Input reader - map function - partition function - compare function - reduce function - output writer.

A map is a set of key/value pairs. The compare function usually sorts based on key/map. The reduce function collapses the map to the results. The output writer moves data to a meaningful easily-accessible single location (preventing data loss if the cluster powers down).

The master (ie the framework) organises execution, controlling one or more mapper nodes and one or more reduce nodes. The framework reads input (data file), and passes chunks to the mappers. Each mapper creates a map of the input. The framework sorts the map based on keys. It allocates a number of reducers to each mapper (the number can be specified). Reduce is called once per unique key (producing 0 or more outputs). Output is written to persistent storage.

Usually a mapper is more complex than in the demo, eg it may filter what's to be analysed etc. For less than 10 GB of data, you might run analyses on your own computer, for 10-100 GB your own servers, probably using MapReduce only for over 100GB pf data. It can process images, video files etc too - although the demo analyses words in a text file.

Canonical example of MapReduce: wordcount

Input - a series of different words eg: bla bla bla and so and.
Mapped - bla 1, bla 1, bla 1, and 1, so 1, and 1. (Ie maps 'bla' to value '1').
Reduced - and 2, bla 3, so 1.

Note: this assumes all input info is important, but often only part is, eg to check how often names are mentioned in a series of articles you wouldn't map everything.

The framework has readymade reducers for common map formats but you can write your own reducer.

Anette's example

Clone the demo repo at (see bottom right hand side - there are buttons to clone in desktop or get the clone URL; the command is git clone <url>).

Ensure everything's executable as necessary.

The input file is data.txt, the mapper is mapper.R and the reducer is reducer.R.

A shell script will demo the map/reduce locally - it reads data.txt to the mapper, sorts the output and puts the output into the reducer.

Going through the code (RStudio helps):

mapper.R - see last function in the code: it reads input from stdin. hsLineReader takes and reads chunks up to 3 lines, doesn't skip anything (eg headers), then applies emit function to each chunk read. The emit function (top of code) transforms the output (1-3 lines) to a uniform processable stream, turns chunks into words (strsplit). sapply applies an anonymous inner function to each word. (paste is used for string concatenate.) The sorted results go to the reducer.

reducer.R - the final function reads from stdin and runs the reduce function on the input. This creates an array of names - vector of columns. (The chunksize can be tweaked to make it more performant depending on the calculation to be run; the default separator is tab, here it's been set to a space.) Then the process function is applied to it (written as a separate function for clarity, but it could be an anonymous inner function). This function takes each piece of map and aggregates by words using an anonymous inner function producing sums.

Running locally

Run - this emulates what the framework does.

NB must install further packages, HadoopStreaming and getopt:


(If that doesn't work, install them from the R_packages folder: R cmd install packagename.tar.gz).

Running on Amazon Web Services

NB this isn't part of Amazon's free tier, so running these demos will cost you - not very much, probably less than a quid?

Go to AWS console

Create a new S3 bucket (click S3 - towards the bottom left at the moment, under 'Storage and Content Delivery'; click Create bucket; give it a unique name. NB the name must be unique for all buckets on AWS, not just for you!).


Edit the script at the line
to replace it with your new bucket's name. (The code is self-explanatory, see the comments)

Open the bucket by clicking on it, rightclick inside and upload the code from Anette's model repo. (You may need to rename the R_packages folder to just R, or change it to R_packages in the script.)

All nodes in the cluster get the code applied to them.

Now in the AWS console go to Elastic MapReduce (under 'Compute and Networking') - best do this in a new browser window or it'll break your upload! Click to sign up for it, if you haven't already, including adding credit card information etc.

Using Amazon's standard example. In EMR, click create a new job flow (see screenshot below):

  • Job Flow Name - anything you like
  • Hadoop version - Amazon Distribution
  • AMI - latest
  • Create a job flow - Run a sample application, pick Word Count (Streaming), then
  • click Continue.


In the next screen (see below):

  • Input Location is prepopulated (a public bucket), leave it
  • Output location - change <yourbucket> to your own new bucket's name (case sensitive I think)
  • Mapper and Reducer - use theirs
  • click Continue.


In the next screen (screenshot below):

  • Instance Type - small
  • Instance Count - 2, and
  • Continue.


In the next screen (see below):

  • Amazon EC2 Key Pair - leave it as Proceed without key pair (you may get an error, if so see below)
  • Amazon VPC Subnet ID - no preference
  • Amazon S3 Log Path - here enter your own path to your bucket, eg s3n://yourbucketname/log (note: s3n is an internal AWS protocol)
  • Enable debugging - Yes, and
  • Continue.


Leave it as Proceed with no Bootstrap Actions, click Continue:

imageThe next screen shows a summary of all the settings for your review, use Back to change any errors etc. When happy, click Create job flow to run it (and you'll get charged, so click Cancel if you'd rather not run it yet!).


It takes a few minutes to run. Click on the job name and click Debug to see the progress. There's a Refresh button to check if it's gone any further. Click on View jobs to see the jobs set up.

Error? If you get errors, at the top right hand side of the AWS Console click on your username, select Security Credentials, expand Access Keys and click Create New Set of Keys, then try again with Proceed without keypair (it seems that creating a new set of keys then enables you to proceed without actually using the created keys!)

Using the uploaded demo files. This is similar. In EMR create a new job flow, but this time under 'Create a job flow' choose 'Run your own application', with job type 'Streaming'.

For the Input Location use s3n://<yourbucketname>/data.txt, for the Output Location similarly the path to your bucket folder (eg Rtest.output) - it will be created if not already in existence, and can be downloaded to your own location. For Mapper, use the uploaded mapper.R file in your bucket, for Reducer the reducer.R file. Instance type small etc.

Proceed without key pair (see above if there are errors). Bootstrap action - this time choose your own custom action, and enter the path to your bucket and the file. Continue. Create. View. Run when you're happy! (NB again it costs you money.)


Further notes: in jobs, tasks can be viewed too - you can see eg 12 mappers and 3 reducers. Output files are created one per reader, you have to stitch them back together. 0 byte files are created where there was no output from the relevant chunk.

Thursday, 27 June 2013

TTIP: how to lobby the EU and US, etc – Sidley cloud computing roundtable


At the June Sidley cloud computing roundtable, held under the Chatham House Rule, one major topic discussed was the proposed EU-US Transatlantic Trade & Investment Partnership, aka TTIP.

In TTIP, both cloud computing and data protection law will be horizontal issues spanning specific areas such as financial services, telecommunications services, computing services and global standards. It isn’t yet clear how the draft Data Protection Regulation will affect TTIP. Or indeed vice versa.

However, in terms of lobbying the EU and US on TTIP, a very helpful outline was given by Yohan Benizri. Some of this may seem self-evident, but I think it’s still useful to set it out.

Participating in consultations is very important, but that’s not in fact the most effective tool available to stakeholders. It seems that direct engagement with negotiators is more likely to lead to better results.

TTIP negotiators, on the European Commission side, will include Ignacio Garcia Bercero and Damien Levie, in DG Trade (under De Gucht), but other DGs, such as DG Connect and Justice (for cloud and privacy/data protection issues) will also be involved. DG Trade is playing a leading role, but positions and text will be developed in close cooperation with other DGs.

On the other side of the Atlantic, Dan Mullaney will probably be the key person, working with Mike Froman (USTR).

The best approach, again at the risk of stating the obvious, is to explain the issues and their (even if speculative) potential implications, and then suggest draft text or drafting changes to address those issues. In other words, don’t just raise the problem, but offer a possible solution too.

Forming ad hoc coalitions of organisations with common interests may also be useful, to voice collective concerns to both the EU and US sides. Indeed, suggesting the same text to both USTR and EU may help.

Other topics

More generally regarding the draft Data Protection Regulation, some EU governments have reportedly expressed the view that the draft legislation might not go through at all, because the vast gulf between the Council and the European Parliament may make agreement between them, at least within the next year or so, seem unlikely. (Of course, others have also expressed this view, eg Chris Pounder at Amberhawk, with Lionel de Souza at Hogan Lovells reporting the French government’s serious reservations about the draft Regulation.)

Also discussed at the roundtable were the EU cloud strategy including cloud standards; and competition law issues, notably the actions against Google in relation to search (and now see Google’s subsequent blog on the subject).

Full disclosure: I gave the firestarter presentation on the EU cloud strategy at this roundtable. I used to work for Sidley. But Sidley didn’t pay me for my participation, or for this blog. This blog is, obviously, mine alone.