Mastodon Kuan0: training
Showing posts with label training. Show all posts
Showing posts with label training. Show all posts

Sunday, 9 February 2025

AI literacy: EU AI Act

The EU AI Act's "AI literacy" obligation applied from 2 Feb 25, alongside its prohibition on certain AI uses (commencement dates 1-pager). But what, if anything, should you do about it? Points to consider:

  • Who's caught? This obligation applies to "providers" and, especially, "deployers" (i.e. users) of AI systems
    • For non-compliance, you can't be fined yet (fining provisions don't kick in till 2 Aug 2025), or maybe even not at all (Art.4 isn't listed in Ch.12 on penalties, and it's unclear whether individual EU Member States can or will penalise breach of this obligation - we'll find out by 2 Aug 2026!)
  • But... train anyway? However, if you use any AI system caught by the EU AI Act, an AI upskilling/training and awareness program for staff is good practice and should help to boost your business's competitiveness as well as legal compliance, so you may want to roll it out anyway - if not yet, then ideally by 2 Aug 2026
    • Who? Train at least staff/contractors developing/adapting/integrating, operating/maintaining or using any high-risk AI systems (including third-party AI systems), and also train staff providing human oversight of AI; and ensure they have appropriate authority to perform their stasks properly. Train them similarly even if your AI systems aren't high-risk (the Art.4 AI literacy obligation applies to all AI systems)
    • What on? Train them (as appropriate to their role, technical knowledge, experience, education and training) on AI technicalities, use/safeguards, and interpretation of output, as well raising their awareness about AI's opportunities, risks and possible harms, taking into account the context the relevant AI system is to be used in, and the persons or groups of persons on whom your AI system is to be used
      • The "AI literacy" definition (below) mentions skills, knowledge and understanding, "taking into account" AI Act rights and obligations, to make an informed deployment of AI systems. This might imply that relevant staff should also be trained on what are your obligations under the AI Act as deployer/provider, at least to a basic level - even engineers who aren't in the Legal/Regulatory/Compliance teams
    • How?
      • Others' experiences. To see what other organisations are doing on AI literacy, you can review the Commission's "living repository" compilation of AI literacy practices of many organisations (15 currently) from different sectors and of different sizes (direct link). Added: on 2 Apr 25 the Commission conducted a survey to gather practices for the repository.
        Also consider attending/viewing the AI Pact webinar on AI literacy on 20 Feb 25 (YouTube livestream). If you don't have enough internal resources/expertise to train your staff, external third-party resources are available, but do check that whoever you engage is sufficiently knowledgeable. There are now many out there who offer AI training (nice market for that since ChatGPT, and it can only get bigger!) - but how well qualified or expert are they? A lot of big well-known AI companies already provide online AI training (typically tailored to their own services but covering the basics too), those are often free, so it's worth checking them out. 
      • AI jargon. My free YouTube video demystifying key AI jargon/terminology may also be of use😉, do incorporate it if you wish
  • What else? Consider contributing to any sectoral/industry initiatives on training/awareness of people ("affected persons") who may be affected by your use of AI systems, and/or of other actors in the AI value chain
    • Surely "affected person" won't be making any "deployment" of AI systems, so the "AI literacy" definition doesn't work very well in relation to them... it seems to be more awareness-raising on AI opportunities and risks/harms rather than training, there
  • What to monitor for?
    • Monitor relevant Member States' national laws for any local penalties that might be imposed for infringement of this obligation (seems unlikely, but you never know)
    • Watch out for any voluntary codes of practice on promoting AI literacy "facilitated" by the EU AI Office/relevant Member States under Art.95(2)(c), and take on board anything from them if you can
    • The AI Board is supposed to support the Commission in promoting AI literacy, public awareness and understanding of the benefits, risks, safeguards and rights and obligations in relation to the use of AI systems. If and when they put anything out, again see whether what they produce can usefully be incorporated into your own AI literacy program.
  • (Added) Update - more resources

Key background info for ease of reference

  • Art.4 AI literacy obligation: Providers and deployers of AI systems shall take measures to ensure, to their best extent, a sufficient level of AI literacy of their staff and other persons dealing with the operation and use of AI systems on their behalf, taking into account their technical knowledge, experience, education and training and the context the AI systems are to be used in, and considering the persons or groups of persons on whom the AI systems are to be used.
    • Rec.91: ... deployers should ensure that the persons assigned to implement the instructions for use [of high-risk AI systems] and human oversight as set out in this Regulation have the necessary competence, in particular an adequate level of AI literacy, training and authority to properly fulfil those tasks...

  • "AI literacy" definition: skills, knowledge and understanding that allow providers, deployers and affected persons, taking into account their respective rights and obligations in the context of this Regulation, to make an informed deployment of AI systems, as well as to gain awareness about the opportunities and risks of AI and possible harm it can cause
    • Rec.20: In order to obtain the greatest benefits from AI systems while protecting fundamental rights, health and safety and to enable democratic control, AI literacy should equip providers, deployers and affected persons with the necessary notions to make informed decisions regarding AI systems. Those notions may vary with regard to the relevant context and can include understanding the correct application of technical elements during the AI system’s development phase, the measures to be applied during its use, the suitable ways in which to interpret the AI system’s output, and, in the case of affected persons, the knowledge necessary to understand how decisions taken with the assistance of AI will have an impact on them. In the context of the application this Regulation, AI literacy should provide all relevant actors in the AI value chain with the insights required to ensure the appropriate compliance and its correct enforcement. Furthermore, the wide implementation of AI literacy measures and the introduction of appropriate follow-up actions could contribute to improving working conditions and ultimately sustain the consolidation, and innovation path of trustworthy AI in the Union. The European Artificial Intelligence Board (the ‘Board’) should support the Commission, to promote AI literacy tools, public awareness and understanding of the benefits, risks, safeguards, rights and obligations in relation to the use of AI systems. In cooperation with the relevant stakeholders, the Commission and the Member States should facilitate the drawing up of voluntary codes of conduct to advance AI literacy among persons dealing with the development, operation and use of AI.

Monday, 18 November 2024

AI: legitimate interests, controller/processor questions - data protection/privacy

Under GDPR, can personal data be processed based on legitimate interests for AI-related purposes, whether for training, deployment, or beyond? That was the key focus of the EDPB stakeholder event on AI models on 5 Nov 24 that I was registered to attend and was fortunate enough to get a place - many thanks to the EDPB for holding this event!

The event was intended to gather cross-stakeholder views to inform the EDPB's drafting of an Art.64(2) consistency opinion on AI models (defined quite broadly) requested by the Irish DPC.  The EDPB said it will issue this opinion by the end of 2024 but, unlike EDPB guidelines, such consistency opinions can't be updated - which is concerning given how important this area is.

The specific questions were:

  1. AI models and "personal data" - technical ways to evaluate whether an AI model trained using personal data still processes personal data? Any specific tools / methods to assess risks of regurgitation and extraction of personal data from AI models trained using personal data? Which measures (upstream or downstream) can help reduce risks of extracting personal data from such AI models trained using personal data? (including effectiveness, metrics, residual risk)
  2. Can "legitimate interest” be relied on as a lawful basis for processing personal data in AI models? 
    1. When training AI models - and what measures to ensure an appropriate balance of interests, considering both first-party and third-party personal data?
    2. In the post-training phase, like deployment or retraining - and what measures to ensure an appropriate balance, and what if the competent supervisory authority found the model's initial training involved unlawful processing?

There wasn't enough time for me to explain my planned input properly or to comment on some issues,  given the number of attendees, so I am doing it here. I'll take the second set first.

Training AI models - legitimate interests

I strongly believe legitimate interest should be a valid legal basis for training AI with personal data. Particularly training AI to reduce the risk of bias or discrimination against people, when the AI is used in relation to them. 

I had a negative experience with facial biometrics. The UK Passport Office's system kept insisting my eyes were shut, when they were wide open - they're just small East Asian eyes, white people's eyes are usually bigger. Others have suffered far worse from facial biometrics and facial recognition, including wrongful arrests, denial of food purchases, debanking (see my book and 23.5 of the free companion PDF under Facial recognition). 

Had the AI concerned been trained on more, and enough, non-white faces, it would be much less likely to claim facial features that didn't match typical white facial features were "inappropriate" (like eye size, hair shape), or to misidentify the wrong non-white people leading to their wrongful arrests.

The EU AI Act is aware of this risk: Art.10(5) (and see Rec.70) specifically permits providers of high-risk AI systems to process special categories of personal data, subject to appropriate safeguards and meeting certain conditions:

  1. the bias detection and correction cannot be effectively fulfilled by processing other data, including synthetic or anonymised data;
  2. the special categories of personal data are subject to technical limitations on the re-use of the personal data, and state-of-the-art security and privacy-preserving measures, including pseudonymisation;
  3. the special categories of personal data are subject to measures to ensure that the personal data processed are secured, protected, subject to suitable safeguards, including strict controls and documentation of the access, to avoid misuse and ensure that only authorised persons have access to those personal data with appropriate confidentiality obligations;
  4. the special categories of personal data are not to be transmitted, transferred or otherwise accessed by other parties;
  5. the special categories of personal data are deleted once the bias has been corrected or the personal data has reached the end of its retention period, whichever comes first;
  6. (GDPR) records of processing activities include reasons why processing of special categories of personal data was strictly necessary to detect and correct biases, and why that objective could not be achieved by processing other data.
(Aside: I know that Article also mentions "appropriate safeguards", but I'd argue that meeting those conditions would provide the minimum required safeguards - although in some cases others could be considered necessary.)

The Act confines this permission to the use of special category data in high-risk AI systems, but I'd argue that legitimate interests should permit the use of non-special category personal data through meeting the above conditions (and any other appropriate safeguards). 

Recall that personal data can be processed under GDPR's legitimate interests legal basis if "necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child." The EDPB's recent guidelines on processing based on Art.6(1)(f) note three cumulative conditions to enable processing based on legitimate interests:

  • the pursuit of a legitimate interest by the controller or by a third party;
  • the need to process personal data for the purposes of the legitimate interest(s) pursued; and
  • the interests or fundamental freedoms and rights of the concerned data subjects do not take precedence over the legitimate interest(s) of the controller or of a third party

Let's review those in turn.

Detecting and correcting bias involves the pursuit of a legitimate interest of the controller, i.e. AI developer, and third parties. I'd argue that many, many third parties, being those in relation to whom the AI is to be used, have a legitimate interest in not being discriminated against due to biased AI. I've already mentioned biased AI resulting in wrongful arrests, denial of services important to life like food buying, and debanking (see 23.5 of my free PDF under Facial recognition). 

It is indeed necessary to process personal data of people in certain groups in order to train AI models to reduce bias, as per the experiences noted above and much more.

Finally, the balancing test in the final limb must clearly consider the legitimate interests, not just of the controller, but also of "a third party" - in this case, the legitimate interest of third parties, in relation to whom the AI is to be used, not to be discriminated against. (While fairness is a core principle of the GDPR, this only concerns fairness to the individual whose personal data is being processed. Processing A's personal data to try to ensure fairness to B isn't a concept explicitly provided for in GDPR. There are mentions of "rights and freedoms of others" or other data subjects, but more in the sense of not adversely affecting their rights/freedoms, rather than positive obligations in their favour.) 

I argue that, if the conditions in Art.10(5) AI Act are implemented as a minimum when training AI using personal data, that should tilt the balancing test in favour of the controller and those third parties, and enable legitimate interests to be used as the legal basis for the training - at least in the case of non-special category data - even when training non-high-risk AI. I really hope the EDPB will agree. 

However, the problem remains of how to use special category data to train non-high-risk AI systems to detect and address bias. Some examples I mentioned could fall through the cracks. 

The UK Passport Office's AI system, designed to reject photos with "inappropriate" facial features, is probably a high-risk AI system within Annex III para.5(a) (if the Act applied in the UK). Yet, para.5 (and Annex III more generally) does not protect anyone from being refused a private bank account or being debanked as a result of biased AI being applied to them.

And, a huge hole in the AI Act is this: Annex III para.1(a) excludes "AI systems intended to be used for biometric verification the sole purpose of which is to confirm that a specific natural person is the person he or she claims to be". What if an AI biometric verification system used by a bank mistakenly says someone is not who they claim to be, because it can't verify the identity of non-white people properly due to not having been trained on the faces of enough non-white people - and therefore the bank's systems automatically debanks that individual? How can such a biased AI biometric verification system be "fixed", if it can't be fully trained in this way?

Such an AI system is not classed as a high risk AI system, because of the biometric verification exclusion. Therefore, the developer isn't allowed to train the AI using special category data, because Art.10(5) AI Act only allows this for high-risk AI systems! (Yes, I know there's the odd situation where biometric data is "special category" data only when used for the purpose of uniquely identifying someone, so it could be argued that using non-white people's facial biometrics to train personal data isn't processing their special category data, because the processing purpose isn't to identify those specific people, and I'd certainly be putting that argument and pushing for being able to use legitimate interests for that training. But - really? Why should those arguments be necessary?)

It was argued that Art.9(2)(g) (necessary for reasons of substantial public interest etc) doesn’t allow processing of special category data to train AI, even though there is a substantial public interest in addressing bias. I agree there is a huge public interest there, but I also agree that, due to the wording of that provision, it can’t apply unless proportionate etc. EU or Member State law provides a basis for such processing. EU law in the form of AI Act Art.10* does provides a basis for processing special category data in high-risk AI systems - doesn’t provide such a basis in the case of non-high-risk AI, or non-special category data - hence the need to argue that biometric data isn’t special category when used for training! I guess it’ll have to be down to national laws to provide for this clearly enough. France, Germany or Ireland, perhaps?

(Consent isn’t feasible in practice here, given the volumes involved, and issues like having to repeat AI training after removing, from the training data, any personal data where consent has been withdrawn. It was argued that financial costs or training time for AI developers shouldn’t be relevant in data protection, but equally it was argued that environmental costs etc. of repeating training are relevant. I’ll only mention briefly practical workarounds, like not removing that data but preventing it from appearing in outputs using technical measures whose efficacy is debated)

If including my personal data in training datasets can help to reduce the risk of otherwise biased AI systems discriminating against you (should you be in the same ethnic or other grouping as me) when deployed,  personally I'd be OK with that - partly informed by my own bad experiences with AI biometrics. Shouldn't such processing of data for AI training be permitted, even encouraged? But, currently, this issue is not properly or fully addressed, as I've shown above. So, there's a big data dilemma here, that still remains to be dealt with.

AI models and personal data

Does an AI model "contain" personal data, given that strictly it's not a database per se? Or is it just something that can be used to produce personal data when used in deployment, with personal data being processed only at the usage stage? Much debate, and diametrically opposing views (and difficult questions like, can a GPAI model developer be said to control the purposes and means for which deployers of the model use it?). [Added: I meant to expand on that the clarify that question, is the model developer controlling the purposes of processing personal data, particularly with general-purpose/foundation models, or is it merely providing part of the means of processing to others, i.e. is it really a "controller"?]

Rather than pinhead-dancing around that question, personally I think that use of a deployed AI system is the most relevant processing here, because that's the main point at which the LLMs/large language models (that the event focused on, pretty much exclusively) could regurgitate accurate or inaccurate personal data - whether through prompt injection attacks or similar in the case of LLMs, or because a model's guardrails weren't strong enough. 

I feel the EDPB's query on technical ways to evaluate whether an AI model trained using personal data "still processes personal" data is really more one for technical AI experts to answer, and that what merits more attention is preventing training data's regurgitation/extraction at the deployment/use stage, whether personal data or otherwise.  It's well known that attacks have successfully obtained training personal data from models - although with some limitations and caveats (paper & articleanother article). This has been shown to be possible not only with open source models (where attackers obviously have access to more info about the model, its parameters etc., and indeed to the model itself), but even semi-open and closed source models like ChatGPT.

Again, my view is that assessing and reducing training data regurgitation/extraction risks are essentially questions for technical AI experts. Reducing such risks mainly involve technical measures, and this is an emerging area where much research continues to be conducted, so I feel it's premature to rule on such measures at this point in time (although organisational measures are also possible, and recommended, like deployers prohibiting their users from trying to extract personal data from any AI).

AI value chain: controllers, processors

More interesting, and difficult, from a GDPR perspective are the crucial questions of: who is a controller, who is a processor, who is liable for what, and at which stages in the AI lifecycle? 

Unfortunately, these weren't really discussed at the event. To be fair, the focus of the event was meant to be legitimate interests, not the controller/processor position of AI model/system providers. 

I still tried to raise them, but wasn't allowed to speak again to clarify my points, so I'll do that below in the form of some "exam questions". But, first, I want to spell out some issues with the AI supply chain that I couldn't expand on during the event.

If a developer organisation makes its own AI model available for customers to use, depending on the business model adopted by the organisation (and the following isn't comprehensive!), the supply chain can involve several alternative options:

  • The model could be accessed via the model developer's API, and/or
  • The model could be permitted to be:
    • Downloaded by customers as a standalone model, then 
      • Embedded/integrated within an AI system developed by the customer (which the customer could use internally only, or offer to its own customers in turn), or
      • Accessed by a customer-developed AI system (which the customer could use internally only, or offer to its own customers in turn) via API, where the downloaded model is hosted
        • on-prem, or 
        • (more likely) in-cloud, using the customer's IaaS/PaaS provider, but with all AI-related operations being self-managed by the customer, or 
    • (Common nowadays) deployed and used by the customer for the customer's AI system  (which the customer could use internally only, or offer to its own customers in turn), through the customer using a provider's cloud AI management platform with the benefit of tools/services available from the cloud provider to ease AI-related operations like fine-tuning models, building AI systems, using RAG, etc.
      • Note: the model used could be one of the cloud provider's own models (i.e. where the cloud provider is the model developer), or it could be a third-party model offered through the cloud provider's own AI marketplace or similar. Exactly what licence/contract terms apply to the customer in such a scenario, particularly with third-party models, let alone what the controller/processor position is there, is still clear as mud (see below).
Note that an AI system can use or integrate more than one AI model.

Also note that the above applies equally to how an AI system is accessed, i.e. via API, or by embedding the system within an AI product/solution/tool, or using a cloud AI management platform, and that an AI system can use or integrate more than one other AI system (i.e. rinse and repeat the above, on AI models, to AI systems). See my PDF that I'd previously uploaded to LinkedIn (with a small clarificatory update):

And I won't even mention the twists introduced by using RAG/retrieval-augmented generation in LLMs, at this point.

All that spelt out, now on to my exam questions!:

  1. After an organisation deploys a third-party model in an AI system
    1. If a user in the organisation deliberately extracts personal data from the AI without the deploying/employing organisation's authorisation
      1. Is the rogue user a controller in their own right, so that the organisation is not responsible as a controller under data protection law (as with the Morrisons case in the UK)?
      2. Does or should the AI model developer bear any responsibility or liability at the deployment and use stage as a controller in some way, if the guardrails they implemented against the extraction weren't appropriate? Or could it be a processor, particularly if the model is hosted by the model or system developer?
        1. Even if the model is considered not to "contain" any personal data, so that the model developer is not a controller of the model itself, could the model developer be considered to have some responsibility if and when personal data is extracted from the AI at the deployment and use stage?
        2. Remember, for security measures under GDPR, a security breach alone doesn't mean the security measures weren't appropriate; it's quite possible for an organisation that had implemented appropriate security measures to suffer a personal data breach nevertheless.
        3. Also to reiterate, measures to reduce the risk of extracting training data from AI models are still being developed, this is very much a nascent research area.
        4. Recall that a developer providing software for download/on-prem install is not generally considered a processor or controller, but when it offers software via the cloud as SaaS, it is at least a processor, even a controller to the extent it uses customer data for its own purposes. If a model developer makes available a model (software), but doesn't host it for customers, it seems the developer shouldn't even be a processor?

    2. If a user in the organisation deliberately extracts personal data from an AI with the deploying/employing organisation's authorisation (e.g. for research, or for the organisation's own purposes)
      1. Is the organisation a controller, responsible/liable for that extraction as "processing"? (and could the GDPR research exemption apply there, if for research?)
      2. Could the AI model developer and/or AI system developer bear any responsibility or liability for this extraction as a controller in some way, if the guardrails they implemented against the extraction weren't appropriate, as above, or as a processor? 
        1. Note the same points/queries apply as in 1.1 above!
    3. If a user in the organisation uses the AI in such a way that, without the user intending it, the AI regurgitates personal data, who is responsible as controller for the output, which is "processing"?
      1. Remember, a user could process personal data by including it in the input provided to the AI (not discussed further here), but personal data could also be processed if it is included in the AI's output
      2. Does or should the AI model developer and/or AI system developer bear any responsibility or liability as a controller in some way, if the guardrails they implemented against inadvertent regurgitation weren't appropriate, or could it be a processor, or neither?
        1. Note the same points/queries apply as with deliberate extraction, 1.1 above.
      3. What difference if any does it make if the personal data in the output is accurate, or inaccurate (e.g. defamatory of the individual concerned)? 
    4. If a person unrelated to the organisation, e.g. a third-party hacker, manages to access the deployed AI to extract training data such as personal data, is the deploying organisation responsible as controller? What about the model/system developer?

  2. Do any of the above apply, are they relevant, when a AI developer makes its model available to customers via the developer's API only? Is the model developer/provider a processor for customers in that situation?
    1. Again see 1.1 above. In particular, it seems the AI developer hosting the model offered to customers would at least be a processor here.

  3. What if a customer uses a third-party AI model hosted by the customer's cloud provider? Is the cloud provider only a processor for the customer, or could it be a controller in any way?
    1. Does it make a difference if the model used by the customer is the cloud provider's own model, or another party's model?
    2. Does it make a difference if the model's use is completely self-managed by the customer, or if the customer is using a cloud provider's cloud AI management platform?
    3. Do the license terms, cloud agreement terms and/or other terms applicable to the customer's use of the cloud service/AI platform affect the position (under GDPR it's the factual control of purposes and means that matters, and contract terms are not determinative, but nevertheless terms could influence the factual position in some cases, especially in what they permit or prohibit...).
    4. Indeed, back to the AI Act, who is the model provider - the AI platform provider, or the model developer?

  4. Rinse and repeat for AI system developers/providers - could they be responsible/liable as controllers and/or processors especially if a model provider hosts its model or AI systems using its model for customers in-cloud?
(There are many more questions and issues, these are just the key ones that spring to mind most immediately, believe it or not!)

Answers on a postcard...?

Sunday, 10 April 2022

Security training - review of Security Innovation's Cmd+Ctrl Shred cyber range & security training

GDPR supervisory authorities (SAs) emphasise data protection training (e.g. the UK Information Commissioner's personal data breach notification form asks, "Had the staff member involved in this breach received data protection training in the last two years?", and "Please describe the data protection training you provide, including an outline of training content and frequency").

What about security? Security of personal data is of course important under GDPR, and organisations can be fined for not having appropriate security measures in place. While security training for developers is not specifically mentioned in GDPR as such, developers do also need training on application security issues that can lead to breaches of websites, online services and any databases or other data storage behind them (including personal data in systems). Most IT staff, developers and otherwise, are not necessarily cyber security (or even security) experts, and must be educated on what to look for and how to address, at least, the most common key security issues.

Many online training courses on cybersecurity for developers are now available. There are also "cyber ranges" offering users deliberately vulnerable systems, websites or online applications that users can attack and seek to exploit, to learn how hackers think and the kinds of the actions they take, and therefore be able to defend against them better. 

As part of OWASP London CTF 2021, in Nov 2021 Security Innovation generously offered participants free access for a month to a fake e-commerce website "Shred Skateboards" on its CMD+Ctrl CTF (Capture the Flag) web application cyber range, and for 6 weeks to its Bootcamp Learning Path, a self-paced online training course incorporating 32 selected courses from its full catalog of training courses.

This blog reviews the Shred range, then the online training courses. These cover some of the issues referenced in the recently-finalised European Data Protection Board (EDPB) Guidelines 01/2021 on Examples regarding Personal Data Breach Notification, as those Guidelines include some recommended security measures as well as breach notification, and also mention OWASP for secure web application development. 






Cmd+Ctrl Ranges and Shred

Cmd+Ctrl's ranges are generally available only to paying organisations to train their staff (but not to paying individuals, sadly. Missed trick there, as I think individuals wanting to improve their ethical hacking skills would pay a reasonable fee or sub for access). People who signed up for the event were however given free access to Shred for a month. Shred is meant to be one of the easy ranges.

The Cmd+Ctrl login page provides some sensible disclaimers and warnings: 

After logging in, you need to click on the relevant range's name and wait a few minutes for it to start up (each user gets their own virtual machines I suspect on Amazon Web Services), as a real website available on the Internet with its own URL (hence the exhortation not to enter sensitive information on the website - I would expand that to real names, real email addresses and basically any real personal data, because real hackers can also access that website as much as you can!).

Then, basically explore the website and try different things to find vulnerabilities e.g. click the links, register user accounts, try different URLs, enter different things into the search or login forms, etc. I won't share screenshots of Shred so as not to give anything away, but it emulates an online shop for skateboards and related accessories and pages, with user accounts that can store user details including payment cards, the ability to purchase gift cards, etc. Each machine is up for I believe 48 hours, and each time you start it, it may have a different URL and IP address. If things go badly wrong you may have to reset the database (which loses your changes e.g. a fake user you registered) or even do a full reset, but you're not penalised for that, the system retains the record of scores you achieved for previous exploits. 

When you successfully exploit a vulnerability, a banner slides in from the top of the webpage indicating what challenge was solved and how many points you gained for it. You can also see what broad types of other challenges remain unsolved. 

Via the My Stats link, you can see a Challenges page, which also gives similar broad information about the types of challenges remaining unsolved. Unfortunately, only Category information was provided regarding unsolved challenges (see the Category column of the Solved table shown below for examples). 

No detailed information about the exact nature of any challenge (i.e. the info under the Challenge column, such as "Unsafe File Upload" in the table above) was provided. It appeared only after you actually solved the challenge, whereupon it was listed in the Solved table (as well as the banner appearing). The "Get Hints" link was disabled for this event - but presumably hints are available in the paid versions of the ranges. However, Security Innovation provided a live online introduction on the first day of the CTF event, access to a one-page basic cheat sheet tutorial, with a guide to Burp Proxy for intercepting HTTP traffic, and weekly emails with some hints and links to helpful videos. A chat icon at the bottom right of every webpage allowed the user to ask questions of support staff. I tried to confine my range attempts to the afternoon/evening given that Cmd+Ctrl is US-based, but I was very impressed with how quickly responses were given to my chat queries, even though I was using the range as an unpaid user. The support staff did not give away any answers, but instead provided some hints, often very cryptic - I suspect similar to the tips that users for whom the Get Hints" link is enabled would receive. 

Under My Stats there was also a Report Card link giving detailed information about your performance, also in comparison to others who had attempted the range, including the maximum score reached. Challenges were again shown here, broken down by category and percentage solved. 

 As well as repeating the solved challenges table further down on this page, there's also a time-based view of the user's stats. As you'll see, I had a go over the first weekend, solving a few basic and easy challenges, then left it until I realised that I would lose access to Shred soon, so I made a concerted effort over the last few days though I ran out of energy with an hour or two to spare!

 

I was rather chuffed that, as a mere lawyer and not cybersecurity professional, I managed to complete 25 out of the 35 challenges and reach the rank of 7, out of 54 people who at least attempted Shred (in the screenshots below I've redacted names and handles other than common ones like Mark or David). I admit I have attended some pen testing training, one excellent 2-day course with renowned web security expert Troy Hunt (yes, I was very lucky), and one terrible week-long course with someone whose name should never be mentioned again (but at least the food was great). However, those courses were several years ago, and this is the first time that I've attempted a range or CTF event. (I've signed up for other services with some similarities, Hack the Box and RangeForce Community Edition, but I haven't had time to try them properly yet.)

 

Prerequisites for trying these ranges

You do need some prior knowledge, particularly about HTML and how URLs, query parameters and web forms work, HTTP, cookies, databases and SQL etc, and concepts like base64 encoding and hashes. You also have to know how to use tools like Chrome developer tools, which is built into Chrome, to edit Shred webpages' HTML. I'd not used those developer tools before tackling Shred, but searched for how (I didn't resort to Burp for Shred, myself). I probably have a better foundation than most tech lawyers as I have computing science degrees as well as the pen testing training, coupled with a deep and abiding interest in computing and security since my childhood days. So I'd strongly recommend that those without such a foundation should take the courses before attempting any ranges (the courses are covered in more detail below).

Positives

The range provided an excellent assortment of different vulnerabilities to try to exploit, most of the type that exist in real life (indeed, recently I spotted a common one on one site I shop from, when I mistyped my order number into its order tracking form!). The chat support staff were very prompt, although I couldn't figure out some of their hints.

Negatives

Shred included 3 challenges (maybe more?) that involved the solving of certain puzzles (at least one of which scored quite a few points). However, I think the range would have been better if they had not been included, as you wouldn't find them on actual websites - they were simply puzzles to solve, not realistic website vulnerabilities. OK perhaps for some fun factor, not so much for learning about web vulnerabilities, particularly as access to the range is time-limited.

The biggest negative in my view is that no model answers are given at the end. If you haven't managed to solve some of the challenges, tough luck, they won't tell you how. A support person said they felt that these ranges could be devalued by "giving away too much", because customers pay to access its ranges. However, I think that view is misconceived.

It depends on how customers use these ranges internally. I believe they would be best used as hands-on training for tech staff (developers, security), but I can't see why previous users would give away the answers to colleagues or indeed people in other organisations, as it defeats the object of trying these ranges. If organisations required staff to achieve a minimum score on these ranges, then yes, that might incentivise "cheating" and disclosure of solutions. But it's not uncommon, and in fact often a good thing, to form teams to solve challenges together and share knowledge. For this and many other reasons, such a requirement would not make sense. And it would make no sense for one customer of Security Innovation to give the answers away to other customers, what would be the purpose of that?

Conversely, it would be very frustrating for someone who had paid to use the range to find out that they would not be told any outstanding answers at the end. If you haven't managed to teach yourself the solutions, you don't know what you don't know, how will you learn if they refuse to fill in the gaps? Security Innovation already impose a condition on the login page that users cannot post public write-ups or answer guides, which they could expand if they wish (though I don't think that's necessary or desirable).

In similar vein, I think they should at least give hints about the detailed challenges (e.g. "Unsafe file upload" as one challenge), not just categories of challenges. The cheat sheet mentioned a few types of vulnerabilities that I spent too many hours trying to find, and it was only on the last day or two before expiry that I asked on the chat, only to be told Shred didn't actually have those types of vulnerabilities! I appreciate Cmd+Ctrl doesn't want to give too much away, but knowing there's an unsafe file upload issue to try to exploit still doesn't tell you how to exploit it, and it would have saved me so much time particularly given that access to Shred was time-limited. Again, I think paying customers would appreciate more detailed hints so that they can be more targeted and productive in tackling the challenges during the limited time available (and perhaps "Get hints" would have done that, but access was disabled for this event).

Also, I'm not sure how time-limited access would be for the paid version, but organisations wanting to subscribe should of course check the details and ensure the time period is sufficient for their purposes, as staff also have to do their jobs! (I tried the range during my annual leave).

Final comments

I think it's definitely worth it for organisations to pay for their developers to try these ranges, subject to the negatives mentioned above (and see below for my review of the training courses). These ranges can be more interesting and fun for users, and certainly involve more active learning (looking into various issues in context as part of attempting to exploit those types of vulnerabilities), which research has shown improves understanding, absorption and retention. And of course, gamification is known to increase engagement. Attempting these ranges would help to consolidate knowledge gained during the security training. 

But, as mentioned above, I believe the best way would be to give staff enough time to tackle the ranges, over a reasonable period over which the relevant range is open. Don't make staff do this exercise during their weekends or leave, or require each person to reach a minimum score; instead, hold a debrief at the end of the period, for staff to discuss the exercise and share their thoughts (and hopefully receive the answers to challenges none of them could solve, so that they can learn what they didn't know). I appreciate that leaderboards and rankings can bring out the competitive streak and make some people try harder, but I believe team members need to cooperate with each other, and staff shouldn't be appraised based on their leaderboard ranking (or be required to reach a minimum score) - the joint debrief and "howto" at the end is, I feel, the most critical aspect to getting developer teams to work together better in future to reduce or hopefully eliminate vulnerabilities in their online applications.

Cmd+Ctrl offers a good variety of ranges with the stats and other features covered above, which seem very up to date in their scope: banking (two), HR portal, social media, mobile/IoT (Android fitness tracker), cryptocurrency exchange, products marketplace, and cloud. I wish I'd had the chance to try the cloud ones! In fact, there now seem to be 3 separate cloud-focused ranges: cloud infrastructure, cloud file storage, and what seems to be a cloud mailing list management app, i.e. both IaaS and SaaS. 

Wishlist

A range that actually allows the user to edit the application code to try to address each vulnerability, then test again for the vulnerability, would be great for developers!






Online training courses

Alongside access to Shred, for those who signed up to the Nov 2021 bootcamp, Security Innovation kindly offered access for 6 weeks to 32 online courses from its full catalog of training courses. I provide some comments on format and functionality first, then end with thoughts on the content.

I took the bootcamp courses, but the vast majority of them only after I'd finished the Shred range. The information in some of those courses would help with the Shred challenges, but not all of them, and they are aimed at developers, so to follow those courses you would also still need some prior computing and coding knowledge.

It was great that many courses were based on the Mitre CWE (common weakness enumeration) classifications often used in the security industry, e.g. incorrect authorization (CWE-863) and on the OWASP 2017 top 10 security risks, but I won't list them all here. The topics covered by the bootcamp: fundamentals of application security, secure software development, fundamentals of security testing, testing for execution with unnecessary privileges, testing for incorrect authorization, broken access control, broken authentication, database security fundamentals, testing for injection vulnerabilities, injection and SQL injection, testing for reliance on untrusted inputs in a security decision, testing for open redirect, security misconfiguration, cross site scripting (XSS), essential session management security, sensitive data exposure (e.g. encrypting), deserialization, use of components with known vulnerabilities, logging and monitoring and XML external entities.

Several courses were split logically into one course on the problem, and the next on mitigating it, or testing for it. Personally, I learn best by being told the point, then seeing practical concrete worked examples, and I would have liked to see more concrete examples of e.g. XSS attacks or SQL injection attacks. A couple were given occasionally, but not enough in my view. (I appreciate some examples can be found by searching online.) 

The above shows Completed but a course's status could also be displayed as being in progress. You need to click against a particular course (where it shows Completed above) to enrol in the first place, an extra step whose purpose I couldn't fathom (why not just "Start"?). The 3 dots "action menu" enables you to copy the direct link to a particular course for sharing, or pin individual courses.

Clicking on a course name takes you to a launch page, from where you can also open a PDF of the text transcription of the audio.

You can leave a course part-completed, and resume later: 

When you launch or resume a course, a video appears for playing. There are 3 icons on the top right, above the video, for a glossary (the book), help regarding how to use the video (the questionmark), and the text version of the course (printer icon). 

Positives

This course caters for people with different learning styles, by providing both videos and PDF transcriptions. Personally, I scan text a zillion times faster than if I had to watch a video linearly at the slower pace at which people speak, so for learning I much prefer text over video (plus the ability to ask questions, but I didn't see a chat icon - I don't know if that's possible with the paid version?). So, I always clicked the printer icon to read the PDF (opens in another tab) rather than watch the video.

A TOC button on the bottom right brings up a table of contents on the left, where you can click to go straight to a particular section of the video. That it also shows progress, with a tick against the sections that you've watched. 

Another positive, from an accessibility perspective: the CC (closed captions) button at the bottom right brings up the text transcript for the current part of the video, synchronised to the audio. 


Negatives

The PDF didn't always show all the slides from the video, especially in the first few courses - not all the slides contained substantive content, but some slides with example URLs or code were missing from the PDF version. So, personally, I only played the videos to check for any useful slides missing from the PDFs. 

 If you play a video, it stops occasionally and you have to click the play button again to start the next section, which may not be obvious. Sometimes it stops to provide interactivity, i.e. the user has to click on one part of the slide to learn about that issue, click on another part to learn about another issue etc. I hate these types of features, myself. I would prefer videos to just play continuously, moving on from section and part to section and part, unless and until the user pauses it. Stopping a video to force the user to click on something just to get to the next portion seems popular, particularly with the periodic online staff training that many are compelled to undergo for regulatory compliance reasons, but really it's not the same as active learning, in my view! Forced stops like these just break the train of thought and get in the way, when the user wants to get a move on. But perhaps this is a matter of personal preference, so allow me my rant about "interactive" online training courses!

Exam

At the end of a video, you can take an exam (and there are also Knowledge Check quizzes to answer throughout the video). As I had scanned the PDFs rather than watch the videos, I generally went straight to the exam via the TOC or by dragging the position arrow. 

If you pass an exam, you get a certificate of completion that you can download under the Transcripts section of the site, which also allows printing of the list of courses and marks (niggle: all certificate PDFs had the same filename, it would be great if certificate filenames followed the course name, and if you could download a single zipped file of all certificates in one go). 

You're allowed to take the exam multiple times until you pass. Most exams comprise about 4-5 questions, although one had 3, a few 6-8, and another 12 questions. They estimate it takes about 5 mins per exam (10 mins sometimes), which I found was about right. 

It doesn't seem possible to go back and amend your answer if you change your mind about a previous question - when I tried that to do that in one exam, it threw a fit and I ended up having to retake the exam (with the same answers) twice before it would register as completed.

 

At the end of the exam, your full results are shown (it doesn't show results per question as you go through):

Tips

The obvious answer is usually the right one, and if you think "Yes, but only if..", then the answer is probably "No"! I felt a few of the questions or multiple choice answers were unclearly or ambiguously phrased. I did think some of the answers were more about categorising vulnerabilities by type, e.g. broken authentication, or more about vulnerabilities than about how to mitigate them.

If you didn't pass, you can click Review Exam to see where you went wrong, which is helpful. I only had to retake one to pass (becase of the No answer above when I had answered Yes!), but didn't bother to retake a few others where I'd passed with less than 100%.

I discovered that I actually knew more than I thought I did, so the courses didn't actually help me with Shred (although the support staff tips did). But I still learned some useful things that I didn't already know, and I strongly recommend that those without the necessary foundation should take these courses before trying the ranges.

Final thoughts

Overall, I would recommend the Cmd+Ctrl ranges as an excellent way for developers and security staff to learn about online application vulnerabilities, subject to taking the courses first for those without the prior knowledge. They really are aimed at developers/programmers, so most lawyers may struggle, even tech lawyers. I do think it's helpful for lawyers to have a basic knowledge of the common vulnerabilities and how they are exploited and mitigated when discussing cybersecurity measures and breaches with clients that have suffered incidents, but you probably don't need to tackle the courses or ranges to gain that knowledge.

Thanks very much again to SecurityInnovation for making Shred and the courses available for the OWASP London CTF 2021 event!

(I wrote this back in Dec 2021 but for various reasons couldn't publish it till now.)