Saturday, 14 January 2012

Architectures for Pseudonymisation (overview)

In this blog post, I seek to analyse various topologies for delivering pseudonymisation technology in line with the accepted standards. But it is important to note that software architectures are numerous, and prolific, and this posting seeks to address the most recent or topical that are available in what is becoming a crowded marketplace.

Option 1: Data Warehouse, Behind the Firewall

This approach is the most undertaken, especially in the NHS, and the benefits are obvious. Data is housed behind an organisation firewall, and data is centralised to provide a single version of the truth in terms of patient identification. Data into and out of the organisation is strictly controlled and tracked by deploying the pseudonymisation ‘engine’ within a new or already established data warehouse. New sources of patient data are integrated into the central data warehouse, which is intended to function as a reporting repository, not necessarily as an ‘interfacing’ technology. The final section will talk about this in greater detail, as interfacing data warehouses to applications used within an organisation can fundamentally weaken the pseudonymisation engine.

Organisations select this particular strategy with the following in mind

· No dilution or ambiguity in terms of responsibility. The organisation processes its patients.

· Data Warehouses are the ideal way to control pseudonyms in a centralised manner.

· Connecting for Health, via the “Pseudonymisation Implementation Plan”, encouraged NHS trusts to implement pseudonymisation technology in a Data Warehousing environment.

· Key IG Toolkit requirements can only be deployed in a Data Warehousing environment.

· Ability to deploy ‘military grade’ encryption.

· Ability to control costs.

· Ability to centrally audit access to patient identifiable data.

Option 2: Deployed externally, data transfer

This approach is to take organisational data feeds into an externally hosted Data Centre, so as to create a ‘super repository’ of pseudonyms. The super repository has been attempted before, at national level (NHS) and this has benefits in that all pseudonyms are managed centrally, as a single system.

This approach is not as widely deployed, as option 1, as many organisations could view this with a degree of suspicion. This option involves data transfer, and therefore data interception (at worst) or technical issues (at best) could occur. Also, if this type of solution were to be deployed to a significant number of trusts, and interfacing were required to ‘wire in’ pseudo data into organisational applications, the sheer number of applications that the pseudo data would need to be technically compliant with, could create an issue in terms of strength of pseudonym technology used (ie. 16 bit encryption), which could result in a breach of privacy. Finally, ‘cloud’ solutions of this type are usually billed on a subscription basis, which is a new type of costing model compared to the traditional “license + support” model commonly used across many organisations, nto just the NHS>

Option 3: ‘Interfaced’ Solution, Behind the Firewall

This option contains many of the facets of 1 & 2, in that a centralised repository is required, but that repository must service multiple (modern and legacy) applications and services. The organisation benefits from having a centralised system, but as the nature of the system is to support

· Reporting

· Interfacing

The solution must meet the exacting interface standards of all applications mentioned, and all applications that would be procured in the future. Therefore, the algorithms used in the pseudonymisation engine must be capable of creating values which are consistent with all applications which will be subject to the protection afforded by the pseudonymisation solution. This, in the opinion of this paper, would inevitably result in the weakening of the pseudonymisation engine, as it has many obligations to fulfil, not just the reporting of patient level reports.

To conclude, multiple local government and advisory bodies have recommended pseudonymisation as the only and best way of preventing breaches of the data protection act. In the case of the NHS, this requirement has been passed by NHS Connecting for Health to local organisations, to be implemented as they see fit. This by default creates an issue in that an individual will possess multiple pseudonyms, and pseudonym types, when stored at different trusts in different parts of the country. This scenario is now the reality facing NHS organisations, and as such data management (incorporating pseudonymisation) across multiple organisations must now be placed into a “next best fit” model as so many versions of the truth exist.

Saturday, 19 November 2011

New Frontiers - Protecting Nottinghamshire Patient Identity with Oka-Bi Pseudonymisation Toolkit on the Microsoft Platform

The Oka-Bi Pseudonymisation Solution was put through a battery of tests during 2011, and all aspects of the solution have been verified, from functionality testing, to end usability, to performance tests. What follows is a brief overview of the test environment, to benefit the wider NHS using feedback from this unique test scenario, with an enterprise data warehouse containing 3 Primary Care Trusts’ data being triplicated and loaded through the Oka-Bi Pseudonymisation Engine.

The test technical platform was as follows – a 64 bit Windows 2008 server with 8GB of RAM, dual core processor and 100 gigabytes of available disk space.

Data was prepared in a SQL Server 2000 loading database, and the Oka-Bi New Safe Haven and Pseudonymisation Engine was prepared in under 1 hour by use of the “Code Generator” application.

The data to be loaded consisted of the following, for each PCT:

A&E Historical Data – 7 years

Outpatient Data – 7 years

Inpatient Data – 5 years

Registered Population Data – 1 year

This data was then triplicated, with the aim of conducting an intensive test of the Oka-Bi Pseudonymisation Engine. So, overall, a simulated 9 PCT load was actioned by Oka-Bi pseudonymisation specialists. This can be most easily thought of as follows – the 4 datasets for each organisation consitute the data warehouse, and the data warehouse was triplicated, with a single set of pseudonyms being used for each PCTs (triplicated) data. This provided a great test of the accuracy of the multi pseudonym technology embedded in the Oka-Bi Pseudonymisation Engine, which resulted in 100% accuracy.

After loading had taken place (over a 2 day period), the data was then linked to the new Nottinghamshire Enterprise Data Warehouse, which is the delivery database of the data passing through the Oka-Bi solution. It is important to note that end users should not have direct access to the New Safe Haven, as this is an administrative database only, according to Connecting for Health guidelines. Some notable challenges were faced in this phase of development, as de-pseudonymised data needed to be available to end users via many different end user applications (access and sql server sessions to name but a few). These challenges were overcome, due to extensions to the toolkit, which meant that end user applications appeared in the precise context required by the de-pseudonymisation engine.

End user testing took place over a period of 2 months, with no major changes required to the engine. In terms of technical feedback, this was the richest part of the programme from the supplier perspective, as the final specifications were formalised based on ETL and end user query timings. Oka-Bi now possess an engine which can operate in a range of scenarios, and can scale as required in multiple architectural environments, with no compromise in terms of accuracy and performance.

The toolkit was significantly augmented during the last 6 months of test activity (some of which existed outside that of the load scenario above), so as to provide assurance to Oka-Bi that the toolkit can scale on demand to multiple scenarios. We were determined to make best use of the opportunity to test against such a huge dataset, and are looking forward to assisting new customers in such a new technical concept (ie. The delivery of full scale pseudonymisation solutions on the Microsoft SQL Server platform).

It is also very important to note that the audit trail facility was a significant feature of the toolkit, and passed all tests with 100% accuracy. Every individual access request to pseudonymised and non-pseudonymised data were successfully recorded in the Oka-Bi Audit Log, providing insight as to the usage behaviour of the tri-PCT community to senior management across the trusts.

To conclude, we now possess at Oka-Bi a unique perspective regarding the demands of such a rigorous multi facated requirement as a pseudonymisation solution. And, in the changing NHS, we are proud that we have developed a toolkit which operates efficiently in multiple contexts, with identical results, and provides tha ability to automate the most demanding parts of pseudonymisation software development, leading to dramatic reductions in development time.

Oka-Blog

Tuesday, 21 February 2012

"Practical Pseudonymisation" EHI Live 7 November 2011