Blog home

Privacy & Security in Times of AI

By virtue of dealing with HR data, we are tapping into privacy and security-sensitive areas. In this post, Orgnostic’s co-founder and CTO touches on how we are managing such risks with AI tooling in a rather practical and down-to-earth way.

If you are in the business of building software tech products, it has been hard to miss the product and platform breakthroughs that have happened in the last two years with LLMs, and almost impossible to miss all the hype with the release of GPT-3.5 and GPT-4.

Even though these were years in making, it seems it had happened almost overnight.

Humanity had its “AHA” moment witnessing the human-like conversational and summarization capabilities of such models, and, for better or for worse, the global software tech product race around them had started.

As usual, there are many selling snake oil, but there are also some incredible new products and extensions of already existing products being built — the ability to improve the human-machine interaction using human language has never been so easy, precise, and, ironically, controllable.

Moreover, there are two camps, one that is calling for at least a temporary stop and the other that wants to press the pedal to the metal.

I am, personally, closer to the latter one, but I believe that we still need strong ethical and reasonable legal controls in place.

However, as we step on the AI bandwagon and build products using relatively new AI platforms and infrastructure, what are the privacy, security, and ethical risks we face, and how do we deal with them?

In our case, we are building what we believe to be, by far, the most sophisticated and simple-to-use People Analytics platform to date.

By virtue of dealing with HR data, we are tapping into privacy and security-sensitive areas, and I would like to touch on how we are managing such risks with AI tooling in a rather practical and down-to-earth way.

What privacy means to us

A large portion of our technical and product team, including myself, have spent their time in biomedical space building the platforms that were dealing with petabytes of biomedical data, primarily cancer genomics data, including sensitive patient data. So we do not take data privacy and security lightly.

Quite the opposite, actually.

We are a bit obsessed and on the other side of the spectrum, and we hope this simple recipe below will help you deal with FUD and mania around AI platforms and tooling, whatever you might be building.

First things first

First of all, never, ever send private or confidential data over to the OpenAI platform, or any other vendor for that matter, unless they have at least SOC 2 Type 2 in place, unless you have a clear NDA/DPA in place with them, and unless you know how your data would be handled and used.

A recent Samsung data breach is a clear sign that “it’s not there yet,” and if you need data to be summarized or in any other way processed by an external data provider, handle it as any other service over the workings of which you don’t have adequate control or knowledge.

Never, ever send the private and confidential data, and if you’re even summarizing over larger demographics and groups, rather than using personal data, be sure that you have (pseudo)anonymization in place.

It’s not trivial, but completely doable by controlling the data masking on your end. 

How to handle private data

If you need to handle private data, you can use some of the incredible alternatives that you can self-host and have results comparable to GPT-3.5. For example, go with:

  • GPT4All if you’re building a non-commercial product (as it’s using the LLaMA model as its base that doesn’t allow commercial usage), or
  • Bloom if you are building a commercial product. Be sure you’re in line with Bloom’s great RAIL ethical AI license (whether you are using Bloom or not. I am sure that we will see more and more similar licenses in place as a complement to regular open-source and commercial licenses in the future).

Also, if you’re using self-hosted models, you will have the benefit of being able to additionally optimize and fine-tune the models for a specific domain and usage, which is, at the time of writing, not available with some hosted models such as GPT-3.5 and GPT-4.

If you’re working with publicly available content that doesn’t contain any confidential or private data, such as blog posts, publications, papers, etc., that are publicly available for access and usage, you should feel comfortable using it with GPT-3.5/4 and similar services.

Chances are that they’ve been used for training some of the large models in the first place and you can use them to scope down the summarization and provide better guidance for the transformer.

If applicable, be a good citizen of the internet and refer to the original sources and authors if you’re using their work to generate the content. 

On building a semantic search

If you are building a semantic search using embeddings over LLMs, you will most likely be using either open-source options or services such as Pinecone — if you are, be sure that the metadata enriching your vector data doesn’t contain any private information.

You can easily resolve any private or sensitive data once you retrieve search results using the keys that refer to private data in the database or data warehouse that you’re using to store and process that data and vendors have the right security and privacy controls in place (such as AWS RDS, Redshift, Google BigQuery, Azure Database, etc.).

One important thing to remember is not to use LLMs for anything in the critical path.

Instead, use it as a helper and guide to get to the information you need — current models are not yet ready to be used as a substitution for either human control or already existing and battle-tested methods for automation — you can compare benefits of GPT-era LLMs in production as an exoskeleton rather than a fully autonomous system.

It’s easy to be overwhelmed by some of the results you might be getting from LLMs to the point that allows you to easily forget that they lack precision and consistency for some tasks.

In a nutshell, this is how we divide dos and don’ts.

This allows us to never expose sensitive and private customer data to vendors that don’t (yet) have the right controls in place, but still allows us to give them the full benefit of the human-language interface through externally available data, including the search and summarization of the relevant content.

These recipes are simple and general enough that they could guide you in most of the cases or products that you’re trying to build. 

Last but not least, at the moment of writing this text, I am sitting at the Madrid airport, which comes as a great reminder — every time I am in an airport, I am sitting in awe as I witness the incredible feat of how such a complex and beautiful system and dance of organization and logistics is helping us get from one part of the world to another in a less of a day and with that, I am reminded of what human civilization, despite all flaws, is capable of.

No matter how impressive, LLMs are hardly the most complex or incredible thing humanity has created.

Despite being an incredible milestone in research and engineering and incredibly useful for many product use cases, they are far from general intelligence. I strongly believe that machines are “not there to get us” or steal our data. 

We, the people that are building products, should simply continue doing what we have been doing with all other pieces of software stack and infrastructure; use our reason and assess the security and privacy risks and ways of managing them one integration, problem, and vendor at the time and be mindful of how we are using AI tooling by understanding what it actually does.

And even beyond that, what are the ethical implications of the products that we are building?

They are not trivial, but definitely manageable.

As AI tooling and vendors mature, I expect that the platforms, such as OpenAI hosted GPT API/Plugins become more enterprise and privacy-friendly with traditional compliance controls in place.

Remember where cloud vendors were just ten years ago?

to schedule a Demo