Turning the ‘Black Boxness’ of Neural Networks into a Privacy feature.

How we can harness the lack of Interpretability of neural nets for privacy preserving AI.

7 min readFeb 7, 2024

Synthetically generated image of a Neural network as a black box by stability ai

Neural nets are black boxes, often criticized because of their non-transparency in performing tasks. This makes it difficult to understand their inner workings and how they arrived at solutions. In an effort to making neural net models interpretable, explainable AI (XAI) has been revived over the decades and has sought to bring transparency and accountability in deep learning and artificial intelligence in general. However, most of what has been done in regards to the explaining how neural nets work is in regards to input-output correlation mapping through the target model. Consequentially, explaining a model’s behavior given only its weights, is quite difficult without the context of its inputs and outputs.

This raises two key points, which form the basis of this article. Firstly, model weights hold the knowledge necessary for generalization extracted from a given set of data, and secondly, they do so in a way that cannot be interpreted yet. Theoretically, we can leverage the former and latter for privacy preserving AI, which is another area of concern in the field of Deep learning.

Privacy preservation in AI comes into play because neural networks usually end up memorizing the data they are trained on, and this particularly becomes a problem if the data contains sensitive information such as Person Identifiable information (PII). This memorized data can be extracted from a model using model inversion attacks. A good example of this, is proved by Fredrikson et. al. 2015 where they reconstruct people’s images that the facial recognition model saw during training. Another great example is by He et. al 2019 that deals with extracting information from a model in the context of collaborative inference.

FYI, I implemented a variant of a model inference attack proposed by Fredrikson et. al. 2015 . But so as not to digress, I will write a separate article that goes into details of how the inversion attack works.

A huge contribution to this privacy leaks by these types of models is contributed upon by the use of explicit images as input to the models. Intuitively, when a model is given the raw image in the input layer to classify , it transforms the image down to latent representations (which are basically final output layers/weights of a model) and further down to learn the class of that image. This creates a more direct relationship between raw image input and model weights that increases likelihood of successful reconstruction attacks. This means that the whole transformation process from image to output is captured through the model weights, and that transformation can implicitly be ‘reversed’ to recreate the input image from the class of that image.

How then can we limit the possibility of an input image from being reconstructed from a model?

One way could be obfuscating the image input using Autoencoders (Bank et.al. 2023, Michelucci, 2022) before feeding it to the model. Autoencoders excel when it comes to finding latent representations of data. Given an image for example, an autoencoder can learn a lower dimension representation of that image that captures the important information in the image.

A simple autoencoder implementation that learns the representations of the mnist dataset is shown below:

class Autoencoder(nn.Module):
    def __init__(self, dim,encoding_dim):
        super(Autoencoder, self).__init__()
        
        self.encoder = nn.Sequential(
            nn.Linear(dim, 700),
            nn.ReLU(),
            nn.Linear(700, 500),
            nn.ReLU(),
            nn.Linear(500, 300),
            nn.ReLU(),
            nn.Linear(300, encoding_dim),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 300),
            nn.ReLU(),
            nn.Linear(300, 500),
            nn.ReLU(),
            nn.Linear(500, 700),
            nn.ReLU(),
            nn.Linear(700, dim),
            nn.Sigmoid()
            
        )

    def forward(self, x):
        encoded = self.encoder(x) # these are the latent representations
        decoded = self.decoder(encoded)
        return encoded,decoded

autoencoder = Autoencoder(dim=784,encoding_dim=32)
autoencoder=autoencoder.to(device)

criterion = nn.MSELoss()
encoder_optimizer = optim.Adam(autoencoder.parameters(), lr=0.001)

num_epochs = 100

for epoch in range(num_epochs):
    autoencoder.train()
    total_loss=0
    val_total_loss=0
    for batch in train_loader:
        inputs,labels = batch
        inputs= inputs.to(device)
        outputs = autoencoder(inputs) 
        # The output contains 
        loss = criterion(outputs[1], inputs)
        total_loss += loss.item()
        encoder_optimizer.zero_grad()
        loss.backward()
        encoder_optimizer.step()
    total_loss=total_loss/len(train_loader)
    
    autoencoder.eval()
    with torch.no_grad():
        for val_batch in val_loader:
            val_inputs,val_labels = val_batch
            val_inputs= val_inputs.to(device)
            val_outputs = autoencoder(val_inputs)

            val_loss = criterion(val_outputs[1], val_inputs)
            val_total_loss += val_loss.item()
            val_total_loss=val_total_loss/len(val_loader)
    
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {total_loss:.4f} | Validation Loss: {val_total_loss:.4f}')

I’ll also do a different article to get into the implementations of an autoencoder, however, for the curious lot, you can find the full code implementation on my Kaggle page here.

These latent representations( final encoder weights) are what we can use to train a facial recognition model instead of the raw image. A quick glimpse of what this might look like:

class Latent_Classifier(nn.Module):
    def __init__(self, dim,output_dim):
        super(Latent_Classifier, self).__init__()
        self.Ln1=nn.Linear(dim, 600)
        self.Ln2=nn.Linear(600, 500)
        self.Ln3=nn.Linear(500, 300)
        self.Ln4=nn.Linear(300, 100)
        self.classifier_layer=nn.Linear(100, output_dim)
        self.actvfn=nn.ReLU()
        
    def forward(self,x):
        out = self.actvfn(self.Ln1(x))
        out = self.actvfn(self.Ln2(out))
        out = self.actvfn(self.Ln3(out))
        out = self.actvfn(self.Ln4(out))
        out = self.actvfn(self.classifier_layer(out))
        return out
    
latent_model=Latent_Classifier(32,10)
latent_model.to(device)

# The training loop

autoencoder.eval()
latent_optimizer = optim.Adam(latent_model.parameters(), lr=0.001)

num_epochs = 100

for epoch in range(num_epochs):
    latent_model.train()
    total_loss=0
    total_val_loss=0
    train_preds = []
    train_labels = []
    for batch in train_loader:
        inputs,labels = batch
        inputs= inputs.to(device)
        latent_inputs=autoencoder(inputs)
        labels = torch.LongTensor(labels).to(device)
        outputs = latent_model(latent_inputs[0])

        loss = nn.functional.cross_entropy(outputs, labels)
        total_loss += loss.item()
        train_preds.extend(torch.argmax(outputs, dim=1).cpu().numpy())
        train_labels.extend(labels.cpu().numpy())
        latent_optimizer.zero_grad()
        loss.backward()
        latent_optimizer.step()
    train_accuracy = accuracy_score(train_labels, train_preds)
    total_loss=total_loss/len(train_loader)
    
    latent_model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for val_batch in val_loader:
            val_inputs, val_labels = val_batch
            val_inputs = val_inputs.to(device)
            latent_val_inputs = autoencoder(val_inputs)
            val_labels = torch.LongTensor(val_labels).to(device)
            val_outputs = latent_model(latent_val_inputs[0])
            val_loss = nn.functional.cross_entropy(val_outputs, val_labels)
            total_val_loss+=val_loss
            all_preds.extend(torch.argmax(val_outputs, dim=1).cpu().numpy())
            all_labels.extend(val_labels.cpu().numpy())

    val_accuracy = accuracy_score(all_labels, all_preds)
    total_val_loss=total_val_loss/(len(val_loader))
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}] | Training Loss: {total_loss:.4f} | Validation Loss:{total_val_loss} | Training Accuracy: {train_accuracy:.4f} | Validation Accuracy:{val_accuracy}')

NB: I proved that using these latent representations gives a fairly good accuracy compared to using the raw image. On my best run, it achieved a 88% accuracy compared to the 98% accuracy in the raw image model. The difference can be attributed to the loss of information during the compressing stage of the auto encoder, anyway I digress, back to the topic.

This latent features resulting from the autoencoder inherits neural net characteristics of un-interpretability. Meaning, we know it stores information about the images in some form, we just do not know what that information is. This becomes very helpful in terms of desensitizing data that contains Person identifiable information, and a step close to privacy preserving AI.

Additionally, we also prevented the whole transformation process from being captured by one model’s weight and instead the latent representations are captured by the autoencoder and the classification task captured by another model.

In our above code example, using latent features instead of raw input images potentially gives two huge privacy benefits.

We make the reconstruction task much harder. The 10% (98–88) accuracy loss from using the latent features that I attributed to compression, is intentional. Meaning, in a perfect case, even if the attacker could reconstruct the image, because of the information lost during the compression phase, they will end up with a less accurate representation than if we used the raw images.
If we can create latent representations of sensitive data, in this case, images, then we can share the latent representations with other people to make models, leveraging on the fact that it would be difficult for them to understand what the latent representations mean, This will be a huge advancement int the field of collaborative AI.

To increase the privacy guarantee, we could leverage aspects like differential privacy to add noise to the latent representations, and this would come at an accuracy loss of course.

Points to note are that in a real world case:

We would not combine the autoencoding process and the latent model as shown in the training code above. Instead, we use the encoder to encode the all sensitive data as a first step, before training the latent model as a second step. The model will only receive the latent representations but no information about the encoder.
After we train our autoencoder, we would discard the decoder and only retain the encoder. The encoder will be not be included in the latent model training phase.
We would not release any information on the auto encoder to the public, to prevent the possibility of an adversary understanding how the sensitive data was encoded.

To conclude, We have talked about how neural nets create a challenge of unexplainability and how we can leverage that to transform our sensitive data into an obfuscated form that inherits those characteristics. We have affirmed that these latent representations (final model outputs/weights) still give a good training accuracy with the benefit of preserving the privacy of training data, thus helping turn the ‘black-boxness’ of neural network in to a privacy feature.

References.

Buhrmester, V., Münch, D., & Arens, M. (2021). Analysis of explainers of black box deep neural networks for computer vision: A survey. Machine Learning and Knowledge Extraction, 3(4), 966–989.

Fredrikson, M., Jha, S., & Ristenpart, T. (2015, October). Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security (pp. 1322–1333).

He, Z., Zhang, T., & Lee, R. B. (2019, December). Model inversion attacks against collaborative inference. In Proceedings of the 35th Annual Computer Security Applications Conference (pp. 148–162).

Bank, D., Koenigstein, N., & Giryes, R. (2023). Autoencoders. Machine learning for data science handbook: data mining and knowledge discovery handbook, 353–374.

Michelucci, U. (2022). An introduction to autoencoders. arXiv preprint arXiv:2201.03898.

Turning the ‘Black Boxness’ of Neural Networks into a Privacy feature.

How we can harness the lack of Interpretability of neural nets for privacy preserving AI.

References.

Written by Brackly Murunga