Video Recordings for SaTML 2024

Opening Remarks

Nicolas Papernot, Carmela Troncoso

Keynotes

Watermarking (The State of the Union)

Somesh Jha

Abstract. Generative AI (GenAI) Turing test (did the content originate from a model, such as DALLE, Gemini, Claude,...?) is an important primitive for several downstream tasks, such as detecting fake media. Several legislations are also mandating that companies should watermark the content generated by their models. Several watermarking schemes exist in the literature. We will discuss a few of them. However, some very powerful attacks exist on the watermarking schemes. We also cover some of these attacks. We will then ponder the following question: what use cases are appropriate for watermarking? We will conclude with some future directions.

Audits and Accountability in the Age of 'Artificial Intelligence'

Deborah Raji

Abstract. When AI systems fall short of articulated expectations, people get hurt.
In order to hold those who build AI systems accountable for the consequences of their actions, we need to operationalize a system for auditing. AI audits have for years been part of the conversation in the context of online platforms but are now just beginning to emerge as a mode of external oversight and evaluation regarding the deployment of a broader range of "automated decision systems" (ADS) and other AI-branded products, including the latest crop of "generative AI" models. As AI auditing makes its way into critical policy proposals as a primary mechanism for algorithmic accountability, we must think critically about the necessary technical and institutional infrastructure required for this form of oversight to be successful.

Tutorials

Detecting the use of copyright-protected content by LLMs

Yves-Alexandre de Montjoye

Abstract. A year ago, ChatGPT surprised the world with its extraordinary language generation capabilities, quickly becoming one of the fastest adopted consumer products in history. Concerns were however quickly raised regarding the reliance of LLMs, developed by OpenAI or its competitors, on high-quality content that is often protected by copyright. In this talk, I will briefly describe the current legal landscape. I will then discuss the different technical approaches that have been proposed to detect the use of copyright-protected content by LLMs. I will first briefly discuss data provenance techniques, a set of tools to efficiently search very large datasets. I will then discuss techniques to audit trained models, inspired by previous work in the privacy literature: post-hoc document-level membership inference attacks, which relies on naturally occurring memorization, and copyright traps, the injection of unique and highly memorizable sequences in original content.

(Formal) Languages Help AI agents Learn and Reason

Sheila McIlraith

Abstract. How do we communicate with AI Agents that learn? One obvious answer is via language. Indeed, humans have evolved languages over tens of thousands of years to provide useful abstractions for understanding and interacting with each other and with the physical world. The claim advanced by some is that language influences how we think, what we perceive, how we focus our attention, and what we remember. We use language to capture and share our understanding of the world around us, to communicate high-level goals, intentions and objectives, and to support coordination with others. Importantly, language can provide us with useful and purposeful abstractions that can help us to generalize and transfer knowledge to new situations. Language comes in many forms. In Computer Science and in the study of AI, we have historically used formal knowledge representation languages and programming languages to capture our understanding of the world and to communicate unambiguously with computers. In this talk I will discuss how formal language can help agents learn and reason with a deep dive on one particular topic – reinforcement learning. I’ll show how we can exploit the syntax and semantics of formal language and automata to aid in the specification of complex reward-worthy behavior. In doing so, formal language can help us address some of the challenges to reinforcement learning in settings where trustworthiness is important - such as settings where properties like safety or fairness need to be enforced.

Session A

Probabilistic Dataset Reconstruction from Interpretable ModelsOpenReview

Julien Ferry (École Polytechnique de Montréal, Université de Montréal), Ulrich Aïvodji (École de technologie supérieure, Université du Québec), Sébastien Gambs (Université du Québec à Montréal), Marie-José Huguet (LAAS / CNRS), Mohamed Siala (LAAS / CNRS)

Abstract. Interpretability is often pointed out as a key requirement for trustworthy machine learning. However, learning and releasing models that are inherently interpretable leaks information regarding the underlying training data. As such disclosure may directly conflict with privacy, a precise quantification of the privacy impact of such breach is a fundamental problem. For instance, previous work have shown that the structure of a decision tree can be leveraged to build a probabilistic reconstruction of its training dataset, with the uncertainty of the reconstruction being a relevant metric for the information leak. In this paper, we propose of a novel framework generalizing these probabilistic reconstructions in the sense that it can handle other forms of interpretable models and more generic types of knowledge. In addition, we demonstrate that under realistic assumptions regarding the interpretable models' structure, the uncertainty of the reconstruction can be computed efficiently. Finally, we illustrate the applicability of our approach on both decision trees and rule lists, by comparing the theoretical information leak associated to either exact or heuristic learning algorithms. Our results suggest that optimal interpretable models are often more compact and leak less information regarding their training data than greedily-built ones, for a given accuracy level.

Shake to Leak: Amplifying the Generative Privacy Risk through Fine-tuningOpenReview

Zhangheng LI (University of Texas at Austin), Junyuan Hong (University of Texas at Austin), Bo Li (University of Illinois, Urbana Champaign), Zhangyang Wang (University of Texas at Austin)

Abstract. While diffusion models have recently demonstrated remarkable progress in generating realistic images, privacy risks also arose -- published models or APIs could generate the training images and thus leak private sensitive training information. In this paper, we reveal a new risk, \textbf{Shake-to-Leak} (S2L), that fine-tuning the pre-trained models on manipulated data can amplify the existing privacy risk risks. We demonstrate that S2L could happen in various off-the-shelf fine-tuning strategies for diffusion models, including embedding-based ones (DreamBooth and Textual Inversion), and parameter-efficient methods (LoRA and Hypernetwork) and their combinations. S2L can amplify the state-of-the-art membership inference attack (MIA) on diffusion models by up to 5% (absolute difference) attacking AUC and also achieve significant results in data extraction. This discovery underscores that the privacy risk problem with diffusion models may be even more severe than previously recognized. The code will be released.

Improved Differentially Private Regression via Gradient BoostingOpenReview

Shuai Tang (Amazon Web Services), Sergul Aydore (Amazon), Michael Kearns (University of Pennsylvania), Saeyoung Rho (Columbia University), Aaron Roth (Amazon), Yichen Wang (Amazon), Yu-Xiang Wang (UC Santa Barbara), Steven Wu (Carnegie Mellon University)

Abstract. We revisit the problem of differentially private squared error linear regression. We observe that existing state-of-the-art methods are sensitive to the choice of hyperparameters --- including the ``clipping threshold'' that cannot be set optimally in a data-independent way. We give a new algorithm for private linear regression based on gradient boosting. We show that our method consistently improves over the previous state of the art when the clipping threshold is taken to be fixed without knowledge of the data, rather than optimized in a non-private way --- and that even when we optimize the hyperparameters of competitor algorithms non-privately, our algorithm is no worse and often better. Additional experiments also that our algorithm is also more robust to outliers.

SoK: A Review of Differentially Private Linear Models For High Dimensional DataOpenReview

Amol Khanna (Booz Allen Hamilton), Edward Raff (Booz Allen Hamilton), Nathan Inkawhich (Air Force Research Laboratory)

Abstract. Linear models are ubiquitous in data science, but are particularly prone to overfitting and data memorization in high dimensions. To combat memorization, differential privacy can be used. Many papers have proposed optimization techniques for high dimensional differentially private linear models, but a systematic comparison between these methods does not exist. We close this gap by providing a comprehensive review of optimization methods for private high dimensional linear models. Empirical tests on all methods demonstrate surprising results which can inform future research. Code for implementing all methods is released in the supplementary material.

Concentrated Differential Privacy for BanditsOpenReview

Achraf Azize (INRIA), Debabrota Basu (INRIA)

Abstract. Bandits serve as the theoretical foundation of sequential learning and an algorithmic foundation of modern recommender systems. However, recommender systems often rely on user-sensitive data, making privacy a critical concern. This paper contributes to the understanding of Differential Privacy (DP) in bandits with a trusted centralised decision-maker, and especially the implications of ensuring *zero Concentrated Differential Privacy* (zCDP). First, we formalise and compare different adaptations of DP to bandits, depending on the considered input and the interaction protocol. Then, we propose three private algorithms, namely AdaC-UCB, AdaC-GOPE and AdaC-OFUL, for three bandit settings, namely finite-armed bandits, linear bandits, and linear contextual bandits. The three algorithms share a generic algorithmic blueprint, i.e. the Gaussian mechanism and adaptive episodes, to ensure a good privacy-utility trade-off. We analyse and upper bound the **regret** of these three algorithms. Our analysis shows that in all of these settings, the prices of imposing zCDP are (asymptotically) negligible in comparison with the regrets incurred oblivious to privacy. Next, we complement our regret upper bounds with the first **minimax lower bounds** on the regret of bandits with zCDP. To prove the lower bounds, we elaborate a new proof technique based on couplings and optimal transport. We conclude by experimentally validating our theoretical results for the three different settings of bandits.

PILLAR: How to make semi-private learning more effectiveOpenReview

Yaxi Hu (Max Planck Institute for Intelligent Systems), Francesco Pinto (University of Oxford), Fanny Yang (Swiss Federal Institute of Technology), Amartya Sanyal (Max-Planck Institute)

Abstract. In Semi-Supervised Semi-Private (SP) learning, the learner has access to both public unlabelled and private labelled data. We propose PILLAR, an easy-to-implement and computationally efficient algorithm that, under mild assumptions on the data, provably achieves significantly lower private labelled sample complexity and can be efficiently run on real-world datasets. The key idea is to use public data to estimate the principal components of the pre-trained features and subsequently project the private dataset onto the top-$k$ Principal Components. We empirically validate the effectiveness of our algorithm in a wide variety of experiments under tight privacy constraints $\epsilon < 1$ and probe its effectiveness in low-data regimes and when the pre-training distribution significantly differs from the one on which SP learning is performed. Despite its simplicity, our algorithm exhibits significantly improved performance, in all of these setting, over all available baselines that use similar amounts of public data while often being more computationally expensive [1]-[3]. For example, in the case of CIFAR-100 for $\epsilon=0.1$, our algorithm improves over the most competitive baselines by a factor of at least two.

Session B

Fair Federated Learning via Bounded Group LossOpenReview

Shengyuan Hu (Carnegie Mellon University), Steven Wu (Carnegie Mellon University), Virginia Smith (Carnegie Mellon University)

Abstract. Fair prediction across protected groups is an important consideration in federated learning applications. In this work we propose a general framework for provably fair federated learning. In particular, we explore and extend the notion of Bounded Group Loss as a theoretically-grounded approach for group fairness that offers favorable trade-offs between fairness and utility relative to prior work. Using this setup, we propose a scalable federated optimization method that optimizes the empirical risk under a number of group fairness constraints. We provide convergence guarantees for the method as well as fairness guarantees for the resulting solution. Empirically, we evaluate our method across common benchmarks from fair ML and federated learning, showing that it can provide both fairer and more accurate predictions than existing approaches in fair federated learning.

Estimating and Implementing Conventional Fairness Metrics With Probabilistic Protected FeaturesOpenReview

Hadi Elzayn (Stanford University), Emily Black (Barnard College), Patrick Vossler (Stanford University), Nathanael Jo (Massachusetts Institute of Technology), JACOB GOLDIN (University of Chicago), Daniel E. Ho (Stanford University)

Abstract. The vast majority of techniques to train fair models require access to the protected attribute (e.g., race, gender), either at train time or in production. However, in many practically important applications this protected attribute is largely unavailable. Still, AI systems used in sensitive business and government applications---such as housing ad delivery and credit underwriting---are increasingly legally required to measure and mitigate their bias. In this paper, we develop methods for measuring and reducing fairness violations in a setting with limited access to protected attribute labels. Specifically, we assume access to protected attribute labels on a small subset of the dataset of interest, but only probabilistic estimates of protected attribute labels (e.g., via Bayesian Improved Surname Geocoding) for the rest of the dataset. With this setting in mind, we propose a method to estimate bounds on common fairness metrics for an existing model, as well as a method for training a model to limit fairness violations by solving a constrained non-convex optimization problem. Unlike similar existing approaches, our methods take advantage of contextual information -- specifically, the relationships between a model's predictions and the probabilistic prediction of protected attributes, given the true protected attribute, and vice versa -- to provide tighter bounds on the true disparity. We provide an empirical illustration of our methods using voting data as well as the COMPAS dataset. First, we show our measurement method can bound the true disparity up to 5.5x tighter than previous methods in these applications. Then, we demonstrate that our training technique effectively reduces disparity in comparison to an unconstrained model while often incurring lesser fairness-accuracy trade-offs than other fair optimization methods with limited access to protected attributes.

Session C

Evaluating Superhuman Models with Consistency ChecksOpenReview

Lukas Fluri (ETHZ - ETH Zurich), Daniel Paleka (Department of Computer Science, ETHZ - ETH Zurich), Florian Tramèr (ETHZ - ETH Zurich)

Abstract. If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impossible to evaluate, we can still surface mistakes if the model's decisions fail to satisfy certain logical, human-interpretable rules. As case studies, we instantiate our framework on three tasks where correctness of decisions is hard to evaluate due to either superhuman model abilities, or to otherwise missing ground truth: evaluating chess positions, forecasting future events, and making legal judgments. We show that regardless of a model's (possibly superhuman) performance on these tasks, we can discover logical inconsistencies in decision making. For example: a chess engine assigning opposing valuations to semantically identical boards; GPT-4 forecasting that sports records will evolve non-monotonically over time; or an AI judge assigning bail to a defendant only after we add a felony to their criminal record.

Certifiably Robust Reinforcement Learning through Model-Based Abstract InterpretationOpenReview

Chenxi Yang (University of Texas, Austin), Greg Anderson (Reed College), Swarat Chaudhuri (University of Texas at Austin)

Abstract. We present a reinforcement learning (RL) framework in which the learned policy comes with a machine-checkable certificate of provable adversarial robustness. Our approach, called CAROL, learns a model of the environment. In each learning iteration, it uses the current version of this model and an external abstract interpreter to construct a differentiable signal for provable robustness. This signal is used to guide learning, and the abstract interpretation used to construct it directly leads to the robustness certificate returned at convergence. We give a theoretical analysis that bounds the worst-case accumulative reward of CAROL. We also experimentally evaluate CAROL on four MuJoCo environments with continuous state and action spaces. On these tasks, CAROL learns policies that, when contrasted with policies from two state-of-the-art robust RL algorithms, exhibit: (i) markedly enhanced certified performance lower bounds; and (ii) comparable performance under empirical adversarial attacks.

Fast Certification of Vision-Language Models Using Incremental Randomized SmoothingOpenReview

Ashutosh Kumar Nirala (Iowa State University), Ameya Joshi (InstaDeep), Soumik Sarkar (Iowa State University), Chinmay Hegde (New York University)

Abstract. A key benefit of deep vision-language models such as CLIP is that they enable zero-shot open vocabulary classification; the user has the ability to define novel class labels via natural language prompts at inference time. However, while CLIP-based zero-shot classifiers have demonstrated competitive performance across a range of domain shifts, they remain highly vulnerable to adversarial attacks. Therefore, ensuring the robustness of such models is crucial for their reliable deployment in the wild. In this work, we introduce Open Vocabulary Certification (OVC), a fast certification method designed for open-vocabulary models like CLIP via randomized smoothing techniques. Given a base ``training'' set of prompts and their corresponding certified CLIP classifiers, OVC relies on the observation that a classifier with a novel prompt can be viewed as a perturbed version of nearby classifiers in the base training set. Therefore, OVC can rapidly certify the novel classifier using a variation of incremental randomized smoothing. By using a caching trick, we achieve approximately two orders of magnitude acceleration in the certification process for novel prompts. To achieve further (heuristic) speedups, OVC approximates the embedding space at a given input using a multivariate normal distribution bypassing the need for sampling via forward passes through the vision backbone. We demonstrate the effectiveness of OVC on through experimental evaluation using multiple vision-language backbones on the CIFAR-10 and ImageNet test datasets.

Session D

Backdoor Attack on Un-paired Medical Image-Text Pretrained Models: A Pilot Study on MedCLIPOpenReview

Ruinan Jin (University of British Columbia), Chun-Yin Huang (University of British Columbia), Chenyu You (Yale University), Xiaoxiao Li (University of British Columbia)

Abstract. In recent years, foundation models (FMs) have solidified their role as cornerstone advancements in the deep learning domain. By extracting intricate patterns from vast datasets, these models consistently achieve state-of-the-art results across a spectrum of downstream tasks, all without necessitating extensive computational resources. Notably, MedCLIP, a vision-text contrastive learning-based medical FM, has been designed using unpaired image-text training. While the medical domain has often adopted unpaired training to amplify data, the exploration of potential security concerns linked to this approach hasn't kept pace with its practical usage. Notably, the augmentation capabilities inherent in unpaired training also indicate that minor label discrepancies can result in significant model deviations. In this study, we frame this label discrepancy as a backdoor attack problem. We further analyze its impact on medical FMs throughout the FM supply chain. Our evaluation primarily revolves around MedCLIP, emblematic of medical FM employing the unpaired strategy. We begin with an exploration of vulnerabilities in MedCLIP stemming from unpaired image-text matching, termed BadMatch. BadMatch is achieved using a modest set of wrongly labeled data. Subsequently, we disrupt MedCLIP's contrastive learning through BadDist-assisted BadMatch by introducing a Bad-Distance between the embeddings of clean and poisoned data. Intriguingly, when BadMatch and BadDist are combined, a slight 0.05 percent of misaligned image-text data can yield a staggering 99 percent attack success rate, all the while maintaining MedCLIP's efficacy on untainted data. Additionally, combined with BadMatch and BadDist, the attacking pipeline consistently fends off backdoor assaults across diverse model designs, datasets, and triggers. Also, our findings reveal that current defense strategies are insufficient in detecting these latent threats in medical FMs' supply chains.

REStore: Black-Box Defense against DNN Backdoors with Rare Event SimulationOpenReview

Quentin Le Roux (INRIA), Kassem Kallas (INRIA), Teddy Furon (INRIA)

Abstract. Backdoor attacks pose a significant threat to deep neural networks, as they allow an adversary to inject a malicious behavior in a victim model during training. This paper addresses the challenge of defending against backdoor attacks in a black-box setting where the defender has a limited access to a suspicious model, only observing its inputs and outputs. This scenario is particularly relevant in the context of the Machine Learning as a Service economy. We introduce Importance Splitting, a Sequential Monte-Carlo method previously used in neural network robustness certification, as an off-the-shelf tool for defending against backdoors. We demonstrate that a black-box defender can leverage the outputs of Importance Splitting to assess the presence of a backdoor, reconstruct its trigger, and finally purify test-time input data in real-time. So-called REStore, our input purification defense proves effective in black-box scenarios because it uses triggers recovered with a query access to a model, only observing its outputs (whether logits, probits, or top-1 labels). We test our method on MNIST, CIFAR-10, and CASIA-Webface. To the best of our knowledge, REStore is the first one-stage, black-box input purification defense that matches previous, more complex comparable methods. REStore avoids gradient estimation, model reconstruction, or the vulnerable training of supplemental neural networks and anomaly detectors.

EdgePruner: Poisoned Edge Pruning in Graph Contrastive LearningOpenReview

Hiroya Kato (KDDI Research, Inc.), Kento Hasegawa (KDDI Research, Inc.), Seira Hidano (KDDI Research, Inc.), Kazuhide Fukushima (KDDI Research, Inc.)

Abstract. Graph Contrastive Learning (GCL) is unsupervised graph representation learning that can obtain useful representation of unknown nodes. The node representation can be utilized as features of downstream tasks. However, GCL is vulnerable to poisoning attacks as with existing learning models. A state-of-the-art defense cannot sufficiently negate adverse effects by poisoned graphs although such a defense introduces adversarial training in the GCL. To achieve further improvement, pruning adversarial edges is important. To the best of our knowledge, the feasibility remains unexplored in the GCL domain. In this paper, we propose a simple defense for GCL, EdgePruner. We focus on the fact that the state-of-the-art poisoning attack on GCL tends to mainly add adversarial edges to create poisoned graphs, which means that pruning edges is important to sanitize the graphs. Thus, EdgePruner prunes edges that contribute to minimizing the contrastive loss based on the node representation obtained after training on poisoned graphs by GCL. Furthermore, we focus on the fact that nodes with distinct features are connected by adversarial edges in poisoned graphs. Thus, we introduce feature similarity between neighboring nodes to help more appropriately determine adversarial edges. This similarity is helpful in further eliminating adverse effects from poisoned graphs on various datasets. Finally, EdgePruner outputs a graph that yields the minimum contrastive loss as the sanitized graph. Our results demonstrate that pruning adversarial edges is feasible on six datasets. EdgePruner can improve the accuracy of node classification under the attack by up to 5.55% compared with that of the state-of-the-art defense. Moreover, we show that EdgePruner is immune to an adaptive attack.

Indiscriminate Data Poisoning Attacks on Pre-trained Feature ExtractorsOpenReview

Yiwei Lu (University of Waterloo), Matthew Y. R. Yang (University of Waterloo), Gautam Kamath (University of Waterloo), Yaoliang Yu (University of Waterloo)

Abstract. Machine learning models have achieved great success in supervised learning tasks for end-to-end training, which requires a large amount of labeled data that is not always feasible. Recently, many practitioners have shifted to self-supervised learning (SSL) methods (e.g., contrastive learning) that utilize cheap unlabeled data to learn a general feature extractor via pre-training, which can be further applied to personalized downstream tasks by simply training an additional linear layer with limited labeled data. However, such a process may also raise concerns regarding data poisoning attacks. For instance, indiscriminate data poisoning attacks, which aim to decrease model utility by injecting a small number of poisoned data into the training set, pose a security risk to machine learning models, but have only been studied for end-to-end supervised learning. In this paper, we extend the exploration of the threat of indiscriminate attacks on downstream tasks that apply pre-trained feature extractors. Specifically, we propose two types of attacks: (1) the input space attacks, where we modify existing attacks (e.g., TGDA attack and GC attack) to directly craft poisoned data in the input space. However, due to the difficulty of optimization under constraints, we further propose (2) the feature targeted attacks, where we mitigate the challenge with three stages, firstly acquiring target parameters for the linear head; secondly finding poisoned features by treating the learned feature representations as a dataset; and thirdly inverting the poisoned features back to the input space. Our experiments examine such attacks in popular downstream tasks of fine-tuning on the same dataset and transfer learning that considers domain adaptation. Empirical results reveal that transfer learning is more vulnerable to our attacks. Additionally, input space attacks are a strong threat if no countermeasures are posed, but are otherwise weaker than feature targeted attacks.

ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networksOpenReview

Eleanor Clifford (Imperial College London), Ilia Shumailov (Google DeepMind), Yiren Zhao (Imperial College London), Ross Anderson (University of Edinburgh, University of Edinburgh), Robert D. Mullins (University of Cambridge)

Abstract. Early backdoor attacks against machine learning set off an arms race in attack and defence development. Defences have since appeared demonstrating some ability to detect backdoors in models or even remove them. These defences work by inspecting the training data, the model, or the integrity of the training procedure. In this work, we show that backdoors can be added during compilation, circumventing any safeguards in the data preparation and model training stages. The attacker can not only insert existing weight-based backdoors during compilation, but also a new class of weight-independent backdoors, such as ImpNet. These backdoors are impossible to detect during the training or data preparation processes, because they are not yet present. Next, we demonstrate that some backdoors, including ImpNet, can only be reliably detected at the stage where they are inserted and removing them anywhere else presents a significant challenge. We conclude that machine learning model security requires assurance of provenance along the entire technical pipeline, including the data, model architecture, compiler, and hardware specification.

The Devil's Advocate: Shattering the Illusion of Unexploitable Data using Diffusion ModelsOpenReview

Hadi Mohaghegh Dolatabadi (University of Melbourne), Sarah Monazam Erfani (The University of Melbourne), Christopher Leckie (The University of Melbourne)

Abstract. Protecting personal data against exploitation of machine learning models is crucial. Recently, availability attacks have shown great promise to provide an extra layer of protection against the unauthorized use of data to train neural networks. These methods aim to add imperceptible noise to clean data so that the neural networks cannot extract meaningful patterns from the protected data, claiming that they can make personal data unexploitable.'' This paper provides a strong countermeasure against such approaches, showing that unexploitable data might only be an illusion. In particular, we leverage the power of diffusion models and show that a carefully designed denoising process can counteract the effectiveness of the data-protecting perturbations. We rigorously analyze our algorithm, and theoretically prove that the amount of required denoising is directly related to the magnitude of the data-protecting perturbations. Our approach, called AVATAR, delivers state-of-the-art performance against a suite of recent availability attacks in various scenarios, outperforming adversarial training even under distribution mismatch between the diffusion model and the protected data. Our findings call for more research into making personal data unexploitable, showing that this goal is far from over. Our implementation is available at this repository: https://github.com/hmdolatabadi/AVATAR.

Session E

SoK: Pitfalls in Evaluating Black-Box AttacksOpenReview

Fnu Suya (University of Maryland, College Park), Anshuman Suri (University of Virginia), Tingwei Zhang (Cornell University), Jingtao Hong (Columbia University), Yuan Tian (UCLA), David Evans (University of Virginia)

Abstract. Numerous works study black-box attacks on image classifiers, where adversaries generate adversarial examples against unknown target models without having access to their internal information. However, these works make different assumptions about the adversary's knowledge, and current literature lacks cohesive organization centered around the threat model. To systematize knowledge in this area, we propose a taxonomy over the threat space spanning the axes of feedback granularity, the access of interactive queries, and the quality and quantity of the auxiliary data available to the attacker. Our new taxonomy provides three key insights. 1) Despite extensive literature, numerous under-explored threat spaces exist, which cannot be trivially solved by adapting techniques from well-explored settings. We demonstrate this by establishing a new state-of-the-art in the less-studied setting of access to top-k confidence scores by adapting techniques from well-explored settings of accessing the complete confidence vector but show how it still falls short of the more restrictive setting that only obtains the prediction label, highlighting the need for more research. 2) Identifying the threat models for different attacks uncovers stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper. 3) Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates. We also highlight the need to evaluate attacks in diverse and harder settings and underscore the need for better selection criteria when picking the best candidate adversarial examples.

Evading Black-box Classifiers Without Breaking EggsOpenReview

Edoardo Debenedetti (Department of Computer Science, ETHZ - ETH Zurich), Nicholas Carlini (Google), Florian Tramèr (ETHZ - ETH Zurich)

Abstract. Decision-based evasion attacks repeatedly query a black-box classifier to generate adversarial examples. Prior work measures the cost of such attacks by the total number of queries made to the classifier. We argue this metric is incomplete. Many security-critical machine learning systems aim to weed out bad" data (e.g., malware, harmful content, etc). Queries to such systems carry a fundamentally *asymmetric cost*: "flagged" queries, i.e., detected as "bad", come at a higher cost because they trigger additional security filters, e.g., usage throttling or account suspension. Yet, we find that existing decision-based attacks issue a large number queries that would get flagged by a real-world system, which likely renders them ineffective against security-critical systems. We then design new attacks that reduce the number of flagged queries by $1.5$-$7.3\times$, but often at a significant increase in total (non-flagged) queries. We thus pose it as an open problem to build black-box attacks that are more effective under realistic cost metrics."

Segment (Almost) Nothing: Prompt-Agnostic Adversarial Attacks on Segmentation ModelsOpenReview

Francesco Croce (EPFL - EPF Lausanne), Matthias Hein (University of Tübingen)

Abstract. General purpose segmentation models are able to generate (semantic) segmentation masks from a variety of prompts, including visual (points, boxed, etc.) and textual (object names) ones. In particular, input images are pre-processed by an image encoder to obtain embedding vectors which are later used for mask predictions. Existing adversarial attacks target the end-to-end tasks, i.e. aim at altering the segmentation mask predicted for a specific image-prompt pair. However, this requires running an individual attack for each new prompt for the same image. We propose instead to generate prompt-agnostic adversarial attacks by maximizing the $\ell_2$-distance, in the latent space, between the embedding of the original and perturbed images. Since the encoding process only depends on the image, distorted image representations will cause perturbations in the segmentation masks for a variety of prompts. We show that even imperceptible $\ell_\infty$-bounded perturbations of radius $\epsilon=1/255$ are often sufficient to drastically modify the masks predicted with point, box and text prompts by recently proposed foundation models for segmentation. Moreover, we explore the possibility of creating universal, i.e. non image-specific, attacks which can be readily applied to any input without further computational cost.

Session F

Improving Privacy-Preserving Vertical Federated Learning by Efficient Communication with ADMMOpenReview

Chulin Xie (University of Illinois, Urbana Champaign), Pin-Yu Chen (International Business Machines), Qinbin Li (University of California, Berkeley), Arash Nourian (UC Berkeley, University of California, Berkeley), Ce Zhang (University of Chicago), Bo Li (University of Illinois, Urbana Champaign)

Abstract. Federated learning (FL) enables distributed resource-constrained devices to jointly train shared models while keeping the training data local for privacy purposes. Vertical FL (VFL), which allows each client to collect partial features, has attracted intensive research efforts recently. We identified the main challenges that existing VFL frameworks are facing: the server needs to communicate gradients with the clients for each training step, incurring high communication cost that leads to rapid consumption of privacy budgets. To address these challenges, in this paper, we introduce a VFL framework with multiple heads (VIM), which takes the separate contribution of each client into account, and enables an efficient decomposition of the VFL optimization objective to sub-objectives that can be iteratively tackled by the server and the clients on their own. In particular, we propose an Alternating Direction Method of Multipliers (ADMM)-based method to solve our optimization problem, which allows clients to conduct multiple local updates before communication, and thus reduces the communication cost and leads to better performance under differential privacy (DP). We provide the client-level DP mechanism for our framework to protect user privacy. Moreover, we show that a byproduct of VIM is that the weights of learned heads reflect the importance of local clients. We conduct extensive evaluations and show that on four VFL datasets, VIM achieves significantly higher performance and faster convergence compared with the state-of-the-art. We also explicitly evaluate the importance of local clients and show that VIM enables functionalities such as client-level explanation and client denoising. We hope this work will shed light on a new way of effective VFL training and understanding.

Differentially Private Multi-Site Treatment Effect EstimationOpenReview

Tatsuki Koga (University of California, San Diego), Kamalika Chaudhuri (UC San Diego, University of California, San Diego), David Page (Duke University)

Abstract. Patient privacy is a major barrier to healthcare AI. For confidentiality reasons, most patient data remains in silo in separate hospitals, preventing the design of data-driven healthcare AI systems that need large volumes of patient data to make effective decisions. A solution to this is collective learning across multiple sites through federated learning with differential privacy. However, literature in this space typically focuses on differentially private statistical estimation and machine learning, which is different from the causal inference-related problems that arise in healthcare. In this work, we take a fresh look at federated learning with a focus on causal inference; specifically, we look at estimating the average treatment effect (ATE), an important task in causal inference for healthcare applications, and provide a federated analytics approach to enable ATE estimation across multiple sites along with differential privacy (DP) guarantees at each site. The main challenge comes from site heterogeneity—different sites have different sample sizes and privacy budgets. We address this through a class of per-site estimation algorithms that reports the ATE estimate and its variance as a quality measure, and an aggregation algorithm on the server side that minimizes the overall variance of the final ATE estimate. Our experiments on real and synthetic data show that our method reliably aggregates private statistics across sites and provides better privacy-utility tradeoff under site heterogeneity than baselines.

ScionFL: Efficient and Robust Secure Quantized AggregationOpenReview

Yaniv Ben-Itzhak (VMware), Helen Möllering (Technical University of Darmstadt), Benny Pinkas (Bar-Ilan University), Thomas Schneider (Technische Universität Darmstadt), Ajith Suresh (Technology Innovation Institute (TII)), Oleksandr Tkachenko (Technische Universität Darmstadt), shay vargaftik (VMware Research), Christian Weinert (Royal Holloway, University of London), Hossein Yalame (TU Darmstadt), Avishay Yanai (Vmware)

Abstract. Secure aggregation is commonly used in federated learning (FL) to alleviate privacy concerns related to the central aggregator seeing all parameter updates in the clear. Unfortunately, most existing secure aggregation schemes ignore two critical orthogonal research directions that aim to (i) significantly reduce client-server communication and (ii) mitigate the impact of malicious clients. However, both of these additional properties are essential to facilitate cross-device FL with thousands or even millions of (mobile) participants. In this paper, we unite both research directions by introducing ScionFL, the first secure aggregation framework for FL that operates efficiently on quantized inputs and simultaneously provides robustness against malicious clients. Our framework leverages novel multi-party computation MPC techniques and supports multiple linear (1-bit) quantization schemes, including ones that utilize the randomized Hadamard transform and Kashin's representation. Our theoretical results are supported by extensive evaluations. We show that with no overhead for clients and moderate overhead for the server compared to transferring and processing quantized updates in plaintext, we obtain comparable accuracy for standard FL benchmarks. Moreover, we demonstrate the robustness of our framework against state-of-the-art poisoning attacks.

Differentially Private Heavy Hitter Detection using Federated AnalyticsOpenReview

Karan Chadha (Stanford University), Hanieh Hashemi (Apple), John Duchi (Stanford University), Vitaly Feldman (Apple AI Research), Hanieh Hashemi (Apple), Omid Javidbakht (Apple), Audra McMillan (Apple), Kunal Talwar (Apple)

Abstract. In this work, we study practical heuristics to improve the performance of prefix-tree based algorithms for differentially private heavy hitter detection. Our model assumes each user has multiple data points and the goal is to learn as many of the most frequent data points as possible across all users' data with aggregate and local differential privacy. We propose an adaptive hyperparameter tuning algorithm that improves the performance of the algorithm while satisfying computational, communication and privacy constraints. We explore the impact of different data-selection schemes as well as the impact of introducing deny lists during multiple runs of the algorithm. We test these improvements using extensive experimentation on the Reddit dataset on the task of learning the most frequent words.

OLYMPIA: A Simulation Framework for Evaluating the Concrete Scalability of Secure Aggregation ProtocolsOpenReview

Ivoline Ngong (University of Vermont), Nicholas Gibson (University of Vermont), Joseph Near (University of Vermont)

Abstract. Recent secure aggregation protocols enable privacy-preserving federated learning for high-dimensional models among thousands or even millions of participants. Due to the scale of these use cases, however, end-to-end empirical evaluation of these protocols is impossible. We present OLYMPIA, a framework for empirical evaluation of secure protocols via simulation. OLYMPIA provides an embedded domain-specific language for defining protocols and a simulation framework for evaluating their performance. We implement several recent secure aggregation protocols using OLYMPIA and perform the first empirical comparison of their end-to-end running times. We release OLYMPIA as open open-source.

Session G

Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image EncodersOpenReview

Andrew Geng (University of Wisconsin, Madison), Pin-Yu Chen (International Business Machines)

Abstract. When evaluating the performance of a pre-trained model transferred to a downstream task, it is imperative to assess not only the in-distribution (ID) accuracy of the downstream model but also its capacity to generalize and identify out-of-distribution (OOD) samples. In this paper, we unveil the hidden costs associated with intrusive fine-tuning techniques. Specifically, we demonstrate that commonly used fine-tuning methods not only distort the representations necessary for generalizing to covariate-shifted OOD samples (OOD generalization) but also distort the representations necessary for detecting semantically-shifted OOD samples (OOD detection). To address these challenges, we introduce a new model reprogramming approach for fine-tuning, which we name Reprogrammer. Reprogrammer aims to improve the holistic performance of the downstream model across ID, OOD generalization, and OOD detection tasks. Our empirical evidence reveals that Reprogrammer is less intrusive and yields superior downstream models. Furthermore, we demonstrate that by appending an additional residual connection to Reprogrammer, we can further preserve pre-training representations, resulting in an even more safe and robust downstream model capable of excelling in many ID classification, OOD generalization, and OOD detection settings.

Data Redaction from Conditional Generative ModelsOpenReview

Zhifeng Kong (NVIDIA), Kamalika Chaudhuri (UC San Diego, University of California, San Diego)

Abstract. Deep generative models are known to produce undesirable samples such as harmful content. Traditional mitigation methods include re-training from scratch, filtering, or editing; however, these are either computationally expensive or can be circumvented by third parties. In this paper, we take a different approach and study how to post-edit an already-trained conditional generative model so that it redacts certain conditionals that will, with high probability, lead to undesirable content. This is done by distilling the conditioning network in the models, giving a solution that is effective, efficient, controllable, and universal for a class of deep generative models. We conduct experiments on redacting prompts in text-to-image models and redacting voices in text-to-speech models. Our method is computationally light, leads to better redaction quality and robustness than baseline methods while still retaining high generation quality.

Towards Scalable and Robust Model VersioningOpenReview

Wenxin Ding (University of Chicago), Arjun Nitin Bhagoji (University of Chicago), Ben Y. Zhao (University of Chicago), Haitao Zheng (University of Chicago)

Abstract. As the deployment of deep learning models continues to expand across industries, the threat of malicious incursions aimed at gaining access to these deployed models is on the rise. Should an attacker gain access to a deployed model, whether through server breaches, insider attacks, or model inversion techniques, they can then construct white-box adversarial attacks to manipulate the model's classification outcomes, thereby posing significant risks to organizations that rely on these models for critical tasks. Model owners need mechanisms to protect themselves against such losses without the necessity of acquiring fresh training data - a process that typically demands substantial investments in time and capital. In this paper, we explore the feasibility of generating multiple versions of a model that possess different attack properties, without acquiring new training data or changing model architecture. The model owner can deploy one version at a time and replace a leaked version immediately with a new version. The newly deployed model version can resist adversarial attacks generated leveraging white-box access to one or all previously leaked versions. We show theoretically that this can be accomplished by incorporating parameterized *hidden distributions* into the model training data, forcing the model to learn task-irrelevant features uniquely defined by the chosen data. Additionally, optimal choices of hidden distributions can produce a sequence of model versions capable of resisting compound transferability attacks over time. Leveraging our analytical insights, we design and implement a practical model versioning method for DNN classifiers, which leads to significant robustness improvements over existing methods. We believe our work presents a promising direction for safeguarding DNN services beyond their initial deployment.

Session H

SoK: AI Auditing: The Broken Bus on the Road to AI AccountabilityOpenReview

Abeba Birhane (Trinity College Dublin), Ryan Steed (Carnegie Mellon University), Victor Ojewale (Brown University), Briana Vecchione (Cornell University), Inioluwa Deborah Raji (Mozilla Foundation)

Abstract. One of the most concrete measures to take towards meaningful AI accountability is to consequentially assess and report the systems’ performance and impact. However, the practical nature of the “AI audit” ecosystem is muddled and imprecise, difficult to work through various concepts, and map out the stakeholders involved in the practice. First, we taxonomize current AI audit practices as completed by regulators, law firms, civil society, journalism, academia, and consulting agencies. Next, we assess the impact of audits done by stakeholders within each domain. We find that only a subset of AI audit studies translate to the desired accountability outcomes. We thus assess and isolate practices necessary for effective AI audit results, articulating the observed connections between AI audit design, methodology and institutional context on its effectiveness as a meaningful mechanism for accountability.

Under manipulations, are there AI models harder to audit?OpenReview

Augustin Godinot (Université Rennes I), Gilles Tredan (LAAS / CNRS), Erwan Le Merrer (INRIA), Camilla Penzo (PEReN - French Center of Expertise for Digital Platform Regulation), Francois Taiani (INRIA Rennes)

Abstract. Auditors need robust methods to assess the compliance of web platforms to the law. However, since they hardly ever have access to the algorithm, implementation or training data used by a platform, the problem is harder than a simple metric estimation. Within the recent framework of manipulation-proofness auditing, we study in this paper the feasibility of robust audits in realistic settings, in which models exhibit large capacities. We first prove a constraining result: if a web platform uses models that may fit any data, no audit strategy—whether active or not—can outperform random sampling when estimating properties such as demographic parity, under minimal assumptions regarding the balance of sensitive attributes. To better understand the conditions under which state-of-the-art audit techniques may still remain competitive, we then relate the difficulty of audits to the capacity of the targeted models, using the Rademacher complexity. We empirically validate these results on popular models of increasing capacities, thus confirming experimentally that large-capacity models, which are commonly used in practice, are particularly hard to audit. These results refine the limits of the auditing problem, and opens up enticing questions on the connection between model capacity and the ability of platforms to manipulate audit attempts.

Session I

SoK: Unifying Corroborative and Contributive Attributions in Large Language ModelsOpenReview

Theodora Worledge (Computer Science Department, Stanford University), Judy Hanwen Shen (Stanford University), Nicole Meister (Stanford University), Caleb Winston (Computer Science Department, Stanford University), Carlos Guestrin (Stanford University)

Abstract. As businesses, products, and services spring up around large language models, the trustworthiness of these models hinges on the verifiability of their outputs. However, methods for explaining language model outputs fall across two distinct fields of study which both use the term attribution" to refer to entirely separate techniques: citation generation and training data attribution. In many modern applications, such as legal document generation and medical question answering, both types of attributions are important. In this systematization of knowledge paper, we argue for and present a unified framework of large language model attributions. We show how existing methods of different types of attribution fall under the unified framework. We also use the framework to discuss real-world use cases where one or both types of attributions are required. We believe that this unified framework will guide the use case driven development of systems that leverage both types of attribution, as well as the standardization of their evaluation."

CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language ModelsOpenReview

Hossein Hajipour (CISPA, saarland university, saarland informatics campus), Keno Hassler (CISPA Helmholtz Center for Information Security), Thorsten Holz (CISPA Helmholtz Center for Information Security), Lea Schönherr (CISPA Helmholtz Center for Information Security), Mario Fritz (CISPA Helmholtz Center for Information Security)

Abstract. Large language models (LLMs) for automatic code generation have recently achieved breakthroughs in several programming tasks. Their advances in competition-level programming problems have made them an essential pillar of AI-assisted pair programming, and tools such as GitHub Copilot have emerged as part of the daily programming workflow used by millions of developers. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure. While these models have been extensively evaluated for their ability to produce functionally correct programs, there remains a lack of comprehensive investigations and benchmarks addressing the security aspects of these models. In this work, we propose a method to systematically study the security issues of code language models to assess their susceptibility to generating vulnerable code. To this end, we introduce the first approach to automatically find generated code that contains vulnerabilities in black-box code generation models. This involves proposing a novel few-shot prompting approach. We evaluate the effectiveness of our approach by examining code language models in generating high-risk security weaknesses. Furthermore, we use our method to create a collection of diverse non-secure prompts for various vulnerability scenarios. This dataset serves as a benchmark to evaluate and compare the security weaknesses of code language models.

Navigating the Structured What-If Spaces: Counterfactual Generation via Structured DiffusionOpenReview

Nishtha Madaan (Indian Institute of Technology Delhi), Srikanta J. Bedathur (Indian Institute of Technology, Delhi)

Abstract. Generating counterfactual explanations is one of the most effective approaches for uncovering the inner workings of black-box neural network models and building user trust. While remarkable strides have been made in generative modelling using diffusion models in domains like vision, their utility in generating counterfactual explanations in structured modalities remains unexplored. In this paper, we introduce Structured Counterfactual Diffuser or SCD, the first plug-and-play framework leveraging diffusion for generating counterfactual explanations in structured data. SCD learns the underlying data distribution via a diffusion model which is then guided at test time to generate counterfactuals for any arbitrary black-box model, input, and desired prediction. Our experiments show that our counterfactuals not only exhibit high plausibility compared to the existing state-of-the-art but also show significantly better proximity and diversity.

Understanding, Uncovering, and Mitigating the Causes of Inference Slowdown for Language ModelsOpenReview

Kamala Varma (University of Maryland, College Park), Arda Numanoğlu (Middle East Technical University), Yigitcan Kaya (University of California, Santa Barbara), Tudor Dumitras (University of Maryland, College Park)

Abstract. Dynamic neural networks (DyNNs) have shown promise for alleviating the high computational costs of pre-trained language models (PLMs), such as BERT and GPT. Emerging slowdown attacks have shown to inhibit the ability of DyNNs to omit computation, e.g., by skipping layers that are deemed unnecessary. As a result, these attacks can cause significant delays in inference speed for DyNNs and may erase their cost savings altogether. Most research in slowdown attacks has been in the image domain, despite the ever-growing computational costs---and relevance of DyNNs---in the language domain. Unfortunately, it is still not understood what language artifacts trigger extra processing in a PLM, or what causes this behavior. We aim to fill this gap through an empirical exploration of the slowdown effect on language models. Specifically, we uncover a crucial difference between the slowdown effect in the image and language domains, illuminate the efficacy of pre-existing and novel techniques for causing slowdown, and report circumstances where slowdown does not occur. Building on these observations, we propose the first approach for mitigating the slowdown effect. Our results suggest that slowdown attacks can provide new insights that can inform the development of more efficient PLMs.

Competitions

Find the Trojan: Universal Backdoor Detection in Aligned Large Language ModelsWebsite

organized by Javier Rando & Stephen Casper & Florian Tramèr.

CNN Interpretability CompetitionWebsite

organized by Stephen Casper & Dylan Hadfield-Menell.

Large Language Models Capture-the-FlagWebsite

organized by Sahar Abdelnabi & Nicholas Carlini & Edoardo Debenedetti & Mario Fritz & Kai Greshake & Richard Hadzic & Thorsten Holz & Daphne Ippolito & Daniel Paleka & Javier Rando & Lea Schönherr & Florian Tramèr & Yiming Zhang.

Closing Remarks

Nicolas Papernot, Carmela Troncoso