Back to top

Video Recordings for SaTML 2023

Opening Remarks

Nicolas Papernot


Robustness in Machine Learning: A Five-Year Retrospective

Zico Kolter

Abstract. In order to deploy deep learning models in safety critical domains, we need something more than models that work well "in expectation": we need models that are (perhaps certifiably) robust to worst-case scenarios. This argument has been a common refrain in adversarial deep learning, but is there much evidence that it is true? More than five years after the development of several practical techniques for improving adversarial robustness of large-scale models, these techniques are still not widely deployed, nor do they seem to be on the verge of deployment. Meanwhile, for better or worse, the wide-scale deployment of (non-robust) models seems to have only accelerated. In this talk, I will offer a bit of historical context about how these certified robustness methods evolved, the notable advances that have been happening over the past five years, and thoughts on why we still seem so far from deploying robust models in practice. I will discuss some of the largest challenges in the space, and possible directions for moving forward.
Eugenics and the Promise of Utopia through Artificial General Intelligence

Timnit Gebru

Abstract. The stated goal of many organizations in the field of artificial intelligence (AI) is to develop artificial general intelligence (AGI), an imagined system with more intelligence than anything we have ever seen. Without questioning whether such a system can and should be built, researchers working towards AGI have created the field of AI safety to attempt to make AGI that is “beneficial for all of humanity.” I argue that unlike systems with specific applications which can be evaluated following standard engineering principles, undefined systems like “AGI” cannot be appropriately tested for safety. With specific examples, I outline how the march towards building AGI has resulted in systems that harm marginalized groups and centralize power, while their creators claim to “benefit humanity." I conclude by urging the field to work on defined tasks and applications for which we can develop safety protocols, rather than attempting to build a presumably all knowing system such as AGI.


An Introduction to Differential Privacy

Gautam Kamath

Abstract. Differential privacy is a rigorous framework for privacy-preserving data analysis. It is widely adopted in both theory and practice, including deployments at Google, Apple, Microsoft, and the US Census Bureau. This tutorial will cover the basics of differential privacy, including the definition, basic properties, fundamental algorithms and building blocks, and applications to machine learning. Time permitting, recent trends in differential privacy research may also be highlighted.
Aligning ML Systems with Human Intent

Jacob Steinhardt

Abstract. ML systems are "aligned" if their behavior matches the intended goals of the system designer, many of which are implicit and informal. Alignment is difficult to achieve, due to misspecified reward functions (Goodhart's law), unexpected behavior that appears emergently at scale, and feedback loops arising from multi-agent interactions. For example, language models trained to predict tokens might give untruthful answers if truth and likelihood diverge, and recommender systems might optimize short-term engagement at the expense of long-term well-being. In this tutorial, I will discuss empirically observed alignment issues in large-scale systems, as well as several techniques for addressing them, based on (1) improving human feedback to reduce reward misspecification, (2) extracting latent knowledge from models' hidden states using unsupervised learning, and (3) large-scale evaluation to detect novel failures. I will also briefly touch on the challenges of emergence and multi-agent interactions and current approaches to addressing them

Session A

Explainable Global Fairness Verification of Tree-Based ClassifiersOpenReview

Stefano Calzavara (Università Ca' Foscari Venezia, Italy), Lorenzo Cazzaro (Università Ca' Foscari Venezia, Italy), Claudio Lucchese (Università Ca' Foscari Venezia, Italy), and Federico Marcuzzi (Università Ca' Foscari Venezia, Italy)

Abstract. We present a new approach to the global fairness verification of tree-based classifiers. Given a tree-based classifier and a set of sensitive features potentially leading to discrimination, our analysis synthesizes sufficient conditions for fairness, expressed as a set of traditional propositional logic formulas, which are readily understandable by human experts. The verified fairness guarantees are global, in that the formulas predicate over all the possible inputs of the classifier, rather than just a few specific test instances. Our analysis is formally proved both sound and complete. Experimental results on public datasets show that the analysis is precise, explainable to human experts and efficient enough for practical adoption.
Exploiting Fairness to Enhance Sensitive Attributes ReconstructionOpenReview

Julien Ferry (LAAS-CNRS, Université de Toulouse, CNRS, France), Ulrich Aïvodji (Ecole de Technologie Supérieure, Canada), Sébastien Gambs (Université du Québec à Montréal, Canada), Marie-José Huguet (LAAS-CNRS, Université de Toulouse, CNRS, INSA, France), and Mohamed Siala (LAAS-CNRS, Université de Toulouse, CNRS, INSA, France)

Abstract. In recent years, a growing body of work has emerged on how to learn machine learning models under fairness constraints, often expressed with respect to some sensitive attributes. In this work, we consider the setting in which an adversary has black-box access to a target model and show that information about this model's fairness can be exploited by the adversary to enhance his reconstruction of the sensitive attributes of the training data. More precisely, we propose a generic reconstruction correction method, which takes as input an initial guess made by the adversary and corrects it to comply with some user-defined constraints (such as the fairness information) while minimizing the changes in the adversary's guess. The proposed method is agnostic to the type of target model, the fairness-aware learning method as well as the auxiliary knowledge of the adversary. To assess the applicability of our approach, we have conducted a thorough experimental evaluation on two state-of-the-art fair learning methods, using four different fairness metrics with a wide range of tolerances and with three datasets of diverse sizes and sensitive attributes. The experimental results demonstrate the effectiveness of the proposed approach to improve the reconstruction of the sensitive attributes of the training set.
Wealth Dynamics Over Generations: Analysis and InterventionsOpenReview

Krishna Acharya (Georgia Institute of Technology, USA), Eshwar Ram Arunachaleswaran (University of Pennsylvania, USA), Sampath Kannan (University of Pennsylvania, USA), Aaron Roth (University of Pennsylvania, USA), and Juba Ziani (Georgia Institute of Technology, USA)

Abstract. We present a stylized model with feedback loops for the evolution of a population's wealth over generations. Individuals have both talent and wealth: talent is a random variable distributed identically for everyone, but wealth is a random variable that is dependent on the population one is born into. Individuals then apply to a downstream agent, which we treat as a university throughout the paper (but could also represent an employer) who makes a decision about whether to admit them or not. The university does not directly observe talent or wealth, but rather a signal (representing e.g. a standardized test) that is a convex combination of both. The university knows the distributions from which an individual's type and wealth are drawn, and makes its decisions based on the posterior distribution of the applicant's characteristics conditional on their population and signal. Each population's wealth distribution at the next round then depends on the fraction of that population that was admitted by the university at the previous round. We study wealth dynamics in this model, and give conditions under which the dynamics have a single attracting fixed point (which implies population wealth inequality is transitory), and conditions under which it can have multiple attracting fixed points (which implies that population wealth inequality can be persistent). In the case in which there are multiple attracting fixed points, we study interventions aimed at eliminating or mitigating inequality, including increasing the capacity of the university to admit more people, aligning the signal generated by individuals with the preferences of the university, and making direct monetary transfers to the less wealthy population.
Learning Fair Representations Through Uniformly Distributed Sensitive AttributesOpenReview

Patrik Joslin Kenfack (Innopolis University, Russia), Adín Ramírez Rivera (University of Oslo, Norway), Adil Mehmood Khan (Innopolis University, Russia; University of Hull, UK), and Manuel Mazzara (Innopolis University, Russia)

Abstract. Machine Learning (ML) models trained on biased data can reproduce and even amplify these biases. Since such models are deployed to make decisions that can affect people's lives, ensuring their fairness is critical. On approach to mitigate possible unfairness of ML models is to map the input data into a less-biased new space by means of training the model on fair representations. Several methods based on adversarial learning have been proposed to learn fair representation by fooling an adversary in predicting the sensitive attribute (e.g., gender or race). However, adversarial-based learning can be too difficult to optimize in practice; besides, it penalizes the utility of the representation. Hence, in this research effort we train bias-free representations from the input data by inducing a uniform distribution over the sensitive attributes in the latent space. In particular, we propose a probabilistic framework that learns these representations by enforcing the correct reconstruction of the original data, plus the prediction of the attributes of interest while eliminating the possibility of predicting the sensitive ones. Our method leverages the inability of Deep Neural Networks (DNNs) to generalize when trained on a noisy label space to regularize the latent space. We use a network head that predicts a noisy version of the sensitive attributes in order to increase the uncertainty of their predictions at test time. Our experiments in two datasets demonstrates that the proposed model significantly improves fairness while maintaining the prediction accuracy of downstream tasks.

Session B

Can Stochastic Gradient Langevin Dynamics Provide Differential Privacy for Deep Learning?OpenReview

Guy Heller (University of Bar-Ilan, Ramat Gan, Israel) and Ethan Fetaya (University of Bar-Ilan, Ramat Gan, Israel)

Abstract. Bayesian learning via Stochastic Gradient Langevin Dynamics (SGLD) has been suggested for differentially private learning. While previous research provides differential privacy bounds for SGLD at the initial steps of the algorithm or when close to convergence, the question of what differential privacy guarantees can be made in between remains unanswered. This interim region is of great importance, especially for Bayesian neural networks, as it is hard to guarantee convergence to the posterior. This paper shows that using SGLD might result in unbounded privacy loss for this interim region, even when sampling from the posterior is as differentially private as desired.
Kernel Normalized Convolutional Networks for Privacy-Preserving Machine LearningOpenReview

Reza Nasirigerdeh (Technical University of Munich, Germany), Javad Torkzadehmahani (Azad University of Kerman, Iran), Daniel Rueckert (Technical University of Munich, Germany; Imperial College London, United Kingdom), and Georgios Kaissis (Technical University of Munich, Germany; Helmholtz Zentrum Munich, Germany; Imperial College London, United Kingdom)

Abstract. Normalization is an important but understudied challenge in privacy-related application domains such as federated learning (FL), differential privacy (DP), and differentially private federated learning (DP-FL). While the unsuitability of batch normalization for these domains has already been shown, the impact of the other normalization methods on the performance of federated or differentially private models is not well-known. To address this, we draw a performance comparison among layer normalization (LayerNorm), group normalization (GroupNorm), and the recently proposed kernel normalization (KernelNorm) in FL, DP, and DP-FL settings. Our results indicate LayerNorm and GroupNorm provide no performance gain compared to the baseline (i.e. no normalization) for shallow models in FL and DP. They, on the other hand, considerably enhance the performance of shallow models in DP-FL and deeper models in FL and DP. KernelNorm, moreover, significantly outperforms its competitors in terms of accuracy and convergence rate (or communication efficiency) for both shallow and deeper models in all considered learning environments. Given these key observations, we propose a kernel normalized ResNet architecture called KNResNet-13 for differentially private learning environments. Using the proposed architecture, we provide new state-of-the-art accuracy values on the CIFAR-10 and Imagenette datasets, when trained from scratch.
Model Inversion Attack with Least Information and an In-depth Analysis of its Disparate VulnerabilityOpenReview

Sayanton V. Dibbo (Dartmouth College), Dae Lim Chung (Dartmouth College), and Shagufta Mehnaz (The Pennsylvania State University)

Abstract. In this paper, we study model inversion attribute inference (MIAI), a machine learning (ML) privacy attack that aims to infer sensitive information about the training data given access to the target ML model. We design a novel black-box MIAI attack that assumes the least adversary knowledge/capabilities to date while still performing similar to the state-of-the-art attacks. Further, we extensively analyze the disparate vulnerability property of our proposed MIAI attack, i.e., elevated vulnerabilities of specific groups in the training dataset (grouped by gender, race, etc.) to model inversion attacks. First, we investigate existing ML privacy defense techniques-- (1) mutual information regularization, and (2) fairness constraints, and show that none of these techniques can mitigate MIAI disparity. Second, we empirically identify possible disparity factors and discuss potential ways to mitigate disparity in MIAI attacks. Finally, we demonstrate our findings by extensively evaluating our attack in estimating binary and multi-class sensitive attributes on three different target models trained on three real datasets.
Distribution inference risks: Identifying and mitigating sources of leakageOpenReview

Valentin Hartmann (EPFL), Léo Meynent (EPFL), Maxime Peyrard (EPFL), Dimitrios Dimitriadis (Amazon), Shruti Tople (Microsoft Research), and Robert West (EPFL)

Abstract. A large body of work shows that machine learning (ML) models can leak sensitive or confidential information about their training data. Recently, leakage due to distribution inference (or property inference) attacks is gaining attention. In this attack, the goal of an adversary is to infer distributional information about the training data. So far, research on distribution inference has focused on demonstrating successful attacks, with little attention given to identifying the potential causes of the leakage and to proposing mitigations. To bridge this gap, as our main contribution, we theoretically and empirically analyze the sources of information leakage that allows an adversary to perpetrate distribution inference attacks. We identify three sources of leakage: (1) memorizing specific information about the E[Y|X] (expected label given the feature values) of interest to the adversary, (2) wrong inductive bias of the model, and (3) finiteness of the training data. Next, based on our analysis, we propose principled mitigation techniques against distribution inference attacks. Specifically, we demonstrate that causal learning techniques are more resilient to a particular type of distribution inference risk termed distributional membership inference than associative learning methods. And lastly, we present a formalization of distribution inference that allows for reasoning about more general adversaries than was previously possible.
Dissecting Distribution InferenceOpenReview

Anshuman Suri (University of Virginia), Yifu Lu (University of Michigan), Yanjin Chen (University of Virginia), and David Evans (University of Virginia)

Abstract. A distribution inference attack aims to infer statistical properties of data used to train machine learning models. These attacks are sometimes surprisingly potent, but the factors that impact distribution inference risk are not well understood and demonstrated attacks often rely on strong and unrealistic assumptions such as full knowledge of training environments even in supposedly black-box threat scenarios. To improve understanding of distribution inference risks, we develop a new black-box attack that even outperforms the best known white-box attack in most settings. Using this new attack, we evaluate distribution inference risk while relaxing a variety of assumptions about the adversary's knowledge under black-box access, like known model architectures and label-only access. Finally, we evaluate the effectiveness of previously proposed defenses and introduce new defenses. We find that although noise-based defenses appear to be ineffective, a simple re-sampling defense can be highly effective.

Session C

ExPLoit: Extracting Private Labels in Split LearningOpenReview

Sanjay Kariyappa (Georgia Institute of Technology) and Moinuddin K Qureshi (Georgia Institute of Technology)

Abstract. Split learning is a popular technique used to perform vertical federated learning, where the goal is to jointly train a model on the private input and label data held by two parties. To preserve privacy of the input and label data, this technique uses a split model trained end-to-end, by exchanging the intermediate representations (IR) of the inputs and gradients of the IR between the two parties. We propose ExPLoit – a label-leakage attack that allows an adversarial input-owner to extract the private labels of the label-owner during split-learning. ExPLoit frames the attack as a supervised learning problem by using a novel loss function that combines gradient-matching and several regularization terms developed using key properties of the dataset and models. Our evaluations on a binary conversion prediction task and several multi-class image classification tasks show that ExPLoit can uncover the private labels with near- perfect accuracy of up to 99.53%, demonstrating that split learning provides negligible privacy benefits to the label owner. Furthermore, we evaluate the use of gradient noise to defend against ExPLoit. While this technique is effective for simpler datasets, it significantly degrades utility for datasets with higher input dimensionality. Our findings underscore the need for better privacy-preserving training techniques for vertically split data.
SafeNet: The Unreasonable Effectiveness of Ensembles in Private Collaborative LearningOpenReview

Harsh Chaudhari (Northeastern University), Matthew Jagielski (Google Research), and Alina Oprea (Northeastern University)

Abstract. Secure multiparty computation (MPC) has been proposed to allow multiple mutually distrustful data owners to jointly train machine learning (ML) models on their combined data. However, by design, MPC protocols faithfully compute the training functionality, which the adversarial ML community has shown to leak private information and can be tampered with in poisoning attacks. In this work, we argue that model ensembles, implemented in our framework called SafeNet, are a highly MPC-amenable way to avoid many adversarial ML attacks. The natural partitioning of data amongst owners in MPC training allows this approach to be highly scalable at training time, provide provable protection from poisoning attacks, and provably defends against a number of privacy attacks. We demonstrate SafeNet's efficiency, accuracy, and resilience to poisoning on several machine learning datasets and models trained in end-to-end and transfer learning scenarios. For instance, SafeNet reduces backdoor attack success significantly, while achieving 39x faster training and 36x less communication than the four-party MPC framework of Dalskov et al. Our experiments show that ensembling retains these benefits even in many non-iid settings. The simplicity, cheap setup, and robustness properties of ensembling make it a strong first choice for training ML models privately in MPC.
Reprogrammable-FL: Improving Utility-Privacy Tradeoff in Federated Learning via Model ReprogrammingOpenReview

Huzaifa Arif (Rensselaer Polytechnic Institute, USA), Alex Gittens (Rensselaer Polytechnic Institute, USA), and Pin-Yu Chen (IBM Research, USA)

Abstract. Model reprogramming (MR) is an emerging and powerful technique that provides cross-domain machine learning by enabling a model that is well-trained on some source task to be used for a different target task without finetuning the model weights. In this work, we propose Reprogrammable-FL, the first framework adapting MR to the setting of differentially private federated learning (FL), and demonstrate that it significantly improves the utility-privacy tradeoff compared to standard transfer learning methods (full/partial finetuning) and training from scratch in FL. Experimental results on several deep neural networks and datasets show up to over 60\% accuracy improvement given the same privacy budget. The code repository can be found at
Optimal Data Acquisition with Privacy-Aware AgentsOpenReview

Rachel Cummings (Columbia University), Hadi Elzayn (Stanford University), Emmanouil Pountourakis (Drexel University), Vasilis Gkatzelis (Drexel University), and Juba Ziani (Georgia Institute of Technology)

Abstract. We study the problem faced by a data analyst or platform that wishes to collect private data from privacy-aware agents. To incentivize participation, in exchange for this data, the platform provides a service to the agents in the form of a statistic computed using all agents' submitted data. The agents decide whether to join the platform (and truthfully reveal their data) or not participate by considering both the privacy costs of joining and the benefit they get from obtaining the statistic. The platform must ensure the statistic is computed differentially privately and chooses a central level of noise to add to the computation, but can also induce personalized privacy levels (or costs) by giving different weights to different agents in the computation as a function of their heterogeneous privacy preferences (which are known to the platform). We assume the platform aims to optimize the accuracy of the statistic, and must pick the privacy level of each agent to trade-off between i) incentivizing more participation and ii) adding less noise to the estimate. We provide a semi-closed form characterization of the optimal choice of agent weights for the platform in two variants of our model. In both of these models, we identify a common nontrivial structure in the platform's optimal solution: an instance-specific number of agents with the least stringent privacy requirements are pooled together and given the same weight, while the weights of the remaining agents decrease as a function of the strength of their privacy requirement. We also provide algorithmic results on how to find the optimal value of the noise parameter used by the platform and of the weights given to the agents.

Session D

A Light Recipe to Train Robust Vision TransformersOpenReview

Edoardo Debenedetti (ETH Zurich, Switzerland), Vikash Sehwag (Princeton University, USA), and Prateek Mittal (Princeton University, USA)

Abstract. In this paper, we ask whether Vision Transformers (ViTs) can serve as an underlying architecture for improving the adversarial robustness of machine learning models against evasion attacks. While earlier works have focused on improving Convolutional Neural Networks, we show that also ViTs are highly suitable for adversarial training to achieve competitive performance. We achieve this objective using a custom adversarial training recipe, discovered using rigorous ablation studies on a subset of the ImageNet dataset. The canonical training recipe for ViTs recommends strong data augmentation, in part to compensate for the lack of vision inductive bias of attention modules, when compared to convolutions. We show that this recipe achieves suboptimal performance when used for adversarial training. In contrast, we find that omitting all heavy data augmentation, and adding some additional bag-of-tricks ( ε-warmup and larger weight decay), significantly boosts the performance of robust ViTs. We show that our recipe generalizes to different classes of ViT architectures and large-scale models on full ImageNet-1k. Additionally, investigating the reasons for the robustness of our models, we show that it is easier to generate strong attacks during training when using our recipe and that this leads to better robustness at test time. Finally, we further study one consequence of adversarial training by proposing a way to quantify the semantic nature of adversarial perturbations and highlight its correlation with the robustness of the model. Overall, we recommend that the community should avoid translating the canonical training recipes in ViTs to robust training and rethink common training choices in the context of adversarial training.
Less is More: Dimension Reduction Finds On-Manifold Adversarial Examples in Hard-Label AttacksOpenReview

Washington Garcia (University of Florida), Pin-Yu Chen (IBM Research), Hamilton Clouse (Air Force Research Laboratory), Somesh Jha (University of Wisconsin), and Kevin Butler (University of Florida)

Abstract. Designing deep networks robust to adversarial examples remains an open problem. Likewise, recent zeroth-order hard-label attacks on image classification models have shown comparable performance to their first-order, gradient-level alternatives. It was recently shown in the gradient-level setting that regular adversarial examples leave the data manifold, while their on-manifold counterparts are in fact generalization errors. In this paper, we argue that query efficiency in the zeroth-order setting is connected to an adversary's traversal through the data manifold. To explain this behavior, we propose an information-theoretic argument based on a noisy manifold distance oracle, which leaks manifold information through the adversary's gradient estimate. Through numerical experiments of manifold-gradient mutual information, we show this behavior acts as a function of the effective problem dimensionality. On high-dimensional real-world datasets and multiple zeroth-order attacks using dimension reduction, we observe the same behavior to produce samples closer to the data manifold. This can result in up to 10x decrease in the manifold distance measure, regardless of the model robustness. Our results suggest that taking the manifold-gradient mutual information into account can thus inform better robust model design in the future, and avoid leakage of the sensitive data manifold information.
Publishing Efficient On-device Models Increases Adversarial VulnerabilityOpenReview

Sanghyun Hong (Oregon State University), Nicholas Carlini (Google Brain), and Alexey Kurakin (Google Brain)

Abstract. Recent increases in the computational demands of deep neural networks (DNNs) have sparked interest in efficient deep learning mechanisms, e.g., quantization or pruning. These mechanisms enable the construction of a small, efficient version of commercial-scale models with comparable accuracy, accelerating their deployment to resource-constrained devices. In this paper, we study the security considerations of publishing on-device variants of large-scale models. We first show that an adversary can exploit on-device models to make attacking the large model easier. In evaluations across 19 DNNs, by exploiting the published on-device models as a prior, the adversarial vulnerability of the original commercial-scale models increases by up to 100x. We then show that the vulnerability increases as the similarity between a full-scale and its efficient model increase. Based on the insights, we propose a defense, similarity-unpairing, that fine-tunes smaller models with the objective of reducing the similarities. We evaluated our defense on all the 19 DNNs and found that it reduces the transferability up to 90% and the number of queries required by a factor of 10-100x. Our results suggest that further research is needed on the security (or even privacy) threats caused by publishing those efficient siblings.
EDoG: Adversarial Edge Detection For Graph Neural Networks (virtual)OpenReview

Xiaojun Xu (University of Illinois at Urbana-Champaign), Hanzhang Wang (eBay), Alok Lal (eBay), Carl Gunter (University of Illinois at Urbana-Champaign), and Bo Li (University of Illinois at Urbana-Champaign)

Abstract. Graph Neural Networks (GNNs) have been widely applied to different tasks such as bioinformatics, drug design, and social networks. However, recent studies have shown that GNNs are vulnerable to adversarial attacks which aim to mislead the node (or subgraph) classification prediction by adding subtle perturbations. In particular, several attacks against GNNs have been proposed by adding/deleting a small amount of edges, which have caused serious security concerns. Detecting these attacks is challenging due to the small magnitude of perturbation and the discrete nature of graph data. In this paper, we propose a general adversarial edge detection pipeline EDoG without requiring knowledge of the attack strategies based on graph generation. Specifically, we propose a novel graph generation approach combined with link prediction to detect suspicious adversarial edges. To effectively train the graph generative model, we sample several sub-graphs from the given graph data. We show that since the number of adversarial edges is usually low in practice, with low probability the sampled sub-graphs will contain adversarial edges based on the union bound. In addition, considering the strong attacks which perturb a large number of edges, we propose a set of novel features to perform outlier detection as the preprocessing for our detection. Extensive experimental results on three real-world graph datasets including a private transaction rule dataset from a major company and two types of synthetic graphs with controlled properties (e.g., Erdos-Renyi and scalefree graphs) show that EDoG can achieve above 0.8 AUC against four state-of-the-art unseen attack strategies without requiring any knowledge about the attack type (e.g., degree of the target victim node); and around 0.85 with knowledge of the attack type. EDoG significantly outperforms traditional malicious edge detection baselines. We also show that an adaptive attack with full knowledge of our detection pipeline is difficult to bypass it. Our results shed light on several principles to improve the robustness of GNNs.
Counterfactual Sentence Generation with Plug-and-Play PerturbationOpenReview

Nishtha Madaan (IBM Research India; Indian Institute of Technology), Diptikalyan Saha (IBM Research India), and Srikanta Bedathur (Indian Institute of Technology)

Abstract. Generating counterfactual test-cases is an important backbone for testing NLP models and making them as robust and reliable as traditional software. In generating the test-cases, a desired property is the ability to control the test-case generation in a flexible manner to test for a large variety of failure cases and to explain and repair them in a targeted manner. In this direction, significant progress has been made in the prior works by manually writing rules for generating controlled counterfactuals. However, this approach requires heavy manual supervision and lacks the flexibility to easily introduce new controls. Motivated by the impressive flexibility of the plug-and-play approach of PPLM, we propose bringing the framework of plug-and-play to counterfactual test case generation task. We introduce CASPer, a plug-and-play counterfactual generation framework to generate test cases that satisfy goal attributes on demand. Our plug-and-play model can steer the test case generation process given any attribute model without requiring attribute-specific training of the model. In experiments, we show that CASPer effectively generates counterfactual text that follow the steering provided by an attribute model while also being fluent, diverse and preserving the original content. We also show that the generated counterfactuals from CASPer can be used for augmenting the training data and thereby fixing and making the test model more robust.
Rethinking the Entropy of Instance in Adversarial TrainingOpenReview

Minseon Kim (KAIST, South Korea), Jihoon Tack (KAIST, South Korea), Jinwoo Shin (KAIST, South Korea), and Sung Ju Hwang (KAIST, South Korea; AITRICS, South Korea)

Abstract. Adversarial training, which minimizes the loss of adversarially-perturbed training examples, has been extensively studied as a solution to improve the robustness of deep neural networks. However, most adversarial training methods treat all training examples equally, while each example may have a different impact on the model's robustness during the course of adversarial training. A couple of recent works have exploited such unequal importance of adversarial samples to the model's robustness by proposing to assign more weights to the misclassified samples or to the samples that violate the margin more severely, which have been shown to obtain high robustness against untargeted PGD attacks. However, we empirically find that they make the feature spaces of adversarial samples across different classes overlap and thus yield more high-entropy samples whose labels could be easily flipped. This makes them more vulnerable to adversarial perturbations, and their seemingly good robustness against PGD attacks is actually achieved by a false sense of robustness. To address such limitations, we propose simple yet effective re-weighting scheme that weighs the loss for each adversarial training example proportionally to the entropy of its predicted distribution to focus on examples whose labels are more uncertain.
Towards Transferable Unrestricted Adversarial Examples with Minimum ChangesOpenReview

Fangcheng Liu (Peking University), Chao Zhang (Peking University), and Hongyang Zhang (University of Waterloo

Abstract. Transfer-based adversarial example is one of the most important classes of black-box attacks. However, there is a trade-off between transferability and imperceptibility of the adversarial perturbation. Prior work in this direction often requires a fixed but large -norm perturbation budget to reach a good transfer success rate, leading to perceptible adversarial perturbations. On the other hand, most of the current unrestricted adversarial attacks that aim to generate semantic-preserving perturbations suffer from weaker transferability to the target model. In this work, we propose a \emph{geometry-aware framework} to generate transferable adversarial examples with minimum changes. Analogous to model selection in statistical machine learning, we leverage a validation model to select the best perturbation budget for each image under both the -norm and unrestricted threat models. Extensive experiments verify the effectiveness of our framework on {balancing} imperceptibility and transferability of the crafted adversarial examples. The methodology is the foundation of our entry to the adversarial competition of a 2021 conference, in which we ranked 1st place out of 1,559 teams and surpassed the runner-up submissions by 4.59\% and 23.91\% in terms of final score and average image quality level, respectively.
Position: “Real Attackers Don’t Compute Gradients”: Bridging the Gap Between Adversarial ML Research and PracticeOpenReview

Giovanni Apruzzese (University of Liechtenstein), Hyrum S. Anderson (Robust Intelligence), Savino Dambra (Norton Research Group), David Freeman (Meta), Fabio Pierazzi (King's College London), and Kevin A. Roundy (Norton Research Group)

Abstract. Recent years have seen a proliferation of research on adversarial machine learning. Numerous papers demonstrate powerful algorithmic attacks against a wide variety of machine learning (ML) models, and numerous other papers propose defenses that can withstand most attacks. However, abundant real-world evidence suggests that actual attackers use simple tactics to subvert ML-driven systems, and as a result security practitioners have not prioritized adversarial ML defenses. Motivated by the apparent gap between researchers and practitioners, this position paper aims to bridge these two domains. We first present three real-world case studies from which we can glean practical insights unknown or neglected in research. Next, we analyze all adversarial ML papers recently published in top security conferences and highlight positive trends and blind spots. Finally, we state positions on precise and cost-driven threat modeling, collaboration between industry and academia, and reproducible research. If adopted, our positions will increase the real-world impact of future endeavours in adversarial ML, bringing both researchers and practitioners closer to their shared goal of improving the security of ML systems.
What Are Effective Labels for Augmented Data? Improving Calibration and Robustness with AutoLabelOpenReview

Yao Qin (Google Research, USA), Xuezhi Wang (Google Research, USA), Balaji Lakshminarayanan (Google Research, USA), Ed H. Chi (Google Research, USA), and Alex Beutel (Google Research, USA)

Abstract. A wide breadth of research has devised data augmentation approaches that can improve both accuracy and generalization performance for neural networks. However, augmented data can end up being far from the clean training data and what is the appropriate label is less clear. Despite this, most existing work simply uses one-hot labels for augmented data. In this paper, we show re-using one-hot labels for highly distorted data might run the risk of adding noise and degrading accuracy and calibration. To mitigate this, we propose a generic method AutoLabel to automatically learn the confidence in the labels for augmented data, based on the transformation distance between the clean distribution and augmented distribution. AutoLabel is built on label smoothing and is guided by the calibration-performance over a hold-out validation set. We successfully apply AutoLabel to three different data augmentation techniques: the state-of-the-art RandAug, AugMix, and adversarial training. Experiments on CIFAR-10, CIFAR-100 and ImageNet show that AutoLabel significantly improves existing data augmentation techniques over models' calibration and accuracy, especially under distributional shift.

Session E

Sniper Backdoor: Single Client Targeted Backdoor Attack in Federated LearningOpenReview

Gorka Abad (Radboud University, The Netherlands; Ikerlan research centre, Spain), Servio Paguada (Radboud University, The Netherlands; Ikerlan research centre, Spain), Oguzhan Ersoy (Radboud University, The Netherlands), Stjepan Picek (Radboud University, The Netherlands), Víctor Julio Ramírez-Durán (Ikerlan research centre, Spain), and Aitor Urbieta (Ikerlan research centre, Spain)

Abstract. Federated Learning (FL) enables collaborative training of Deep Learning (DL) models where the data is retained locally. Like DL, FL has severe security weaknesses that the attackers can exploit, e.g., model inversion and backdoor attacks. Model inversion attacks reconstruct the data from the training datasets, whereas backdoors misclassify only classes containing specific properties, e.g., a pixel pattern. Backdoors are prominent in FL and aim to poison every client model, while model inversion attacks can target even a single client. This paper introduces a novel technique to allow backdoor attacks to be client-targeted, compromising a single client while the rest remain unchanged. The attack takes advantage of state-of-the-art model inversion and backdoor attacks. Precisely, we leverage a generative adversarial network to perform the model inversion. Afterward, we shadow-train the FL network, in which, using a siamese neural network, we can identify, target, and backdoor the victim's model. Our attack has been validated using the MNIST, F-MNIST, EMNIST, and CIFAR-100 datasets under different settings---achieving up to 99\% accuracy on both source (clean) and target (backdoor) classes and against state-of-the-art defenses, e.g., Neural Cleanse, opening a novel threat model to be considered in the future.
Backdoor Attacks on Time Series: A Generative ApproachOpenReview

Yujing Jiang (University of Melbourne), Xingjun Ma (Fudan University), Sarah Monazam Erfani (University of Melbourne), and James Bailey (University of Melbourne)

Abstract. Backdoor attacks have emerged as one of the major security threats to deep learning models as they can easily control the model's test-time predictions by pre-injecting a backdoor trigger into the model at training time. While backdoor attacks have been extensively studied on images, few works have investigated the threat of backdoor attacks on time series data. To fill this gap, in this paper we present a novel generative approach for time series backdoor attacks against deep learning based time series classifiers. Backdoor attacks have two main goals: high stealthiness and high attack success rate. We find that, compared to images, it can be more challenging to achieve the two goals on time series. This is because time series have fewer input dimensions and lower degrees of freedom, making it hard to achieve a high attack success rate without compromising stealthiness. Our generative approach addresses this challenge by generating trigger patterns that are as realistic as real-time series patterns while achieving a high attack success rate without causing a significant drop in clean accuracy. We also show that our proposed attack is resistant to potential backdoor defenses. Furthermore, we propose a novel universal generator that can poison any type of time series with a single generator that allows universal attacks without the need to fine-tune the generative model for new time series datasets.
VENOMAVE: Targeted Poisoning Against Speech RecognitionOpenReview

Hojjat Aghakhani (University of California, Santa Barbara), Lea Schönherr (CISPA Helmholtz Center for Information Security), Thorsten Eisenhofer (Ruhr University Bochum), Dorothea Kolossa (Technische Universität Berlin), Thorsten Holz (CISPA Helmholtz Center for Information Security), Christopher Kruegel (University of California, Santa Barbara), and Giovanni Vigna (University of California, Santa Barbara)

Abstract. Despite remarkable improvements, automatic speech recognition is susceptible to adversarial perturbations. Compared to standard machine learning architectures, these attacks against speech recognition are significantly more challenging, especially since the inputs to a speech recognition system are time series that contain both acoustic and linguistic properties of speech. Extracting all recognition-relevant information requires more complex pipelines and an ensemble of specialized components. Consequently, an attacker needs to consider the entire pipeline. In this paper, we present VENOMAVE, the first training-time poisoning attack against speech recognition. Similar to the pre-dominantly studied evasion attacks, we pursue the same goal: leading the system to an incorrect, and attacker-chosen transcription of a target audio waveform. In contrast to evasion attacks, however, we assume that the attacker can only manipulate a small part of the training data without altering the target audio waveform at run time. We evaluate our attack on two datasets: TIDIGITS and Speech Commands. When poisoning less than 0.17% of the dataset, VENOMAVE achieves attack success rates of over 80.0%, without access to the victim's network architecture or hyperparameters. In a more realistic scenario, when the target audio waveform is played over the air in different rooms, VENOMAVE maintains a success rate of up to 73.3%. Finally, VENOMAVE achieves an attack transferability rate of 36.4% between two different model architectures.

Session F

Endogenous Macrodynamics in Algorithic RecourseOpenReview

Patrick Altmeyer (Delft University of Technology, The Netherlands), Giovan Angela (Delft University of Technology, The Netherlands), Aleksander Buszydlik (Delft University of Technology, The Netherlands), Karol Dobiczek (Delft University of Technology, The Netherlands), Arie van Deursen (Delft University of Technology, The Netherlands), and Cynthia C. S. Liem (Delft University of Technology, The Netherlands)

Abstract. Existing work on Counterfactual Explanations (CE) and Algorithmic Recourse (AR) has largely been limited to the static setting and focused on single individuals: given some estimated model, the goal is to find valid counterfactuals for an individual instance that fulfill various desiderata. The ability of such counterfactuals to handle dynamics like data and model drift remains a largely unexplored research challenge at this point. There has also been surprisingly little work on the related question of how the actual implementation of recourse by one individual may affect other individuals. Through this work we aim to close that gap by systematizing and extending existing knowledge. We first show that many of the existing methodologies can be collectively described by a generalized framework. We then argue that the existing framework fails to account for a hidden external cost of recourse, that only reveals itself when studying the endogenous dynamics of recourse at the group level. Through simulation experiments involving various state-of-the-art counterfactual generators and several benchmark datasets, we generate large numbers of counterfactuals and study the resulting domain and model shifts. We find that the induced shifts are substantial enough to likely impede the applicability of Algorithmic Recourse in situations that involve large groups of individuals. Fortunately, we find various potential mitigation strategies that can be used in combination with existing approaches. Our simulation framework for studying recourse dynamics is fast and open-sourced.
ModelPred: A Framework for Predicting Trained Model from Training DataOpenReview

Yingyan Zeng (Virginia Tech, USA), Jiachen T. Wang (Princeton University, USA), Si Chen (Virginia Tech, USA), Hoang Anh Just (Virginia Tech, USA), Ran Jin (Virginia Tech, USA), and Ruoxi Jia (Virginia Tech, USA)

Abstract. In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain behaviors of a model emerge during deployment. Specifically, ModelPred learns a parameterized function that takes a dataset S as the input and predicts the model obtained by training on S. Our work differs from the recent work of Datamodels \citep{ilyas2022datamodels} as we aim for predicting the trained model parameters directly instead of the trained model behaviors. We demonstrate that a neural network-based set function class is capable of learning the complex relationships between the training data and model parameters. We introduce novel global and local regularization techniques to prevent overfitting and we rigorously characterize the expressive power of neural networks (NN) in approximating the end-to-end training process. Through extensive empirical investigations, we show that ModelPred enables a variety of applications that boost the interpretability and accountability of machine learning (ML), such as data valuation, data selection, memorization quantification, and model calibration.
SoK: Harnessing Prior Knowledge for Explainable Machine Learning: An OverviewOpenReview

Katharina Beckh (Fraunhofer IAIS, Germany), Sebastian Müller (University of Bonn, Germany), Matthias Jakobs (TU Dortmund University, Germany), Vanessa Toborek (University of Bonn, Germany), Hanxiao Tan (TU Dortmund University, Germany), Raphael Fischer (TU Dortmund University, Germany), Pascal Welke (University of Bonn, Germany), Sebastian Houben (Hochschule Bonn-Rhein-Sieg, Germany), and Laura von Rueden (Fraunhofer IAIS, Germany)

Abstract. The application of complex machine learning models has elicited research to make them more explainable. However, most explainability methods cannot provide insight beyond the given data, requiring additional information about the context. We argue that harnessing prior knowledge improves the accessibility of explanations. We hereby present an overview of integrating prior knowledge into machine learning systems in order to improve explainability. We introduce a categorization of current research into three main categories which integrate knowledge either into the machine learning pipeline, into the explainability method or derive knowledge from explanations. To classify the papers, we build upon the existing taxonomy of informed machine learning and extend it from the perspective of explainability. We conclude with open challenges and research directions.
SoK: Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural NetworksOpenReview

Tilman Rauker (n/a), Anson Ho (Epoch), Stephen Casper (MIT CSAIL), and Dylan Hadfield-Menell (MIT CSAIL)

Abstract. The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. We discuss key challenges and argue that the status quo in interpretability research is largely unproductive. Finally, we highlight the importance of future work that emphasizes diagnostics, debugging, adversaries, and benchmarking in order to make interpretability tools more useful to engineers in practical applications.

Session G

Reducing Certified Regression to Certified Classification for General Poisoning AttacksOpenReview

Zayd Hammoudeh (University of Oregon, USA) and Daniel Lowd (University of Oregon, USA)

Abstract. Adversarial training instances can severely distort a model’s behavior. This work investigates certified regression defenses, which provide guaranteed limits on how much a regressor’s prediction may change under a training-set attack. Our key insight is that certified regression reduces to certified classification when using median as a model’s primary decision function. Coupling our reduction with existing certified classifiers, we propose six new provably-robust regressors. To the extent of our knowledge, this is the first work that certifies the robustness of individual regression predictions without any assumptions about the data distribution and model architecture. We also show that the assumptions made by existing state-of-the-art certified classifiers are often overly pessimistic. We introduce a tighter analysis of model robustness, which in many cases results in significantly improved certified guarantees. Lastly, we empirically demonstrate our approaches’ effectiveness on both regression and classification data, where the accuracy of up to 50% of test predictions can be guaranteed under 1% training-set corruption and up to 30% of predictions under 4% corruption. Our source code is available at
Neural Lower Bounds For VerificationOpenReview

Florian Jaeckle (University of Oxford, UK) and M. Pawan Kumar (University of Oxford, UK)

Abstract. Recent years have witnessed the deployment of branch-and-bound (BaB) frameworks for formal verification in deep learning. The main computational bottleneck of BaB is the estimation of lower bounds. Past work in this field has relied on traditional optimization algorithms whose inefficiencies have limited their scope. To alleviate this deficiency, we propose a novel graph neural network (GNN) based approach. Our GNN architecture closely resembles the network we wish to verify. During inference, it performs forward-backward passes through the GNN layers to compute a dual solution of the convex relaxation, thereby providing a valid lower bound. During training, its parameters are estimated via a loss function that encourages large lower bounds over a time horizon. We show that our approach provides a significant speedup for formal verification compared to state-of-the-art solvers and achieves good generalization performance on unseen networks.
Toward Certified Robustness Against Real-World Distribution ShiftsOpenReview

Haoze Wu (Stanford University, USA), Teruhiro Tagomori (Stanford University, USA; NRI Secure, Japan), Alexander Robey (University of Pennsylvania, USA), Fengjun Yang (University of Pennsylvania, USA), Nikolai Matni (University of Pennsylvania, USA), George Pappas (University of Pennsylvania, USA), Hamed Hassani (University of Pennsylvania, USA), Corina Pasareanu (Carnegie Mellon University, USA), and Clark Barrett (Stanford University, USA)

Abstract. We consider the problem of certifying the robustness of deep neural networks against real-world distribution shifts. To do so, we bridge the gap between hand-crafted specifications and realistic deployment settings by considering a neural-symbolic verification framework in which generative models are trained to learn perturbations from data and specifications are defined with respect to the output of these learned models. A pervasive challenge arising from this setting is that although S-shaped activations (e.g., sigmoid, tanh) are common in the last layer of deep generative models, existing verifiers cannot tightly approximate S-shaped activations. To address this challenge, we propose a general meta-algorithm for handling S-shaped activations which leverages classical notions of counter-example-guided abstraction refinement. The key idea is to ``lazily'' refine the abstraction of S-shaped functions to exclude spurious counter-examples found in the previous abstraction, thus guaranteeing progress in the verification process while keeping the state-space small. For networks with sigmoid activations, we show that our technique outperforms state-of-the-art verifiers on certifying robustness against both canonical adversarial perturbations and numerous real-world distribution shifts. Furthermore, experiments on the MNIST and CIFAR-10 datasets show that distribution-shift-aware algorithms have significantly higher certified robustness against distribution shifts.
CARE: Certifiably Robust Learning with Reasoning via Variational InferenceOpenReview

Jiawei Zhang (University of Illinois Urbana-Champaign, USA), Linyi Li (University of Illinois Urbana-Champaign, USA), Ce Zhang (ETH Zürich, Switzerland), and Bo Li (University of Illinois Urbana-Champaign, USA)

Abstract. Despite great recent advances achieved by deep neural networks (DNNs), they are often vulnerable to adversarial attacks. Intensive research efforts have been made to improve the robustness of DNNs; however, most empirical defenses can be adaptively attacked again, and the theoretically certified robustness is limited, especially on large-scale datasets. One potential root cause of such vulnerabilities for DNNs is that although they have demonstrated powerful expressiveness, they lack the reasoning ability to make robust and reliable predictions. In this paper, we aim to integrate domain knowledge to enable robust learning with the reasoning paradigm. In particular, we propose a certifiably robust learning with reasoning pipeline (CARE), which consists of a learning component and a reasoning component. Concretely, we use a set of standard DNNs to serve as the learning component to make semantic predictions (e.g., whether the input is furry), and we leverage the probabilistic graphical models, such as Markov logic networks (MLN), to serve as the reasoning component to enable knowledge/logic reasoning (e.g., IsPanda -> IsFurry). However, it is known that the exact inference of MLN (reasoning) is #P-complete, which limits the scalability of the pipeline. To this end, we propose to approximate the MLN inference via variational inference based on an efficient expectation maximization algorithm. In particular, we leverage graph convolutional networks (GCNs) to encode the posterior distribution during variational inference and update the parameters of GCNs (E-step) and the weights of knowledge rules in MLN (M-step) iteratively. We conduct extensive experiments on different datasets such as AwA2, Word50, GTSRB, and PDF malware, and we show that CARE achieves significantly higher certified robustness (e.g., the certified accuracy is improved from 36.0% to 61.8% under l_2 radius 2.0 on AwA2) compared with the state-of-the-art baselines. We additionally conducted different ablation studies to demonstrate the empirical robustness of CARE and the effectiveness of different knowledge integration.
FaShapley: Fast and Approximated Shapley Based Model Pruning Towards Certifiably Robust DNNsOpenReview

Mintong Kang (University of Illinois at Urbana-Champaign), Linyi Li (University of Illinois at Urbana-Champaign), and Bo Li (University of Illinois at Urbana-Champaign)

Abstract. Despite the great success achieved by deep neural networks (DNNs) recently, several concerns have been raised regarding their robustness against adversarial perturbations as well as large model size in resource-constrained environments. Recent studies on robust learning indicate that there is a tradeoff between robustness and model size. For instance, larger smoothed models would provide higher robustness certification. Recent works have tried to weaken such a tradeoff by training small models via optimized pruning. However, these methods usually do not directly take specific neuron properties such as their importance into account. In this paper, we focus on designing a quantitative criterion, neuron Shapley, to evaluate the neuron weight/filter importance within DNNs, leading to effective unstructured/structured pruning strategies to improve the certified robustness of the pruned models. However, directly computing Shapley value for neurons is of exponential computational complexity, and thus we propose a fast and approximated Shapley (FaShapley) method via gradient-based approximation and optimized sample-size. Theoretically, we analyze the desired properties (e.g, linearity and symmetry) and sample complexity of FaShapley. Empirically, we conduct extensive experiments on different datasets with both unstructured pruning and structured pruning. The results on several DNN architectures trained with different robust learning algorithms show that FaShapley achieves state-of-the-art certified robustness under different settings.

Session H

PolyKervNets: Activation-free Neural Networks For Private InterferenceOpenReview

Toluwani Aremu (Mohamed Bin Zayed Institute of Artificial Intelligence, UAE) and Karthik Nandakumar (Mohamed Bin Zayed Institute of Artificial Intelligence, UAE

Abstract. With the advent of cloud computing, machine learning as a service (MLaaS) has become a growing phenomenon with the potential to address many real-world problems. In an untrusted cloud environment, privacy concerns of users is a major impediment to the adoption of MLaaS. To alleviate these privacy issues and preserve data confidentiality, several private inference (PI) protocols have been proposed in recent years based on cryptographic tools like Fully Homomorphic Encryption (FHE) and Secure Multiparty Computation (MPC). Deep neural networks (DNN) have been the architecture of choice in most MLaaS deployments. One of the core challenges in developing PI protocols for DNN inference is the substantial costs involved in implementing non-linear activation layers such as Rectified Linear Unit (ReLU). This has spawned a search for accurate, but efficient approximations of the ReLU function and neural architectures that operate on a stringent ReLU budget. While these methods improve efficiency and ensure data confidentiality, they often come at a significant cost to prediction accuracy. In this work, we propose a DNN architecture based on polynomial kervolution called \emph{PolyKervNet} (PKN), which completely eliminates the need for non-linear activation and max pooling layers. PolyKervNets are both FHE and MPC-friendly - they enable FHE-based encrypted inference without any approximations and improve the latency on MPC-based PI protocols without any use of garbled circuits. We demonstrate that it is possible to redesign standard convolutional neural networks (CNN) architectures such as ResNet-18 and VGG-16 with polynomial kervolution and achieve approximately 30x improvement in latency of MPC-based PI with minimal loss in accuracy on many image classification tasks.
Theoretical Limits of Provable Security Against Model Extraction by Efficient Observational DefensesOpenReview

Ari Karchmer (Boston University)

Abstract. Can we hope to provide provable security against model extraction attacks? As a step towards a theoretical study of this question, we unify and abstract a wide range of "observational" model extraction defenses (OMEDs) --- roughly, those that attempt to detect model extraction by analyzing the distribution over the adversary's queries. To accompany the abstract OMED, we define the notion of complete OMEDs --- when benign clients can freely interact with the model --- and sound OMEDs --- when adversarial clients are caught and prevented from reverse engineering the model. Our formalism facilitates a simple argument for obtaining provable security against model extraction by complete and sound OMEDs, using (average-case) hardness assumptions for PAC-learning, in a way that abstracts current techniques in the prior literature. The main result of this work establishes a partial computational incompleteness theorem for the OMED: any efficient OMED for a machine learning model computable by a polynomial size decision tree that satisfies a basic form of completeness cannot satisfy soundness, unless the subexponential Learning Parity with Noise (LPN) assumption does not hold. To prove the incompleteness theorem, we introduce a class of model extraction attacks called natural Covert Learning attacks based on a connection to the Covert Learning model of Canetti and Karchmer (TCC '21), and show that such attacks circumvent any defense within our abstract mechanism in a black-box, nonadaptive way. As a further technical contribution, we extend the Covert Learning algorithm of Canetti and Karchmer to work over any "concise" product distribution, by showing that the technique of learning with a distributional inverter of Binnendyk et al. (ALT '22) remains viable in the Covert Learning setting. Finally, we further expose the tension between Covert Learning and OMEDs by proving that the existence of Covert Learning algorithms requires the nonexistence of provable security via efficient OMEDs. Therefore, we observe a ``win-win" result, by obtaining a characterization of the existence of provable security via efficient OMEDs by the nonexistence of natural Covert Learning algorithms.
No Matter How You Slice It: Machine Unlearning with SISA Comes at the Expense of Minority ClassesOpenReview

Korbinian Koch (Universität Hamburg, Germany) and Marcus Soll (NORDAKADEMIE gAG Hochschule der Wirtschaft, Germany)

Abstract. Machine unlearning using the SISA technique promises a significant speedup in model retraining with only minor sacrifices in performance. Even greater speedups can be achieved in a distribution-aware setting, where training samples are sorted by their individual unlearning likelihood. Yet, the side effects of these techniques on model performance are still poorly understood. In this paper, we lay out the impact of SISA unlearning in settings where classes are imbalanced, as well as in settings where class membership is correlated with unlearning likelihood. We show that the performance decrease that is associated with using SISA is primarily carried by minority classes and that conventional techniques for imbalanced datasets are unable to close this gap. We demonstrate that even for a class imbalance of just 1:10, simply down-sampling the dataset to a more balanced single shard outperforms SISA while providing the same unlearning speedup. We show that when minority class membership is correlated with a higher- or lower-than-average unlearning likelihood, the accuracy of those classes can be either improved or diminished in distribution-aware SISA models. This relationship makes the model sensitive to naturally occurring unlearning likelihood correlations. While SISA models tend to be sensitive to class distribution we found no impact on imbalanced subgroups or model fairness. Our work contributes to a better understanding of the side effects and trade-offs that are associated with SISA trainin
Data Redaction from Pre-trained GANsOpenReview

Zhifeng Kong (University of California San Diego, USA) and Kamalika Chaudhuri (University of California San Diego, USA)

Abstract. Large pre-trained generative models are known to occasionally output undesirable samples, which undermines their trustworthiness. The common way to mitigate this is to re-train them differently from scratch using different data or different regularization -- which uses a lot of computational resources and does not always fully address the problem. In this work, we take a different, more compute-friendly approach and investigate how to post-edit a model after training so that it ``redacts'', or refrains from outputting certain kinds of samples. We show that redaction is a fundamentally different task from data deletion, and data deletion may not always lead to redaction. We then consider Generative Adversarial Networks (GANs), and provide three different algorithms for data redaction that differ on how the samples to be redacted are described. Extensive evaluations on real-world image datasets show that our algorithms out-perform data deletion baselines, and are capable of redacting data while retaining high generation quality at a fraction of the cost of full re-training.

Session I

Position: Tensions Between the Proxies of Human Values in AIOpenReview

Teresa Datta (Arthur), Daniel Nissani (Arthur), Max Cembalest (Arthur), Akash Khanna (Arthur), Haley Massa (Arthur), and John Dickerson (Arthur)

Abstract. Motivated by mitigating potentially harmful impacts of technologies, the AI community has formulated and accepted mathematical definitions for certain pillars of accountability: e.g. privacy, fairness, and model transparency. Yet, we argue this is fundamentally misguided because these definitions are imperfect, siloed constructions of the human values they hope to proxy, while giving the guise that those values are sufficiently embedded in our technologies. Under popularized methods, tensions arise when practitioners attempt to achieve each pillar of fairness, privacy, and transparency in isolation or simultaneously. In this position paper, we push for redirection. We argue that the AI community needs to consider all the consequences of choosing certain formulations of these pillars---not just the technical incompatibilities, but also the effects within the context of deployment. We point towards sociotechnical research for frameworks for the latter, but push for broader efforts into implementing these in practice.
SoK: A Validity Perspective on Evaluating the Justified Use of Data-driven Decision-making AlgorithmsOpenReview

Amanda Coston (Carnegie Mellon University, USA), Anna Kawakami (Carnegie Mellon University, USA), Haiyi Zhu (Carnegie Mellon University, USA), Ken Holstein (Carnegie Mellon University, USA), andHoda Heidari (Carnegie Mellon University, USA)

Abstract. Recent research increasingly brings to question the appropriateness of using predictive tools in complex, real-world tasks. While a growing body of work has explored ways to improve value alignment in these tools, comparatively less work has centered concerns around the fundamental justifiability of using these tools. This work seeks to center validity considerations in deliberations around whether and how to build data-driven algorithms in high-stakes domains. Toward this end, we translate key concepts from validity theory to predictive algorithms. We apply the lens of validity to re-examine common challenges in problem formulation and data issues that jeopardize the justifiability of using predictive algorithms and connect these challenges to the social science discourse around validity. Our interdisciplinary exposition clarifies how these concepts apply to algorithmic decision making contexts. We demonstrate how these validity considerations could distill into a series of high-level questions intended to promote and document reflections on the legitimacy of the predictive task and the suitability of the data.


Model Attribution ChallengeWebsite

organized by Deepesh Chaudhari, Hyrum Anderson, Keith Manville, Lily Wong, Jonathan Broadbent, Christina Liaghati, Joao Gante, Ram Shankar, Yonadav Shavit, Elizabeth Merkhofer

Final report here
Improving training data extraction attacks on large language modelsWebsite

organized by Nicholas Carlini, Christopher Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Milad Nasr, Florian Tramer, Chiyuan Zhang.

MICO: A Membership Inference Competition Website

organized by Giovanni Cherubin, Ana-Maria Cretu, Andrew Paverd, Ahmed Salem, Santiago Zanella-Beguelin.

Results here

Closing Remarks

Nicolas Papernot