Below is the list of the accepted papers at SaTML 2026, organized by category. To learn more about the three categories of papers, please visit the Call for Papers.
Research Papers
Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints
Kai Yao and Marc Juarez (University of Edinburgh)
Model fingerprint detection techniques have emerged as a promising approach for attributing AI-generated images to their source models, with high detection accuracy in clean settings. Yet, their robustness under adversarial conditions remains largely unexplored. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 8 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under constrained black-box access. While forgery is more challenging than removal, its success significantly varies across targeted models. We also identify a utility-robustness trade-off: methods with the highest attribution accuracy are often vulnerable to attacks, whereas more robust approaches tend to be less accurate. Notably, residual- and manifold-based fingerprints show comparatively stronger black-box resilience than others. These findings highlight the urgent need for developing model fingerprinting techniques that are robust in adversarial settings.
Privacy Risks in Time Series Forecasting: User- and Record-Level Membership Inference
Nicolas Johansson, Tobias Olsson (Chalmers University of Technology), Daniel Nilsson, Johan Östman and Fazeleh Hoseini (AI Sweden)
Membership inference attacks (MIAs) aim to determine whether specific data were used to train a model. While extensively studied on classification models, their impact on time series forecasting remains largely unexplored. We address this gap by introducing two new attacks: (i) an adaptation of multivariate LiRA, a state-of-the-art MIA originally developed for classification models, to the time-series forecasting setting, and (ii) a novel end-to-end learning approach called Deep Time Series (DTS) attack. We benchmark these methods against adapted versions of other leading attacks from the classification setting.
We evaluate all attacks in realistic settings on the TUH-EEG and ELD datasets, targeting two strong forecasting architectures, LSTM and the state-of-the-art N-HiTS, under both record- and user-level threat models. Our results show that forecasting models are vulnerable, with user-level attacks often achieving perfect detection. The proposed methods achieve the strongest performance in several settings, establishing new baselines for privacy risk assessment in time series forecasting. Furthermore, vulnerability increases with longer prediction horizons and smaller training populations, echoing trends observed in large language models.
Certifiably Robust RAG against Retrieval Corruption
Chong Xiang (NVIDIA), Tong Wu, Zexuan Zhong (Princeton University), David Wagner (University of California, Berkeley), Danqi Chen and Prateek Mittal (Princeton University)
Retrieval-augmented generation (RAG) is susceptible to retrieval corruption attacks, where malicious passages injected into retrieval results can lead to inaccurate model responses. We propose RobustRAG, the first defense framework with certifiable robustness against retrieval corruption attacks. The key insight of RobustRAG is an isolate-then-aggregate strategy: we isolate passages into disjoint groups, generate LLM responses based on the concatenated passages from each isolated group, and then securely aggregate these responses for a robust output. To instantiate RobustRAG, we design keyword-based and decoding-based algorithms for securely aggregating unstructured text responses. Notably, RobustRAG achieves certifiable robustness: for certain queries in our evaluation datasets, we can formally certify non-trivial lower bounds on response quality---even against an adaptive attacker with full knowledge of the defense and the ability to arbitrarily inject a bounded number of malicious passages. We evaluate RobustRAG on the tasks of open-domain question-answering and free-form long text generation and demonstrate its effectiveness across three datasets and three LLMs.
FedSpy-LLM: Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLMs
Syed Irfan Ali Meerza (University of Tennessee Knoxville), Feiyi Wang (Oak Ridge National Laboratory) and Jian Liu (University of Georgia)
Given the growing reliance on private data in training Large Language Models (LLMs), Federated Learning (FL) combined with Parameter-Efficient Fine-Tuning (PEFT) has garnered significant attention for enhancing privacy and efficiency.
Despite FL's privacy benefits, prior studies have shown that private data can still be extracted from shared gradients. However, these studies, mainly on full-parameter model training, are limited to reconstructing small batches and short input sequences, and specific model architectures, such as encoder-based or decoder-based models.
The reconstruction quality will become even worse when dealing with gradients from PEFT methods.
To fully understand the practical attack surface of federated LLMs, this paper proposes FedSpy-LLM, a scalable and generalizable data reconstruction attack designed to reconstruct training data with larger batch sizes and longer sequences while generalizing across diverse model architectures, even when PEFT methods are deployed for training.
At the core of FedSpy-LLM is a novel gradient decomposition strategy that exploits the rank deficiency and subspace structure of gradients, enabling efficient token extraction while preserving key signal components at scale. This approach further mitigates the reconstruction challenges introduced by PEFT's substantial null space, ensuring robustness across encoder-based, decoder-based, and encoder-decoder model architectures. Additionally, by iteratively aligning each token’s partial-sequence gradient with the full-sequence gradient, FedSpy-LLM ensures accurate token ordering in reconstructed sequences.
Extensive evaluations demonstrate that FedSpy-LLM consistently outperforms prior attacks and maintains strong reconstruction quality under realistic and challenging settings, revealing a broader and more severe privacy risk landscape in federated LLMs. These findings underscore the urgent need for more robust privacy-preserving techniques in future FL systems.
Training Set Reconstruction from Differentially Private Forests: How Effective is DP?
Alice Gorgé (École Polytechnique, Palaiseau), Julien Ferry (Polytechnique Montréal), Sébastien Gambs (Université du Québec à Montréal) and Thibaut Vidal (Polytechnique Montréal)
Recent research has shown that structured machine learning models such as tree ensembles are vulnerable to privacy attacks targeting their training data. To mitigate these risks, differential privacy (DP) has become a widely adopted countermeasure, as it offers rigorous privacy protection.
In this paper, we introduce a reconstruction attack targeting state-of-the-art $\epsilon$-DP random forests. By leveraging a constraint programming model that incorporates knowledge of the forest's structure and DP mechanism characteristics, our approach formally reconstructs the most likely dataset that could have produced a given forest. Through extensive computational experiments, we examine the interplay between model utility, privacy guarantees and reconstruction accuracy across various configurations.
Our results reveal that random forests trained with meaningful DP guarantees can still leak portions of their training data. Specifically, while DP reduces the success of reconstruction attacks, the only forests fully robust to our attack exhibit predictive performance no better than a constant classifier. Building on these insights, we also provide practical recommendations for the construction of DP random forests that are more resilient to reconstruction attacks while maintaining a non-trivial predictive performance.
Evaluating Deep Unlearning in Large Language Models
Ruihan Wu (University of California, San Diego), Chhavi Yadav, Ruslan Salakhutdinov (CMU) and Kamalika Chaudhuri (University of California, San Diego)
Machine unlearning has emerged as an important component in developing safe and trustworthy models. Prior work on fact unlearning in LLMs has mostly focused on removing a specified target fact robustly, but often overlooks its deductive connections to other knowledge. We propose a new setting for fact unlearning, deep unlearning, where the goal is not only to remove a target fact but also to prevent it from being deduced via retained knowledge in the LLM and logical reasoning. We propose three novel metrics: Success-DU and Recall to measure unlearning efficacy, and Accuracy to measure the remainder model utility. To benchmark this setting, we leverage both (1) an existing real-world knowledge dataset, MQuAKE, that provides one-step deduction instances, and (2) newly construct a novel semi-synthetic dataset, Eval-DU, that allows multiple steps of realistic deductions among synthetic facts. Experiments reveal that current methods struggle with deep unlearning: they either fail to deeply unlearn, or excessively remove unrelated facts. Our results suggest that targeted algorithms may have to be developed for robust/deep fact unlearning in LLMs.
Efficient and Scalable Implementation of Differentially Private Deep Learning without Shortcuts
Sebastian Rodriguez Beltran, Marlon Tobaben, Joonas Jälkö (University of Helsinki), Niki Loppi (NVIDIA) and Antti Honkela (University of Helsinki)
Differentially private stochastic gradient descent (DP-SGD) is the standard algorithm for training machine learning models under differential privacy (DP). The most common DP-SGD privacy accountants rely on Poisson subsampling to ensure the theoretical DP guarantees. Implementing computationally efficient DP-SGD with Poisson subsampling is not trivial, which leads to many implementations taking a shortcut by using computationally faster subsampling. We quantify the computational cost of training deep learning models under DP by implementing and benchmarking efficient methods with the correct Poisson subsampling. We find that using the naive implementation of DP-SGD with Opacus in PyTorch has a throughput between 2.6 and 8 times lower than that of SGD. However, efficient gradient clipping implementations like Ghost Clipping can roughly halve this cost. We propose an alternative computationally efficient implementation of DP-SGD with JAX that uses Poisson subsampling and performs comparably with efficient clipping optimizations based on PyTorch. We study the scaling behavior using up to 80 GPUs and find that DP-SGD scales better than SGD.
Defeating Prompt Injections by Design
Edoardo Debenedetti (ETH Zurich), Ilia Shumailov (Google DeepMind), Tianqi Fan (Google), Jamie Hayes (Google DeepMind), Nicholas Carlini (Anthropic), Daniel Fabian, Christoph Kern (Google), Chongyang Shi, Andreas Terzis (Google DeepMind) and Florian Tramèr (ETH Zurich)
Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models may be susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL relies on a notion of capability to prevent the exfiltration of private data over unauthorized data flows. We demonstrate effectiveness of CaMeL by solving 67% of tasks with provable security in AgentDojo [NeurIPS 2024], a recent agentic security benchmark.
ConCap: Practical Network Traffic Generation for (ML- and) Flow-based Intrusion Detection Systems
Miel Verkerken (Ghent University - imec), Laurens D'hooge (Ghent University - imec, Department of Information Technology, IDLab), Bruno Volckaert (IDLab-imec, Ghent University), Filip De Turck (Ghent University - imec) and Giovanni Apruzzese (University of Liechtenstein)
Network Intrusion Detection Systems (NIDS) have been studied in research for almost four decades. Yet, despite thousands of papers claiming scientific advances, a non-negligible number of recent works suggest that the findings of prior literature may be questionable. At the root of such a disagreement is the well-known challenge of obtaining data representative of a real-world network---and, hence, usable for security assessments.
We tackle such a challenge in this paper. We propose $ConCap$, a practical tool meant to facilitate experimental research on NIDS. Through $ConCap$, a researcher can set up an isolated and lightweight network environment and configure it to produce network-related data, such as packets or NetFlows, that are automatically labeled---hence ready for fine-grained experiments. $ConCap$ is rooted on open-source software and is designed to foster experimental reproducibility across the scientific community by sharing just one configuration file. Through comprehensive experiments on 10 different network activities, further expanded via in-depth analyses of 21 variants of two specific activities and of 100 repetitions of four other ones, we empirically verify that $ConCap$ produces network data resembling that of a real-world network. We also carry out experiments on well-known benchmark datasets as well as on a real ``smart-home'' network, showing that, from a cyber-detection viewpoint, $ConCap$'s automatically-labeled NetFlows are functionally equivalent to those collected in other environments. Finally, we show that $ConCap$ enables to safely reproduce sophisticated attack chains (e.g., to test/enhance existing NIDS). Altogether, $ConCap$ is a solution to the ``data problem'' that is plaguing NIDS research.
Beyond the TESSERACT: Trustworthy Dataset Curation for Sound Evaluations of Android Malware Classifiers
Theo Chow (King's College London, University College London), Mario D'Onghia (University College London), Lorenz Linhardt (Technische Universität Berlin, BIFOLD), Zeliang Kan (HiddenLayer, King's College London), Daniel Arp (Technische Universität Wien), Lorenzo Cavallaro and Fabio Pierazzi (University College London)
The reliability of machine learning critically depends on dataset quality. While machine learning applied to computer vision and natural language processing benefit from high-quality benchmark datasets, cyber security often falls behind, as quality ties to the ability of accessing hard-to-obtain realistic datasets that may evolve over time. Android is, however, positioned uniquely in this ecosystem thanks to AndroZoo and other sources, which provide large-scale, continuously updated, and timestamped repositories of benign and malicious apps.
Since their release, such data sources provided access to populations of Android apps that researchers can sample from to evaluate learning-based methods in realistic settings, i.e., over temporal frames to account for apps evolution (natural distribution shift) and test datasets that reflect in-the-wild class ratios. Surprisingly, we observe that despite this abundance of data, performance discrepancies of learning-based Android malware classifiers still persist even after satisfying such realistic requirements, which challenges our ability to understand what the state-of-the-art in this field is. In this work, we identify five novel factors that influence such discrepancies: we show how such factors have been largely overlooked and the impact they have on providing sound evaluations. Our findings and recommendations help define a methodology for creating trustworthy datasets towards sound evaluations of Android malware classifiers.
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
Jean-Charles Noirot Ferrand, Yohan Beugin (University of Wisconsin-Madison), Eric Pauley (Virginia Tech), Ryan Sheatsley and Patrick McDaniel (University of Wisconsin-Madison)
Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM.
Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime---a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks.
Optimal Robust Recourse with $L^p$-Bounded Model Change
Phone Kyaw, Kshitij Kayastha and Shahin Jabbari (Drexel University)
Recourse provides individuals who received undesirable labels (e.g., denied a loan) from algorithmic decision-making systems with a minimum-cost improvement suggestion to achieve the desired outcome. However, in practice, models often get updated to reflect changes in the data distribution or environment, invalidating the recourse recommendations (i.e., following the recourse will not lead to the desirable outcome). The robust recourse literature addresses this issue by providing a framework for computing recourses whose validity is resilient to slight changes in the model. However, since the optimization problem of computing robust recourse is non-convex (even for linear models), most of the current approaches do not have any theoretical guarantee on the optimality of the recourse. Recent work by~\citet{KayasthaGJ24} provides the first \emph{provably} optimal algorithm for robust recourse with respect to generalized linear models when the model changes are measured using the $L^{\infty}$ norm. However, using the $L^{\infty}$ norm can lead to recourse solutions with a high price. To address this shortcoming, we consider more constrained model changes defined by the $L^p$ norm, where $p\geq 1$ but $p\neq \infty$, and provide a new algorithm that provably computes the optimal robust recourse for generalized linear models. Empirically, for both linear and non-linear models, we demonstrate that our algorithm achieves a significantly lower price of recourse (up to several orders of magnitude) compared to prior work and also exhibits a better trade-off between the implementation cost of recourse and its validity. Our empirical analysis also illustrates that our approach provides more sparse recourses compared to prior work and remains resilient to post-processing approaches that guarantee feasibility.
Towards Zero Rotation and Beyond: Architecting Neural Networks for Fast Secure Inference with Homomorphic Encryption
Yifei Cai (Iowa State University), Yizhou Feng (Old Dominion University), Qiao Zhang (Shandong University), Chunsheng Xin (Iowa State University) and Hongyi Wu (University of Arizona)
Privacy-preserving deep learning addresses privacy concerns in Machine Learning as a Service (MLaaS) using Homomorphic Encryption (HE) for linear computations. Nevertheless, the high computational cost remains a challenge. While prior work has attempted to improve the efficiency, most are built upon models originally designed for plaintext inference. These models are inherently limited by architectural inefficiencies when adapted to HE settings. We argue that substantial efficiency improvements can be achieved by designing networks specifically tailored to the unique computational characteristics of HE, rather than retrofitting existing plaintext models. Our design comprises two main components: the building block and the overall architecture. The first, StriaBlock, targets the most expensive HE operation—Rotation. It integrates ExRot-Free Convolution and a novel Cross Kernel, completely eliminating the need for external Rotation and requiring only 19% of the internal Rotation operations compared to plaintext models. The second component, the architectural principle, includes the Focused Constraint Principle, which limits cost-sensitive factors while preserving flexibility in others, and the Channel Packing-Aware Scaling Principle, which dynamically adapts bottleneck ratios based on ciphertext channel capacity that varies with network depth. These strategies efficiently control the local and overall HE cost, enabling a balanced architecture for HE settings. The resulting network, StriaNet, is comprehensively evaluated. While prior works primarily focus on small-scale datasets such as CIFAR-10, we conduct an extensive evaluation of StriaNet across datasets of varying scales, including large-scale (ImageNet), medium-scale (Tiny ImageNet), and small-scale (CIFAR-10) benchmarks. At comparable accuracy levels, StriaNet achieves speedups of 9.78 times, 6.01 times, and 9.24 times on ImageNet, Tiny ImageNet, and CIFAR-10, respectively.
Evaluating Black-Box Vulnerabilities with Wasserstein-Constrained Data Perturbations
Adriana Laurindo Monteiro (FGV - EMap) and Jean-Michel Loubes (Université Paul Sabatier)
The growing use of Machine Learning (ML) tools comes with critical challenges, such as limited model explainability. We propose a global explainability framework that leverages
Optimal Transport and Distributionally Robust Optimization to analyze how ML algorithms respond to constrained data perturbations. We provide a model-agnostic testing bench for both
regression and classification tasks with theoretical guarantees. We establish convergence results and validate the approach on examples and real-world datasets.
Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes
Xavier Pleimling, Sifat Muhammad Abdullah (Virginia Tech), Gunjan Balde (Indian Institute of Technology Kharagpur), Peng Gao (Virginia Tech), Mainack Mondal (Indian Institute of Technology Kharagpur), Murtuza Jadliwala (University of Texas at San Antonio) and Bimal Viswanath (Virginia Tech)
Advances in Generative AI (GenAI) have led to the development of various protection strategies to prevent the unauthorized use of images. These methods rely on adding imperceptible \textit{protective perturbations} to images to thwart misuse such as style mimicry or deepfake manipulations. Although previous attacks on these protections required specialized, purpose-built methods, we demonstrate that this is no longer necessary. We show that off-the-shelf image-to-image GenAI models can be repurposed as generic ``denoisers" using a simple text prompt, effectively removing a wide range of protective perturbations. Across 8 case studies spanning 6 diverse protection schemes, our general-purpose attack not only circumvents these defenses but also outperforms existing specialized attacks while preserving the image's utility for the adversary. Our findings reveal a critical and widespread vulnerability in the current landscape of image protection, indicating that many schemes provide a false sense of security. We stress the urgent need to develop robust defenses and establish that any future protection mechanism must be benchmarked against attacks from off-the-shelf GenAI models.
Membership Inference Attacks for Retrieval Based In-Context Learning for Document Question Answering
Tejas Kulkarni, Antti Koskela (Nokia Bell Labs) and Laith Zumot (Nokia)
We show that remotely hosted applications employing in‑context learning when augmented with a retrieval function to select in‑context examples can be vulnerable to membership inference attacks even when the service provider and users are separate parties.
We propose two black‑box membership‑inference attacks that exploit query text prefixes to distinguish member from non‑member inputs. The first attack uses a reference model to estimate an otherwise unavailable loss metric. The second attack improves upon it by eliminating the reference model and instead computing a membership statistic through a simple but novel weighted‑averaging scheme. Our comprehensive empirical evaluations consider a stricter case in which the adversary has a paraphrased version of the text in the queries and show that our attacks can exhibit stronger resilience to paraphrasing and outperform three prior attacks in many cases with small number of prefixes. We also adapt an existing ensemble prompting defense to our setting, demonstrating that it substantially mitigates the privacy leakage caused by our second attack.
Differentially Private Adaptation of Diffusion Models via Noisy Aggregated Embeddings
Pura Peetathawatchai (ETH Zurich), Wei-Ning Chen (Microsoft), Berivan Isik (Google), Sanmi Koyejo (Stanford University) and Albert No (Yonsei University)
Personalizing large-scale diffusion models poses serious privacy risks, especially when adapting to small, sensitive datasets. A common approach is to fine-tune the model using differentially private stochastic gradient descent (DP-SGD), but this suffers from severe utility degradation due to the high noise needed for privacy, particularly in the small data regime. We propose an alternative that leverages Textual Inversion (TI), which learns an embedding vector for an image or set of images, to enable adaptation under differential privacy (DP) constraints. Our approach, Differentially Private Aggregation via Textual Inversion (DPAgg-TI), adds calibrated noise to the aggregation of per-image embeddings to ensure formal DP guarantees while preserving high output fidelity. We show that DPAgg-TI outperforms DP-SGD finetuning in both utility and robustness under the same privacy budget, achieving results closely matching the non-private baseline on style adaptation tasks using private artwork from a single artist and Paris 2024 Olympic pictograms. In contrast, DP-SGD fails to generate meaningful outputs in this setting.
Provably Safe Model Updates
Leo Elmecker-Plakolm (Imperial College London), Pierre Fasterling (EPFL), Philip Sosnin, Calvin Tsay and Matthew Wicker (Imperial College London)
Safety-critical environments are inherently dynamic. Distribution shifts, emerging vulnerabilities, and evolving requirements demand continuous updates to machine learning models. Yet even benign parameter updates can have unintended consequences, such as catastrophic forgetting in classical models or alignment drift in foundation models. Existing heuristic approaches (e.g., regularization, parameter isolation) can mitigate these effects but cannot certify that updated models continue to satisfy required performance specifications. We address this problem by introducing a framework for provably safe model updates. Our approach first formalizes the problem as computing the largest locally invariant domain (LID): a connected region in parameter space where all points are certified to satisfy a given specification. While exact maximal LID computation is intractable, we show that relaxing the problem to parameterized abstract domains (orthotopes, zonotopes) yields a tractable primal-dual formulation. This enables efficient certification of updates—independent of the data or algorithm used—by projecting them onto the safe domain. Our formulation further allows computation of multiple approximately optimal LIDs, incorporation of regularization-inspired biases, and use of look-ahead data buffers. Across continual learning and foundation model fine-tuning benchmarks, our method matches or exceeds heuristic baselines for avoiding forgetting while providing formal safety guarantees.
Exact Unlearning of Finetuning Data via Model Merging at Scale
Kevin Kuo, Amrith Setlur, Kartik Srinivas, Aditi Raghunathan and Virginia Smith (Carnegie Mellon University)
Approximate unlearning has gained popularity as an approach to efficiently update an LLM so that it behaves (roughly) as if it was not trained on a subset of data to begin with. However, existing methods are brittle in practice and can easily be attacked to reveal supposedly unlearned information. To alleviate issues with approximate unlearning, we instead propose SIFT-Masks (SIgn-Fixed Tuning-Masks), a method for one-shot finetuning and model merging that enables exact unlearning at scale. SIFT-Masks addresses two key limitations of standard model merging: (1) merging a large number of tasks can severely harm utility; and (2) methods that boost utility by sharing extra information across tasks make exact unlearning prohibitively expensive. SIFT-Masks solves these issues by (1) applying local masks to recover task-specific performance; and (2) constraining finetuning to align with a global sign vector as a lightweight approach to determine masks independently before merging. Across four settings where we merge up to 2,500 models, SIFT-Masks improves accuracy by 5-80% over naive merging and uses up to 250x less compute for exact unlearning compared to other merging baselines.
Architectural Backdoors for Within-Batch Data Stealing and Model Inference Manipulation
Nicolas Küchler (ETH Zurich), Ivan Petrov, Conrad Grobler and Ilia Shumailov (Google DeepMind)
For nearly a decade the academic community has investigated backdoors in neural networks, primarily focusing on classification tasks where adversaries manipulate the model prediction. While demonstrably malicious, the immediate real-world impact of such prediction-altering attacks has remained unclear.
In this paper we introduce a novel and significantly more potent class of backdoors that builds upon recent advancements in architectural backdoors.
We demonstrate how these backdoors can be specifically engineered to exploit batched inference, a common technique for hardware utilization, enabling large-scale user data manipulation and theft.
By targeting the batching process, these architectural backdoors facilitate information leakage between concurrent user requests and allow attackers to fully control model responses directed at other users within the same batch.
In other words, an attacker who can change the model architecture can set and steal model inputs and outputs of other users within the same batch. We show that such attacks are not only feasible but also alarmingly effective, can be readily injected into prevalent model architectures, (e.g. Transformers), and represent a truly malicious threat to user privacy and system integrity.
Critically, to counteract this new class of vulnerabilities, we propose a deterministic mitigation strategy that provides formal guarantees against this new attack vector, unlike prior work that relied on Large Language Models to find the backdoors.
Our mitigation strategy employs a novel Information Flow Control mechanism that analyzes the model graph and proves non-interference between different user inputs within the same batch.
Using our mitigation strategy we perform a large scale analysis of models hosted through Hugging Face and find over 200 models that introduce (unintended) information leakage between batch entries due to the use of dynamic quantization.
BinaryShield: Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
Waris Gill, Natalie Isak and Matthew Dressman (Microsoft)
The widespread deployment of LLMs across enterprise services has created a critical security blind spot. Organizations operate multiple LLM services handling billions of queries daily, yet regulatory compliance boundaries prevent these services from sharing threat intelligence about prompt injection attacks, the top security risk for LLMs. When an attack is detected in one service, the same threat may persist undetected in others for months, as privacy regulations prohibit sharing user prompts across compliance boundaries.
We present BinaryShield, the first privacy-preserving threat intelligence system that enables secure sharing of attack fingerprints across compliance boundaries. BinaryShield transforms suspicious prompts through a unique pipeline combining PII redaction, semantic embedding, binary quantization, and randomized response mechanism to potentially generate non-invertible fingerprints that preserve attack patterns while providing privacy. Our evaluations demonstrate that BinaryShield achieves an F1-score of 0.94, significantly outperforming SimHash (0.77), the privacy-preserving baseline, while achieving storage reduction and 38x faster similarity search compared to dense embeddings.
Oblivious Exact (Un)Learning of Extremely Randomized Trees
Sofiane Azogagh, Zelma Aubin Birba, Sébastien Gambs and Marc-Olivier Killijian (Université du Québec à Montréal)
Recent regulations such as the GDPR have given the right to be forgotten to users, which requires that they can ask for the deletion of their data. Yet, enforcing such deletions for machine learning (ML) models remains challenging, especially when servers are untrusted or may ignore requests. To address this issue, we present the first ML model to support oblivious exact unlearning, in which the deletion is computationally indistinguishable from regular training or inference. This ensures that unlearning can be enforced without revealing its occurrence to the server. Our construction is based on Extremely Randomized Trees (ERTs), which are well-suited for encrypted training and efficient unlearning. More precisely, their randomized data-independent structure enables exact sample removal without retraining. We instantiate our protocol within the TFHE framework by designing a non-interactive procedure for encrypted updates, traversals and inference. Our implementation shows that encrypted ERTs train up to 2.4x faster than prior encrypted random forests while maintaining a comparable accuracy.
DeepLeak: Privacy Enhancing Hardening of Model Explanations Against Membership Leakage
Firas Ben Hmida, Zain Sbeih, Philemon Hailemariam and Birhanu Eshete (University of Michigan, Dearborn)
Machine learning (ML) explainability is central to algorithmic transparency in high-stakes settings such as predictive diagnostics and loan approval. Yet these same domains demand rigorous privacy guarantees, creating tension between interpretability and privacy. While prior work has shown that explanation methods can leak membership information, practitioners still lack systematic guidance for selecting or deploying explanation techniques that balance transparency with privacy.
We present DeepLeak, a system to audit and mitigate privacy risks in post-hoc explanation methods. DeepLeak advances the state-of-the-art in three ways: (1) comprehensive leakage profiling: we develop a stronger explanation-aware membership inference attack (MIA) to quantify how much representative explanation methods leak membership information under default configurations; (2) lightweight hardening strategies: we introduce practical, model-agnostic mitigations, including sensitivity-calibrated noise, attribution clipping, and masking, that substantially reduce membership leakage while preserving explanation utility; and (3) root-cause analysis: through controlled experiments, we pinpoint algorithmic properties (e.g., attribution sparsity and sensitivity) that drive leakage.
Evaluating 15 explanation techniques across four families on image benchmarks, DeepLeak shows that default settings can leak up to 74.9% more membership information than previously reported. Our mitigations cut leakage by up to 95% (minimum 46.5%) with only 3.3% utility loss on average. DeepLeak offers a systematic, reproducible path to safer explainability in privacy-sensitive ML.
On the Fragility of Contribution Evaluation in Federated Learning
Balázs Pejó (EGroup), Marcell Frank (VIK - BME), Krisztian Varga, Peter Veliczky (TTK - BME) and Gergely Biczok (HUN-REN)
This paper investigates the fragility of contribution evaluation in federated learning, a critical mechanism for ensuring fairness and incentivizing participation. We argue that contribution scores are susceptible to significant distortions from two fundamental perspectives: architectural sensitivity and intentional manipulation. First, we explore how different model aggregation methods impact these scores. While most research assumes a basic averaging approach, we demonstrate that advanced techniques, including those designed to handle unreliable or diverse clients, can unintentionally yet significantly alter the final scores. Second, we examine the threat posed by poisoning attacks, where malicious participants strategically manipulate their model updates to either inflate their own contribution scores or reduce others'. Through extensive experiments across diverse datasets and model architectures, implemented within the Flower framework, we rigorously show that both the choice of aggregation method and the presence of attackers can substantially skew contribution scores, highlighting the need for more robust contribution evaluation schemes.
Stealthy Fake News and Lost Profits: Manipulating Headlines in LLM-Driven Algorithmic Trading
Advije Rizvani, Giovanni Apruzzese and Pavel Laskov (University of Liechtenstein)
We study the security of news–driven algorithmic trading systems (ATS) that integrate sentiment extracted from financial headlines by large language models (LLMs).
While adversarial risks to LLMs and perturbations against machine-learning (ML) predictors for stock prices have been examined, the impact of attacks against _financial LLMs for news ingestion_ is still unknown. Specifically, we wonder: can we quantify the economic losses stemming from an LLM misled by "fake news" that is integrated in a full-fledged algorithmic trading system?
We hypothesize a constrained but realistic adversary with no access to the targeted LLM-driven ATS, but who can alter a stock-related headline on a single day.
Within this setting, we evaluate two human-imperceptible manipulations in the finance context: _Unicode homoglyph_ substitutions that misroute headlines during stock-name recognition; and _hidden-text_ clauses that are invisible to humans but parsed by models.
We implement a realistic ATS in Backtrader that fuses an LSTM-based price forecast with LLM-derived sentiment (FinBERT, FinGPT, FinLLaMA, and six general-purpose LLMs), and we quantify the _economic impact_ using portfolio metrics.
Experiments on real-world data show that manipulating a single headline over the course of 14 months can reliably distort sentiment and reduce annual returns by up to 17.7 percentage points.
To further assess the real-world feasibility of these adversarial tactics, we analyze widely used scraping libraries and trading platforms, and survey 27 practitioners from the FinTech sector, confirming our hypotheses. We alerted the owners of trading platforms of this security issue.
“Org-Wide, We’re Not Ready": C-Level Lessons on Securing Generative AI Systems
Elnaz Rabieinejad Balagafsheh, Ali Dehghantanha (Cyber Science Lab, Canada Cyber Foundry, University of Guelph) and Fattane Zarrinkalam (College of Engineering, University of Guelph)
Enterprises are adopting generative AI (GenAI) faster than they can secure it. We report an empirical study of 20 Canadian Chief Information Security Officers (CISO) that combined semi-structured interviews with a full-day, practitioner-led think tank. We ask (RQ1) how leaders prioritize GenAI threats, (RQ2) where organizations are prepared across the lifecycle, and (RQ3) where governance and assurance frameworks fall short. CISOs consistently rank three exposures as high-likelihood, high-impact: (1) data movement and leakage via everyday assistant use and downstream logs/backups, (2) prompt/model misuse that steers assistants, especially RAG-backed ones, outside intended retrieval scope, and (3) deepfake voice used for authority spoofing and urgent fraud. Readiness is strongest upstream (intake reviews, data classification/lineage, architectural zoning) and weakest at runtime: few teams have EDR-like telemetry for prompts, tool calls, or agent routing, so detection remains largely human-in-the-loop. Current frameworks are principle-heavy but procedure-light and insufficiently sector-tuned. We translate these observations into actionable controls: minimum “AI-EDR” telemetry, sector-ready governance runbooks, and a red-teaming program that moves from single-prompt tests to end-to-end exercises spanning data to tools/APIs. Our findings align investment and policy with the blast-radius CISOs face today, and provide a pragmatic path from static compliance to operational assurance.
Reconstructing Training Data from Models Trained with Transfer Learning
Yakir Oz, Gilad Yehudai, Gal Vardi, Itai Antebi, Michal Irani and Niv Haim (Weizmann Institute of Science)
Current methods for reconstructing training data from trained classifiers are restricted to very small models, limited training set sizes, and low-resolution images. Such restrictions hinder their applicability to real-world scenarios. In this paper, we present a novel approach enabling data reconstruction in more realistic settings. Our method adapts the reconstruction scheme of Haim et al. 2022 to real-world scenarios -- specifically, targeting models trained via transfer learning over image embeddings of large pre-trained models like DINO-ViT and CLIP. Our work employs data reconstruction in the embedding space rather than in the image space, showcasing its applicability beyond visual data. Moreover, we introduce a novel clustering-based method to identify good reconstructions from thousands of candidates. This significantly improves on previous works that relied on knowledge of the training set to identify good reconstructed images. Our findings shed light on a potential privacy risk for data leakage from models trained using transfer learning.
Private Blind Model Averaging – Distributed, Non-interactive, and Convergent
Moritz Kirschte, Sebastian Meiser (University of Lubeck), Saman Ardalan (UKSH Kiel) and Esfandiar Mohammadi (University of Lubeck)
Distributed differentially private learning techniques enable a large number of users to jointly learn a model without having to first centrally collect the training data. At the same time, neither the communication between the users nor the resulting model shall leak information about the training data. This kind of learning technique can be deployed to edge devices if it can be scaled up to a large number of users, particularly if the communication is reduced to a minimum: no interaction, i.e., each party only sends a single message. The best previously known methods are based on gradient averaging, which inherently requires many synchronization rounds.
A promising non-interactive alternative to gradient averaging relies on so-called output perturbation: each user first locally finishes training and then submits its model for secure averaging without further synchronization. We analyze this paradigm, which we coin blind model averaging (BlindAvg), in the setting of convex and smooth empirical risk minimization (ERM) like a support vector machine (SVM). While the required noise scale is asymptotically the same as in the centralized setting, it is not well understood how close BlindAvg comes to centralized learning, i.e., its utility cost.
We characterize and boost the privacy-utility tradeoff of BlindAvg with two contributions:
First, we prove that BlindAvg convergences towards the centralized setting for a sufficiently strong L2-regularization for a non-smooth SVM learner. Second, we introduce the novel differentially private convex and smooth ERM learner SoftmaxReg that has a better privacy-utility tradeoff than an SVM in a multi-class setting.
We evaluate our findings on three datasets (CIFAR-10, CIFAR-100, and Federated EMNIST) and provide ablation in an artificially extreme non-IID scenario.
Your Privacy Depends on Others: Collusion Vulnerabilities in Individual Differential Privacy
Johannes Kaiser (Technical University of Munich), Alexander Ziller (Technical Universtiy of Munich), Eleni Triantafillou (Google Deepmind), Daniel Rückert (Technical University of Munich, Imperial College London) and Georgios Kaissis (Technical University of Munich, Google Deepmind)
Individual Differential Privacy (iDP) promises users control over their privacy, but this promise can be broken in practice.
We reveal a previously overlooked vulnerability in sampling-based iDP mechanisms: while conforming to the iDP guarantees, an individual's privacy risk is not solely governed by their own privacy budget, but critically depends on the privacy choices of all other data contributors.
This creates a mismatch between the promise of individual privacy control and the reality of a system where risk is collectively determined.
We demonstrate empirically and theoretically that certain distributions of privacy preferences can unintentionally inflate the privacy risk of individuals, even when their formal guarantees are met.
Moreover, this excess risk provides an exploitable attack vector.
A central adversary or a set of colluding adversaries can deliberately choose privacy budgets to amplify vulnerabilities of targeted individuals.
Most importantly, this adversarial attack operates entirely within the guarantees of DP, hiding this excess vulnerability from data contributors.
Our empirical evaluation demonstrates successful attacks against 62\% of targeted individuals, substantially increasing their membership inference susceptibility.
To mitigate this, we propose the formulation of $(\varepsilon_i, \delta_i, \overline{\Delta})$-iDP a \textit{privacy contract} that uses $\Delta$-divergences to provide users with a hard upper bound on this \textit{excess vulnerability}, while offering flexibility to mechanism design.
Our findings expose a fundamental challenge to the current paradigm, demanding a re-evaluation of how iDP systems are designed, audited, communicated, and deployed to make excess risks transparent and controllable.
Counterfactual Training: Teaching Models Plausible and Actionable Explanations
Patrick Altmeyer, Aleksander Buszydlik, Arie van Deursen and Cynthia C. S. Liem (Delft University of Technology)
We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to
produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual
training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically
and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
The Feature-Space Illusion: Exposing Practical Vulnerabilities in Blockchain GNN Fraud Detection
François Frankart, Thibault Simonetto, Maxime Cordy, Orestis Papageorgiou, Nadia Pocher and Gilbert Fridgen (University of Luxembourg)
Graph Neural Networks are becoming essential for detecting fraudulent transactions on Ethereum. Yet, their robustness against realistic adversaries remains unexplored. We identify a fundamental gap: existing adversarial machine learning assumes arbitrary feature manipulation, while blockchain adversaries face an inverse feature-mapping problem—they must synthesize costly, cryptographically-valid transactions that produce desired perturbations through deterministic feature extraction. We present the first adversarial framework tailored to blockchain's constraints. Our gradient-guided search exploits partial differentiability: leveraging GNN gradients to identify promising directions, then employing derivative-free optimization to synthesize concrete transactions. For fraud rings controlling multiple accounts, we introduce a probability-weighted objective that naturally prioritizes evasion bottlenecks. Evaluating on real Ethereum transactions reveals architectural vulnerabilities with immediate security implications. Attention mechanisms significantly fail—GATv2 suffers 78.4% attack success rate with merely 2-3 transactions costing negligible amounts relative to fraud proceeds. Remarkably, GraphSAGE exhibits both superior detection (F1=0.905) and robustness (85.2% resistance), suggesting sampling-based aggregation inherently produces more stable decision boundaries than adaptive attention. As GNN-based detection becomes critical DeFi infrastructure, our work exposes the urgent need for architectures explicitly designed for adversarial resilience under real-world constraints. We release our framework and the ETHFRAUD-30K dataset to enable rigorous security evaluation of deployed systems.
Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
Hanna Foerster (University of Cambridge), Ilia Shumailov (Google Deepmind), Yiren Zhao (Imperial College London), Harsh Chaudhari (Northeastern University), Jamie Hayes (Google Deepmind), Robert Mullins (University of Cambridge) and Yarin Gal (University of Oxford)
Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce "decomposed reasoning poison", in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components.
Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer.
Structured Command Hijacking against Embodied Artificial Intelligence with Text-based Controls
Luis Burbano, Diego Ortiz (University of California, Santa Cruz), Qi Sun (Johns Hopkins University), Siwei Yang, Haoqin Tu, Cihang Xie (University of California, Santa Cruz), Yinzhi Cao (Johns Hopkins University) and Alvaro A Cardenas (University of California, Santa Cruz)
Embodied Artificial Intelligence (AI) promises to handle edge cases in robotic vehicle systems where data is scarce by using common-sense reasoning grounded in perception and action to generalize beyond training distributions and adapt to novel real-world situations. These capabilities, however, also create new security risks. In this paper, we introduce CHAI (Command Hijacking against embodied AI), a new class of prompt-based attacks that exploit the multimodal language interpretation abilities of Large Visual-Language Models (LVLMs). CHAI embeds deceptive natural language instructions, such as misleading signs, in visual input, systematically searches the token space, builds a dictionary of prompts, and guides an attacker model to generate Visual Attack Prompts. We evaluate CHAI on four LVLM agents; drone emergency landing, autonomous driving, and aerial object tracking, and on a real robotic vehicle. Our experiments show that CHAI consistently outperforms state-of-the-art attacks. By exploiting the semantic and multimodal reasoning strengths of next-generation embodied AI systems, CHAI underscores the urgent need for defenses that extend beyond traditional adversarial robustness.
Are Robust Fingerprints Adversarially Robust?
Anshul Nasery (University of Washington), Edoardo Contente (Sentient Research), Alkin Kaz, Pramod Viswanath (Princeton University) and Sewoong Oh (University of Washington)
Model fingerprinting has emerged as a promising paradigm for claiming model ownership. However, robustness evaluation of these schemes has mostly focused on benign perturbations such as incremental fine-tuning, model merging, and prompting. Lack of systematic investigations into adversarial robustness against a malicious adversary leaves current systems vulnerable.
To fill this gap, we first define a concrete, practical threat model against model fingerprinting. We then take a critical look at existing model fingerprinting schemes to identify their fundamental vulnerabilities. This leads to adaptive adversarial attacks tailored for each vulnerability, that can bypass model authentication completely for several fingerprinting schemes while maintaining high utility of the model for the rest of the users.
Our work encourages fingerprint designers to adopt adversarial robustness by design. We end with recommendations for future fingerprinting methods.
One RNG to Rule Them All - How Randomness Becomes an Attack Vector in Machine Learning
Kotekar Annapoorna Prabhu, Andrew Gan and Zahra Ghodsi (Purdue University)
Machine learning relies on randomness as a fundamental component in various steps such as data sampling, data augmentation, weight initialization, and optimization. Most machine learning frameworks use pseudorandom number generators as the source of randomness. However, variations in design choices and implementations across different frameworks,
software dependencies, and hardware backends along with the lack of statistical validation can lead to previously unexplored attack vectors on machine learning systems. Such attacks on
randomness sources can be extremely covert, and have a history of exploitation in real-world systems. In this work, we examine the role of randomness in the machine learning development
pipeline from an adversarial point of view, and analyze the implementations of PRNGs in major machine learning frameworks. We present RNGGUARD to help machine learning engineers
secure their systems with low effort. RNGGUARD statically analyzes a target library’s source code and identifies instances of random functions and modules that use them. At runtime, RNGGUARD enforces secure execution of random functions by replacing insecure function calls with RNGGUARD’s implementations that meet security specifications. Our evaluations show that
RNGGUARD presents a practical approach to close existing gaps in securing randomness sources in machine learning systems.
Safe But Not Robust: Security Evaluation of VLM by Jailbreaking MSTS
Wenxin Ding (University of Chicago), Cong Chen, Jean-Philippe Monteuuis and Jonathan Petit (Qualcomm)
Vision Language Models (VLM) have been integrated to AI assistants to improve the user interface. Users can input an image and ask for recommendations (also called Visual Question Answering). However, certain requests for recommendations may trigger an unsafe answer from the VLM. For instance, a VLM may answer ``yes" when asked by the user ``Should I drink this" followed by the image of a bottle of bleach. To evaluate the safety risk of such a scenario, the Multimodal Safety Test Suite (MSTS) was created to assess the safety of VLM outputs in realistic settings.
While MSTS provides a foundation for evaluating VLM safety, it does not address the growing threat of jailbreak attacks—adversarial manipulations that induce unsafe model outputs. In this work, we introduce Robust-MSTS, an extension of MSTS that incorporates realistic jailbreak scenarios through targeted image perturbations. First, we provide a new dataset that includes jailbreak attacks tailored to MSTS requirements to mitigate the absence of adversarial scenarios. Second, we provide an extensive evaluation of the robustness of several VLMs when tested against our adversarial MSTS dataset. Our evaluation shows not only attacks with high attack success rate (98.5%) but also ways to mitigate these attacks (e.g., model quantization). Lastly, we show that automated safety assessment using VLM-as-a-Judge can be further improved in the context of jailbreak attacks.
Gauss-Newton Unlearning for the LLM Era
Lev McKinney, Anvith Thudi, Juhan Bae (University of Toronto), Tara Rezaei Kheirkhah (Massachusetts Institute of Technology), Nicolas Papernot, Sheila A. McIlraith and Roger Baker Grosse (University of Toronto)
Large language models (LLMs) can learn to produce sensitive outputs which model developers may wish to reduce. These outputs can be suppressed using methods such as LLM unlearning.
However, unlearning a set of data (called the forget set) often also degrades model performance on other datasets where we want to retain the model's behavior. To improve this trade-off, we demonstrate that using the forget set to compute only a few uphill Gauss-Newton steps provides a conceptually simple, state-of-the-art unlearning algorithm for LLMs. While Gauss-Newton steps adapt Newton's method to non-linear models, it is non-trivial to efficiently and accurately compute such steps for LLMs. Hence, our approach crucially relies on parametric Hessian approximations such as Kronecker-Factored Approximate Curvature (K-FAC). We call this combined approach K-FAC for Distribution Erasure (K-FADE). Our evaluation on the WMDP and ToFU benchmarks demonstrates that K-FADE suppresses outputs from the forget set, while altering the model's outputs on the retain set less than previous methods. This is because K-FADE allows us to transform a constraint on the model's outputs across the entire retain set into a constraint on the model's weights allowing us to ensure each step of our algorithm satisfies this constraint. Moreover, K-FADE is more robust to subsequent changes made to the model: the unlearning update computed by K-FADE can be reapplied later on if the model is trained further.
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
Rui Xin (University of Washington), Niloofar Mireshghallah (CMU), Shuyue Stella Li, Michael Duan (University of Washington), Hyunwoo Kim (NVIDIA), Yejin Choi (Stanford), Yulia Tsvetkov, Sewoong Oh and Pang Wei Koh (University of Washington)
Sanitizing sensitive text data for release often relies on methods that remove personally identifiable information (PII) or generate synthetic data. However, evaluations of these methods have focused on measuring surface-level privacy leakage (e.g., revealing explicit identifiers like names). We propose the first semantic privacy evaluation framework for sanitized textual datasets, leveraging re-identification attacks. On medical records and chatbot dialogue datasets, we demonstrate that seemingly innocuous auxiliary information, such as a mention of specific speech patterns, can be used to deduce sensitive attributes like age or substance use history. PII removal techniques make only surface-level textual manipulations: e.g., the industry-standard Azure PII removal tool fails to protect 89% of the original information. On the other hand, synthesizing data with differential privacy protects sensitive information but garbles the data, rendering it much less useful for downstream tasks. Our findings reveal that current data sanitization methods create a false sense of privacy, and underscore the urgent need for more robust methods that both protect privacy and preserve utility.
Temporal Misalignment Attacks against Multimodal Perception in Autonomous Driving
Md Hasan Shahriar, Md Mohaimin Al Barat (Virginia Tech), Harshavardhan Sundar (Amazon.com, Inc.), Ning Zhang (Washington University in St. Louis), Naren Ramakrishnan, Y. Thomas Hou and Wenjing Lou (Virginia Tech)
Multimodal fusion (MMF) plays a critical role in the perception of autonomous driving, which primarily fuses camera and LiDAR streams for a comprehensive and efficient scene understanding. However, its strict reliance on precise temporal synchronization exposes it to new vulnerabilities. In this paper, we introduce DejaVu, an attack that exploits the in-vehicular network and induces delays across sensor streams to create subtle temporal misalignments, severely degrading downstream MMF-based perception tasks. Our comprehensive attack analysis across different models and datasets reveals the sensors' task-specific imbalanced sensitivities: object detection is overly dependent on LiDAR inputs, while object tracking is highly reliant on the camera inputs. Consequently, with a single-frame LiDAR delay, an attacker can reduce the car detection mAP by up to 88.5%, while with a three-frame camera delay, multiple object tracking accuracy (MOTA) for car drops by 73%. We further demonstrated two attack scenarios using an automotive Ethernet testbed for hardware-in-the-loop validation and the Autoware stack for end-to-end AD simulation, demonstrating the feasibility of the DejaVu attack and its severe impact, such as collisions and phantom braking. Code and data will be released upon acceptance.
RobPI: Robust Private Inference against Malicious Client
Jiaqi Xue, Mengxin Zheng and Qian Lou (University of Central Florida)
The increased deployment of machine learning inference in various applications has sparked privacy concerns. In response, private inference (PI) protocols have been created to allow parties to perform inference without revealing their sensitive data. Despite the recent advancements in the efficiency of PI, most current methods assume a semi-honest threat model where the data owner is honest and adheres to the protocol. However, in reality, data owners can have different motivations and act in unpredictable ways, making this assumption unrealistic. To demonstrate how a malicious client can compromise the semi-honest model, we first designed an inference manipulation attack against a range of state-of-the-art private inference protocols. This attack allows a malicious client to modify the model output using 3$\times$ to 8$\times$ fewer queries relative to the current black-box attacks. Motivated by the attacks, we proposed and implemented RobPI, a fortified and resilient private inference protocol that can withstand malicious clients. RobPI integrates a distinctive cryptographic protocol that bolsters security by weaving encryption-compatible noise into the logits and features of private inference, thereby efficiently warding off malicious-client attacks. Our extensive experiments on various neural networks and datasets show that RobPI achieves $\sim 91.9\%$ attack success rate reduction and increases more than $10\times$ query number required by malicious-client attacks.
Defending Against Prompt Injection with DataFilter
Yizhu Wang, Sizhe Chen (UC Berkeley), Raghad Alkhudair, Basel Alomair (KACST) and David Wagner (UC Berkeley)
When large language model (LLM) agents are increasingly deployed to automate tasks and interact with untrusted external data, prompt injection emerges as a significant security threat. By injecting malicious instructions into the data that LLMs access, an attacker can arbitrarily override the original user task and redirect the agent toward unintended, potentially harmful actions. Existing defenses either require access to model weights (fine-tuning), incur substantial utility loss (detection-based), or demand non-trivial system redesign (system-level). Motivated by this, we propose DataFilter, a test-time model-agnostic defense that removes malicious instructions from the data before it reaches the backend LLM. DataFilter is trained with supervised fine-tuning on simulated injections and leverages both the user's instruction and the data to selectively strip adversarial content while preserving benign information. Across multiple benchmarks, DataFilter consistently reduces the attack success rates to near zero while maintaining the utility of undefended models. DataFilter delivers strong security, high utility, and plug-and-play deployment, making it a strong practical defense to secure black-box commercial LLMs against prompt injection. Code and scripts for reproducing our results will be made publicly available.
Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings
Arjhun Swaminathan and Mete Akgün (Eberhard Karls Universität Tübingen)
Deep neural networks for image classification remain vulnerable to adversarial examples -- small, imperceptible perturbations that induce misclassifications. In black-box settings, where only the final prediction is accessible, crafting targeted attacks that aim to misclassify into a specific target class is particularly challenging due to narrow decision regions. Current state-of-the-art methods often exploit the geometric properties of the decision boundary separating a source image and a target image rather than incorporating information from the images themselves. In contrast, we propose Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge information from the target image to carefully perturb it, thereby producing an adversarial image that is closer to the source image while still achieving the desired target classification. Our approach consistently outperforms current state-of-the-art methods across different models in low query settings (nearly 70\% fewer queries are used), a scenario especially relevant in real-world applications with limited queries and black-box access. Furthermore, by efficiently generating a suitable adversarial example, TEA provides an improved target initialization for established geometry-based attacks.
RobustBlack: Challenging Black-Box Adversarial Attacks on State-of-the-Art Defenses
Mohamed DJILANI (University of Luxembourg), Salah GHAMIZI (Luxembourg Institute of Health) and Maxime CORDY (University of Luxembourg)
Although adversarial robustness has been extensively studied in white-box settings, recent advances in black-box attacks (including transfer- and query-based approaches) are primarily benchmarked against weak defenses, leaving a significant gap in the evaluation of their effectiveness against more recent and moderate robust models (e.g., those featured in the Robustbench leaderboard).
In this work, we argue that this gap is problematic and the previous benchmarks conclusions do not hold under more robust evaluation frameworks, leading to contradicting conclusions on the transferability of adversarial attacks.
We extensively evaluate the effectiveness of 13 popular black-box attacks, representative of the top ten popular transferability theories. We benchmark these attacks against eight top-performing and standard defense mechanisms on the ImageNet dataset. Our empirical evaluation reveals the following key findings: (1) the most advanced black-box attacks struggle to succeed even against simple adversarially trained models; (2) robust models that are optimized to withstand strong white-box attacks, such as AutoAttack, also exhibit enhanced resilience against black-box attacks; and (3) robustness alignment between the surrogate models and the target model plays can significantly impact the success rate of transfer-based attacks.
On the Effectiveness of Membership Inference in Targeted Data Extraction from Large Language Models
Ali Al Sahili, Ali Chehab and Razan Tajeddine (American University of Beirut)
Abstract—Large Language Models (LLMs) are prone to mem- orizing training data, which poses serious privacy risks. Two of the most prominent concerns are training data extraction and Membership Inference Attacks (MIAs). Prior research has shown that these threats are interconnected: adversaries can extract training data from an LLM by querying the model to generate a large volume of text and subsequently applying MIAs to verify whether a particular data point was included in the training set. In this study, we integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness. We then compare their performance in this integrated setting against results from conventional MIA bench- marks, allowing us to evaluate their practical utility in real-world
extraction scenarios.
On the Robustness of Tabular Foundation Models: Test-Time Attacks and In-Context Defenses
Mohamed Djilani, Thibault Simonetto, Karim Tit, Florian Tambon (University of Luxembourg), Salah Ghamizi (Luxembourg Institute of Health), Maxime Cordy and Mike Papadakis (University of Luxembourg)
Recent tabular Foundational Models (FM) such as TabPFN and TabICL, leverage in-context learning to achieve strong performance without gradient updates or fine-tuning. However, their robustness to adversarial manipulation remains largely unexplored. In this work, we present a comprehensive study of the adversarial vulnerabilities of tabular FM, focusing on both their fragility to targeted test-time attacks and their potential misuse as adversarial tools. We show on three benchmarks in finance, cybersecurity and healthcare, that small, structured perturbations to test inputs can significantly degrade prediction accuracy, even when the training context remains fixed.
Additionally, we demonstrate that tabular FM can be repurposed to generate transferable evasion to conventional models such as random forests and XGBoost, and on a lesser extent to deep tabular models.
To improve tabular FM, we formulate the robustification problem as an optimisation of the weights (adversarial fine-tuning), or the context (adversarial in-context learning). We introduce an in-context adversarial training strategy that incrementally replaces the context with adversarial perturbed instances, without updating model weights. Our approach improves robustness across multiple tabular benchmarks. Together, these findings position tabular FM as both a target and a source of adversarial threats, highlighting the urgent need for robust training and evaluation practices in this emerging paradigm.
Protecting Facial Biometrics from Malicious Generative Editing via Latent Optimization
Fahad Shamshad, Hashmat Shadab Malik (Mohamed bin Zayed University of Artificial Intelligence, UAE), Muzammal Naseer (Khalifa University, UAE), Salman Khan (Mohamed bin Zayed University of Artificial Intelligence, UAE. The Australian National University) and Karthik Nandakumar (Mohamed bin Zayed University of Artificial Intelligence, UAE. Michigan State University, USA)
Instruction-guided diffusion models enable realistic and instruction-driven image edits of facial images. However, these capabilities raise severe privacy risks, as malicious actors can fabricate convincing but harmful edits while preserving the subject’s identity. Existing defenses, such as pixel-space adversarial perturbations, are either fragile to real-world transformations (e.g., compression, blurring, rotation) or fail under advanced diffusion-based purification techniques. In this paper, we introduce FaceGuardian, a semantic and instruction-agnostic defense framework that operates in the low-dimensional latent manifold of a pretrained generative model, producing protected images that are visually indistinguishable from the original while ensuring that downstream diffusion-based edits fail to preserve biometric identity. FaceGuardian optimizes adversarial perturbations in the identity-sensitive subspaces of the facial manifold, ensuring that the protected image preserves the global structure and context of the original face. FaceGuardian remains robust not only to common input transformations but also to advanced diffusion-based purification strategies.
To quantify biometric protection more accurately, we further introduce a face-centric evaluation protocol that focuses exclusively on identity-relevant region. Extensive experiments across diverse editing prompts and real-world degradations demonstrate that FaceGuardian achieves up to a 12\% improvement in FR score over recent methods, while maintaining robustness against input transformations.
They’re Closer Than We Think: Tackling Near-OOD Problem
Shaurya Bhatnagar, Ishika Sharma, Ranjitha Prasad (Indraprastha Institute of Information Technology Delhi (IIIT-Delhi)), Vidya T (LightMetrics), Ramya Hebbalaguppe (TCS Research) and Ashish Sethi (LightMetrics)
Abstract— Out-of-Distribution (OoD) detection plays a vital role in the robustness of models in real-world applications. While traditional approaches are effective at detecting samples that are significantly different from the training distribution (far-OoD), they often falter with near-OoD samples, where subtle variations in images pose a challenge for standard methods like likelihood-based detection. In practical applications, near-OoD samples are more prevalent, particularly in fine-grained tasks where instances from different classes exhibit high perceptual similarity. In these scenarios OoD detection relies on subtle, localized features. For instance, in bird species classification, accurate OoD detection requires discerning fine-grained attributes such as beak shape, tail and feather pattern which exhibit substantial structural overlap across both ID and OoD classes. We propose the novel NORD-F framework to detect near-OoDs by disentangling the coarse-structural features from the fine-grained discriminative features. We use gradient reversal based disentangled representation learning which helps in isolating the class-invariant features, allowing the classifier to employ class-specific features. We present theoretical analysis that motivates the design of the novel architecture consisting of an invariance, classification and reconstruction branches. Empirically, we demonstrate that NORD-F outperforms the well-known baselines on fine-grained datasets such as CUB, Stanford-Dogs and Aircraft datasets, for near-OoD detection.
Index Terms— Near-OoD, Fine-grained, Gradient reversal, Disentanglement.
Systematization of Knowledge Papers
SoK: The Hitchhiker’s Guide to Efficient, End-to-End, and Tight DP Auditing
Meenatchi Sundaram Muthu Selva Annamalai (University College London), Borja Balle, Jamie Hayes, Georgios Kaissis (Google DeepMind) and Emiliano De Cristofaro (UC Riverside)
In this paper, we systematize research on auditing Differential Privacy (DP) techniques, aiming to identify key insights and open challenges. First, we introduce a comprehensive framework for reviewing work in the field and establish three cross-contextual desiderata that DP audits should target—namely, efficiency, end-to-end-ness, and tightness. Then, we systematize the modes of operation of state-of-the-art DP auditing techniques, including threat models, attacks, and evaluation functions. This allows us to highlight key details overlooked by prior work, analyze the limiting factors to achieving the three desiderata, and identify open research problems. Overall, our work provides a reusable and systematic methodology geared to assess progress in the field and identify friction points and future directions for our community to focus on.
SoK: On the Survivability of Backdoor Attacks on Unconstrained Face Recognition Systems
Quentin Le Roux (Thales Group, Inria), Yannick Teglia (Thales Group), Teddy Furon (Inria), Philippe Loubet Moundi and Eric Bourbao (Thales Group)
The widespread deployment of Deep Learning-based Face Recognition Systems raises multiple security concerns. While prior research has identified backdoor vulnerabilities on isolated components, Backdoor Attacks on real-world, unconstrained pipelines remain underexplored. This SoK paper presents the first comprehensive system-level analysis of Backdoor Attacks targeting fully-fledged Face Recognition Systems. We exploit the existing Supervised Learning backdoor literature and show for the first time that face feature extractors trained with large margin metric learning losses can fall to Backdoor Attacks. By analyzing 20 pipeline configurations and 15 attack scenarios in a holistic manner, we then reveal that a single model backdoor can compromise an entire Face Recognition System. Finally, we discuss the impact of such attacks and propose best practices and countermeasures for stakeholders.
SoK: Data Minimization in Machine Learning
Robin Staab, Nikola Jovanović (ETH Zurich), Kimberly Mai (University College London), Prakhar Ganesh (McGill University / Mila), Martin Vechev (ETH Zurich), Ferdinando Fioretto (University of Virginia) and Matthew Jagielski (Anthropic)
Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection.
This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, we present the first systematization of knowledge (SoK) for DMML. We introduce a general framework for DMML, encompassing a unified data pipeline, adversarial models, and points of minimization. This framework allows us to systematically review the literature on data minimization as well as DM-adjacent methodologies whose link to DM was often overlooked. Our structured overview is designed to help practitioners and researchers effectively adopt and apply DM principles in AI/ML, by helping them identify relevant techniques and understand their underlying assumptions and trade-offs through a unified DM-centric lens.
SoK: Decentralized AI (DeAI)
Elizabeth Lui (FLock.io), Rui Sun (Newcastle University & University of Manchester), Vatsal Shah (FLock.io), Xihan Xiong (Imperial College London), Jiahao Sun (FLock.io), Davide Crapis (Ethereum Foundation & PIN AI), William Knottenbelt (Imperial College London) and Zhipeng Wang (University of Manchester)
Centralization enhances the efficiency of Artificial Intelligence (AI) but also introduces critical challenges, including single points of failure, inherent biases, data privacy risks, and scalability limitations. To address these issues, blockchain-based Decentralized Artificial Intelligence (DeAI) has emerged as a promising paradigm that leverages decentralization and transparency to improve the trustworthiness of AI systems. Despite rapid adoption in industry, the academic community lacks a systematic analysis of DeAI ’s technical foundations, opportunities, and challenges. This work presents the first Systematization of Knowledge (SoK) on DeAI, offering a formal definition and precise mathematical model, a taxonomy of existing solutions based on the AI lifecycle, and an in-depth investigation of the roles of blockchain in enabling secure and incentive-compatible collaboration. We further review security risks across the DeAI lifecycle and empirically evaluate representative mitigation techniques. Finally, we highlight open research challenges and future directions for advancing blockchain-based DeAI.
SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems
Andreea-Elena Bodea, Stephen Meisenbacher, Alexandra Klymenko and Florian Matthes (Technical University of Munich)
The continued promise of Large Language Models (LLMs), particularly in their natural language understanding and generation capabilities, has driven a rapidly increasing interest in identifying and developing LLM use cases. In an effort to complement the ingrained "knowledge" of LLMs, Retrieval-Augmented Generation (RAG) techniques have become widely popular. At its core, RAG involves the coupling of LLMs with domain-specific knowledge bases, whereby the generation of a response to a user question is augmented with contextual and up-to-date information. The proliferation of RAG has sparked concerns about data privacy, particularly with the inherent risks that arise when leveraging databases with potentially sensitive information. Numerous recent works have explored various aspects of privacy risks in RAG systems, from adversarial attacks to proposed mitigations. With the goal of surveying and unifying these works, we ask one simple question: What are the privacy risks in RAG, and how can they be measured and mitigated? To answer this question, we conduct a systematic literature review of RAG works addressing privacy, and we systematize our findings into a comprehensive set of privacy risks, mitigation techniques, and evaluation strategies. We supplement these findings with two primary artifacts: a Taxonomy of RAG Privacy Risks and a RAG Privacy Process Diagram. Our work contributes to the study of privacy in RAG not only by conducting the first systematization of risks and mitigations, but also by uncovering important considerations when mitigating privacy risks in RAG systems and assessing the current maturity of proposed mitigations from the literature.
SoK: Enhancing Cryptographic Collaborative Learning with Differential Privacy
Francesco Capano, Jonas Boehler (SAP SE) and Benjamin Weggenmann (Technische Hochschule Würzburg Schweinfurt)
In collaborative learning (CL), multiple parties jointly train a machine learning model on their private datasets. However, data can not be shared directly due to privacy concerns. To ensure input confidentiality, cryptographic techniques, e.g., multi-party computation (MPC), enable training on encrypted data. Yet, even securely trained models are vulnerable to inference attacks aiming to extract, e.g., memorized data, from model outputs. To ensure output privacy and mitigate inference attacks, differential privacy (DP) injects calibrated noise during training. While cryptography and DP offer complementary guarantees, combining them efficiently for cryptographic and differentially private collaborative learning (CPCL) is challenging. Cryptography incurs performance overheads, while DP degrades accuracy, creating a privacy-accuracy-performance trade-off that needs careful considerations and design choices.
This work systematizes solutions combining cryptography and DP in CL. We generalize and detail common phases across CPCL paradigms, and identify secure noise sampling as the foundational phase to achieve CPCL. We analyze trade-offs of different secure noise sampling techniques, noise types, and DP mechanisms discussing their implementation challenges and evaluating their accuracy and cryptographic overhead across CPCL paradigms. Additionally, we implement identified secure noise sampling options in MPC and evaluate their computation and communication costs in WAN and LAN. Finally, we propose future research directions based on identified key observations, gaps and possible enhancements in the literature.
Position Papers
Position: Research in Collaborative Learning Does Not Serve Cross-Silo Federated Learning in Practice
Kevin Kuo, Chhavi Yadav and Virginia Smith (Carnegie Mellon University)
Cross-silo federated learning (FL) is a promising approach to enable cross-organization collaboration in machine learning model development without directly sharing private data. Despite growing organizational interest driven by data protection regulations such as GDPR and HIPAA, the adoption of cross-silo FL remains limited in practice. In this paper, we conduct an interview study to understand the practical challenges to cross-silo FL adoption. With interviews spanning a diverse set of stakeholders such as user organizations, software providers, and academic researchers, we uncover various barriers, from concerns about model performance to questions of incentives and trust between participating organizations. Our study shows that cross-silo FL faces a set of challenges that have yet to be well-captured by existing research in the area and are quite distinct from other forms of federated learning such as cross-device FL. We end with a discussion on future research directions that can help overcome these challenges.
Position: Gaussian DP for Reporting Differential Privacy Guarantees in Machine Learning
Juan Felipe Gomez (Harvard University), Bogdan Kulynych (Lausanne University Hospital), Georgios Kaissis, Borja Balle, Jamie Hayes (Google DeepMind), Flavio du Pin Calmon (Harvard University) and Antti Honkela (University of Helsinki)
Current practices for reporting the level of differential privacy (DP) protection for machine learning (ML) algorithms such as DP-SGD provide an incomplete and potentially misleading picture of the privacy guarantees. For instance, if only a single $(\varepsilon,\delta)$ is known about a mechanism, standard analyses show that there exist highly accurate inference attacks against training data records, when, in fact, such accurate attacks might not exist. In this position paper, we argue that using non-asymptotic Gaussian Differential Privacy (GDP) as the primary means of communicating DP guarantees in ML avoids these potential downsides. Using two recent developments in the DP literature: (i) open-source numerical accountants capable of computing the privacy profile and $f$-DP curves of DP-SGD to arbitrary accuracy, and (ii) a decision-theoretic metric over DP representations, we show how to provide non-asymptotic bounds on GDP using numerical accountants, and show that GDP can capture the entire privacy profile of DP-SGD and related algorithms with virtually no error, as quantified by the metric. To support our claims, we investigate the privacy profiles of state-of-the-art DP large-scale image classification, and the TopDown algorithm for the U.S. Decennial Census, observing that GDP fits their profiles remarkably well in all cases. We conclude with a discussion on the strengths and weaknesses of this approach, and discuss which other privacy mechanisms could benefit from GDP.