Dr. Yufei Han currently works as Senior Principal Researcher. His research interests include robust learning with imperfect telemetry data, adversarial learning, and privacy-preserving learning, which aim at providing a trusted machine learning service.
Prior to joining the research group at NortonLifeLock, formerly Symantec, he conducted post-doctoral research in French Institute of Research in Computer Science and Automation (INRIA) in Paris from 2010-2014. He received his Ph.D. degree in National Laboratory of Pattern Recognition, Chinese Academy of Sciences, Beijing in 2010.
Selected Academic Papers
In proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI 2020)
For federated learning systems deployed in the wild, data flaws hosted on local agents are widely witnessed. On one hand, given a large amount of training data are corrupted by systematic sensor noise and environmental perturbations, the performances of federated model training can be degraded significantly. On the other hand, it is prohibitively expensive for either clients or service providers to set up manual sanitary checks to verify the quality of data instances. In our study, we echo this challenge by proposing a collaborative and privacy-preserving machine teaching method. Specifically, we use a few trusted instances provided by teachers as benign examples in the teaching process. Our collaborative teaching approach seeks jointly the optimal tuning on the distributed training set, such that the model learned from the tuned training set predicts labels of the trusted items correctly. The proposed method couples the process of teaching and learning and thus produces directly a robust prediction model despite the extremely pervasive systematic data corruption. The experimental study on real benchmark data sets demonstrates the validity of our method.
In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN 2017)
This paper proposes a practical approach to learn spectral clustering based on adaptive stochastic gradient optimization. Crucially, the proposed approach recovers the exact spectrum of Laplacian matrices in the limit of the iterations, and the cost of each iteration is linear in the number of samples. Extensive experimental validation on data sets with up to half a million samples demonstrate its scalability and its ability to outperform state-of-the-art approximate methods to learn spectral clustering for a given computational budget.
In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016)
We propose to encode the weakly supervised information in PU learning tasks into pairwise constraints between training in-stances. Violation of pairwise constraints are measured and incorporated into a partially supervised graph embedding model.
In Proceedings of the 24th ACM Conference on Computer and Communications Security (ACM SIGSAC 2017)
In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN 2019)
In this work, we define a collaborative and privacy-preserving machine teaching paradigm with multiple distributed teachers. The focus is to find strategies to organize distributed agents to jointly select a compact subset of data that can be used to train a global model. The global model should achieve nearly the same performance as if the central learner had access to all the data, but the central learner only has access to the selected subset, and each agent only has access to their own data. The goal of this research is to find good strategies to train global models while giving some control back to agents.
In Proceedings of the 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2018)
We proposed a weakly supervised multi-label learning approach, based on the idea of collaborative embedding. It provides a flexible framework to conduct efficient multi-label classification at both transductive and inductive mode by coupling the process of reconstructing missing features and weak label assignments in a joint optimisation framework.
In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM 2020) In this paper, we study the graph-based semi-supervised learning for classifying nodes in attributed networks, where the nodes and edges possess content information. Recent approaches like graph convolution networks and attention mechanisms have been pro-posed to ensemble the first-order neighbors and incorporate the relevant neighbors. However, it is costly (especially in memory) to consider all neighbors without a prior differentiation. We propose to explore the neighborhood in a reinforcement learning setting and find a walk path well-tuned for classifying the unlabeled target nodes. We let an agent (of node classification task) walk over the graph and decide where to move to maximize classification accuracy. We define the graph walk as a partially observable Markov decision process (POMDP). The proposed method is flexible for working in both transductive and inductive setting. Extensive experiments on four data sets demonstrate that our proposed method outperforms several state-of-the-art methods. Several case studies also illustrate the meaningful movement trajectory made by the agent.
In Proceedings of the 33th Annual computer Security Applications Conference (ACSAC 2017)
We set out to predict which security events and incidents a security product would have detected had it been deployed, based on the events produced by other security products that were in place. We discovered that the problem is tractable, and that some security products are much harder to model than others, which makes them more valuable.
In Proceedings of the 33rd Annual Computer Security Applications Conference (ACSAC 2017)
We presented Marmite, a system that can detect malicious files by leveraging a global download graph and label propagation with Bayesian confidence.