学术报告（Grace Y. Yi，Wenqing He，Wanhua Su 9.21）
Grace Y. Yi教授（加拿大滑铁卢大学）；Wenqing He教授（加拿大西安大略大学）；Wanhua Su教授（MacEwan University）
- Thanks to the advancement of modern technology in acquiring data, massive data with diverse features and big volume are becoming more accessible than ever. The impact of big data is signicant.While the abundant volume of data presents great opportunities for researchers to extract useful information for new knowledge gain and sensible decision making, big data present great challenges.A very important, sometimes overlooked challenge is the quality and provenance of the data. Big data are not automatically useful; big data are often raw and involve considerable noise.Typically, the challenges presented by noisy data with measurement error, missing observations and high dimensionality are particularly intriguing. Noisy data with these features arise ubiquitously from various elds including health sciences, epidemiological studies,environmental studies, survey research, economics, and so on. In this talk, I will discuss the issues induced from noisy data and some methods of handling such data.
- Perturbation resampling method can be employed to estimate the covariance matrix of an estimator when the estimator is obtained through minimizing a U-process. This perturbation resampling is proposed to establish general tests for the detection of model misspecification or for model checking. The proposed tests enjoy simplicity and a theoretical justification. We apply the proposed method to modify the tests proposed by Shih (1998) for the assessment of Clayton models in multivariate survival analysis, where the asymptotic variance is intractable. The proposed tests present promising performance in the simulation studies and have simpler procedures than the nonparametric bootstrap which can also be applied to approximate the covariance matrix. A colon cancer study further illustrates the proposed methods.
- There are two kinds of medical tests: screening tests and diagnostic tests. The purpose of a screening test is to determine whether an asymptomatic individual has a certain disease or not, while diagnostic tests are used to confirm the disease status. To evaluate the effectiveness of a diagnostic test, the gold standard is the area under a receiver operating characteristic curve (AUC). As for screening tests, a performance metric that accentuates the true positive rate at the early part of the ROC curve is more attractive than a metric that treats the true positive rate equally important across the entire curve such as the AUC. For the same type of evaluations, a different metric known as the average precision (AP) is used much more widely in the information retrieval (IR) community, where the task is to retrieve relevant documents from a collection of documents. In this talk, we elucidate the difference and relationship between the AUC and the AP. More specifically, we explain mathematically why the AP may be more appropriate if the earlier part of the ROC curve is of interest and hence is more appropriate for screening tests. We also proposed a framework to make statistical inferences on the AUC and the AP based on the multinomial model and the delta method. The performance of this framework is demonstrated with real-world examples concerning the evaluation of protein biomarkers for prostate cancer and the assessment of digital versus film mammography for breast cancer screening. This is joint work with Dr. Yan Yuan and Dr. Mu Zhu.