Hi, welcome to my homepage! I am Connie and I am currently a senior data scientist at Novartis. I have a diverse background in statistics, bioinformatics and biomedicine. I also have enriched working experiences in pre-clinical projects and both early & late phase clinical studies.
I specialize in deep learning, machine learning and multi-omics data analysis. As mentioned I have been working on several projects in both clinical and non-clinical fields, including performing the digital device data analysis for diagnostic purpose, supporting the Rshiny exploratory app for the Hidradenitis Suppurativa task force, providing statistical support for the early phase Hematology trial as trial statistician and support the biomarker analysis for the late phase clinical trials in the whole AMD/MDS disease area. I also performed exploratory analysis using the time-series EEG (electroencephalogram) data and the high-dimensional SOMAScan data in collaboration with experts in the Neuroscience and Cardiovascular disease areas.
Before joining Novartis, I studied Human Genetics and Statistics at University of California, Los Angeles. I studied Genomics with professor Leonid Kruglyak, the chair of Human Genetics department. And I worked with professor Yinglian Wu on the Statistics project.
During my research, I built different machine learning and deep learning models to solve biological questions. I also performed linkage and association studies using genomic data to find out causal genes for disease related traits. I love travelling, trying out good food and cute animals. I have a one-year-old German Shepherd and I truly enjoy every walk with him.
Digital technologies have the potential to provide objective and precise tools to detect depression-related symptoms. Deployment of digital technologies in clinical research can enable collection of large volumes of clinically relevant data that may not be captured using conventional psychometric questionnaires and patient-reported outcomes. Rigorous methodology studies to develop novel digital endpoints in depression are warranted.
Together with the team, I conducted an exploratory, cross-sectional study to evaluate several digital technologies in subjects with major depressive disorder (MDD) and persistent depressive disorder (PDD), and healthy controls. The study aimed at assessing utility and accuracy of the digital technologies as potential diagnostic tools for unipolar depression, as well as correlating digital biomarkers to clinically validated psychometric questionnaires in depression. In many cases, I obtained simple, parsimonious models that have reasonably high diagnostic accuracy and potential to predict standard clinical outcome in depression. This study generated many useful insights for future methodology studies of digital technologies and proof-of-concept clinical trials in depression and possibly other indications.
Electroencephalography (EEG) is a widely available and noninvasive diagnostic methods that provides direct insight into brain synaptic activity in real time. EEG biomarkers play a valuable role in the early diagnosis of many neurological disorders. However, current EEG biomarkers are limited to pre-defined frequency bins and can only find certain frequency waves (eg. Alpha waves) that contribute to group differences. There is an urgent need for identifying new EEG biomarkers without pre-defined bins to enable more sensitive and robust detection of biologically relevant phenotypes and drug effects. Here we explored the novel EEG biomarkers without pre-defined bins using the EEG data from the mice animal models.
I leveraged the power of Recurrent Neural Networks (RNN) to build models that can accurately distinguish mice groups with different genetic modifications and the wildtype mice. I applied data augmentation to largely increase the diversity of data available for training the RNN, and therefore boost the predicting accuracy. I also compared different RNN methods and tuned the hyperparameters to gain a reliable model with high AUROC. By extracting feature importance from the best performed models, I obtained the specific frequency bins that are potential EEG biomarkers for the group differences. For next steps, I will apply this RNN based approach to other Novartis neuroscience mice studies across multiple years (500-1000 mice) to identify novel EEG biomarkers for different disease phenotypes.
Heart failure (HF) is a medical condition affecting at least 26 million people worldwide and is increasing in prevalence. HF can be classified into three groups based on the percentage of the ejection fraction (EF) - heart failure with reduced EF (HFrEF), heart failure with mid-range reduced EF (HFmrEF), and heart failure with preserved ejection fraction (HFpEF). HFmrEF can progress into either HFrEF or HFpEF. While recent medical advances have resulted in efficient and specific treatments for HFrEF, these treatments lack efficacy for HFpEF patients. The differential response rates highlight the significance of understanding the unique pathogenesis of HFrEF and HFpEF.
In this study, I aim to understand the biological pathways and underlying mechanism for HFrEF and HFpEF. The Slow Off-rate Modified Aptamer (SOMA) scan assay of around 5000 proteins, was profiled in plasma samples from 1548 HFrEF patients (PARADIGM trial) and 1177 HFpEF patients (PARAGON trial). I performed the cross-study batch correction to minimize the effect introduced by batch differences. 4828 SOMAmers shared between these two studies were used for downstream analysis. With a minimal model controlling for gender, age and anticoagulant status, we identified 40 proteins that were significantly differentially expressed between HFpEF and HFrEF. Some of these proteins are sex hormone related proteins, indicating sex differences between the two patient populations. This is consistent with previous observations that men are predisposed to HFrEF, whereas women predominate in HFpEF. The analysis provides insight on specific protein biomarkers of HF subpopulations and is expected to improve drug developments for HFpEF.
Assessment of measurable residual disease (MRD) to evaluate the depth of remission at the time of achieving morphological complete remission (CR) or CR with incomplete blood count recovery (CRi) by multiparameter flow cytometry (MFC) or quantitative polymerase chain reaction (qPCR) has been shown to be predictive of outcome in patients (pts) with acute myeloid leukemia (AML), and is recommended according to European LeukemiaNet guidelines. However, accurate MRD assessment is still limited by lack of suitable markers in all pts (qPCR) and/or limited specificity/sensitivity (MFC). Next generation sequencing (NGS) holds promise to overcome some of these limitations and allows versatile and sensitive MRD assessment in almost all AML pts. However, data on NGS-MRD in this setting are still limited.
To further extend the experience with NGS-based MRD assessment, I analyzed the NGS MRD data collected in a randomized, placebo-controlled phase 3 study (UNIFY; NCT03512197), to explore the prognostic implications of MRD as detected by NGS at CR/CRi in AML patients. I found a high correlation of variant allele frequency (Pearson’s r>=0.84) and a high level of concordance in MRD calls between Bone Marrow (BM) and Peripheral Blood (PB) MRD (91% overall percent agreement). Detection of MRD in either BM or PB in CR/CRi pts at end-of induction was associated with significantly lower event free survival (EFS) and overall survival (OS) compared to MRD-negative status (1-year EFS of 26% vs 72% in BM and 34% vs 71% in PB; 1-year OS of 73% vs 93% in BM and 67% vs 94% in PB)(Figure). I also performed the multivariate analysis, controlling for age and sex, to confirm the independent prognostic value of NGS-MRD for OS.
B-cell activating factor receptor (BAFF-R) enhances the survival and regulation of normal and malignant B cells. VAY736 is a human monoclonal antibody (mAb) targeting BAFF-R that targets BAFF-R+ B cells for elimination by antibody-dependent cell-mediated cytotoxicity (ADCC). VAY736 has anti-leukemia activity in preclinical CLL models that is superior to that of anti-CD20 mAbs. Although Bruton's tyrosine kinase inhibitors (BTKis; acalabrutinib, ibrutinib) are the current standard of care for CLL, the indefinite length of monotherapy required may result in cumulative clinical or economic toxicity and/or acquired treatment resistance. In preclinical models, adding VAY736 to ibrutinib significantly improved survival and reduced disease burden, suggesting that this combination may augment the anti-leukemia response and allow patients (pts) to discontinue ibrutinib.
I have evaluated the safety and efficacy of the VAY736 + ibrutinib combo. This treatment was well tolerated with an acceptable safety profile enabling dose expansion. Clinical activity was observed including multiple pts attaining undetected MRD status in blood and BM, allowing 6 to discontinue ibrutinib therapy for an extended period. The data provide clinical evidence of the potent anti-leukemia activity of VAY736 and the potential to safely discontinue ibrutinib or other BTKi by VAY736 add-on therapy.
Mutations are the root source of genetic variation and underlie the process of evolution. Although the rates at which mutations occur vary considerably between species, little is known about differences within species, or the genetic and molecular basis of these differences. In this project, I leveraged the power of the yeast Saccharomyces cerevisiae as a model system to uncover natural genetic variants that underlie variation in mutation rate.
In this project, I developed a high-throughput fluctuation assay and used it to quantify mutation rates in natural yeast isolates and in 1040 segregant progeny from a cross between BY, a lab strain, and RM, a wine strain. We observed that mutation rate varies among yeast strains and is highly heritable (H2=0.46). We performed linkage mapping in the segregants and identified four quantitative trait loci (QTLs) underlying mutation rate variation in the cross. I fine-mapped two QTLs to the underlying causal genes, RAD5 and MKT1, that contribute to mutation rate variation. These genes also underlie sensitivity to the DNA damaging agents 4NQO and MMS, suggesting a connection between spontaneous mutation rate and mutagen sensitivity.
The CRISPR base editors are programmable DNA editing systems that induce single-nucleotide changes in the DNA using a fusion protein containing a catalytically defective Cas9, a cytidine or adenine deaminase, and an inhibitor of base excision repair. This genome editing approach has the advantage that it does not require generation of double-stranded DNA breaks or a donor DNA template. Adenine and cytidine deaminases convert their target nucleotides to other DNA bases, enabling versatile DNA editing. Base editors with natural or engineered Cas9 variants can target genomes at different protospacer-adjacent motif (PAM) sites, which significantly expand the number of sites that can be targeted by base editing. However, a systematic analysis of the performance of different base editors has not been conducted.
I evaluate the editing efficiency and the targeting window of ten different base editors in the model organism Saccharomyces Cerevisiae using large scale guide RNA library and MiSeq amplicon sequencing analysis. For the pilot experiments, ten different guide RNAs (gRNAs) were designed for each base editor to target ten different region in the genome. Specific primers were designed to amplify each targeting region using high-fidelity polymerase and the PCR products were used for amplicon sequencing. I analyzed the sequencing and estimated the efficiency of base editor at each site, as well as identifying the editing window for each base editor.
The fundamental goal of genetics is to understand the link between genetic variation and trait variation across the entire range of biological phenomena, from genetic diseases to pathogen virulence. However, current methods for trait mapping often span many genes, and narrowing them to the specific underlying genes and genetic variants can be painstaking. Recently my lab has successfully fine-mapped manganese sensitivity to a single polymorphism in yeast by generating a mapping panel with targeted recombination events using CRISPR. I aim to apply this approach that my lab has successfully used to map traits in yeast to human cells.
I start with H9 human embryonic stem cell, given its diploid and pluripotent characteristic. I have generated a stable H9 cell line that contains a heterozygous selectable marker on chromosome 16 close to the telomere. By introducing double strand breaks at designed sites along chromosome 16 using CRISPR-cas9, followed by selection for cells undergone loss of heterozygosity (LOH), I will be able to generate a mapping panel of cells with different sizes carrying homozygous recombinant genomes to the telomere. Once developed, this mapping panel will allow us to map loci that control gene expression and other traits in human cells with high resolution without limitations by recombination rates and the length of linkage blocks. By differentiating the H9 stem cells into different cell lines, this mapping panel will enable fine mapping with a wide variety of cell type specific traits, shredding light on the genetic effect in different cells.