Contenu,
titles & abstracts

The evolutionary study of recombination requires genetic maps from related species. However, experimental approaches to estimating recombination rates are impractical for many species of interest. Here, we construct the first fine-scale genetic map for a non-model species through sequencing the genomes of ten western chimpanzees, Pan troglodytes verus, and analysing patterns of haplotype structure. We show that chimpanzee recombination is dominated by hotspots, which show no overlap with humans, but which are similarly enriched around CpG islands and depleted within genes. At broad scales, recombination rates are correlated, but show divergence, particularly in regions of chromosomal rearrangement and most strikingly around the site of ancestral fusion in human chromosome 2. We find that chimpanzees have extensive variation in the hotspot-localising protein PRDM9 and show little evidence for sequence motifs enriched in hotspots. The changing locations of hotspots provide natural experiments through which to analyse the consequences of recombination on patterns of molecular evolution.

download the pdf file

download the pdf file

TBA

The overall aim of the talk is to present a general framework for
representing and then studying the genetic evolution of a population
scattered into some area of space. These recent models rely on a
’duality’ relation between the reproduction model and the corresponding
genealogies of a sample, which is of great help in the understanding of
the large scale behaviour of the local (or global) genetic diversities.
Furthermore a great variety of scenarii can be described, ranging e.g.
from very local reproduction events to very rare and massive
extinction/recolonization events. This is a joint work with Alison
Etheridge and Nick Barton.

download the pdf file

download the pdf file

Analysis of high-dimensional regression models is ubiquitous in
postgenomics. The purpose of this talk is to
understand the statistical limitations occurring when "too much" data
is available.

Consider the standard Gaussian linear regression model Y = X theta + E, where Y is a response vector that belongs to R^n and X is a design matrix of R^*n x p*. Numerous works have been devoted to building efficient estimators of theta when p is much larger than n. In such a situation, a classical approach amounts to assume that theta is approximately sparse. I will talk about the minimax risks of estimation and testing over classes of sparse vectors as a way to grasp the limitations due to high-dimensionality.

As a byproduct, we obtain a characterization of high-dimensional

statistical problems that are intrinsically too difficult to be addressed.

download the pdf file

Cancer genomes often display copy number alterations (CNAs) and/or losses of heterozygosity (LOH) (Hanahan and Weinberg, 2011). Genetic abnormalities in specific regions may be related to the aggressiveness of a cancer and be associated with clinical outcomes. In cancer, tumor suppressor genes can be deleted or mutated, whereas oncogenes can be amplified or mutated with a gain of function. At the same time, translocations can result in cancer-causing fusion proteins (BCR/ABL fusion in CML, BCL1/IGH in multiple myeloma, EWS/FLI1 in Ewing sarcoma, etc.)
With the arrival of new high-throughput sequencing technologies, our current power to detect genetic abnormalities has significantly improved. Genomic breakpoints of large structural variants (i.e., translocations or large duplications and deletions) can be identified using two complementary approaches: (1) calculation and segmentation of copy number and allelic content profiles and (2) analysis of ‘discordant’ mate-paired/paired-ends mappings (PEMs).
Investigation of copy number profiles allows identification of genomic regions of gain and loss. There exist three frequent obstacles in the analysis of cancer genomes: absence of an appropriate control sample for normal tissue, possible polyploidy and contamination of a tumor sample by normal cells. We therefore developed a bioinformatics tool, called FREEC [2], able to automatically detect CNAs with or without use of a control dataset. If a control sample is not available, FREEC normalizes copy number profiles using read mappability and GC-content. FREEC applies a LASSO-based segmentation procedure to the normalized profiles to predict CNAs. FREEC is able to analyze over-diploid tumor samples and samples contaminated by normal cells. If sequencing coverage is large enough (>15x), FREEC’s extension, Control-FREEC, is able to calculate allelic content profiles and consequently predict loss of heterozygosity regions.
For PEM data, one can complement the information about CNAs and LOH (i.e., output of FREEC) with the predictions of structural variants made by another tool that we have developed, SVDetect [3]. SVDetect finds clusters of ‘discordant’ PEMs and uses all the characteristics of reads inside the clusters (orientation, order and clone insert size) to identify structural variants type. SVDetect allows identification of a large spectrum of rearrangements including large insertions-deletions, duplications, inversions, insertions of genomic shards and balanced/unbalanced intra/inter-chromosomal translocations.
The intersection of FREEC and SVDetect outputs allows one to (1) refine coordinates of CNAs using PEM data and (2) improve confidence in calling true positive rearrangements (particularly, in ambiguous satellite/repetitive regions).
Both SVDetect and FREEC are compatible with the SAM/BAM alignment format and provide output files for graphical visualization of predicted genomic rearrangements.

1. Hanahan, D. and Weinberg, R.A. (2011) Cell, 144, 646-674.

2. V. Boeva et al. (2011), Bioinformatics, 27(2):268-9.

3. B. Zeitouni et al. (2010), Bioinformatics, 26: 1895-1896.

download the pdf file

I present a pipeline and methodology for simultaneously estimating isoform expression and allelic imbalance in diploid organisms using RNA-seq data. This is achieved by modelling the expression of haplotype-specific isoforms. If unknown, the two parental isoform sequences can be individually reconstructed. A statistical method, MMSEQ, deconvolves the mapping of reads to multiple transcripts (isoforms or haplotype-specific isoforms). Non-uniform read generation and the insert size distribution are taken into account. The method is fast and scales well with the number of reads. I demonstrate the utility of this approach with a synthetic dataset constructed using reads from the X chromosomes of two males as well as a F_1 hybrid mouse dataset.

download the pdf file

download the pdf file

Next generation sequencing is quickly replacing microarrays as a technique to probe different molecular levels of the cell, such as DNA or mRNA. The technology has the advantage to provide higher resolution, while reducing biases, in particular at the lower end of the spectrum. mRNA sequencing (RNAseq) data consist in counts of pieces of RNA called tags. This type of data imposes new challenges for statistical analysis. We present a novel approach to model and analyze these data.

Methodologies and softwares for differential expression analysis usually use some generalization of the Poisson or Binomial distribution that accounts for overdispersion. A popular choice is the negative binomial (i.e. Poisson-Gamma) model. However, there is no consensus on what model fits best to RNAseq data, and this may depend on the technology used (and also vary per tag). With RNAseq, the number of features strongly exceeds the sample size. This implies that shrinkage of variance-related parameters may lead to more stable estimates and inference. Methods to do so are available, but only for restrictive study designs, e.g. two-group comparisons or fixed-effect designs.

We present a framework that allows for a) various count models b) flexible designs c) random effects and d) shrinkage across tags by Empirical Bayes estimation of (multiple) priors. Moreover, it implements Bayesian multiplicity correction. We show the performance of our methods on simulation and illustrate our approach on a challenging data set. The data motivates use of the zero-inflated negative binomial as a powerful alternative to the negative binomial, because it leads to less bias of the overdispersion parameter and improved detection power for the low-count tags.

download the pdf file

High-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression. Experimental designs with extensive biological replication give a unique opportunity to exploit this feature and distinguish expression profiles with higher resolution. RNA-seq data analysis methods so far have been based on two well-known data distributions, the Poisson and the negative binomial (NB) that controls overdispersion. In this talk we will show, however, that the rich diversity of expression profiles require further additional count data distributions to capture other oddities such as zero count inflation (i.e., in lowly expressed genes, the proportion of zero counts may be greater than expected under an NB), and heavy tail behavior (i.e., a large dynamic range within the same expression profile). Through simulation studies and using real datasets, we will demonstrate that our proposed method result in shorter and more accurate lists of differentially expressed genes than other existing methods such as edgeR or DESeq.

download the pdf file

download the pdf file

I will describe how Approximate Bayesian Computation can be used for statistical inference of demographic parameters in population genetics. I will provide two examples of ABC applications in population genetics. The first example pertains to the information provided by human gene trees to distinguish between models of human origins. In the second example, I will show that resequencing data provide no evidence for a bottleneck in Africa during the penultimate ice age.

download the pdf file

download the pdf file

To understand how existing organisms evolved to their present form, one can compare statistical features of genomes within a population to predictions of evolutionary models. Theoretical developments have produced a good understanding of how positive selection at a few sites affects genetic variation at linked neutral sites and of how strong selection at many sites affects variation at linked neutral sites. Recent sequence data from a variety of populations indicates that moderate selection acting on linked sites may be common and simulations show that it can have a significant impact on observed sequence variation. I will present a newly developed framework which allows us to understand the expected patterns of genetic variation when weak or moderate selection acts on many linked sites. I will show that in this limit the probability of allelic configurations cannot be described by any neutral model, indicating that it is possible to detect selection from patterns of sampled allelic diversity. I will then combine this analysis with the structured coalescence approach to trace the ancestry of individuals through the distribution of fitnesses within the population. I will show that selection alters the statistics of genealogies compared to neutral population with varying size, building a basis for a way to detect negative selection in sequence data.

download the pdf file

download the pdf file

Phylogenetic trees of present-day species allow the inference of the rate of speciation and extinction which led to the present-day diversity. Classically, inference methods assume a constant rate of diversification. I will present a new methodology which can infer changes in diversification rates through time, can detect mass extinction events, and can account for density-dependent speciation.

I use the method for testing the hypothesis of accelerated mammalian diversification following the extinction of the dinosaurs (65 Ma); none of the analyzed mammalian phylogenies showed a change in diversification rates at 65 Ma. Application of the method to bird data (Dendroica) reveals a density-dependent speciation process, agreeing with previous studies. The new method further allows to quantify the extinction rate which is estimated to be significantly larger than zero for these birds.

The methods can easily be applied to other phylogenies using the R package TreePar available on CRAN.

download the pdf file

The molecular phylogenies of present-day species are widely used to understand what ecological and evolutionary processes have shaped current biodiversity patterns. The classical macroevolutionary models that are currently used for such inferences suffer from a major limitation: they consider diversification at the level of lineages, ignoring the underlying individuals that make up these lineages. In addition, they typically assume that diversification rates are homogeneous across lineages, and as a result, they are much more balanced than empirical trees. Alternative models, stemming from the Neutral Theory of Biodiversity, take into account individuals, and produce trees that are much more coherent with empirical trees in terms of balance. However, under the basic Neutral Model of Biodiversity with constant metacommunity size, the branch-length patterns of reconstructed phylogenies are not realistic.

Here, we relax the assumption of a constant metacommunity size, and we analyze the impact of demographic parameters on the shape of reconstructed phylogenies. In particular, we investigate the likelihood of a reconstructed phylogeny (with or without sampling, with or without contemporary abundances) under simple population dynamics models subjected to a point model of speciation. We propose a dynamic programming algorithm that allows us to evaluate this likelihood numerically, and thus to infer the models parameters from empirical data. By bridging gaps between micro- and macroevolutionary models, our work offers promising perspectives for integrating ecology into macroevolutionary research.

download the pdf file

With the availability of many ‘omics’ datasets measured on the same set of samples, the development of methods capable to analyze conjointly multiple datasets becomes crucial. Such development remains a major technical and computational challenge as most approaches suffer from high data dimensionality. (Regularized) Canonical Correlation Analysis and PLS regression is classically applied to this type of structured datasets but tackle only the two blocks configuration. Regularized Generalized Canonical Correlation Analysis (RGCCA) has been proposed in [Tenenhaus and Tenenhaus, 2011] and is an extension of the regularized Canonical Correlation Analysis to the more than two blocks case. One of the distinct advantages of the optimization problem which defines RGCCA is that a remarkable large number of well-known multi-block data analysis methods are recovered as particular cases.
Tenenhaus, A. and Tenenhaus, M. (2011) Regularized Generalized Canonical Correlation Analysis, Psychometrika, 76:257–284

download the pdf file

download the pdf file

In silico prediction of drug-target interactions from heterogeneous
biological data is critical in the search for drugs and therapeutic
targets for known diseases such as cancers. In this study, we
investigate the correlation between the chemical space of compound
structures, the genomic space of genes/proteins, the pharmacological
space of phenotypic effects, and the topology of drug-target
interaction networks. We then develop a new method to predict unknown
drug-target interactions from chemical, genomic, and pharmacological
data on a large scale. The originality of the proposed method lies in
the formalization of the drug-target interaction inference as a
supervised learning problem for a bipartite graph, the lack of need
for 3D structure information of the target proteins, and in the
integration of chemical, genomic, and pharmacological spaces in a
unified framework. In the results, we make predictions for four
classes of important drug-target interactions involving enzymes, ion
channels, GPCRs, and nuclear receptors. Our comprehensively predicted
drug-target interaction networks enable us to suggest many potential
drug-target interactions and to increase research productivity toward
genomic drug discovery.

download the pdf file

download the pdf file

We consider multivariate two-sample tests of means, where the location shift be- tween the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties such as bio- logical process, molecular function, regulation, or metabolism. For a fixed graph of interest, we demonstrate that accounting for graph structure can yield more powerful tests under the assumption of smooth distribution shift on the graph. We also investigate the identification of non-homogeneous subgraphs of a given large graph, which poses both computational and multiple testing problems. The relevance and benefits of the proposed approach are illustrated on synthetic data and on breast cancer gene expression data analyzed in context of KEGG pathways.

download the pdf file

download the pdf file

Recent developments in comparative studies at the macro-evolutionary level have revived old questions about patterns of species diversification or life-history evolution, or about the coupling between diversification rate, life-history, and sustitution rate. All these developments generally rely on time-calibrated phylogenies obtained using independent means, usually a Bayesian or maximum penalized likelihood divergence-time estimation method. Doing so, however, amounts to ignoring uncertainty about divergence times, while overlooking potentially relevant cross-talks between the estimation of divergence times, diversification patterns and life-history evolution.

We will present a Bayesian probabilistic framework for modeling the macroevolutionary process in an integrative manner. The framework takes as an input a multiple sequence alignment, data about life-history traits of extant species, and fossil calibrations. The underlying probabilistic model assumes correlated variation of life-history traits and substitution parameters along the phylogeny, while relying on an explicit species diversification process for specifying the prior on divergence times. We will present an application at the level of placental mammals, showing that extensive correlations between life-history evolution and substitution patterns are detected, providing stimulating observations for testing molecular evolutionary models. In addition, divergence time estimation and life-history reconstruction can be significantly impacted by considerations about the diversification process, or by the correlation structure between substitution rate and life-history.

download the pdf file

The common disease-rare variant (CDRV) hypothesis has recently received much attention. Indeed, for complex diseases, the allelic architecture of susceptibility variants is likely to have a wide range of allele frequencies and effect sizes. Classical GWAS are powerful in detecting common susceptibility variants, but not as powerful under the CDRV hypothesis. As a complementary tool, linkage analysis may help to detect rare variants of recent origin that association studies cannot.
Traditionally, linkage mapping used microsatellite markers spaced evenly across the genome. An alternative approach is to use high-density maps of single nucleotide polymorphisms (SNPs). SNPs have several advantages over microsatellites but, due to their lower heterozygozity, a larger number of SNPs is necessary to achieve similar levels of information content. Moreover, several studies have demonstrated that dense SNP arrays can offer equal or superior power to detect linkage compared with low-density microsatellite maps. However, the presence of high linkage disequilibrium (LD) between SNPs can increase type I error rates in classical non parametric multipoint linkage analysis.
Focusing on isolated population, characterized by large pedigrees and a small number of founder, may be a promising strategy to map quantitative trait loci (QTL). Although simulations and theoretical power calculations have demonstrated that large pedigrees provide more power to QTL mapping than smaller families do, the complexity of the large genealogies characterizing isolated populations often prohibits using them directly for linkage analysis [Dyer et al, 2001]. Variance components (VC) method [Amos, 1994] is a popular solution to such complexity. However, few studies have evaluated the performance of the VC method using dense SNP arrays.
Here, we investigate approaches to linkage analysis with a dense SNP map in a semi-isolated Alpine population [Pattaro et al, 2007]. To evaluate the amount of linkage information that can be extracted from our whole population, we compare via extensive simulation type I error and power estimates of the VC test using either extended families or smaller subset of families generated by a multiple splitting approach [Bellenguez et al, 2009].

download the pdf file

download the pdf file

In recent years gene expression studies have increasingly made
use of next generation sequencing technology, and in turn, research concerning the appropriate statistical methods for the analysis of digital gene expression has flourished. In this work, we focus on the question of clustering digital gene expression profiles as a means to discover groups of co-expressed genes. Clustering analyses based on metric criteria such as the K-means algorithm and hierarchical clustering have been used in the past to cluster microarray-based measures of gene expression as they are rapid, simple, and stable. However, such methods require both the choice of metric and criterion to be optimized, as well as the selection of the number of clusters. An alternative to such methods are probabilistic clustering models, where the objects to be classified (genes) are considered to be a sample of a random vector and a clustering of the data is obtained by analyzing the density of this vector. In this work, as in previous methods defined for the clustering of serial analysis of gene expression (SAGE) data, we use Poisson loglinear models to cluster count-based HTS observations; however, rather than using such a model
to define a distance metric to be used in a K-means or hierarchical clustering algorithm, we make use of finite mixtures of Poisson loglinear models. This framework has the advantage of providing straightforward procedures for parameter estimation and model selection, as well as an a posteriori probability for each gene of belonging to each cluster. A set of simulation studies compares the performance of the proposed model with that of two previously proposed approaches for SAGE data. We also study the performance of the proposed Poisson mixture model on real high-throughput sequencing data. Keywords: Mixture models, clustering, co-expression, RNA-seq, EM-type algorithms, HTSCluster (R package)

download the pdf file

download the pdf file

Change-point detection problems arise in genome analysis for the
discovery of CNVs or new transcripts. Microarray data have a length less
than 1 million points, so the Dynamic Programming Algorithm (DPA)
introduced by Bellman in 1961 [1] with a complexity of O(Kn2 ) can be
used to recover the optimal segmentation. The recent NGS technology
produces data of length up to n = 1 billion points , for which the DPA
is not effcient. Two main alternatives exist: the CART algorithm (1984,
Breiman, [2] 2008, Lebarbier [3]), a heuristic approaching the optimal
solution with complexity bounded by O(n log n) ≤ C ≤ O(n² ), and the
Pruned DPA (PDPA) (2010, Rigaill, [4]), an exact algorithm with a
complexity at worst of O(Kn² ) time, and empirically faster than O(Kn
log n) which is adapted to the use of one-parameter loss functions.
In this talk, we will focus on the PDPA and its application to NGS-data
analysis. The Negative Binomial model is very common to describe
Seq-data. The algorithm recovers the exact optimal solution for losses
such as the Gaussian homoscedastic and the Poisson law, but the
procedure has to be adapted to take into account the overdispertion
parameter φ of the Negative Binomial, that cannot be estimated within
the PDPA. Because of its effect on the resulting segmentation, we
propose a procedure that estimates φ apart from the PDPA.
The simulation section of the talk will show that our algorithm recovers
the same break-points than the DPA but in a much faster time. It will
then show the influence of the choice of the overdispersion parameter of
the Negative Binomial law on the recovered cost and segmentation. As an
application, the location of genes and UTRs of the S. Cerevisiae
Chromosome 1 will be studied with our segmentation method.

References

[1] Richard Bellman. On the approximation of curves by line segments

using dynamic programming.Commun. ACM, 4(6):284, 1961.

[2] Breiman, Friedman, Olshen, and Stone. Classi-cation and regression

trees.1984.Wadsworth and Brooks,

[3] Servane Gey and Emile Lebarbier. Using CART to detect multiple

change points in the mean for large sample.

http://hal.archives-ouvertes.fr/hal..., February 2008.

[4] Guillem Rigaill. Pruned dynamic programming for optimal multiple

change-point detection. Arxiv:1004.0887, April 2010.

download the pdf file

Context: GWAs are widely used to investigate the connection between genotypic and phenotypic variation with respect to a given trait (e.g. a given disease). Assessing the statistical power of such studies is crucial. Power is empirically estimated by simulating realistic samples under a
disease model H1. For this purpose, the gold standard consists in simulating the genotypes given the observed phenotypes (case or control); thus ensuring that the total number of cases stays unchanged. This method is implemented in the software of reference Hapgen. We study an alternative approach for simulating samples under H1 that does not require generating new genotypes for each simulation but only phenotypes. Methods: In particular, we propose to simulate new phenotypic
datasets such that a) the phenotypes are in accordance with the corresponding observed genotypes under the chosen model H1; b) the total number of cases is the same as in the observed dataset. In order to do so, we suggest three algorithms: i) a simple rejection algorithm; ii) a MCMC approach; iii) and an exact and efficient backward sampling algorithm. We validated our three algorithms both on a toy-dataset and by comparing them with Hapgen on a more realistic dataset. As an application, we then conducted a simulation study on a 1000 Genomes Project dataset consisting of 629 individuals (314 cases) and 8,048 SNPs from Chromosome X. We arbitrarily defined an additive disease model with two susceptibility SNPs and an epistatic effect. Results: The three algorithms
are consistent, but backward sampling is dramatically faster than the others. Our approach also gives consistent results with Hapgen. On our application data we showed that our limited design requires a biological a priori to limit the investigated region. We also proved that epistatic effects
can play a significant role even when simple marker statistics (e.g. trend) are used. We finally showed that the overall performance of a GWAs strongly depends on the prevalence of the disease: the larger the prevalence, the better the power. Conclusions: Our approach is a valid alternative to Hapgen-type methods; it is not only dramatically faster but also has two main advantages: 1) there is no need for sophisticated genotype models (e.g. haplotype frequencies, or recombination rates); 2) the choice of the disease model is completely unconstrained (number of SNPs involved, Gene-Environment interactions, hybrid genetic models, etc.). Our three algorithms will soon be available in an R package called waffect

download the pdf file

download the pdf file

In cancer studies, allele-specific Copy-Number Variations (aCNV) of DNA along the chromosomes is used to identify markers for survival analysis and/or treatment responses. aCNV combined with high density SNP
arrays are used to provides a precise localization and an estimation of genomic alterations.[1] In order to get proper signal to noise ratio, aCNV in cancer studies are usually realized with matched samples. For each patient, the information coming from two SNP arrays is combined, one from a tumor biopsy and the other from healthy tissue. This matching has several defaults, the main one being the technical and financial cost.[5]
In this study, we show that it is possible to recover from a unique sample in tumor the same information in term of aCNV at least up to the unknown proportion of tumor tissue in the sample. This result is achieved in
a two-step procedure : a segmentation of the total Copy-Number based on the minimization of a least squares penalized contrast realized on a very large class of models[4]; followed by a proper modelization of the allelic
imbalance combined with a proper EM algorithm [3]. This two-step procedure allows to estimate properly the aCNV. Our aCNV segmentation is moreover developed with efficient algorithms allowing estimation on a full genome in about a minute. We have developed our approach on Affymetrix Mapping250K_Nsp SNP arrays in order to fill the gaps left
by the aroma package of Bioconductor[2].

References

[1] Henrik Bengtsson, Pierre Neuvial, and Terence Speed. Tumorboost: Normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays. BMC Bioinformatics, 11(1):245,

2010.

[2] Bioconductor. http://www.bioconductor.org/.

[3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 1977.

[4] Y. Rozenholc, T. Mildenberger, and U. Gather. Combining regular and irregular histograms by penalized likelihood. Comput. Stat. Data Anal., 54:3313–3323, December 2010.

[5] J. Staaf, D. Lindgren, J. Vallon-Christersson, A. Isaksson, H. Goransson, G. Juliusson, R. Rosenquist, M. Hoglund, A. Borg, and M. Ringner. Segmentation-based detection of allelic imbalance and loss-of-

heterozygosity in cancer cells using whole genome snp arrays. Genome Biology, 9(9):R136, 2008.

download the pdf file

Many methods have been developed to estimate the set of relevant variables in a sparse linear model Y= Xb+e where the dimension p of b can be much higher than the length n of Y. In particular, a lot of model selection methods have been developed based on a penalized criterion. The mostly known is probably the Lasso (Tibshirani, 1996); l1 penalization of the least squares estimate which shrinks to zero some irrelevant coefficients, hence an estimation of the set of relevant variables. Many results on the Lasso are available; e.g. consistency of the Lasso in high-dimensional linear regression (Zhang, 2006) or sparsity oracle inequalities (Bunea, 2007).
Here we cope with the problem of variable selection from a different point of view. We propose two new methods based on multiple hypotheses testing, either for ordered or non-ordered variables. Our procedures are inspired by the testing procedure proposed by Baraud et al (2003).
When the variables are ordered, the procedure is a simple sequential multiple hypotheses testing using F-tests. When the variables are not pre-ordered, we distinguished wether the variance is known or not. The procedures are two-stage procedures: the first step orders the variables taking into account Y and the second performs a sequential multiple hypotheses testing using new statistics.
All the new procedures are proved to be powerful under some conditions on the data and their properties are non asymptotic. They gave better results in estimating the set of relevant variables than both the False Discovery Rate (FDR) -used for variable selection by Bunea et al (2006)- the Lasso and the Bolasso technique (Bach, 2009), both in the common case (pn).

download the pdf file

download the pdf file

The interpretation of data-driven experiments in genomics often involves a
search for biological categories that are enriched for the responder genes
identified by the experiments. I will present Model-based Gene Set Analysis (MGSA) in which we tackle the problem by turning the question differently. Instead of searching for all significantly enriched groups, we search for a minimal set of groups that can explain the data. Our model penalizes the number of active groups thus naturally providing parsimonious solutions. Applications on yeast and the HeLa cell line demonstrate that MGSA provides high-level, summarized views of core biological processes and can correctly eliminate confounding associations.

download the pdf file

download the pdf file

Motivation: Target enrichment, also referred to as DNA capture, provides
an e-ective way to focus sequencing e-orts on a genomic region of interest.
This approach is commonly employed to interrogate exons, which are likely
to harbor variants that have a causal role in disease. Capture data is typi-
cally used to detect single nucleotide variants and small insertions or deletions.However, it can also be used to detect copy number alterations (CNAs), which is particularly useful in the context of cancer, where such changes occur frequently. Methods for detecting CNAs in array-based data are ill suited for sequencing data. Here we present a statistical modeling approach for detecting CNAs speci-cally developed for capture data.

Results: In copy number analysis it is common practice to determine ratios

between test and control samples, but this approach results in a loss of information. Calculating ratios from sequencing data disregards the total coverage at a locus and is prone to outliers based on low coverage in the reference sample. Rather than modeling the ratio, we instead modeled the coverage of the test sample as a linear function of the control sample. Another bene-t of this approach is that it is able to deal with regions that are completely deleted, which are problematic for methods that use log ratios. To demonstrate the utility of our approach, we used capture data to determine copy number for a set of 600 genes including the entire human kinome in a panel of nine breast cancer cell lines. We found high concordance between our results and those generated using a SNP genotyping platform (Affymetrix SNP6.0). When we compared our results to other methods, including ExomeCNV, we found that our approach produced better overall correlation with SNP data and was less prone to outliers.