Methods detecting rhythmic gene expression are only reliable for strong signal

The nycthemeral transcriptome embodies all genes displaying a rhythmic variation of their mRNAs periodically every 24 hours, including but not restricted to circadian genes. In this study, we show that the nycthemeral rhythmicity at the gene expression level is biologically functional and that this functionality is more conserved between orthologous genes than between random genes. We used this conservation of the rhythmic expression to assess the ability of seven methods (ARSER, Lomb Scargle, RAIN, JTK, empirical-JTK, GeneCycle, and meta2d) to detect rhythmic signal in gene expression. We have contrasted them to a naive method, not based on rhythmic parameters. By taking into account the tissue-specificity of rhythmic gene expression and different species comparisons, we show that no method is strongly favored. The results show that these methods designed for rhythm detection, in addition to having quite similar performances, are consistent only among genes with a strong rhythm signal. Rhythmic genes defined with a standard p-value threshold of 0.01 for instance, could include genes whose rhythmicity is biologically irrelevant. Although these results were dependent on the datasets used and the evolutionary distance between the species compared, we call for caution about the results of studies reporting or using large sets of rhythmic genes. Furthermore, given the analysis of the behaviors of the methods on real and randomized data, we recommend using primarily ARS, empJTK, or GeneCycle, which verify expectations of a classical distribution of p-values. Experimental design should also take into account the circumstances under which the methods seem more efficient, such as giving priority to biological replicates over the number of time-points, or to the number of time-points over the quality of the technique (microarray vs RNAseq). GeneCycle, and to a lesser extent empirical-JTK, might be the most robust method when applied to weakly informative datasets. Finally, our analyzes suggest that rhythmic genes are mainly highly expressed genes.

The nycthemeral transcriptome is characterized by the set of genes that display a 2 rhythmic change in their mRNAs levels with a periodicity of 24 hours. These include, 3 but are not limited to, circadian genes whose rhythm is endogenous and entrainable. 4

1/24
In baboon, 82% of protein-coding genes have been reported to be rhythmic in at least 5 one tissue [1]. The nycthemeral rhythmicity of these transcripts can be driven by the 6 internal oscillator clock or by other circadian input such as food-intake, the light-dark 7 cycle, sleep-wake behavior, or social activities. Moreover, the nycthemeral transcriptome 8 is tissue-specific [2] [3]. Given the importance of biological rhythms in understanding 9 biology and medicine, many algorithms have been proposed to detect such rhythms. 10 Some were developed specifically for biological data, while others were adapted from 11 other fields where periodicity is important, such as Lomb Scargle (LS). Most methods are 12 based on non-parametric models that search for referenced patterns, classically sinusoid, 13 called time-domain methods, while some are frequency-domain methods based on spectral 14 analysis [4]. Some of them have been designed to detect more diverse waveforms, including 15 asymmetric patterns, such as RAIN [5] or empirical JTK (empJTK) [6]. For instance, 16 RAIN outperformed the original JTK CYCLE algorithm for simulated data consisting 17 of sinusoidal and ramp waveforms [6]. Thus, methods differ in the conception of their 18 algorithm and in how they take into account features of the dataset such as curve shapes, 19 period, noise level, presence of missing data, phase shifts, sampling rates [7], asymmetry 20 of the waveform, or the number of cycles (total period length of the experiment). Each 21 method has in principle different strengths and weaknesses for some features of the 22 dataset. In Arabidopsis, HAYSTACK identified 45% more cycling transcripts than 23 COSOPT, mainly due to the inclusion of a 'spike' pattern in its model [8]. Deckard 24 et al. [7] studied how four methods (LS, JTK CYCLE, de Lichtenberg, and persistent 25 homology) performed across a variety of organisms and periodic processes. Based on 26 synthetic data, they investigated the algorithms' ability to distinguish periodic from 27 non-periodic profiles, to recover period, phase and amplitude, and they evaluated their 28 performance for different signal shapes, noise levels, and sampling rates. They proposed 29 a decision tree to recommend one of these four algorithms based on these features of 30 datasets [7]. 31 The performance of algorithms to identify such periodic signal has been assessed so 32 far based on synthetic (i.e., simulated) data, or on benchmark sets of known cycling 33 genes. Hughes et al. [9] recently published guidelines for the analysis of biological 34 rhythms and proposed a web-based application (CircaInSilico) for generating synthetic 35 genome biology data to benchmark statistical algorithms for studying biological rhythms. 36 While such benchmarks are useful to explore the behavior of methods in a set of cases, 37 the applicability of results to real data is limited. For example, simulations need to 38 impose an a priori fluctuation pattern, typically cosine. The fluctuation of transcript 39 abundance of core clock genes does seem to follow a cosine shape [10], but sometimes 40 follows non-sinusoidal periodic patterns in mouse liver (e.g., Nr1d1 or Arntl) [11] (based 41 on the data from [12]). The fluctuations of the nycthemeral transcriptome are entrained 42 by a complex network involving external cues [13] [14] [15] [16], as simplified in Figure 43 1a, which might yield non-sinusoidal periodic patterns among rhythmic genes even if 44 circadian genes were sinusoidal. But the biological relevance of these waveforms is still 45 not clear. This raises two issues: benchmarks based on simulations are biased towards 46 methods that detect the same types of patterns as simulated; and when an algorithm 47 detects more rhythmic genes, it could be more true positives or more false positives. 48 When pattern constraints are released this increases the number of genes detected as 49 rhythmic, but is not necessarily informative on the capacity of the algorithm to detect 50 genes whose rhythmicity is biologically relevant. 51 Using real data and randomization tests, we compared seven methods: JTK CYCLE 52 (JTK) [18], LS [19]& [20], ARSER (ARS) [4], and meta2d (Fisher integration of 53 LS, ARS, and JTK), are frequently used by many studies and are all included within 54 the MetaCycle R package [21]. We also included empirical JTK (empJTK) [6] and 55 RAIN [5], which have been recently developed to deal with more non cosine patterns 56 and with asymmetric waveforms. empJTK and RAIN aim to improve the original JTK 57 algorithm which assumed that any underlying rhythms have symmetric waveforms [5]. 58 Finally, robust.spectrum [22] extents a robust rank-based spectral estimator to the 59 detection of periodic signals. It is integrated in the GeneCycle R package [23] and called 60 GeneCycle in this paper. We excluded de Lichtenberg [24], Persistent Homology [25], 61 COSOPT [26], Fisher's G test [27], MAPES, Capon, and other algorithms for reasons 62 such as i. difficult accessibility of the software which limit their use by researchers, 63 ii. their higher sensitivity to certain features of the data such as the sampling density, 64 the number of replicates and/or periods, noise level, and waveform, iii. their weaker 65 efficiency on simulated data or known cycling genes, or iv. their previously reported 66 less good detection of non-sinusoidal periodic patterns [7] [4] [28] [29] [30] [6]. We first 67 analysed the behavior of these seven methods applied to a variety of real datasets in 68 animals, and within each dataset, we compared results between representative gene 69 subsets such as highly and lowly expressed genes, known cycling genes, and randomized 70 data. Contrary to real data, randomized data is not expected to show any signal of 71 rhythmicity, which we used to test proper statistical behavior under the null hypothesis. 72 Secondly, as function tends to be conserved between orthologs [31], true rhythmic genes 73 are expected to be enriched in orthologs that are themselves rhythmic in other species. 74 Indeed, evolutionary conservation provides a valuable filter through which to highlight 75 functional biological networks, notably for clock-controlled functions [32]. The biological 76 relevance of rhythmic genes is expected to be higher for rhythmic orthologs. On the 77 other hand, errors in the prediction of rhythmicity by each method are not expected 78 to be conserved between orthologs. Thus, the best methods are expected to report 79 rhythmic genes with a high proportion of rhythmic orthologs. We used this approach to 80 compare the algorithms based on their ability to capture the evolutionary conservation 81 signal within nycthemeral genes, and compared them to a Naive method.

83
We used gene expression time-series datasets that come from circadian experiments and 84 kept the data from "normal" individuals (apparently healthy, wild-type, no treatment) 85 for seven species (Table S1), allowing comparisons among vertebrates and among insects. 86 We benchmarked methods on animal data since organ homology allowed to compare 87 datasets for which we expect conservation of functional patterns (tissue-specific rhythms). 88 For readability, we present vertebrate results in the main figures and insect results in 89 supplementary results (Additional file 5). Apart from the rat and Anopheles datasets, 90 data with several biological replicates were obtained already normalized over replicates 91 (one value per time-point). 92 We define a rhythmic gene as a gene which displays a nycthemeral change in its mRNA 93 abundance, i.e. occurring over 24 hours and repeated every 24 hours. All these rhythmic 94 genes represent the nycthemeral transcriptome. Different organs have been reported 95 to have transcriptomes which are more or less rhythmic [2]. The rhythmic expression 96 of these genes can be entrained directly by the internal clock or indirectly by external 97 inputs, such as the light-dark cycle or food-intake [13] [14] [15] [16] (Fig 1a). We consider 98 the entirety of these rhythms to be a biologically relevant signal to detect. That is 99 why we preferred data from light-dark and ad-libitum experimental conditions whenever 100 possible (Additional Table 1), as providing a better representation of wild conditions.

Internal clock
Rhythmic gene 24h period Rhythmic gene Rhythmic gene a b Figure 1. The nycthemeral transcriptome is the group of genes whose mRNAs have periodic variations with a 24h period, called rhythmic genes. To detect these rhythmic genes, we applied seven methods to time-series datasets that produced different density distribution of p-values. a) Simplified diagram of the entrainment of nycthemeral gene expression. Environmental cues include the light-dark cycle, food-intake, sleep-wake behavior, social activities, or any other 24h periodic event. b) Density distribution of p-values obtained before (raw) and after the default correction (software) for the seven methods applied to mouse liver data (microarray) sub-categorized in: i. randomized data which represents the null hypothesis; ii. randomized data restricted to the first and fourth quartiles of the median gene expression level, to check for the impact of expression level under the null; iii. the full original dataset; iv. the first and fourth quartiles of the median gene expression level of the original data; and v. a subset of known cycling genes (99 genes from KEGG "circadian entrainment"). The default p-values of ARS, GeneCycle, and LS are uncorrected.

5/24
excess of high p-values for lowly expressed genes (Fig 2a). The p-values distributions 151 imply that most methods detect almost all highly expressed genes, and almost no lowly 152 expressed genes, as "rhythmic" (Fig 1b). This led us to be concerned about the risk of 153 a very high false-positive rate for highly expressed genes. This risk could be produced 154 by the high sensitivity to large variations of gene expression levels. To assess this, we 155 normalized gene expression values by a Z-score transformation so that they all had the 156 same mean and variance (see Methods), and applied methods after this transformation 157 (Fig 2b). This normalization of gene expression values did not change the p-values 158 distributions within highly expressed genes, and particularly did not recover rhythmicity 159 within lowly expressed genes (Fig 2c). This was not due to sampling biases of microarray 160 data since results are consistent with RNAseq data (Additional file 1: Fig S3c). Thus, 161 the differences obtained between highly and lowly expressed genes could be explained 162 either by a truly larger number of highly expressed genes among the rhythmic genes, or 163 by experimental noise which masks circadian signal for lowly expressed genes. Overall, 164 ARS, empJTK, and GeneCycle had the best behavior, producing a uniform distribution 165 under the null hypothesis, and a skew towards p-value=0 for all empirical data. Fig 2a 166 indicates that less rhythmic organs, such as the muscle or the aorta, generated a wider 167 distribution of p-values with ARS and GeneCycle, and to a lesser degree with empJTK, 168 for both the lowly and highly expressed genes.

Microarray Each 2h over 48h
RNAseq Each 6h over 48h b) The restriction of microarray time-series to the same time-points as in the RNAseq series produces similar p-value distributions to those obtained with RNAseq. This supports a major role of the temporal resolution for method results, relative to a minor role for the difference between RNAseq and microarrays.
From data of the same mouse experiment [2], we observed differences of p-value density 170 distributions between microarray and RNAseq, with the skew towards p-value=0 less 171 marked for RNAseq data (Additional file 1: Fig S4a and S4b). This can be due to the 172 more precise temporal resolution of the microarray time-series dataset, or to differences 173 in the detection of gene expression by RNAseq vs microarrays (Fig 3a). When we 174 restricted the microarray time-series to the same time-points as in the RNAseq series, 175 we obtained a p-value distribution very similar to that of the RNAseq data (Fig 3b). 176 The same time-series restriction applied to known cycling genes produced comparable 177 results (Additional file 1: Fig S4c). This supports a major role of the temporal resolution 178 for method results, relative to a minor role for the difference between RNAseq and 179 microarrays. That is why for the next steps, we only considered the microarray dataset 180 for the mouse. Methods lose in efficiency for detecting rhythmic patterns in gene expression when the number of 24h cycles decreases, or when the number of time-points sampled decreases. a) Default p-value distributions obtained for ARS, GeneCycle, and empJTK applied to different datasets and subcategorized in: i. randomized data which represents the null hypothesis; ii. randomized data restricted to the first and fourth quartiles of the median gene expression level, to check for the impact of expression level under the null; iii. the full original dataset; iv. the first and fourth quartiles of the median gene expression level of the original data; and v. a subset of known cycling genes (8 to 99 genes according to species, see Methods). For each dataset, the number of time-points with data and the temporal resolution is illustrated around a 24h clock. For the same number of time-points, performance seems better with two cycles than only one cycle (zebrafish vs baboon). b) The reduction of the number of time-points of the mouse liver microarray dataset shows increasingly weak rhythm detection by ARS, GeneCycle, and empJTK, shown by a flattening of the p-value distribution on the full dataset (red arrow). GeneCycle showed no difference between a few time-points over two cycles or more time-points over a single cycle (black arrow).
This observation can be generalized to diverse datasets. We see that each method loses 182 in efficiency when the number of 24h cycles decreases, or when the number of time-points 183 sampled decreases (Fig 4a). We show only results for ARS, GeneCycle, and empJTK 184 because they were the only methods with correct behavior in their p-value distributions 185 7/24 (Fig 1b). For the same number of time-points, performance seems better with two cycles 186 than only one cycle, as shown comparing zebrafish and baboon data which have both 187 twelve time-points (Fig 4a). But this observation could be confused by the comparison of 188 different species or different samples' quality. ARS performed better with a smaller total 189 number of time-points but over two cycles than with more total time-points over a single 190 cycle (mouse RNAseq vs baboon in Fig 4a), indicating that ARS is very dependant on 191 the repetitive nature of profiles. The reduction of the number of time-points of the mouse 192 microarray dataset shows similar effects on the rhythm detection by ARS, GeneCycle, 193 and empJTK (Fig 4b). Of note, GeneCycle presented more or less no differences between 194 having a few time-points over two cycles and having more time-points over a single cycle 195 (black arrow Fig 4b). 196

197
Among genes called rhythmic, we analysed the number of those called in common by the 198 different methods. For p-value thresholds of 0.05 or 0.01, we found a large proportion 199 of genes called rhythmic by only one or few methods (Fig 5a and additional file 1: Fig 200  S5). Nevertheless, the overlap between all methods was the largest category for the 201 mouse liver data (Fig 5a). If we ignore p-value thresholds and consider the first 6000 202 genes detected rhythmic for each method, the overlap becomes stronger (Fig 5d). We 203 obtained similar results from the most informative dataset (Additional file 1: Fig S6b). 204 Indeed, the rat lung dataset has 36 time-points spread over three 24h cycles (Fig 4a). 205 Thus, the same genes seem to be called rhythmic by all methods but the threshold of 206 significance appear inconsistent. These results suggest an issue with the significance 207 of p-value thresholds for the methods. With a smaller number of top rhythmic genes, 208 the overlap between methods was weaker (Fig 5c and 5d). Thus the methods agree 209 on a large number of rhythmic genes, but not necessarily on the order of significance 210 among them. Finally, for baboon liver data there was less overlap of methods (Fig 5b; 211 Additional file 1: Fig S6b and S6c), which might be due to the low information in that 212 data (Fig 4a).

213
Use of evolutionary conservation as a benchmark 214 Signal of evolutionary conservation 215 We expect biologically relevant rhythmic activity of genes to be more conserved between 216 species than putative false positives from detection methods. For each condition (species 217 and tissue), we defined the group of genes whose orthologs are called rhythmic in the 218 homologous tissue of another species (Fig 6a). For example, starting with all mouse 219 genes, we only kept mouse-zebrafish one-to-one orthologs. Considering the liver, these 220 orthologs were separated into two groups: genes for which the ortholog is detected as 221 rhythmic in zebrafish liver, called rhythmic orthologs; and the remaining one-to-one 222 orthologs (Fig 6b). To detect zebrafish liver rhythmic genes, we used the ARS method 223 following the results presented above and fixed a p-value threshold of 0.01. to call orthologs as rhythmic in zebrafish liver (Additional file 1: Fig S7). This result 230 obtained for distant species (Additional file 1: Fig S1) shows that the conservation of 231 rhythmicity at the transcriptomic level is strong and should be informative. Similar 232 results were obtained in other species comparisons (Additional file 1: Fig S9), with a 233 stronger signal for evolutionarily close species such as mouse and rat (Additional file 234 9/24 1: Fig S8). Thus, for the same homologous organ, rhythmic orthologs have a stronger 235 statistical signal of rhythmicity than non-rhythmic orthologs. We are going to use this 236 evolutionary conservation of the rhythmicity of gene expression in order to compare the 237 performance of methods.

Mouse-Zebrafish Orthologs
Genes rhythmic in the same tissue Orthologous genes detected as rhythmic in the same organ of two species have a stronger statistical signal of rhythmicity than those not detected as rhythmic in at least one species. a) Mouse and zebrafish share orthologous genes, some of which are rhythmic in the homologous tissues. b) Method used for ortholog benchmarking, as in panel d: From all mouse genes, only mouse-zebrafish one-to-one orthologs are kept. Considering the liver, these orthologs are separated into two groups: genes for which the ortholog is detected as rhythmic in zebrafish liver, called rhythmic orthologs; and the remaining one-to-one orthologs. c) Chart providing the legends to inform about the method and the threshold used to call genes rhythmic for each condition (species and tissue). d) p-values density distribution of rhythmic orthologs vs non-rhythmic orthologs obtained for the seven methods applied to mouse liver data. Mouse-zebrafish orthologs, that are detected rhythmic in zebrafish liver, are significantly more enriched in small p-values in mouse liver, for all methods (Kolmogorov-Smirnov test p-values < 0.001. Only strong rhythmic signals of gene expression are reliable 239 In this last part, we compared the performances of methods to detect the rhythmic 240 orthologs. For a given dataset, the best method is expected to report rhythmic genes 241 with the highest proportion of rhythmic orthologs. It should be noted that this does 242 not imply that we expect all rhythmic behavior to be conserved between orthologs, but 243 10/24   Figure 7. Only strong rhythmic signals of gene expression are reliable Methods designed for rhythm detection in gene expression show an advantage only for the genes with a strong rhythmic signal. For a fixed number of top genes called rhythmic, all the methods, despite their design differences, retrieve approximately the same proportion of biologically functional rhythmic genes and the same genes themselves. a) Method to obtain figure b: For a given p-value threshold, each method detects a certain number of rhythmic genes (genes with p-value threshold). At each threshold, we calculate the proportion of orthologs rhythmic in species2 (A) among one-to-one species1-species2 orthologs (B). The benchmark set is composed of one-to-one orthologs detected rhythmic in the second species (using method ARS, GeneCycle, or empJTK), called rhythmic orthologs. b) Variation of the proportion rhythmic orthologs/all orthologs in mouse as a function of the number of mouse orthologs detected rhythmic, for each method applied to the mouse lung dataset. The benchmark gene set is composed of mouse-rat orthologs, detected rhythmic in rat lung by the GeneCycle method with default p-value ≤ 0.01. The black line is the Naive method which orders genes according to their median expression levels (median of time-points), from highest expressed to lowest expressed gene, then, for each gene, calculates the proportion of rhythmic orthologs among those with higher expression. The proportion of the benchmark set among one-to-one orthologs is higher for highly expressed genes (4th quartile) than for lowly expressed genes (1st quartile) (∼60% vs ∼20% respectively). Diamonds correspond to a p-value threshold of 0.01. c) Upset diagram showing the number of rhythmic orthologs (figure a) called in common by the methods among the first 1000 mouse-rat orthologs that are called rhythmic in mouse lung.
rather that true rhythmic genes should have more rhythmic orthologs than false-positive 244 predictions. For a given p-value threshold, each method detects a certain number 245 of rhythmic genes (genes with p-value threshold). At each threshold we calculated 246 the proportion of orthologs rhythmic in species2 among one-to-one species1-species2 247 orthologs, as defined in Fig 7a. This proportion allows assessing how each method is 248 11/24 able to detect the conservation of rhythmicity and can be calculated for each p-value 249 threshold. The benchmark set is composed of orthologs detected rhythmic in the second 250 species, called rhythmic orthologs. To define this set, we chose a rhythmicity detection 251 method among ARS, empJTK, and GeneCycle, in agreement with results of previous 252 sections, and a p-value threshold of 0.01.

253
A risk is that orthologs have conservation of gene expression levels and that there is 254 a bias towards calling highly expressed genes "rhythmic". To control for this in the 255 benchmarking, we added a "Naive" method based only on expression levels. This Naive 256 method simply orders genes (orthologs here) according to their median expression levels 257 (median of time-points), from highest expressed to lowest expressed gene, then, for 258 each gene, we calculated the proportion of rhythmic orthologs among those with higher 259 expression. We also present results for subsets obtained from the division in four quartiles 260 of expression levels. Figure 7b shows the variation of the proportion defined above as a 261 function of the number of orthologs detected rhythmic, obtained for each method applied 262 to the mouse lung dataset. The benchmark gene set was defined by mouse-rat orthologs, 263 detected rhythmic in rat lung by the GeneCycle method (default p-value ≤ 0.01). Genes 264 are given by order of their detection by the methods. The genes with small p-values, 265 i.e. with a strong signal of rhythmicity, had a high proportion of rhythmic orthologs. 266 Importantly, for all methods, this proportion was higher than that obtained from the 267 Naive method (Fig. 7b). Results are consistent in almost all species comparisons, with 268 exceptions for cerebral tissues (Additional file 3). However, the thresholds of 0.01 are to 269 the right of the cross between the curves of rhythm detection methods and the Naive, 270 except for LS. This means that, for an apparently reasonable threshold (p-value ≤ 0.01), 271 ranking genes by expression level performed "better" than all methods designed specially 272 for rhythm detection. These methods performed better only for genes with very high 273 signal of rhythmicity. Finally, for the top 1000 mouse-rat orthologs detected as rhythmic 274 in mouse lung, all the methods reported a similar proportion of rhythmic orthologs, 275 around 62%, mainly highly expressed genes (fourth quartile of gene expression) (Fig. 276  7b). And the overlap between these orthologs was largely detected by all methods (Fig. 277  7c). Thus, for genes with a high signal of rhythmicity, all methods performed similarly 278 to detect the tissue-specific conservation of gene expression rhythmicity. Similar results 279 were obtained for other species comparisons (Additional file 4).

281
The methods designed for rhythm detection in gene expression perform 282 similarly and only for strong rhythmic signal.

283
In this study, we show that orthologous genes detected as rhythmic in the same organ 284 of two species have a stronger statistical signal of rhythmicity than those detected as 285 not-rhythmic in at least one species. These results support our hypothesis that the 286 nycthemeral rhythmicity at the gene expression level is biologically functional, and that 287 this functionality is more conserved between orthologous genes than between random 288 genes. We define the nycthemeral transcriptome as all genes displaying a rhythmic 289 expression repeated every 24 hours. In order to assess the performance of seven methods 290 to detect these rhythms, we used this concept of conservation of the rhythmicity between 291 species for benchmarking. We employed genes whose orthologs had a rhythmic expression 292 called in the same homologous organ as a proxy for a true positive set, as done in some 293 previous benchmarks. For instance, [33] assessed the quality of microarrays quality 294 control methods based on evolutionary conservation of expression profiles, and [34] 295 benchmarked tissue-specificity methods in the same way. This approach based on real 296 12/24 data, also used by Boyle et al. [3] to solve the issue of weak overlap between the same 297 tissues from the same species from different experiments, avoids relying on simulations 298 which tend to favor methods using the same model, e.g. the same patterns, and has the 299 advantage of not being based on specific assumptions, other than general evolutionary 300 conservation of function. By taking into account the tissue-specificity of rhythmic 301 gene expression and different species comparisons, we show that no method is strongly 302 favoured. For instance, one would have expected that the added features of RAIN and 303 empJTK allowing then to detect more diverse patterns than a classical sinusoidal, would 304 have favored them. But this flexibility did not provide them any advantage in the 305 benchmark. Furthermore, the comparison of the methods with a 'Naive' one, uninformed 306 about rhythmicity, shows an advantage for informed methods only for the genes with 307 a strong rhythmic signal. Thus, only genes with a strong rhythmic signal, i.e. the top 308 genes called rhythmic, can be considered as relevant. Even if the threshold of "relevance" 309 of these genes is dependant on the evolutionary distance of the species compared, these 310 results suggest a call for caution about the results of previous studies reporting or based 311 on large sets of rhythmic genes. For the same number of genes called rhythmic, all the 312 methods, despite their design differences, retrieved approximately the same proportion of 313 biologically functional rhythmic genes (Fig 7b) and the same genes themselves (Fig 7c). 314 The issue of significance 315 For the same p-value threshold, the number of genes called rhythmic is different from 316 one method to another, with a large proportion of these genes detected rhythmic by 317 only one or a few methods. But, if we consider the top genes called rhythmic for each 318 method, without taking into account any p-value threshold, the overlap of rhythmic 319 genes become strong between the methods (Fig 5). This highlights an issue with the 320 meaning of the p-value and the associated thresholds used. This is directly related to 321 the issue of correction that needs to be improved in this field. When a smaller number 322 of top rhythmic genes is used, the overlap between methods becomes weaker (Fig 5c and 323  5d). Thus, the order of calling genes rhythmic is different from one method to another. 324 Finally, since methods performed better than a Naive method only for genes with a 325 strong rhythmic signal, we can not conclude for the relevance of the other genes called 326 rhythmic, even when they have very low nominal p-values.

327
ARS, empJTK, and GeneCycle produce consistent p-values 328 ARS, empJTK, and GeneCycle were the methods that showed the best behavior on real 329 and randomized data (single species tests). They were the only methods displaying both 330 a uniform distribution of their p-values under the null hypothesis, and a left-skewed 331 distribution when applied to real data. For empJTK, its default correction allowed to 332 produce these expected results. However, each of these three methods is conceptually 333 completely different, which indicates that there is not one conceptual framework which 334 dominates rhythmic gene detection. ARS combines time-domain and frequency-domain 335 analyses. GeneCycle, which is the robust spectrum function of the R package, is 336 based on a robust spectral estimator which is incorporated into the hypothesis testing 337 framework using a so-called g-statistic together with correction for multiple testing. 338 And, empJTK improves the original JTK including additional reference waveforms in its 339 rhythm detection. The other methods all presented major issues. LS has a right-skewed 340 distribution of its initial uncorrected p-values suggesting an invalid null hypothesis. JTK, 341 RAIN, and meta2d had also issues with their null hypothesis displaying left-skewed 342 distributions of their uncorrected p-values. Their default adjustment was excessive, 343 13/24 favoring high p-values obtained after correction. Hutchison and Dinner [17] observed this 344 on simulated data, and proposed that it was due to non independence of measurements 345 from the same time series.

346
Biological insight into gene rhythmicity in animal tissues 347 The separation of genes according to their expression levels showed differences in the 348 p-values distributions, with more rhythmic signal among highly expressed genes, and 349 almost none among lowly expressed genes. This led us to be concerned about the risk of 350 a very high false-positive rate for highly expressed genes. However, the normalization 351 of gene expression values (by Z-score, see above and methods) did not change the 352 p-values distributions within highly expressed genes and particularly did not recover 353 rhythmicity within lowly expressed genes (same results for microarray and RNAseq data). 354 Thus, it is possible that the rhythmic genes are largely enriched in highly expressed 355 genes. Experimental noise that would mask the rhythmic signal of lowly expressed genes 356 could also explain this result in part, especially considering that the datasets with good 357 sampling used microarray technology.

358
The observation of known cycling genes in different organs seems to indicate different 359 profiles of rhythmicity possible for the same gene. For instance, Npas2 displays a cosine 360 shape in kidney and lung, and a peak/box shape in liver and muscle (Fig. 3a). This 361 observation suggests that methods might perform differently depending on the organ 362 studied. This is also one of the reasons why all our analyses were made for homologous 363 organs.

364
In mouse-baboon comparisons, there were no significant differences of p-value density 365 between rhythmic and non-rhythmic orthologs in cerebral tissues: brain stem, cerebellum, 366 and supra-chiasmatic nucleus, except for the hypothalamus (Additional file 3: Fig S4-S5). 367 This could be explained by the fact that there are only low amplitudes of expression of 368 clock genes and few rhythmic genes in almost all brain regions. This is assumed to be 369 due to an inefficient synchronization of individual cellular oscillators in brain cells to 370 avoid noise into the synchronizator element [35]. In addition, it could also be an essential 371 aspect for intrinsic brain processes which could require a constant expression of most 372 genes.

373
The importance of having an informative dataset 374 Because of the cost and complexity of circadian experiments, time-series datasets of 375 gene expression in animals are rare, especially in the same experimental conditions. 376 Algorithms must be able to deal with little data, but importantly experiments should take 377 into account the algorithms' sensitivity. All algorithms appeared to produce relatively 378 poor p-values distributions when applied to the available Drosophila or baboon datasets, 379 and, for the baboon dataset, were almost always less efficient than the Naive method 380 (Additional file 1: Fig S11). This baboon dataset is probably not very informative, which 381 raises questions about the biological conclusions from the associated study [1]. With 382 only one replicate per time-point, over only one cycle of 24 hours, the algorithms are 383 unable to detect repetitive patterns. Variations over a single 24 hours cycle appear to be 384 insufficient to detect rhythmic signal, when there is no evidence of repetition over several 385 cycles. Moreover, each data comes from different outbred individuals. The variations 386 of gene expression between two time-points can be due to individual variations or real 387 oscillation within the population. It is possible that sinusoidal patterns with a continuous 388 trend over successive time-points could be detected without replicates, although power 389 will be lacking, but patterns such as the peak pattern will be extremely sensitive to 390 14/24 inter-individual variation. Figure 4 generally suggests that datasets with one replicate 391 per time-point over a unique cycle of 24 hours do not provide enough information that 392 would allow to correctly detect the rhythmicity. It seems that ARS in peculiar is very 393 sensitive to the repetitive nature of profiles. Thus, one cycle with several biological 394 replicates should be favored. Our results support the conclusions of Hutchison et al. [6] 395 who indicate that for a fixed number of samples, better sensitivity and specificity are 396 achieved with higher numbers of replicates than with higher sampling density. We 397 propose that future experiments should produce data with two biological replicates per 398 time-points as a strict minimum. Obviously, we suggest considering biological replicates 399 as new cycles within one replicate, as proposed in recent guidelines [9]. GeneCycle, 400 and to a lesser extent empJTK, were the most robust methods when applied to weakly 401 informative datasets. Thus, the performance of the algorithms is dependent on techniques 402 and experimental designs used for the experiments. The optimization of experimental 403 plans (see section Recommendations) could improve the methods' performance for the 404 detection of rhythmically expressed genes. Finally, contrary to the mouse experiment, 405 the rat experiment has been done under zeitgeber conditions which have most likely 406 resulted in more genes being expressed rhythmically, so in proportion, more periodic 407 patterns. This might explain the higher density of small p-values obtained for the rat 408 dataset (Fig 4a).

409
Limitations and improvement of methods 410 ARS and GeneCycle need complete chronological data and cannot deal with biological 411 replicates. Except for LS and RAIN, all methods studied here assume equally spaced 412 time-points. Furthermore, ARS needs an integer sampling interval with regular time-413 series datasets and cannot deals with missing values, or with several replicates per 414 time-point. In this study, ARS appeared to be efficient only for the dataset with at 415 least two cycles of data. Indeed it produced aberrant p-value distributions when applied 416 to datasets restricted to one cycle of 24 hours. But, for these datasets, all algorithms 417 behaved poorly. The improvement of JTK by empJTK produced much better results 418 than the original JTK algorithm. We believe that LS could be a very interesting method 419 if its null hypothesis could be clarified and would thus provide p-values with proper 420 behavior. LS has advantages that other algorithms don't. For instance, it can deal with 421 irregular intervals, missing data, and has been shown to stay efficient on small sample 422 size [28], which constitutes one of the big issues of circadian transcriptomic data. On 423 the other hand, relative to JTK, ARS, or MICOP methods, LS has also been shown to 424 be highly sensitive to the increasing of sampling intervals and to noise for proteomic 425 data [36].

426
A good method must, at least, display a uniform distribution under the null hypothesis, 427 and a classic skewed distribution when applied to full dataset or even more to known 428 cycling genes. It should also be able to detect efficiently rhythmic orthologs, which 429 represent an important part of the functionally relevant nycthemeral rhythmicity. In 430 this study, we did not assess the amplitudes, phases, and precise period provided by the 431 algorithms. We only analysed the performance of methods for nycthemeral or circadian 432 rhythms in gene expression data, and cannot conclude directly for ultradian or seasonal 433 rhythms, and for other types of datasets which are not gene expression data. 434 p-value per gene. Whenever the per-gene normalization was not necessary (unique data 521 for all genes), we obtained the original p-value for each gene. For each species comparison, orthologs relationships have been downloaded from OMA 530 [37]. For simplicity, we only considered one-to-one orthologs. In species comparisons, we 531 only kept orthologous genes that had available data in both species.

533
The Naive method is only based on expression levels of genes and is not informed about 534 rhythm detection. It simply orders genes according to their median expression levels 535 (median of time-points), from highest expressed to lowest expressed gene. Then, for 536 each gene i, we calculate the proportion of rhythmic orthologs among those with higher 537 expression, i.e. among the genes from the highest expressed one to the gene i.