Genome-wide analysis of the Firmicutes illuminates the diderm/monoderm transition

The transition between cell envelopes with one membrane (Gram-positive or monoderm) and those with two membranes (Gram-negative or diderm) is a fundamental open question in the evolution of Bacteria. Evidence of the presence of two independent diderm lineages, the Halanaerobiales and the Negativicutes, within the classically monoderm Firmicutes has blurred the monoderm/diderm divide and specifically anticipated that other members with an outer membrane (OM) might exist in this phylum. Here, by screening 1,639 genomes of uncultured Firmicutes for signatures of an OM, we highlight a third and deep branching diderm clade, the Limnochordia, strengthening the hypothesis of a diderm ancestor and the occurrence of independent transitions leading to the monoderm phenotype. Phyletic patterns of over 176,000 protein families constituting the Firmicutes pan-proteome identify those that strongly correlate with the diderm phenotype and suggest the existence of new potential players in OM biogenesis. In contrast, we find practically no largely conserved core of monoderms, a fact possibly linked to different ways of adapting to repeated OM losses. Phylogenetic analysis of a concatenation of main OM components totalling nearly 2,000 amino acid positions illustrates the common origin and vertical evolution of most diderm bacterial envelopes. Finally, mapping the presence/absence of OM markers onto the tree of Bacteria shows the overwhelming presence of diderm phyla and the non-monophyly of monoderm ones, pointing to an early origin of two-membraned cells and the derived nature of the Gram-positive envelope following multiple OM losses. Phylogenomic analysis supports a diderm ancestor of the Firmicutes and points to an early origin of two-membraned cells in Bacteria and the derived nature of the Gram-positive envelope following multiple outer membrane losses.


NaTurE EcOLOGy & EvOLuTION
entire diversity of this phylum, in particular the Clostridia. Most UBA genomes fall into known Firmicutes families consistently with their inferred taxonomy 13 , whereas others fall into clades that do not contain any genome representative of known families and have a loose taxonomic assignment 13 (Fig. 1). Only one UBA genome belongs to the Halanaerobiales, possibly a bias due to the difficulty of assembling these GC-rich genomes or to poor sampling from the environments where they thrive. In contrast, these new genomes significantly enrich the coverage for Negativicutes (39 UBAs). The deep branching of Halanaerobiales and the placement of Negativicutes within Clostridia are both well supported and are consistent with previous analyses 6,10,15 . In particular, we previously specifically tested alternative branchings of the Halanaerobiales and showed that they were all strongly rejected by the data 10 . Interestingly, while the monoderm lineage Natranaerobius thermophilus branched with Halanaerobiales in previous analyses 10,15 , it now groups with Dethiobacter alkaliphilus and 17 other UBA genomes at the base of Bacilli (Fig. 1). The increased genomic coverage probably enhanced the phylogenetic signal and helped correctly place these lineages, which is more consistent with their monoderm phenotype.
To investigate the existence of additional diderm lineages among the UBA Firmicutes, we screened them for the presence of four conserved genes for LPS biosynthesis and six protein domains previously used as markers of Gram negativity 7 (Methods) (Supplementary Table 2). Consistently with our previous study 10 , we found homologues of these OM markers in all Negativicutes and Halanaerobiales UBA genomes, but surprisingly we also found them in 46 unclassified UBA genomes belonging to a particularly interesting clade. It represents the second-deepest-branching group in the reference phylogeny after the Halanaerobiales (Fig. 1) and includes a single isolated member, Limnochorda pilosa, recently isolated from a brackish meromictic lake and defining the proposed class Limnochordia 16 . The sequenced L. pilosa genome revealed the presence of classical OM markers consistent with microscopy evidence of a diderm cell envelope 17 . Most UBA genomes that we assign to Limnochordia were assembled from anaerobic mud and digester samples (Supplementary Table 1) and form a diverse group ( Supplementary Fig. 1). These results show that Limnochordia represent a third and deep-branching diderm lineage within the Firmicutes, strengthening the hypothesis of a diderm ancestor.
Distribution of protein families and domains reveals the functional core of diderm Firmicutes. The existence of three distinct diderm lineages embedded in the classical monoderm Firmicutes provides an exceptional opportunity to gain insights into the diderm/monoderm transition by a large-scale comparative genomics analysis across this phylum. To this end, we assembled a reduced genome databank of 316 representative Firmicutes (including the newly identified UBA diderm Firmicutes): 47 Limnochordia, 101 Negativicutes and 17 Halanaerobiales, for a total of 165 diderm and 151 monoderm taxa (Methods). We then used this databank to compare the proteomes of monoderm and diderm lineages across the phylum. We calculated the pan-proteome of Firmicutes as composed of 176,024 protein families, 15,964 being present in at least five genomes and which were thus kept for further analysis (Methods).
We built a presence/absence matrix and calculated the distribution of these 15,964 protein families (Fig. 2). Some families are specific to Halanaerobiales (327), to Negativicutes (1,820) or   Table 3). However, these might include innovations specific to these clades or linked to their phylogenetic placement and not necessarily to the diderm phenotype . Of the 15,964 families, 131 are shared  uniquely between Halanaerobiales and Limnochordia, 73 between  Limnochordia and Negativicutes, and 26 between Negativicutes and Halanaerobiales. Finally, 41 protein families are present in at least one member of each of the three diderm firmicute lineages but are totally absent from monoderm Firmicutes, and will be referred to hereafter as the diderm strict core (Fig. 2a, in bold). Consistently, these families include components of known OM systems such as LpxABD (LPS synthesis), BamA (OM biogenesis) and FlgHI (flagellar rings) (Supplementary Table 3). However, some false negatives may arise from this approach. For example, LpxC and KdsABD (keto-deoxyoctulosonate synthesis) do not appear in the diderm strict core because homologues are present in a few monoderm genomes ( Fig. 3 and supporting data 14 ). In contrast, this analysis did not identify a strict core of monoderms. In fact, within the 3,863 protein families totally absent from diderm Firmicutes, only three are present in >50% of monoderm genomes (Supplementary  Table 3, highlighted in grey). Therefore, to relax our criteria, we implemented a complementary approach based on hierarchical clustering (HC) of the 15,964 protein families (Methods). Among the 3,500 clusters generated by the HC approach, four (HC5, HC7, HC8 and HC9) appeared particularly enriched in the three diderm clades with respect to their monoderm relatives and contained at least one of the previously identified strict core diderm families present in more than 40% of diderm Firmicutes. These four clusters total 120 protein families (Supplementary Table 4), 95 of which were also retrieved with an alternative clustering approach (Methods, Supplementary Table 5). The HC5 (14 protein families), HC7 (67 protein families) and HC9 (8 protein families) clusters show variable distribution patterns across diderm Firmicutes and seem to be mostly restricted to the Negativicutes (Fig. 2b and  Supplementary Table 4). In contrast, cluster HC8 has the sharpest pattern of enrichment in all three diderm Firmicutes clades (Fig. 2b). It includes 31 protein families, which comprise some of the OM markers that were missed by the strict presence/absence criterion, such as LpxC and KdsABD (Supplementary Table 4), confirming the highest sensitivity of the clustering approach. Consistently with the strict core analysis, here again we could not highlight any cluster specifically enriched in monoderms with respect to diderms (see supporting data 14 ).
Finally, we used a third approach based on the Pearson correlation coefficient (PCC) (Methods). We found that 83 families correlate with the diderm phenotype (PCC ≥ 0.5) (Fig. 2c). Of these, 26 are present in HC8, 56 are totally absent in monoderm Firmicutes and 35 are present in all three diderm Firmicutes lineages (Supplementary Table 6). Some important OM markers were nevertheless not recovered by the analysis of protein families. One example is TamB, an inner membrane (IM) protein involved in the insertion of proteins in the OM. Other examples are OmpM (a porin responsible for tethering the OM in Negativicutes) and some of the components of the Lpt system for export of LPS to the OM (LptB and LptD). These homologues are not sufficiently conserved at the sequence level and ended up being split among different families. We therefore complemented the protein family approach by one based on protein domains (Methods). We identified all Pfam domains in our Firmicutes databank and grouped proteins using three different approaches (Methods). We then computed the PCC of each of these groups with the diderm phenotype (Supplementary  Fig. 2 and Supplementary Table 7). This allowed us to identify OM markers that were missed by the protein family approach, such as TamB and OmpM. All results (protein families and Pfam domains) were merged. For the proteins identified by more than one method, we kept the corresponding family or domain presenting the best PCC. Finally, we kept only the protein families or domains present in at least one member of the three diderm Firmicutes lineages. These analyses identified 52 protein families/domains that are specifically enriched in the three diderm Firmicutes lineages, the majority of which are either totally or mostly absent from monoderm Firmicutes and include crucial functions linked to the cell envelope and the OM (Fig. 3a). Interestingly, they also include putative functions not immediately linked to OM biogenesis, such as RuvC (endonuclease) and RecX (a regulator of RecA, involved in DNA repair). Other proteins have only a generic annotation (for example, Peptidase_S55, Peptidase_M23 and AMIN) and might be linked to peptidoglycan (PG) related functions. Still other proteins are totally uncharacterized. One example is DUF3084, which is among the most highly correlated with the diderm phenotype and is practically absent in monoderms. Finally, we identified 51 protein families and Pfam domains that correlate with the monoderm phenotype (Fig. 3b, Supplementary Fig. 2 and Supplementary Table 8). However, and consistently with the strict core analysis, most of these proteins/domains are not widely distributed across monoderms, even those commonly assumed to be markers of Gram-positive bacteria (such as sortase and LPXTG anchor), and only one protein family (containing the S4 domain, loosely annotated as involved in RNA binding and translation regulation) shows a PCC higher than 0.9. Therefore, unlike diderm Firmicutes, it seems that there is de facto no conserved core for monoderm Firmicutes.
A large OM gene cluster is a distinguishing feature of all diderm Firmicutes and reveals potential new functions related to OM biogenesis. Very little experimental data are available on the nature of the cell envelope of diderm Firmicutes. To gather further information on the putative functions of the 52 families/domains most correlated with the diderm phenotype, we applied a guilty-by-association strategy by looking at their genomic environment. Twenty-nine of them were revealed to be part of a large gene cluster (up to 65 kilobases) that we previously identified in Negativicutes and Halanaerobiales 10 . This cluster, which will be hereafter referred to as the OM cluster, is also present in all Limnochordia genomes with a very similar gene order (Fig. 4) and therefore seems to be a distinguishing characteristic of all diderm Firmicutes.
The OM cluster codes for a number of important systems known to be involved in the biogenesis of the bacterial diderm cell envelope 18,19 (Fig. 4). It contains the main players in the synthesis of LPS and its transfer to the OM (Fig. 4, in blue). This includes the pathway responsible for the synthesis of Lipid A, the innermost component of LPS (LpxACD/LpxI/LpxB/LpxK). While the enzymes for the core oligosaccharide component of LPS are present in Halanaerobiales and Negativicutes (WaaM and KdsABCD), they are absent altogether in Limnochordia, where they might be replaced by a large number of unrelated glycosyl transferases present in the cluster (Fig. 4). All diderm Firmicutes seem to have a conserved LipidA flippase (MsbA) and the Lpt system for LPS transport to the OM (LptABC/LptFG/LptD) 20 . It is intriguing to note that the gene coding for LptD is not close to those encoding the other Lpt components but frequently lies next to two genes with no annotated function or conserved domains in both Negativicutes and Halanaerobiales (Hypothetical 1 and 2 in Fig. 4). This conserved three-gene arrangement suggests functional linkage and two potentially new components of the machinery for LPS transport to the OM in diderm Firmicutes. Similarly, a gene coding for a third hypothetical protein is always present in the OM gene cluster (Fig. 4, Hypothetical 3). No functional domains are annotated for this protein, but its strong conservation within the LPS genes strongly suggests some involvement in this pathway.
The OM cluster also includes another very important system responsible for the assembly of OM proteins (OMPs) (Fig. 4, in orange). This requires a coordinated process of folding into a β-barrel Articles NaTurE EcOLOGy & EvOLuTION structure and membrane integration, and it is accomplished by the β-barrel assembly machinery (BAM) complex, whose main component is BamA (Omp85) 18 . In E. coli and other Proteobacteria, an additional complex known as the translocation and assembly module (TAM) is present 21 . All three diderm Firmicutes lineages seem to possess a single system for OMP biogenesis composed of TamB and BamA, and one or multiple copies of the periplasmic chaperone Skp (Fig. 4). This TamB/BamA complex was previously proposed to constitute an ancestral configuration predating the emergence of the BAM and TAM machineries in the evolution of Bacteria 10,21 . The TamB homologues of Limnochordia are longer than their counterparts in Halanaerobiales and Negativicutes and frequently have an N-terminal extension with no identifiable domains. This might indicate the existence of yet unknown functions or interactions of this machinery in diderm Firmicutes. Curiously, the genome of L. pilosa and some other uncultured members of Limnochordia do not possess a homologue of the gene coding for TamB ( Fig. 4 and Supplementary Fig. 3). It is unclear at this stage whether these are true absences or repeated assembly errors.
One or multiple copies of the genes coding for OmpM can be observed in the OM cluster (Fig. 4, in red). OmpM is a porin with a surface-layer homology (SLH) domain that has been shown to non-covalently attach the OM to a modified PG in Negativicutes, constituting an atypical OM tethering system that is different from the well-known Braun's lipoprotein 11 . This suggests that diderm Firmicutes may all use a similar strategy to attach their OM. Curiously, in Halanaerobiales and Limnochordia, one of the ompM genes lies in a conserved context including homologues of genes coding for an ExbBD/TonB machinery, responsible for the active transport of molecules to the OM by exploiting the energized IM (ref. 22 ). Moreover, these genes lie together with a gene coding for cohesin, a protein generally involved in DNA repair and gene regulation (Fig. 4). Such strong synteny conservation may indicate a functional link among these proteins, which is intriguing and will need experimental verification.
In Negativicutes only, the OM cluster contains a conserved four-gene arrangement coding for the IM components of the Mla machinery (MlaEFD) and the OM channel TolC (Fig. 4, in green).

NaTurE EcOLOGy & EvOLuTION
The Mla system is responsible in diderm bacteria for maintaining lipid bilayer asymmetry and the OM barrier by the transport of phospholipids from and to the IM (refs. 23,24 ). No clear homologues of the periplasmic and OM components of the Mla system (MlaABC) could be identified in Negativicutes (Fig. 4). This may suggest a new mechanism to maintain lipid asymmetry in the Negativicutes which may involve TolC. Moreover, as Halanaerobiales and Limnochordia do not have any of these Mla coding gene homologues in their genomes, it remains to be understood how they cope with the absence of such a crucial system for membrane integrity or whether they use a non-homologous machinery.
In flagellated diderm Firmicutes, the OM cluster also contains six genes encoding the flagellar proteins FlgFGAHIJ, including those for the specific P-and L-ring structures that in diderm bacteria serve to anchor the flagellum to the OM (ref. 25 ) (Fig. 4, in pink).
Finally, a number of genes are highly conserved in the OM cluster in all three diderm Firmicutes clades but have only a generic annotation (Fig. 4, in yellow). Some of these are included in a conserved six-gene arrangement (sometimes interrupted by flagellar genes) that codes for a member of the S55 peptidase family, a protein annotated as N-acetylglucosamine-1-phosphodiester alpha-N-acetylglucosaminidase involved in protein glycosylation, a peptidase of the M23 family, a distant homologue of the lytic transglycosylase SpoIID, a homologue of the sporulation sigma factor SpoIIID and a MreB homologue. Interestingly, a SpoIID homologue was recently shown to be involved in cell division in Chlamydia trachomatis 26 , and it is tempting to speculate that some of these proteins are also involved in PG remodelling or daughter cell separation in diderm Firmicutes. Two other uncharacterized genes are always next to each other in the OM cluster ( Fig. 4): one codes for the uncharacterized protein DUF3084, and the other for RuvC (Holliday junction resolvase). Both proteins are among the most correlated with the diderm phenotype (see below and Fig. 3), strongly suggesting functional interaction and a still unexplored role in OM biogenesis.
Phylogenomic analysis does not support the acquisition of the OM by horizontal gene transfer. The presence of the OM cluster may suggest that it was acquired by HGT. We previously showed by phylogenetic analysis of a concatenation of the four core LPS proteins (LpxABCD) that the sequences from Halanaerobiales and Negativicutes are closely related, match their reference species phylogeny, and do not stem from within other diderm bacterial phyla. We interpreted this result as support that these genes (and by extension the whole OM cluster) were not acquired via HGT but were rather inherited from the ancestor of all Firmicutes, which would therefore have been a diderm with LPS (refs. 5,10 ). The possibility of an acquisition in the ancestor of either Halanaerobiales or Negativicutes followed by a further HGT between the ancestors of these two clades, although unlikely, remained however open. The presence of the cluster in Limnochordia weakens this scenario, as this would imply an additional ancient transfer event. Nevertheless, to further investigate the HGT hypothesis, we searched for OM gene clusters similar to the one present in diderm Firmicutes in our Firmicutes databank as well as in 377 genomes representative of major bacterial phyla (Methods). This confirmed that no other bacterial phylum possesses any gene cluster similar to the one in diderm Firmicutes ( Supplementary Fig. 3). In most diderm bacteria, the OM genes are in fact split into a number of small clusters, as in E. coli. Interestingly, bigger gene clusters could be observed in some diderm phyla that are evolutionarily close to the Firmicutes (Armatimonadetes and Synergistetes), but never as large as the one present in diderm Firmicutes ( Supplementary Fig. 3). Taken together, these results weaken the hypothesis of an acquisition of the OM in diderm Firmicutes.
An additional argument against the HGT scenario is that the OM cluster also contains genes that are not specific to diderm Firmicutes but are also present in monoderm lineages, such as fabZ (fatty acid metabolism), murA (cell wall synthesis), mreB (cell shape), spoIID and spoIIID (sporulation) (Fig. 4). Many of these genes lie at the beginning of the OM cluster in diderm Firmicutes but are also similarly clustered in monoderm Firmicutes ( Fig. 4 and Supplementary Fig. 3). Under the hypothesis of a diderm ancestor of all Firmicutes 10 , these genes could be remnants of an ancestral OM cluster, which would have been lost in monoderms. Conversely, under the triple HGT scenario, it is difficult to explain why the OM cluster would have been inserted at the same genomic position three times independently in Limnochordia, Halanaerobiales and Negativicutes.
Finally, the HGT hypothesis is not supported by phylogenetic analysis. Among the OM markers encoded in the gene clusters previously identified, we selected the ones most widely conserved in diderm bacterial phyla. These were gathered into a large concatenated dataset, a strategy routinely used to increase ancient phylogenetic signals 27 . We built two alternative concatenations, one comprising 11 markers (FabZ, LpxACD/LpxB/LpxK, KdsBADC and LptB) totalling 146 taxa and 1,998 amino acid positions, and a second one also including the flagellar proteins (FlgFGAHIJ) totalling 3,705 amino acid positions but fewer taxa (122 taxa) (Methods). A number of diderm phyla could not be included in the concatenation because they lacked more than half of these markers or because their genomes did not have them in cluster ( Supplementary Fig. 6), preventing their clear identification. These datasets are larger than the LPS concatenation we analysed previously 10 . Given the small number of positions with respect to the large evolutionary distances analysed, ML trees from both concatenations are not totally resolved ( Fig. 5a and Supplementary Fig. 4, respectively) but they display nevertheless a well-supported monophyly of major bacterial diderm phyla and an overall topology that is consistent with the known relationships within Bacteria, notably the two large clades of Terrabacteria and Gracilicutes 28,29 . These results indicate that the OM markers were present in the ancestors of each of these major diderm phyla and have not been subjected to extensive HGT during bacterial diversification. Consistently, we observe the monophyly of the three diderm Firmicutes clades; this indicates that their OM markers are more closely related among them than to other bacteria and do not stem from within any specific diderm bacterial phylum, which would have been expected if these markers were acquired through HGT (Fig. 5a). Moreover, the clade formed by diderm Firmicutes groups with other Terrabacteria phyla is in agreement with the phylogenetic placement of the Firmicutes.
Together, these results indicate that an OM with LPS was already present in the ancestor of the Halanaerobiales, Negativicutes and Limnochordia, which is by definition the ancestor of all Firmicutes. These results strengthen the diderm-first hypothesis for this phylum 5,10 and the emergence of the monoderm cell envelope by multiple losses of the OM (Fig. 5b). Moreover, they support the fact that

NaTurE EcOLOGy & EvOLuTION
LPS-OMs have a very ancient origin in Bacteria and were inherited remarkably vertically throughout the diversification of the major diderm phyla.

NaTurE EcOLOGy & EvOLuTION
consider the phylogenetic relationships among diderm and monoderm lineages and among outer membrane markers. The existence of three independent diderm lineages within the Firmicutes adds an important piece to the puzzle and shows that, at least in this phylum, the OM is ancestral and the monoderm phenotype is a derived character that arose multiple times independently through OM loss. The hypothesis of an ancestor with an LPS-OM is also supported by the evidence that the phylogeny of the Firmicutes is becoming increasingly populated by diderm lineages, notably at its deepest offshoots, and it cannot be excluded that even more will be discovered in the future. How the OM would have been lost multiple times in the Firmicutes remains to be understood. Endosporulation was probably important in the transition between monoderms and diderms 3,31 . In fact, during the process of endosporulation, the cell produces a spore that is transiently surrounded by a second membrane, which is then lost during maturation in sporulating monoderm Firmicutes, while it is retained in sporulating diderm Firmicutes 32 . Previous hypotheses proposed that endosporulation would have allowed the emergence of the OM 3,4,31,32 , but we think that the opposite occurred: that viable accidents during endosporulation allowed multiple OM losses in the Firmicutes 5 (Fig. 5b). The widespread presence and antiquity of endosporulation in this phylum (Supplementary Fig. 1 and Methods) could have allowed multiple occurrences of such accidents during its diversification, and this is probably the reason why the Firmicutes are currently the only phylum containing both monoderm and diderm envelopes (the presence of OMs in some Actinobacteria is likely an independent convergence, see below). The alternative scenario where the OM would have been acquired via multiple HGT events in the Firmicutes remains possible, but we

NaTurE EcOLOGy & EvOLuTION
believe it is less supported by our data. Moreover, rather than suggesting HGT, the clustering of the OM genes in diderm Firmicutes may indicate a tight coordination of the various OM biogenesis processes, which could represent a peculiarity of these lineages. Finally, if the OM was acquired multiple times by HGT, this is not a simpler process to imagine, as it would have necessitated that all the complex machineries for OM biogenesis become immediately functional in a monoderm context (notably, attaching the OM to the PG wall and stabilizing it, adapting existing flagella and secretion systems to span two membranes, and developing transport of key compounds to the OM). It may be argued that the benefit of having an OM is such that there would have been no selective advantage in losing it. However, no advantage has to be invoked if the loss of the OM was the result of viable accidents making it unstable. These might not have led to a decrease in fitness so dramatic as to immediately disadvantage the resulting monoderm phenotype, in particular if accompanied (either preceded or followed) by an increase in thickness of the cell wall 2,5,10 . Moreover, the monoderm phenotype with a thick PG wall might have been advantageous in certain environmental conditions (such as drought or high temperature). The absence of a core of protein families specific to monoderm Firmicutes may reflect the fact that different solutions were found independently to accommodate each of these multiple OM loss events, and it is interesting to note that PG structure is indeed highly variable in the Firmicutes 31 , and probably so is the arsenal of enzyme families needed to produce and remodel it. After loss of the OM, the genes involved in its biogenesis would have been progressively lost, and some perhaps were repurposed for In the absence of an outgroup, the tree is tentatively rooted between Terrabacteria and Gracilicutes according to previous analysis 34 . Even though the tree contains only diderm phyla, it shows that the presence of the main OM markers predates the divergence of these phyla (including that of Firmicutes) and that an OM with LPS has therefore an ancient origin. For the corresponding full phylogeny, see the supporting data 14 . b, Evolutionary scenario for the origin and evolution of the OM in the Firmicutes mapped on a schematic of the reference phylogeny in Fig. 1. The ancestor of Firmicutes is indicated as a sporulating diderm with LPS. The LPS-OM was inherited in the three diderm lineages (Halanaerobiales, Limnochordia and Negativicutes), while it was lost three times independently to give rise to the classical monoderm cell envelope.

NaTurE EcOLOGy & EvOLuTION
cell envelope functions in monoderm Firmicutes, a hypothesis that will need specific analysis and experimental evidence. The availability of genetic tools in the negativicute Veillonella parvula 5 opens the way to experimentally test some of these hypotheses and will allow researchers to gather precious insights about the biogenesis and functioning of both the diderm and the monoderm cell envelope.

NaTurE EcOLOGy & EvOLuTION
Whether the last bacterial common ancestor (LBCA) also had an OM has been unclear, notably due to uncertainties in the phylogenetic relationships among diderms and monoderms. We calculated a phylogeny including all the main bacterial phyla used in this study, and we inferred the diderm or monoderm nature of their cell envelopes by mapping the presence or absence, respectively, of two key markers of the OM, BamA and LpxA (Fig. 6a, Supplementary Table 9 and Methods). Although the cell envelope remains uncharacterized in most phyla (notably, the uncultured one for which the diderm or monoderm status can only be tentatively inferred in silico) 5,33 , it is already clear that the presence of diderms is overwhelming in Bacteria compared with that of monoderms. Determining whether the LBCA was already a diderm or a monoderm will require to firmly establish the root of Bacteria, a complex methodological issue that requires specific analyses. We chose here to display a root between Terrabacteria and Gracilicutes, supported by our recent phylogenomic analysis 34 . Combined with evidence presented here that the LPS-OM did not spread across diderm bacteria by HGT but was rather inherited vertically (Fig. 5a), this root may imply that the LBCA could have been a diderm (Fig. 6b, top). Alternative roots have been proposed in the literature, such as a possible one lying in the monoderm Chloroflexi or in the Candidate Phyla Radiation (CPR) 3,35,36 , which could support an LBCA with one membrane (Fig. 6b, bottom). Most importantly, and irrespective of whether the LBCA was a diderm or not, our results show that monoderm phyla do not constitute a natural group but are polyphyletic. This indicates that the monoderm cell envelope would in any case have appeared multiple times independently through OM loss, and the next important goal will be to work out all the evolutionary paths that led to these multiple transitions.
The mechanism that would have been involved in the loss of the OM in the monoderm phyla other than the Firmicutes remains unclear, but it may be linked to different types of viable accidents causing an instability of the OM. It is evident that the few known monoderm phyla are present only in the large clade of the Terrabacteria. This clade displays a larger range of cell envelopes than the more homogenous Gracilicutes, which are composed only of diderms mostly endowed with LPS (Fig. 6a) 5 . This larger diversity possibly suggests that the Terrabacteria cell envelopes have retained some ancestral characteristics. As such, diderm Firmicutes could represent good experimental models of the primordial bacterial cell envelope.
Currently, only the Firmicutes and the Actinobacteria display a mixture of monoderm and diderm lineages. However, the presence of the mycolic acid OM in Actinobacteria such as Corynebacteria 37 is probably an independent de novo origination, as these taxa do not possess any of the classical OM markers. The presence of complex cell envelope structure and potential OM-like structures has also been recently reported in the Chloroflexi 38 , a phylum that lacks all classical OM markers and is thought to be monoderm 39 . However, virtually nothing is known of the cell envelope structure of most bacterial phyla, in particular those with no representative cultured members, and obtaining ultrastructural data for these phyla is another fundamental challenge of future research.
Finally, how the OM initially came into being remains unknown. It is possible that it arose from a simpler cell surrounded by a single membrane (for example, of the monoderm type), but we have no means by using phylogeny to go this far back in time beyond the LBCA. However, we think it unlikely that endosporulation was at the origin of the OM in the LBCA 3,4 , as today this type of sporulation (which produces a transient OM) is specific to the Firmicutes and probably originated in this phylum. Alternatively, early speculation by Blobel postulated that the very first cell was already surrounded by a double membrane through a mechanism involving the formation of a "gastruloid" vesicle and the fusion of its extremities 40 . So, it is possible that early life never went through a 'simpler' monoderm phase but started already as diderm.
Some of these questions should be addressed through a more systematic characterization of a wide range of cell envelopes, both monoderm and diderm, across all bacterial phyla (notably the uncultured ones), combined with large-scale comparative genomics and phylogenomic analysis to fully reconstruct the evolutionary history that accompanied their evolution and the multiple transitions among them, as well as the development of new experimental models from unexplored branches of the tree of life.

Updating the Firmicutes reference tree and identifying diderm lineages.
We retrieved 1,639 genomes annotated as Firmicutes from the dataset of Parks et al. 13 , deposited under NCBI BioProject PRJNA348753. These genomes were isolated from different environments, and their quality ranges from partial to near complete (Supplementary Table 1). According to the taxonomy provided by Parks et al. 13 , these genomes consist of 351 Bacilli, 980 Clostridia, 61 Erysipelotrichia, 1 Tissierellia, 62 Negativicutes, 1 Halanaerobiales and 183 annotated as unclassified Firmicutes. Because these UBA genomes were in the nucleotide format at the time of this analysis, we used Prodigal 41 with the default parameters to predict genes. To analyse their phylogenetic placement within the reference phylogeny of Firmicutes, we added 230 complete genomes from representatives of all families available in the NCBI databases as of December 2017, including 61 Bacilli, 86 Clostridia, 4 Tissierellia, 62 Negativicutes, 16 Halanaerobiales and the genome of the only available representative of Limnochordia, L. pilosa. This resulted in a databank of 1,869 genomes (DB LARGE Firmicutes) (see supporting data 14 ). Exhaustive hidden Markov model (HMM)-based homology searches (with an e-value cut-off of 1 × 10 −4 ) were carried out by using the HMM profiles of the complete set of 54 bacterial ribosomal proteins from the Pfam v.29.0 database 42 as queries using the HMMER package 43 . Absences were checked with TBLASTN 44 . Forty-five ribosomal proteins present in >70% of the genomes were kept for analysis, 514 UBA Firmicutes genomes having less than 35 ribosomal proteins were considered as too partial and discarded from analysis, and 13 taxa were included as outgroups (see supporting data 14 ). The 45 ribosomal proteins of the 1,355 remaining UBA and outgroup genomes were aligned by CLUSTAL OMEGA 45 with the default parameters and trimmed using BMGE-1.1 (ref. 46 ) with the BLOSUM35 substitution matrix. The resulting trimmed alignments were concatenated into a supermatrix (1,368 taxa and 5,087 amino acid positions). An ML tree was generated using IQ-TREE v.1.4.4 (ref. 47 6 | Distribution of monoderm and diderm cell envelopes across Bacteria and two potential evolutionary scenarios for their origin. a, ML reference tree of Bacteria, with cell envelope types mapped on it. The tree was obtained from a concatenation of RNA polymerase subunits B and B′ and translation initiation factor IF-2 (2,144 amino acid positions and 377 taxa). The tree was inferred with IQ-TREE and the LG+C60+G+F model. Node supports higher than 70% are displayed. The scale bar represents the average number of substitutions per site. The tree does not include an outgroup but is tentatively rooted between two large clades corresponding to Terrabacteria and Gracilicutes, as in Raymann et al. 34 . For the corresponding full phylogeny, see the supporting data 14 . PVC s.l. (sensu lato) indicates the clade including the PVC superphylum (Planctomycetes, Verrucomicrobia and Chlamydiae) and related phyla; FCB s.l. indicates the clade including the FBC superphylum (Fibrobacteres, Bacteroidetes and Chlorobi) and related phyla; Proteobacteria s.l. indicates the clade including the Proteobacteria subdivisions and related phyla. For the corresponding full phylogeny, see the supporting data 14 . b, Evolutionary scenario mapped on a schematic version of the tree in a. Different roots of the Bacteria have been proposed in the literature; we show here two alternatives as an example: between Terrabacteria and Gracilicutes 34 (top) or in the branch leading to the CPR and Chloroflexi 35,36 (bottom). In the first scenario, the LBCA could have been a diderm. In the second scenario, the LBCA could have been a monoderm, and the OM would have appeared just after the divergence of Chloroflexi and CPR. Note that both scenarios imply multiple losses of the OM. The second scenario also leaves open the possibility that the LBCA was a diderm and that the OM was lost in the branch leading to the Chloroflexi and the CPR.
To identify sporulating taxa, we used HMMSEARCH to screen the DB SMALL Firmicutes using the Pfam domain spo0A_C (PF08769) with the option -cut_ga, and we mapped the results onto the corresponding tree of Firmicutes (see supporting data 14 ).

Distribution of protein families and domains in diderm and monoderm
Firmicutes. To carry out the large-scale comparative genomic analysis, we assembled a reduced databank of 316 genomes. It includes the 230 reference Firmicutes genomes and the newly identified UBA diderm Firmicutes (46 Limnochordia,39 Negativicutes and 1 Halanaerobiales), for a total of 165 diderm and 151 monoderm taxa (DB SMALL Firmicutes) (see supporting data 14 ). The 861,409 proteins contained in the DB SMALL Firmicutes were annotated using eggNOG-Mapper 51 with the default parameters. The eggNOG-Mapper tool uses precomputed orthologous groups and phylogenies from the eggNOG database 52 to transfer functional information from fine-grained orthologues only. In a second approach, Pfam domains were predicted for each protein using HMMSEARCH (e-value ≤ 1 × 10 −5 ) against the Pfam v.29.0 database. The results of the two approaches were merged, and each protein family was annotated according to the most frequent prediction of its members. We performed all-versus-all pairwise comparisons of the protein sequences contained in the DB SMALL Firmicutes using BLASTP v.2.6.0 (ref. 44 ) with the default parameters. Protein families were assembled with SILIX v.1.2.9 (ref. 53 ). Identity threshold values from 30% to 60% with intervals of 5% were tested, with a coverage of at least 80%. The resulting protein families were then refined using HIFIX v.1.0.5, which performs a three-step high-quality sequence clustering guided by network topology and multiple alignment likelihood 54 . To assess the most suitable identity threshold to group orthologous proteins, we tested different cut-offs by using as a positive control the clustering of 16 ribosomal proteins commonly used in phylogenetic analyses and of the four core LPS proteins (LpxABCD). The identity threshold that maximized the number of true positives and minimized the number of false positives was 35%. This 80% coverage-35% identity cut-off cannot, however, completely exclude some false negatives or false positives. Applying this threshold resulted in 176,024 protein families.
From these, we retained the families present in at least five taxa, resulting in 15,964 families for further analysis (see supporting data 14 ), and a presence/absence matrix was built. Among the 15,964 families, 41 were completely absent from monoderm Firmicutes while present in at least one member of all three diderm lineages (strict core diderm families).
To relax this strict criterion, we used two clustering approaches on the presence/absence matrix (HCLUST and K-MEANS, both implemented in R-3.5.1 (ref. 55 )). HCLUST hierarchically clusters the families according to Jaccard distances calculated on the presence/absence matrix. We tested different cut-offs and chose the one that allowed gathering the four core LPS protein families (LpxABCD) in the same cluster as a positive control (the number of generated clusters was set to 3,500) (see supporting data 14 ). Four clusters (HC5, HC7, HC8 and HC9) included at least one of the strict core diderm families with a large taxonomic distribution (present in more than 40% of the diderm Firmicutes genomes), totalling 120 families.
For the second method, we clustered families using K-MEANS (k = 500) on the basis of multiple correspondence analysis. As the K-MEANS approach involves defining a random set of starting points in a multidimensional space, ten iterations were run. In total, 173 protein families clustered with at least one of the 41 strict core diderm families and were present in more than 40% of the diderm Firmicutes genomes in all K-MEANS iterations. Among these families, 95 were common to HCLUST and K-MEANS.
In parallel to the distribution of protein families, we used a second approach based on protein domains. Because proteins can contain different domains or multiple occurrences of the same domain, we used three different approaches: in the first, we considered together all proteins containing exactly the same type and number of predicted domains (ALL); in the second, we considered together proteins containing the same domains even if some had more than one occurrence (COLLAPSED); and in the third, we counted the same proteins as many times as the different domains they contain (SINGLE) (Supplementary Tables 7 and 8) (see supporting data 14 ).
Finally, we used the four obtained datasets (protein families and the three Pfam domain approaches) to identify the diderm-and monoderm-specific protein families and Pfam domains using the PCC on the basis of their presence/absence in the diderm and monoderm taxa (higher than 0.5). The conserved genomic loci for cell envelope components in the Firmicutes were assessed using GeneSpy 56 .
Distribution of the OM cluster in Bacteria and evolutionary analysis. To study the distribution of the OM cluster, we built HMM profiles of all genes included in the OM cluster of diderm Firmicutes. We then used MacSyFinder 57 to identify clusters containing these genes in the DB SMALL Firmicutes and a second databank including 377 genomes representative of 58 main bacterial phyla (DB Bacteria) (see supporting data 14 ). We defined a cluster as a system with at least three of these OM components with a separation no greater than five other genes.
Among the gene clusters identified above, we selected 11 OM markers most conserved across Bacteria (FabZ, KdsA, KdsB, KdsC, KdsD, LptB, LpxA, LpxB, LpxC, LpxD and LpxK_WaaA). For each of them, homologues were aligned with MAFFT using the L-INS-I option 58 and trimmed using BMGE-1.1. The resulting alignments were concatenated by allowing a maximum of six missing markers per taxon and leading to a supermatrix of 146 taxa and 1,998 amino acid positions. A second matrix including the flagellar components FlgF, FlgG, FlgA, FlgH, FlgI and FlgJ was assembled by allowing a maximum of eight missing markers per taxon and included 122 taxa (because many do not have flagella) and 3,705 amino acid positions.
For the reference phylogeny of Bacteria, we used a concatenation of RNA polymerase subunits B, B′ and IF-2 (2,144 amino acid positions and 377 taxa).
For all concatenations, ML trees were generated using IQ-TREE v.1.4.4 with the profile mixture model LG+C60+F+G with ultrafast bootstrap supports calculated on 1,000 replicates from the original data.
To map cell envelope types onto the reference phylogeny of Bacteria, we built an HMM profile for BamA using the sequences in the corresponding family described above, and used it together with the HMM profile of LpxA to screen the DB Bacteria with HMMSEARCH (e-value ≤ 1 × 10 −4 ). The results were then refined and absences checked with TBLASTN. All trees were annotated using iToL (ref. 50 ).
Note added in proof: an analysis was deposited on BioRxiv 59 after this study was accepted for publication (https://doi.org/10.1101/2020.07.15.205187). It reports a number of alternative roots of the bacterial phylogeny, including one in between Terrabacteria and Gracilicutes. It also infers the proteome of the LCBA, which comprises a few OM markers, suggesting its diderm nature.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All results and raw data relative to this analysis (databanks, sequence accession numbers, sequence datasets and corresponding trees, and protein families) are provided as supporting data at Mendeley Data repository 14  Corresponding author(s): Simonetta GRIBALDO Last updated by author(s): Jul 24, 2020 Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code

Data collection
No software was used for data collection Data analysis -Homology searches and pfam domains prediction: HMMER-v3.3: HMMSEARCH for biosequence analysis using profile hidden Markov models (e-value <= 1e-5) BLAST 2.6.0+: tblastn and blastp with default parameters -Protein families building: SILIX V1.2.9: Ultra-fast sequence clustering from similarity networks HIFIX V1.0.5: High-quality sequence clustering guided by network topology and multiple alignment likelihood Identity thresholds values from 30% to 60% with intervals of 5% were tested, with a coverage of at least 80%. The resulting protein families were then refined using HIFIX v1.0.5, which performs a three-step high-quality sequence clustering guided by network topology and multiple alignment likelihood41. To assess the most suitable identity threshold to group orthologous proteins, we tested different cutoffs by using as positive control the clustering of 16 ribosomal proteins commonly used in phylogenetic analyses and of the four core LPS genes (lpxABCD). The identity threshold that maximized the number of true positives and minimized the number of false positives was 35%. This 80% coverage-35% identity cutoff cannot however completely exclude some false negatives or false positives -Protein families clustering: Hierarchical clustering (HCLUST) and Multiple Correspondence analysis (MCA) algorithms, both implemented in R. For HCLUST, we clustered hierarchically the families according to Jaccard distances calculated on the presence/absence matrix. We set the number of generated clusters to 3,500. MCA computes multiple correspondence analysis on the presence/absence matrix. Then, we clustered the families using k-means (k = 500). As the MCA approach involves defining a random set of starting points in a multidimensional space, 10 iterations were run.
-Protein families annotation: 2 nature research | reporting summary

October 2018
Protein families were functionally annotated using two approaches. First each protein was annotated by using the eggNOG-Mapper with default parameters. The eggNOG-Mapper tool uses precomputed orthologous groups and phylogenies from the eggNOG database to transfer functional information from fine-grained orthologs only. In parallel, Pfam domains were predicted for each protein using HMMSEARCH (e-value <= 1e-5) against the Pfam 29.0 database. Finally, each protein family was annotated according the most frequent prediction of its members.
-Correlation analysis: Protein families and Pfam domains were used to identify the diderm and monoderm specific protein families and Pfam domains using the Pearson correlation coefficient based on their presence/absence in the diderm and monoderm taxa (higher than 0.5).
-Synteny analyses: MacSyFinder to identify clusters of OM components in the genomes of Bacteria and Firmicutes. We defined a cluster as a system with at least 3 of these OM components with a separation no greater than 5 other genes GeneSpy to identify the conserved genomic locus for cell envelope components in the Firmicutes.
-Evolutionary analysis: Clustal Omega V1.2.3 with default parameters and MAFFT V7.222 with the L-INS-I option for the alignments BMGE-1.1 with the BLOSUM35 option for selection of phylogenetic informative regions from multiple sequence alignments IQTREE V1.4.4 with the model LG+C60+F+G for tree building For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability https://data.mendeley.com/datasets/3pcn9779gc/draft?a=8cc3c448-b3e7-4b02-96fb-9b3a0af4625d Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. All studies must disclose on these points even when the disclosure is negative.

Study description
We surveyed over 1,639 genomes from uncultured lineages of the Firmicutes that became recently available. We identified markers of an outer membrane in 86 new genomes, and identified a third whole new diderm clade. We then carried out a large comparative genomic analysis using the new diderm lineages identified in the Firmicutes. We identified 15,000 families constituting the Firmicutes pan-proteome, and we analyzed their distribution in the Firmicutes. Finally, we carried out the phylogenetic analysis of a large concatenation of the most conserved outer membrane markers in all bacteria.

Research sample
Three databanks were used in our analysis:

Reproducibility
The reliability of the phylogenetic trees was measured by the bootstrap probability of interior branches of the tree.
-Bacterial phylogeny: 100 resampled multiple sequences alignments (MSAs) were generated to construct a tree for each bootstrapped MSA. The resulting bootstrap trees are used to compute branch supports for the tree reconstructed from the original MSA.
-Firmicutes phylogeny: For this tree we used the ultrafast bootstrap approximation implemented in the IQ-TREE software with 1000 replicates. This approach relies on the trees encountered during the ML-tree search for the original MSA -instead of MSA resampling -to evaluate the tree likelihoods for the bootstrap MSAs.