Metagenomic analysis of soybean endosphere microbiome to reveal signatures of microbes for health and disease

Background Soil metagenomics is a cultivation-independent molecular strategy for investigating and exploiting the diversity of soil microbial communities. Soil microbial diversity is essential because it is critical to sustaining soil health for agricultural productivity and protection against harmful organisms. This study aimed to perform a metagenomic analysis of the soybean endosphere (all microbial communities found in plant leaves) to reveal signatures of microbes for health and disease. Results The dataset is based on the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) release “microbial diversity in soybean”. The quality control process rejected 21 of the evaluated sequences (0.03% of the total sequences). Dereplication determined that 68,994 sequences were artificial duplicate readings, and removed them from consideration. Ribosomal Ribonucleic acid (RNA) genes were present in 72,747 sequences that successfully passed quality control (QC). Finally, we found that hierarchical classification for taxonomic assignment was conducted using MG-RAST, and the considered dataset of the metagenome domain of bacteria (99.68%) dominated the other groups. In Eukaryotes (0.31%) and unclassified sequence 2 (0.00%) in the taxonomic classification of bacteria in the genus group, Streptomyces, Chryseobacterium, Ppaenibacillus, Bacillus, and Mitsuaria were found. We also found some biological pathways, such as CMP-KDO biosynthesis II (from D-arabinose 5-phosphate), tricarboxylic acid cycle (TCA) cycle (plant), citrate cycle (TCA cycle), fatty acid biosynthesis, and glyoxylate and dicarboxylate metabolism. Gene prediction uncovered 1,180 sequences, 15,172 of which included gene products, with the shortest sequence being 131 bases and maximum length 3829 base pairs. The gene list was additionally annotated using Integrated Microbial Genomes and Microbiomes. The annotation process yielded a total of 240 genes found in 177 bacterial strains. These gene products were found in the genome of strain 7598. Large volumes of data are generated using modern sequencing technology to sample all genes in all species present in a given complex sample. Conclusions These data revealed that it is a rich source of potential biomarkers for soybean plants. The results of this study will help us to understand the role of the endosphere microbiome in plant health and identify the microbial signatures of health and disease. The MG-RAST is a public resource for the automated phylogenetic and functional study of metagenomes. This is a powerful tool for investigating the diversity and function of microbial communities. Supplementary Information The online version contains supplementary material available at 10.1186/s43141-023-00535-4.


Background
Soybean, Glycine max (L.)Merrill is an annual, selfpollinated diploid legume (subfamily Fabaceae).Soybeans have been grown as a commercial crop, mainly in temperate ecologies, for thousands of years.One of the most widely cultivated legumes, soybeans, came initially from East Asia, but can now be found everywhere [1].The United States has grown soybeans over the most significant areas.It accounts for approximately 32% of the world's soybean production, followed by Brazil (31%), Argentina (19%), China (6%), and India (4%) [2].Madhya Pradesh has the highest soybean production in any other state.Soybeans have been cultivated in the Indian state of Madhya Pradesh during the last two years over an area of around 4.4 million hectares (ha), with a total yield of approximately 3.9 million tonnes and average productivity of 796-885 kg/ha [3].The conditions under which soybeans were grown were similar to those of maize.In addition to being used to make oil, crayons, and other products, soybeans are also used for a variety of other uses.Its output is almost identical to that of maize [4].The microbial population living in the roots of soybean plants is diverse and mostly composed of bacteria and fungi [5].Interactions between the plant host and its microbial communities determine microbiome diversity and taxonomy, and assist critical plant activities, such as nutrient absorption and tolerance to biotic and abiotic changes [6].Plant microbiomes may include both beneficial and harmful bacteria.Microbes inhabit the root rhizosphere and endosphere, which are composed of the outermost tissue layers of the root identified via research on experimental model plants and crops [7].The microbial communities that are isolated from the various root compartments each have unique taxonomic structures and functional compositions [8,9], highlighting the significance of the intricate connections between various bacterial and fungal communities and the role that these communities play in the formation of the microbiome [10].The Microbiomes of many plant species support plant defense against pathogens and environmental stress through mechanisms such as hormone induction, nutrient absorption, and transport [11].There has been a meteoric rise in the number of studies conducted in recent years with the objective of describing the human microbiome (the environment, including the microbiota, any proteins or metabolites they make, their metagenome, and host proteins and metabolites in this environment) in both healthy and diseased conditions.The field of microbiology has undergone a paradigm shift in the last 30 years, which has caused a change not only in our perspective on microorganisms, but also in the techniques that are used to investigate them.This has led to significant development [12].In the early part of the twentieth century, there was a widespread belief that microbes would not exist if they could not be cultivated in a laboratory [13].Metagenomics was first noted by [14], which encompassed information on the entire microbial community composition and function, widening the area of genomics where only genetic material is studied.A few prior studies, such as [15], on phylogenetic analyses of environmental microbial communities have also been reported [16].In the process of metagenomics study, genomic DNA from all organisms in a community (metagenome) was extracted for fragmentation, cloning, transformation, and subsequent screening of the constructed metagenomic library.Initially, the primary target of metagenomics was limited to screening environmental communities for a specific biological activity and to identifying the related genomics [17,18].Although it is considered to have a significant impact in determining the outcome of metagenome analysis, it circumvents the uncolorability and genomic diversity of most microbes, the biggest roadblocks to advances in microbiology that are not properly cultured in the laboratory and identification.Knowledge gaps in understanding unculturable microorganisms and functional and taxonomic analyses are fundamental limitations [19].Metagenomics studies can be tackled using the targeted metagenomics approach and shotgun metagenomics approach with fundamental differences based on methodology and objectives.In targeted metagenomics, a gene or a few genes are sequenced and used primarily to carry out phylogenetic studies, whereas in shotgun metagenomics, all the present DNA is sequenced and used in functional gene analysis assays (Morgan et al. 2013).This process usually involves next-generation sequencing (NGS) after DNA is extracted from the samples.This resulted in a large amount of data in the form of short reads.
In this study, we investigated the microbes present in the soybean endosphere and identified their taxonomy, function, and genes.The endosphere microbiome of soybean plants is composed of a wide variety of bacteria and fungi, which play an important role in plant health.Beneficial microbes can improve plant nutrition by increasing the availability of nutrients to plants.They can also protect plants from disease by competing with pathogens for space and nutrients, and by producing antibiotics.In addition, beneficial microbes can help plants tolerate stress by producing enzymes that detoxify the stress hormones.In this process, each piece of DNA is assigned to a particular taxonomic group, such as species, genus, or family.There are many different methods that can be used for taxonomic classification, but one of the most common is called "taxonomic hits distribution".In the taxonomic hit distribution, the sequenced DNA was compared with a reference database of known DNA sequences.This reference database can either be a collection of known genomes or a collection of known genes.The reference database was searched for the best match to each piece of DNA in the sample, and then the taxonomic group of the reference sequence was assigned to the piece of DNA in the sample.

Dataset acquired and processing
The SRR10740534 dataset has been retrived from the NCBI SRA and is based on the paper "microbial diversity in soybean".Fastq-dump SRA toolkit software was used to convert the data file from the Sequence Alignment Map (SAM) to FASTQ format.Most sequencers produce sequence files in FASTQ format, which is a standard.This is similar to the FASTA format, where Q represents the quality [17].Along with the sequence, it is recommended that the FASTQ file contain the sequence and quality of the sequence bases.The primary detail of the dataset is given in Table 1, and the basic statistics of the considered dataset is given in Table 2 below.

MG-Rast-server
We utilized MG-RAST (version 4.0.3) to control quality, predict proteins, and organize and annotate nucleic acid sequence databases.MG-RAST compares the predicted proteins to database proteins (for shotgun) and compares the 16S and 18S sequences to reads.MG-RAST allows access to phylogenetic and metabolic reconstructions [20,21].

Data processing
We selected the following pipeline for data processing.
1. Assembled: If the file contains assembly data, we choose the assembled input sequence option and include coverage information within each sequence header.2. Dereplication: This process includes removing artificially replicated sequences that are artificially processed.3. Screening-There is a filter for hot species in screening, and then we select the specific species.It removes any host species sequence, for example, plant, human, mouse, and others, with the help of DNA-level matching with a bowtie [22].4. Dynamic trimming: This method removes low-quality sequences using dynamic cutting.

Omics Box tools
Omics Box is bioinformatics software that converts readings into insights.For each collection of sequences, these tools enable the identification of pathways, function analysis, gene prediction, and other functions from multiple databases [23].The OmicsBox tools were used to predict gene function, pathway, and gene modules.

IMG/M
IMG/M is an integrated genome and metagenome comparative data analysis system that allows open access interactive analysis of publicly available datasets, whereas manual curation, submission, and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/MER) [24].The core data model underlying IMG allows recording the primary sequence information and its organization in scaffolds and/or contigs [25].Metagenome bins can be stored in IMG as individual workspace scaffold datasets, and analyzed using many tools, such as function profiles [24].The new Scaffold Search under the Find Genomes menu provides two search modes: Quick Search allows querying of scaffolds in IMG using scaffold IDs, while Advanced Search allows querying of scaffolds using various metadata attributes [26].

Sequencing quality analysis
To process the metagenome data analysis, dataset quality analysis was performed using the FastQC program.The FastQC program provides a QC report on spot problems that originate either in the sequencer or in the starting library material.Many modules were used to evaluate the raw data, and an HTML report with a module summary was created.The pre-alignment steps are specified in the quality control report.Run the FastQC summary report and compare the read format information with the overall poor quality to filter out and cut low-quality sequence parts while maintaining high-quality sequences.The dataset contained 72,868 sequences, totalling 203,552,249 base pairs (bp) with an average length of 378 ± 77 bp.
The quality control process rejected 21 of the evaluated sequences (0.03% of the total), as shown in Fig. 1.Dereplication determined that 68,994 sequences were artificial duplicate readings, and removed them from consideration.Ribosomal RNA genes were present in each of the 72,747 sequences that successfully passed the QC.In Fig. 1(a, b), the feature breakdown and function of the QC are shown.

Source hits distribution
The biological interpretation of the source hit distribution is essential for providing information on how many sequences per dataset were found for each database.The source hists distribution has been investigated, we have found 16 hit databases including protein databases, protein databases with functional hierarchy information, and ribosomal RNA databases with maximum in RefSeq [27], TrEMBL [28], and Subsystems shown in Fig. 2. In the figure, the bars representing annotated reads are colored based on the e-value range.It is important to note that different databases may have varying numbers of hits and can also provide different types of annotation data.

Sequence GC distribution
The Sequence GC Distribution was evaluated as illustrated in Figs. 3 and 4. Histograms depict the sequence lengths in bp for this metagenome.Each position represents the number of base pairs (bp).The charts used raw upload and post-QC data.
In Sequence GC Distribution analysis, Guanine and Cytosine-rich areas were identified to predict the annealing temperature.Figure 4 shows the GC % distribution in

Taxonomic hits distribution
When conducting a metagenomic study, one of the key parameters that is often considered is taxonomic hit distribution.This provides insights into the species present in a given sample and their relative abundance.Taxonomic hit distribution can provide insights into the species present in a given sample and their relative abundance.This information can be used to help understand the ecology of a sample and can be used to help guide future studies.Hierarchical classification for taxonomic assignment was conducted using MG-RAST, and the considered dataset of the metagenome domain of Bacteria (99.68%) dominated other groups of eukaryotes (0.31%) and unclassified sequence 2(0.00%).The charts below represent the distribution of taxa using a contig LCA algorithm, finding a single consensus taxonomic entity for all features on each individual sequence.Similarly, Proteobacteria dominated over Actinobacteria, Bacteroidetes, and Firmicutes.In terms of bacterial taxonomy, the richest classes, orders, families, and genus are as follows Streptomyces (25.60%),Chryseobacterium (18.46%),Paenibacillus (15.94%),Bacillus (10.86%),Mitsuaria (8.57%), Dyadobacter (2.94%), Pseudomonas (2.69%), Rhizobium, (2.61%) Acinetobacter (2.49%), Burkholderia, (1.98%) unclassified (derived from Bacteria)-(1.6),Micromonospora (0.97%), Arthrobacter (0.53%), Serratia-(0.50%)These percentages indicate the prevalence of each taxonomic entity, as depicted in Fig. 5(a-f ).The distribution of taxa was determined using a contigLCA algorithm, which assigned a single consensus taxonomic classification to all features found in each individual sequence.Within the realm of bacterial taxonomy, the genus Streptomyces stands out by claiming a substantial portion, approximately 25.60%, of its corresponding taxonomic category.Notably, this genus exhibits remarkable capabilities as it produces antibiotics with efficacy against a wide range of biological adversaries, including fungi, bacteria, and parasites [29].Streptomyces has harnessed its capabilities to develop immunosuppressants and biocontrol agents specifically designed for agricultural purposes.These antibiotics exhibit the power to regulate and combat fungi and parasites, effectively safeguarding crops like soybeans from potential damage caused by these microorganisms.Moreover, Streptomyces demonstrates its prowess by suppressing or eradicating microbial adversaries, while simultaneously stimulating plant growth in various agricultural settings.This remarkable phenomenon has been observed across multiple crop types, leading to significant improvements in soybean crop production.Furthermore, the presence of Streptomyces contributes to the overall promotion and enhancement of soybean crop growth [30].Among the recorded sequences, it was determined that the genus Corynebacterium held the second-highest percentage, amounting to approximately 18.46%.The majority of Gram-positive bacteria falling under this classification exhibit the ability to thrive and persist in oxygen-rich environments.Corynebacterium species are ubiquitous, inhabiting the soil layers on the skin of mammals.According to the genetic characteristics, examination of the Corynebacterium genome revealed both harmful and non-pathogenic species [31].Within the genus, Paenibacillus emerges as the third prominent species, accounting for a substantial percentage of approximately 15.94%.This particular species comprises a multitude of sequences that play a pivotal role in facilitating the growth and development of soybean crops [32].Paenibacillus species that establish a symbiotic Fig. 5 We present taxa using contig LCA to determine a single consensus taxonomic item for each sequence feature.A domain B Phylum C Class D Order E Family F Genus relationship with plants possess the remarkable ability to produce auxin phytohormones, which exert a profound influence on plant development.In addition to this, these species facilitate the uptake of phosphorus by plant roots and some of them even engage in atmospheric nitrogen fixation, thereby providing substantial benefits to soybean plants.Furthermore, Paenibacillus plays a crucial role in suppressing phytopathogens through the production of biocides that contribute to systemic resistance [33].Bacillus emerges as the third prominent species, accounting for a substantial percentage of approximately 10.86%, Bacillus species are predominately found in food and have both beneficial and harmful effects on human health.This is because these microbes produce bioactive substances during fermentation.Consequently, eating food made from soybeans fermented by Bacillus ensures food safety [34].Additionally, Bacillus helps the seedlings of soybean plants to become more resistant to infection.Bacillus species possess a wide variety of advantageous tracts.Plants benefit from this process because they can obtain nutrients.Increased synthesis of phytohormones leads to better overall growth and increased resistance to both biotic and abiotic stressors [34].There are numerous varieties of bacteria found in this genus, which may be advantageous and helpful for soybean plant development and metabolic activity, and some aid as a biofertilizer and biotic and abiotic stress, and have a major function in soybean crops.Within the context of soybean plants, the genus Mitsuaria assumes a noteworthy position.It is worth mentioning that the sequences affiliated with Mitsuaria constitute the fourth largest segment, amounting to approximately 8.57% sequence.Mitsuaria isolates have been observed to inhibit fungal and oomycete plant pathogens in laboratory and in vivo experiments on soybean seedlings, leading to a reduction in disease severity.This study indicates the effectiveness of T-RFLP-derived markers for identifying microorganisms with pathogeninhibiting properties [35].The metagenomic sequences reveal that several genera, including Pseudomonas, Rhizobium, Burkholderia, and Mesorhizobium, co-exist in the rhizosphere and nodules of soybean plants [36].Additionally, endophytic bacteria, including Burkholderia, Rhizobium, Bradyrhizobium, Mesorhizobium, and Dyadobacter, have been identified as beneficial for plant growth and development.Some studies have also investigated the effects of plant growth promoting rhizobacteria (PGPR) on soybean growth and soil bacterial community composition.For example, Paenibacillus mucilaginosus, a PGPR strain, improved symbiotic nodulation, soybean growth parameters, nutrient contents, and yields in a field experiment [37].The certain genus Acinetobacter, Micromonospora, and Serratia species in the soybean metagenome promote plant growth through nitrogen fixation, phosphate solubilization, siderophore production, phytohormone synthesis, and enhanced tolerance to salinity.Their presence and activities contribute to the overall growth and development of soybean plants.

Rank abundance plot
To graphically depict taxonomic richness and evenness, Rank Abundance plots were arranged the taxonomic abundances in descending order from their most abundant to their least abundant values.In most cases, only the top 50 most prevalent cases are presented.On a logarithmic scale, the abundance of annotations is shown along the y axis.The most abundant sequences on the left are shown in Fig. 6.

Rarefaction curve
The rarefaction curve shows the total number of different species annotations as a function of the number of sampled sequences.This curve indicates the richness of the annotated species (Fig. 7).

Scaffold analysis from the genome sequence
A scaffold is a reconstructed genomic sequence from whole-genome shotgun clones, consisting of contigs and gaps.It is created by chaining contigs together and separating them by gaps.Whole-genome shotgun assembly aims to represent each genomic sequence in one scaffold, but it is not possible.Scaffolding improves the contiguity and quality of metagenomic bins by assembling short metagenomic reads into longer contiguous sequences based on sequence overlap.The distribution of scaffolds by gene count provided valuable insights into the prevalence and distribution patterns of genes within the metagenomic dataset.The analysis revealed varying numbers of scaffolds within specific gene count ranges, indicating varying levels of gene abundance and representation.The histogram tab in Scaffold Cart displays a histogram with the counts of protein-coding genes in the sample.Analysis of gene count distribution within metagenomic datasets plays a fundamental role in unraveling the complexity and functional diversity of microbial communities.In this study, the distribution of scaffolds by gene count was thoroughly examined, providing valuable insights into the prevalence and distribution of genes across a metagenomic dataset.The results demonstrated a comprehensive breakdown of scaffolds falling within specific gene count ranges, ranging from 1 to 12,972, as shown in Fig. 8.This detailed breakdown enabled a deeper understanding of the distribution patterns and relative abundance of genes within the metagenomic dataset.By elucidating the number of scaffolds within each gene count range, this analysis sheds light on the genetic composition and functional potential of microbial communities, thereby contributing to our knowledge of the intricate dynamics of these complex ecosystems.In this study, we identified 6,568 scaffolds, with gene counts ranging from 1 to 1,299.Of these, 515 scaffolds had gene counts ranging from 1,300 to 2,597; 613 scaffolds had gene counts ranging from 2,598 to 3,895; 577 scaffolds had gene counts ranging from 3,896 to 5,193; 362 scaffolds had gene counts ranging from 5,194 to 6,491; 172 scaffolds had gene counts ranging from 6,492 to 7,789; 105 scaffolds had gene counts ranging from 7,790 to 9,087; 69 scaffolds had gene counts ranging from 9,088 to 10,385; eight scaffolds had gene counts ranging from 10,386 to 11,683; and seven scaffolds had gene counts ranging from 11,684 to 12,972.The distribution of gene counts across scaffolds was nonuniform, with a higher proportion of scaffolds having fewer genes.This suggests that the genome is composed of a large number of small genes and a smaller number of larger genes.The distribution of gene counts may also be influenced by the assembly method used because different methods may have different biases in the number of genes that can be detected.The identification of scaffolds with different numbers of genes is important for understanding genome organization.Genes are often clustered together on scaffolds, and the number of genes on a scaffold can be used to infer their functions.For example, scaffolds with a large number of genes are often associated with metabolic pathways, whereas scaffolds with a small number of genes are often associated with regulatory functions.In addition to analyzing the gene count distribution, it is essential to examine other relevant parameters that provide further insights into the metagenomic dataset.This section focuses on the GC percentage, scaffold count, and combined sequence length.These parameters contribute to our understanding of the composition and structural characteristics of the datasets.Figure 9 presents the distribution of scaffolds based on their GC percentage range along with the corresponding scaffold count and combined sequence length.This analysis provided an overview of the dataset and presented the values for each GC percentage range.In the range of 13.54 to 20.54, there were 208,564 scaffolds with a combined sequence length of 208,564 base pairs.In the range of 21.54 to 27.54, we observed a significant increase in scaffold count, with 44,316,835 scaffolds and an equivalent combined sequence length.Similarly, the ranges of 28.54 54 show varying scaffold counts and combined sequence lengths.These values provide valuable insights into the composition and characteristics of the metagenomic dataset, offering a quantitative representation of the genomic content within each GC percent range.This helps to unravel the dataset's characteristics, genomic diversity, and structural properties of the microbial communities under study.

Function analysis
OmicsBox is a bioinformatics software platform that enables researchers to go from raw data to meaningful insights within a couple of hours [23].Functional studies of the Clusters of Orthologous Groups (COG), cellular activities and signalling, metabolism, and storage were carried out.The analysis was carried out on 827 sequences, each of which had an average length of 44.0 characters.Only 1.81% of the sequences had gene ontology (GO) annotations, leading to the discovery of 75 GO term annotations.Functional analysis of COG considers that, in the instance of Metabolic Processes, the predominant functions by sequence are amino acid translation, ribosomal structure and biogenesis, and replication, recombination, and repair.Metagenomic analysis of the fermented soybean product sikkam indicated that the sequence activities include translation, ribosomal structure and biogenesis, replication, recombination, and repair [38].Mapping of metagenomic sequences against databases of orthologous gene groups revealed many enriched recombination and repair functional sequences.

Gene prediction
Gene prediction is an important tool in metagenomics and in the study of the genetic material of an entire ecosystem.By examining the genes of organisms in a sample, scientists can learn about the functions of these genes and the organisms themselves.There are several methods of gene prediction, but they all center on one basic process: looking for regions of the genome that are likely to encode proteins.Proteins are the building blocks of all living organisms; therefore, genes encoding proteins are often the most important.There are many different types of proteins, each with a specific function.Some proteins are involved in metabolism, whereas others are involved in cell structure regulation.Regardless of the function of the protein, gene prediction can help to identify the genes that encode it.By identifying these genes, scientists can learn about the function of the proteins and the organisms that produce them.Gene prediction uncovered 1,180 sequences, 15,172 of which included gene products, with the shortest sequence being 131 bases and maximum length 3829 base pairs.These gene products were found in the genome of strain 7598.The maximum genes were discovered in Bacillus, Pseudomonas, Paenibacillus, Klebsiella, Lactobacillus, Streptomyces, Bradyrhizobium, Bordetella, Cronobacter, Salmonella, Corynebacterium, Micromonospora, Cupriavidus, Akkermansia, Leuconostoc, Xanthomonas, Priestia, Ligilactobacillus, and Candidatus (Table S1).The gene list was additionally annotated using Integrated Microbial Genomes and Microbiomes.The annotation process yielded a total of 240 genes found in 177 bacterial strains, as depicted in Table 3.

Pathway analysis
Pathway analysis is a powerful tool to understand the biological significance of gene lists generated from      high-throughput experiments.In this study, pathway analysis was used to identify 16 significant pathways enriched in the predicted genes.The network of pathway enrichment of the metagenome data has been shown in Fig. 10.These pathways are involved in a variety of essential cellular processes, including biosynthesis, energy production, and signaling.The CMP-KDO biosynthesis II (from D-arabinose 5-phosphate) pathway is one of the most significant pathways identified in this study [39].It is involved in the biosynthesis of lipopolysaccharide (LPS), an essential component of the outer membrane of gram-negative bacteria (Fig. 1S(i)).Two sequences, aconitate hydratase (K01681) and citrate synthase (K01647), are associated with the TCA cycle pathway.They have been found in Glycine max (soybeans) and Saccharomyces cerevisiae (yeast).The TCA cycle is essential for optimal functioning of primary carbon metabolism in plants (Fig. 1S(ii-iv)).Aconitate hydratase catalyzes the isomerization of citrate to isocitrate in the TCA cycle.The function of aconitate hydratase has been well studied in model plants, such as Arabidopsis thaliana.The TCA cycle is a metabolic process that occurs in plants, animals, fungi, and other bacteria.It is a series of chemical reactions that converts acetyl-CoA into carbon dioxide and energy.The TCA cycle is an important source of energy for cells and also plays a role in the synthesis of other molecules such as amino acids and fatty acids [40].
The next pathway involved the biosynthesis of fatty acids (Fig. 1S(v)).This is essential for the formation of membranes, which are necessary for the viability of all cells, except Archaea.Fatty acids are also a compact energy source for seed germination.Enenoyl-[acyl-carrier protein] reductase I (K00208) is an enzyme involved in fatty acid biosynthesis, prodigiosin biosynthesis, and biotin metabolism pathways.Another significant pathway that has been identified is biotin metabolism (Fig. 1S(vi)).The Biotin metabolism is a universal and essential process that is required for intermediary metabolism in all three domains of life: archaea, bacteria, and eukaryotes [41].It is an indispensable vitamin for human health and plays a vital role in well-being [42].This essential nutrient can be obtained through the consumption of a wide range of foods, including legumes, soybeans, tomatoes, romaine lettuce, eggs, cow's milk, and oats.One of the primary functions of biotin is to act as a cofactor for enzymes, facilitating carboxylation reactions that are crucial for processes, such as gluconeogenesis, amino acid catabolism, and fatty acid metabolism.In addition, it produces biochemicals that have a wide variety of applications in nutrition and industry.Another gene sequence, also been discovered that encodes a protein called 2-dehydro-3-deoxyphosphooctonate aldolase (K01627), also known as Kdo-8-phosphate synthase (KDO 8-P).This protein is a major constituent of the outer leaflet of the outer membrane of most Gram-negative bacteria.It is essential for the survival of bacteria and pathogens, and is involved in the biosynthesis of nucleotide sugars (Fig. 1S(vii)) and lipopolysaccharide biosynthesis pathways (Fig. 1S(viii)) [43].Another important pathway has been discovered that is essential for plant growth and development that is called the porphyrin and chlorophyll metabolism pathway (Fig. 1S(ix)).Porphyrins are a group of organic compounds essential for life.They are found in chlorophyll, which is a green pigment that plants use for photosynthesis.Porphyrins are also found in heme, a protein that carries oxygen in the blood.The porphyrin and chlorophyll metabolic pathways are complex processes that involve the synthesis of porphyrins and chlorophyll.This pathway is essential for plant growth and development because it provides plants with the materials required to produce chlorophyll and heme.The porphyrin and chlorophyll metabolic pathways are synthesized by a multistep pathway that involves eight enzymes [44], which is a complex process involving numerous chemical reactions catalyzed by enzymes.The regulation of chlorophyll and heme balance is important for the growth and development of plants [45].Porphyrin biosynthesis is one of the most conserved pathways known, with the same sequence of reactions occurring in all species.By associating different metals, porphyrins give rise to the "pigments of life": chlorophyll, heme, and cobalamin [46].The glyoxylate and dicarboxylate metabolic pathways play a pivotal role in the sustenance of organisms and their biochemical functions (Fig. 1S(x)).In the glyoxylate pathway, citrate synthase (K01647) is responsible for citrate formation from acetyl-CoA and oxaloacetate.Citrate is then converted into succinate, which is used to synthesize glucose [47].This process entails the conversion and application of these intermediates, which are generated via the catabolism of fatty acids, metabolism of amino acids, and fermentation of carbohydrates.These compounds can facilitate the synthesis of diverse molecules such as glucose, amino acids, and fatty acids.The glyoxylate and dicarboxylate pathway holds significant Fig. 10 The Pathway Enrichment analysis importance in the realm of plant physiology, owing to the fact that unlike animals, plants are unable to stockpile carbohydrates in the form of glycogen.Rather than undergoing direct utilization, fatty acids are transformed into glucose molecules, which play a crucial role in supporting growth and reproduction.The bacterial pathway is of paramount importance, as it facilitates the conversion of carbon dioxide into organic compounds, thereby enabling the acquisition of energy from carbon dioxide.In general, glyoxylate and dicarboxylate pathways are crucial and intricate metabolic pathways that are indispensable for the viability of diverse organisms.The aforementioned metabolic pathway is important for the generation of energy, carbon metabolism, and production of biomolecules.Comprehending the overall metabolic network and its implications in cellular function requires a thorough understanding of the intricacies of glyoxylate and dicarboxylate metabolism.The process of prodigiosin biosynthesis was initially characterized in the γ-proteobacterium, Serratia marcescens.Subsequently, it was studied and characterized in another bacterium, Pseudoalteromonas rubra.In these organisms, prodigiosin biosynthesis involves a series of biochemical reactions and enzymatic steps that lead to the production of this captivating red pigment (Fig. 1S(xi)).By exploring the biosynthetic pathways of prodigiosin in different bacterial species, researchers have gained valuable insights into the diversity and complexity of this intriguing pigment and its potential applications in various fields.ABC transporters pathway are a large, ancient protein superfamily found in all living organisms (Fig. 1S(xii)).They function as molecular machines by coupling ATP binding, hydrolysis, and phosphate release for the translocation of diverse substrates across membranes.ABC transporters are also known as efflux pumps because they mediate the cross-membrane transportation of various endo-and xenobiotic molecules energized by ATP hydrolysis [48] and the arginine transport system substrate-binding protein(K09996), which specifically binds to arginine molecules and facilitates their transport across the cell membrane.This protein is part of a large complex involved in cellular arginine uptake.Substratebinding proteins play a crucial role in the arginine transport system by recognizing and capturing arginine molecules from the extracellular environment and initiating their transport into cells.It ensures the specificity and efficiency of arginine uptake, contributing to various biological processes that require arginine as a nutrient or signaling molecule.The biosynthetic pathways of cofactors employ a greater quantity of innovative organic chemistry compared to other pathways in primary metabolism (Fig. 1S(j)).As a result, there is a wealth of research being conducted on the mechanisms of cofactor biosynthetic enzymes [49].There are two sequence of arginine transport system substrate-binding protein (ASBP) (K09996, K09997) have been associated with biosynthetic pathways of cofactors.The function of these protein is unknow.It is a protein that is involved in the transport of arginine across the cytoplasmic membrane of bacteria.ASBP is a member of the ABC transporter family, which is a large family of proteins that are involved in the transport of a variety of molecules across membranes [50].The small subunit ribosomal protein S12 (K02950) is present in both mitochondrial and bacterial ribosomes (Fig. 1S(xiv)).Ribosomal protein S12 is an essential component of the small subunit within the ribosome that is responsible for protein synthesis.In E. coli, S12 plays a significant role in facilitating translation initiation.This protein consists of approximately 120-150 amino acid residues and is fundamentally basic in nature [51].Two-component systems represent intricate signaling pathways that have gained significant attention during the initial stages of the 1980s.Their emergence into the spotlight can be primarily attributed to their identification within the paradigmatic microorganism, E. coli (Fig. 1S(xv)).These systems provide living organisms with the ability to detect and convert a diverse array of incoming signals, enabling them to adjust and respond in a highly adaptable manner to alterations in both their external and internal environments [52].Methyl-accepting chemotaxis proteins (K03406) are the most common receptors in bacteria and archaea.They are arranged as trimers of dimers that form hexagonal arrays in the cytoplasmic membrane or cytoplasm [53].Methyl-accepting chemotaxis proteins (MCPs) are also involved in bacterial chemotaxis, which is essential for the host colonization and virulence of many pathogenic bacteria causing human, animal, and plant diseases (Fig. 1S(xvi)).These receptors undergo reversible methylation during the adaptation of the bacterial cells to environmental attractants and repellents.They are also involved in bacterial chemotaxis.MCPs are concentrated at the cell poles in an evolutionarily diverse panel of bacteria and archae [54].They are classified into different classes according to their ligand-binding region and membrane topology.Chemotaxis is the process by which cells sense chemical gradients in their environment and move towards more favorable conditions.MCPs are a family of bacterial receptors that mediate chemotaxis to diverse signals and respond to changes [53] (Table 4).

Conclusion
The enormous amount of data generated from omics studies is possible only with fast and accurate handy omics tools, as they are available on today's scientific platform.Metagenomics is an area that is booming with the advent of NGS.Metagenomics is the study of the genes and genomes of microbes that cannot be cultured in a laboratory.Metagenomic data of a sample can be generated by shotgun sequencing of total community DNA.The gene sequences obtained from metagenomic shotgun sequencing can be Nucleotide-Nucleotide BLAST (blastn) mapped to the available microbial genomes in public domain databases, such as NCBI GenBank, RefSeq, and Integrated Microbial Genomes (IMG).This will provide an inventory of all microbial genera and species present in the sample.The unmapped reads were annotated using different in silico tools, such as DIAMOND, KEGG, CAZy, and eggNOG, for the identification of genes and their functions.Taxonomic and functional annotation of genes helps understand the metabolic pathways that are unique to a particular microbiome.Based on these inferences, it is possible to propose models for the role of microbes in health and disease.Soybean is one of the most important crops in the world and a major source of protein and oil.The soybean endosphere is a unique microenvironment colonized by a large number of bacteria, fungi, and viruses.The composition of the endospheric microbiome differs from that of the rhizospheric microbiome.The microbiome of the endosphere plays an important role in plant health.In this study, we aimed to elucidate the usefulness of the microbiome in revealing the signatures of microbes in healthy and diseased soybean agricultural lands.To identify the microbes associated with health and disease, microbial diversity in the soybean endosphere was analyzed by metagenomic analysis using the MG-RAST tool.The analysis of the soybean endosphere microbiome revealed signatures of microbes associated with health and disease.The most dominant group of bacteria in the endosphere is Streptomyces, followed by Chryseobacterium, Paenibacillus, Bacillus, and Mitsuaria.These bacteria play a role in a variety of biological pathways, including CMP-KDO biosynthesis II (from D-arabinose 5-phosphate), TCA cycle (plant), citrate cycle (TCA cycle), fatty acid biosynthesis, and glyoxylate and dicarboxylate metabolism.These data revealed that it is a rich source of potential biomarkers for soybean plants.The results of this study will help us understand the role of the endosphere microbiome in plant health and identify the microbial signatures of health and disease.

Fig. 1
Fig. 1 QC result of Sequences a Sequence Breakdown b Predicted Features

Fig. 2 Fig. 3 Fig. 4
Fig. 2 Source hit distribution of studied data set

Fig. 6 Fig. 7
Fig. 6 kmer rank abundance graph plots the kmer coverage as a function of abundance rank

Table 1
Primary detail of the dataset

Table 3
The gene list annotation using the metagenomic database

Table 4
Provides more specific breakdown of information on pathways and sequences