GStat: 10:05:50 02/13/12 GMT - @goc.grid.sinica.edu.tw


GStat 2.0 Production InstanceGStat 2.0 Release NoteGStat 2.0 Installation GuideGStat 2.0 OverviewGStat Support List

home alert table service regional service metrics links info.gif prod pps test aegis baltic dorii eela euchina euindia eumed e-nmr gilda grisu ireland pi2s2 sa-grid seegrid trigrid

Introduction

Gstat
    
    About Gstat
    ---------------------------------------------------------------------
    GStat is an application designed to monitor EGEE/LCG compatible 
    Information Systems.  GStat's primary goal is to detect faults, 
    verify the validity and display useful data from the Information 
    System.

    GStat tests the Information System approximately every 30 minutes. 
    The test does not rely on any submitted job, but rather on queries 
    to site GIISes/BDIIs.  This is done to gather information and 
    perform so called sanity checks to point out any potential problems 
    with individual sites. The test covers the following areas:

     * Site and service information: Provide information about the site, 
       services, software and VOs supported at that site.

     * Usage information: Provides the statistics on job slots, jobs, 
       storage space.

     * Information integrity: Checks if the Information system is 
       publishing data that is meets specific syntax and value rules.


    GStat Internals
    ---------------------------------------------------------------------
    GStat runs on a single server.  From this server, GStat executes 
    queries, process the results and generates static HTML reports.  
    Execution of queries and result processing are accomplished by agents 
    and filters respectively.  Currently there are two servers running 
    GStat running at ASGC and CNAF.  Both servers can be accessed from 
    this alias: http://gstat.gridops.org/gstat/

    GStat agents are responsible for making queries and collecting raw 
    data for further analysis by filter components.  Filters execute test 
    logic and generate processed data which is in turn used to create a 
    web based test reports.

    The configuration of GStat heavily depends on the data found in the 
    GOCDB.  GStat queries the GOCDB for the site's GIIS contact string, 
    nodes information and other basic site information.

    Gstat currently stores numeric data in RRD databases with data 
    reduction.  Additional historical results can be found in daily snap 
    shot archives.

    Using Gstat
    ---------------------------------------------------------------------
    The GStat interface consists of these components:
     * Top menu
     * Summary table
     * Total statistics
     * Global tests
     * Table view
     * Site report

    At the top of the main page, you will find the following links:
     Home:     Brings you to the main page of the current instance.
     Alert:    Shows only sites with alerts levels of warning and above.
     Service:  Displays services that are available to VOs.
     Regional: Shows only sites from selected regions.
     Service metric: Middleware version statistics
     Links:    Related links to GStat
     ?:        Main help and documentation page

    The remaining links to the right point to other GStat instances that 
    run on this server.  A few special instances are:
     prod:   Production certified EGEE sites
     pps:    Preproduction EGEE sites
     tests:  Sites that are either non-production or uncertified

    Summary table view shows site names and their most severe status of all 
    tests associated with.  Clicking on the site name will display the 
    detailed GStat report for the site.  Each site also has a small table 
    cell to the right.  This cell indicates and links to the results of the 
    SAM Tests page.  Multiple cells indicate that this site hosts multiple 
    CEs.  

    Below the summary table, you can find a total statistics table for 
    entire instance.  The 'Total' link will display graphs associated with 
    these statistics.

    The Global Grid Test displays the results for the service duplicate 
    test.  This test primarily is designed to look for duplicate instances 
    of global LFC services for a single VO.  There should only be one 
    global LFC for each VO.

    At the bottom of the main page is the table view which shows 
    individual test results for each site.  This table can be reorganized 
    into different perspectives by with the sort by links at the top of 
    the table.

    Finally there is a detailed report generated for each site.  At the 
    top of the report, you will find links to the site's homepage, SAM 
    results, GOCDB and graphs for all test result data associated with 
    the site.  The body of the report will consist of individual sections 
    for each test performed and their detailed results.  Each test 
    section will display the name of the test, the results status, link 
    to alert status history graph and help documentation for the test.  
    The bottom of the report shows test data results in both tables and 
    graphs.  Long term graphs can be located by following the link for 
    each graph.
    
    All of GStat tests respect scheduled downtime booked in GOCDB to alert 
    level of result status.  We can discuss downtime topic in two aspects:
    
     * If the whole of site is in downtime, the site alert level is changed 
       to maintenance status.  In addition, the test alert level of section 
       'GOC DB Info' in site report will be marked as maintenance, but all 
       tests associated with the site still work normally to present the 
       real status and details of test result even though the site is in 
       downtime.  
       
     * If some of nodes in the site are in downtime, the test alert levels 
       of tests associated with maintained nodes will be marked as 
       maintenance, but site alert level won't be effected.  Particularly, 
       the test section 'Service Check' in site report will ignore the 
       maintained nodes in the section and retrieve the most severe status 
       as test alert level.  Please note that if site-bdii is in downtime, 
       the test alert level of tests associated with the bdii will be marked 
       as maintenance, but the details of check result still be shown.

    Feedback and comments for GStat can be sent to roc-dev at 
    lists.grid.sinica.edu.tw and issues can be raised with GGUS tickets 
    by adding GStat to the ticket title.
    
    The section below describes the filters available for Gstat.
    

BDIINode_Perf
BDIINode Performance Filter: Checks BDII node performance
            
            Column Name:    bnode
            
            Conditions               Alert level
            -------------------------------------
            No problems              OK
            Response time > 10 secs  INFO
            No entries found         ERROR
            -------------------------------------

            This filter ldapsearch queries to top-level BDII nodes found 
            in the GOCDB.  The number of entries found and the query 
            response time(ms) are recorded.
            
            To query the bdii the following command and options are used: 
                ldapsearch -xL -s one -l 15 -h  -p 2170                    
                -b 'mds-vo-name=local,o=grid'
                    
            This query only searches one level below the basesearch 
            Provided.  The number of entries represents the number of sites 
            Found in the bdii query.
            
            The current results and graphs for each sites BDII's can be 
            found in the site's detailed reports.  The following suffixes 
            are used after the BDII hostnames:
            
                BE    BDII Entries
                BT    BDII response Times

            

CERNSE_Check
CERNSE Check Filter: Checks if BDII has CERN SE
            
            Column Name:    bnode
            
            Conditions               Alert level
            -------------------------------------
            No problems              OK
            Problems with SE object  NOTE
            -------------------------------------

            This filter checks if the CERN's SE samdpm001.cern.ch used in SFT 
            can be found in each's site's BDII.  If this SE is missing,
            then SFT replication test may fail. 

            

GIISQuery_Perf
GIIS Performance Filter: Checks GIIS Query performance
            
            Column Name:    gperf
            
            Conditions               Alert level
            -------------------------------------
            No problems              OK
            Response time > 40 secs  INFO
            No entries found         ERROR
            Old entries found        ERROR
            -------------------------------------

            The filter shares the same agent as the SanityCheck filter and
            uses the same ldapsearch query results.    
            
            The number of entries found, old entries(not modified within 10 
            minutes) and the query response time(ms) are recorded.  If any 
            old entry found, the oledest value of modifyTimestamp found in 
            information system and the timestamp that GStat starts to check
            old entries are both listed, also the comparison between this 
            two timestamp is also shown.
            
            The current results and graphs for each sites GIIS's can be 
            found in the site's detailed reports.  The names are used to 
            identify the name of the data collected:
            
                giisEntry      GIIS Entries
                giisOld        GIIS Old entries - with modifyTimestamp 
                                            older than 10 minutes
                giisTime       GIIS response Times (ms)

            

GIISQuery_SanityCheck
GIISQuery SanityCheck Filter: Performs syntax and logic checks on GIIS
    
        Column Name:    sanity

        -------------------------------------
        Conditions               Alert level
        -------------------------------------
        no problems               OK
        blank lines exists        NOTE
        blank values found        WARN
        invalid entries           WARN
        query failed              ERROR
        -------------------------------------
        
        This filters does a few types of checks on the GIIS output
        1 - Syntax Checks
            a) Check for non zero length blank lines: with spaces.  
                This may cause probs.
            b) Check for entries that have no values
            c) Check for line without ":". these should not exists
            d) Check missing new line character between two attributes.
                This looks like two lines combined together.
            e) Check for duplicate GlueCEStateWorstResponseTime 
                in each CE. 
        2 - Missing attributes
            a) Check if GlueCEUnique & GlueSEUnique DN specified in 
                "dn: GlueCESEBindGroupCEUniqueID=" exists
            b) Check if for srm_v1/edg-se SEs have consistent access
                rules between the GlueSARoot and GlueServiceURI DN entries
            c) Check if following critical DN and their attributes exists
            		IN: dn: GlueSiteUniqueID=
		IN: dn: GlueServiceUniqueID=
			GlueServiceType	.+

			GlueServiceEndpoint	.+


        Related wikis: 
         * http://goc.grid.sinica.edu.tw/gocwiki/Value_for_%22GlueSEUniqueID%22_not_published
         * http://goc.grid.sinica.edu.tw/gocwiki/Value_for_%22GlueCEUniqueID%22_not_published
         * http://goc.grid.sinica.edu.tw/gocwiki/Value_for_%22GlueSAStateAvailableSpace%22_is_not_published
         * http://goc.grid.sinica.edu.tw/gocwiki/Value_for_%22GlueCESEBindCEAccesspoint%22_is_not_published
         * http://goc.grid.sinica.edu.tw/gocwiki/GIIS_unreachable
         * http://goc.grid.sinica.edu.tw/gocwiki/Invalid_Installation_Date
         * http://goc.grid.sinica.edu.tw/gocwiki/Information_from_GRIS_not_published_to_GIIS
         

GIISQuery_Service
GIISQuery Service Filter: Checks for services in GIIS
            
        Column Name:    serv

        ----------------------------------------------------------------------------------
        Service registered in GOCDB        Missing Service in BDII            Alert level
        ----------------------------------------------------------------------------------
                                           none missing                       OK 
        RB                                 missing ResourceBroker             WARN
        MyProxy                            missing MyProxy                    WARN
        CE                                 missing GlueCE                     ERROR
        gLite-CE                           missing GlueCE                     ERROR
        CREAM-CE                           missing GlueCE                     ERROR
        ARC-CE                             missing GlueCE                     ERROR
        Classic-SE                         missing GlueSE                     ERROR
        Central-LFC                        missing lcg-file-catalog           ERROR
        Local-LFC                          missing lcg-local-file-catalog     ERROR
        WMS                                missing org.glite.wms.WMProxy      ERROR
        LB                                 missing org.glite.lb.Server        ERROR
        Site-BDII                          missing bdii_site                  INFO
        Top-BDII                           missing bdii_top                   INFO        
        ----------------------------------------------------------------------------------   
        Service published in BDII          Missing Service in GOCDB           Alert level
        ----------------------------------------------------------------------------------
        bdii_site                          missing Site-BDII                  INFO
        bdii_top                           missing Top-BDII                   INFO        
        ----------------------------------------------------------------------------------
            
        This filter takes the list of service nodes in GOCDB and 
        checks if the services are published in the information system 
        as a "GlueServiceUniqueID" or a "GlueCEUniqueID" or a 
        "GlueSEUniqueID" object.  This allows a site to notice 
        if an important service goes down and ceases to publish
        it's presence into the information system.  This filter also 
        checks if the GlueService DNs in information system are 
        registered in GOCDB as corresponding service types.
        
        The node status of monitoring and downtime is shown in columns
        "Monitored" and "Downtime".  The column, "GOCDB NodeTypes",
        is the list of corresponding service types for the nodes 
        registered in GOCDB, and "BDII ServiceTypes" column contains 
        the list of GlueServiceType values in GlueServiceUniqueID DNs 
        which have the same hostname and corresponding to the node 
        in GOCDB.
             
        The history of the node status is also collected.  If the node 
        or service is missing then the alert level shown above is raised.
        If the node monitoring in the GOCDB is turned off, then the alert
        levels is set to 0 or "NA".
            
        Note:
        This filter depends on the results from the GOCDB Agent plugin.
            
        

GIISQuery_ServiceEntry
GIISQuery Service Verify Filter: Checks GlueServiceUniqueID
            
        Column Name:    serEntry

        -------------------------------------
        Conditions               Alert level
        -------------------------------------
        no problems               OK
        srm check                 ERROR
        -------------------------------------
            
        This filter verifies syntax of GlueServiceUniqueID entities.
        The following checks are currently performed.
        
        1. Check if SRM has the following acceptable type and version
         * SRM         1.1.0, 2.2.0
         * srm         1.1.0, 2.2.0
         * srm_v1      1.1.0
         ** other type starting with "srm" are not acceptable
         
        Related wikis: 
         * http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_the_version_of_my_SRM
                     
        

GIISQuery_SiteInfo
GIISQuery SiteInfo Filter: 
        
        Column Name:        version
        
        -------------------------------------
        Conditions               Alert level
        -------------------------------------
        No problems              OK
        old dataGridVersion      NOTE
        sitename mismatch GOCDB  NOTE
        -------------------------------------
        
        Values:             dataGridVersion found

        Detailed site report includes the following information: 

        siteName:               Name of site
        dataGridVersion:        Middleware version installed
        UserSupportContact:     User support email contact
        SysAdminContact:        Administrator email contact
        GlueSiteLatitude:       -90 to 90 degrees
        GlueSiteLongitude:      -180 to 180 degrees
        GlueCEUniqueID:         List of CE found
        GlueSEUniqueID:         List of SE found
        GlueServiceURI:         List of services and their URI
        GlueHostApplicationSoftwareRunTimeEnvironment:
            List of softwares/packages installed on this subcluster
        
        The OS Name and Release are checked if they one of the 
        accepted values registered in this wiki:
           http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_the_OS_name
        -------------------------------------------------------------
        GlueHostOperatingSystemName GlueHostOperatingSystemName:
		AIX 5.2
		CentOS 3.5
		CentOS 3.6
		CentOS 3.7
		CentOS 3.8
		CentOS 4.2
		CentOS 4.5
		CentOS 4.6
		CentOS 4.7
		CentOS 4.8
		CentOS 5.0
		CentOS 5.1
		CentOS 5.2
		CentOS 5.3
		CentOS 5.4
		CentOS 5.5
		Debian 3.1
		Debian 4.0
		FedoraCore 4
		Gentoo 2006.0
		RedHatEnterpriseAS 3
		RedHatEnterpriseAS 4
		linux-rocks-3.1 Rocks Linux
		linux-rocks-4.1 Rocks Linux
		Scientific Linux 3.0.3
		Scientific Linux 3.0.4
		Scientific Linux 3.0.5
		Scientific Linux 3.0.6
		Scientific Linux 3.0.7
		Scientific Linux 3.0.8
		Scientific Linux 3.0.9
		ScientificSL 4.2
		ScientificSL 4.3
		ScientificSL 4.4
		ScientificSL 4.5
		ScientificSL 4.6
		ScientificSL 4.7
		ScientificSL 4.8
		ScientificSL 5.0
		ScientificSL 5.1
		ScientificSL 5.2
		ScientificSL 5.3
		ScientificSL 5.4
		ScientificSL 5.5
		Scientific Linux CERN 3.0.4
		Scientific Linux CERN 3.0.5
		Scientific Linux CERN 3.0.6
		Scientific Linux CERN 3.0.8
		ScientificCERNSLC 4.3
		ScientificCERNSLC 4.4
		ScientificCERNSLC 4.5
		ScientificCERNSLC 4.6
		ScientificCERNSLC 4.7
		ScientificCERNSLC 4.8
		ScientificCERNSLC 5.2
		ScientificCERNSLC 5.3
		ScientificCERNSLC 5.4
		ScientificCERNSLC 5.5
		SUSE LINUX 9
		SUSE LINUX 10
		SUSE LINUX 10.2
		Ubuntu 5.10
		Ubuntu 6.06
		Ubuntu 8.04
		Ubuntu 8.10        
        GlueCEPolicyMaxTotalJobs: should set to accurate number 
        Related wikis:
         * http://goc.grid.sinica.edu.tw/gocwiki/Sitename_inconsistency
         * http://goc.grid.sinica.edu.tw/gocwiki/Contact_e-mail_address_inconsistency
         * http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_the_OS_name
        

GIISQuery_Usage
GIISQuery Usage Filter: Analyzes GIIS for CPU, Job & Storage Usage

        Column Name:    totalcpu, cpuUsed %, runjob, freejob
                        seAvail, seUsed %
        Values:         Number  - success
                        ""    - not available 

        -------------------------------------
        Conditions               Alert level
        -------------------------------------
        No problems               OK
        se percent usage > 80%    INFO (1)
        se percent usage > 90%    WARN (1)
        seAvail < 1GB             WARN
        waitJob > 50*totalCPU     WARN
        waitJob > 150*totalCPU    ERROR
        no cpu info found         ERROR
        no job info found         ERROR
        -------------------------------------
        
        (1)    Alert supressed if more than 5 TB storage available.
        
        totalCPU:   Total number of cpu for the site
        freeCPU:    Number of free cpus
        runJob:     Total running jobs on each CE queue
        waitJob:    Total waiting jobs on each CE queue
        seAvail:    SE storage space available
        seUsed:     SE torage space used

        Notes: 
        -------
        CPU
        -------
        * Physical CPU defined
        If the subcluster PhysicalCPU is configured to a non zero number for
        any of the CE of a given site, this number is used as the totalCPU.
        If other CE's PhysicalCPU are set to zero, then this CE's stats are
        excluded.  This is useful if sites have two or more CE that point to
        the same batch system.  Then the site should set only subcluster's
        PhysicalCPU for CEs with unique batch systems. 
        
        * Physical CPU not defined
        If GlueSubClusterPhysicalCPU to a CE is not defined, the numbers of 
        TotalCPUs in queues on the same CE are used.  To avoid recounting  
        CPUs from queues on the same CE that refer to the same cluster, only
        queues with maximum "GlueCEInfoTotalCPUs" are added to the totalCPU 
        and freeCPU values.  Queues with different "GlueCEInfoTotalCPUs" 
        values but all referring to the same physical cluster.  The best 
        estimate of site cpus is the using values from the largest queue.  
            
        -------
        Storage
        -------
        For site storage statistics, GStat prefers to use the summary information
        in GlueSE to calculate the total available space and storage usage.  If
        these summary data are unset or zero values, GStat will further adopt the
        information in GlueSA instead of GlueSE.

        If the information in GlueSE is used, GStat takes the version of GlueSchema
        to determine how to calculate the spaces in place:

        Glue 1.2: 
            Storage Available = GlueSESizeFree
            Storage Used      = GlueSESizeTotal - GlueSESizeFree
        Glue 1.3: 
            Storage Available = (GlueSETotalNearlineSize + GlueSETotalOnlineSize) - 
                                (GlueSEUsedNearlineSize + GlueSEUsedOnlineSize)
            Storage Used      = GlueSEUsedNearlineSize + GlueSEUsedOnlineSize

        The storage area (GlueSA) is a logical portion of storage extent assigned 
        to a VO.  Storage areas can overlap the same physical space, thus having 
        contention over the free space among different VO's.  If the information 
        in GlueSA is adopted in GStat, the checks are in place so that VOs sharing
        the same physical partition on will not be counted twice.
        
        To determine if VOs are on the same partition, we assume that VOs with 
        identical GlueSAStateAvailableSpace values are sharing partitions. This 
        can cause problems only if 2 partitions have the same exact disk available
        space, which should have a low probability.
        
        Glue 1.2:
            Storage Available = add up the distinct values of GlueSAStateAvailableSpace
                                in the GlueSA.
            Storage Used      = add up the values of GlueSAStateUsedSpace if 
                                GlueSAStateAvailableSpace values are distinct in the GlueSA.
        Glue 1.3: 
            Storage Available = the same manner as Glue 1.2                        
            Storage Used      = add up the values of GlueSAStateUsedSpace in all GlueSA. 

        `empty entries` means that the information could not be obtained 
        from GIIS.
        
        Related wikis:
         * http://goc.grid.sinica.edu.tw/gocwiki/Unreliable_gathering_of_CE_Information

        

RRDFetch_Hist
RRDFetch Hist Filter: Displays average of RRD data

        Column Name:	maxcpu, avgcpu    

        Values:         Number  - success
                        ""    - not available 

		maxcpu:		max of daily max CPUs number found in GIIS for 
					last 30 days
		avgcpu: 	average of daily avg CPUs number found in GIIS 
					for last 30 days

        These values indicated the relative size of the site
        and provides a reference of how many CPUs normally is
        available.
        
        

DeployQuery_DeployInfo
DeployQuery Deployment Info Filter:

        Please check: this page for more details
        https://lcg-sam.cern.ch:8443/sam/sam.py?funct=StatusTable&sensors=CE&vo=ops
 
        This filters parses the LCG grid deployment Site functional Test Results 
        and integrates the information in to this site. 

        

GridICE_Info
GridICE Info Filter: 
        
        Column Name:            gice
        
        -------------------------------------
        Conditions               Alert level
        -------------------------------------
        No problems              OK
        no GridICE service       INFO
        GridICE not accessible   INFO
        no host monitored        INFO
        no batch system          INFO
        -------------------------------------

        Detailed site report includes the following information: 
                
        GlueHostUniqueID:        Represents the hosts that are monitored
                                 by GridICE.  This should be > 0.
        GlueBatchSystemType:     The type of batch system monitored by 
                                 GridICE.  If no entries are found then 
                                 batch system monitoring in not enabled.
    
        This filter requires that gstat queries a bdii to collect available
        GridICE 'GlueServiceAccessPointURL'.  These available GridICE agents
        are then matched to a given site the agent domain matches that of 
        the GIIS server.  For some site this can be a problem.  A better
        approach may have to be taken.
        
        If multiple GridICE agents are found, then results for all matching
        agents are combined and provided to this filter for analysis.

        

AGocDB_Display
GocDB Display Filter: Display GOC DB maintenace information

        Column Name:    none
        Values:         Shows the maintenace periods for this site

        



logo.gif

Copyright © ASGC/CERN
All Rights Reserved
Comments to author: roc-dev at lists.grid.sinica.edu.tw
Generated: Mon Feb 13, 2012