Introduction
The Failover activity was born as EGEE SA1 Operations COD task, with the goal to propose, implement and document failover procedures for the collaboration, management and monitoring tools used in EGEE/WLCG Grid. The mentioned tools, listed below in this page, are daily and heavily used by COD teams, regional and sites operators and other user categories, for grid management and control purposes. The services (and the users too) are dispersed on several different time-zones, and the interdependencies inside this infrastructure are very strong. These are the reasons for an availability requirement that is high and which tend to become higher in future.
This Wiki collects the latest updates about the status of Failover procedures, the services we are taking care of and where, with all the details that can be public. We show also the latest news, ideas and decisions about Failover, resulting after technical achievements, or COD and specific meetings. The basic concept of the Failover idea has been presented at COD-7 Meeting within the EGEE-1 project.
Involved People
Alessandro Cavalli (alessandro.cavalli@cnaf.infn.it) - INFN-CNAF
Alfredo Pagano (alfredo.pagano@cnaf.infn.it) - INFN-CNAF
- Christian Peter - ITWM ( GOC-DB frontend replica )
- Kostas Koumantaros - GRNET ( slave DNS for gridops.org )
- Christos Papachristos - FORTH-ICS ( GRIDICE )
- Philippa Strange - RAL (GOC-DB)
- Marcin Radecki - CYFRONET (SAM volunteer)
- Michael M. Roth - FZK/GGUS (administration of GGUS servers, including failover mechanism)
- all the main services administrators:
- Cyril L'Orphelin, Gilles Mathieu, Osman Aidel (CIC-PORTAL)
- Min Tsai, Joanna Huang (GSTAT)
- Rafal Lichwala (SAM Admin)
- Andy Newton, Cristina Del Cano Novales, Keir Hawker (GOC-DB)
- Piotr Nyczyk, David Collados (SAM)
Failover mailing list: (project-egee-sa1-failover@cern.ch)
Failover Web: www.gridops.org
Status and Plans
Operations Tools: the map
The Operations Tools mentioned in this wiki form a distributed system, made of different linked applications. The system is fairly complex, thus to clearly understand every critical point, e.g. to react when one application is down, we keep this map of all the interconnections:
Replication
Here's the list of what we currently are working on.
Please notice that the 1st column has links to detailed pages.
Service |
Main Instance |
Replica |
Status |
Notes |
DONE |
|
|||
PARTLY-DONE |
Work in progress on DB synchronization |
|||
DONE |
|
|||
PARTLY-DONE |
DONE: frontend OK, DB hourly updated. |
|||
N.A. |
TODO |
|
||
FZK: local failover |
In Progress |
now can manually switch to the VM replica. Automatic switch: TODO |
||
DONE |
thanks to the Greek site FORTH-ICS |
The current replication process is:
- find a volunteer site to host a replica
- replicate the service
- done by the replica site staff
- supported by the main service owners/developers
- document it
- to be done by the replica site staff
- on the proper subwiki, for general and not sensible information
on a special and private Failover project (on CNAF Forge portal, in case of detailed and sensible information)
- manage and keep updated the replica with the main instance
done by a collaboration between the admins of the main and the replica. Main instance staff must give the needed support to properly assure that the latest changes in code, features and data are committed onto the replica.
Geographical Failover Idea
This idea is based on DNS, and was proposed during Failover session at COD-6 meeting in Barcelona. After a DNS test phase at CNAF on a VM environment, focused on nsupdate, NS/zone configuration and fast TTLs, a new domain for Grid Operations has been registered. A quick poll among Operations people and the resulting domain name is: gridops.org. The current configuration is master-slave, with master at CNAF and slave at GRNET.
The purpose of the new domain is that:
- the Operations Tools are available as:
the www.gridops.org provide access links
the gridops.org DNS servers use the DNS built-in master-slave feature
- the fast TTLs allow us to quickly remap services to replica sites
On www.gridops.org you can find a list of all the Operations tools currently mapped under the gridops domain.
Automatic Failover
The key points of the concept are:
as starting point, at least be able to switch from master to replica, manually
- define the architecture for the control/monitoring framework
- deploy 3 or more monitoring sites that will:
take decisions based on the status of services
take actions, remapping services through the DNS
notify the change to the administrators
Challenges, open issues
The idea for the monitoring agents is to be based on '''Nagios''' :
- start with a test of a multiple Nagios installation, e.g. on a VM
- try to combine the results from different Nagios, to take decisions
Oracle is the preferred DB backend for production. We have investigated its HighAvailability-related features, like Materialized Views, Data Guard and Streams. At present, only the Materialized View approach has been used: it is faster with this feature to understand how to provide a first level of read-only replication (GOC-DB). But knowledge on the more complex Streams approach is available (see CERN-LCG 3D project), and it will be possibly chosen in future.
Here are some drawings used to present the work on the Failover concept:
Action list
Papers
Aidel, O.; Cavalli, A.; Cordier, H.; L’Orphelin, C.; Mathieu, G.; Pagano, A.; Reynaud, S.; CIC Portal: A Collaborative and Scalable Integration Platform for High Availability Grid; in the Proc. of the 8th IEEE/ACM International Conference on Grid Computing (Grid 2007), Austin, Texas, Sep 2007 (paper)
Cavalli, A.; Pagano, A.; Aidel, O.; L'Orphelin, C.; Mathieu, G.; Lichwala, R.; Geographical failover for the EGEE-WLCG Grid collaboration tools; in Proc. of Conference on Computing in High Energy and Nuclear Physics (CHEP 2007), Victoria BC (CA), Sep 2007 (paper)
