Introduction

The Failover activity was born as EGEE SA1 Operations COD task, with the goal to propose, implement and document failover procedures for the collaboration, management and monitoring tools used in EGEE/WLCG Grid. The mentioned tools, listed below in this page, are daily and heavily used by COD teams, regional and sites operators and other user categories, for grid management and control purposes. The services (and the users too) are dispersed on several different time-zones, and the interdependencies inside this infrastructure are very strong. These are the reasons for an availability requirement that is high and which tend to become higher in future.

This Wiki collects the latest updates about the status of Failover procedures, the services we are taking care of and where, with all the details that can be public. We show also the latest news, ideas and decisions about Failover, resulting after technical achievements, or COD and specific meetings. The basic concept of the Failover idea has been presented at COD-7 Meeting within the EGEE-1 project.

Involved People

Status and Plans

Operations Tools: the map

The Operations Tools mentioned in this wiki form a distributed system, made of different linked applications. The system is fairly complex, thus to clearly understand every critical point, e.g. to react when one application is down, we keep this map of all the interconnections:

Operations Tools Map

Replication

Here's the list of what we currently are working on.
Please notice that the 1st column has links to detailed pages.

Service
(see sub-wiki links)

Main Instance

Replica

Status

Notes

GSTAT

ASGC-TW

INFN-CNAF

DONE

CIC PORTAL

IN2P3-CC

INFN-CNAF

PARTLY-DONE

Work in progress on DB synchronization

SAM ADMIN

IN2P3-CC

INFN-CNAF

DONE

GOC DATABASE

RAL

ITWM

PARTLY-DONE

DONE: frontend OK, DB hourly updated.
TODO: some performance issue and connections with other tools

SFT/SAM

CERN

N.A.

TODO

GGUS

FZK

FZK: local failover

In Progress

now can manually switch to the VM replica. Automatic switch: TODO

GRIDICE

CNAF

FORTH-ICS

DONE

thanks to the Greek site FORTH-ICS

The current replication process is:

Geographical Failover Idea

This idea is based on DNS, and was proposed during Failover session at COD-6 meeting in Barcelona. After a DNS test phase at CNAF on a VM environment, focused on nsupdate, NS/zone configuration and fast TTLs, a new domain for Grid Operations has been registered. A quick poll among Operations people and the resulting domain name is: gridops.org. The current configuration is master-slave, with master at CNAF and slave at GRNET.

The purpose of the new domain is that:

On www.gridops.org you can find a list of all the Operations tools currently mapped under the gridops domain.

Automatic Failover

The key points of the concept are:

Challenges, open issues

Here are some drawings used to present the work on the Failover concept:

Failover Drawings

Action list

Action List

Papers

News

Failover News

Failover mechanisms (last edited 2008-10-14 09:30:15 by AlessandroCavalli)