the simple rule of data centre
power management – actions have
consequences and consequences
require action.
The BA example demonstrates
again that power misunderstanding
is a common problem. Two-thirds of
data centre professionals in Eaton’s
research weren’t fully confident in
power, and until organisations get
to grips with power management
we can expect to see more power-
related outages. There is a profound
concern around skills availability,
that it’s hard to acquire and retain
the relevant expertise or talent,
whether it’s designing for energy
efficiency, managing consumption
on an ongoing basis, or dealing with
power-related failures quickly and
effectively to avoid and mitigate
outages.
Have you tried switching
it off and on again?
Should a full power outage occur
then it’s absolutely imperative to
have a disaster recovery process in
place that clearly defines the steps
to be taken when re-energising the
data centre, detailing which systems
must be brought back online first. In
a full outage situation where people
are in a state of panic and under
pressure to resume normal services,
staggering the re-energisation of
the systems in your data centre may
seem counter intuitive as the goal
is to get back online as quickly as
possible, but such a process helps
to avoid further extension of the
outage. The restoration of a data
centre post going black needs to be
done gently and in a clearly defined
methodical fashion, simply trying to
get everything back up in a hastily
and unplanned way will only cause
in-rush which could cause more
outages, quickly crippling the data
centre again. Power management
is all about understanding the
dependencies between the different
parts of the power system and the
IT load and having appropriate
levels of resilience in the hardware,
software and processes.
Recovering from an outage requires
patience and a systematic process
– two things that were seemingly
missing according to reports on BA’s
outage. No data centre professional
has ever asked ‘have you tried
switching it off and on again?’ The
skill is to pace oneself and follow
each step in turn, controlling and
monitoring a phased restart so
that batches of systems are only
brought online when it’s safe
to do so and one is sure of the
correct phase balancing and loads.
Skipping any steps in the rush to
get back online can create a power
surge, overloading circuits, tripping
breakers and, to put it mildly, cause
chaos.
Resilience and
infrastructure upgrades
Alongside skills andpower processes,
the facilities infrastructure itself
often needs upgrading to meet
today’s efficiency, reliability and
flexibility expectations. Around
half of respondents in Eaton’s
survey report that their core IT
infrastructure needs strengthening,
and this number is closer to two-
thirds when it comes to facilities
such as power and cooling.
Power management is increasingly
becoming a software defined activity;
given the skills gap, software can
play an important role in bridging
the divide between IT and power
by presenting power management
options in dashboard styles that are
familiar to an IT audience, making
it easier to understand and even
automating management of power
infrastructure. This could have
prevented the outage that faced BA
as the automated processes would
have brought systems back online in
a controlled and monitored fashion.
We’ve moved towards more
virtualised environments in data
centres, IT and data centre
professionals are familiar with using
virtualisation to maintain hardware,
so the question is why not use
the same principles in power? It is
important that all power distribution
designs, and associated resiliency
software tools, are compatible with
all the major virtualisation vendors
to ensure future-proofing of the
infrastructure. This approach will
enable data centre professionals
to do concurrent maintenance
to mitigate risks of infrastructure
maintenance and upgrades.
Learning lessons
While we may never fully understand
what happened within BA’s data
centre, it’s near guaranteed that it
won’t be an isolated incident across
the wider data centre industry, even
if it’s unlikely we’ll see anything on
the same scale for a long time. The
issue comes down to either poor
preparation or implementation of
disaster recovery. Better preparation
of the data centre disaster recovery
process would have seen it
designed with resilience in mind,
meaning firstly the DR site should
have kicked in to cover the demand
during the outage and, secondly,
when restarting the hardware and
applications, it should have been
done in a far more controlled
manner. This would have meant
that the reintroduction of power
to systems in a slow and phased
manner, allowed for a smooth and
steady recovery. We, as a data
centre industry, need to make sure
that we all learn lessons from BA’s
high-profile outage and take actions
to ensure that effective power
management is a ‘must have’ and
not a ‘nice to have’.
New-Tech Magazine Europe l 19