Difference: MuSunNCSAClustersInfo (1 vs. 4)

Revision 42011-04-22 - PeterWinter

Line: 1 to 1
 
META TOPICPARENT name="MuSunGroup"
Changed:
<
<
SET ALLOWTOPICCHANGE = MuSunGroup

SET ALLOWTOPICVIEW = MuSunGroup

Cluster specs and docs

Computational requirements

MuCap:

  • 2GB memory per node
  • High Bandwidth to MSS

MuSun

  • 2-3GB per node
  • High bandwidth to MSS
  • 300,000 SUs (on Abe) over the course of 3-5 years.
    • Several mu passes of 15000 SUs with a few mta passes of 2000? SUs per mu pass.
    • One mu pass = 15000 SUs, one mta = 2000 SUs
    • For 2011 data, ~5 mu passes with ~5 mta per mu pass
    • Same for 2013 data
  • 150TB MSS space
    • ~75 TB per run

Draft of letter to Mike Pflugmacher

  • 1st draft Michael, please edit or make note of your comments

Dear Mike,

We have two experiments that currently make use of the Abe cluster. Our analysis software for both experiments operates in a embarrassingly parallel fashion, so we make no use of any MPI or shared memory functionality. The processors on any of the mentioned clusters are sufficient for our needs, and we generally need about 2GB of memory per core. Given sufficient access to available cores, the bottleneck for running a pass over our dataset is the bandwidth from the MSS to staging scratch space.

One of our experiments, MuCap, will be concluding this summer, so only a few final dataset passes will be performed (fewer than 30,000 SU required on Abe equivalent hardware). The other, MuSun, is just now moving into the mode of full-dataset analysis and will require ~300,000 SUs (on Abe) over the course of the next 3-5 years. Given this time scale and the inconvenience of updating batch job scripts and the like to a new cluster, we prefer to move to the cluster that will remain active the longest.

Of the clusters you mentioned (QueenBee, Steele, Ember, Lonestar, Trestles), we have the following impression:
QueenBee: shutting down in the summer - not suitable
Ember: hardware is acceptable, but cluster is specialized for a highly-parallel computing model that we would not use.
Steele: functionally similar to Abe for our purposes, however contains fewer available cores. What is the bandwidth to MSS?
Trestles/Lonestar: hardware acceptable, but does it have equivalent bandwidth to the MSS as Abe?

Email from Mike Pflugmacher

Peter,

The recommendations for the general user community of Abe will be to continue on
with QueenBee or Steele.  If that is not workable, the next option will be to
transfer to Tressels or Lonstar.  
However, given your teams data in MSS and the small computational time used, we
could consider a transfer to Ember for the upcoming needs.  I would like to
understand your estimated needs over the next few months to help place you with
the best available resource.  I will work with you and the TeraGrid resource
providers to meet your needs.

Mike

FROM: Pflugmacher, Michael
(Concerning ticket No. 197114)

Peter's response to original NCSA email

Dear NCSA helpdesk,

 We have just read in a recent TERAGRID news email, that the Abe cluster
 will retire end of April
 and since no year was specified, we assume it is April 2011. This is a
 rather short notice.

 We have currently two open allocations (TG-PHY060011N and TG-PHY080015N)
 and we are unsure about the consequences of this retirement with respect to our
 usage of Abe.
 At this moment, we are in the final stage of our analysis of the MuCap
 experiment. This is the culmination of 10 years of experimental effort by an
 international collaboration.
 We are under enormous pressure to perform the final data passes without
 distraction, as it determines both a thesis of one of our Illinois graduate students
 and the overall result for MuCap. We also promised this final analysis to our funding
 agencies (NSF and DOE) in our research proposals.
 In addition, the MuSun experiment has taken successful data last fall
 and we were working hard to adapt our analysis software for a pass over this data with the
 current allocation in May/June. We need to analyze and learn from this
 data, before our main experimental run, scheduled for three months later this year,
 takes place.

 However, given the retirement of Abe, we are seriously concerned regarding
 this essential and timely analysis of these two experiments.

 Can you please advice us on our future option?  We have a massive amount
 of data stored at MSS (200 TB), so we would need an equivalent computing cluster
 at NCSA, not to run into bandwidth limitations. Should we transition to Lincoln, but
 this system is also replaced soon?
 Which options do we have? We would like to start testing the future
 options immediately!

 We already experienced the transition from the Tungsten to the Abe
 cluster and it was unfortunately quite some work until we could run all our analysis
 scripts / software on this new cluster.

 Thanks you very much for your help. We have really appreciated the
 advice from NCSA staff given to us in the past, to move our research forward.

 With my best greetings

 Peter

>      * The NCSA Abe and LONI Queen Bee machines of the AQS pool will
>        retire in late April and July 31, respectively, and only the
>        Purdue Steele machine will be available as a separate allocable
>        system (Steele).
>      * NCSA: The Lincoln Tesla-GPU system will be replaced in June by a
>        150TF (peak) Fermi-GPU system named Forge. The new system will
>        have 32 nodes (Dell C6145 servers) with dual 8-core Magny-Cours
>        processors and 48GB per node. Each node supports 8 C2070 Fermi
>        GPUs. A 600TB GPFS file system provides an aggregate IO
>        bandwidth of 15GB/s. (Final configuration may vary.)
>      * NCSA: The new NCSA SGI Altix UV, Ember
>        [[http://www.ncsa.illinois.edu/News/10/0302NCSAprovide.html][<http://www.ncsa.illinois.edu/News/10/0302NCSAprovide.html>]],
>        replaced the Itanium based SGI Altix Cobalt system with roughly
>        twice the performance per core. System Info: 16 TFLOPS; 4 Intel
>        SGI Altix UV shared-memory systems each with 384 2.6 GHz
>        Nehalem-EX cores and 2 TB of memory; QDR IB.
>
>
Please go to https://muon.npl.washington.edu/twiki/bin/view/Main/MuSunNCSAClustersInfo
 \ No newline at end of file

Revision 32011-03-24 - PeterKammel

Line: 1 to 1
 
META TOPICPARENT name="MuSunGroup"
Changed:
<
<

Cluster specs and docs

>
>
SET ALLOWTOPICCHANGE = MuSunGroup

SET ALLOWTOPICVIEW = MuSunGroup

Cluster specs and docs

 
Line: 29 to 33
 
    • ~75 TB per run

Draft of letter to Mike Pflugmacher

Added:
>
>
  • 1st draft Michael, please edit or make note of your comments
 Dear Mike,

We have two experiments that currently make use of the Abe cluster. Our analysis software for both experiments operates in a embarrassingly parallel fashion, so we make no use of any MPI or shared memory functionality. The processors on any of the mentioned clusters are sufficient for our needs, and we generally need about 2GB of memory per core. Given sufficient access to available cores, the bottleneck for running a pass over our dataset is the bandwidth from the MSS to staging scratch space.

Line: 36 to 42
 One of our experiments, MuCap, will be concluding this summer, so only a few final dataset passes will be performed (fewer than 30,000 SU required on Abe equivalent hardware). The other, MuSun, is just now moving into the mode of full-dataset analysis and will require ~300,000 SUs (on Abe) over the course of the next 3-5 years. Given this time scale and the inconvenience of updating batch job scripts and the like to a new cluster, we prefer to move to the cluster that will remain active the longest.

Of the clusters you mentioned (QueenBee, Steele, Ember, Lonestar, Trestles), we have the following impression:
QueenBee: shutting down in the summer - not suitable
Ember: hardware is acceptable, but cluster is specialized for a highly-parallel computing model that we would not use.
Steele: functionally similar to Abe for our purposes, however contains fewer available cores. What is the bandwidth to MSS?
Trestles/Lonestar: hardware acceptable, but does it have equivalent bandwidth to the MSS as Abe?

Added:
>
>

Email from Mike Pflugmacher

Peter,

The recommendations for the general user community of Abe will be to continue on
with QueenBee or Steele.  If that is not workable, the next option will be to
transfer to Tressels or Lonstar.  
However, given your teams data in MSS and the small computational time used, we
could consider a transfer to Ember for the upcoming needs.  I would like to
understand your estimated needs over the next few months to help place you with
the best available resource.  I will work with you and the TeraGrid resource
providers to meet your needs.

Mike

FROM: Pflugmacher, Michael
(Concerning ticket No. 197114)

Peter's response to original NCSA email

Dear NCSA helpdesk,

 We have just read in a recent TERAGRID news email, that the Abe cluster
 will retire end of April
 and since no year was specified, we assume it is April 2011. This is a
 rather short notice.

 We have currently two open allocations (TG-PHY060011N and TG-PHY080015N)
 and we are unsure about the consequences of this retirement with respect to our
 usage of Abe.
 At this moment, we are in the final stage of our analysis of the MuCap
 experiment. This is the culmination of 10 years of experimental effort by an
 international collaboration.
 We are under enormous pressure to perform the final data passes without
 distraction, as it determines both a thesis of one of our Illinois graduate students
 and the overall result for MuCap. We also promised this final analysis to our funding
 agencies (NSF and DOE) in our research proposals.
 In addition, the MuSun experiment has taken successful data last fall
 and we were working hard to adapt our analysis software for a pass over this data with the
 current allocation in May/June. We need to analyze and learn from this
 data, before our main experimental run, scheduled for three months later this year,
 takes place.

 However, given the retirement of Abe, we are seriously concerned regarding
 this essential and timely analysis of these two experiments.

 Can you please advice us on our future option?  We have a massive amount
 of data stored at MSS (200 TB), so we would need an equivalent computing cluster
 at NCSA, not to run into bandwidth limitations. Should we transition to Lincoln, but
 this system is also replaced soon?
 Which options do we have? We would like to start testing the future
 options immediately!

 We already experienced the transition from the Tungsten to the Abe
 cluster and it was unfortunately quite some work until we could run all our analysis
 scripts / software on this new cluster.

 Thanks you very much for your help. We have really appreciated the
 advice from NCSA staff given to us in the past, to move our research forward.

 With my best greetings

 Peter

>      * The NCSA Abe and LONI Queen Bee machines of the AQS pool will
>        retire in late April and July 31, respectively, and only the
>        Purdue Steele machine will be available as a separate allocable
>        system (Steele).
>      * NCSA: The Lincoln Tesla-GPU system will be replaced in June by a
>        150TF (peak) Fermi-GPU system named Forge. The new system will
>        have 32 nodes (Dell C6145 servers) with dual 8-core Magny-Cours
>        processors and 48GB per node. Each node supports 8 C2070 Fermi
>        GPUs. A 600TB GPFS file system provides an aggregate IO
>        bandwidth of 15GB/s. (Final configuration may vary.)
>      * NCSA: The new NCSA SGI Altix UV, Ember
>        [[http://www.ncsa.illinois.edu/News/10/0302NCSAprovide.html][<http://www.ncsa.illinois.edu/News/10/0302NCSAprovide.html>]],
>        replaced the Itanium based SGI Altix Cobalt system with roughly
>        twice the performance per core. System Info: 16 TFLOPS; 4 Intel
>        SGI Altix UV shared-memory systems each with 384 2.6 GHz
>        Nehalem-EX cores and 2 TB of memory; QDR IB.

Revision 22011-03-24 - BrendanKiburg

Line: 1 to 1
 
META TOPICPARENT name="MuSunGroup"

Cluster specs and docs

Line: 33 to 33
  We have two experiments that currently make use of the Abe cluster. Our analysis software for both experiments operates in a embarrassingly parallel fashion, so we make no use of any MPI or shared memory functionality. The processors on any of the mentioned clusters are sufficient for our needs, and we generally need about 2GB of memory per core. Given sufficient access to available cores, the bottleneck for running a pass over our dataset is the bandwidth from the MSS to staging scratch space.
Changed:
<
<
One of our experiments, MuCap, will be concluding this summer, so only a few final dataset passes will be performed. The other, MuSun, is just now moving into the mode of full-dataset analysis and will require ~300,000 SUs (on Abe) over the course of the next 3-5 years. Given this time scale and the inconvenience of updating batch job scripts and the like to a new cluster, we prefer to move to the cluster that will remain active the longest.
>
>
One of our experiments, MuCap, will be concluding this summer, so only a few final dataset passes will be performed (fewer than 30,000 SU required on Abe equivalent hardware). The other, MuSun, is just now moving into the mode of full-dataset analysis and will require ~300,000 SUs (on Abe) over the course of the next 3-5 years. Given this time scale and the inconvenience of updating batch job scripts and the like to a new cluster, we prefer to move to the cluster that will remain active the longest.
 
Changed:
<
<
Of the clusters you mentioned (QueenBee, Steele, Ember, Lonestar, Trestles), we have the following impression:
QueenBee: shutting down in the summer - not suitable
Ember: hardware is acceptable, but cluster is specialized for a highly-parallel computing model that we would not use.
Steele: functionally similar to Abe for our purposes - fewer available cores. Bandwidth to MSS?
Trestles/Lonestar: harware acceptable, but does it have equivalent bandwidth to the MSS as Abe?
>
>
Of the clusters you mentioned (QueenBee, Steele, Ember, Lonestar, Trestles), we have the following impression:
QueenBee: shutting down in the summer - not suitable
Ember: hardware is acceptable, but cluster is specialized for a highly-parallel computing model that we would not use.
Steele: functionally similar to Abe for our purposes, however contains fewer available cores. What is the bandwidth to MSS?
Trestles/Lonestar: hardware acceptable, but does it have equivalent bandwidth to the MSS as Abe?

Revision 12011-03-23 - MichaelMurray

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="MuSunGroup"

Cluster specs and docs

Computational requirements

MuCap:

  • 2GB memory per node
  • High Bandwidth to MSS

MuSun

  • 2-3GB per node
  • High bandwidth to MSS
  • 300,000 SUs (on Abe) over the course of 3-5 years.
    • Several mu passes of 15000 SUs with a few mta passes of 2000? SUs per mu pass.
    • One mu pass = 15000 SUs, one mta = 2000 SUs
    • For 2011 data, ~5 mu passes with ~5 mta per mu pass
    • Same for 2013 data
  • 150TB MSS space
    • ~75 TB per run

Draft of letter to Mike Pflugmacher

Dear Mike,

We have two experiments that currently make use of the Abe cluster. Our analysis software for both experiments operates in a embarrassingly parallel fashion, so we make no use of any MPI or shared memory functionality. The processors on any of the mentioned clusters are sufficient for our needs, and we generally need about 2GB of memory per core. Given sufficient access to available cores, the bottleneck for running a pass over our dataset is the bandwidth from the MSS to staging scratch space.

One of our experiments, MuCap, will be concluding this summer, so only a few final dataset passes will be performed. The other, MuSun, is just now moving into the mode of full-dataset analysis and will require ~300,000 SUs (on Abe) over the course of the next 3-5 years. Given this time scale and the inconvenience of updating batch job scripts and the like to a new cluster, we prefer to move to the cluster that will remain active the longest.

Of the clusters you mentioned (QueenBee, Steele, Ember, Lonestar, Trestles), we have the following impression:
QueenBee: shutting down in the summer - not suitable
Ember: hardware is acceptable, but cluster is specialized for a highly-parallel computing model that we would not use.
Steele: functionally similar to Abe for our purposes - fewer available cores. Bandwidth to MSS?
Trestles/Lonestar: harware acceptable, but does it have equivalent bandwidth to the MSS as Abe?

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback