Přeskočit na obsah

2012

MetaCentrum Operational Statistics in 2012

This report summarizes the use of MetaCentrum VO computing resources in 2012. The overall results were as follows. Data from 2011 are in parentheses.

  • Number of completed jobs: 1080 thousands (609 thousands)       
  • Number of used CPU years: 2.5 thousands CPU years (742 CPU years) 
  • Number of users with active account: 613 (491)
  • Number of extended accounts: 301 (314)
  • Number of new accounts: 312 (177)
  • Number of active users (with min. 1 job running): 322 (240)
  • The amount of data in the storage: 350 TB (85 TB)

By the end of 2012, MetaCentrum registers 613 active users. Out of this number 301 accounts were extended and we gained 312 new users in 2012.
At least one job was run by 322 users. Some of our memebers use their membership in MetaCentrum only for access to storage capacities and other services, some of them were never active.

In 2012, the users and their jobs utilized 22 millions CPU hours via Torque. Comparing to 2011, both the number of CPU cores and the total CPU time increase by approximately the same factor of three. This is a strong confirmation that the massive investments to the infrastructure are appropriate because there is a matching user demand. At the same time, the MetaCentrum services and organization were clearly able to handle this growth.

graf

The rapid grow in number of the available CPU was realized in 2011-2012 thanks to the significant investments into the national e-Infrastructure supported by the ICT projects defined in the National Roadmap of Large Infrastructures for Research and Development, CESNET and CERIT-SC.
Major HW resources owners are CERIT-SC (2200 CPU) and CESNET (2600), having 80% of all CPU cores available, the remaining resources are owned by other academic or research institutions, however, management of the clusters is done by MetaCentrum: NCBR/CEITEC, MU Brno (580 CPUs), Loschmidt Laboratories, MU Brno (260 CPUs), West Bohemian University, Pilsen (140 CPUs), FEKT VUT, Brno (100 CPUs), Bohemian University, Ceske Budejovice (100 CPUs), FZU, Academy of Sciences in Prague (70 CPUs), or Faculty of Informatics, MU Brno (60 CPUs).

Utilization

The following figure shows hardware utilization of all clusters (with the exception of those included in the experimental cloud environment, i.e. Dukan and some nodes of Zegox) in 2012. The base (100%) is the total number of available CPU-core-seconds minus the CPU-core-seconds of machines under maintenance. The values are CPU-core-seconds of running jobs and reservations. Utilization above 60% is considered as optimal; on the other hand utilization higher as 90% means that cluster is fully saturated and users have to wait long time before their jobs are executed. The main part of clusters are in the range 60% to 90%, thus they are in the optimal range.

graf

The most demanded clusters are the most powerful multiprocessors ones with sufficient memory. In general, the demand on newer (hence faster) CPUs is higher, yielding higher utilization as well. On the other hand, scheduling many-cores or huge-memory computations is more difficult (machines must be drained off smaller computations first), therefore utilization of SMP clusters is slightly lower.

New SMP Clusters Zewura and Mandos

Zewura is the largest cluster with 1600 (20x80) CPUs and with the largest memory of 512 GB per each node. The first part of the cluster was made operational in December 2011, the second part half year later. Mandos is the second largest cluster in MetaCentrum. It contains 896 (14x64) CPUs and 256 GB RAM per each node. Zewura and Mandos represents SMP-clusters (Symmetric MultiProcessing) with single shared main memory. They are suitable for applications with enormous memory requirements and/or parallelized, composed from number of processes communicating through the shared memory.

graf

graf

Both the clusters are very popular for large memory demanding jobs running on small number of CPUs. Therefore, the situation when one user utilize just few CPUs, but concurrently whole machine memory often appears; consequently no more users can use the node, but processors in node stay idle and the node seems to be free in the accounting.
The utilization of the Mandos is above 70%. From the beginning and in both deliveries, Zewura has faced HW problems, some nodes were available for supplier's testing purposes. Therefore its utilization is slightly below 65%.
The following figure of total computed time per cluster shows that Zewura and Mandos became very popular clusters with the highest computed time in 2012.

graf

HD Clusters Zegox and Mandos

High density cluster  Zegox has 576 (48x12) CPUs and 90GB RAM per each node, its utilization is nearly 80%. All nodes of the cluster are included in the OpenNebula cloud, thus being available to run user images of operating system when required. Complementarily, the
nodes of the cluster which are not utilized by the cloud users run “CERIT-SC standard” OS image which is a worker node of the Torque batch system.
Hence the nodes become available in the batch system transparently.

Cluster Minos is the third largest cluster having 600 (50x12) CPUs together and 24GB RAM per each node. Cluster is partially dedicated to virtualization experiments, therefore utilization by grid users is bellow MetaCentrum average. The jobs from experiments are not counted.

graf

graf

Clusters owned by research groups

Clusters with lower utilization (e.g. Loslab, Quark, and Orca) run in a restricted mode with significant capacity dedicated to their owners, therefore the
total utilization is lower. On the contrary, Perian is owned by the group (NCBR) which generates the largest fraction of the whole MetaCentrum load, therefore it is well utilized too. Moreover, the jobs run there are specifically crafted to match number of CPU cores per node and available memory, hence using the resources optimally.

graf

graf

graf

graf

graf

graf

graf

Jobs in MetaCentrum

In 2012, the users and their jobs utilized 22 millions CPU hours in over one million jobs (6.5 mil. COU hours in 609 thousands jobs in 2011). Comparing to 2011, the number of the total CPU time increased three times, it corresponds with the increase of the number of CPUs. The number of executed jobs increased only with ratio 1.7. 

graf

 This inproportion between the increase of the CPU time and number of executed jobs can be explained by the significant increase of the number of parallel jobs in 2012. For example, jobs requiring more than 16 CPUs were rather unusual in the past year while during this year they represent significant proportion of all jobs. Users started to utilize new large SMP clusters, the largest jobs requested over 500 CPUs.

graf

The most of jobs (over 65% of all jobs) in 2012 started immediately within 60 seconds. Almost 80% of all jobs started within 24 hours. We noticed increase of the number of waiting jobs 1 hour and more in 2012, comparing to previous year. 

graf

Longer waiting time is very often for new, fast machines that are popular among the MetaCentrum users and also for parallel jobs with many CPUs or for memory demanding jobs. Other reasons of job waiting can also be very rare combination of machine properties, requested combination of properties that is not in MetaCentrum available, exceeding queue limits by the user, MPI jobs waiting until all required nodes are free etc.

The following figure shows increase of jobs execution duration in 2012 comparing to 2011.

graf

Accoding to computed time, the most utilized were queues long@arien and default@wagap. The long queue is suitable for jobs with expected duration from 24 hours to 30 days.

graf

As compared to the MetaCentrum infrastructure, where the jobs' maximum run-time (influencing their scheduling) is implicitly specified by placing the jobs into a set of pre-defined, time-limited queues (short, normal, long, etc.), all the jobs (except backfill one) under the CERIT-SC computing infrastructure are placed into a single default queue.

According to executed jobs, the most frequent was the queue backfill@arien, the low priority queue dedicated to many single CPU jobs (limit 1000 submitted job per user and 24 hours), that fills the nodes when there are free.

graf

Backfill is a low-priority queue; jobs from this queue "fill" free gaps in the schedule (e.g., when waiting for a completion of a job, which holds resources requested by a starving job). The queue accepts just single-node jobs with the specified maximum run-time in the length up to 24 hours. When necessary (e.g., reserving resources for a long-term job), the backfill jobs may be suspended or even terminated by us at any time.

Institutions, groups and users

There are 322 users that have executed at least one job during this time period. The number has increased from 240 in the year 2011. Metacentrum is utilized by several research groups from several academic or scientific institutions that covers computational groups.
More detailed preview of computed time using by institutions is shown at the following figure.

graf

The most active users (according to consumed CPU time) are from the Masaryk univerzity, its members  computed over 10 milion CPU hours (more than 50% of total computed CPU time). There are 91 users from Masaryk university that have executed at least one job during this time period, the most of these users comes from NCBR (part of CEITEC) or Loschmidt laboratories groups. Both groups own clusters dedicated for their research, connected to MetaCentrum, and available also for other users. The other universities have smaller computed time but also have smaller number of active users. Charles univerzity has 32 users, University of west Bohemia 38 users, the Academy of science and University of south Bohemia both 15 users.

The same trend can be observed in the groups. Major part of user are not separated into research groups, the largest and most active is once again NCBR (41 members in MetaCentrum, part of CEITEC).

graf

First ten users have computed more than 10 mil. CPU hours that is almost 50% of the whole computed time. According to the number of jobs they have computed almost 800 thousands jobs, that makes nearly 70% of all jobs executed in MetaCentrum. Between the most active users belong users from NCBR (4 users) and Loschmidt Laboratories (2 users) groups, both groups from the Masaryk univerzity, and 3 users from Charles university.

graf

 Majority of recognized applications that are utilized in MetaCentrum are from chemical domain. The unrecognized applications are often programs created by users.

  • Amber package contains cca 50 applications that covers methodes utilized in computational chemistry
  • Gaussian package based on quantum mechanics to study molecules and chemical reactions
  • VASP package for performing ab-initio quantum-mechanical molecular dynamics
  • Gromacs package to compute molecular mechanics and dynamics - minimalization of energy of system and dynamic behaviour of molecular systems

graf

Storage Usage

The main high-capacity online storage facility in MetaCentrum (NFSv4 volume) has total capacity of 600 TB (124 TB in 2011) and consists of five disk arrayas in three geographic locations (3 x Brno, Pilsen, and Ceske Budejovice). End of 2012, user data occupied approximately 380 TB (it means 60% of total capacity) in 233 mil. files (155 mil. files in 2011).

 

Last changed:2013-02-08 09:50:24