SSCC News for August 2024

Farewell to Doug and Amanda!

Doug Hemken will retire as an SSCC statistical consultant at the end of September, after a long career during which he enhanced the work of thousands of researchers. Amanda Todd has left SSCC this week to move closer to family. Many of you have had the experience of talking to Amanda at the Help Desk and just knowing your problem will be solved. We wish them both all the best.

Doug’s departure leaves the SSCC with less expertise in SAS and a good bit less expertise in SPSS. (Russell Dimond is now the go-to person for both.) We’d love to teach the remaining SPSS users R or Stata. We’ll also have less capacity for training and statistical consulting in general.

Amanda’s departure will leave just Ryan St. Peters to staff the Help Desk, so we’ve made some changes so he can do that and carry out his other duties. The Help Desk will be available by email, voice mail, and video chat or in-person appointment (i.e. not walk-in or phone) and the hours will be reduced to 9AM-12PM and 1PM-3PM.

SSCC Training

Next week the SSCC’s statistical consultants will teach our core R and Stata workshops. The “Introduction to…” workshops will teach you how the software works and prepare you to excel in classes that use them (including the SSCC’s workshops). The “Data Wrangling in…” workshops will prepare you to do data-driven research.

Data wrangling is the process of taking raw data and putting it in a form that can be analyzed. If you’re not sure you need to take a data wrangling workshop, suppose you were given a data set of individuals living in households and think about whether, and how, you could carry out the following tasks:

  • Restructure the data set from one row per household to one row per person.
  • Calculate the household income
  • Identify the households that contain children
  • Combine the data set with one containing one row per county to add county-level variables

If you don’t know how to carry out these kinds of tasks (and want to work with quantitative data in your research) you should take data wrangling. If your plan for carrying them out involves Excel, you should take data wrangling. (Excel is not a tool for research.)

Visit the SSCC’s training page for details and to register.

Summer Tech Update

The Summer Tech Update was carried out on August 10, 2024. During this time, all software was updated to their latest version. A few things to note:

  • You’ll need to reinstall your R packages to ensure they’re compatible with the latest R.
  • If upgrading Python or Anaconda packages caused problems for your Python code, using a conda environment will ensure that never happens again.
  • Matlab Parallel Server has changed how clusters are configured. See the documentation for updated instructions (it’s easy).

Slurm is Busy! Now What?

More and more people are learning to use the Slurm cluster, and it’s now common that the cluster is too busy to start new jobs the moment they’re submitted. In the last SSCC News we talked about priorities and how someone who puts hundreds of jobs in the queue will become the lowest priority so you should submit your jobs anyway. This time we’ll focus on how you can get your job to run as soon as possible.

  • Visit the Slurm Status page and craft your job to fit in the resources available. Your job may run fastest if it gets 128 cores, but if only asking for 100 cores lets it start right now, it will probably get done sooner. Or maybe you’ll find 128 cores are available.
  • Be sure not to ask for more resources than you need. Jobs that need the high memory servers (>250GB) are the most likely to have to wait in the queue. (The email you get when your job finishes will tell you how much memory it actually used.) Jobs that use very large numbers of cores may reach a point of diminishing returns.
  • Use the short partition if your job will take less than six hours. Some servers only run short jobs.
  • Matlab Parallel Server can run workers on multiple servers. We’ve advertised it as a way to run jobs that use more than 128 cores, but it can also take advantage of unused cores on multiple servers to assemble whatever number of cores you need. To use it, you may not need to change anything in your code other than the cluster name.
  • Linstat will always start jobs right away, though if it gets busy it will slow down and become much less efficient than Slurm. Use it when you need some computing power right now.

TreeSize Storage Manager

We appreciate the effort so many of you have made to use SSCC disk space more efficiently. One challenge is knowing what is taking up space. Treesize can analyze your disk usage, including on SSCC network drives, so you can focus your efforts to save space where they’ll do the most good. TreeSize is available on the lab computers and Winstat, or you can install it on your Windows computer from SSCC Software Center or the Campus Software Library.

Changes to 4218 and 2470

The computer lab in 4218 Sewell Social Sciences Building is being remodeled into a small teaching space for SSCC training. 2470 Sewell Social Sciences Building will take its place as a place to work, and will be ready for use (with brand-new computers) by the start of the semester. Remember that neither lab will have printers.