Three new papers lay out the opportunities and challenges

Pangeo got involved in cloud computing almost by accident. In our original 2017 NSF proposal, the program manager asked us to trim our budget. So we removed the servers we had planned to buy and instead asked to be included in the NSF BIGDATA program (now defunct), which provided direct grants of credits from the big cloud providers. What followed was a period of intense experimentation, building things, breaking things, and growing a community around cloud-native geoscience research.

Four years later, we are convinced that cloud computing holds the power to transform scientific research and data science education in fundamental…

New Funding and New Directions

On behalf of the Pangeo Steering Council

TLDR: Pangeo is launching new efforts in the areas of education, cloud data storage, and machine learning. We are pivoting away from directly operating cloud-based JupyterHubs, migrating this role to new service providers in this space.

A Brief History of Pangeo

The Pangeo project effectively began in 2016 with a workshop at Columbia university. The schedule is still online and is fun to review. The workshop was an exciting mix of science and technology, a dynamic that continues to characterize Pangeo today. The mission for Pangeo developed at that workshop has stood the test of time:

Our mission…

This short blog post is a technical followup to a post from earlier this year:

In that post, I shared the following about my struggles with collaboration / project management tools, and the need for a better system, particular given the fact that we are now all working remotely thanks to COVID-19:

Keeping track of many different projects and collaborations requires organization. This is especially important when team members are not regularly meeting face-to-face. I tend to struggle in this department. However, I have found that technology solutions can really help. In the past, I we used a mix of…

The 2020–2021 academic year is shaping up to be drastically different from those that came before it, both in general and for me particularly. The aim of this blog post is to summarize where I am at and how I will be doing business for my collaborators and advisees during this time.

My Family’s Current Situation

This year will be my first sabbatical since starting my position at Columbia. We originally planned to spend a good chunk of our time in Italy, but circumstances redirected us instead to my home state of Vermont, a place whose exemplary handling of the pandemic has made for…

by Ryan Abernathey and Tom Augspurger

TLDR: this post describes a new python library called rechunker, which performs efficient on-disk rechunking of chunked array storage formats. Rechunker allows you to write code like this:

from rechunker import rechunk
target_chunks = (100, 10, 1)
max_mem = "2GB"
plan = rechunk(source_array, target_chunks, max_mem,

…and have the operation parallelized over any number of Dask workers.


Chunked arrays are a key part of the modern scientific software stack in fields such as geospatial analytics and bioinformatics. Chunked arrays take a large multidimensional array dataset, such as an image captured over many…

Ryan Abernathey wrote this post, with input from the rest of the Pangeo Steering Council.

On the 10th of June, 2020, a historic event brought academia, science, and tech together for a strike in solidarity with the anti-racist Black Lives Matter movement.

This event was sparked by the tragic deaths of George Floyd, Breonna Taylor, Ahmaud Arbery, countless others like them, and the systematic anti-Black racism that makes such killings commonplace in America. The strike was spearheaded by a group of physicists but soon engulfed many disciplines, as scientists sought a conduit for the outrage we feel.

The aim of…

By Ryan Abernathey & Joe Hamman

Anyone working with large-scale Earth System data today faces the same general problems:

  • The data we want to work with are huge (typical analyses involve several TB at least)
  • The data we need are produced and distributed by many different organizations (NASA, NOAA, ESGF, Copernicus, etc.)
  • We want to apply a wide range of different analysis methodologies to the data, from simple statistics to signal processing to machine learning.

The community is waking up to the idea that we can’t simply expect scientists to download all this data to their personal computers for processing.

Download-based workflow. From Abernathey, Ryan (2020): Data Access Modes in Science. figshare. Figure.

“The cloud” — here defined broadly…

Last week we announced that Pangeo has partnered with Google Cloud to bring CMIP6 climate data to Google Cloud’s Public Datasets Program. You can read more about the dataset the process here.

CMIP6 stands for Couple Model Intercomparison Project Phase 6, a project that is part of the World Climate Research Program.

The concept of cloud computing and cloud-based data is still very new to many scientists. In this post, I’ll try to explain what cloud-based data analysis looks like by running through five different scenarios for interacting with the CMIP6 data. …

We are thrilled to announce the launch of the Lamont Doherty Earth Observatory Climate Data Science Lab. A $2.3M grant from the Gordon and Betty Moore Foundation, to principal investigator Ryan Abernathey, will enable us to tackle some of the most difficult scientific problems and data challenges in Climate Data Science. We are extremely grateful to the Moore foundation for their generous support and for their progressive vision for data-driven science.

Motivation and Goals

Oceanographers and climate scientists have access to a wealth of data–from instruments, satellite observations, and numerical simulations–to help confront the intellectual and societal challenges posted by climate change. But…

Ryan Abernathey and Joe Hamman cowrote this blog post following discussion with the Pangeo Steering Council.

The basic ingredients for a Pangeo deployment are a fast parallel storage system, scalable high-performance compute nodes, access to the internet, and software which makes all these elements work together for an amazing interactive data-analysis experience.

Pangeo deployment architecture. Via

We have deployed Pangeo on a wide range of different systems, from small university clusters to Top 500 supercomputers. We have also found that commercial cloud environments (e.g. Amazon Web Services, Google Cloud Platform, Microsoft Azure) are a particularly good fit for Pangeo. The ability to quickly scale…

Ryan Abernathey

Associate Professor, Earth & Environmental Sciences, Columbia University.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store