Exploring the DANDI Archive¶
This notebook serves as a quick-start guide for the Distributed Archive for Neurophysiology Data Integration (DANDI).
The DANDI Archive holds hundreds of Dandisets with a diverse range of neurodata modalities.
These modalities span the spectrum of microscopy, optogenetics, intracellular and extracellular electrophysiology, and optophysiology.
While we cannot hope to completely showcase this diversity here, there are two key examples which provide a good starting point:
- 000728 - Visual Coding - Optical Physiology by the Allen Institute for Brain Science (AIBS)
- 000409 - Brain Wide Map by the International Brain Laboratory (IBL)
For even more usage guides, dandiset-specific tutorials, and general documentation, please read the main DANDI Docs.
Q: How do I navigate the archive and its datasets?¶
DANDI provides a web interface, REST API, and command-line interface (CLI) to help users intuitively navigate the contents.
The easiest place to start is the primary Dandiset listing page.
After scrolling around a while, we choose our first Dandiset from the web interface 000728.
We can see the contents by going to the "Files" tab.
From here, we can see that a Dandiset is organized as a collection of folders organized by subject ID.
Each folder contains files named according to session ID or other unique discriminators:
000728/
├── sub-691657859/sub-691657859_ses-712919679-StimB_ophys.nwb
│ ├── sub-691657859_ses-712919679-StimB_ophys.nwb
│ ├── sub-691657859_ses-710504563-StimA_behavior+image+ophys.nwb
│ └── ...
├── sub-501800590/
│ └── ...
└──...
Setup¶
Before we start accessing data contents, we will need to install and import some Python libraries:
!pip install -q dandi matplotlib remfile opencv-python-headless
from pathlib import Path
import cv2
import h5py
import remfile
import matplotlib.pyplot as plt
import numpy as np
from dandi.dandiapi import DandiAPIClient
from pynwb import read_nwb, NWBHDF5IO
Next, we will initialize our DANDI API client to interact with the archive database and list a few of the available Dandisets:
client = DandiAPIClient()
dandisets = list(client.get_dandisets())
# Print the dandiset IDs and titles of the first 3 dandisets
for dandiset in dandisets[:3]:
print(f"{dandiset.identifier}: {dandiset.get_raw_metadata()["name"]}")
000003: Physiological Properties and Behavioral Correlates of Hippocampal Granule Cells and Mossy Cells 000004: A NWB-based dataset and processing pipeline of human single-neuron activity during a declarative memory task 000005: Electrophysiology data from thalamic and cortical neurons during somatosensation
Now let's return to our first example Dandiset and list out a few of its contents:
dandiset = client.get_dandiset(dandiset_id="000728", version_id="0.240827.1809")
assets = list(dandiset.get_assets())
# Print the file paths as seen on the DANDI web interface
for asset in assets[:3]:
print(asset.get_raw_metadata()["path"])
sub-691657859/sub-691657859_ses-712919679-StimB_ophys.nwb sub-501800590/sub-501800590_ses-509522931-StimC_ophys.nwb sub-572569275/sub-572569275_ses-591563201-StimB_ophys.nwb
version_id in this case.
Dandisets that are published on the archive are given a citable DOI, such as:
DOI: 10.48324/dandi.000728/0.240827.1809
These are very rich models whose full potential is best showcased in the Advanced Search Tutorial.
Q: What kinds of data are hosted and what formats do they use?¶
DANDI accepts a relatively small number of open, community-driven file formats designed according to NIH-accepted data standards*.
and behavior |
Zarr |
||
(MRI, EEG, etc.) |
JSON TSV |
||
Next Generation File Format |
(cloud-optimized) |
These data standards are specifically designed to integrate multi-modal raw and processed neurodata alongside behavioral data and metadata annotations.
The S3 bucket hosting the DANDI archive allows users to take advantage of cloud-native services for scalable data access, computation, visualization, and analysis.
This allows DANDI to integrate with many external visualization tools, accessible via the "Open With" button on the web interface.
*The difference between data formats and standards is elaborated in greater detail in the Data Standards section of the documentation.
Q: How do I access the contents of a Dandiset?¶
Data assets from a Dandiset can either be downloaded directly from the web page, through the CLI, or programmatically:
# Look up a specific file asset from a different Dandiset
dandiset = client.get_dandiset(dandiset_id="000728")
dandi_filename = "sub-495727015/sub-495727015_ses-501559087-StimB_behavior+image+ophys.nwb"
asset = dandiset.get_asset_by_path(path=dandi_filename)
# Download the entire file (alter the base directory as needed)
output_path = Path.cwd() / Path(dandi_filename).name
if not output_path.exists():
asset.download(filepath=output_path)
To open the file after the download completes, we can use the PyNWB library to read the NWB file and display the basic content layout:
nwbfile = read_nwb(path=output_path)
print(nwbfile)
root pynwb.file.NWBFile at 0x1491548314544
Fields:
data_collection: Generated by pipeline Brain Observatory version 3.0.
devices: {
Camera <class 'pynwb.device.Device'>,
Microscope <class 'pynwb.device.Device'>,
StimulusDisplay <class 'pynwb.device.Device'>
}
epochs: epochs <class 'pynwb.epoch.TimeIntervals'>
experiment_description: For more information, please see http://help.brain-map.org/display/observatory/Allen+Brain+Observatory
file_create_date: [datetime.datetime(2024, 3, 19, 15, 55, 50, 497145, tzinfo=tzoffset(None, -14400))]
identifier: 211c0e3c-4b28-4375-b539-cdb71a42851b
imaging_planes: {
ImagingPlane <class 'pynwb.ophys.ImagingPlane'>
}
institution: Allen Institute for Brain Science
intervals: {
epochs <class 'pynwb.epoch.TimeIntervals'>
}
notes: Container ID: 511510736
Mouse ID (from genotype white paper): 222426
Session type: three_session_B
processing: {
behavior <class 'pynwb.base.ProcessingModule'>,
ophys <class 'pynwb.base.ProcessingModule'>
}
protocol: 20160204_222426_3StimB
session_description: Auto-generated by neuroconv
session_id: 501559087-StimB
session_start_time: 2016-02-04 10:25:24-08:00
stimulus: {
natural_movie_one_stimulus <class 'pynwb.image.IndexSeries'>,
natural_scenes_stimulus <class 'pynwb.image.IndexSeries'>,
spontaneous_stimulus <class 'pynwb.epoch.TimeIntervals'>,
static_gratings <class 'pynwb.epoch.TimeIntervals'>
}
stimulus_template: {
natural_movie_one <class 'pynwb.image.ImageSeries'>,
natural_scenes_template <class 'pynwb.base.Images'>
}
subject: subject pynwb.file.Subject at 0x1492272903952
Fields:
age: P104D
age__reference: birth
description: Mus musculus in vivo.
genotype: Cux2-CreERT2/wt;Camk2a-tTA/wt;Ai93(TITL-GCaMP6f)/Ai93(TITL-GCaMP6f)
sex: M
species: Mus musculus
strain: Cux2-CreERT2;Camk2a-tTA;Ai93-222426
subject_id: 495727015
timestamps_reference_time: 2016-02-04 10:25:24-08:00
If you are interested in learning more about two-photon calcium imaging, or have any questions about the rest of this experiment, check out the OpenScope Databook.
A common way of analyzing fluorescence imaging data is to quantify the change in the amount of light emitted by specific regions.
One such measure is the $\Delta F/F$ time series, which is derived from the raw two-photon calcium imaging.
This data stream can be found under the 'processing' module:
df_over_f_array = nwbfile.processing["ophys"]["DfOverF"]["DfOverF"].data
# Get a subset of the data for visualization
# Note that the `df_over_f_array` has shape `number of frames x number of regions of interest (ROI)`
# reflecting the dimensions of `time x ROIs`
time_series_data = df_over_f_array[:1000, :5]
plt.figure(figsize=(7, 3))
for i in range(time_series_data.shape[1]):
plt.plot(time_series_data[:, i], alpha=0.7)
plt.xlabel('Time (frames)')
plt.ylabel('ΔF/F')
plt.title('Calcium Imaging Time Series (ΔF/F)')
plt.show()
This is particularly useful when working with large (> 60 GB) datasets that may not otherwise fit into memory.
Some files on the DANDI archive can be quite large - even hundreds of gigabytes - which makes downloading a file just to explore its contents impractical.
Thankfully, instead of downloading, you can stream data directly from S3!
Let's give that a try:
s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)
rem_file = remfile.File(url=s3_url)
h5py_file = h5py.File(name=rem_file, mode="r")
io = NWBHDF5IO(file=h5py_file)
streamed_nwbfile = io.read()
streamed_df_over_f_array = streamed_nwbfile.processing["ophys"]["DfOverF"]["DfOverF"].data
streamed_time_series_data = streamed_df_over_f_array[:1000, :5]
plt.figure(figsize=(7, 3))
for i in range(streamed_time_series_data.shape[1]):
plt.plot(streamed_time_series_data[:, i], alpha=0.7)
plt.xlabel('Time (frames)')
plt.ylabel('ΔF/F')
plt.title('Calcium Imaging Time Series (ΔF/F)')
plt.show()
While we just showcased a simple data array, DANDI assets can also include beautiful images and videos!
Now let's get a better understanding of how the identified regions of interest relate to our underlying imaging data.
These data streams can be accessed and displayed in nearly the same manner:
summary_image = nwbfile.processing["ophys"]["SummaryImages"]["maximum_intensity_projection"][:]
plane_segmentation = nwbfile.processing["ophys"]["ImageSegmentation"]["PlaneSegmentation"]
combined_image_masks = np.zeros(shape=summary_image.shape)
for pixel_mask in plane_segmentation["pixel_mask"][:]:
for x, y, w in pixel_mask:
combined_image_masks[x,y] += w
masked_image = np.ma.masked_where(combined_image_masks == 0, combined_image_masks)
plt.figure(figsize=(7, 7))
plt.imshow(summary_image, cmap="gray")
plt.imshow(masked_image, cmap="viridis", alpha=0.5)
plt.axis('off');
The gray background represents the imaging space, while the colored overlay indicates the identified regions where neural activity was measured.
As mentioned, DANDI hosts a diverse range of neurophysiology data modalities - not just optophysiology!
Let's also showcase some electrophysiology and behavioral data:
# Setup streaming like we did before
ecephys_dandiset = client.get_dandiset("000409", "draft")
subject_id = "sub-NYU-39"
session_id = "ses-6ed57216-498d-48a6-b48b-a243a34710ea"
ecephys_path = f"{subject_id}/{subject_id}_{session_id}_desc-processed_behavior+ecephys.nwb"
ecephys_asset = ecephys_dandiset.get_asset_by_path(path=ecephys_path)
ecephys_s3_url = ecephys_asset.get_content_url(follow_redirects=1, strip_query=True)
ecephys_rem_file = remfile.File(url=ecephys_s3_url)
ecephys_h5py_file = h5py.File(name=ecephys_rem_file, mode="r")
ecephys_io = NWBHDF5IO(file=ecephys_h5py_file)
ecephys_nwbfile = ecephys_io.read()
# Filter by 'good units' (those with non-NaN waveforms) and select one for visualization
units_dataframe = ecephys_nwbfile.units.to_dataframe()
good_unit = units_dataframe[
[not np.isnan(row["waveform_mean"]).any() for _, row in units_dataframe.iterrows()]
].iloc[10]
# Extract waveform and convert from volts to microvolts
waveform_uV = good_unit["waveform_mean"] * 1e6
trace = waveform_uV[:, 19]
# Define the time axes of the waveform
num_samples = waveform_uV.shape[0]
sampling_rate = 30_000.0
time_ms = (np.arange(num_samples) - num_samples // 2) / sampling_rate * 1000
# Plot
fig, ax = plt.subplots(figsize=(7, 3))
ax.plot(time_ms, trace, color="black", linewidth=1.0)
ax.fill_between(time_ms, 0, trace, where=(trace > 0), color="orange", alpha=0.8)
ax.fill_between(time_ms, 0, trace, where=(trace < 0), color="slateblue", alpha=0.8)
ax.axhline(0, color="gray", linestyle="--", linewidth=0.5)
ax.set_xlabel("Time (ms)")
ax.set_ylabel("Amplitude (uV)")
plt.title(f"Average waveform of a spiking unit")
plt.show()
If you are interested in learning more about this experiment, check out the International Brain Lab: The Brain-Wide Map website.
Alongside the recordings of neural activity, a video captures the animal performing the simple task of turning a wheel!
Various points on the body (such as hands and eyes) are then tracked using a technique known as 'pose estimation':
video_directory = f"{subject_id}/{subject_id}_{session_id}_ecephys+image"
video_path = f"{video_directory}/{subject_id}_{session_id}_VideoLeftCamera.mp4"
video_asset = ecephys_dandiset.get_asset_by_path(path=video_path)
video_s3_url = video_asset.get_content_url(follow_redirects=1, strip_query=True)
cap = cv2.VideoCapture(video_s3_url)
cap.set(cv2.CAP_PROP_POS_FRAMES, 100000)
_, frame = cap.read()
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
pose_estimation_module = ecephys_nwbfile.processing["pose_estimation"]
left_camera_pose_estimation = pose_estimation_module["LeftCamera"]
pose_estimation_series = left_camera_pose_estimation.pose_estimation_series
plt.figure(figsize=(7, 6))
plt.imshow(frame_rgb)
colors = plt.cm.tab20(np.linspace(0, 1, len(pose_estimation_series)))
for color, keypoint in zip(colors, pose_estimation_series):
series = pose_estimation_series[keypoint]
first_frame_point = series.data[0, :]
plt.scatter(
first_frame_point[0],
first_frame_point[1],
label=series.name.replace("PoseEstimationSeries", ""),
color=color,
s=40,
zorder=5,
)
plt.legend(loc="lower right", fontsize=7, framealpha=0.8, ncol=2)
plt.axis("off")
plt.show()
Q: What is one scientific question that has been answered using these data?¶
Focusing on the optophysiology example used above - the Visual Coding project by the Allen Institute - one question that was addressed involves characterizing population-level response characteristics across visual cortex.
Multiple stimuli (natural scenes, drifting gratings, static gratings) were presented to each subject over the course of the experiment. Different structures within the visual cortex were targeted across subjects. The neural responses during each presentation were then characterized to show differing response properties across visual areas. This demonstrated that different cortical layers have distinct response properties and tuning characteristics. The experiments also quantified how correlated activity between neurons affects information coding by showing that noise correlations are stronger between neurons with similar tuning properties. Additional findings demonstrate that correlations are modulated by behavioral state (running vs. stationary movements).
A full reproducible analysis of this work can be found through its more detailed tutorial notebook.
Q: What is one unanswered question that you think could be answered using these data?¶
One such question might be how do different visual cortical areas (V1, LM, AL, PM) coordinate their activity over time during naturalistic scene viewing, and can we identify temporal "routing" patterns that predict behavioral state transitions?
While the Visual Coding dataset(s) have characterized individual area responses, differences across cell types, and other correlations, the temporal dynamics of information flow between areas during natural scene processing remains less explored - particularly how running vs. stationary states modulate inter-area communication.
This dataset is just one small part of the greater Allen Brain Observatory effort, which has seen considerable reuse based on questions like these. You can read more about this project in the following publication:
de Vries, S. E., Siegle, J. H., & Koch, C. (2023). Sharing neurophysiology data from the Allen Brain Observatory. Elife, 12, e85550. DOI: https://doi.org/10.7554/eLife.85550
It is worth mentioning in this context that the NWB group hosts a regular NeuroDataReHack event where researchers are brought together to work precisely on such questions of how to analyze existing datasets in novel ways, rather than running entirely new experiments. Check the NWB Events page and sign up for the newsletter to stay informed about these kinds of events!