Skip to content

merging data across tables

In this example, we will use pandas to merge data from two blocks in a RELION particle STAR file particles.star.

In RELION (>=3.1) particle STAR files contain an optics table and a particles table. Data about optics common to many particles is stored in the optics table.

particles.star
data_optics

loop_
_rlnOpticsGroup #1
_rlnOpticsGroupName #2
_rlnSphericalAberration #3
_rlnVoltage #4
_rlnImagePixelSize #5
_rlnImageSize #6
_rlnImageDimensionality #7
1   opticsGroup1    2.700000    300.000000  3.300000    160 3

data_particles

loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnCoordinateZ #3
_rlnAngleRot #4
_rlnAngleTilt #5
_rlnAnglePsi #6
_rlnImageName #7
_rlnCtfImage #8
_rlnRandomSubset #9
_rlnPixelSize #10
_rlnMicrographName #11
_rlnOpticsGroup #12
_rlnGroupNumber #13
_rlnOriginXAngst #14
_rlnOriginYAngst #15
_rlnOriginZAngst #16
_rlnClassNumber #17
_rlnNormCorrection #18
_rlnLogLikeliContribution #19
_rlnMaxValueProbDistribution #20
_rlnNrOfSignificantSamples #21
880.274000  895.057900  1107.245000 131.246000  106.820800  45.228430   subtomo_05Feb21/TS.mrc/TS.mrc_22redo_combined_0000000_3.30A.mrc subtomo_05Feb21/TS.mrc/TS.mrc_22redo_combined_0000000_ctf_3.30A.mrc 2   3.300000    TS.mrc.tomostar 1   1   0.000000    0.000000    0.000000    6   1.000000    4150981.000000  1.000000    1
973.165500  958.641800  1115.101000 87.247110   109.236500  45.323810   subtomo_05Feb21/TS.mrc/TS.mrc_22redo_combined_0000001_3.30A.mrc subtomo_05Feb21/TS.mrc/TS.mrc_22redo_combined_0000001_ctf_3.30A.mrc 1   3.300000    TS.mrc.tomostar 1   1   0.000000    0.000000    0.000000    4   1.000000    4150837.000000  1.000000    1
955.949400  960.465100  629.459100  9.331370    99.309980   11.024810   subtomo_05Feb21/TS.mrc/TS.mrc_22redo_combined_0000002_3.30A.mrc subtomo_05Feb21/TS.mrc/TS.mrc_22redo_combined_0000002_ctf_3.30A.mrc 1   3.300000    TS.mrc.tomostar 1   1   0.000000    0.000000    0.000000    3   1.000000    4150650.000000  1.000000    1
1175.177000 1135.731000 882.605200  -147.261200 81.605380   77.325710   subtomo_05Feb21/TS.mrc/TS.mrc_22redo_combined_0000003_3.30A.mrc subtomo_05Feb21/TS.mrc/TS.mrc_22redo_combined_0000003_ctf_3.30A.mrc 2   3.300000    TS.mrc.tomostar 1   1   0.000000    0.000000    0.000000    1   1.000000    4151420.000000  1.000000    1
1190.658000 1122.102000 1073.642000 40.404590   98.061770   42.563060   subtomo_05Feb21/TS.mrc/TS.mrc_22redo_combined_0000004_3.30A.mrc subtomo_05Feb21/TS.mrc/TS.mrc_22redo_combined_0000004_ctf_3.30A.mrc 1   3.300000    TS.mrc.tomostar 1   1   0.000000    0.000000    0.000000    6   1.000000    4150168.000000  1.000000    1

The column rlnOpticsGroup is present in both tables.

Merging data from multiple dataframes is an example of a join in relational algebra. In pandas, this is implemented as DataFrame.merge().

import starfile

star = starfile.read('particles.star')
df = star['particles'].merge(star['optics'], on='rlnOpticsGroup')

The resulting dataframe contains columns with data from both the particles and optics dataframes.

df.head()
   rlnCoordinateX  rlnCoordinateY  rlnCoordinateZ  rlnAngleRot  rlnAngleTilt  ...  rlnSphericalAberration rlnVoltage rlnImagePixelSize  rlnImageSize  rlnImageDimensionality
0        880.2740        895.0579       1107.2450    131.24600     106.82080  ...                     2.7      300.0               3.3           160                       3
1        973.1655        958.6418       1115.1010     87.24711     109.23650  ...                     2.7      300.0               3.3           160                       3
2        955.9494        960.4651        629.4591      9.33137      99.30998  ...                     2.7      300.0               3.3           160                       3
3       1175.1770       1135.7310        882.6052   -147.26120      81.60538  ...                     2.7      300.0               3.3           160                       3
4       1190.6580       1122.1020       1073.6420     40.40459      98.06177  ...                     2.7      300.0               3.3           160                       3

[5 rows x 27 columns]

This table has the following properties - every column is a variable - every row is an observation - every cell contains a single value

Data in this form is sometimes referred to as tidy data. Tidy data is easier to manipulate.