No reason to choose, you can use both depending on your needs or processing stage. For example:
All steps can be done in either language but effort/support differs depending on case.
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt # roughly ~base R plotting functionality
import pandas as pd #roughly ~base R & tidyr functionality
import seaborn as sns #roughly ~ggplot2 functionality
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings('ignore')
#to make the plots appear inline, and saved in notebook:
%matplotlib inline
sns.set_context("talk") # seaborn function to make plots according to purpose (talk, paper, poster, notebook)
# We'll show people what versions we use
import datetime
now = datetime.datetime.now().isoformat()
print('Ran on ' + now)
import IPython
print(IPython.sys_info())
!pip freeze | grep -E 'seaborn|matplotlib|pandas|statsmodels'

Multiple great tutorials already exist on how to handle, clean and shaping data into the right format(s) for your visualizations and analyses in python. I was helped by the 10 minutes to pandas tutorial, this brief overview, I still regularly need to check this cheat sheet, and of course google/stackoverflow gives you access to the very bright and helpful python data science community.
If you want a more thorough explanation of pandas, check this book on python for data analysis or this Python Data Science Handbook.
Rather than rehash what's in those works, we will learn by example using the data from a Navon experiment with hierarchical letters probing precedence and interference of global processing. For more background and a description of the experiment, see Chamberlain et al. (2017):
"The Navon task was a modified version of a hierarchical letter paradigm (Navon, 1977), designed to reduce the potential influence of factors confounding global processing such as spatial location and variation in shape characteristics. Participants were required to distinguish whether a global letter shape made up of local letters or the local letters themselves were vowels or consonants (Fig. 1). Vowel and consonant letter shapes were kept as comparable as possible. There were 5 consonant types (C, D, F, H, T) and 5 vowel types (A, E, I, O, U). Trial congruency was defined by the category type (vowel/consonant). In some congruent trials the exact letter identity matched between local and global stimulus levels, whilst in all other congruent trials only the category type matched. Presentation location of the test stimulus was randomized on a trial-by-trial basis, in order to eliminate the ability of participants to fixate on local spatial locations to determine global shape. The stimulus was presented in one of four corners of a 100x100-pixel square around central fixation. There were 10 practice trials followed by two blocks of 100 experimental trials. In alternate blocks whose order was randomized, participants were instructed to either focus on the global letter shape (global selective attention) or on the local letter shapes (local selective attention) and press the ‘J’ key if the letter was a vowel, and the ‘F’ key if the letter was a consonant. Each trial began with a fixation cross presented for 1000 ms. The fixation cross then disappeared and was followed by the experimental stimulus (a white letter shape on a black background). The stimuli were presented for 300 ms followed by a 4 s response window. Feedback was presented in the form of a coloured (red/green) fixation cross which also encouraged central fixation for the next trial. Both accuracy and reaction time(s) were recorded. Stimulus presentation and data collection were controlled using the Psychopy package (Peirce, 2007) and stimuli were created using the MATLAB toolbox GERT (v1.20) (Demeyer & Machilsen, 2012)."

!pwd
!ls data/ #we can use shell commands to see where the data is (ls=dir on windows)
import glob, os #to work with paths
df = pd.DataFrame()
folders = ['CO10','CO12'] # the raw data is in two different folders
for folder in folders:
if folder=='CO10': sep=';' #data in different folders use different field separators
else: sep=','
all_files = glob.glob(os.path.join('data', folder, "*.csv")) # get list of individual data files
df_from_each_file = (pd.read_csv(f, sep=sep, index_col=0) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
df = df.append([df,concatenated_df])
df.head()
# add demographics
all_files = glob.glob(os.path.join('data', "*.csv"))
df_from_each_file = (pd.read_csv(f, sep=sep, index_col=0) for f in all_files)
dfdemo = (pd.concat(df_from_each_file, ignore_index=True, axis=0)
.drop_duplicates(['Subject'], keep='first', inplace=False) # drop duplicate rows for subjects
.sort_values('Subject')
)
dfdemo.head()
Note the method chaining in the example above. Most pandas methods return a DataFrame so that another pandas method can be applied to the result (with the dot formulation). Using the pandas-ply library you can also use dplyr style piping in pandas. Depending on how you value code readability, performance or debugging you'll prefer one or the other.
print dfdemo.age.value_counts()
print df.groupby(['participant']).age.apply(lambda x: x.iloc[0]).value_counts()
# age column is the same between dfs so we can drop it in the original data before merging
df=df.drop(['age'], axis=1)
df= pd.merge(df, dfdemo, how='left', left_on='participant', right_on='Subject')
df.head()
sns.distplot(df.age.dropna(), kde=False);
print 'Variables:\n', df.columns
print 'variables:\n', df.dtypes
print 'nb of participants:', len(df['participant'].unique())
print 'trials per participant:', len(df)/ len(df['participant'].unique()) #not exactly
# rename a variable/column
df = df.rename(columns={'acc_raw': 'acc', 'rt_raw': 'rt'})
#properly format RTs (dot notation)
df['rt'] = df['rt'].apply(lambda x: str(x).replace(',','.')).astype('float64')
# fill in missing values
df.loc[df.rt<0,['rt']]= np.nan
# we have 2 gender columns now, how do they compare?
df.gender_x.value_counts()
# Correcting values in gender column
df.loc[df['gender_x'] == 'vrouw', 'gender_x'] = 'F'
df.groupby('gender_x')['participant'].nunique()
# check gender stats
df.groupby('gender_y')['participant'].nunique()
# count missing values
df.isnull().sum()
# Save merged data
df.to_csv('dfmerged.csv')
# Save merged data
df = pd.read_csv('dfmerged.csv')
Common operation:
Can be about:
It helps to start with the result you want, and work backwards from there. If you want to get a single value for each group, use aggregate(), apply() (or one of the shortcuts). If you want to get a subset of the original rows, use filter(). And if you want to get a new value for each original row, use transform(). Check this page for some more advanced groupby recipes. Note that apply() is most flexible because it can implement an aggregate or a transform, and anything in between. Examples follow...
df.groupby(['cond', 'congruence']).acc.mean() #.reset_index()
Insert a column with standardized RT:
zscore = lambda x: (x - x.mean()) / x.std()
df.insert(17, 'Zage', df.groupby(['participant'])['age'].transform(zscore))
# proportion of incorrect trials
1-df.acc.mean()
sns.distplot(df.groupby(['participant']).acc.mean(), bins=20);
?sns.distplot()
Or just press shift-tab when cursor is on a method/module to view info. Note that you can find all jupyter lab commands in the left panel, under the Commands tab. You can also open a console (file-> view console for notebook) to quickly try out some code before entering them in a cell.
df = df.groupby('participant').filter(
lambda x : x['acc'].mean() > df.acc.mean()-(2.5*df.groupby(['participant']).acc.mean().std()))
print len(df.participant.unique())
sns.distplot(df.groupby(['participant']).acc.mean(), bins=20);
# discard incorrect trials for RT analyses
dfrt = df[df.acc==1]
sns.distplot(dfrt.rt);
dfrt = dfrt.groupby('participant').filter(
lambda x : x['rt'].mean() < df.rt.mean()+(2.5*df.groupby(['participant']).rt.mean().std()))
print len(dfrt.participant.unique())
sns.distplot(dfrt.rt);
len(dfrt)
dfrt = dfrt[dfrt.groupby('participant').rt.transform('mean') +(2.5*dfrt.groupby('participant').rt.transform('std')) > dfrt.rt]
len(df)
sns.distplot(dfrt.rt);
fig, ax = plt.subplots()
conditions = ['congruent', 'incongruent']
for condition in conditions:
condition_data = dfrt[(dfrt['congruence'] == condition) & (dfrt['cond'] == 'selAttGlob')]['rt']
sns.kdeplot(condition_data, shade=True, label=condition)
sns.despine()
More ways to explore RT distributions.
# summary statistics of RT per condition
dfrt.groupby(['condition','cond','congruence']).rt.describe()
g = sns.factorplot(x="condition", y="rt", hue="congruence",
col="cond", unit='participant', data=dfrt, kind="point");
g.set(xlabel="Condition", yticklabels=[y for y in np.linspace(0,.6,6)]);
# we could define a function for saving plots for use outside of notebook (e.g. in paper)
def save_fig(fig, figname):
if not os.path.exists('figs'):
os.makedirs('figs')
fig.savefig("figs/%s.pdf" % figname, dpi=300) #high dpi for printing
fig.savefig("figs/%s.png" % figname, dpi=300)
fig.savefig("figs/%s.svg" % figname, dpi=300)
dfsum = dfrt.groupby(['participant', 'congruence', 'cond']).rt.mean().reset_index()
g = sns.boxplot(x="cond", y="rt", hue="congruence", data=dfsum);
g.legend(loc="upper left");
save_fig(g.get_figure(),'box')
def interference(group):
return group[group.congruence=='incongruent'].rt.mean()-group[group.congruence=='congruent'].rt.mean()
dfIF = dfrt.groupby(['participant','cond']).apply(interference).reset_index()
dfIF = dfIF.rename(columns={0: 'interference'})
dfIF.head()
sns.swarmplot(x="cond", y="interference", data=dfIF);
sns.set_context("poster", font_scale=1.3)
sns.set_palette('deep') # options: deep, muted, bright, pastel, dark, colorblind
fig, ax = plt.subplots(figsize=(10, 6))
conditions = ['selAttGlob', 'selAttLoc']
for i, condition in enumerate(conditions):
condition_data = dfIF[(dfIF['cond'] == condition)]['interference']
sns.distplot(condition_data, label=condition);
ax.axvline(x=dfIF[(dfIF['cond'] == condition)]['interference'].mean(), linestyle='--', color= sns.color_palette()[i])
# embellish plot
sns.despine()
ax.set_title('Navon interference on reaction times')
ax.set_xlabel("Interference")
# Improve the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, ['local interference', 'global interference'], loc="best");
save_fig(fig, "interference")
For more customization options, see the seaborn tutorials. More flexibility is provided by the matplotlib core: a cheat sheet.
from statsmodels.stats.weightstats import ztest
out = ztest(dfIF[dfIF.cond=='selAttLoc']['interference'], value=0)
print "t-test for global interference different from zero:\nt = ", round(out[0],2), "p = ", round(out[1],4)
out = ztest(dfIF[dfIF.cond=='selAttGlob']['interference'], value=0)
print "t-test for local interference different from zero:\nt = ", round(out[0],2), "p = ", round(out[1],4)
Luckily these conclusions are consistent with those reported in the paper...
md = smf.mixedlm("rt ~ congruence * cond", dfrt, groups=data["participant"])
mdf = md.fit()
print(mdf.summary())
sns.factorplot(x="condition", y="acc", hue="congruence",
col="cond", unit="participant", data=df, kind="point");
Incidentally, it is possible to use latex in your markdown, for example, drop that Bayes!
$ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $
Just kidding, I'm not going to show you how to do a Bayesian analysis on these data...
Statsmodel has an alternative method to deal with clustered data: Generalized Estimating Equations (GEE). GEEs have a few attractive advantages over hierarchical linear models (“mixed models”), for example fewer assumptions about the random effects and more intuitive interpretation of coefficients. Especially for discrete dependent variables (i.e., our binary accuracy data) the likelihood-based mixed models can show difficult convergence and a lack of robustness to misspecification of the covariance structure. See also: McNeish, D., Stapleton, L. M., & Silverman, R. D. (2016). On the Unnecessary Ubiquity of Hierarchical Linear Modeling. Psychological Methods.
# model formulation
fml = "acc ~ cond * congruence"
# covariance structure
ex = sm.cov_struct.Exchangeable()
#link fu
fa = sm.families.Binomial(sm.families.links.logit)
model = sm.GEE.from_formula(fml, "participant", df, cov_struct=ex, family=fa)
result = model.fit()
print(result.summary())
print(result.cov_struct.summary())
import rpy2.rinterface
%reload_ext rpy2.ipython
%R -n require(lme4);
%R -n require(tidyr); require(ggplot2)
A killer feature of Jupyter notebooks are magic commands.
%%R
# ^ Tells the notebook that this code cell is actually R
# Now we can write R code as if this were an R notebook
X_vals <- c(1, 2, 3, 4, 5)
y_vals <- c(1, 2, 4, 8, 16)
plot(X_vals, y_vals,
col='purple', pch=12,
main='R plot in Python')
np.random.seed(42)
# Make a pandas DataFrame
testdf = pd.DataFrame(np.random.normal(0,1,size=(100, 3)), columns=list('ABC'))
testdf['C'] = testdf['C'] + 2
testdf.head()
%%R -i testdf
testdf %>%
gather("Category", "X") %>%
ggplot(aes(x = Category, y = X, fill = Category)) +
geom_violin() +
stat_summary(fun.y=mean, color='black', geom='point', size = 3) +
theme_bw()
dfmelted = pd.melt(testdf, value_vars=["A","B","C"], var_name="category", value_name="X")
print dfmelted.head()
dfmelted["xbin"] = dfmelted.X > .5
sns.violinplot(x="category", y="X", data=dfmelted, inner="quart");
Note that we used the pandas melt function to go from a dataset in the wide form (testdf above) to long form (done in R with the melt or gather function). There are many more reshaping and pivoting functions in pandas, which we won't cover. The best way to get a feel for the pandas reshape and pivot functions is to go through this brief visual guide. Further details can be found in the pandas documentation.
from plotnine import *
(ggplot(dfmelted, aes(x = 'category', y = 'X', fill = 'category'))
+ geom_violin()
+ stat_summary(fun_y=np.mean, color='black', geom='point', size = 3)
+ theme_bw())
dfreduc=df.iloc[:5000]
dfreduc.head()
%%R
lr.test = function(m1, m2, name){
print(summary(m1))
print(summary(m2))
out = anova(m1, m2)
chi2 = out$Chisq[2]
dof = out$"Chi Df"[2]
p = out$"Pr(>Chisq)"[2]
test_str = "Likelihood ratio test for %s:\n Chisq(%d) = %.2f; p = %.3g"
writeLines(sprintf(test_str, name, dof, chi2, p))
}
%%R -i dfreduc
m = glmer(acc ~ cond * congruence + (1 | participant), dfreduc, family=binomial)
m.null = glmer(acc ~ cond + congruence + (1 | participant), dfreduc, family=binomial)
lr.test(m, m.null, "interaction effect of selective attention * congruence")
%%R -i dfrt
m = glmer(rt ~ cond * congruence + (1 | participant), dfrt)
m.null = glmer(rt ~ cond + congruence + (1 | participant), dfrt)
lr.test(m, m.null, "interaction effect of selective attention * congruence")
Aside from the examples, this tutorial is basically a mashup of sections of these sources:
https://github.com/ujjwalkarn/DataSciencePython