In the article “Differential testing for machine learning: an analysis for classification algorithms beyond deep learning” recen tly published in Empirical Software Engineering, researchers from our group investigated if we could use different machine learning frameworks as ground truth to test others. The idea is to, e.g., use the random forest from scikit-learn and compare the results to the implementation from Weka. We found that while there is a large potential for this approach, because many algorithms are implemented in multiple large frameworks, there are only relatively few feasible combinations where the design choices are the same. However, even for these algorithms, testing did not really work, because different frameworks often lead to subtly different results, showing the practical usefulness of this approach for testing such algorithms is limited.