The award

The walls were painted white, they were striking against the black carpets. The atmosphere was cool, there was music playing in the background, it was soft and slow, it was a jazz piece playing out…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Association of environmental mixtures with human semen quality

(Blog Post 1)

Lu Lu, Yihang Dong, Irina Su, Sunny Deng

Brown University

Throughout recent years, heavy pollution in China has become a serious problem due to rapid industrialization and urbanization, and exposures encountered in the environment have become an increasing concern due to the potential human health risks, including effects on reproductive health​. Specifically, we consider the phthalate exposure, which is associated with adverse male reproductive health. By examining the relations between 8 phthalate metabolites at environmental levels and semen quality in a Chinese population, we hope that by the end of the project, our data analysis could bring more concrete insights into the effect of phthalate exposure on human reproductive health and help further raise environmental awareness through the findings.

The dataset we plan to use is semen evaluations from 1052 men collected by the Reproductive Center of Tongji Hospital in Wuhan, China. Each entry contains data of semen quality parameters and concentrations of 8 phthalate metabolites in two urine samples.

Before we dive into any analysis, we first clean the data. In the data set, we find that there are three types of errors, including missing values, wrong values, and 0 (which means that the corresponding concentration is below the limit of detection). The procedure of data cleaning is as follows:

As we mentioned in the introduction, we have two measurements for each concentration. In the first step, we use the average of these two measurements adjusted by each creatinine level as our features.

After cleaning the data, we took a look at the summary of their statistics. We realized that the variability in the data especially for MMP, MEP, and MBP is high. Standard deviations for these three sets of data are all larger than 100. MOP is the category with the least variable data. Additionally, through the percentiles of the data, we can see that for all categories the distribution is skewed to the left. The mean and the 75thpercentile are both significantly smaller than the max value. It is possible that normal distribution could not best describe our data. We also proposed that there could be some noise in our dataset. Mistakes could have happened when these medical data were recorded, which resulted in the outliers on the leftmost and rightmost side of the distribution. Therefore, we decided to continue our data processing with logarithmic transformation and removal of outliers.

To get a sense of the distributions of the input variables and dependent variables, we perform distribution fitting for all the input variables and the dependent variable. We have tried fitting Normal distribution to all variables, however, as shown in the plots below, it is not a good fit for the input variables.

While the dependent variable, progressive motility does look like it’s normally distributed. Then after observing the distributions of all data points with mostly positive skewness, we use Lognormal distribution to fit all input variables which gave the following result.

graphs fitting Lognormal distributions

The parameters of fitted Lognormal distributions are noted on each plot, we can see that they are consistent with the basic statistics for each variables we analyzed above.

To conclude our observation, the dependent variable is approximately normally distributed. And each input variable in the dataset is approximately lognormally distributed, which implies that the log of data for each variable is normally distributed. We proceed to transform all input data logarithmically to fit symmetric distributions. The transformed data exhibits more symmetric distribution as shown below,

distributions of the logarithmically transformed data

The reason for this is that transforming data into symmetric distribution will be more desirable when we later apply regression and hypothesis testing, or when we do visualization.

After finding out that the distribution of the data is approximately lognormally distributed, we then proceed to create the box-plots for each concentration data (cleaned as mentioned in the “Data Cleaning” section). The data are also transformed to their log values based on the finding about their distributions.

We first plot the box-plots using the standard reach of the whiskers above the first quartile and the third quartile, which is 1.5 x IQR (interquartile range). However, we find a relatively large number of outliers using this reach, which generates around 200 outlier samples among the total 1070 samples (a sample is an outlier sample if one of its concentration data is deemed as an outlier).

Thus, we loosen the restriction by extending the reach of the whiskers to be 3 x IQR, which reduces the number of the outlier samples to 55. The resulting box-plots are shown below:

box-plots with 3 x IQR as the reach of the whiskers

From the box-plots, we can see that concentrations such as MEP, MMP, and MEHP still has many outliers, but have been reduced so that the amount would not affect the performance of our later analysis much. The extended reach of the whisker is still within a reasonable range, which would still preserve the persuasiveness of our analysis.

In the end, we remove the outlier samples for the following analysis.

We have also checked the pairwise correlation between all of the features and label by scatter plotting all the data points we have. The input data are already transformed by taking log as discussed earlier. Here is the graph that we have obtained.

graphs for pairwise correlations between concentration data

From the above graph we can see that there is no significant positive or negative correlation between our label and any of the input features, and thus it is highly likely that any single feature cannot determine the progressive motility. However, we can also notice that there exists evidence for positive correlation between some pairs of the input features (eg. between MEHHP and MEOHP), which suggests the possibility of getting more concrete conclusion if we do dimension reduction for input data.

Next, we will begin to investigate the relations between inputs and outputs. The methods we will use are:

We would like to thank the developers of Matplotlib, NumPy, pandas, and scikit-learn.

Add a comment

Related posts:

Pomodoro Technique

Pomodoro technique is extremely helpful for those that have problem in lack focus and giving full concentration to any given specific task . Though using this technique one can learn to give full…

Discover the Magic of NYC with Night Cruises

New York City is a magical place, especially at night. The city that never sleeps truly comes to life as the sun sets and the lights turn on. One of the best ways to experience the city at night is…

Saw Mill Log Flume Malfunctions at Six Flags Great Adventure in New Jersey

As health officials ease on health restrictions brought by the pandemic, more people are visiting amusement parks again. Unfortunately, Six Flags Great Adventure in New Jersey had an accident last…