"Data is the new science. Big data holds the answers.
Are you asking the right questions?"
-- Pat Gelsinger

About Me

Hi, this is Wenjie!

I'm a Data Scientist with previous experience in business development. I am very passionate about processing complex data and using machine learning to tell stories and solve business problems.

Worked and lived in Paris, Shanghai, Seattle, and San Francisco, I really enjoy working with a diverse team. In June, I graduated from the Master of Science in Data Science (MSDS) program at the University of San Francisco, where I have developed strong coding and statistics skill sets to solve problems with data.

In the following part, I'd like to share some of my projects.

Click "more" for details.

Featured Projects

Predict Air Quality Index with PySpark

Processed big(>2GB) datasets with PySpark to predict Air Quality Index in Los Angeles.

(AWS S3, AWS EMR, PySpark, time series, plotly, h2o.ai, distributed computing)

Gets Alerts for Supermarkets Delivery Slots During COVID-19

Run the script to find delivery slots for fresh vegetables. Stay safe and healthy.

(JSON, data scraping)

Web Product: HireReady.APP Tailored Interviews

A complete web product that provides tailored interviews in Data Science.

(TF-IDF, AWS Elastic Beanstalk, HTML, Bootstrap, Flask, Jinja, Google Analytics)

Machine learning

Feature Importance

Implemented feature importance analysis from scratch with methods including PCA, correlation, drop-column and permutation.

(PCA, regression, feature selection)

K-means Clustering

Implemented k-means from scratch. Achieved applications in binary prediction, image compression, and spectral clustering.

(Python, clustering, image compression)

NLP with Quora Question Pairs

Identify Quora question pairs that have the same intent by calculating their "distance" with deep learning model GRU.

(PyTorch, GRU, natural language processing)

Facial Key Points Detection

Implemented Convolutional Neural Network (CNN) and data augmentation with fast.ai API for key points detection.

(CNN, PyTorch, ResNet, deep learning, fast.ai, image processing)

Grab-A-Cab Strategy

Uber or Lyft, which one is cheaper? Combined the cab and weather datasets in Boston to analyze the cab strategy.

(Python, DataFrame, pandas, ggplot, data visualization, tell stories with data)

Classify Buildings’ Damage Level

Multi-classification on building's damage-grade in earthquake, ranked top 10% in the Drivendata Competition.

(SVM, TensorFlow, LightGBM, Random Forest, classification)

Stay Connected