About

About Me

My name is Carlos Monsivais, and I am currently working as a Data Scientist at Accenture. I have obtained my Master's degree in Data Science and Engineering at the University of California, San Diego. I am passionate about data science and specialize in big data analytics by creating end to end data science processes including data extraction, data validation, feature engineering, modeling and ML Ops monitoring. To accomplish this, I use either Python or PySpark while applying machine learning models in TensorFlow or Scikit-Learn. I have experience using cloud environments to complete these processes in AWS and GCP to spin up clusters or compute engines and run the pipeline steps in Airflow DAG's or on a scheduler. In terms of visualization I have experience using Tableau to create streaming analytics in Python through TabPy and have had experience using Power BI. I am always learning new methods and techniques within the data science field and look forward to keep solving interesting business problems though the power of data!

Download Resume

Education

Masters in Data Science and Engineering

University of California, San Diego

Bachelor of Science in Statistics

University of California, Davis

Bachelor of Arts in Economics

University of California, Davis

Experience

June 2022 - Present

Data Scientist

Accenture

● Produced fraud detection models using Machine Learning to detect fraudulent claims for Aetna CVS Healthcare by applying rule based logic and XGBoost classifiers with 98% accuracy, saving approximately $20,000,000 per year in three markets.

● Achieved industry standard documentation by creating ML Ops processes for Aetna CVS Healthcare for ongoing model analysis and thresholds notifying when a model needs to be retrained.

● Created regression models using a combination of Non-Parametric Statistical methods and Linear Regression to create a recommendation list of the top 10 locations to place 5 new Rolex training facilities and stores within the United States. This automated process saved the team 60 days of delay time by not having to go to 12 different department subject matter experts.

● Saved 8 hours per week by creating efficient data pipelines where API requests from 3 different data sources were automated using AWS Lambda Functions with a rule based system to clean, merge and place data in the correct S3 and Redshift locations.

● Established data validation processes for streaming data being captured through API’s with a schema check and data transformations using TensorFlow Data Validation, Python and PySpark manipulations to ensure good data quality.

● Automated client PowerPoint analytics using Python for graph creation in Plotly and text creation that was later embedded on client slide decks to show the results of using the Accenture AI service which saved about 40 hours per month.

August 2019 - June 2022

Data Analyst (April 2021 - June 2021)

Business Analyst (August 2019 - April 2021)

CoreLogic

● Created outlier detection system using TensorFlow Extended tools, TensorFlow Models and BigQuery Modeling tools through a Vertex AI pipeline for products in the Fraud space, and Home Price Indexes. This created a process for end to end data science work including data extraction, modeling and processing metadata saving 10 hours per week between the engineering and modeling teams.

● Established documentation and processes for ML Ops by saving models in GCS buckets and determining best processes to document outliers for an outlier detection system. As part of the process, I also included how to analyze metadata in TensorFlow Extended by looking at model changes, and data quality changes for each data extract.

● Recognized as a top 10 finalist during “Innovation Challenge” for proposing a solution to include more accurate data impacting zip codes and across more states by implementing clustering techniques to determine high quality and low quality data.

● Saved 16 hours of monthly analyst time by automating manual calculations in the S&P CoreLogic Case-Shiller Home Price Indices, which tracks changes in value of U.S. residential real estate prices.

Skills

Programming

Python→ Pandas, Numpy, Matplotlib, Plotly, TensorFlow, TensorFlow Extended, Scikit-Learn, PyTorch, Scrapy, Dash

PySpark

RStudio

Bash

Git

Cloud

Google Cloud Platform (GCP)→ BigQuery, Dataproc, Kubernetes, Cloud Composer, Vertex AI, AutoML, Dataprep, Google Cloud SDK, GCS

Amazon Web Services (AWS)→ EMR, Lambda, Amazon SageMaker, EC2, RedShift, S3

Databases

SQL→ ML Model Queries, Query Optimization

PostgreSQL

Neo4j Graph Database

Software

Tableau→ TabPy, Calculated Field

Power BI

Microsoft→ Excel, Word and PowerPoint

Certifications

Completed July 2022

Building Batch Data Pipelines on Google Cloud

Google Cloud

Expected Completion April 2023

ML Pipelines on Google Cloud

Google Cloud

Personal Projects

S&P 500 Portfolio Optimization

(June 2022 - Present)

GitHub Repo

Taking the stock data from the S&P 500 index fund and analyzing the data in PySpark by spinning up a cluster in GCP through Dataproc. Using Big Data techniques in PySpark I will be applying machine learning models from MLlib and TensorFlow to create an optimal subset of the stocks to choose from while still having a diverse portfolio.

Messi vs Ronaldo

(September 2020 - March 2021)

Dashboard View
GitHub Repo

In order to compare the two greatest soccer players of all time, I used my data analysis skills to create a dashboard using Python's Dash framework. I was able to see which of the two players have the edge in a head to head comparison in terms of goals, effectiveness in crucial situations and a Time Series Analysis to forecast career goals scored.

FIFA 2020 UT Dashboard

(June 2020 - September 2020)

Dashboard View
GitHub Repo

This dashboard created in Python's Dash framework gives an analysis of the FIFA 2020 video game mode called Ultimate Team. This game mode contains thousands of soccer players with individual statistics, and daily prices. In this dashboard I analyzed player prices, ratings, the player market, and gave a price prediction using a regression.

FBREF Soccer Statistics

(August 2020 - September 2020)

GitHub Repo

Used Python's Scrapy to extract data from FBREF where I scraped every player statistic for Lionel Messi and Cristiano Ronaldo using ethical web scraping practices. This was the first step to formatting and cleaning the data for my Messi vs Ronaldo Dashboard.

FUTWIZ Data

(April 2020 - May 2020)

GitHub Repo

Used Python's Scrapy to extract data from FUTWIZ where I scraped every single player in the game along with their attributes and pricing history using ethical web scraping practices. This was the first step to formatting and cleaning the data for my FIFA Ultimate Team 2020 Dashboard.

Carlos Monsivais

I'm a

About

About Me

Education

Masters in Data Science and Engineering

Bachelor of Science in Statistics

Bachelor of Arts in Economics

Experience

Data Scientist

Data Analyst (April 2021 - June 2021)

Business Analyst (August 2019 - April 2021)

Skills

Programming

Cloud

Databases

Software

Certifications

Building Batch Data Pipelines on Google Cloud

ML Pipelines on Google Cloud

Personal Projects

Personal Projects

S&P 500 Portfolio Optimization

Messi vs Ronaldo

FIFA 2020 UT Dashboard

FBREF Soccer Statistics

FUTWIZ Data

Contact

Contact Me

Location

Contact Number

Email

GitHub