My name is Carlos Monsivais, and I am currently working as a Data Scientist at Accenture. I have obtained my Master's degree in Data Science and Engineering at the University of California, San Diego. I am passionate about data science and specialize in big data analytics by creating end to end data science processes including data extraction, data validation, feature engineering, modeling and ML Ops monitoring. To accomplish this, I use either Python or PySpark while applying machine learning models in TensorFlow or Scikit-Learn. I have experience using cloud environments to complete these processes in AWS and GCP to spin up clusters or compute engines and run the pipeline steps in Airflow DAG's or on a scheduler. In terms of visualization I have experience using Tableau to create streaming analytics in Python through TabPy and have had experience using Power BI. I am always learning new methods and techniques within the data science field and look forward to keep solving interesting business problems though the power of data!
● Produced fraud detection models using Machine Learning to detect fraudulent claims for Aetna CVS Healthcare by applying rule based logic and XGBoost classifiers with 98% accuracy, saving approximately $20,000,000 per year in three markets.
● Achieved industry standard documentation by creating ML Ops processes for Aetna CVS Healthcare for ongoing model analysis and thresholds notifying when a model needs to be retrained.
● Created regression models using a combination of Non-Parametric Statistical methods and Linear Regression to create a recommendation list of the top 10 locations to place 5 new Rolex training facilities and stores within the United States. This automated process saved the team 60 days of delay time by not having to go to 12 different department subject matter experts.
● Saved 8 hours per week by creating efficient data pipelines where API requests from 3 different data sources were automated using AWS Lambda Functions with a rule based system to clean, merge and place data in the correct S3 and Redshift locations.
● Established data validation processes for streaming data being captured through API’s with a schema check and data transformations using TensorFlow Data Validation, Python and PySpark manipulations to ensure good data quality.
● Automated client PowerPoint analytics using Python for graph creation in Plotly and text creation that was later embedded on client slide decks to show the results of using the Accenture AI service which saved about 40 hours per month.
● Created outlier detection system using TensorFlow Extended tools, TensorFlow Models and BigQuery Modeling tools through a Vertex AI pipeline for products in the Fraud space, and Home Price Indexes. This created a process for end to end data science work including data extraction, modeling and processing metadata saving 10 hours per week between the engineering and modeling teams.
● Established documentation and processes for ML Ops by saving models in GCS buckets and determining best processes to document outliers for an outlier detection system. As part of the process, I also included how to analyze metadata in TensorFlow Extended by looking at model changes, and data quality changes for each data extract.
● Recognized as a top 10 finalist during “Innovation Challenge” for proposing a solution to include more accurate data impacting zip codes and across more states by implementing clustering techniques to determine high quality and low quality data.
● Saved 16 hours of monthly analyst time by automating manual calculations in the S&P CoreLogic Case-Shiller Home Price Indices, which tracks changes in value of U.S. residential real estate prices.
Python→ Pandas, Numpy, Matplotlib, Plotly, TensorFlow, TensorFlow Extended, Scikit-Learn, PyTorch, Scrapy, Dash
PySpark
RStudio
Bash
Git
Google Cloud Platform (GCP)→ BigQuery, Dataproc, Kubernetes, Cloud Composer, Vertex AI, AutoML, Dataprep, Google Cloud SDK, GCS
Amazon Web Services (AWS)→ EMR, Lambda, Amazon SageMaker, EC2, RedShift, S3
SQL→ ML Model Queries, Query Optimization
PostgreSQL
Neo4j Graph Database
Tableau→ TabPy, Calculated Field
Power BI
Microsoft→ Excel, Word and PowerPoint
(June 2022 - Present)
Taking the stock data from the S&P 500 index fund and analyzing the data in PySpark by spinning up a cluster in GCP through Dataproc.
Using Big Data techniques in PySpark I will be applying machine learning models from MLlib and TensorFlow to create an optimal
subset of the stocks to choose from while still having a diverse portfolio.
(September 2020 - March 2021)
In order to compare the two greatest soccer players of all time, I used my data analysis skills to create a dashboard using Python's Dash framework. I was able to see which of the two players have the edge in a head to head comparison in terms of goals, effectiveness in crucial situations and a Time Series Analysis to forecast career goals scored.
(June 2020 - September 2020)
This dashboard created in Python's Dash framework gives an analysis of the FIFA 2020 video game mode called Ultimate Team. This game mode contains thousands of soccer players with individual statistics, and daily prices. In this dashboard I analyzed player prices, ratings, the player market, and gave a price prediction using a regression.
(August 2020 - September 2020)
Used Python's Scrapy to extract data from FBREF where I scraped every player statistic for Lionel Messi and Cristiano Ronaldo
using ethical web scraping practices. This was the first step to formatting and cleaning the data for my Messi vs Ronaldo
Dashboard.
(April 2020 - May 2020)
Used Python's Scrapy to extract data from FUTWIZ where I scraped every single player in the game along with their attributes and
pricing history using ethical web scraping practices. This was the first step to formatting and cleaning the data for my FIFA
Ultimate Team 2020 Dashboard.
Please feel free to contact me regarding any opportunities or questions about my projects.
San Diego, California