Multiple projects in Big Data, Microservices, Cloud Architecture, Machine Learning, Software Development
Go, Scala, Java, Python, NodeJS.
Project 1: near-real time data replication system from MySQL to Redshift in AWS using Kafka, Spark streaming with Scala and Python. The system also used Prometheus, Kibana, Sqoop, Flume and Gitlab. Worked very closely with other data engineers, product owner and data scientists. Replicated very large tables in near real time. Team size 6-7. Duration 6 months.
Project 2: end-to-end data processing architecture and implementation from log structuring, to storage in Redshift and S3 to machine learning with NLTK, Scikit Learn, and Python to serving recommendations and analytics via a NodeJS API from Elasticsearch. Worked very closely with CEO, CTO, Senior Product Managers and DevOps Engineers to ship to production. A near-real time recommendation system was deployed and proven effective in an independent A/B test. Helped interview, hire and on-board data engineers. Project duration 8 months. Team size: 1 core, 3 non-core.
Project 3: prepared training materials on Spark architecture and sketching algorithms commissioned by a top-tier IT company; trained data scientists who are now senior data scientists, managers and CTOs on efficient algorithms for Big Data with Spark (Scala) and PySpark using a hand-on problem solving approach
Project 4: participated in design thinking workshops as a data engineer to inform future data products. In a team of three, delivered clickable proof-of-concept prototype for large financial data in just three days. The system ingested and structured GBs of financial data in Redshift. The aggregated small data moved to MySQL from where a Go backend took over to deliver to a React app. Developed a simple idempotent pipeline in bash. Deployed infrastructure on AWS. Facilitated discussions between data scientists, backend engineers and financial data experts to inform the product.
Project 5: external member of an in-house team to build a modern data pipeline tool. The tool supports data pipelines on AWS Redshift and EMR (Spark). Informed software and cloud architecture in technical discussions with developers, team lead and product owner. Implemented features, unit and integration tests as well as end-user demos. Created and hosted Python and Anacodna packages on Artifactory. Setup test jobs on Jenkins. Supported peers with code reviews. Supported other engineers with debugging help. Duration 4-5 months. Team size: 5 - 8.
Project 6: optimized slow database queries in Postgres. Optimized data storage via Change Data Capture and Slowly Changing Dimension. Implemented SWIFT parser in Go via code-generation. Implemented comprehensive integration tests. Delivered extensive project documentation for project handover. Used Docker, Postgres and Go. Duration 4 months. Team size 4.
Project 7: Large financial institution: extracted and presented visitor analytics from ELB logs in Jupyter on the request of a product owner
Project 8: developed (in collaboration) an open-source repository for analysis of stock market data to showcase the data analytics capabilities of a consulting company. The repository is now featured on AWS Open Data Registry. Used Jupyter, Tensorflow, scikit-learn, and AWS Sagemaker.
Project 9: Evaluated cloud (AWS, Google Azure), third party and desktop OCR services. Implemented a zonal OCR prototype in a team of two using open-source image processing libraries (OpenCV) and cloud services.
Project 10: participated as a machine learning expert in design thinking workshops for building NLP-based and image search products. Facilitated discussion around product and engineering requirements and the application of state-of-the-art approaches. Delivered a 30-page report of related work and included the output models on data similar to the customer's data.
Demos:
I have developed multiple demos to showcase my abilities to prospective customers: https://stefansavev.com/demos.html