This project developed in the Jupyter notebook aims to create a predictive model to predict when a person has cancer given the micro RNA sequencing exam based on TCGA and to study the techniques of data dimensionality reduction (PCA), logistic regression and model training.
Data were collected from The Cancer Genome Atlas Program (TCGA), which is an international and world-class program for characterizing more than 33 types of cancer. The data are real and have been properly anonymized. Each row represents a sample taken from a person. The columns are the types of microRNA and each entry represents the intensity with which that microRNA is expressed. Expression values range from [0, infinity]. Values close to zero indicate low expression, while the opposite indicates high expression. The data also have labels (see class attribute), with TP (primary solid tumor) indicating tumor and NT (normal tissue) indicating no tumor.
- The Cancer Genome Atlas Program (TCGA) - (https://www.cancer.gov/ccg/research/genome-sequencing/tcga)
- What is micro RNA ? (https://www.bbc.com/news/articles/c79nrgp97x9o)
Python
- Pandas
- Numpy
- Seaborn
- Matplotlib
- Sklearn