DATA SCIENCE
1)Download 5 clustering datasets from UCI machine learning repository
([login to view URL]
mIns=&type=&sort=nameUp&view=table).
a) Run k-means algorithm with different numbers of clusters, including the correct of the crowd. Use Euclidean similarity function. Make diagrams of convergence of the algorithm, and distribution of data
in clusters. Compare the latter with the proper distribution of data in clusters,
as given by the creators of datasets («groundtruths»). What do you notice as it changes the number of clusters compared with the value of the objective function? How do you explain?
b) Repeat the above using Gaussian mixture models and answers to all respectively ruling the question (a). Consider in your analysis as full as and diagonal covariance matrices. What differences in performance between the two setups;
2. Based on the relative demo of lesson1, use the corpus 20newsgroups dataset to train a model of learning topics using nonnegative matrix factorization. As in the demo, recreate every text through
tf-idf representation, wherein each document is represented by a combination of relative frequency of each token in a text, and reverse the relative frequency in all the texts of the corpus.
a. Try algorithm with at least 5 different number of options
latent features. What do you observe regarding the convergence and the computational costs? That
you give?
b. Try algorithm with at least 5 different number of options
inferred topics. What do you observe regarding the convergence and the computational costs? That
you give?
c. What happens if we increase the number of samples for convergence and
computational costs? That you give?
3. Build a system that provides
4. Download the datasets from: https://archive.ics.uci.edu/ml/datasets/UJI+Pen+Characters.
Use HMM models which will be trained to these data to make a system that can recognize handwriting. Try HMM models with different
many states. What do you notice on the convergence and recognition accuracy, and
computational costs? How do you interpret the phenomena based on the theory that you have
taught?
1 http://scikitlearn.
org/stable/auto_examples/applications/topics_extraction_with_nmf.html#exampleapplications-
topics-extraction-with-nmf-py
Dear Madam or Sir,
I have great experience in mcahine learning and Python. As a mathematician and software developer, I understand both the theoretical and the practical part of machine learning. I have implemented k-means, the expecation maximazation algorithm for gaussian mixture models from scratch in Matlab and Python.
Moreover, I am familiar with the build in function from scikit-learn that should be used here.
You will get the Python source code with lots of comments, the plots, and the answers for all question for each task.
I am looking forward to discuss your exact needs in the chat.
Best regards
€44 EUR in 3 days
5.0 (7 reviews)
3.7
3.7
11 freelancers are bidding on average €195 EUR for this job
HI i am a python software developer. I have been working on python for last 3 years. I have good knowledge of this technology. I also have good understanding of data mining and clustering algorithm. Looking forward for your response. let chat about this
I helped people in sussex, george mason, kings london in their works related to natural language processing, predictive analytics, document classification etc and did them on time with 100% accuracy.I believe i can help you as well on time.
my milestone plan is designed to reduce any risk to you.
Bytheway, your 3rd point is not complete. please share that ,if its complex it may change the bid/ timeline.
Hi, I am a Computational Biologist. I have advanced skills in R and Python and a lot of experience in machine learning algorithms and statistics on biological datasets.
Is it have to do it in Python?
Happy to help you with your assignment.
Cheers,
Narendra Meena
Dear,
I've read your job description and understood the job.
I've experience in analysis of yelp review.
I think that Kmeans is represented for descrete data clustering and GMM is for continuous data.
GMM is some more correct than Kmeans in UCI dataset.
TFIDF is one of topic models, but it is less than LSA and LDA.
If you hire me, I will finish it on time.
Let's contact on freelancer and discuss some more details.
Regards.