Create a python script about:
1. Using the following tweets dataset, identify the most influential users using the PageRank algorithm:
Link: [login to view URL] (Files from 2020/01 to 2020/05)
See attached file for detail.
List the 100 most influential users (user id, user name, PageRank score).
PageRank example implementation script: [login to view URL]
You should use the Spark GraphX or GraphFrames library.
Limit your analysis to tweets in Portuguese. (Language filter)
To create the network, create an edge pointing from a retweeter to the user who tweeted.
2. Process a Twitter stream to identify the most popular users within a time window.
a) Continually collect the stream using the filter endpoint of the Twitter Streaming API to select
tweets from the United Kingdom.
Suggestion: Use the bounding box [-8.6, 49.5, 1.46, 60.5]
b) Apply the exponentially decaying window approach to keep smoothed counts of mentions in the
collected tweets, and display the 10 most popular users in (user-defined) intervals of t seconds.
See example tweets and output (10 sec window; some outputs removed) below.
c) Apply a sentiment classifier to the collected tweets, as they arrive, and keep two counts
to track the most popular and unpopular users