I am trying to get twitter data of narendramodi using below command.
SELECT b.t_id,a.profile_image,b.tweet_text,e.media_media_url,b.retweet_count,b.favorite_count as like_count,count(reply_to_status_id) as reply_count,f.imp_count,f.eng_count,f.eng_rate FROM twitter_users a LEFT JOIN twitter_tweets b on a.user_id=b.user_id LEFT JOIN replies c on b.t_id = c.t_id LEFT JOIN media e on b.t_id = e.t_id LEFT JOIN twitter_tweet_metric_aggregates f ON f.metric_timestamp=(select max(metric_timestamp) FROM twitter_tweet_metric_aggregates g WHERE g.t_id=f.t_id and g.t_id=b.t_id) WHERE a.twitter_screen_name= 'narendramodi'GROUP BY b.t_id,a.profile_image ,b.tweet_text,b.retweet_count,b.favorite_count,e.media_media_url,f.imp_count,f.eng_count,f.eng_rate);
Query was working correctly But, in the above query I have used sub-select to get recent data of imp_counts of each tweet based on timestamp. Because of this sub-select Query_cost was huge and so it was taking more than 15min for query execution. I want to reduce that and should able to execute within 10seconds. For that reason I was trying to use WITH (CTE) expression
WITH metric_counts AS (SELECT max(metric_timestamp),f.t_id,f.imp_count,f.eng_count,f.eng_rate FROM twitter_tweet_metric_aggregates f LEFT JOIN tweets b on f.t_id=b.t_id )SELECT b.t_id,a.profile_image,b.tweet_text,e.media_media_url,b.retweet_count,b.favorite_count as like_count, count(reply_to_status_id) as reply_count,metric_counts.imp_count,metric_counts.eng_count,metric_counts.eng_rate FROM twitter_users as a LEFT JOIN tweets as b on a.twitter_user_id=b.twitter_user_id LEFT JOIN replies c on b.t_id = c.t_id LEFT JOIN media e on b.t_id = e.t_id LEFT JOIN metric_counts on metric_counts.t_id = b.t_id WHERE lower(a.twitter_screen_name)=lower('narendramodi') GROUP BY b.t_id,a.profile_image,b.tweet_text,e.media_media_url, b.retweet_count,b.favorite_count, metric_counts.imp_count,metric_counts.eng_count, metric_counts.eng_rate;
The above WITH expression was giving results of imp_counts also for each tweet but not giving latest record/value. Can anyone help me in achieving this.
Here is the Query cost of WITH query
HashAggregate (cost=1734856.13..1735618.48 rows=76235 width=673)
So can anyone help me to reduce cost to even lesser but giving results within 15 sec.
Query_cost
HashAggregate (cost=4923196.15..4923958.50 rows=76235 width=673) (actual time=51871.524..51872.333 rows=1513 loops=1) Group Key: imp_counts.tweet_status_id, a.profile_image, b.tweet_text, b.tweet_created_at, d.sentiment, d.emotion, b.retweet_count, b.favorite_count, e.media_media_url, imp_counts.tweet_impression_count, imp_counts.tweet_engagement_count, imp_counts.tweet_engagement_rate CTE imp_counts -> Seq Scan on twitter_tweet_metric_aggregates f (cost=0.00..3356389.12 rows=17676 width=47) (actual time=37516.805..41642.354 rows=60699 loops=1) Filter: (metric_timestamp = (SubPlan 2)) Rows Removed by Filter: 3475596 SubPlan 2 -> Result (cost=0.90..0.91 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=3536295) InitPlan 1 (returns $1) -> Limit (cost=0.56..0.90 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=3536295) -> Index Only Scan Backward using pk_twitter_tweet_metric_aggregates on twitter_tweet_metric_aggregates g (cost=0.56..55.57 rows=158 width=8) (actual time=0.010..0.010 rows=1 loops=3536295) Index Cond: ((tweet_status_id = (f.tweet_status_id)::text) AND (metric_timestamp IS NOT NULL)) Heap Fetches: 3536295 -> Nested Loop Left Join (cost=1202381.45..1564329.39 rows=76235 width=673) (actual time=50478.887..51854.010 rows=10672 loops=1) -> Nested Loop Left Join (cost=1202380.90..1472231.15 rows=76235 width=641) (actual time=50478.871..51781.010 rows=10649 loops=1) -> Hash Right Join (cost=1202380.34..1362479.87 rows=76235 width=626) (actual time=50478.841..51702.556 rows=10649 loops=1) Hash Cond: ((c.tweet_status_id)::text = (b.tweet_status_id)::text) -> Seq Scan on twitter_tweet_replies c (cost=0.00..150216.68 rows=2606068 width=38) (actual time=0.041..692.358 rows=2606068 loops=1) -> Hash (cost=1201427.40..1201427.40 rows=76235 width=607) (actual time=50477.837..50477.837 rows=1499 loops=1) Buckets: 131072 Batches: 1 Memory Usage: 1497kB -> Merge Left Join (cost=1200957.10..1201427.40 rows=76235 width=607) (actual time=50446.646..50476.746 rows=1499 loops=1) Merge Cond: ((b.tweet_status_id)::text = (imp_counts.tweet_status_id)::text) -> Sort (cost=1199356.58..1199547.17 rows=76235 width=293) (actual time=8608.183..8608.530 rows=1499 loops=1) Sort Key: b.tweet_status_id Sort Method: quicksort Memory: 597kB -> Hash Right Join (cost=29591.36..1193174.62 rows=76235 width=293) (actual time=3.714..8604.233 rows=1499 loops=1) Hash Cond: ((b.twitter_user_id)::text = (a.twitter_user_id)::text) -> Seq Scan on twitter_tweets b (cost=0.00..1095151.30 rows=18045230 width=230) (actual time=0.026..5037.377 rows=18044981 loops=1) -> Hash (cost=29214.36..29214.36 rows=30160 width=89) (actual time=0.025..0.025 rows=1 loops=1) Buckets: 32768 Batches: 1 Memory Usage: 257kB -> Index Scan using twitter_screen_name_idx on twitter_users a (cost=0.56..29214.36 rows=30160 width=89) (actual time=0.020..0.021 rows=1 loops=1) Index Cond: (lower((twitter_screen_name)::text) = 'narendramodi'::text) -> Sort (cost=1600.52..1644.71 rows=17676 width=314) (actual time=41838.456..41847.253 rows=60655 loops=1) Sort Key: imp_counts.tweet_status_id Sort Method: quicksort Memory: 8540kB -> CTE Scan on imp_counts (cost=0.00..353.52 rows=17676 width=314) (actual time=37516.810..41684.563 rows=60699 loops=1) -> Index Scan using tweet_sentiment_index on tweet_sentiment d (cost=0.56..1.43 rows=1 width=34) (actual time=0.006..0.007 rows=0 loops=10649) Index Cond: ((b.tweet_status_id)::text = (tweet_status_id)::text) -> Index Only Scan using twitter_tweet_media_pkey on twitter_tweet_media e (cost=0.55..1.20 rows=1 width=70) (actual time=0.006..0.006 rows=0 loops=10649) Index Cond: (tweet_status_id = (b.tweet_status_id)::text) Heap Fetches: 1074 Planning time: 2.913 ms Execution time: 51875.165 ms(43 rows)