04. Mining Data Streams
4.1 The Stream Data Model
Example: Socho tum ek social media platform ke data scientist ho. Users ke posts aur comments real-time mein aa rahe hain. Ye data stream hai jisko tum analyze karna chahte ho.
-
Data-Stream-Management System (DSMS): Ye system tumhe data streams ko manage karne ki capability deta hai. Jaise, ek dashboard bana sakte ho jo real-time mein user engagement ko track kare.
-
Stream Sources: Jaise Twitter, Facebook, ya sensors se data stream milta hai.
-
Stream Queries: Tum ek query run kar sakte ho, jaise “Sabse popular post kaun si hai is waqt?” aur ye real-time data se jawab dega.
-
Issues in Stream Processing: Challenges aate hain, jaise data ka loss hona, latency issues, ya processing speed.
4.2 Sampling Data Techniques in a Stream
Example: Socho ek online shopping site hai jahan thousands of customers daily visit karte hain. Tum chahte ho ki kuch customers ka sample le kar unki preferences samjho.
- Sampling Techniques: Random sampling ya reservoir sampling use kar sakte ho jisse tum representative data le sakte ho.
4.3 Filtering Streams
Example: Tumhe unwanted spam messages filter karne hain jo ek messaging app mein aate hain.
-
Bloom Filter: Ye ek space-efficient probabilistic data structure hai. Tum filter bana sakte ho jisse check kar sakte ho ki koi message spam hai ya nahi, bina data ko pura store kiye.
-
Analysis: Iska use karne se tumhe fast checking milega, lekin kabhi-kabhi false positives ho sakte hain, yaani kuch normal messages ko bhi spam samajh sakte ho.
4.4 Counting Distinct Elements in a Stream
Count-Distinct Problem: Socho tumhe ek streaming data mein unique visitors count karna hai.
-
Flajolet-Martin Algorithm: Ye algorithm estimate karta hai unique elements ka count. Ye bit pattern ka use karta hai jo uniquely identify karta hai.
-
Combining Estimates: Agar multiple streams hain, toh tum alag-alag estimates ko combine kar sakte ho accurate count ke liye.
-
Space Requirements: Ye algorithm memory-efficient hai, lekin exact count nahi deta, sirf approximation.
4.5 Counting Frequent Items in a Stream
Example: Tumhe ek music streaming app mein sabse popular songs count karne hain.
-
Sampling Methods for Streams: Tum samples le kar dekh sakte ho kaun se songs frequently play ho rahe hain.
-
Frequent Itemsets in Decaying Windows: Tum specific time window ka analysis kar sakte ho jisme recently play hue songs pe focus karte ho.
4.6 Counting Ones in a Window
Example: Socho tumhare paas ek sensor data hai jo har second temperature readings bhej raha hai.
-
The Cost of Exact Counts: Agar tumhe exact counts chahiye, toh high memory use karna padega.
-
DGIM Algorithm: Ye algorithm efficient hai temperature readings ko count karne ke liye. Ye readings ko time window mein manage karta hai.
-
Query Answering in the DGIM Algorithm: Tum real-time queries run kar sakte ho, jaise “Pichle 10 seconds mein temperature readings ka average kya hai?”
-
Decaying Windows: Ye algorithm purani readings ko weightage deta hai, taaki recent readings ka zyada importance ho.
05. Finding Similar Items and Clustering
5.1 Distance Measures
Example: Socho tumhe similar movies recommend karni hain.
-
Definition of a Distance Measure: Ye measure karta hai ki do items (movies) kitne similar hain.
-
Euclidean Distance: Agar do movies ki rating ka plot hai, toh distance measure karega ki dono movies ke ratings kitne close hain.
-
Jaccard Distance: Ye use hota hai jab tumhe do sets ke beech similarity measure karni ho, jaise do users ke movie preferences.
-
Cosine Distance: Ye angle ki tarah kaam karta hai, agar do movies ka feature vector (jaise genre, actors) diya gaya ho.
-
Edit Distance: Agar tumhe do strings (jaise movie titles) compare karna hai, ye batata hai ki kitne edits (insertions, deletions) chahiye ek string ko dusre mein convert karne ke liye.
-
Hamming Distance: Ye binary strings ke liye use hota hai, jisme count hota hai ki kitne positions par values different hain.
5.2 Clustering Algorithms
Example: Tumhe customers ko segments mein divide karna hai based on unki buying behavior.
-
CURE Algorithm: Ye algorithm clusters ko define karta hai aur unhe merge karta hai, isse complex shapes ko handle kiya ja sakta hai.
-
Stream-Computing: Ye algorithm real-time data streams ke liye bhi kaam karta hai, taaki tum turant results dekh sako.
-
A Stream-Clustering Algorithm: Is algorithm mein data ko buckets mein store kiya jata hai, aur fir clusters create kiye jate hain.
-
Initializing & Merging Buckets: Jab naye data aate hain, tum existing buckets ko merge karke naye clusters create karte ho.
-
Answering Queries: Tum queries run kar sakte ho jaise “Is segment mein kaun se customers hain jo sabse zyada kharidari karte hain?”
06. Real-Time Big Data Models
6.1 PageRank Overview
Example: Tum Google ke liye search results improve karna chahte ho.
-
Efficient Computation of PageRank: Tum links ki quantity aur quality ka analysis karte ho, aur ranking assign karte ho.
-
PageRank Iteration Using MapReduce: Ye distributed computing ka use karta hai, jisse large web graphs ko process kiya ja sakta hai.
-
Use of Combiners to Consolidate the Result Vector: Ye intermediate results ko consolidate karne mein help karta hai, jisse computation fast hota hai.
6.2 A Model for Recommendation Systems
Example: Tumhe ek e-commerce site par products recommend karne hain.
-
Content-Based Recommendations: Tum user ke past purchases ya interests ke basis par similar products recommend karte ho.
-
Collaborative Filtering: Tum users ke behavior ko analyze karte ho aur un users se recommendations lete ho jo similar hain.
6.3 Social Networks as Graphs
Example: Tumhe Facebook jaisa social network analyze karna hai.
-
Clustering of Social-Network Graphs: Tum nodes (users) ke beech connections ko analyze karte ho aur clusters (friends groups) identify karte ho.
-
Direct Discovery of Communities in a Social Graph: Is technique se tum identify karte ho ki kaun se users closely connected hain aur unke beech communities banate ho.
https://chatgpt.com/share/6707308c-86b4-800b-b7fc-2c2800bee326