0

04. Mining Data Streams
4.1 The Stream Data Model

Example: Socho tum ek social media platform ke data scientist ho. Users ke posts aur comments real-time mein aa rahe hain. Ye data stream hai jisko tum analyze karna chahte ho.

  • Data-Stream-Management System (DSMS): Ye system tumhe data streams ko manage karne ki capability deta hai. Jaise, ek dashboard bana sakte ho jo real-time mein user engagement ko track kare.

  • Stream Sources: Jaise Twitter, Facebook, ya sensors se data stream milta hai.

  • Stream Queries: Tum ek query run kar sakte ho, jaise “Sabse popular post kaun si hai is waqt?” aur ye real-time data se jawab dega.

  • Issues in Stream Processing: Challenges aate hain, jaise data ka loss hona, latency issues, ya processing speed.

4.2 Sampling Data Techniques in a Stream

Example: Socho ek online shopping site hai jahan thousands of customers daily visit karte hain. Tum chahte ho ki kuch customers ka sample le kar unki preferences samjho.

  • Sampling Techniques: Random sampling ya reservoir sampling use kar sakte ho jisse tum representative data le sakte ho.

4.3 Filtering Streams

Example: Tumhe unwanted spam messages filter karne hain jo ek messaging app mein aate hain.

  • Bloom Filter: Ye ek space-efficient probabilistic data structure hai. Tum filter bana sakte ho jisse check kar sakte ho ki koi message spam hai ya nahi, bina data ko pura store kiye.

  • Analysis: Iska use karne se tumhe fast checking milega, lekin kabhi-kabhi false positives ho sakte hain, yaani kuch normal messages ko bhi spam samajh sakte ho.

4.4 Counting Distinct Elements in a Stream

Count-Distinct Problem: Socho tumhe ek streaming data mein unique visitors count karna hai.

  • Flajolet-Martin Algorithm: Ye algorithm estimate karta hai unique elements ka count. Ye bit pattern ka use karta hai jo uniquely identify karta hai.

  • Combining Estimates: Agar multiple streams hain, toh tum alag-alag estimates ko combine kar sakte ho accurate count ke liye.

  • Space Requirements: Ye algorithm memory-efficient hai, lekin exact count nahi deta, sirf approximation.

4.5 Counting Frequent Items in a Stream

Example: Tumhe ek music streaming app mein sabse popular songs count karne hain.

  • Sampling Methods for Streams: Tum samples le kar dekh sakte ho kaun se songs frequently play ho rahe hain.

  • Frequent Itemsets in Decaying Windows: Tum specific time window ka analysis kar sakte ho jisme recently play hue songs pe focus karte ho.

4.6 Counting Ones in a Window

Example: Socho tumhare paas ek sensor data hai jo har second temperature readings bhej raha hai.

  • The Cost of Exact Counts: Agar tumhe exact counts chahiye, toh high memory use karna padega.

  • DGIM Algorithm: Ye algorithm efficient hai temperature readings ko count karne ke liye. Ye readings ko time window mein manage karta hai.

  • Query Answering in the DGIM Algorithm: Tum real-time queries run kar sakte ho, jaise “Pichle 10 seconds mein temperature readings ka average kya hai?”

  • Decaying Windows: Ye algorithm purani readings ko weightage deta hai, taaki recent readings ka zyada importance ho.

05. Finding Similar Items and Clustering
5.1 Distance Measures

Example: Socho tumhe similar movies recommend karni hain.

  • Definition of a Distance Measure: Ye measure karta hai ki do items (movies) kitne similar hain.

  • Euclidean Distance: Agar do movies ki rating ka plot hai, toh distance measure karega ki dono movies ke ratings kitne close hain.

  • Jaccard Distance: Ye use hota hai jab tumhe do sets ke beech similarity measure karni ho, jaise do users ke movie preferences.

  • Cosine Distance: Ye angle ki tarah kaam karta hai, agar do movies ka feature vector (jaise genre, actors) diya gaya ho.

  • Edit Distance: Agar tumhe do strings (jaise movie titles) compare karna hai, ye batata hai ki kitne edits (insertions, deletions) chahiye ek string ko dusre mein convert karne ke liye.

  • Hamming Distance: Ye binary strings ke liye use hota hai, jisme count hota hai ki kitne positions par values different hain.

5.2 Clustering Algorithms

Example: Tumhe customers ko segments mein divide karna hai based on unki buying behavior.

  • CURE Algorithm: Ye algorithm clusters ko define karta hai aur unhe merge karta hai, isse complex shapes ko handle kiya ja sakta hai.

  • Stream-Computing: Ye algorithm real-time data streams ke liye bhi kaam karta hai, taaki tum turant results dekh sako.

  • A Stream-Clustering Algorithm: Is algorithm mein data ko buckets mein store kiya jata hai, aur fir clusters create kiye jate hain.

  • Initializing & Merging Buckets: Jab naye data aate hain, tum existing buckets ko merge karke naye clusters create karte ho.

  • Answering Queries: Tum queries run kar sakte ho jaise “Is segment mein kaun se customers hain jo sabse zyada kharidari karte hain?”

06. Real-Time Big Data Models
6.1 PageRank Overview

Example: Tum Google ke liye search results improve karna chahte ho.

  • Efficient Computation of PageRank: Tum links ki quantity aur quality ka analysis karte ho, aur ranking assign karte ho.

  • PageRank Iteration Using MapReduce: Ye distributed computing ka use karta hai, jisse large web graphs ko process kiya ja sakta hai.

  • Use of Combiners to Consolidate the Result Vector: Ye intermediate results ko consolidate karne mein help karta hai, jisse computation fast hota hai.

6.2 A Model for Recommendation Systems

Example: Tumhe ek e-commerce site par products recommend karne hain.

  • Content-Based Recommendations: Tum user ke past purchases ya interests ke basis par similar products recommend karte ho.

  • Collaborative Filtering: Tum users ke behavior ko analyze karte ho aur un users se recommendations lete ho jo similar hain.

6.3 Social Networks as Graphs

Example: Tumhe Facebook jaisa social network analyze karna hai.

  • Clustering of Social-Network Graphs: Tum nodes (users) ke beech connections ko analyze karte ho aur clusters (friends groups) identify karte ho.

  • Direct Discovery of Communities in a Social Graph: Is technique se tum identify karte ho ki kaun se users closely connected hain aur unke beech communities banate ho.

Team Posted new comment October 10, 2024