BDA – short mini example notes

Question

646 viewsOctober 10, 2024bda

0

Team October 10, 2024 1 Comment

04. Mining Data Streams
4.1 The Stream Data Model

Example: Socho tum ek social media platform ke data scientist ho. Users ke posts aur comments real-time mein aa rahe hain. Ye data stream hai jisko tum analyze karna chahte ho.

Data-Stream-Management System (DSMS): Ye system tumhe data streams ko manage karne ki capability deta hai. Jaise, ek dashboard bana sakte ho jo real-time mein user engagement ko track kare.
Stream Sources: Jaise Twitter, Facebook, ya sensors se data stream milta hai.
Stream Queries: Tum ek query run kar sakte ho, jaise “Sabse popular post kaun si hai is waqt?” aur ye real-time data se jawab dega.
Issues in Stream Processing: Challenges aate hain, jaise data ka loss hona, latency issues, ya processing speed.

4.2 Sampling Data Techniques in a Stream

Example: Socho ek online shopping site hai jahan thousands of customers daily visit karte hain. Tum chahte ho ki kuch customers ka sample le kar unki preferences samjho.

Sampling Techniques: Random sampling ya reservoir sampling use kar sakte ho jisse tum representative data le sakte ho.

4.3 Filtering Streams

Example: Tumhe unwanted spam messages filter karne hain jo ek messaging app mein aate hain.

Bloom Filter: Ye ek space-efficient probabilistic data structure hai. Tum filter bana sakte ho jisse check kar sakte ho ki koi message spam hai ya nahi, bina data ko pura store kiye.
Analysis: Iska use karne se tumhe fast checking milega, lekin kabhi-kabhi false positives ho sakte hain, yaani kuch normal messages ko bhi spam samajh sakte ho.

4.4 Counting Distinct Elements in a Stream

Count-Distinct Problem: Socho tumhe ek streaming data mein unique visitors count karna hai.

Flajolet-Martin Algorithm: Ye algorithm estimate karta hai unique elements ka count. Ye bit pattern ka use karta hai jo uniquely identify karta hai.
Combining Estimates: Agar multiple streams hain, toh tum alag-alag estimates ko combine kar sakte ho accurate count ke liye.
Space Requirements: Ye algorithm memory-efficient hai, lekin exact count nahi deta, sirf approximation.

4.5 Counting Frequent Items in a Stream

Example: Tumhe ek music streaming app mein sabse popular songs count karne hain.

Sampling Methods for Streams: Tum samples le kar dekh sakte ho kaun se songs frequently play ho rahe hain.
Frequent Itemsets in Decaying Windows: Tum specific time window ka analysis kar sakte ho jisme recently play hue songs pe focus karte ho.

4.6 Counting Ones in a Window

Example: Socho tumhare paas ek sensor data hai jo har second temperature readings bhej raha hai.

The Cost of Exact Counts: Agar tumhe exact counts chahiye, toh high memory use karna padega.
DGIM Algorithm: Ye algorithm efficient hai temperature readings ko count karne ke liye. Ye readings ko time window mein manage karta hai.
Query Answering in the DGIM Algorithm: Tum real-time queries run kar sakte ho, jaise “Pichle 10 seconds mein temperature readings ka average kya hai?”
Decaying Windows: Ye algorithm purani readings ko weightage deta hai, taaki recent readings ka zyada importance ho.

05. Finding Similar Items and Clustering
5.1 Distance Measures

Example: Socho tumhe similar movies recommend karni hain.

Definition of a Distance Measure: Ye measure karta hai ki do items (movies) kitne similar hain.
Euclidean Distance: Agar do movies ki rating ka plot hai, toh distance measure karega ki dono movies ke ratings kitne close hain.
Jaccard Distance: Ye use hota hai jab tumhe do sets ke beech similarity measure karni ho, jaise do users ke movie preferences.
Cosine Distance: Ye angle ki tarah kaam karta hai, agar do movies ka feature vector (jaise genre, actors) diya gaya ho.
Edit Distance: Agar tumhe do strings (jaise movie titles) compare karna hai, ye batata hai ki kitne edits (insertions, deletions) chahiye ek string ko dusre mein convert karne ke liye.
Hamming Distance: Ye binary strings ke liye use hota hai, jisme count hota hai ki kitne positions par values different hain.

5.2 Clustering Algorithms

Example: Tumhe customers ko segments mein divide karna hai based on unki buying behavior.

CURE Algorithm: Ye algorithm clusters ko define karta hai aur unhe merge karta hai, isse complex shapes ko handle kiya ja sakta hai.
Stream-Computing: Ye algorithm real-time data streams ke liye bhi kaam karta hai, taaki tum turant results dekh sako.
A Stream-Clustering Algorithm: Is algorithm mein data ko buckets mein store kiya jata hai, aur fir clusters create kiye jate hain.
Initializing & Merging Buckets: Jab naye data aate hain, tum existing buckets ko merge karke naye clusters create karte ho.
Answering Queries: Tum queries run kar sakte ho jaise “Is segment mein kaun se customers hain jo sabse zyada kharidari karte hain?”

06. Real-Time Big Data Models
6.1 PageRank Overview

Example: Tum Google ke liye search results improve karna chahte ho.

Efficient Computation of PageRank: Tum links ki quantity aur quality ka analysis karte ho, aur ranking assign karte ho.
PageRank Iteration Using MapReduce: Ye distributed computing ka use karta hai, jisse large web graphs ko process kiya ja sakta hai.
Use of Combiners to Consolidate the Result Vector: Ye intermediate results ko consolidate karne mein help karta hai, jisse computation fast hota hai.

6.2 A Model for Recommendation Systems

Example: Tumhe ek e-commerce site par products recommend karne hain.

Content-Based Recommendations: Tum user ke past purchases ya interests ke basis par similar products recommend karte ho.
Collaborative Filtering: Tum users ke behavior ko analyze karte ho aur un users se recommendations lete ho jo similar hain.

6.3 Social Networks as Graphs

Example: Tumhe Facebook jaisa social network analyze karna hai.

Clustering of Social-Network Graphs: Tum nodes (users) ke beech connections ko analyze karte ho aur clusters (friends groups) identify karte ho.
Direct Discovery of Communities in a Social Graph: Is technique se tum identify karte ho ki kaun se users closely connected hain aur unke beech communities banate ho.

Team Posted new comment October 10, 2024

Team commented October 10, 2024

https://chatgpt.com/share/6707308c-86b4-800b-b7fc-2c2800bee326

https://chatgpt.com/share/6707308c-86b4-800b-b7fc-2c2800bee326

BDA – short mini example notes

0 Answers

Questions

Question stats

0 Answers

Questions

Question stats

Related Questions

Describe the advantages and limitations of Hadoop

Explain what characteristics of social media makes it suitable for Big Data.

State the three Vs of Big Data. Give two examples of big data case studies.