This repository contains scripts to manage and analyze social media data stored in an Amazon Neptune graph database. The system synchronizes Twitter following data, calculates influence scores, and performs routine cleanups.
The system consists of three main scripts:
- updateAccounts.py: Syncs Twitter following data into the graph database.
- calculateInfluenceScore.py: Computes influence scores using a custom PageRank-like algorithm.
- performRoutineCleanUp.py: Removes low-influence accounts from the graph.
- Purpose: Fetches an influencer's Twitter followings and updates the graph database.
- Key Features:
- Connects to PostgreSQL (for tracking collections) and Neptune (graph DB).
- Uses Twitter API v2 to retrieve followings with pagination and rate limit handling.
- Adds/updates nodes (accounts/collections) and edges (
followsrelationships). - Tracks progress using
progress.txtto resume after interruptions. - Skips accounts with >5M followers (e.g., celebrities like Ronaldo).
- Purpose: Computes influence scores for non-influencer nodes.
- Key Features:
- Implements a simplified PageRank algorithm:
- Score = Σ (Inbound node's score / Outbound degree of inbound node).
- Updates scores iteratively (configurable with
max_iterations). - Avoids influencers (nodes marked as
type=influencer).
- Implements a simplified PageRank algorithm:
- Purpose: Removes low-impact accounts from the graph.
- Key Features:
- Deletes accounts with an influence score below the current average.
- Targets nodes marked as
type=account.
- Python 3.9+
- Libraries:
pip install nest-asyncio httpx psycopg2-binary gremlinpython python-dotenv
- Databases:
- PostgreSQL: Stores tracked collections (
social_mediatable). - Amazon Neptune: Graph database for storing accounts and relationships.
- PostgreSQL: Stores tracked collections (
- Twitter API v2 Bearer Token (for
updateAccounts.py).
- Create a
.envfile with:DB_NAME=your_db_name DB_USER=your_db_user DB_PASSWORD=your_db_password DB_HOST=your_db_host DB_PORT=your_db_port TWITTER_BEARER_TOKEN=your_twitter_bearer_token
- Ensure Neptune is running at
localhost:8182(modifyhost/portin scripts if needed).
- Sync Data (Run first):
python updateAccounts.py
- Calculate Influence Scores (Run periodically):
python calculateInfluenceScore.py
- Clean Up Graph (Run after score updates):
python performRoutineCleanUp.py
- Order: Run scripts in sequence:
updateAccounts.py→calculateInfluenceScore.py→performRoutineCleanUp.py. - Rate Limits:
updateAccounts.pyhandles Twitter API rate limits and daily caps. - Security: SSL verification is disabled for Neptune connections (not recommended for production).
- Progress Tracking:
progress.txtstores the last processed influencer and pagination token. - Performance: A 3-second delay between influencers in
updateAccounts.pyensures Neptune stability.
For questions or issues, contact the repository maintainer.