First Steps in Machine Learning: From Kaggle to Bittensor’s Masa Network
How you can use X and TikTok for something more than doom scrolling
So I'm diving into my first basic Machine Learning model building and have been looking for ways to incorporate this into A Bittensor Journey... Who would have known that loading datasets and preparing data is so crucial to using Python for AI?
To a beginner this sounds like the equivalent of the intern filing boxes in an office as they can’t be trusted to do anything else.
Now I realize I had that all wrong. And its not boring at all.
Kaggle Training Wheels
Anyone that has an interest in Data Science and is using Kaggle to upskill or make connections will know of the Titanic competition, which feels like the longest ongoing competition in history but also is testament to its success.
Predicting passenger survival on that fateful journey offers a great chance to get started with predictive modeling and gain some practical experience in Machine Learning.
Today, I want to explore how foundational data skills from Kaggle translate to real-world, incentivized data work on Bittensor.
Masa’s Offer
In my Bittensor subnet research, I came across Subnet 42: Real-Time Data by Masa.
This immediately attracted my interest as they analyze real-time data and aim to democratize access to specialized data, compute resources, and AI development tools. Masa is fueling a more equitable and fair AI ecosystem which speaks to my why outlined in my first post.
Masa scrapes and structures real-time data from social media platforms like X, Discord, and TikTok for AI training. This is the perfect opportunity to use my basic data exploration skills and explore how I can contribute to Masa's data quality.
Lets get stuck in.
Loading the Dataset
What's exciting at this stage is, any block of code that runs and does not return an error is a boon to my confidence, so even importing the pandas library is cause for me to run around the room celebrating.
import pandas as pd
df = pd.read_csv('/kaggle/input/titanic/train.csv')
I also tried scraping a dataset from X using the Masa API. Here's an example of the data structure:
{ "Id": "1741580150234570760", "Content": "5/ The Masa zkData Network & Marketplace\n\nSet to launch in Q1 2024, Masa is solving the data privacy problem by empowering users to take control of their data, and monetize it to data consumers who train #AI models and build a variety of applications.\n\nhttps://t.co/6PFkp5bQg1", "Metadata": { "author": "", "conversation_id": "1741580142449979723", "created_at": "2023-12-31T22:00:39Z", "lang": "", "likes": 18, "newest_id": "", "oldest_id": "", "possibly_sensitive": false, "public_metrics": { "BookmarkCount": 0, "ImpressionCount": 0, "LikeCount": 0, "QuoteCount": 0, "ReplyCount": 0, "RetweetCount": 0 }, "tweet_id": 1741580150234570800, "user_id": "1419111693112676353", "username": "getmasafi" }, "Score": 1 },
and then loading it into the pandas DataFrame:
import pandas as pd
df = pd.read_json('masa_twitter_data.json')
Exploratory Data Analysis
To understand the dataset I perform initial explorations like pulling up the basic information and a subset from the data:
df.info()
df.describe()
df.isnull().sum()
Data Cleaning
I then try to handle missing values and encode categorical variables, replacing them with average values:
For Titanic:
df['Age'].fillna(df['Age'].median(), inplace=True)
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)
For Masa, I clean the text data, remove duplicates and filter relevant information:
df['Content'] = df['Content'].str.lower()
df['Content'] = df['Content'].str.replace(r'http\S+|www\S+|https\S+', '', regex=True)
df.drop_duplicates(inplace=True)
Model Training and Validation
I need to train a model and validate its performance, like using a Decision Tree Regressor or a more advanced Random Forest Regressor:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
In Masa, as I miner I could try and focus on data quality for validator scoring, checking on the accuracy, relevance and completeness of the data.
accuracy = model.score(X_test, y_test)
print(f'Validator Score (Accuracy): {accuracy}')
To do this I would have to track validator feedback and look at ways of better refining my data so it better meets their expectations. A little out of my league for now but its encouraging to see how Kaggle’s foundational skills can directly apply to Bittensor’s real-world data pipelines.
The added benefit of TAO token incentives makes this journey even more rewarding.
For more advanced cases, I would look to use Feature Engineering to create new features and improve model performance. But that's well beyond my skills for this week!
Is there anything you would add that would help or speed up progress on Machine Learning, or even using Python or the Bittensor SDK to assist? I'm all ears and thanks for reading.
If you’d like to receive new posts directly, subscribe here and forward to a friend who would appreciate it:
Until next week.
Cheers,
Brian