Reply
Results 1 to 8 of 8
  1. #1
    Injured_Knee_Brah Auraria's Avatar
    Join Date: Jan 2016
    Location: New York, United States
    Age: 25
    Posts: 2,864
    Rep Power: 21392
    Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000)
    Auraria is offline

    Tensorflow/Keras brah GTFIH

    I have a working sequential model I'm working with, however it seems I'm getting worse accuracy the more values I add to the training set.

    It went from around 300k values to 4.4million and my accuracy % increased from 78% to around 83% and loss decreased but the actual predictions are way fukken off.

    I know there's issues with over fitting a model, but can there be issues with giving too much data?

    I currently have my test size set to 0.9(90%) of the data.

    I don't have a ton of experience in python I just make tools for work and random apps for home use.
    Last edited by Auraria; 02-12-2020 at 04:42 AM.
    **Ret USAF Crew**
    **Cyber Security Crew**
    **Death Metal Crew**
    **Fishing Crew**
    **Firearms Crew**
    Silverback1996 "We all start some where brah, even the girthiest, veiniest, tree trunk of a cawk started out as just a sperm and an egg brah"
    Reply With Quote

  2. #2
    Branned Didlid's Avatar
    Join Date: Oct 2012
    Posts: 7,027
    Rep Power: 8441
    Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000)
    Didlid is offline
    Why is your test data 90% of all of the data?

    You're essentially giving it 10% of the data and hoping it knows whats going on in the other 90%. Maybe try a 80% training / 20% test dataset? or 70 / 30. Typically you want to train your data on the most data and then test with the least amount of data not the other way around


    I mean with ML theres a lot of variables as to why your results may not be as expected but the first thing that sticks out is that your using a 10/90 test split based on your post
    /
    Reply With Quote

  3. #3
    Injured_Knee_Brah Auraria's Avatar
    Join Date: Jan 2016
    Location: New York, United States
    Age: 25
    Posts: 2,864
    Rep Power: 21392
    Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000)
    Auraria is offline
    Originally Posted by Didlid View Post
    Why is your test data 90% of all of the data?

    You're essentially giving it 10% of the data and hoping it knows whats going on in the other 90%. Maybe try a 80% training / 20% test dataset? or 70 / 30. Typically you want to train your data on the most data and then test with the least amount of data not the other way around


    I mean with ML theres a lot of variables as to why your results may not be as expected but the first thing that sticks out is that your using a 10/90 test split based on your post
    So here's my train split, from my understanding .9 was 90% of the data was allocated for training?:
    X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=0.9, random_state=33)

    EDIT: If I misunderstood that in the docs I'll actually drop that 0.3 then.
    **Ret USAF Crew**
    **Cyber Security Crew**
    **Death Metal Crew**
    **Fishing Crew**
    **Firearms Crew**
    Silverback1996 "We all start some where brah, even the girthiest, veiniest, tree trunk of a cawk started out as just a sperm and an egg brah"
    Reply With Quote

  4. #4
    Some idiot MrBourbon's Avatar
    Join Date: Aug 2014
    Posts: 17,805
    Rep Power: 376287
    MrBourbon has a reputation beyond repute. Second best rank possible! (+100000) MrBourbon has a reputation beyond repute. Second best rank possible! (+100000) MrBourbon has a reputation beyond repute. Second best rank possible! (+100000) MrBourbon has a reputation beyond repute. Second best rank possible! (+100000) MrBourbon has a reputation beyond repute. Second best rank possible! (+100000) MrBourbon has a reputation beyond repute. Second best rank possible! (+100000) MrBourbon has a reputation beyond repute. Second best rank possible! (+100000) MrBourbon has a reputation beyond repute. Second best rank possible! (+100000) MrBourbon has a reputation beyond repute. Second best rank possible! (+100000) MrBourbon has a reputation beyond repute. Second best rank possible! (+100000) MrBourbon has a reputation beyond repute. Second best rank possible! (+100000)
    MrBourbon is online now
    Originally Posted by Auraria View Post
    So here's my train split, from my understanding .9 was 90% of the data was allocated for training?:
    X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=0.9, random_state=33)

    EDIT: If I misunderstood that in the docs I'll actually drop that 0.3 then.
    I don't know TensorFlow but in SciKitLearn that's 90% test size and 10% training size. Run it the other way around and see what you get.

    What's your cross validation look like?
    Smooth Seas don't make Strong Sailors. Keep your head up.

    Blowing out someone else's candle wont make yours glow any brighter (no homo).
    Reply With Quote

  5. #5
    Injured_Knee_Brah Auraria's Avatar
    Join Date: Jan 2016
    Location: New York, United States
    Age: 25
    Posts: 2,864
    Rep Power: 21392
    Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000)
    Auraria is offline
    Originally Posted by MrBourbon View Post
    I don't know TensorFlow but in SciKitLearn that's 90% test size and 10% training size. Run it the other way around and see what you get.

    What's your cross validation look like?
    Thanks brahs you were both right, I missunderstood what the docs were telling me.

    Swapped it to .2 and now I jumped from 53k to 426k in my epochs in training.

    Sorry, not entirely sure from a keras standpoint for cross validation but I do have it test the model against the dataset and compare it to the real values by doing the following:

    input_data = data.drop("rating",axis=1) ##This is where I drop the rating, the for the testing to put in it's predicted value
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) ##how I compile the model missing all the layers I've created
    model.fit(X_train, np.array(y_train), epochs=200, batch_size=1000) ##This is where the training happens
    scores = model.evaluate(X_test, y_test) ##Where I test and compare the models
    print("\nAccuracy: %.2f%%" % (scores[1]*100)) ##format of accuracy

    EDIT:

    I have a separate section for prediction where I compare the top 10 values by running the prediction model against the data and also print out the 10 real values to do a spot check as well.

    prediction = model.predict(X_test) ##Prediction model of x_test data
    prediction1 = pd.DataFrame({'Good Url'rediction[:,0],'Susp Url'rediction[:,1], 'Malicious Url'rediction[:,2]}) ##pandas array of prediction data with the different url types as labels
    pred = prediction1.round(decimals=4).head(10) ##pediction output change head to get_value for ful list
    #pred_string = prediction1.round(decimals=4).to_csv('pred1.csv')


    real_values = pd.DataFrame({'Good Url':y_test[:,0],'Susp Url':y_test[:,1], 'Malicious Url':y_test[:,2]}) ##Pulling the values from the test data to do a comparison
    real = real_values.head(10) ##real data output change head to get_value for ful list
    #real_string = real_values.to_csv('real1.csv')

    print("Machine Learning Predicted url:" + '\n' + str(pred) + '\n')
    print("Real url:" + '\n' + str(real))
    Last edited by Auraria; 02-12-2020 at 05:38 AM.
    **Ret USAF Crew**
    **Cyber Security Crew**
    **Death Metal Crew**
    **Fishing Crew**
    **Firearms Crew**
    Silverback1996 "We all start some where brah, even the girthiest, veiniest, tree trunk of a cawk started out as just a sperm and an egg brah"
    Reply With Quote

  6. #6
    Branned Didlid's Avatar
    Join Date: Oct 2012
    Posts: 7,027
    Rep Power: 8441
    Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000) Didlid is a name known to all. (+5000)
    Didlid is offline
    Originally Posted by Auraria View Post
    Thanks brahs you were both right, I missunderstood what the docs were telling me.

    Swapped it to .2 and now I jumped from 53k to 426k in my epochs in training.

    Sorry, not entirely sure from a keras standpoint for cross validation but I do have it test the model against the dataset and compare it to the real values by doing the following:

    input_data = data.drop("rating",axis=1) ##This is where I drop the rating, the for the testing to put in it's predicted value
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) ##how I compile the model missing all the layers I've created
    model.fit(X_train, np.array(y_train), epochs=200, batch_size=1000) ##This is where the training happens
    scores = model.evaluate(X_test, y_test) ##Where I test and compare the models
    print("\nAccuracy: %.2f%%" % (scores[1]*100)) ##format of accuracy
    Hah, I was mid way through quoting MrBourbon to say yeah using that function you're allocating 90% of the data to testing. aka your model is only learning from 10% of it but then i refreshed and saw you replied.

    Anyway, I think he is talking about a validation dataset.

    It's usually good practice to split your data 3 ways: Train/Test/Validation. You can use the train and test dataset like you have in that code, it's also possible to take the validation set and do things like k-fold cross validation which you can use to tune the hyper parameters of the model. This should improve performance and help iron out some of the bias in the model.

    Might be worth a google of train/test/validation split and see what i mean.

    Nothing wrong with what you're doing in that code though. Validation sets just give you a better understanding of how the model is performing with extra metrics to go off essentially.


    edit: to do this most people use 60 / 20 / 20. 60 being training data as a best practice iirc.
    /
    Reply With Quote

  7. #7
    Injured_Knee_Brah Auraria's Avatar
    Join Date: Jan 2016
    Location: New York, United States
    Age: 25
    Posts: 2,864
    Rep Power: 21392
    Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000) Auraria has much to be proud of. One of the best! (+20000)
    Auraria is offline
    Originally Posted by Didlid View Post
    Hah, I was mid way through quoting MrBourbon to say yeah using that function you're allocating 90% of the data to testing. aka your model is only learning from 10% of it but then i refreshed and saw you replied.

    Anyway, I think he is talking about a validation dataset.

    It's usually good practice to split your data 3 ways: Train/Test/Validation. You can use the train and test dataset like you have in that code, it's also possible to take the validation set and do things like k-fold cross validation which you can use to tune the hyper parameters of the model. This should improve performance and help iron out some of the bias in the model.

    Might be worth a google of train/test/validation split and see what i mean.

    Nothing wrong with what you're doing in that code though. Validation sets just give you a better understanding of how the model is performing with extra metrics to go off essentially.


    edit: to do this most people use 60 / 20 / 20. 60 being training data as a best practice iirc.
    Interesting, thank you for the information.

    Like I said I do this on my off time and I'm working on models for work for chits and gigs.

    I really do appreciate it guys, I'll look into that after I finish training this new model!

    Enjoy my measlies
    **Ret USAF Crew**
    **Cyber Security Crew**
    **Death Metal Crew**
    **Fishing Crew**
    **Firearms Crew**
    Silverback1996 "We all start some where brah, even the girthiest, veiniest, tree trunk of a cawk started out as just a sperm and an egg brah"
    Reply With Quote

  8. #8
    Registered User MaximumCapacity's Avatar
    Join Date: Dec 2012
    Location: Fort Pierce, Florida, United States
    Posts: 4,838
    Rep Power: 1731
    MaximumCapacity is just really nice. (+1000) MaximumCapacity is just really nice. (+1000) MaximumCapacity is just really nice. (+1000) MaximumCapacity is just really nice. (+1000) MaximumCapacity is just really nice. (+1000) MaximumCapacity is just really nice. (+1000) MaximumCapacity is just really nice. (+1000) MaximumCapacity is just really nice. (+1000) MaximumCapacity is just really nice. (+1000) MaximumCapacity is just really nice. (+1000) MaximumCapacity is just really nice. (+1000)
    MaximumCapacity is offline
    what kind of data is this?
    legal disclaimer: OP sucks cock
    Computer Science autist race
    aspie
    Reply With Quote

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts